2023 Q4 | Science of Security Virtual Organization

2023 Q4

Leveraging Machine Learning for Binary Software Understanding

Research Team Status

Names of researchers and position:
Yan Shoshitaishvili - Lead PI, Associate Professor
Adam Doupe - Co-I, Associate Professor
Chitta Baral - Co-I, Professor
Michael Tompkins - PhD Student
Steven Wirsz - PhD Student
Any new collaborations with other universities/researchers?
None.

Project Goals

What is the current project goal?

Task 1 (Base Period): Achieving Semantically-equivalent Decompilation. The focus here is on generating a representative dataset of original source code and compiled binary code to train machine learning models. We also aim to fundamentally improve the evaluation of decompilation techniques.
- Task 1.1: Recovering Missing Human-Level Constructs
- Task 1.2: Using Recovered Human-Level Constructs to Generate Original Source Code
- Task 1.3: Evaluating Decompiled Code Quality

How does the current goal factor into the long-term goal of the project?

Long-Term Goal: Achieving binary software understanding, in order to make identifying security issues much easier and cheaper.
Task 1 is the foundation for the overall goal of the project. By being able to generate a representative dataset of original source code and compiled binary code and training machine learning models, we can then advance to Task 2. Task 2 will build upon this foundation by working towards being able to describe code in natural language, in a variety of programming languages, and to more abstract structural representations such as flow graphs or state transition diagrams.

Accomplishments

VarBERT: Our research on variable name prediction directly contributes to the goal of semantically equivalent decompilation by addressing the need for meaningful variable names in decompiled code. We developed VarBERT, a novel neural network model that utilizes transfer learning from a large dataset of human-written source code to predict variable names and their origins in decompiled output, supporting multiple decompilers such as IDA and Ghidra. We employed novel techniques to build a new dataset, including decompiled functions from both C and C++ binaries across four compiler optimizations (O0, O1, O2, and O3). Our dataset features (i) more unique functions, (ii) better variable matching, and (iii) more meaningful variable names. Our largest dataset includes 2.6M decompiled functions from IDA for O0, and our human source code dataset used for transfer learning comprises 5M functions.

VarBERT achieved a 50.70% prediction accuracy, outperforming existing solutions, DIRE (35.94%), and DIRTY (38.00%) on O0 binaries for IDA. Our evaluation demonstrates that VarBERT is applicable to variable name and origin prediction in the decompilation output of stripped, real-world software. Additionally, we conducted a user study to evaluate the impact of predicted variable names in decompiled functions on code understanding. The study results provide statistically significant evidence to support the usefulness of VarBERT in understanding decompiled code.

Our paper “Len or index or count, anything but v1”: Predicting Variable Names in Decompilation Output with Transfer Learning got accepted to IEEE S&P in October 2023.

Binary Type Inference: Our research focuses on binary type inference, a core research challenge in binary program analysis and reverse engineering. It concerns identifying the data types of registers and memory values in a stripped executable (or object file), whose type information is discarded during compilation.

We propose a novel graph-based representation of data-flow information that allows a synergistic combination of a data-flow analysis and a graph neural network model to balance scalability and accuracy. We implement it as a system that uses a GNN model that is trained on a large data set of binary functions to infer types of program variables in unseen functions from stripped binaries. We have also demonstrated the effectiveness of our approach by extensively evaluating it on a large data set. Our approach demonstrates an overall accuracy of 76.6 % and struct type accuracy of 45.2% on the x64 dataset across four optimization levels (O0-O3) and outperforms existing works by a minimum of 26.1% in overall accuracy and 10.2% in struct type accuracy. The research paper is currently under review.

SAILR: Our research on the SAILR project concluded last November in a publication to USENIX 2024 with two notable improvements for decompilation:
A1. A new methodology for measuring decompilation’s closeness to source (given the original source)

A2. A new algorithm to restructure decompilation to resemble its original source code more closely by undoing compiler optimizations.

In A1, we introduced a new speed-improved graph edit distance algorithm, which calculated the minimum edits needed to turn our decompilation into the original source code. This Control Flow Graph Edit Distance algorithm is referred to as CFGED. Using CFGED, we measured against modern decompilers. In the case of IDA Pro, we generated code that was on par with their output. Additionally, we had nearly three times fewer spurious gotos than both IDA Pro and Ghidra. We also had a CFGED score 13% better than Ghidra. These findings indicate that we’ve found a sweet spot for simplifying decompilation complexity while being faithful to the original source code.

To achieve these numbers, we created a series of graph-based and condition-based schema algorithms found in A2 that allow us to precisely revert the most impactful compiler optimizations in decompilation. Our decompiler, dataset, algorithms, and pipeline have all been open-sourced and documented for the community. This openness has prompted other researchers to make pull requests to Ghidra to implement parts of SAILR in the decompiler. It has also received broad discussion from decompiler companies on social media and was featured in a blog post on the front page of Hacker News.

Rust: Our research on Rust decompilation aims to develop a Rust decompiler on top of C/C++ decompiler angr to generate semantically equivalent Rust pseudocode. The milestones we have achieved are (i) We reused C type inference algorithms and we are now able to translate type inference results into Rust data types instead of C data types. (ii) We replaced the original C structured code generator with our custom Rust structured code generator, which is able to generate Rust-specific control flow structures and Rust-like pseudocode. (iii) We implemented some optimization passes to simplify AIL code to get more understandable Rust pseudocode.

This research project is still in progress. Our next steps are (i) recovering data-specific data types like String, Array, and Slices, (ii) Rust calling convention recovery, (iii) memory allocation and deallocation simplification, (iv) security check and error handling simplification, (v) control flow de-optimization, and (vi) syntax re-sugaring. Our final goal is to successfully recover semantically-equivalent and understandable Rust pseudocode.

Publications and presentations

“Len or index or count, anything but v1”: Predicting Variable Names in Decompilation Output with Transfer Learning

Pal, Kuntal Kumar, et al. "“Len or index or count, anything but v1”: Predicting Variable Names in Decompilation Output with Transfer Learning." 2024 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 2024.

Ahoy SAILR! There is No Need to DREAM of C: A Compiler-Aware Structuring Algorithm for Binary Decompilation

Basque, Zion Leonahenahe, et al. "Ahoy sailr! there is no need to dream of c: A compiler-aware structuring algorithm for binary decompilation." 33st USENIX Security Symposium (USENIX Security 24). 2024.

Lead PI:

Yan Shoshitaishvili

Co-Pi(s):

Adam Doupé