2024 Q3 | Science of Security Virtual Organization

2024 Q3

Leveraging Machine Learning for Binary Software Understanding

Research Team Status

Names of researchers and position
(e.g. Research Scientist, PostDoc, Student (Undergrad/Masters/PhD))
- Yan Shoshitaishvili - Lead PI, Associate Professor
- Adam Doupe - Co-I, Associate Professor
- Chitta Baral - Co-I, Professor
- Divij Handa - PhD Student
- William Gibbs - PhD Student
- Michael Tompkins - PhD Student
Any new collaborations with other universities/researchers?
- None.

Project Goals

What is the current project goal?
- Task 2 (Option Year 1): Higher-level decompliation abstraction. The focus here is to abstract the binary software beyond the decompiled code into human-level representations.
  - Task 2.1: Code to Human Description
  - Task 2.2: Translating Decompiled Code
  - Task 2.3: Code to High Level Structural Representations
How does the current goal factor into the long-term goal of the project?
- Long-Term Goal: Achieving binary software understanding, in order to make identifying security issues much easier and cheaper.
- Task 2 builds upon the foundations created by Task 1 by working towards being able to describe code in natural language, in a variety of programming languages, and to more abstract structural representations such as flow graphs or state transition diagrams.

Accomplishments

Address whether project milestones were met. If milestones were not met, explain why, and what are the next steps.
What is the contribution to foundational cybersecurity research? Was there something discovered or confirmed?
Impact of research
- Internal to the university (coursework/curriculum)
- External to the university (transition to industry/government (local/federal); patents, start-ups, software, etc.)
- Any acknowledgements, awards, or references in media?

Recompilable Decompilation (New as of 10/15/2024):

10/15/2024 (July-Sep 2024): The goal of this project is to make angr's decompiled code recompilable, ensuring that the recompiled binary not only compiles successfully but also exhibits the intended behavior. A key focus is on verifying the correctness of the recompiled binaries' behavior, ensuring they faithfully reproduce the original functionality.

Decompiled code typically does not recompile out of the box because it does not conform to the C syntax rules expected by compilers like GCC. We have developed a preliminary pipeline that attempts to recompile the decompiled code and verify the functionality of the recompiled binary. Currently, we are focusing on studying the recompilation failures of coreutils binaries to improve the quality of angr's decompiled code.

REaLLM (New as of 10/15/2024):

10/15/2024 (July-Sep 2024): In our push to make decompilation more like source, we’ve finished the SAILR project. We presented our work at USENIX Security 2024 and have finished open-sourcing all special passes into angr that SAILR introduced. The latest version of SAILR can now be found on the latest version of angr. As part of this work, we also made it easier to access through the command line: pip install angr && angr decompile /bin/true.

Much of SAILR focused on making decompilation more like source code through automated techniques. This decompilation can often be easier for humans to understand and, interestingly, may also be easier for Large Language Models (LLMs) to reason about. In new work, termed the REaLLM project, we’ve begun studying how LLMs interact and augment decompilation and how they might improve the reverse engineering process. We’ve open-sourced a prototype of this work: https://github.com/mahaloz/daila, which we use in our REaLLM study.

The REaLLM study aims to accomplish the following:

Study the ways current practitioners and professionals use LLMs with decompilers
Implement those uses in a tool that integrates directly into the decompiler
Study how those integrated features affect the reversing process quantitatively

We have currently completed tasks 1 and 2. After some pilot studies, we plan to accomplish task 3 in the coming months. Overall, this project will produce the following gains:

Find out how effective current LLMs are in augmenting decompilation
Identify new ways decompilers can be improved to utilize LLMs better
Develop a prototype for practitioners to use LLMs in decompilers more easily

Rust Decompilation:

2/22/2024 (Start of project though Dec 2023): Our research on Rust decompilation aims to develop a Rust decompiler on top of C/C++ decompiler angr to generate semantically equivalent Rust pseudocode. The milestones we have achieved are (i) We reused C type inference algorithms and we are now able to translate type inference results into Rust data types instead of C data types. (ii) We replaced the original C structured code generator with our custom Rust structured code generator, which is able to generate Rust-specific control flow structures and Rust-like pseudocode. (iii) We implemented some optimization passes to simplify AIL code to get more understandable Rust pseudocode.

This research project is still in progress. Our next steps are (i) recovering data-specific data types like String, Array, and Slices, (ii) Rust calling convention recovery, (iii) memory allocation and deallocation simplification, (iv) security check and error handling simplification, (v) control flow de-optimization, and (vi) syntax re-sugaring. Our final goal is to successfully recover semantically-equivalent and understandable Rust pseudocode.

4/15/2024 (Jan-Mar 2024): Our research on Rust decompilation aims to develop a Rust decompiler on top of C/C++ decompiler angr to generate semantically equivalent Rust pseudocode. The milestones we have achieved during this reporting period are (i) We identified the challenges we need to solve in Rust decompiler; (ii) We designed a pipeline for the Rust decompiler; (iii) We implemented some simplification passes to simplify redundant memory operation details into high-level Rust code.

This research project is still in progress. Our next steps are (i) recovering data-specific data types like Array, and Slices, (ii) Rust calling convention recovery, (iii) security check and error handling simplification, (iv) control flow de-optimization, and (v) syntax re-sugaring. Our final goal is to successfully recover semantically-equivalent and understandable Rust pseudocode.

7/15/2024 (Apr-Jun 2024): Our research on Rust decompilation aims to develop a Rust decompiler on top of C/C++ decompiler angr to generate semantically equivalent Rust pseudocode. The milestones we have achieved during this reporting period are (i) We are not able to collect function prototypes and struct definitions from open-source Rust packages to facilitate decompilation; (ii) we have some simplification passes for simplifying error handling; (iii) We completed a pipeline for the Rust decompiler and we are going to start writing the paper soon.

This research project is still in progress. Our next steps are (i) Rust callsite simplification; (ii) syntax re-sugaring; (iii) control flow deoptimization. Our final goal is to successfully recover semantically-equivalent and understandable Rust pseudocode.

10/15/2024 (July-Sep 2024): Our research on Rust decompilation aims to develop a Rust decompiler on top of C/C++ decompiler angr to generate semantically equivalent Rust pseudocode. Right now we have the whole decompilation pipeline built that is working. The milestones we have achieved during this reporting period are (i) we have better type recovery for Rust decompilation - we are now able to recover struct return types and struct argument types of a function; (ii) we have better control flow and data flow simplification that significantly reduces lines of code, number of variables, and so on; (iii) We completed a pipeline for the Rust decompiler.

This research project is still in progress. Our next step is to build a Rust decompiler that is able to handle a specific scope of Rust binaries. Our final goal is to successfully recover semantically-equivalent and understandable Rust pseudocode.

Binary Type Inference (Now complete)

2/22/2024 (Start of project through Dec 2023): Our research focuses on binary type inference, a core research challenge in binary program analysis and reverse engineering. It concerns identifying the data types of registers and memory values in a stripped executable (or object file), whose type information is discarded during compilation.

We propose a novel graph-based representation of data-flow information that allows a synergistic combination of a data-flow analysis and a graph neural network model to balance scalability and accuracy. We implement it as a system that uses a GNN model that is trained on a large data set of binary functions to infer types of program variables in unseen functions from stripped binaries. We have also demonstrated the effectiveness of our approach by extensively evaluating it on a large data set. Our approach demonstrates an overall accuracy of 76.6 % and struct type accuracy of 45.2% on the x64 dataset across four optimization levels (O0-O3) and outperforms existing works by a minimum of 26.1% in overall accuracy and 10.2% in struct type accuracy. The research paper is currently under review.

4/15/2024 (Jan-Mar 2024): Binary Type Inference: We have completed the evaluation part and finished the paper writing. The research paper is currently submitted and under review.

7/15/2024 (Apr-Jun 2024): TYGR: TYGR was accepted to USENIX Security 2024 and we will present it in August.

10/15/2024 (July-Sep 2024): TYGR: TYGR was accepted to USENIX Security 2024. We have presented the paper in the conference and we have also open-sourced the tool: https://github.com/sefcom/TYGR

VarBERT (complete as of 7/152024):

Published paper: “Len or index or count, anything but v1”: Predicting Variable Names in Decompilation Output with Transfer Learning

Pal, Kuntal Kumar, et al. "“Len or index or count, anything but v1”: Predicting Variable Names in Decompilation Output with Transfer Learning." 2024 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 2024.

SAILR (complete as of 7/15/2024):

Published paper: Ahoy SAILR! There is No Need to DREAM of C: A Compiler-Aware Structuring Algorithm for Binary Decompilation

Basque, Zion Leonahenahe, et al. "Ahoy sailr! there is no need to dream of c: A compiler-aware structuring algorithm for binary decompilation." 33st USENIX Security Symposium (USENIX Security 24). 2024.

Publications and presentations

Add publication reference in the publications section below. An authors copy or final should be added in the report file(s) section. This is for NSA's review only.
Optionally, upload technical presentation slides that may go into greater detail. For NSA's review only.

No new published papers since last quarterly report.

Lead PI:

Yan Shoshitaishvili

Co-Pi(s):

Adam Doupé