The goal of our research is to achieve binary software understanding—that is, the ability to recreate the semantic meaning of the original source code as well as the intention of the developer. We envision that this will help human analysts reverse engineering software, identify vulnerabilities, decode/defang malware, and patch legacy software, as well as help automated techniques make sense of the vast amount of binary software.
We aim to significantly advance the state-of-the-art in decompilation by leveraging Machine Learning techniques to achieve semantically-equivalent decompilation for binary software. Our focus here is on generating a representative dataset of original source code and compiled binary code to train machine learning models. Our preliminary work in this area has shown that existing techniques created improper datasets, which significantly impacted the resulting models and evaluation. We also aim to fundamentally improve the evaluation of decompilation techniques.