Binary Software Composition Analysis with CodeSentry
Abstract
Most modern software systems have significant third-party dependencies, which often contain exploitable vulnerabilities. Once a vulnerability is disclosed, there is a race between malicious actors, trying to exploit the vulnerability, and the defenders of critical infrastructure. The recent Log4j vulnerability disclosure is just one of many examples over the last few years. Deployed systems must be continuously scanned for known vulnerabilities and repaired with patches before the attackers breach them. Keeping track of third-party dependencies, their versions, and their associated vulnerabilities is challenging, but also essential to map a software system’s supply chain. The difficulty of this task is exacerbated by the fact that much of the software today is distributed in binary format and without a comprehensive software bill of materials (SBOM) that enumerates all its dependencies. SBOMs are now the subject of legislation and regulations, e.g., the May 12 2021 executive order on improving the nation’s cybersecurity, and they are expected to be required by compliance directives in the future.
To address this problem, we have developed CodeSentry, a deep binary scanner for identifying the presence of known vulnerable components in binaries. CodeSentry uses a combination of lightweight binary analysis and machine learning to reliably identify third-party components in software and their associated vulnerabilities, providing a comprehensive cybersecurity assessment, and helping cyber defenders prioritize risks.
In this presentation we will discuss two of the analysis techniques that CodeSentry uses to identify software components in binaries. These techniques have to address the fact that compilation options—such as the choice of the compiler and optimization flags—introduce a lot of variability in the binary code generated from the same source code. These analysis techniques also need to be accurate and lightweight, to be able to scan modern software deployments containing hundreds of binaries within minutes. The first technique is called Strlibid, and it extracts component signatures from the strings in the binary. Strings provide useful information for identifying binary components. They are relatively easy to extract from binaries, and they are typically not modified by the compilation process. Nonetheless, in order to use strings as an effective component identification signature, several information retrieval techniques need to be applied to filter and weight them. The second analysis technique we discuss, Embedlibid, uses function embeddings as signatures. Embedlibid computes function embeddings from the function’s assembly listings using Siamese deep neural networks. These Siamese networks are trained to produce similar embeddings for similar functions, thus they can be used to identify library functions in binaries. We will discuss some of the challenges of training such neural networks, as well as how function similarity scores can be lifted into software component matches.
--
Antonio Flores Montoya is a Senior Scientist at GrammaTech. He earned his PhD in Computer Science from the Technical University of Darmstadt (Germany). His research interests are in static program analysis, binary analysis and rewriting, and machine learning applications to binary analysis.
He is one of lead developers of CodeSentry, where he focuses on using neural networks for computing embeddings of binary procedures. These embeddings are used to find third-party library components in binaries and their associated known vulnerabilities. Dr. Flores Montoya is also the lead developer of ddisasm, a state-of-the-art open-source disassembler that produces reassembleable assembly, thus enabling low-overhead binary rewriting. Finally, he currently serves as the Principal Investigator of GrammaTech’s team for ReMath, a DARPA program focused on extracting high-level mathematical representations from cyber-physical algorithms encoded in binaries.