Malware Detection Using Features from Static Disassembly

ABSTRACT

Machine learning has emerged as an important tool for malware detection and classification, with the potential to generalize to previously unseen malware that can evade traditional signature based techniques. However, many existing machine learning approaches for malware detection rely primarily on inexpensive parsing-based features, or on limited static or dynamic analyses, which may fail to capture deeper structural and semantic characteristics of malicious binaries. As malware continues to evolve to evade detection, there is a need for richer feature representations that reflect program behavior and structure while remaining scalable to large datasets.

In this talk, we describe GrammaTech’s work on malware detection using sophisticated features extracted at scale from static disassembly of PE32 Windows binaries. We show that by incorporating features derived from static disassembly using GrammaTech's state-of-the-art
disassembler DDisasm¹, in conjunction with features based on binary parsing and capability labeling using Mandiant's CAPA², we can substantially improve malware classification accuracy and robustness. By extracting detailed control-flow, instruction-level, and semantic patterns from binaries, we can detect characteristics of binaries that generalize well across datasets, helping ensure the models remain effective against new threats.

We also perform feature importance analysis of trained malware detection models to generate insights into salient features for classifying and analyzing malware. Our findings indicate that disassembly-based features provide signals that are complementary to traditional parsing-based features. These features can improve the robustness of models across binary datasets from different sources and time periods, reducing overfitting and maintaining high detection rates against previously unseen threats. In addition, we can achieve near-peak performance with a relatively small subset of informative features, suggesting a practical path toward efficient deployment.

Our research demonstrates how machine learning can be combined with rigorous static program analysis to augment malware detection capabilities and generate insights for malware experts. The results suggest that static disassembly-based features can play a meaningful role in improving the robustness, interpretability, and scalability of AI-assisted malware detection systems within high confidence software and security workflows.

¹Flores-Montoya, A., & Schulte, E. (2020). Datalog disassembly. In 29th USENIX Security Symposium (USENIX Security 20) (pp. 1075-1092).

²https://github.com/mandiant/capa

BIO

Dr. Akshay Sood is a senior scientist at GrammaTech. He received a doctorate in Computer Science in 2021 from the University of Wisconsin-Madison, where he was advised by Dr. Mark Craven. His research interests span diverse application areas of trustworthy machine learning, including software analysis, malware detection, and clinical risk prediction. His doctoral research explored methods that help engender trust and transparency in black-box models by interpreting or explaining their decision-making, with a focus on models in the healthcare domain. At GrammaTech, his work focuses on applications of machine learning to software analysis and security.

Submitted by Katie Dey on Mon, 04/06/2026 - 09:35