2024 Q1 | Science of Security Virtual Organization

2024 Q1

Improving Malware Classifiers with Plausible Novel Samples

Research Team Status

Names of researchers and position
(e.g. Research Scientist, PostDoc, Student (Undergrad/Masters/PhD))
- Kailani "Cai" Lemieux-Mack, PhD student
- Thuy Dung "Judy" Nguyen, PhD student
- Kevin Leach (Assistant Professor), Taylor Johnson (Associate Professor)
Any new collaborations with other universities/researchers?
Developed contact with external faculty member concerning binary analysis that will be relevant to malware classification (Kexin Pei at the University of Chicago).

Project Goals

What is the current project goal?
- The current goal of the project is to develop techniques for augmenting malware samples in feature spaces to enhance performance of malware classifiers (and neural classifiers in general). In particular, our technique partially addresses the challenge of labeling a sufficient volume of malware to adequately train a classification model. By carefully developing novel samples nearby in the feature space, we can synthesize a much larger volume of labeled malware. We hypothesize that augmenting samples via careful consideration of the types of features being augmented can aid the downstream performance of classifiers trained with such data.
How does the current goal factor into the long-term goal of the project?
- Accomplishing the current goal will address two long-term goals of this project. First, the augmentation approach will help with neural network verification -- the augmented samples can be treated as "hard examples" that are difficult to classify. Neural network verification techniques frequently rely on reasoning about samples nearby an input sample in the feature space. Second, the samples generated in the feature space will later serve as starting points for distillation back to the space of executable binaries. In doing so, we can more faithfully generate realistic, challenging samples that can be used to further improve malware classifier performance.

Accomplishments

Address whether project milestones were met. If milestones were not met, explain why, and what are the next steps.
- Following our proposal milestones, we are on track with respect to our current tasks:
  - Task 1-A Feature extraction pipeline. We have developed an initial prototype for extracting features from malware samples in the BODMAS malware family classification dataset. We are extracting features in two categories: (1) interpolatable features, which can be linearly interpolated during augmentation (e.g., entropy, filesize), and (2) non-interpolatable features, which cannot be interpolated or that are unordered (e.g., file hashes, libraries loaded). These featurized malware samples are embedded in Task 1-B.
  - Task 1-B Development of malware embedding. The malware samples, once extracted, are embedded into a space. These embeddings are used as the basis for the augmentation technique we are developing for Task 1-C. The augmentation will apply with respect to this embedding space.
What is the contribution to foundational cybersecurity research? Was there something discovered or confirmed?
- This project contributes in to cybersecurity research in several ways. First, by improving low resource malware classifier performance (e.g., in zero-, one-, and few-shot settings), we can reduce overall engineering effort required to fully understand new malware families and novel malware capabilities. Second, our techniques can be applied more broadly to neural network classifiers in general, enabling the creation of more robust AI models. Third, these techniques contribute to the verification of neural networks, allowing more rigorous evaluation and vetting of complex neural networks.
Impact of research
- Internal to the university (coursework/curriculum)
  - Developed half lecture on machine learning applications to malware detection and classification; to be deployed during Spring 2024 semester.
- External to the university (transition to industry/government (local/federal); patents, start-ups, software, etc.)
  - Invited talk: 85th IFIP Working Group 10.4 Meeting. February 2, 2024, St. Simons Island, GA USA. Workshop on Trustworthy AI-Enabled Cyber-Physical Systems.
  - Organizing VNN-COMP, a competition for verification of neural networks. This year, we will propose malware classifiers from our papers as benchmarks.
- Any acknowledgements, awards, or references in media?
  - N/A

Publications and presentations

Add publication reference in the publications section below. An authors copy or final should be added in the report file(s) section. This is for NSA's review only.
Optionally, upload technical presentation slides that may go into greater detail. For NSA's review only.

Lead PI:

Kevin Leach

Co-Pi(s):

Taylor Johnson

Report Materials

Publications

Neural Network Malware Detection Verification for Feature and Image Datasets

"Benchmark: Neural Network Malware Classification"