Recent growth and proliferation of malware has tested practitioners’ ability to promptly classify new samples. In contrast to labor-intensive reverse engineering efforts, machine learning approaches have demonstrated increased speed and accuracy. However, most existing malware classifiers must be calibrated using a large number of samples that are painstakingly manually analyzed before training. Furthermore, as novel malware samples arise that are beyond the scope of the training set, additional reverse engineering effort must be employed to update the training set, and the sheer volume of new samples found in the wild places substantial pressure on the practitioner’s ability to reverse engineer enough malware to adequately train modern classifiers.
To address this problem, we propose a three-pronged approach. First, we will leverage data mixing techniques to generate novel and plausibly realistic malware samples by mixing feature representations of pairs of malware samples. In contrast to using rudimentary perturbation techniques, our approach will generate novel samples that correspond to malware samples that reflect malicious binaries using semantics-aware augmentation. Second, we will leverage neural network verification techniques for analyzing and improving classification robustness that ensures a specific level of coverage in the input and feature spaces of malware binaries. This will enable improved classification boundaries in the feature space, resulting in more accurate malware classification. Third, we will develop a search-based malware evolution engine that generates additional novel malicious binaries. While existing data augmentation techniques work in the feature space or upon an abstract representation, the augmented samples do not necessarily correspond to real, functioning binaries. Thus, we will leverage automated program repair techniques to generate new malware samples by guiding an evolutionary search to evade classification.
Together, these research thrusts will improve the accuracy and robustness of existing malware classifiers by focusing on data quality — this means the findings will generalize across various model architectures and feature representations. Moreover, our approach will enable classification by drastically reducing the quantity of manually labeled malware samples. In turn, we will reduce the amount of effort spent reverse engineering and labeling samples, instead focusing on the automatic generation of novel samples that can augment a small seed training set. Overall, this project will address a critical need for ensuring high quality malware classification, but also the ever-growing need for scalable classification in spite of the sheer volume of new malware threats discovered in the wild year to year.
The proposal consists of three thrusts spanning three years.
In the first year, we will focus on the development of novel augmentation techniques that focus on generating novel malware samples by mixing pairs of labeled samples. In the second year, we will focus on the robustness analysis techniques for perturbing malware samples to ensure adequate coverage of classification boundaries. Finally, in the third year, we will develop an evolutionary search to generate realistic novel malware samples intended to evade classification. Each thrust will contain an internal evaluation to ensure progress is being made and goals are being met with respect to malware classification accuracy, robustness, and efficiency.