Publications | Science of Security Virtual Organization

A Survey of Explainable Graph Neural Networks for Cyber Malware Analysis

Malware Analysis and Graph Theory - Malicious cybersecurity activities have become increasingly worrisome for individuals and companies alike. While machine learning methods like Graph Neural Networks (GNNs) have proven successful on the malware detection task, their output is often difficult to understand. Explainable malware detection methods are needed to automatically identify malicious programs and present results to malware analysts in a way that is human interpretable. In this survey, we outline a number of GNN explainability methods and compare their performance on a real-world malware detection dataset. Specifically, we formulated the detection problem as a graph classification problem on the malware Control Flow Graphs (CFGs). We find that gradient-based methods outperform perturbation-based methods in terms of computational expense and performance on explainer-specific metrics (e.g., Fidelity and Sparsity). Our results provide insights into designing new GNN-based models for cyber malware detection and attribution.

Authored by Dana Warmsley, Alex Waagen, Jiejun Xu, Zhining Liu, Hanghang Tong

Malware analysis and multi-label category detection issues: Ensemble-based approaches

Malware Analysis - Detection of malware and security attacks is a complex process that can vary in its details and analysis activities. As part of the detection process, malware scanners try to categorize a malware once it is detected under one of the known malware categories (e.g. worms, spywares, viruses, etc.). However, many studies and researches indicate problems with scanners categorizing or identifying a particular malware under more than one malware category. This paper, and several others, show that machine learning can be used for malware detection especially with ensemble base prediction methods. In this paper, we evaluated several custom-built ensemble models. We focused on multi-label malware classification as individual or classical classifiers showed low accuracy in such territory.This paper showed that recent machine models such as ensemble and deep learning can be used for malware detection with better performance in comparison with classical models. This is very critical in such a dynamic and yet important detection systems where challenges such as the detection of unknown or zero-day malware will continue to exist and evolve.

Authored by Izzat Alsmadi, Bilal Al-Ahmad, Mohammad Alsmadi

GNN-Based Malicious Network Entities Identification In Large-Scale Network Data

A reliable database of Indicators of Compromise (IoC’s) is a cornerstone of almost every malware detection system. Building the database and keeping it up-to-date is a lengthy and often manual process where each IoC should be manually reviewed and labeled by an analyst. In this paper, we focus on an automatic way of identifying IoC’s intended to save analysts’ time and scale to the volume of network data. We leverage relations of each IoC to other entities on the internet to build a heterogeneous graph. We formulate a classification task on this graph and apply graph neural networks (GNNs) in order to identify malicious domains. Our experiments show that the presented approach provides promising results on the task of identifying high-risk malware as well as legitimate domains classification.

Authored by Stepan Dvorak, Pavel Prochazka, Lukas Bajer

CFGExplainer: Explaining Graph Neural Network-Based Malware Classification from Control Flow Graphs

With the ever increasing threat of malware, extensive research effort has been put on applying Deep Learning for malware classification tasks. Graph Neural Networks (GNNs) that process malware as Control Flow Graphs (CFGs) have shown great promise for malware classification. However, these models are viewed as black-boxes, which makes it hard to validate and identify malicious patterns. To that end, we propose CFG-Explainer, a deep learning based model for interpreting GNN-oriented malware classification results. CFGExplainer identifies a subgraph of the malware CFG that contributes most towards classification and provides insight into importance of the nodes (i.e., basic blocks) within it. To the best of our knowledge, CFGExplainer is the first work that explains GNN-based mal-ware classification. We compared CFGExplainer against three explainers, namely GNNExplainer, SubgraphX and PGExplainer, and showed that CFGExplainer is able to identify top equisized subgraphs with higher classification accuracy than the other three models.

Authored by Jerome Herath, Priti Wakodikar, Ping Yang, Guanhua Yan

Representation Learning with Function Call Graph Transformations for Malware Open Set Recognition

Open set recognition (OSR) problem has been a challenge in many machine learning (ML) applications, such as security. As new/unknown malware families occur regularly, it is difficult to exhaust samples that cover all the classes for the training process in ML systems. An advanced malware classification system should classify the known classes correctly while sensitive to the unknown class. In this paper, we introduce a self-supervised pre-training approach for the OSR problem in malware classification. We propose two transformations for the function call graph (FCG) based malware representations to facilitate the pretext task. Also, we present a statistical thresholding approach to find the optimal threshold for the unknown class. Moreover, the experiment results indicate that our proposed pre-training process can improve different performances of different downstream loss functions for the OSR problem.

Authored by Jingyun Jia, Philip Chan

Detection of Botnets in IoT Networks using Graph Theory and Machine Learning

The Internet of things (IoT) is proving to be a boon in granting internet access to regularly used objects and devices. Sensors, programs, and other innovations interact and trade information with different gadgets and frameworks over the web. Even in modern times, IoT gadgets experience the ill effects of primary security threats, which expose them to many dangers and malware, one among them being IoT botnets. Botnets carry out attacks by serving as a vector and this has become one of the significant dangers on the Internet. These vectors act against associations and carry out cybercrimes. They are used to produce spam, DDOS attacks, click frauds, and steal confidential data. IoT gadgets bring various challenges unlike the common malware on PCs and Android devices as IoT gadgets have heterogeneous processor architecture. Numerous researches use static or dynamic analysis for detection and classification of botnets on IoT gadgets. Most researchers haven t addressed the multi-architecture issue and they use a lot of computing resources for analyzing. Therefore, this approach attempts to classify botnets in IoT by using PSI-Graphs which effectively addresses the problem of encryption in IoT botnet detection, tackles the multi-architecture problem, and reduces computation time. It proposes another methodology for describing and recognizing botnets utilizing graph-based Machine Learning techniques and Exploratory Data Analysis to analyze the data and identify how separable the data is to recognize bots at an earlier stage so that IoT devices can be prevented from being attacked.

Authored by Putsa Pranav, Sachin Verma, Sahana Shenoy, S. Saravanan

A Survey of Explainable Graph Neural Networks for Cyber Malware Analysis

Malicious cybersecurity activities have become increasingly worrisome for individuals and companies alike. While machine learning methods like Graph Neural Networks (GNNs) have proven successful on the malware detection task, their output is often difficult to understand. Explainable malware detection methods are needed to automatically identify malicious programs and present results to malware analysts in a way that is human interpretable. In this survey, we outline a number of GNN explainability methods and compare their performance on a real-world malware detection dataset. Specifically, we formulated the detection problem as a graph classification problem on the malware Control Flow Graphs (CFGs). We find that gradient-based methods outperform perturbation-based methods in terms of computational expense and performance on explainer-specific metrics (e.g., Fidelity and Sparsity). Our results provide insights into designing new GNN-based models for cyber malware detection and attribution.

Authored by Dana Warmsley, Alex Waagen, Jiejun Xu, Zhining Liu, Hanghang Tong

Malware analysis and multi-label category detection issues: Ensemble-based approaches

Detection of malware and security attacks is a complex process that can vary in its details and analysis activities. As part of the detection process, malware scanners try to categorize a malware once it is detected under one of the known malware categories (e.g. worms, spywares, viruses, etc.). However, many studies and researches indicate problems with scanners categorizing or identifying a particular malware under more than one malware category. This paper, and several others, show that machine learning can be used for malware detection especially with ensemble base prediction methods. In this paper, we evaluated several custom-built ensemble models. We focused on multi-label malware classification as individual or classical classifiers showed low accuracy in such territory.This paper showed that recent machine models such as ensemble and deep learning can be used for malware detection with better performance in comparison with classical models. This is very critical in such a dynamic and yet important detection systems where challenges such as the detection of unknown or zero-day malware will continue to exist and evolve.

Authored by Izzat Alsmadi, Bilal Al-Ahmad, Mohammad Alsmadi

Objection!: Identifying Misclassified Malicious Activities with XAI

Many studies have been conducted to detect various malicious activities in cyberspace using classifiers built by machine learning. However, it is natural for any classifier to make mistakes, and hence, human verification is necessary. One method to address this issue is eXplainable AI (XAI), which provides a reason for the classification result. However, when the number of classification results to be verified is large, it is not realistic to check the output of the XAI for all cases. In addition, it is sometimes difficult to interpret the output of XAI. In this study, we propose a machine learning model called classification verifier that verifies the classification results by using the output of XAI as a feature and raises objections when there is doubt about the reliability of the classification results. The results of experiments on malicious website detection and malware detection show that the proposed classification verifier can efficiently identify misclassified malicious activities.

Authored by Koji Fujita, Toshiki Shibahara, Daiki Chiba, Mitsuaki Akiyama, Masato Uchida

Detecting and Classifying Self-Deleting Windows Malware Using Prefetch Files

Malware detection and analysis can be a burdensome task for incident responders. As such, research has turned to machine learning to automate malware detection and malware family classification. Existing work extracts and engineers static and dynamic features from the malware sample to train classifiers. Despite promising results, such techniques assume that the analyst has access to the malware executable file. Self-deleting malware invalidates this assumption and requires analysts to find forensic evidence of malware execution for further analysis. In this paper, we present and evaluate an approach to detecting malware that executed on a Windows target and further classify the malware into its associated family to provide semantic insight. Specifically, we engineer features from the Windows prefetch file, a file system forensic artifact that archives process information. Results show that it is possible to detect the malicious artifact with 99% accuracy; furthermore, classifying the malware into a fine-grained family has comparable performance to techniques that require access to the original executable. We also provide a thorough security discussion of the proposed approach against adversarial diversity.

Authored by Adam Duby, Teryl Taylor, Gedare Bloom, Yanyan Zhuang

NBP-MS: Malware Signature Generation Based on Network Behavior Profiling

With the proliferation of malware, the detection and classification of malware have been hot topics in the academic and industrial circles of cyber security, and the generation of malware signatures is one of the important research directions. In this paper, we propose NBP-MS, a method of signature generation that is based on network traffic generated by malware. Specifically, we utilize the network traffic generated by malware to perform fine-grained profiling of its network behaviors first, and then cluster all the profiles to generate network behavior signatures to classify malware, providing support for subsequent analysis and defense.

Authored by Zhixin Shi, Xiangyu Wang, Pengcheng Liu

Static Malware Analysis using PE Header files API

In today’s fast pacing world, cybercrimes have time and again proved to be one of the biggest hindrances in national development. According to recent trends, most of the times the victim’s data is breached by trapping it in a phishing attack. Security and privacy of user’s data has become a matter of tremendous concern. In order to address this problem and to protect the naive user’s data, a tool which may help to identify whether a window executable is malicious or not by doing static analysis on it has been proposed. As well as a comparative study has been performed by implementing different classification models like Logistic Regression, Neural Network, SVM. The static analysis approach used takes into parameters of the executables, analysis of properties obtained from PE Section Headers i.e. API calls. Comparing different model will provide the best model to be used for static malware analysis

Authored by Naman Aggarwal, Pradyuman Aggarwal, Rahul Gupta

Automation Detection of Malware and Stenographical Content using Machine Learning

In recent times, the occurrence of malware attacks are increasing at an unprecedented rate. Particularly, the image-based malware attacks are spreading worldwide and many people get harmful malware-based images through the technique called steganography. In the existing system, only open malware and files from the internet can be identified. However, the image-based malware cannot be identified and detected. As a result, so many phishers make use of this technique and exploit the target. Social media platforms would be totally harmful to the users. To avoid these difficulties, Machine learning can be implemented to find the steganographic malware images (contents). The proposed methodology performs an automatic detection of malware and steganographic content by using Machine Learning. Steganography is used to hide messages from apparently innocuous media (e.g., images), and steganalysis is the approach used for detecting this malware. This research work proposes a machine learning (ML) approach to perform steganalysis. In the existing system, only open malware and files from the internet are identified but in the recent times many people get harmful malware-based images through the technique called steganography. Social media platforms would be totally harmful to the users. To avoid these difficulties, the proposed Machine learning has been developed to appropriately detect the steganographic malware images (contents). Father, the steganalysis method using machine learning has been developed for performing logistic classification. By using this, the users can avoid sharing the malware images in social media platforms like WhatsApp, Facebook without downloading it. It can be also used in all the photo-sharing sites such as google photos.

Authored by Henry Samuel, Santhanam Kumar, R. Aishwarya, G. Mathivanan

Countermeasure Against Anti-Sandbox Technology Based on Activity Recognition

In order to prevent malicious environment, more and more applications use anti-sandbox technology to detect the running environment. Malware often uses this technology against analysis, which brings great difficulties to the analysis of applications. Research on anti-sandbox countermeasure technology based on application virtualization can solve such problems, but there is no good solution for sensor simulation. In order to prevent detection, most detection systems can only use real device sensors, which brings great hidden dangers to users’ privacy. Aiming at this problem, this paper proposes and implements a sensor anti-sandbox countermeasure technology for Android system. This technology uses the CNN-LSTM model to identify the activity of the real machine sensor data, and according to the recognition results, the real machine sensor data is classified and stored, and then an automatic data simulation algorithm is designed according to the stored data, and finally the simulation data is sent back by using the Hook technology for the application under test. The experimental results show that the method can effectively simulate the data characteristics of the acceleration sensor and prevent the triggering of anti-sandbox behaviors.

Authored by Jin Yang, Yunqing Liu

Analysing the Impact of Cyber-Threat to ICS and SCADA Systems

The aim of this paper is to examine noteworthy cyberattacks that have taken place against ICS and SCADA systems and to analyse them. This paper also proposes a new classification scheme based on the severity of the attack. Since the information revolution, computers and associated technologies have impacted almost all aspects of daily life, and this is especially true of the industrial sector where one of the leading trends is that of automation. This widespread proliferation of computers and computer networks has also made it easier for malicious actors to gain access to these systems and networks and carry out harmful activities.

Authored by Cheerag Kaura, Nidhi Sindhwani, Alka Chaudhary

Ransomware Classification and Detection With Machine Learning Algorithms

Malicious attacks, malware, and ransomware families pose critical security issues to cybersecurity, and it may cause catastrophic damages to computer systems, data centers, web, and mobile applications across various industries and businesses. Traditional anti-ransomware systems struggle to fight against newly created sophisticated attacks. Therefore, state-of-the-art techniques like traditional and neural network-based architectures can be immensely utilized in the development of innovative ransomware solutions. In this paper, we present a feature selection-based framework with adopting different machine learning algorithms including neural network-based architectures to classify the security level for ransomware detection and prevention. We applied multiple machine learning algorithms: Decision Tree (DT), Random Forest (RF), Naïve Bayes (NB), Logistic Regression (LR) as well as Neural Network (NN)-based classifiers on a selected number of features for ransomware classification. We performed all the experiments on one ransomware dataset to evaluate our proposed framework. The experimental results demonstrate that RF classifiers outperform other methods in terms of accuracy, F -beta, and precision scores.

Authored by Mohammad Masum, Md Faruk, Hossain Shahriar, Kai Qian, Dan Lo, Muhaiminul Adnan

Improving the Prediction Accuracy with Feature Selection for Ransomware Detection

This paper presents the machine learning algorithm to detect whether an executable binary is benign or ransomware. The ransomware cybercriminals have targeted our infrastructure, businesses, and everywhere which has directly affected our national security and daily life. Tackling the ransomware threats more effectively is a big challenge. We applied a machine-learning model to classify and identify the security level for a given suspected malware for ransomware detection and prevention. We use the feature selection data preprocessing to improve the prediction accuracy of the model.

Authored by Chulan Gao, Hossain Shahriar, Dan Lo, Yong Shi, Kai Qian

Analyzing Machine Learning-based Feature Selection for Botnet Detection

In this cyber era, the number of cybercrime problems grows significantly, impacting network communication security. Some factors have been identified, such as malware. It is a malicious code attack that is harmful. On the other hand, a botnet can exploit malware to threaten whole computer networks. Therefore, it needs to be handled appropriately. Several botnet activity detection models have been developed using a classification approach in previous studies. However, it has not been analyzed about selecting features to be used in the learning process of the classification algorithm. In fact, the number and selection of features implemented can affect the detection accuracy of the classification algorithm. This paper proposes an analysis technique for determining the number and selection of features developed based on previous research. It aims to obtain the analysis of using features. The experiment has been conducted using several classification algorithms, namely Decision tree, k-NN, Naïve Bayes, Random Forest, and Support Vector Machine (SVM). The results show that taking a certain number of features increases the detection accuracy. Compared with previous studies, the results obtained show that the average detection accuracy of 98.34% using four features has the highest value from the previous study, 97.46% using 11 features. These results indicate that the selection of the correct number and features affects the performance of the botnet detection model.

Authored by Winda Safitri, Tohari Ahmad, Dandy Hostiadi

Cyber Automated Network Resilience Defensive Approach against Malware Images

Cyber threats have been a major issue in the cyber security domain. Every hacker follows a series of cyber-attack stages known as cyber kill chain stages. Each stage has its norms and limitations to be deployed. For a decade, researchers have focused on detecting these attacks. Merely watcher tools are not optimal solutions anymore. Everything is becoming autonomous in the computer science field. This leads to the idea of an Autonomous Cyber Resilience Defense algorithm design in this work. Resilience has two aspects: Response and Recovery. Response requires some actions to be performed to mitigate attacks. Recovery is patching the flawed code or back door vulnerability. Both aspects were performed by human assistance in the cybersecurity defense field. This work aims to develop an algorithm based on Reinforcement Learning (RL) with a Convoluted Neural Network (CNN), far nearer to the human learning process for malware images. RL learns through a reward mechanism against every performed attack. Every action has some kind of output that can be classified into positive or negative rewards. To enhance its thinking process Markov Decision Process (MDP) will be mitigated with this RL approach. RL impact and induction measures for malware images were measured and performed to get optimal results. Based on the Malimg Image malware, dataset successful automation actions are received. The proposed work has shown 98% accuracy in the classification, detection, and autonomous resilience actions deployment.

Authored by Kainat Rizwan, Mudassar Ahmad, Muhammad Habib

Data Sanitization Approach to Mitigate Clean-Label Attacks Against Malware Detection Systems

Machine learning (ML) models are increasingly being used in the development of Malware Detection Systems. Existing research in this area primarily focuses on developing new architectures and feature representation techniques to improve the accuracy of the model. However, recent studies have shown that existing state-of-the art techniques are vulnerable to adversarial machine learning (AML) attacks. Among those, data poisoning attacks have been identified as a top concern for ML practitioners. A recent study on clean-label poisoning attacks in which an adversary intentionally crafts training samples in order for the model to learn a backdoor watermark was shown to degrade the performance of state-of-the-art classifiers. Defenses against such poisoning attacks have been largely under-explored. We investigate a recently proposed clean-label poisoning attack and leverage an ensemble-based Nested Training technique to remove most of the poisoned samples from a poisoned training dataset. Our technique leverages the relatively large sensitivity of poisoned samples to feature noise that disproportionately affects the accuracy of a backdoored model. In particular, we show that for two state-of-the art architectures trained on the EMBER dataset affected by the clean-label attack, the Nested Training approach improves the accuracy of backdoor malware samples from 3.42% to 93.2%. We also show that samples produced by the clean-label attack often successfully evade malware classification even when the classifier is not poisoned during training. However, even in such scenarios, our Nested Training technique can mitigate the effect of such clean-label-based evasion attacks by recovering the model's accuracy of malware detection from 3.57% to 93.2%.

Authored by Samson Ho, Achyut Reddy, Sridhar Venkatesan, Rauf Izmailov, Ritu Chadha, Alina Oprea