Publications | Science of Security Virtual Organization

Rotten Apples Spoil the Bunch: An Anatomy of Google Play Malware

Malware Analysis and Graph Theory - This paper provides an in-depth analysis of Android malware that bypassed the strictest defenses of the Google Play application store and penetrated the official Android market between January 2016 and July 2021. We systematically identified 1,238 such malicious applications, grouped them into 134 families, and manually analyzed one application from 105 distinct families. During our manual analysis, we identified malicious payloads the applications execute, conditions guarding execution of the payloads, hiding techniques applications employ to evade detection by the user, and other implementation-level properties relevant for automated malware detection. As most applications in our dataset contain multiple payloads, each triggered via its own complex activation logic, we also contribute a graph-based representation showing activation paths for all application payloads in form of a control- and data-flow graph. Furthermore, we discuss the capabilities of existing malware detection tools, put them in context of the properties observed in the analyzed malware, and identify gaps and future research directions. We believe that our detailed analysis of the recent, evasive malware will be of interest to researchers and practitioners and will help further improve malware detection tools.

Authored by Michael Cao, Khaled Ahmed, Julia Rubin

Mal-Bert-GCN: Malware Detection by Combining Bert and GCN

Malware Analysis and Graph Theory - With the dramatic increase in malicious software, the sophistication and innovation of malware have increased over the years. In particular, the dynamic analysis based on the deep neural network has shown high accuracy in malware detection. However, most of the existing methods only employ the raw API sequence feature, which cannot accurately reflect the actual behavior of malicious programs in detail. The relationship between API calls is critical for detecting suspicious behavior. Therefore, this paper proposes a malware detection method based on the graph neural network. We first connect the API sequences executed by different processes to build a directed process graph. Then, we apply Bert to encode the API sequences of each process into node embedding, which facilitates the semantic execution information inside the processes. Finally, we employ GCN to mine the deep semantic information based on the directed process graph and node embedding. In addition to presenting the design, we have implemented and evaluated our method on 10,000 malware and 10,000 benign software datasets. The results show that the precision and recall of our detection model reach 97.84\% and 97.83\%, verifying the effectiveness of our proposed method.

Authored by Zhenquan Ding, Hui Xu, Yonghe Guo, Longchuan Yan, Lei Cui, Zhiyu Hao

Detection of Botnets in IoT Networks using Graph Theory and Machine Learning

Malware Analysis and Graph Theory - The Internet of things (IoT) is proving to be a boon in granting internet access to regularly used objects and devices. Sensors, programs, and other innovations interact and trade information with different gadgets and frameworks over the web. Even in modern times, IoT gadgets experience the ill effects of primary security threats, which expose them to many dangers and malware, one among them being IoT botnets. Botnets carry out attacks by serving as a vector and this has become one of the significant dangers on the Internet. These vectors act against associations and carry out cybercrimes. They are used to produce spam, DDOS attacks, click frauds, and steal confidential data. IoT gadgets bring various challenges unlike the common malware on PCs and Android devices as IoT gadgets have heterogeneous processor architecture. Numerous researches use static or dynamic analysis for detection and classification of botnets on IoT gadgets. Most researchers haven t addressed the multi-architecture issue and they use a lot of computing resources for analyzing. Therefore, this approach attempts to classify botnets in IoT by using PSI-Graphs which effectively addresses the problem of encryption in IoT botnet detection, tackles the multi-architecture problem, and reduces computation time. It proposes another methodology for describing and recognizing botnets utilizing graph-based Machine Learning techniques and Exploratory Data Analysis to analyze the data and identify how separable the data is to recognize bots at an earlier stage so that IoT devices can be prevented from being attacked.

Authored by Putsa Pranav, Sachin Verma, Sahana Shenoy, S. Saravanan

A Survey of Explainable Graph Neural Networks for Cyber Malware Analysis

Malware Analysis and Graph Theory - Malicious cybersecurity activities have become increasingly worrisome for individuals and companies alike. While machine learning methods like Graph Neural Networks (GNNs) have proven successful on the malware detection task, their output is often difficult to understand. Explainable malware detection methods are needed to automatically identify malicious programs and present results to malware analysts in a way that is human interpretable. In this survey, we outline a number of GNN explainability methods and compare their performance on a real-world malware detection dataset. Specifically, we formulated the detection problem as a graph classification problem on the malware Control Flow Graphs (CFGs). We find that gradient-based methods outperform perturbation-based methods in terms of computational expense and performance on explainer-specific metrics (e.g., Fidelity and Sparsity). Our results provide insights into designing new GNN-based models for cyber malware detection and attribution.

Authored by Dana Warmsley, Alex Waagen, Jiejun Xu, Zhining Liu, Hanghang Tong

Detecting Malware Using Graph Embedding and DNN

Malware Analysis and Graph Theory - Nowadays, the popularity of intelligent terminals makes malwares more and more serious. Among the many features of application, the call graph can accurately express the behavior of the application. The rapid development of graph neural network in recent years provides a new solution for the malicious analysis of application using call graphs as features. However, there are still problems such as low accuracy. This paper established a large-scale data set containing more than 40,000 samples and selected the class call graph, which was extracted from the application, as the feature and used the graph embedding combined with the deep neural network to detect the malware. The experimental results show that the accuracy of the detection model proposed in this paper is 97.7\%; the precision is 96.6\%; the recall is 96.8\%; the F1-score is 96.4\%, which is better than the existing detection model based on Markov chain and graph embedding detection model.

Authored by Rui Wang, Jun Zheng, Zhiwei Shi, Yu Tan

Malware analysis and multi-label category detection issues: Ensemble-based approaches

Malware Analysis - Detection of malware and security attacks is a complex process that can vary in its details and analysis activities. As part of the detection process, malware scanners try to categorize a malware once it is detected under one of the known malware categories (e.g. worms, spywares, viruses, etc.). However, many studies and researches indicate problems with scanners categorizing or identifying a particular malware under more than one malware category. This paper, and several others, show that machine learning can be used for malware detection especially with ensemble base prediction methods. In this paper, we evaluated several custom-built ensemble models. We focused on multi-label malware classification as individual or classical classifiers showed low accuracy in such territory.This paper showed that recent machine models such as ensemble and deep learning can be used for malware detection with better performance in comparison with classical models. This is very critical in such a dynamic and yet important detection systems where challenges such as the detection of unknown or zero-day malware will continue to exist and evolve.

Authored by Izzat Alsmadi, Bilal Al-Ahmad, Mohammad Alsmadi

DynaMalDroid: Dynamic Analysis-Based Detection Framework for Android Malware Using Machine Learning Techniques

Malware Analysis - Android malware is continuously evolving at an alarming rate due to the growing vulnerabilities. This demands more effective malware detection methods. This paper presents DynaMalDroid, a dynamic analysis-based framework to detect malicious applications in the Android platform. The proposed framework contains three modules: dynamic analysis, feature engineering, and detection. We utilized the well-known CICMalDroid2020 dataset, and the system calls of apps are extracted through dynamic analysis. We trained our proposed model to recognize malware by selecting features obtained through the feature engineering module. Further, with these selected features, the detection module applies different Machine Learning classifiers like Random Forest, Decision Tree, Logistic Regression, Support Vector Machine, Naïve-Bayes, K-Nearest Neighbour, and AdaBoost, to recognize whether an application is malicious or not. The experiments have shown that several classifiers have demonstrated excellent performance and have an accuracy of up to 99\%. The models with Support Vector Machine and AdaBoost classifiers have provided better detection accuracy of 99.3\% and 99.5\%, respectively.

Authored by Hashida Manzil, Manohar S

Disparity Analysis Between the Assembly and Byte Malware Samples with Deep Autoencoders

Malware Analysis - Malware attacks in the cyber world continue to increase despite the efforts of Malware analysts to combat this problem. Recently, Malware samples have been presented as binary sequences and assembly codes. However, most researchers focus only on the raw Malware sequence in their proposed solutions, ignoring that the assembly codes may contain important details that enable rapid Malware detection. In this work, we leveraged the capabilities of deep autoencoders to investigate the presence of feature disparities in the assembly and raw binary Malware samples. First, we treated the task as outliers to investigate whether the autoencoder would identify and justify features as samples from the same family. Second, we added noise to all samples and used Deep Autoencoder to reconstruct the original samples by denoising. Experiments with the Microsoft Malware dataset showed that the byte samples features differed from the assembly code samples.

Authored by Muhammed Abdullah, Yongbin Yu, Jingye Cai, Yakubu Imrana, Nartey Tettey, Daniel Addo, Kwabena Sarpong, Bless Lord Y. Agbley, Benjamin Appiah

Effective of Obfuscated Android Malware Detection using Static Analysis

Malware Analysis - The effective security system improvement from malware attacks on the Android operating system should be updated and improved. Effective malware detection increases the level of data security and high protection for the users. Malicious software or malware typically finds a means to circumvent the security procedure, even when the user is unaware whether the application can act as malware. The effectiveness of obfuscated android malware detection is evaluated by collecting static analysis data from a data set. The experiment assesses the risk level of which malware dataset using the hash value of the malware and records malware behavior. A set of hash SHA256 malware samples has been obtained from an internet dataset and will be analyzed using static analysis to record malware behavior and evaluate which risk level of the malware. According to the results, most of the algorithms provide the same total score because of the multiple crime inside the malware application.

Authored by Teddy Mantoro, Muhammad Fahriza, Muhammad Bhakti

PDF Malware Analysis

Malware Analysis - This document addresses the issue of the actual security level of PDF documents. Two types of detection approaches are utilized to detect dangerous elements within malware: static analysis and dynamic analysis. Analyzing malware binaries to identify dangerous strings, as well as reverse-engineering is included in static analysis for t1he malware to disassemble it. On the other hand, dynamic analysis monitors malware activities by running them in a safe environment, such as a virtual machine. Each method has its own set of strengths and weaknesses, and it is usually best to employ both methods while analyzing malware. Malware detection could be simplified without sacrificing accuracy by reducing the number of malicious traits. This may allow the researcher to devote more time to analysis. Our worry is that there is no obvious need to identify malware with numerous functionalities when it isn t necessary. We will solve this problem by developing a system that will identify if the given file is infected with malware or not.

Authored by Md Khalil, Vivek, Kumar Anand, Antarlina Paul, Rahul Grover

GNN-Based Malicious Network Entities Identification In Large-Scale Network Data

A reliable database of Indicators of Compromise (IoC’s) is a cornerstone of almost every malware detection system. Building the database and keeping it up-to-date is a lengthy and often manual process where each IoC should be manually reviewed and labeled by an analyst. In this paper, we focus on an automatic way of identifying IoC’s intended to save analysts’ time and scale to the volume of network data. We leverage relations of each IoC to other entities on the internet to build a heterogeneous graph. We formulate a classification task on this graph and apply graph neural networks (GNNs) in order to identify malicious domains. Our experiments show that the presented approach provides promising results on the task of identifying high-risk malware as well as legitimate domains classification.

Authored by Stepan Dvorak, Pavel Prochazka, Lukas Bajer

Malware Detection Approach Based on the Swarm-Based Behavioural Analysis over API Calling Sequence

The rapidly increasing malware threats must be coped with new effective malware detection methodologies. Current malware threats are not limited to daily personal transactions but dowelled deeply within large enterprises and organizations. This paper introduces a new methodology for detecting and discriminating malicious versus normal applications. In this paper, we employed Ant-colony optimization to generate two behavioural graphs that characterize the difference in the execution behavior between malware and normal applications. Our proposed approach relied on the API call sequence generated when an application is executed. We used the API calls as one of the most widely used malware dynamic analysis features. Our proposed method showed distinctive behavioral differences between malicious and non-malicious applications. Our experimental results showed a comparative performance compared to other machine learning methods. Therefore, we can employ our method as an efficient technique in capturing malicious applications.

Authored by Eslam Amer, Adham Samir, Hazem Mostafa, Amer Mohamed, Mohamed Amin

Rotten Apples Spoil the Bunch: An Anatomy of Google Play Malware

This paper provides an in-depth analysis of Android malware that bypassed the strictest defenses of the Google Play application store and penetrated the official Android market between January 2016 and July 2021. We systematically identified 1,238 such malicious applications, grouped them into 134 families, and manually analyzed one application from 105 distinct families. During our manual analysis, we identified malicious payloads the applications execute, conditions guarding execution of the payloads, hiding techniques applications employ to evade detection by the user, and other implementation-level properties relevant for automated malware detection. As most applications in our dataset contain multiple payloads, each triggered via its own complex activation logic, we also contribute a graph-based representation showing activation paths for all application payloads in form of a control- and data-flow graph. Furthermore, we discuss the capabilities of existing malware detection tools, put them in context of the properties observed in the analyzed malware, and identify gaps and future research directions. We believe that our detailed analysis of the recent, evasive malware will be of interest to researchers and practitioners and will help further improve malware detection tools.

Authored by Michael Cao, Khaled Ahmed, Julia Rubin

Mal-Bert-GCN: Malware Detection by Combining Bert and GCN

With the dramatic increase in malicious software, the sophistication and innovation of malware have increased over the years. In particular, the dynamic analysis based on the deep neural network has shown high accuracy in malware detection. However, most of the existing methods only employ the raw API sequence feature, which cannot accurately reflect the actual behavior of malicious programs in detail. The relationship between API calls is critical for detecting suspicious behavior. Therefore, this paper proposes a malware detection method based on the graph neural network. We first connect the API sequences executed by different processes to build a directed process graph. Then, we apply Bert to encode the API sequences of each process into node embedding, which facilitates the semantic execution information inside the processes. Finally, we employ GCN to mine the deep semantic information based on the directed process graph and node embedding. In addition to presenting the design, we have implemented and evaluated our method on 10,000 malware and 10,000 benign software datasets. The results show that the precision and recall of our detection model reach 97.84\% and 97.83\%, verifying the effectiveness of our proposed method.

Authored by Zhenquan Ding, Hui Xu, Yonghe Guo, Longchuan Yan, Lei Cui, Zhiyu Hao

Detection of Botnets in IoT Networks using Graph Theory and Machine Learning

The Internet of things (IoT) is proving to be a boon in granting internet access to regularly used objects and devices. Sensors, programs, and other innovations interact and trade information with different gadgets and frameworks over the web. Even in modern times, IoT gadgets experience the ill effects of primary security threats, which expose them to many dangers and malware, one among them being IoT botnets. Botnets carry out attacks by serving as a vector and this has become one of the significant dangers on the Internet. These vectors act against associations and carry out cybercrimes. They are used to produce spam, DDOS attacks, click frauds, and steal confidential data. IoT gadgets bring various challenges unlike the common malware on PCs and Android devices as IoT gadgets have heterogeneous processor architecture. Numerous researches use static or dynamic analysis for detection and classification of botnets on IoT gadgets. Most researchers haven t addressed the multi-architecture issue and they use a lot of computing resources for analyzing. Therefore, this approach attempts to classify botnets in IoT by using PSI-Graphs which effectively addresses the problem of encryption in IoT botnet detection, tackles the multi-architecture problem, and reduces computation time. It proposes another methodology for describing and recognizing botnets utilizing graph-based Machine Learning techniques and Exploratory Data Analysis to analyze the data and identify how separable the data is to recognize bots at an earlier stage so that IoT devices can be prevented from being attacked.

Authored by Putsa Pranav, Sachin Verma, Sahana Shenoy, S. Saravanan

A Survey of Explainable Graph Neural Networks for Cyber Malware Analysis

Malicious cybersecurity activities have become increasingly worrisome for individuals and companies alike. While machine learning methods like Graph Neural Networks (GNNs) have proven successful on the malware detection task, their output is often difficult to understand. Explainable malware detection methods are needed to automatically identify malicious programs and present results to malware analysts in a way that is human interpretable. In this survey, we outline a number of GNN explainability methods and compare their performance on a real-world malware detection dataset. Specifically, we formulated the detection problem as a graph classification problem on the malware Control Flow Graphs (CFGs). We find that gradient-based methods outperform perturbation-based methods in terms of computational expense and performance on explainer-specific metrics (e.g., Fidelity and Sparsity). Our results provide insights into designing new GNN-based models for cyber malware detection and attribution.

Authored by Dana Warmsley, Alex Waagen, Jiejun Xu, Zhining Liu, Hanghang Tong

Detecting Malware Using Graph Embedding and DNN

Nowadays, the popularity of intelligent terminals makes malwares more and more serious. Among the many features of application, the call graph can accurately express the behavior of the application. The rapid development of graph neural network in recent years provides a new solution for the malicious analysis of application using call graphs as features. However, there are still problems such as low accuracy. This paper established a large-scale data set containing more than 40,000 samples and selected the class call graph, which was extracted from the application, as the feature and used the graph embedding combined with the deep neural network to detect the malware. The experimental results show that the accuracy of the detection model proposed in this paper is 97.7\%; the precision is 96.6\%; the recall is 96.8\%; the F1-score is 96.4\%, which is better than the existing detection model based on Markov chain and graph embedding detection model.

Authored by Rui Wang, Jun Zheng, Zhiwei Shi, Yu Tan

Malware analysis and multi-label category detection issues: Ensemble-based approaches

Detection of malware and security attacks is a complex process that can vary in its details and analysis activities. As part of the detection process, malware scanners try to categorize a malware once it is detected under one of the known malware categories (e.g. worms, spywares, viruses, etc.). However, many studies and researches indicate problems with scanners categorizing or identifying a particular malware under more than one malware category. This paper, and several others, show that machine learning can be used for malware detection especially with ensemble base prediction methods. In this paper, we evaluated several custom-built ensemble models. We focused on multi-label malware classification as individual or classical classifiers showed low accuracy in such territory.This paper showed that recent machine models such as ensemble and deep learning can be used for malware detection with better performance in comparison with classical models. This is very critical in such a dynamic and yet important detection systems where challenges such as the detection of unknown or zero-day malware will continue to exist and evolve.

Authored by Izzat Alsmadi, Bilal Al-Ahmad, Mohammad Alsmadi

DynaMalDroid: Dynamic Analysis-Based Detection Framework for Android Malware Using Machine Learning Techniques

Android malware is continuously evolving at an alarming rate due to the growing vulnerabilities. This demands more effective malware detection methods. This paper presents DynaMalDroid, a dynamic analysis-based framework to detect malicious applications in the Android platform. The proposed framework contains three modules: dynamic analysis, feature engineering, and detection. We utilized the well-known CICMalDroid2020 dataset, and the system calls of apps are extracted through dynamic analysis. We trained our proposed model to recognize malware by selecting features obtained through the feature engineering module. Further, with these selected features, the detection module applies different Machine Learning classifiers like Random Forest, Decision Tree, Logistic Regression, Support Vector Machine, Naïve-Bayes, K-Nearest Neighbour, and AdaBoost, to recognize whether an application is malicious or not. The experiments have shown that several classifiers have demonstrated excellent performance and have an accuracy of up to 99\%. The models with Support Vector Machine and AdaBoost classifiers have provided better detection accuracy of 99.3\% and 99.5\%, respectively.

Authored by Hashida Manzil, Manohar S

Disparity Analysis Between the Assembly and Byte Malware Samples with Deep Autoencoders

Malware attacks in the cyber world continue to increase despite the efforts of Malware analysts to combat this problem. Recently, Malware samples have been presented as binary sequences and assembly codes. However, most researchers focus only on the raw Malware sequence in their proposed solutions, ignoring that the assembly codes may contain important details that enable rapid Malware detection. In this work, we leveraged the capabilities of deep autoencoders to investigate the presence of feature disparities in the assembly and raw binary Malware samples. First, we treated the task as outliers to investigate whether the autoencoder would identify and justify features as samples from the same family. Second, we added noise to all samples and used Deep Autoencoder to reconstruct the original samples by denoising. Experiments with the Microsoft Malware dataset showed that the byte samples features differed from the assembly code samples.

Authored by Muhammed Abdullah, Yongbin Yu, Jingye Cai, Yakubu Imrana, Nartey Tettey, Daniel Addo, Kwabena Sarpong, Bless Lord Y. Agbley, Benjamin Appiah

Effective of Obfuscated Android Malware Detection using Static Analysis

The effective security system improvement from malware attacks on the Android operating system should be updated and improved. Effective malware detection increases the level of data security and high protection for the users. Malicious software or malware typically finds a means to circumvent the security procedure, even when the user is unaware whether the application can act as malware. The effectiveness of obfuscated android malware detection is evaluated by collecting static analysis data from a data set. The experiment assesses the risk level of which malware dataset using the hash value of the malware and records malware behavior. A set of hash SHA256 malware samples has been obtained from an internet dataset and will be analyzed using static analysis to record malware behavior and evaluate which risk level of the malware. According to the results, most of the algorithms provide the same total score because of the multiple crime inside the malware application.

Authored by Teddy Mantoro, Muhammad Fahriza, Muhammad Bhakti

PDF Malware Analysis

This document addresses the issue of the actual security level of PDF documents. Two types of detection approaches are utilized to detect dangerous elements within malware: static analysis and dynamic analysis. Analyzing malware binaries to identify dangerous strings, as well as reverse-engineering is included in static analysis for t1he malware to disassemble it. On the other hand, dynamic analysis monitors malware activities by running them in a safe environment, such as a virtual machine. Each method has its own set of strengths and weaknesses, and it is usually best to employ both methods while analyzing malware. Malware detection could be simplified without sacrificing accuracy by reducing the number of malicious traits. This may allow the researcher to devote more time to analysis. Our worry is that there is no obvious need to identify malware with numerous functionalities when it isn t necessary. We will solve this problem by developing a system that will identify if the given file is infected with malware or not.

Authored by Md Khalil, Vivek, Kumar Anand, Antarlina Paul, Rahul Grover

Objection!: Identifying Misclassified Malicious Activities with XAI

Many studies have been conducted to detect various malicious activities in cyberspace using classifiers built by machine learning. However, it is natural for any classifier to make mistakes, and hence, human verification is necessary. One method to address this issue is eXplainable AI (XAI), which provides a reason for the classification result. However, when the number of classification results to be verified is large, it is not realistic to check the output of the XAI for all cases. In addition, it is sometimes difficult to interpret the output of XAI. In this study, we propose a machine learning model called classification verifier that verifies the classification results by using the output of XAI as a feature and raises objections when there is doubt about the reliability of the classification results. The results of experiments on malicious website detection and malware detection show that the proposed classification verifier can efficiently identify misclassified malicious activities.

Authored by Koji Fujita, Toshiki Shibahara, Daiki Chiba, Mitsuaki Akiyama, Masato Uchida

The Application of 1D-CNN in Microsoft Malware Detection

In the computer field, cybersecurity has always been the focus of attention. How to detect malware is one of the focuses and difficulties in network security research effectively. Traditional existing malware detection schemes can be mainly divided into two methods categories: database matching and the machine learning method. With the rise of deep learning, more and more deep learning methods are applied in the field of malware detection. Deeper semantic features can be extracted via deep neural network. The main tasks of this paper are as follows: (1) Using machine learning methods and one-dimensional convolutional neural networks to detect malware (2) Propose a machine The method of combining learning and deep learning is used for detection. Machine learning uses LGBM to obtain an accuracy rate of 67.16%, and one-dimensional CNN obtains an accuracy rate of 72.47%. In (2), LGBM is used to screen the importance of features and then use a one-dimensional convolutional neural network, which helps to further improve the detection result has an accuracy rate of 78.64%.

Authored by Da Huo, Xiaoyong Li, Linghui Li, Yali Gao, Ximing Li, Jie Yuan

Detecting and Classifying Self-Deleting Windows Malware Using Prefetch Files

Malware detection and analysis can be a burdensome task for incident responders. As such, research has turned to machine learning to automate malware detection and malware family classification. Existing work extracts and engineers static and dynamic features from the malware sample to train classifiers. Despite promising results, such techniques assume that the analyst has access to the malware executable file. Self-deleting malware invalidates this assumption and requires analysts to find forensic evidence of malware execution for further analysis. In this paper, we present and evaluate an approach to detecting malware that executed on a Windows target and further classify the malware into its associated family to provide semantic insight. Specifically, we engineer features from the Windows prefetch file, a file system forensic artifact that archives process information. Results show that it is possible to detect the malicious artifact with 99% accuracy; furthermore, classifying the malware into a fine-grained family has comparable performance to techniques that require access to the original executable. We also provide a thorough security discussion of the proposed approach against adversarial diversity.

Authored by Adam Duby, Teryl Taylor, Gedare Bloom, Yanyan Zhuang