Publications | Science of Security Virtual Organization

Malware analysis and multi-label category detection issues: Ensemble-based approaches

Malware Analysis - Detection of malware and security attacks is a complex process that can vary in its details and analysis activities. As part of the detection process, malware scanners try to categorize a malware once it is detected under one of the known malware categories (e.g. worms, spywares, viruses, etc.). However, many studies and researches indicate problems with scanners categorizing or identifying a particular malware under more than one malware category. This paper, and several others, show that machine learning can be used for malware detection especially with ensemble base prediction methods. In this paper, we evaluated several custom-built ensemble models. We focused on multi-label malware classification as individual or classical classifiers showed low accuracy in such territory.This paper showed that recent machine models such as ensemble and deep learning can be used for malware detection with better performance in comparison with classical models. This is very critical in such a dynamic and yet important detection systems where challenges such as the detection of unknown or zero-day malware will continue to exist and evolve.

Authored by Izzat Alsmadi, Bilal Al-Ahmad, Mohammad Alsmadi

DynaMalDroid: Dynamic Analysis-Based Detection Framework for Android Malware Using Machine Learning Techniques

Malware Analysis - Android malware is continuously evolving at an alarming rate due to the growing vulnerabilities. This demands more effective malware detection methods. This paper presents DynaMalDroid, a dynamic analysis-based framework to detect malicious applications in the Android platform. The proposed framework contains three modules: dynamic analysis, feature engineering, and detection. We utilized the well-known CICMalDroid2020 dataset, and the system calls of apps are extracted through dynamic analysis. We trained our proposed model to recognize malware by selecting features obtained through the feature engineering module. Further, with these selected features, the detection module applies different Machine Learning classifiers like Random Forest, Decision Tree, Logistic Regression, Support Vector Machine, Naïve-Bayes, K-Nearest Neighbour, and AdaBoost, to recognize whether an application is malicious or not. The experiments have shown that several classifiers have demonstrated excellent performance and have an accuracy of up to 99\%. The models with Support Vector Machine and AdaBoost classifiers have provided better detection accuracy of 99.3\% and 99.5\%, respectively.

Authored by Hashida Manzil, Manohar S

Automatic labeling of the elements of a vulnerability report CVE with NLP

Information Reuse and Security - Common Vulnerabilities and Exposures (CVE) databases contain information about vulnerabilities of software products and source code. If individual elements of CVE descriptions can be extracted and structured, then the data can be used to search and analyze CVE descriptions. Herein we propose a method to label each element in CVE descriptions by applying Named Entity Recognition (NER). For NER, we used BERT, a transformer-based natural language processing model. Using NER with machine learning can label information from CVE descriptions even if there are some distortions in the data. An experiment involving manually prepared label information for 1000 CVE descriptions shows that the labeling accuracy of the proposed method is about 0.81 for precision and about 0.89 for recall. In addition, we devise a way to train the data by dividing it into labels. Our proposed method can be used to label each element automatically from CVE descriptions.

Authored by Kensuke Sumoto, Kenta Kanakogi, Hironori Washizaki, Naohiko Tsuda, Nobukazu Yoshioka, Yoshiaki Fukazawa, Hideyuki Kanuka

LSB-Reused Protection Technique in Secure SAR ADC against Power Side-Channel Attack

Information Reuse and Security - Successive approximation register analog-to-digital converter (SAR ADC) is widely adopted in the Internet of Things (IoT) systems due to its simple structure and high energy efficiency. Unfortunately, SAR ADC dissipates various and unique power features when it converts different input signals, leading to severe vulnerability to power side-channel attack (PSA). The adversary can accurately derive the input signal by only measuring the power information from the analog supply pin (AVDD), digital supply pin (DVDD), and/or reference pin (Ref) which feed to the trained machine learning models. This paper first presents the detailed mathematical analysis of power side-channel attack (PSA) to SAR ADC, concluding that the power information from AVDD is the most vulnerable to PSA compared with the other supply pin. Then, an LSB-reused protection technique is proposed, which utilizes the characteristic of LSB from the SAR ADC itself to protect against PSA. Lastly, this technique is verified in a 12-bit 5 MS/s secure SAR ADC implemented in 65nm technology. By using the current waveform from AVDD, the adopted convolutional neural network (CNN) algorithms can achieve \textgreater99\% prediction accuracy from LSB to MSB in the SAR ADC without protection. With the proposed protection, the bit-wise accuracy drops to around 50\%.

Authored by Lele Fang, Jiahao Liu, Yan Zhu, Chi-Hang Chan, Rui Martins

A Fault and Intrusion Tolerance Framework for Containerized Environments: A Specification-Based Error Detection Approach

Intrusion Intolerance - Container-based virtualization has gained momentum over the past few years thanks to its lightweight nature and support for agility. However, its appealing features come at the price of a reduced isolation level compared to the traditional host-based virtualization techniques, exposing workloads to various faults, such as co-residency attacks like container escape. In this work, we propose to leverage the automated management capabilities of containerized environments to derive a Fault and Intrusion Tolerance (FIT) framework based on error detection-recovery and fault treatment. Namely, we aim at deriving a specification-based error detection mechanism at the host level to systematically and formally capture security state errors indicating breaches potentially caused by malicious containers. Although the paper focuses on security side use cases, results are logically extendable to accidental faults. Our aim is to immunize the target environments against accidental and malicious faults and preserve their core dependability and security properties.

Authored by Taous Madi, Paulo Esteves-Verissimo

Malware Detection Approach Based on the Swarm-Based Behavioural Analysis over API Calling Sequence

Malware Analysis and Graph Theory - The rapidly increasing malware threats must be coped with new effective malware detection methodologies. Current malware threats are not limited to daily personal transactions but dowelled deeply within large enterprises and organizations. This paper introduces a new methodology for detecting and discriminating malicious versus normal applications. In this paper, we employed Ant-colony optimization to generate two behavioural graphs that characterize the difference in the execution behavior between malware and normal applications. Our proposed approach relied on the API call sequence generated when an application is executed. We used the API calls as one of the most widely used malware dynamic analysis features. Our proposed method showed distinctive behavioral differences between malicious and non-malicious applications. Our experimental results showed a comparative performance compared to other machine learning methods. Therefore, we can employ our method as an efficient technique in capturing malicious applications.

Authored by Eslam Amer, Adham Samir, Hazem Mostafa, Amer Mohamed, Mohamed Amin

Representation Learning with Function Call Graph Transformations for Malware Open Set Recognition

Malware Analysis and Graph Theory - Open set recognition (OSR) problem has been a challenge in many machine learning (ML) applications, such as security. As new/unknown malware families occur regularly, it is difficult to exhaust samples that cover all the classes for the training process in ML systems. An advanced malware classification system should classify the known classes correctly while sensitive to the unknown class. In this paper, we introduce a self-supervised pre-training approach for the OSR problem in malware classification. We propose two transformations for the function call graph (FCG) based malware representations to facilitate the pretext task. Also, we present a statistical thresholding approach to find the optimal threshold for the unknown class. Moreover, the experiment results indicate that our proposed pre-training process can improve different performances of different downstream loss functions for the OSR problem.

Authored by Jingyun Jia, Philip Chan

Detection of Botnets in IoT Networks using Graph Theory and Machine Learning

Malware Analysis and Graph Theory - The Internet of things (IoT) is proving to be a boon in granting internet access to regularly used objects and devices. Sensors, programs, and other innovations interact and trade information with different gadgets and frameworks over the web. Even in modern times, IoT gadgets experience the ill effects of primary security threats, which expose them to many dangers and malware, one among them being IoT botnets. Botnets carry out attacks by serving as a vector and this has become one of the significant dangers on the Internet. These vectors act against associations and carry out cybercrimes. They are used to produce spam, DDOS attacks, click frauds, and steal confidential data. IoT gadgets bring various challenges unlike the common malware on PCs and Android devices as IoT gadgets have heterogeneous processor architecture. Numerous researches use static or dynamic analysis for detection and classification of botnets on IoT gadgets. Most researchers haven t addressed the multi-architecture issue and they use a lot of computing resources for analyzing. Therefore, this approach attempts to classify botnets in IoT by using PSI-Graphs which effectively addresses the problem of encryption in IoT botnet detection, tackles the multi-architecture problem, and reduces computation time. It proposes another methodology for describing and recognizing botnets utilizing graph-based Machine Learning techniques and Exploratory Data Analysis to analyze the data and identify how separable the data is to recognize bots at an earlier stage so that IoT devices can be prevented from being attacked.

Authored by Putsa Pranav, Sachin Verma, Sahana Shenoy, S. Saravanan

A Survey of Explainable Graph Neural Networks for Cyber Malware Analysis

Malware Analysis and Graph Theory - Malicious cybersecurity activities have become increasingly worrisome for individuals and companies alike. While machine learning methods like Graph Neural Networks (GNNs) have proven successful on the malware detection task, their output is often difficult to understand. Explainable malware detection methods are needed to automatically identify malicious programs and present results to malware analysts in a way that is human interpretable. In this survey, we outline a number of GNN explainability methods and compare their performance on a real-world malware detection dataset. Specifically, we formulated the detection problem as a graph classification problem on the malware Control Flow Graphs (CFGs). We find that gradient-based methods outperform perturbation-based methods in terms of computational expense and performance on explainer-specific metrics (e.g., Fidelity and Sparsity). Our results provide insights into designing new GNN-based models for cyber malware detection and attribution.

Authored by Dana Warmsley, Alex Waagen, Jiejun Xu, Zhining Liu, Hanghang Tong

Malware analysis and multi-label category detection issues: Ensemble-based approaches

Malware Analysis - Detection of malware and security attacks is a complex process that can vary in its details and analysis activities. As part of the detection process, malware scanners try to categorize a malware once it is detected under one of the known malware categories (e.g. worms, spywares, viruses, etc.). However, many studies and researches indicate problems with scanners categorizing or identifying a particular malware under more than one malware category. This paper, and several others, show that machine learning can be used for malware detection especially with ensemble base prediction methods. In this paper, we evaluated several custom-built ensemble models. We focused on multi-label malware classification as individual or classical classifiers showed low accuracy in such territory.This paper showed that recent machine models such as ensemble and deep learning can be used for malware detection with better performance in comparison with classical models. This is very critical in such a dynamic and yet important detection systems where challenges such as the detection of unknown or zero-day malware will continue to exist and evolve.

Authored by Izzat Alsmadi, Bilal Al-Ahmad, Mohammad Alsmadi

DynaMalDroid: Dynamic Analysis-Based Detection Framework for Android Malware Using Machine Learning Techniques

Malware Analysis - Android malware is continuously evolving at an alarming rate due to the growing vulnerabilities. This demands more effective malware detection methods. This paper presents DynaMalDroid, a dynamic analysis-based framework to detect malicious applications in the Android platform. The proposed framework contains three modules: dynamic analysis, feature engineering, and detection. We utilized the well-known CICMalDroid2020 dataset, and the system calls of apps are extracted through dynamic analysis. We trained our proposed model to recognize malware by selecting features obtained through the feature engineering module. Further, with these selected features, the detection module applies different Machine Learning classifiers like Random Forest, Decision Tree, Logistic Regression, Support Vector Machine, Naïve-Bayes, K-Nearest Neighbour, and AdaBoost, to recognize whether an application is malicious or not. The experiments have shown that several classifiers have demonstrated excellent performance and have an accuracy of up to 99\%. The models with Support Vector Machine and AdaBoost classifiers have provided better detection accuracy of 99.3\% and 99.5\%, respectively.

Authored by Hashida Manzil, Manohar S

Comparison Of Different Machine Learning Methods Applied To Obesity Classification

Machine Learning - Estimation for obesity levels is always an important topic in medical field since it can provide useful guidance for people that would like to lose weight or keep fit. The article tries to find a model that can predict obesity and provides people with the information of how to avoid overweight. To be more specific, this article applied dimension reduction to the data set to simplify the data and tried to Figure out a most decisive feature of obesity through Principal Component Analysis (PCA) based on the data set. The article also used some machine learning methods like Support Vector Machine (SVM), Decision Tree to do prediction of obesity and wanted to find the major reason of obesity. In addition, the article uses Artificial Neural Network (ANN) to do prediction which has more powerful feature extraction ability to do this. Finally, the article found that family history of obesity is the most decisive feature, and it may because of obesity may be greatly affected by genes or the family eating diet may have great influence. And both ANN and Decision tree’s accuracy of prediction is higher than 90\%.

Authored by Zhenghao He

Fashion Images Classification using Machine Learning, Deep Learning and Transfer Learning Models

Machine Learning - Fashion is the way we present ourselves which mainly focuses on vision, has attracted great interest from computer vision researchers. It is generally used to search fashion products in online shopping malls to know the descriptive information of the product. The main objectives of our paper is to use deep learning (DL) and machine learning (ML) methods to correctly identify and categorize clothing images. In this work, we used ML algorithms (support vector machines (SVM), K-Nearest Neirghbors (KNN), Decision tree (DT), Random Forest (RF)), DL algorithms (Convolutionnal Neurals Network (CNN), AlexNet, GoogleNet, LeNet, LeNet5) and the transfer learning using a pretrained models (VGG16, MobileNet and RestNet50). We trained and tested our models online using google colaboratory with Tensorflow/Keras and Scikit-Learn libraries that support deep learning and machine learning in Python. The main metric used in our study to evaluate the performance of ML and DL algorithms is the accuracy and matrix confusion. The best result for the ML models is obtained with the use of ANN (88.71\%) and for the DL models is obtained for the GoogleNet architecture (93.75\%). The results obtained showed that the number of epochs and the depth of the network have an effect in obtaining the best results.

Authored by Bougareche Samia, Zehani Soraya, Mimi Malika

Empirical Research on Multifactor Quantitative Stock Selection Strategy Based on Machine Learning

Machine Learning - In this paper, stock selection strategy design based on machine learning and multi-factor analysis is a research hotspot in quantitative investment field. Four machine learning algorithms including support vector machine, gradient lifting regression, random forest and linear regression are used to predict the rise and fall of stocks by taking stock fundamentals as input variables. The portfolio strategy is constructed on this basis. Finally, the stock selection strategy is further optimized. The empirical results show that the multifactor quantitative stock selection strategy has a good stock selection effect, and yield performance under the support vector machine algorithm is the best. With the increase of the number of factors, there is an inverse relationship between the fitting degree and the yield under various algorithms.

Authored by Chengzhao Zhang, Huiyue Tang

Comparison of Different Machine Learning Algorithms Based on Intrusion Detection System

Machine Learning - An IDS is a system that helps in detecting any kind of doubtful activity on a computer network. It is capable of identifying suspicious activities at both the levels i.e. locally at the system level and in transit at the network level. Since, the system does not have its own dataset as a result it is inefficient in identifying unknown attacks. In order to overcome this inefficiency, we make use of ML. ML assists in analysing and categorizing attacks on diverse datasets. In this study, the efficacy of eight machine learning algorithms based on KDD CUP99 is assessed. Based on our implementation and analysis, amongst the eight Algorithms considered here, Support Vector Machine (SVM), Random Forest (RF) and Decision Tree (DT) have the highest testing accuracy of which got SVM does have the highest accuracy

Authored by Utkarsh Dixit, Suman Bhatia, Pramod Bhatia

Sentiment Analysis of Covid19 Vaccines Tweets Using NLP and Machine Learning Classifiers

Machine Learning - Sentiment Analysis (SA) is an approach for detecting subjective information such as thoughts, outlooks, reactions, and emotional state. The majority of previous SA work treats it as a text-classification problem that requires labelled input to train the model. However, obtaining a tagged dataset is difficult. We will have to do it by hand the majority of the time. Another concern is that the absence of sufficient cross-domain portability creates challenging situation to reuse same-labelled data across applications. As a result, we will have to manually classify data for each domain. This research work applies sentiment analysis to evaluate the entire vaccine twitter dataset. The work involves the lexicon analysis using NLP libraries like neattext, textblob and multi class classification using BERT. This word evaluates and compares the results of the machine learning algorithms.

Authored by Amarjeet Rawat, Himani Maheshwari, Manisha Khanduja, Rajiv Kumar, Minakshi Memoria, Sanjeev Kumar

Machine Learning-Based Heart Disease Prediction: A Study for Home Personalized Care

Machine Learning - This study develops a framework for personalized care to tackle heart disease risk using an at-home system. The machine learning models used to predict heart disease are Logistic Regression, K - Nearest Neighbor, Support Vector Machine, Naive Bayes, Decision Tree, Random Forest and XG Boost. Timely and efficient detection of heart disease plays an important role in health care. It is essential to detect cardiovascular disease (CVD) at the earliest, consult a specialist doctor before the severity of the disease and start medication. The performance of the proposed model was assessed using the Cleveland Heart Disease dataset from the UCI Machine Learning Repository. Compared to all machine learning algorithms, the Random Forest algorithm shows a better performance accuracy score of 90.16\%. The best model may evaluate patient fitness rather than routine hospital visits. The proposed work will reduce the burden on hospitals and help hospitals reach only critical patients.

Authored by Goutam Sahoo, Keerthana Kanike, Santos Das, Poonam Singh

A Machine Learning Study on the Model Performance of Human Resources Predictive Algorithms

Machine Learning - A good ecological environment is crucial to attracting talents, cultivating talents, retaining talents and making talents fully effective. This study provides a solution to the current mainstream problem of how to deal with excellent employee turnover in advance, so as to promote the sustainable and harmonious human resources ecological environment of enterprises with a shortage of talents.This study obtains open data sets and conducts data preprocessing, model construction and model optimization, and describes a set of enterprise employee turnover prediction models based on RapidMiner workflow. The data preprocessing is completed with the help of the data statistical analysis software IBM SPSS Statistic and RapidMiner.Statistical charts, scatter plots and boxplots for analysis are generated to realize data visualization analysis. Machine learning, model application, performance vector, and cross-validation through RapidMiner s multiple operators and workflows. Model design algorithms include support vector machines, naive Bayes, decision trees, and neural networks. Comparing the performance parameters of the algorithm model from the four aspects of accuracy, precision, recall and F1-score. It is concluded that the performance of the decision tree algorithm model is the highest. The performance evaluation results confirm the effectiveness of this model in sustainable exploring of enterprise employee turnover prediction in human resource management.

Authored by Yong Shi

Classification of Mobile Phone Price Dataset Using Machine Learning Algorithms

Machine Learning - With the development of technology, mobile phones are an indispensable part of human life. Factors such as brand, internal memory, wifi, battery power, camera and availability of 4G are now modifying consumers decisions on buying mobile phones. But people fail to link those factors with the price of mobile phones; in this case, this paper is aimed to figure out the problem by using machine learning algorithms like Support Vector Machine, Decision Tree, K Nearest Neighbors and Naive Bayes to train the mobile phone dataset before making predictions of the price level. We used appropriate algorithms to predict smartphone prices based on accuracy, precision, recall and F1 score. This not only helps customers have a better choice on the mobile phone but also gives advice to businesses selling mobile phones that the way to set reasonable prices with the different features they offer. This idea of predicting prices level will give support to customers to choose mobile phones wisely in the future. The result illustrates that among the 4 classifiers, SVM returns to the most desirable performance with 94.8\% of accuracy, 97.3 of F1 score (without feature selection) and 95.5\% of accuracy, 97.7\% of F1 score (with feature selection).

Authored by Ningyuan Hu

A machine learning approach to predict the result of League of Legends

Machine Learning - Nowadays, the MOBA game is the game type with the most audiences and players around the world. Recently, the League of Legends has become an official sport as an e-sport among 37 events in the 2022 Asia Games held in Hangzhou. As the development in the e-sport, analytical skills are also involved in this field. The topic of this research is to use the machine learning approach to analyze the data of the League of Legends and make a prediction about the result of the game. In this research, the method of machine learning is applied to the dataset which records the first 10 minutes in diamond-ranked games. Several popular machine learning (AdaBoost, GradientBoost, RandomForest, ExtraTree, SVM, Naïve Bayes, KNN, LogisticRegression, and DecisionTree) are applied to test the performance by cross-validation. Then several algorithms that outperform others are selected to make a voting classifier to predict the game result. The accuracy of the voting classifier is 72.68\%.

Authored by Qiyuan Shen

Predicting Creditworthiness of Smartphone Users in Indonesia during the COVID-19 pandemic using Machine Learning

Machine Learning - In this research work, we attempted to predict the creditworthiness of smartphone users in Indonesia during the COVID-19 pandemic using machine learning. Principal Component Analysis (PCA) and Kmeans algorithms are used for the prediction of creditworthiness with the used a dataset of 1050 respondents consisting of twelve questions to smartphone users in Indonesia during the COVID-19 pandemic. The four different classification algorithms (Logistic Regression, Support Vector Machine, Decision Tree, and Naive Bayes) were tested to classify the creditworthiness of smartphone users in Indonesia. The tests carried out included testing for accuracy, precision, recall, F1-score, and Area Under Curve Receiver Operating Characteristics (AUCROC) assesment. Logistic Regression algorithm shows the perfect performances whereas Naïve Bayes (NB) shows the least. The results of this research also provide new knowledge about the influential and non-influential variables based on the twelve questions conducted to the respondents of smartphone users in Indonesia during the COVID-19 pandemic.

Authored by R Winahyu, Maman Somantri, Oky Nurhayati

Construction of Computer Big Data Security Technology Platform Based on Artificial Intelligence

Intelligent Data and Security - Artificial technology developed in recent years. It is an intelligent system that can perform tasks without human intervention. AI can be used for various purposes, such as speech recognition, face recognition, etc. AI can be used for good or bad purposes, depending on how it is implemented. The discuss the application of AI in data security technology and its advantages over traditional security methods. We will focus on the good use of AI by analyzing the impact of AI on the development of big data security technology. AI can be used to enhance security technology by using machine learning algorithms, which can analyze large amounts of data and identify patterns that cannot be detected automatically by humans. The computer big data security technology platform based on artificial intelligence in this paper is the process of creating a system that can identify and prevent malicious programs. The system must be able to detect all types of threats, including viruses, worms, Trojans and spyware. It should also be able to monitor network activity and respond quickly in the event of an attack.

Authored by Yu Miao

Application of Intelligent Transportation System Data using Big Data Technologies

Intelligent Data and Security - Problems such as the increase in the number of private vehicles with the population, the rise in environmental pollution, the emergence of unmet infrastructure and resource problems, and the decrease in time efficiency in cities have put local governments, cities, and countries in search of solutions. These problems faced by cities and countries are tried to be solved in the concept of smart cities and intelligent transportation by using information and communication technologies in line with the needs. While designing intelligent transportation systems (ITS), beyond traditional methods, big data should be designed in a state-of-the-art and appropriate way with the help of methods such as artificial intelligence, machine learning, and deep learning. In this study, a data-driven decision support system model was established to help the business make strategic decisions with the help of intelligent transportation data and to contribute to the elimination of public transportation problems in the city. Our study model has been established using big data technologies and business intelligence technologies: a decision support system including data sources layer, data ingestion/ collection layer, data storage and processing layer, data analytics layer, application/presentation layer, developer layer, and data management/ data security layer stages. In our study, the decision support system was modeled using ITS data supported by big data technologies, where the traditional structure could not find a solution. This paper aims to create a basis for future studies looking for solutions to the problems of integration, storage, processing, and analysis of big data and to add value to the literature that is missing within the framework of the model. We provide both the lack of literature, eliminate the lack of models before the application process of existing data sets to the business intelligence architecture and a model study before the application to be carried out by the authors.

Authored by Kutlu Sengul, Cigdem Tarhan, Vahap Tecim

Research on Intellectual Property Protection of Artificial Intelligence Creation in China Based on SVM Kernel Methods

Intellectual Property Security - Artificial intelligence creation comes into fashion and has brought unprecedented challenges to intellectual property law. In order to study the viewpoints of AI creation copyright ownership from professionals in different institutions, taking the papers of AI creation on CNKI from 2016 to 2021, we applied orthogonal design and analysis of variance method to construct the dataset. A kernel-SVM classifier with different kernel methods in addition to some shallow machine learning classifiers are selected in analyzing and predicting the copyright ownership of AI creation. Support vector machine (svm) is widely used in statistics and the performance of SVM method is closely related to the choice of the kernel function. SVM with RBF kernel surpasses the other seven kernel-SVM classifiers and five shallow classifier, although the accuracy provided by all of them was not satisfactory. Various performance metrics such as accuracy, F1-score are used to evaluate the performance of KSVM and other classifiers. The purpose of this study is to explore the overall viewpoints of AI creation copyright ownership, investigate the influence of different features on the final copyright ownership and predict the most likely viewpoint in the future. And it will encourage investors, researchers and promote intellectual property protection in China.

Authored by Xinjia Xie, Yunxiao Guo, Jiangting Yin, Shun Gai, Han Long

An Insider Threat Detection Method Based on Heterogeneous Graph Embedding

Insider Threat - Insider threats have high risk and concealment characteristics, which makes traditional anomaly detection methods less effective in insider threat detection. Existing detection methods ignore the logical relationship between user behaviors and the consistency of behavior sequences among homogeneous users, resulting in poor model effects. We propose an insider threat detection method based on internal user heterogeneous graph embedding. Firstly, according to the characteristics of CERT data, comprehensively consider the relationship between users, the time sequence, and logical relationship, and construct a heterogeneous graph. In the second step, according to the characteristics of heterogeneous graphs, the embedding learning of graph nodes is carried out according to random walk and Word2vec. Finally, we propose an Insider Threat Detection Design (ITDD) model which can map and the user behavior sequence information into a high-dimensional feature space. In the CERT r5.2 dataset, compared with a variety of traditional machine learning methods, the effect of our method is significantly better than the final result.

Authored by Chaofan Zheng, Wenhui Hu, Tianci Li, Xueyang Liu, Jinchan Zhang, Litian Wang