Data poisoning is a type of adversarial attack on training data where an attacker manipulates a fraction of data to degrade the performance of machine learning model. There are several known defensive mechanisms for handling offline attacks, however defensive measures for online learning, where data points arrive sequentially, have not garnered similar interest. In this work, we propose a defense mechanism to minimize the degradation caused by the poisoned training data on a learner's model in an online setup. Our proposed method utilizes an influence function which is a classic technique in robust statistics. Further, we supplement it with the existing data sanitization methods for filtering out some of the poisoned data points. We study the effectiveness of our defense mechanism on multiple datasets and across multiple attack strategies against an online learner.
Authored by Sanjay Seetharaman, Shubham Malaviya, Rosni Vasu, Manish Shukla, Sachin Lodha
Machine learning (ML) models are increasingly being used in the development of Malware Detection Systems. Existing research in this area primarily focuses on developing new architectures and feature representation techniques to improve the accuracy of the model. However, recent studies have shown that existing state-of-the art techniques are vulnerable to adversarial machine learning (AML) attacks. Among those, data poisoning attacks have been identified as a top concern for ML practitioners. A recent study on clean-label poisoning attacks in which an adversary intentionally crafts training samples in order for the model to learn a backdoor watermark was shown to degrade the performance of state-of-the-art classifiers. Defenses against such poisoning attacks have been largely under-explored. We investigate a recently proposed clean-label poisoning attack and leverage an ensemble-based Nested Training technique to remove most of the poisoned samples from a poisoned training dataset. Our technique leverages the relatively large sensitivity of poisoned samples to feature noise that disproportionately affects the accuracy of a backdoored model. In particular, we show that for two state-of-the art architectures trained on the EMBER dataset affected by the clean-label attack, the Nested Training approach improves the accuracy of backdoor malware samples from 3.42% to 93.2%. We also show that samples produced by the clean-label attack often successfully evade malware classification even when the classifier is not poisoned during training. However, even in such scenarios, our Nested Training technique can mitigate the effect of such clean-label-based evasion attacks by recovering the model's accuracy of malware detection from 3.57% to 93.2%.
Authored by Samson Ho, Achyut Reddy, Sridhar Venkatesan, Rauf Izmailov, Ritu Chadha, Alina Oprea
In recent years, in order to continuously promote the construction of safe cities, security monitoring equipment has been widely used all over the country. How to use computer vision technology to realize effective intelligent analysis of violence in video surveillance is very important to maintain social stability and ensure people's life and property safety. Video surveillance system has been widely used because of its intuitive and convenient advantages. However, the existing video monitoring system has relatively single function, and generally only has the functions of monitoring video viewing, query and playback. In addition, relevant researchers pay less attention to the complex abnormal behavior of violence, and relevant research often ignores the differences between violent behaviors in different scenes. At present, there are two main problems in video abnormal behavior event detection: the video data of abnormal behavior is less and the definition of abnormal behavior in different scenes cannot be clearly distinguished. The main existing methods are to model normal behavior events first, and then define videos that do not conform to the normal model as abnormal, among which the learning method of video space-time feature representation based on deep learning shows a good prospect. In the face of massive surveillance videos, it is necessary to use deep learning to identify violent behaviors, so that the machine can learn to identify human actions, instead of manually monitoring camera images to complete the alarm of violent behaviors. Network training mainly uses video data set to identify network training.
Authored by Xuezhong Wang
Attack detection in enterprise networks is increasingly faced with large data volumes, in part high data bursts, and heavily fluctuating data flows that often cause arbitrary discarding of data packets in overload situations which can be used by attackers to hide attack activities. Attack detection systems usually configure a comprehensive set of signatures for known vulnerabilities in different operating systems, protocols, and applications. Many of these signatures, however, are not relevant in each context, since certain vulnerabilities have already been eliminated, or the vulnerable applications or operating system versions, respectively, are not installed on the involved systems. In this paper, we present an approach for clustering data flows to assign them to dedicated analysis units that contain only signature sets relevant for the analysis of these flows. We discuss the performance of this clustering and show how it can be used in practice to improve the efficiency of an analysis pipeline.
Authored by Michael Vogel, Franka Schuster, Fabian Kopp, Hartmut König
Current intrusion detection techniques cannot keep up with the increasing amount and complexity of cyber attacks. In fact, most of the traffic is encrypted and does not allow to apply deep packet inspection approaches. In recent years, Machine Learning techniques have been proposed for post-mortem detection of network attacks, and many datasets have been shared by research groups and organizations for training and validation. Differently from the vast related literature, in this paper we propose an early classification approach conducted on CSE-CIC-IDS2018 dataset, which contains both benign and malicious traffic, for the detection of malicious attacks before they could damage an organization. To this aim, we investigated a different set of features, and the sensitivity of performance of five classification algorithms to the number of observed packets. Results show that ML approaches relying on ten packets provide satisfactory results.
Authored by Idio Guarino, Giampaolo Bovenzi, Davide Di Monda, Giuseppe Aceto, Domenico Ciuonzo, Antonio Pescapè
The Manufacturer Usage Description (MUD) standard aims to reduce the attack surface for IoT devices by locking down their behavior to a formally-specified set of network flows (access control entries). Formal network behaviors can also be systematically and rigorously verified in any operating environment. Enforcing MUD flows and monitoring their activity in real-time can be relatively effective in securing IoT devices; however, its scope is limited to endpoints (domain names and IP addresses) and transport-layer protocols and services. Therefore, misconfigured or compromised IoTs may conform to their MUD-specified behavior but exchange unintended (or even malicious) contents across those flows. This paper develops PicP-MUD with the aim to profile the information content of packet payloads (whether unencrypted, encoded, or encrypted) in each MUD flow of an IoT device. That way, certain tasks like cyber-risk analysis, change detection, or selective deep packet inspection can be performed in a more systematic manner. Our contributions are twofold: (1) We analyze over 123K network flows of 6 transparent (e.g., HTTP), 11 encrypted (e.g., TLS), and 7 encoded (e.g., RTP) protocols, collected in our lab and obtained from public datasets, to identify 17 statistical features of their application payload, helping us distinguish different content types; and (2) We develop and evaluate PicP-MUD using a machine learning model, and show how we achieve an average accuracy of 99% in predicting the content type of a flow.
Authored by Arman Pashamokhtari, Arunan Sivanathan, Ayyoob Hamza, Hassan Gharakheili
The rise of social media has brought the rise of fake news and this fake news comes with negative consequences. With fake news being such a huge issue, efforts should be made to identify any forms of fake news however it is not so simple. Manually identifying fake news can be extremely subjective as determining the accuracy of the information in a story is complex and difficult to perform, even for experts. On the other hand, an automated solution would require a good understanding of NLP which is also complex and may have difficulties producing an accurate output. Therefore, the main problem focused on this project is the viability of developing a system that can effectively and accurately detect and identify fake news. Finding a solution would be a significant benefit to the media industry, particularly the social media industry as this is where a large proportion of fake news is published and spread. In order to find a solution to this problem, this project proposed the development of a fake news identification system using deep learning and natural language processing. The system was developed using a Word2vec model combined with a Long Short-Term Memory model in order to showcase the compatibility of the two models in a whole system. This system was trained and tested using two different dataset collections that each consisted of one real news dataset and one fake news dataset. Furthermore, three independent variables were chosen which were the number of training cycles, data diversity and vector size to analyze the relationship between these variables and the accuracy levels of the system. It was found that these three variables did have a significant effect on the accuracy of the system. From this, the system was then trained and tested with the optimal variables and was able to achieve the minimum expected accuracy level of 90%. The achieving of this accuracy levels confirms the compatibility of the LSTM and Word2vec model and their capability to be synergized into a single system that is able to identify fake news with a high level of accuracy.
Authored by Anand Matheven, Burra Kumar
Fake news is a new phenomenon that promotes misleading information and fraud via internet social media or traditional news sources. Fake news is readily manufactured and transmitted across numerous social media platforms nowadays, and it has a significant influence on the real world. It is vital to create effective algorithms and tools for detecting misleading information on social media platforms. Most modern research approaches for identifying fraudulent information are based on machine learning, deep learning, feature engineering, graph mining, image and video analysis, and newly built datasets and online services. There is a pressing need to develop a viable approach for readily detecting misleading information. The deep learning LSTM and Bi-LSTM model was proposed as a method for detecting fake news, In this work. First, the NLTK toolkit was used to remove stop words, punctuation, and special characters from the text. The same toolset is used to tokenize and preprocess the text. Since then, GLOVE word embeddings have incorporated higher-level characteristics of the input text extracted from long-term relationships between word sequences captured by the RNN-LSTM, Bi-LSTM model to the preprocessed text. The proposed model additionally employs dropout technology with Dense layers to improve the model's efficiency. The proposed RNN Bi-LSTM-based technique obtains the best accuracy of 94%, and 93% using the Adam optimizer and the Binary cross-entropy loss function with Dropout (0.1,0.2), Once the Dropout range increases it decreases the accuracy of the model as it goes 92% once Dropout (0.3).
Authored by Govind Mahara, Sharad Gangele
{Health diseases have been issued seriously harmful in human life due to different dehydrated food and disturbance of working environment in the organization. Precise prediction and diagnosis of disease become a more serious and challenging task for primary deterrence, recognition, and treatment. Thus, based on the above challenges, we proposed the Medical Things (MT) and machine learning models to solve the healthcare problems with appropriate services in disease supervising, forecast, and diagnosis. We developed a prediction framework with machine learning approaches to get different categories of classification for predicted disease. The framework is designed by the fuzzy model with a decision tree to lessen the data complexity. We considered heart disease for experiments and experimental evaluation determined the prediction for categories of classification. The number of decision trees (M) with samples (MS), leaf node (ML), and learning rate (I) is determined as MS=20
Authored by Hemanta Bhuyan, Arun Sai, M. Charan, Vignesh Chowdary, Biswajit Brahma
This paper analyzes techniques to enable differential privacy by adding Laplace noise to healthcare data. First, as healthcare data contain natural constraints for data to take only integral values, we show that drawing only integral values does not provide differential privacy. In contrast, rounding randomly drawn values to the nearest integer provides differential privacy. Second, when a variable is constructed using two other variables, noise must be added to only one of them. Third, if the constructed variable is a fraction, then noise must be added to its constituent private variables, and not to the fraction directly. Fourth, the accuracy of analytics following noise addition increases with the privacy budget, ϵ, and the variance of the independent variable. Finally, the accuracy of analytics following noise addition increases disproportionately with an increase in the privacy budget when the variance of the independent variable is greater. Using actual healthcare data, we provide evidence supporting the two predictions on the accuracy of data analytics. Crucially, to enable accuracy of data analytics with differential privacy, we derive a relationship to extract the slope parameter in the original dataset using the slope parameter in the noisy dataset.
Authored by Rishabh Subramanian
With the variety of cloud services, the cloud service provider delivers the machine learning service, which is used in many applications, including risk assessment, product recommen-dation, and image recognition. The cloud service provider initiates a protocol for the classification service to enable the data owners to request an evaluation of their data. The owners may not entirely rely on the cloud environment as the third parties manage it. However, protecting data privacy while sharing it is a significant challenge. A novel privacy-preserving model is proposed, which is based on differential privacy and machine learning approaches. The proposed model allows the various data owners for storage, sharing, and utilization in the cloud environment. The experiments are conducted on Blood transfusion service center, Phoneme, and Wilt datasets to lay down the proposed model's efficiency in accuracy, precision, recall, and Fl-score terms. The results exhibit that the proposed model specifies high accuracy, precision, recall, and Fl-score up to 97.72%, 98.04%, 97.72%, and 98.80%, respectively.
Authored by Rishabh Gupta, Ashutosh Singh
Stealthy attackers often disable or tamper with system monitors to hide their tracks and evade detection. In this poster, we present a data-driven technique to detect such monitor compromise using evidential reasoning. Leveraging the fact that hiding from multiple, redundant monitors is difficult for an attacker, to identify potential monitor compromise, we combine alerts from different sets of monitors by using Dempster-Shafer theory, and compare the results to find outliers. We describe our ongoing work in this area.
Authored by Uttam Thakore, Ahmed Fawaz, William Sanders