Ethical bias in machine learning models has become a matter of concern in the software engineering community. Most of the prior software engineering works concentrated on finding ethical bias in models rather than fixing it. After finding bias, the next step is mitigation. Prior researchers mainly tried to use supervised approaches to achieve fairness. However, in the real world, getting data with trustworthy ground truth is challenging and also ground truth can contain human bias. Semi-supervised learning is a technique where, incrementally, labeled data is used to generate pseudo-labels for the rest of data (and then all that data is used for model training). In this work, we apply four popular semi-supervised techniques as pseudo-labelers to create fair classification models. Our framework, Fair-SSL, takes a very small amount (10%) of labeled data as input and generates pseudo-labels for the unlabeled data. We then synthetically generate new data points to balance the training data based on class and protected attribute as proposed by Chakraborty et al. in FSE 2021. Finally, classification model is trained on the balanced pseudo-labeled data and validated on test data. After experimenting on ten datasets and three learners, we find that Fair-SSL achieves similar performance as three state-of-the-art bias mitigation algorithms. That said, the clear advantage of Fair-SSL is that it requires only 10% of the labeled training data. To the best of our knowledge, this is the first SE work where semi-supervised techniques are used to fight against ethical bias in SE ML models. To facilitate open science and replication, all our source code and datasets are publicly available at https://github.com/joymallyac/FairSSL. CCS CONCEPTS • Software and its engineering → Software creation and management; • Computing methodologies → Machine learning. ACM Reference Format: Joymallya Chakraborty, Suvodeep Majumder, and Huy Tu. 2022. Fair-SSL: Building fair ML Software with less data. In International Workshop on Equitable Data and Technology (FairWare ‘22), May 9, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3524491.3527305
Authored by Joymallya Chakraborty, Suvodeep Majumder, Huy Tu
During pandemic COVID-19 outbreaks, number of cyber-attacks including phishing activities have increased tremendously. Nowadays many technical solutions on phishing detection were developed, however these approaches were either unsuccessful or unable to identify phishing pages and detect malicious codes efficiently. One of the downside is due to poor detection accuracy and low adaptability to new phishing connections. Another reason behind the unsuccessful anti-phishing solutions is an arbitrary selected URL-based classification features which may produce false results to the detection. Therefore, in this work, an intelligent phishing detection and prevention model is designed. The proposed model employs a self-destruct detection algorithm in which, machine learning, especially supervised learning algorithm was used. All employed rules in algorithm will focus on URL-based web characteristic, which attackers rely upon to redirect the victims to the simulated sites. A dataset from various sources such as Phish Tank and UCI Machine Learning repository were used and the testing was conducted in a controlled lab environment. As a result, a chrome extension phishing detection were developed based on the proposed model to help in preventing phishing attacks with an appropriate countermeasure and keep users aware of phishing while visiting illegitimate websites. It is believed that this smart phishing detection and prevention model able to prevent fraud and spam websites and lessen the cyber-crime and cyber-crisis that arise from year to year.
Authored by Amir Rose, Nurlida Basir, Nur Heng, Nurzi Zaizi, Madihah Saudi
Supervisory control and data acquisition (SCADA) systems play pivotal role in the operation of modern critical infrastructures (CIs). Technological advancements, innovations, economic trends, etc. have continued to improve SCADA systems effectiveness and overall CIs’ throughput. However, the trends have also continued to expose SCADA systems to security menaces. Intrusions and attacks on SCADA systems can cause service disruptions, equipment damage or/and even fatalities. The use of conventional intrusion detection models have shown trends of ineffectiveness due to the complexity and sophistication of modern day SCADA attacks and intrusions. Also, SCADA characteristics and requirement necessitate exceptional security considerations with regards to intrusive events’ mitigations. This paper explores the viability of supervised learning algorithms in detecting intrusions specific to SCADA systems and their communication protocols. Specifically, we examine four supervised learning algorithms: Random Forest, Naïve Bayes, J48 Decision Tree and Sequential Minimal Optimization-Support Vector Machines (SMO-SVM) for evaluating SCADA datasets. Two SCADA datasets were used for evaluating the performances of our approach. To improve the classification performances, feature selection using principal component analysis was used to preprocess the datasets. Using prominent classification metrics, the SVM-SMO presented the best overall results with regards to the two datasets. In summary, results showed that supervised learning algorithms were able to classify intrusions targeted against SCADA systems with satisfactory performances.
Authored by Oyeniyi Alimi, Khmaies Ouahada, Adnan Abu-Mahfouz, Suvendi Rimer, Kuburat Alimi
The volume of SMS messages sent on a daily basis globally has continued to grow significantly over the past years. Hence, mobile phones are becoming increasingly vulnerable to SMS spam messages, thereby exposing users to the risk of fraud and theft of personal data. Filtering of messages to detect and eliminate SMS spam is now a critical functionality for which different types of machine learning approaches are still being explored. In this paper, we propose a system for detecting SMS spam using a semi-supervised novelty detection approach based on one class SVM classifier. The system is built as an anomaly detector that learns only from normal SMS messages thus enabling detection models to be implemented in the absence of labelled SMS spam training examples. We evaluated our proposed system using a benchmark dataset consisting of 747 SMS spam and 4827 non-spam messages. The results show that our proposed method out-performed the traditional supervised machine learning approaches based on binary, frequency or TF-IDF bag-of-words. The overall accuracy was 98% with 100% SMS spam detection rate and only around 3% false positive rate.
Authored by Suleiman Yerima, Abul Bashar
Recent years have witnessed a surge in ransomware attacks. Especially, many a new variant of ransomware has continued to emerge, employing more advanced techniques distributing the payload while avoiding detection. This renders the traditional static ransomware detection mechanism ineffective. In this paper, we present our Hardware Anomaly Realtime Detection - Lightweight (HARD-Lite) framework that employs semi-supervised machine learning method to detect ransomware using low-level hardware information. By using an LSTM network with a weighted majority voting ensemble and exponential moving average, we are able to take into consideration the temporal aspect of hardware-level information formed as time series in order to detect deviation in system behavior, thereby increasing the detection accuracy whilst reducing the number of false positives. Testing against various ransomware across multiple families, HARD-Lite has demonstrated remarkable effectiveness, detecting all cases tested successfully. What's more, with a hierarchical design that distributing the classifier from the user machine that is under monitoring to a server machine, Hard-Lite enables good scalability as well.
Authored by Chutitep Woralert, Chen Liu, Zander Blasingame
The spread of Internet of Things (IoT) devices in our homes, healthcare, industries etc. are more easily infiltrated than desktop computers have resulted in a surge in botnet attacks based on IoT devices, which may jeopardize the IoT security. Hence, there is a need to detect these attacks and mitigate the damage. Existing systems rely on supervised learning-based intrusion detection methods, which require a large labelled data set to achieve high accuracy. Botnets are onerous to detect because of stealthy command & control protocols and large amount of network traffic and hence obtaining a large labelled data set is also difficult. Due to unlabeled Network traffic, the supervised classification techniques may not be used directly to sort out the botnet that is responsible for the attack. To overcome this limitation, a semi-supervised Deep Learning (DL) approach is proposed which uses Semi-supervised GAN (SGAN) for IoT botnet detection on N-BaIoT dataset which contains "Bashlite" and "Mirai" attacks along with their sub attacks. The results have been compared with the state-of-the-art supervised solutions and found efficient in terms of better accuracy which is 99.89% in binary classification and 59% in multi classification on larger dataset, faster and reliable model for IoT Botnet detection.
Authored by Kumar Saurabh, Ayush Singh, Uphar Singh, O.P. Vyas, Rahamatullah Khondoker
Cyber threat intelligence (CTI) is vital for enabling effective cybersecurity decisions by providing timely, relevant, and actionable information about emerging threats. Monitoring the dark web to generate CTI is one of the upcoming trends in cybersecurity. As a result, developing CTI capabilities with the dark web investigation is a significant focus for cybersecurity companies like Deepwatch, DarkOwl, SixGill, ThreatConnect, CyLance, ZeroFox, and many others. In addition, the dark web marketplace (DWM) monitoring tools are of much interest to law enforcement agencies (LEAs). The fact that darknet market participants operate anonymously and online transactions are pseudo-anonymous makes it challenging to identify and investigate them. Therefore, keeping up with the DWMs poses significant challenges for LEAs today. Nevertheless, the offerings on the DWM give insights into the dark web economy to LEAs. The present work is one such attempt to describe and analyze dark web market data collected for CTI using a dark web crawler. After processing and labeling, authors have 53 DWMs with their product listings and pricing.
Authored by Ashwini Dalvi, Gunjan Patil, S Bhirud
Cyber threats can cause severe damage to computing infrastructure and systems as well as data breaches that make sensitive data vulnerable to attackers and adversaries. It is therefore imperative to discover those threats and stop them before bad actors penetrating into the information systems.Threats hunting algorithms based on machine learning have shown great advantage over classical methods. Reinforcement learning models are getting more accurate for identifying not only signature-based but also behavior-based threats. Quantum mechanics brings a new dimension in improving classification speed with exponential advantage. The accuracy of the AI/ML algorithms could be affected by many factors, from algorithm, data, to prejudicial, or even intentional. As a result, AI/ML applications need to be non-biased and trustworthy.In this research, we developed a machine learning-based cyber threat detection and assessment tool. It uses two-stage (both unsupervised and supervised learning) analyzing method on 822,226 log data recorded from a web server on AWS cloud. The results show the algorithm has the ability to identify the threats with high confidence.
Authored by Shuangbao Wang, Md Arafin, Onyema Osuagwu, Ketchiozo Wandji
Nowadays, although it is much more convenient to obtain news with social media and various news platforms, the emergence of all kinds of fake news has become a headache and urgent problem that needs to be solved. Currently, the fake news recognition algorithm for fake news mainly uses GCN, including some other niche algorithms such as GRU, CNN, etc. Although all fake news verification algorithms can reach quite a high accuracy with sufficient datasets, there is still room for improvement for unsupervised learning and semi-supervised. This article finds that the accuracy of the GCN method for fake news detection is basically about 85% through comparison with other neural network models, which is satisfactory, and proposes that the current field lacks a unified training dataset, and that in the future fake news detection models should focus more on semi-supervised learning and unsupervised learning.
Authored by Zhichao Wang