In this work, we present a comprehensive survey on applications of the most recent transformer architecture based on attention in information security. Our review reveals three primary areas of application: Intrusion detection, Anomaly Detection and Malware Detection. We have presented an overview of attention-based mechanisms and their application in each cybersecurity use case, and discussed open grounds for future trends in Artificial Intelligence enabled information security.
Authored by M. Vubangsi, Sarumi Abidemi, Olukayode Akanni, Auwalu Mubarak, Fadi Al-Turjman
Cybersecurity is the practice of preventing cyberattacks on vital infrastructure and private data. Government organisations, banks, hospitals, and every other industry sector are increasingly investing in cybersecurity infrastructure to safeguard their operations and the millions of consumers who entrust them with their personal information. Cyber threat activity is alarming in a world where businesses are more interconnected than ever before, raising concerns about how well organisations can protect themselves from widespread attacks. Threat intelligence solutions employ Natural Language Processing to read and interpret the meaning of words and technical data in various languages and find trends in them. It is becoming increasingly precise for machines to analyse various data sources in multiple languages using NLP. This paper aims to develop a system that targets software vulnerability detection as a Natural Language Processing (NLP) problem with source code treated as texts and addresses the automated software vulnerability detection with recent advanced deep learning NLP models. We have created and compared various deep learning models based on their accuracy and the best performer achieved 95\% accurate results. Furthermore we have also made an effort to predict which vulnerability class a particular source code belongs to and also developed a robust dashboard using FastAPI and ReactJS.
Authored by Kanchan Singh, Sakshi Grover, Ranjini Kumar
Topic modeling algorithms from the natural language processing (NLP) discipline have been used for various applications. For instance, topic modeling for the product recommendation systems in the e-commerce systems. In this paper, we briefly reviewed topic modeling applications and then described our proposed idea of utilizing topic modeling approaches for cyber threat intelligence (CTI) applications. We improved the previous work by implementing BERTopic and Top2Vec approaches, enabling users to select their preferred pretrained text/sentence embedding model, and supporting various languages. We implemented our proposed idea as the new topic modeling module for the Open Web Application Security Project (OWASP) Maryam: Open-Source Intelligence (OSINT) framework. We also described our experiment results using a leaked hacker forum dataset (nulled.io) to attract more researchers and open-source communities to participate in the Maryam project of OWASP Foundation.
Authored by Hatma Suryotrisongko, Hari Ginardi, Henning Ciptaningtyas, Saeed Dehqan, Yasuo Musashi
Vulnerability Detection 2022 - With the booming development of deep learning and machine learning, the use of neural networks for software source code security vulnerability detection has become a hot pot in the field of software security. As a data structure, graphs can adequately represent the complex syntactic information, semantic information, and dependencies in software source code. In this paper, we propose the MPGVD model based on the idea of text classification in natural language processing. The model uses BERT for source code pre-training, transforms graphs into corresponding feature vectors, uses MPNN (Message Passing Neural Networks) based on graph neural networks in the feature extraction phase, and finally outputs the detection results. Our proposed MPGVD, compared with other existing vulnerability detection models on the same dataset CodeXGLUE, obtain the highest detection accuracy of 64.34\%.
Authored by Yang Xue, Junjun Guo, Li Zhang, Huiyu Song
Privacy Policies - Data privacy laws like the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) provide guidelines for collecting personal information from individuals and processing it. These frameworks also require service providers to inform their customers on how clients data is gathered, used, protected, and shared with other parties. A privacy policy is a legal document used by service providers to inform users about how their personal information is collected, stored, and shared with other parties. It is expected that the privacy policies adhere to the data privacy regulations. However, it has been observed that some policies may deviate from the practices recommended by data protection regulations. Detecting instances where a policy may violate a certain regulation is quite challenging because the privacy policy text is long and complex, and there are numerous regulations. To address this problem, we have designed an approach to automatically detect whether a policy violates the articles of GDPR. This paper demonstrates how we have used Natural Language Inference (NLI) tasks to compare privacy content against the GDPR to detect privacy policies text in violation of GDPR. We provide two designs using the Stanford Natural Language Inference (SNLI) and the Multi-Genre Natural Language Inference (MultiNLI) datasets. The results from both designs are promising as our approach detected the deviations with 76\% accuracy.
Authored by Abdullah Alshamsan, Shafique Chaudhry
Privacy Policies - Privacy policies, despite the important information they provide about the collection and use of one’s data, tend to be skipped over by most Internet users. In this paper, we seek to make privacy policies more accessible by automatically classifying text samples into web privacy categories. We use natural language processing techniques and multiple machine learning models to determine the effectiveness of each method in the classification method. We also explore the effectiveness of these methods to classify privacy policies of Internet of Things (IoT) devices.
Authored by Jasmine Carson, Lisa DiSalvo, Lydia Ray
Network Coding - Software vulnerabilities, caused by unintentional flaws in source codes, are the main root cause of cyberattacks. Source code static analysis has been used extensively to detect the unintentional defects, i.e. vulnerabilities, introduced into the source codes by software developers. In this paper, we propose a deep learning approach to detect vulnerabilities from their LLVM IR representations based on the techniques that have been used in natural language processing. The proposed approach uses a hierarchical process to first identify source codes with vulnerabilities, and then it identifies the lines of codes that contribute to the vulnerability within the detected source codes. This proposed twostep approach reduces the false alarm of detecting vulnerable lines. Our extensive experiment on real-world and synthetic codes collected in NVD and SARD shows high accuracy (about 98\%) in detecting source code vulnerabilities 1.
Authored by Arash Mahyari
Natural Language Processing - Dissemination of fake news is a matter of major concern that can result in national and social damage with devastating impacts. The misleading information on the internet is dubious and seems to be arduous for identification. Machine learning models are becoming an irreplaceable component in the detection of fake news spreading on the social media. LSTM is a memory based machine learning model for the detection of false news. LSTM has a promising approach and eradicates the issue of vanishing gradient in RNNs. The integration of natural language processing and LSTM model is considered to be effective in the false news identification.
Authored by Abina Azees, Geevarghese Titus
Natural Language Processing - Rule-based Web vulnerability detection is the most common method, usually based on the analysis of the website code and the feedback on detection of the target. In the process, large amount of contaminated data and network pressure will be generated, the false positive rate is high. This study implements a detection platform on the basis of the crawler and NLP. We use the crawler obtain the HTTP request on the target system firstly, classify the dataset according to whether there is parameter and whether the samples get to interact with a database. then we convert text word vector, carries on the dimensionality of serialized, through train dataset by NLP algorithm, finally obtain a model that can accurately predict Web vulnerabilities. Experimental results show that this method can detect Web vulnerabilities efficiently, greatly reduce invalid attack test parameters, and reduce network pressure.
Authored by Xin Ge, Min-Nan Yue
Natural Language Processing - Application code analysis and static rules are the most common methods for Web vulnerability detection, but this process will generate a large amount of contaminated data and network pressure, the false positive rate is high. This study implements a detection system on the basis of the crawler and NLP. The crawler visits page in imitation of a human, we collect the HTTP request and response as dataset, classify the dataset according to parameter characteristic and whether the samples get to interact with a database, then we convert text word vector, reduce the dimension and serialized them, through train dataset by NLP algorithm, finally we obtain a model that can accurately predict Web vulnerabilities. Experimental results show that this method can detect Web vulnerabilities efficiently, greatly reduce invalid attack test parameters, and reduce network pressure.
Authored by Xin Ge, Minnan Yue
Natural Language Processing - Story Ending Generation (SEG) is a challenging task in natural language generation. Recently, methods based on Pretrained Language Models (PLM) have achieved great prosperity, which can produce fluent and coherent story endings. However, the pre-training objective of PLM-based methods is unable to model the consistency between story context and ending. The goal of this paper is to adopt contrastive learning to generate endings more consistent with story context, while there are two main challenges in contrastive learning of SEG. First is the negative sampling of wrong endings inconsistent with story contexts. The second challenge is the adaptation of contrastive learning for SEG. To address these two issues, we propose a novel Contrastive Learning framework for Story Ending Generation (CLSEG)†, which has two steps: multi-aspect sampling and story-specific contrastive learning. Particularly, for the first issue, we utilize novel multi-aspect sampling mechanisms to obtain wrong endings considering the consistency of order, causality, and sentiment. To solve the second issue, we well-design a story-specific contrastive training strategy that is adapted for SEG. Experiments show that CLSEG outperforms baselines and can produce story endings with stronger consistency and rationality.
Authored by Yuqiang Xie, Yue Hu, Luxi Xing, Yunpeng Li, Wei Peng, Ping Guo
Natural Language Processing - The new capital city (IKN) of the Republic of Indonesia has been ratified and inaugurated by President Joko Widodo since January 2022. Unfortunately, there are still many Indonesian citizens who do not understand all the information regarding the determination of the new capital city. Even though the Indonesian Government has created an official website regarding the new capital city (www.ikn.go.id) the information is still not optimal because web page visitors are still unable to interact actively with the required information. Therefore, the development of the Chatting Robot (Chatbot) application is deemed necessary to become an interactive component in obtaining information needed by users related to new capital city. In this study, a chatbot application was developed by applying Natural Language Processing (NLP) using the Term Frequency-Inverse Document Frequency (TFIDF) method for term weighting and the Cosine-Similarity algorithm to calculate the similarity of the questions asked by the user. The research successfully designed and developed a chatbot application using the Cosine-Similarity algorithm. The testing phase of the chatbot model uses several scenarios related to the points of NLP implementation. The test results show that all scenarios of questions asked can be responded well by the chatbot.
Authored by Harry Achsan, Deni Kurniawan, Diki Purnama, Quintin Barcah, Yuri Astoria
Natural Language Processing - In today’s digital era, online attacks are increasing in number and are becoming severe day by day, especially those related to web applications. The data accessible over the web persuades the attackers to dispatch new kinds of attacks. Serious exploration on web security has shown that the most hazardous attack that affects web security is the Structured Query Language Injection(SQLI). This attack addresses a genuine threat to web application security and a few examination works have been directed to defend against this attack by detecting it when it happens. Traditional methods like input validation and filtering, use of parameterized queries, etc. are not sufficient to counter these attacks as they rely solely on the implementation of the code hence factoring in the developer’s skill-set which in turn gave rise to Machine Learning based solutions. In this study, we have proposed a novel approach that takes the help of Natural Language Processing(NLP) and uses BERT for feature extraction that is capable to adapt to SQLI variants and provides an accuracy of 97\% with a false positive rate of 0.8\% and a false negative rate of 5.8\%.
Authored by Sagar Lakhani, Ashok Yadav, Vrijendra Singh
Natural Language Processing - In today s digital age, businesses create tremendous data as part of their regular operations. On legacy or cloud platforms, this data is stored mainly in structured, semi-structured, and unstructured formats, and most of the data kept in the cloud are amorphous, containing sensitive information. With the evolution of AI, organizations are using deep learning and natural language processing to extract the meaning of these big data through unstructured data analysis and insights (UDAI). This study aims to investigate the influence of these unstructured big data analyses and insights on the organization s decision-making system (DMS), financial sustainability, customer lifetime value (CLV), and organization s long-term growth prospects while encouraging a culture of self-service analytics. This study uses a validated survey instrument to collect the responses from Fortune-500 organizations to find the adaptability and influence of UDAI in current data-driven decision making and how it impacts organizational DMS, financial sustainability and CLV.
Authored by Bibhu Dash, Swati Swayamsiddha, Azad Ali
Natural Language Processing - Natural language processing (NLP) is a computer program that trains computers to read and understand the text and spoken words in the same way that people do. In Natural Language Processing, Named Entity Recognition (NER) is a crucial field. It extracts information from given texts and is used to translate machines, text to speech synthesis, to understand natural language, etc. Its main goal is to categorize words in a text that represent names into specified tags like location, organization, person-name, date, time, and measures. In this paper, the proposed method extracts entities on Hindi Fraud Call (publicly not available) annotated Corpus using XLM-Roberta (base-sized model). By pre-training model to build the accurate NER system for datasets, the Authors are using XLM-Roberta as a multi-layer bidirectional transformer encoder for learning deep bidirectional Hindi word representations. The fine-tuning concept is used in this proposed method. XLM-Roberta Model has been fine-tuned to extract nine entities from sentences based on context of sentences to achieve better performance. An Annotated corpus for Hindi with a tag set of Nine different Named Entity (NE) classes, defined as part of the NER Shared Task for South and Southeast Asian Languages (SSEAL) at IJCNLP. Nine entities have been recognized from sentences. The Obtained F1-score(micro) and F1-score(macro) are 0.96 and 0.80, respectively.
Authored by Aditya Choure, Rahul Adhao, Vinod Pachghare
Natural Language Processing - The Internet of Thigs is mainly considered as the key technology tools which enables in connecting many devices through the use of internet, this has enabled in overall exchange of data and information, support in receiving the instruction and enable in acting upon it in an effective manner. With the advent of IoT, many devices are connected to the internet which enable in assisting the individuals to operate the devise virtually, share data and program required actions. This study is focused in understanding the key determinants of creating smart homes by applying natural language processing (NLP) through IoT. The major determinants considered are Integrating voice understanding into devices; Ability to control the devices remotely and support in reducing the energy bills.
Authored by Shahanawaj Ahamad, Deepalkumar Shah, R. Udhayakumar, T.S. Rajeswari, Pankaj Khatiwada, Joel Alanya-Beltran
Natural Language Processing - This paper presents a system to identify social engineering attacks using only text as input. This system can be used in different environments which the input is text such as SMS, chats, emails, etc. The system uses Natural Language Processing to extract features from the dialog text such as URL s report and count, spell check, blacklist count, and others. The features are used to train Machine Learning algorithms (Neural Network, Random Forest and SVM) to perform classification of social engineering attacks. The classification algorithms showed an accuracy over 80\% to detect this type of attacks.
Authored by Juan Lopez, Jorge Camargo
Information Reuse and Security - Common Vulnerabilities and Exposures (CVE) databases contain information about vulnerabilities of software products and source code. If individual elements of CVE descriptions can be extracted and structured, then the data can be used to search and analyze CVE descriptions. Herein we propose a method to label each element in CVE descriptions by applying Named Entity Recognition (NER). For NER, we used BERT, a transformer-based natural language processing model. Using NER with machine learning can label information from CVE descriptions even if there are some distortions in the data. An experiment involving manually prepared label information for 1000 CVE descriptions shows that the labeling accuracy of the proposed method is about 0.81 for precision and about 0.89 for recall. In addition, we devise a way to train the data by dividing it into labels. Our proposed method can be used to label each element automatically from CVE descriptions.
Authored by Kensuke Sumoto, Kenta Kanakogi, Hironori Washizaki, Naohiko Tsuda, Nobukazu Yoshioka, Yoshiaki Fukazawa, Hideyuki Kanuka
Machine Learning - Sentiment Analysis (SA) is an approach for detecting subjective information such as thoughts, outlooks, reactions, and emotional state. The majority of previous SA work treats it as a text-classification problem that requires labelled input to train the model. However, obtaining a tagged dataset is difficult. We will have to do it by hand the majority of the time. Another concern is that the absence of sufficient cross-domain portability creates challenging situation to reuse same-labelled data across applications. As a result, we will have to manually classify data for each domain. This research work applies sentiment analysis to evaluate the entire vaccine twitter dataset. The work involves the lexicon analysis using NLP libraries like neattext, textblob and multi class classification using BERT. This word evaluates and compares the results of the machine learning algorithms.
Authored by Amarjeet Rawat, Himani Maheshwari, Manisha Khanduja, Rajiv Kumar, Minakshi Memoria, Sanjeev Kumar
Sentiment Analysis (SA) is an approach for detecting subjective information such as thoughts, outlooks, reactions, and emotional state. The majority of previous SA work treats it as a text-classification problem that requires labelled input to train the model. However, obtaining a tagged dataset is difficult. We will have to do it by hand the majority of the time. Another concern is that the absence of sufficient cross-domain portability creates challenging situation to reuse same-labelled data across applications. As a result, we will have to manually classify data for each domain. This research work applies sentiment analysis to evaluate the entire vaccine twitter dataset. The work involves the lexicon analysis using NLP libraries like neattext, textblob and multi class classification using BERT. This word evaluates and compares the results of the machine learning algorithms.
Authored by Amarjeet Rawat, Himani Maheshwari, Manisha Khanduja, Rajiv Kumar, Minakshi Memoria, Sanjeev Kumar
XAI with natural language processing aims to produce human-readable explanations as evidence for AI decision-making, which addresses explainability and transparency. However, from an HCI perspective, the current approaches only focus on delivering a single explanation, which fails to account for the diversity of human thoughts and experiences in language. This paper thus addresses this gap, by proposing a generative XAI framework, INTERACTION (explain aNd predicT thEn queRy with contextuAl CondiTional varIational autO-eNcoder). Our novel framework presents explanation in two steps: (step one) Explanation and Label Prediction; and (step two) Diverse Evidence Generation. We conduct intensive experiments with the Transformer architecture on a benchmark dataset, e-SNLI [1]. Our method achieves competitive or better performance against state-of-the-art baseline models on explanation generation (up to 4.7% gain in BLEU) and prediction (up to 4.4% gain in accuracy) in step one; it can also generate multiple diverse explanations in step two.
Authored by Jialin Yu, Alexandra Cristea, Anoushka Harit, Zhongtian Sun, Olanrewaju Aduragba, Lei Shi, Noura Moubayed
For a long time, SQL injection has been considered one of the most serious security threats. NoSQL databases are becoming increasingly popular as big data and cloud computing technologies progress. NoSQL injection attacks are designed to take advantage of applications that employ NoSQL databases. NoSQL injections can be particularly harmful because they allow unrestricted code execution. In this paper we use supervised learning and natural language processing to construct a model to detect NoSQL injections. Our model is designed to work with MongoDB, CouchDB, CassandraDB, and Couchbase queries. Our model has achieved an F1 score of 0.95 as established by 10-fold cross validation.
Authored by Sivakami Praveen, Alysha Dcouth, A Mahesh
In the present work, we intend to present a thorough study developed on a digital library, called HAT corpus, for a purpose of authorship attribution. Thus, a dataset of 300 documents that are written by 100 different authors, was extracted from the web digital library and processed for a task of author style analysis. All the documents are related to the travel topic and written in Arabic. Basically, three important rules in stylometry should be respected: the minimum document size, the same topic for all documents and the same genre too. In this work, we made a particular effort to respect those conditions seriously during the corpus preparation. That is, three lexical features: Fixed-length words, Rare words and Suffixes are used and evaluated by using a centroid based Manhattan distance. The used identification approach shows interesting results with an accuracy of about 0.94.
Authored by S. Ouamour, H. Sayoud
The range of text analysis methods in the field of natural language processing (NLP) has become more and more extensive thanks to the increasing computational resources of the 21st century. As a result, many deep learning-based solutions have been proposed for the purpose of authorship attribution, as they offer more flexibility and automated feature extraction compared to traditional statistical methods. A number of solutions have appeared for the attribution of English texts, however, the number of methods designed for Hungarian language is extremely small. Hungarian is a morphologically rich language, sentence formation is flexible and the alphabet is different from other languages. Furthermore, a language specific POS tagger, pretrained word embeddings, dependency parser, etc. are required. As a result, methods designed for other languages cannot be directly applied on Hungarian texts. In this paper, we review deep learning-based authorship attribution methods for English texts and offer techniques for the adaptation of these solutions to Hungarian language. As a part of the paper, we collected a new dataset consisting of Hungarian literary works of 15 authors. In addition, we extensively evaluate the implemented methods on the new dataset.
Authored by Laura Oldal, Gábor Kertész
Topic modeling algorithms from the natural language processing (NLP) discipline have been used for various applications. For instance, topic modeling for the product recommendation systems in the e-commerce systems. In this paper, we briefly reviewed topic modeling applications and then described our proposed idea of utilizing topic modeling approaches for cyber threat intelligence (CTI) applications. We improved the previous work by implementing BERTopic and Top2Vec approaches, enabling users to select their preferred pre-trained text/sentence embedding model, and supporting various languages. We implemented our proposed idea as the new topic modeling module for the Open Web Application Security Project (OWASP) Maryam: Open-Source Intelligence (OSINT) framework. We also described our experiment results using a leaked hacker forum dataset (nulled.io) to attract more researchers and open-source communities to participate in the Maryam project of OWASP Foundation.
Authored by Hatma Suryotrisongko, Hari Ginardi, Henning Ciptaningtyas, Saeed Dehqan, Yasuo Musashi