The range of text analysis methods in the field of natural language processing (NLP) has become more and more extensive thanks to the increasing computational resources of the 21st century. As a result, many deep learning-based solutions have been proposed for the purpose of authorship attribution, as they offer more flexibility and automated feature extraction compared to traditional statistical methods. A number of solutions have appeared for the attribution of English texts, however, the number of methods designed for Hungarian language is extremely small. Hungarian is a morphologically rich language, sentence formation is flexible and the alphabet is different from other languages. Furthermore, a language specific POS tagger, pretrained word embeddings, dependency parser, etc. are required. As a result, methods designed for other languages cannot be directly applied on Hungarian texts. In this paper, we review deep learning-based authorship attribution methods for English texts and offer techniques for the adaptation of these solutions to Hungarian language. As a part of the paper, we collected a new dataset consisting of Hungarian literary works of 15 authors. In addition, we extensively evaluate the implemented methods on the new dataset.
Authored by Laura Oldal, Gábor Kertész
Topic modeling algorithms from the natural language processing (NLP) discipline have been used for various applications. For instance, topic modeling for the product recommendation systems in the e-commerce systems. In this paper, we briefly reviewed topic modeling applications and then described our proposed idea of utilizing topic modeling approaches for cyber threat intelligence (CTI) applications. We improved the previous work by implementing BERTopic and Top2Vec approaches, enabling users to select their preferred pre-trained text/sentence embedding model, and supporting various languages. We implemented our proposed idea as the new topic modeling module for the Open Web Application Security Project (OWASP) Maryam: Open-Source Intelligence (OSINT) framework. We also described our experiment results using a leaked hacker forum dataset (nulled.io) to attract more researchers and open-source communities to participate in the Maryam project of OWASP Foundation.
Authored by Hatma Suryotrisongko, Hari Ginardi, Henning Ciptaningtyas, Saeed Dehqan, Yasuo Musashi
Recent advances in deep learning typically, with the introduction of transformer based models has shown massive improvement and success in many Natural Language Processing (NLP) tasks. One such area which has leveraged immensely is conversational agents or chatbots in open-ended (chit-chat conversations) and task-specific (such as medical or legal dialogue bots etc.) domains. However, in the era of automation, there is still a dearth of works focused on one of the most relevant use cases, i.e., tutoring dialog systems that can help students learn new subjects or topics of their interest. Most of the previous works in this domain are either rule based systems which require a lot of manual efforts or are based on multiple choice type factual questions. In this paper, we propose EDICA (Educational Domain Infused Conversational Agent), a language tutoring Virtual Agent (VA). EDICA employs two mechanisms in order to converse fluently with a student/user over a question and assist them to learn a language: (i) Student/Tutor Intent Classification (SIC-TIC) framework to identify the intent of the student and decide the action of the VA, respectively, in the on-going conversation and (ii) Tutor Response Generation (TRG) framework to generate domain infused and intent/action conditioned tutor responses at every step of the conversation. The VA is able to provide hints, ask questions and correct student's reply by generating an appropriate, informative and relevant tutor response. We establish the superiority of our proposed approach on various evaluation metrics over other baselines and state of the art models.
Authored by Raghav Jain, Tulika Saha, Souhitya Chakraborty, Sriparna Saha
Due to the migration megatrend, efficient and effective second-language acquisition is vital. One proposed solution involves AI-enabled conversational agents for person-centered interactive language practice. We present results from ongoing action research targeting quality assurance of proprietary generative dialog models trained for virtual job interviews. The action team elicited a set of 38 requirements for which we designed corresponding automated test cases for 15 of particular interest to the evolving solution. Our results show that six of the test case designs can detect meaningful differences between candidate models. While quality assurance of natural language processing applications is complex, we provide initial steps toward an automated framework for machine learning model selection in the context of an evolving conversational agent. Future work will focus on model selection in an MLOps setting.
Authored by Markus Borg, Johan Bengtsson, Harald Österling, Alexander Hagelborn, Isabella Gagner, Piotr Tomaszewski
Over the past two decades, several forms of non-intrusive technology have been deployed in cooperation with medical specialists in order to aid patients diagnosed with some form of mental, cognitive or psychological condition. Along with the availability and accessibility to applications offered by mobile devices, as well as the advancements in the field of Artificial Intelligence applications and Natural Language Processing, Conversational Agents have been developed with the objective of aiding medical specialists detecting those conditions in their early stages and monitoring their symptoms and effects on the cognitive state of the patient, as well as supporting the patient in their effort of mitigating those symptoms. Coupled with the recent advances in the the scientific field of machine and deep learning, we aim to explore the grade of applicability of such technologies into cognitive health support Conversational Agents, and their impact on the acceptability of such applications bytheir end users. Therefore, we conduct a systematic literature review, following a transparent and thorough process in order to search and analyze the bibliography of the past five years, focused on the implementation of Conversational Agents, supported by Artificial Intelligence technologies and in service of patients diagnosed with Mild Cognitive Impairment and its variants.
Authored by Ioannis Kostis, Konstantinos Karamitsios, Konstantinos Kotrotsios, Magda Tsolaki, Anthoula Tsolaki
Conversational Intelligent Tutoring Systems (CITS) in learning environments are capable of providing personalized instruction to students in different domains, to improve the learning process. This interaction between the Intelligent Tutoring System (ITS) and the user is carried out through dialogues in natural language. In this study, we use an open source framework called Rasa to adapt the original button-based user interface of an algebraic/arithmetic word problem-solving ITS to one based primarily on the use of natural language. We conducted an empirical study showing that once properly trained, our conversational agent was able to recognize the intent related to the content of the student’s message with an average accuracy above 0.95.
Authored by Romina De Luise, Pablo Arnau-González, Miguel Arevalillo-Herráez
Currently, the Dark Web is one key platform for the online trading of illegal products and services. Analysing the .onion sites hosting marketplaces is of interest for law enforcement and security researchers. This paper presents a study on 123k listings obtained from 6 different Dark Web markets. While most of current works leverage existing datasets, these are outdated and might not contain new products, e.g., those related to the 2020 COVID pandemic. Thus, we build a custom focused crawler to collect the data. Being able to conduct analyses on current data is of considerable importance as these marketplaces continue to change and grow, both in terms of products offered and users. Also, there are several anti-crawling mechanisms being improved, making this task more difficult and, consequently, reducing the amount of data obtained in recent years on these marketplaces. We conduct a data analysis evaluating multiple characteristics regarding the products, sellers, and markets. These characteristics include, among others, the number of sales, existing categories in the markets, the origin of the products and the sellers. Our study sheds light on the products and services being offered in these markets nowadays. Moreover, we have conducted a case study on one particular productive and dynamic drug market, i.e., Cannazon. Our initial goal was to understand its evolution over time, analyzing the variation of products in stock and their price longitudinally. We realized, though, that during the period of study the market suffered a DDoS attack which damaged its reputation and affected users' trust on it, which was a potential reason which lead to the subsequent closure of the market by its operators. Consequently, our study provides insights regarding the last days of operation of such a productive market, and showcases the effectiveness of a potential intervention approach by means of disrupting the service and fostering mistrust.
Authored by Víctor Labrador, Sergio Pastrana
Deep learning models rely on single word features and location features of text to achieve good results in text relation extraction tasks. However, previous studies have failed to make full use of semantic information contained in sentence dependency syntax trees, and data sparseness and noise propagation still affect classification models. The BERT(Bidirectional Encoder Representations from Transformers) pretrained language model provides a better representation of natural language processing tasks. And entity enhancement methods have been proved to be effective in relation extraction tasks. Therefore, this paper proposes a combination of the shortest dependency path and entity-enhanced BERT pre-training language model for model construction to reduce the impact of noise terms on the classification model and obtain more semantically expressive feature representation. The algorithm is tested on SemEval-2010 Task 8 English relation extraction dataset, and the F1 value of the final experiment can reach 0. 881.
Authored by Zeyu Sun, Chi Zhang
The rise of social media has brought the rise of fake news and this fake news comes with negative consequences. With fake news being such a huge issue, efforts should be made to identify any forms of fake news however it is not so simple. Manually identifying fake news can be extremely subjective as determining the accuracy of the information in a story is complex and difficult to perform, even for experts. On the other hand, an automated solution would require a good understanding of NLP which is also complex and may have difficulties producing an accurate output. Therefore, the main problem focused on this project is the viability of developing a system that can effectively and accurately detect and identify fake news. Finding a solution would be a significant benefit to the media industry, particularly the social media industry as this is where a large proportion of fake news is published and spread. In order to find a solution to this problem, this project proposed the development of a fake news identification system using deep learning and natural language processing. The system was developed using a Word2vec model combined with a Long Short-Term Memory model in order to showcase the compatibility of the two models in a whole system. This system was trained and tested using two different dataset collections that each consisted of one real news dataset and one fake news dataset. Furthermore, three independent variables were chosen which were the number of training cycles, data diversity and vector size to analyze the relationship between these variables and the accuracy levels of the system. It was found that these three variables did have a significant effect on the accuracy of the system. From this, the system was then trained and tested with the optimal variables and was able to achieve the minimum expected accuracy level of 90%. The achieving of this accuracy levels confirms the compatibility of the LSTM and Word2vec model and their capability to be synergized into a single system that is able to identify fake news with a high level of accuracy.
Authored by Anand Matheven, Burra Kumar
People connect with a plethora of information from many online portals due to the availability and ease of access to the internet and electronic communication devices. However, news portals sometimes abuse press freedom by manipulating facts. Most of the time, people are unable to discriminate between true and false news. It is difficult to avoid the detrimental impact of Bangla fake news from spreading quickly through online channels and influencing people’s judgment. In this work, we investigated many real and false news pieces in Bangla to discover a common pattern for determining if an article is disseminating incorrect information or not. We developed a deep learning model that was trained and validated on our selected dataset. For learning, the dataset contains 48,678 legitimate news and 1,299 fraudulent news. To deal with the imbalanced data, we used random undersampling and then ensemble to achieve the combined output. In terms of Bangla text processing, our proposed model achieved an accuracy of 98.29% and a recall of 99%.
Authored by Md. Rahman, Faisal Bin Ashraf, Md. Kabir
Deep learning have a variety of applications in different fields such as computer vision, automated self-driving cars, natural language processing tasks and many more. One of such deep learning adversarial architecture changed the fundamentals of the data manipulation. The inception of Generative Adversarial Network (GAN) in the computer vision domain drastically changed the way how we saw and manipulated the data. But this manipulation of data using GAN has found its application in various type of malicious activities like creating fake images, swapped videos, forged documents etc. But now, these generative models have become so efficient at manipulating the data, especially image data, such that it is creating real life problems for the people. The manipulation of images and videos done by the GAN architectures is done in such a way that humans cannot differentiate between real and fake images/videos. Numerous researches have been conducted in the field of deep fake detection. In this paper, we present a structured survey paper explaining the advantages, gaps of the existing work in the domain of deep fake detection.
Authored by Pramod Bide, Varun, Gaurav Patil, Samveg Shah, Sakshi Patil