Darknet Anomaly Detection in Internet Access Using Machine Learning: A Big Data Analytics Examination

ABSTRACT

The Darknet is a separate part of the Deep Web, where users browse anonymously and in encrypted form. It is therefore associated most often with criminal activities, as law enforcement finds it difficult to track users. Because of this, the detection of Darknet signals, through their the most common and revealing features, has become an important area of research. Although previous studies have focused on the utilization of machine learning in the area of Darknet signal detection, to our knowledge few have focused on creating a multi-stage machine learning methodology with a pipeline. The aim of this study is to utilize five different machine learning models - Decision Tree, Logistic Regression, Random Forest, Gradient Boosted Tree and XGBoost - to classify the data type of data transfers being made, picking the best performing model for each data type. A pipeline has been created for the second stage - predicting a binary classifier of connections deemed "anomalous" (Tor or VPN) or "non-anomalous" (Non-Tor or Non-VPN) among the Darknet Dataset from the Canadian Institute of Cybersecurity at the University of New Brunswick. The results showed that the most important features for predicting an anomalous connection varied by data type. The XGBoost model performed the best at multiclass classification for determining data types, and best at binary classification of Darknet connections for all but one data type. The overall model achieved accuracy of 98.31%, with precision of 98.34% and recall of 98.31%, with an F1-score of 98.32%. The model's performance suggests this current framework can be utilized to create an alert system for cyber-attacks from the Darknet by analyzing connections in real time.

Mason Turner is a Business Intelligence Analyst at the University of Michigan-Dearborn and a Graduate Student in Data Science at the University of Michigan-Flint. With 10 years of experience in computer science and data analytics, he has led research and analysis projects across both the private sector and the University of Michigan system. His proudest achievement is developing a machine learning model to detect academic risk using student data from UM-Flint’s Canvas LMS—work he recently presented at the 2024 MI/AIR Conference and the 2025 FIE Conference in Nashville.
Submitted by Katie Dey on