C3E 2014 Challenge Problem: Metadata-based Malicious Cyber Discovery
Metadata is simplistically described as ‘data about data’ and in cyberspace could be even more simplistically defined as all network communications data except that which is considered communications content or the payload. The use of network communications encryption is becoming increasingly prevalent, including by those who carry out malicious activities in cyberspace. This effectively makes content or payload unavailable for use in traditional signature-based detection methods, and thus makes the need for metadata-based modeling and analysis approaches increasingly important. Even when content or payload is available, the analysis involved in extracting detectable features can be very time consuming. Unfortunately, the Internet of today has become a vast ocean of anomalous activities and the process of discovering interesting anomalous behaviors representing malicious activities has become an ever increasingly difficult challenging.
At an abstract level, an anomaly is defined as a pattern that does not conform to expected normal behavior. The complexities of cyberspace, however, make this apparently simple approach even more challenging given the following:
- Defining a normal behavior is very difficult when the differences for distinguishing normal behavior from anomalous behavior are very subtle.
- Malicious adversaries work hard to look like normal behavior.
- Normal behavior keeps evolving and what we think of as normal today might be very different tomorrow.
- Availability of labeled data for research community to support training/validation of models is a constant issue.
- Network data is very noisy and much of the noise is generated from known malicious activities that are detectable through traditional mechanisms.
The last feature listed above would suggest not only a requirement for metadata-based methods for detecting malicious behavior, but also methods for prioritizing these behaviors. As part of one of its themes to explore Big Data models to support Cybersecurity, C3E proposes a Discovery Problem.
The Discovery Problem
To invent and prototype approaches for identifying high interest, suspicious and likely malicious behaviors from meta-data that challenge the way we traditionally think about the cyber problem. At C3E, we value innovation and paradigm shifting approaches above incremental improvements to existing anomaly techniques.
The Big Data Sets
In support of this research challenge, C3E will facilitate access to the Protected Repository for the Defense of Infrastructure against Cyber Threats (PREDICT), a data repository for cyber security research. PREDICT is supported by the Department of Homeland Security, Science & Technology Directorate. PREDICT technical advisors have suggested at least three data sets for possible use to demonstrate new and innovative research approaches to address this discovery problem. Researchers can browse the catalog and select any other that are appropriate for their approach to the problem. The datasets suggested for initial consideration are as follows:
- GT Malware Passive DNS Data
- Darknet datasets (Internet pollution traffic)
- Anonymized netflow
There are more than 400 datasets within the PREDICT repository. To participate in this cyber discovery problem, the government sponsors strongly encourage researchers to use the PREDICT datasets.
Researchers can access the DHS PREDICT repository via https://www.predict.org. The C3E planning is available to assist in getting researchers registered with the system and to help with any questions.
Sample Malware Traffic Patterns
In order to bound the problem to research of potential interest to typical cybersecurty practitioners, there are several kinds of threats of interest detailed in the open literature that might be detectable through meta-data approaches such as the following:
n excellent resource for listing example Advanced Persistent Threat (APT) activity is at the website http://www.deependresearch.org which provides a library of malware traffic patterns. See also the spreadsheet information located at: https://docs.google.com/spreadsheet/ccc?key=0AjvsQV3iSLa1dDFfWHduQlA5THBRd081eFhsZThwUlE#gid=1
The “Links” worksheet (tab) of this document provides a lot of useful links to reference material associated with most of the malware listed. The URL to the links worksheet is below. Another source of information especially germane to the problem is at the URL:
Researchers are invited to submit by October 1, 2014 a one page abstract describing their approaches and results prior to the actual workshop. The C3E planning team will use these submissions to choose researchers to participate on the Discovery Problem Panel Discussion.
The agenda for the workshop has scheduled a poster session on the Discovery Problem at the end of the first day’s activities.
In addition to the Panel Discussion and poster presentations, a group of cybersecurity research experts will award “special recognition” to significant approaches that discover malware highlighted in the sample patterns through meta-data.
In connection with the C3E Workshop, there is the potential for limited funding of interesting approaches to the Metadata-based Malicious Cyber Discovery Problem. This possible financial support will come through a formal proposal process that solicits innovative approaches to examine hard problems. Researchers interested in these potential follow-up efforts should contact the C3E support team about registering for the future formal application notices.
For more information about the Metadata-based Malicious Cyber Discovery Problem, contact Chip Willard at gnwilla[at]nsa.gov or Dan Wolf at dwolf[at]CyberPackVentures.com. Additionally, more information will be posted on the website as it becomes available: http://cps-vo.org/group/c3e.