C3E Idea Detail - Data for Cyber Research and Testing

Submitted by Luanne Burns

Title: Data for Cyber Research and Testing

Problem:

Researchers in the Cyber Security realm have difficulty getting high quality datasets to use in testing cyber analytics. Privacy rules make collecting live traffic difficult leaving researchers without a good source of test data.

Proposal:

The government should embark on a multi threaded approach to develop and make available for the research community high quality datasets that could be used in cyber security analytics research and testing.

Such a research program could include the following elements:

Data 1 – Cyber Research Data Collection Protocols – The government could develop standard protocols and guidance for collecting cyber data at one's own organization for use in cyber experimentation. Rather than focusing on creating datasets to share among organizations, this approach would develop a common repeatable process that could be used in multiple organizations to collect information and offer a basis for research collaboration without data exchange. The protocols could address the security and privacy aspects of collecting data for this use.

Data 2 – Synthetic Cyber Research Data Generation - Embark on a research program whose end goal is to develop high quality synthetic data for use in cyber security research. The program could begin with capture of real data, development of a model of the data based on analysis and feature extraction and, ultimately generation of synthetic data based on the model. The research should address not only generating current types of data but addressing the new types of data anticipated in the future.

Data 3 - Cyber Data Anonymization – Embark on a program that will generate useful test data by converting actual cyber data into a form that can be used and shared for research. Some research approaches include metadata extraction from the stream, hashing, or data redaction.

Data 4 – Cyber Research Data Library – Create a library of existing cyber security datasets where organizations can share the datasets that they are developing/using for cyber security research.

Data 5 - Cyber Rosetta Stone - Develop a method to allow for the exchange of cyber security related information among organizations while placing minimum data structuring and transformation demands on participating organizations. Potential methods include development of a dynamic cyber domain model, development of a semantic ontology for cyber data, or development of dynamic translators that can be used in each transaction to align and transfer data. Solutions should be capable of continually evolving concepts to keep up with the rapidly changing technology terrain.

Data 6 – Cyber Environment Test Harness – Create a set of standards and technologies to support the connection of live, streaming cyber environments to research analytics without risk to the performance or quality of the mission of the live network. Even when researchers in cyber analytics have permission to use live data, the methods we have to capture and use these data offline for analytics may not contain the context or necessary external connections to provide evidence of the efficacy of the research, delaying the benefit of new and innovative analytics within cyberspace.

Strengths:

Making data more widely available could allow more small businesses, research teams and individuals to become engaged in cyber security analytics development by lowering the barriers to entry and competition.
Providing datasets to academic institutions could help spur students into studying cyber security related fields since availability of data might attract students to direct their research projects towards this application area.
The initial efforts that would be required to create synthetic data (e.g., collecting, analyzing and characterizing in detail the traffic and behavior on networks) would have benefit of moving forward our understanding of “normal” behavior and thus help us to identify “abnormal” behavior.

Weaknesses:

Past attempts at anonymization have shown that it is hard to both retain significant relational content in a dataset and keep the entities in the dataset anonymous (a good chronology and collection of articles related to the 2006 release of search queries by AOL can be found here http://sifaka.cs.uiuc.edu/xshen/aol_querylog.html)
Release of open datasets would allow our adversaries to have access to the same datasets for their research and development efforts potentially negating some of the benefits that we would gain

References:

National Cyber Leap Year Summit 2009 Participants’ Ideas Report, NITRD Program Office, September 16, 2009, pp 44-47, 114-115

DHS PREDICT Project (Protected REpository for the Defense of Infrastructure against Cyber Threats - https://www.predict.org/ - DHS program currently working to address this issue

CAIDA – The Cooperative Association for Internet Data Analysis –

http://www.caida.org/data/anonymization/ - Includes a reading list on data sharing and anonymization

http://www.caida.org/data/ - a list of datasets currently available through CAIDA

KDD Cup 1999 Dataset

http://www.sigkdd.org/kddcup/index.php?section=1999&method=info – A data set used in the KDD Cup 1999 contest which ". . . to build a network intrusion detector, a predictive model capable of distinguishing between 'bad' connections, called intrusions or attacks, and 'good' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment."