PREDICT: A data repository for cyber security research


Access to appropriate data is an essential component to successful research. In the area of cyber security, large-scale datasets containing realistic network traffic are necessary to understand how proposed methods and tools actually work in real-life situations. Such data, by its very nature, contains sensitive information about users and organizations and is subject to legal and regulatory restrictions. It is therefore not easy for researchers to acquire this data.

As a component of their effort to advance research infrastructure for cyber security, DHS established The Protected Repository for Defense of Infrastructure against Cyber Threats (PREDICT), to provide a trusted framework for sharing real-world security-related datasets. As the only freely available, legally collected repository of large-scale datasets containing real network traffic and system logs, PREDICT provides access to data that can be used by the research community to test and evaluate research prototypes and by government technology decision-makers to evaluate competing cyber security tools and methods.

In establishing PREDICT, a set of key issues for sharing these types of data has been addressed: providing secure, centralized access to multiple sources of data; assuring confidentiality to protect the privacy of the individuals and the security of the networks from which the data are collected; assuring data integrity to protect access to the data and ensure its proper use; and protecting proprietary information and reducing legal risks.

In this presentation, we discuss the PREDICT framework for acquiring and sharing datasets, the process by which researchers are authorized to access the data, related program activities that have been initiated in response to issues encountered in developing PREDICT, and future activities.