Governance for Big Data
Lead PI:
Deidre Mulligan
Abstract

Privacy governance for Big Data is challenging—data may be rich enough to allow the inference of private information that has been removed, redacted, or minimized. We must protect against both malicious and accidental inference, both by data analysts and by automated systems. To do this, we are extending existing methods for controlling the inference risks of common analysis tools (drawn from literature on the related problem of nondiscriminatory data analysis). We are coupling these methods with auditing tools such as verifiably integral audit logs. Robust audit logs hold analysts accountable for their investigations, reducing exposure to malicious sensitive inference. Further, tools for controlling information flow and leakage in analytical models eliminate many types of accidental sensitive inference. Together, the analytical guarantees of inference-sensitive data mining technologies and the record-keeping functions of logs can create a methodology for truly accountable systems for holding and processing private data.

This project will deliver a data governance methodology enabling more expressive policies that rely on accountability to enable exploration of the privacy consequences of Big Data analysis. This methodology will combine known techniques from computer science—including verification using formal methods—with principles from the study of privacy-by-design and with accountability mechanisms. Systems subject to this methodology will generate evidence both before they operate (in the form of guarantees about when and how data can be analyzed and disclosed – e.g., "differential privacy in this database implies that no analysis can determine the presence or absence of a single row.") and as they operate (in the form of audit log entries that describe how the system's capabilities were actually used – e.g. "the marketing application submitted a query Q which returned a set of rows R" – and records this information in a way that demonstrates that these audit materials could not have been altered or forged). While the techniques here are not novel, they have not been previously synthesized into an actionable methodology for practical data governance, especially as it applies to Big Data. Information gleaned by examining this evidence can be used to inform the development of traditional-style data access policies that support full personal accountability for data access.

To demonstrate that the output methodology is actionable, this project will also produce a set of generalizable design patterns applying the methodology to real data analysis scenarios drawn from interviews with practitioners in industry and government. These patterns will then inform practical use of the methodology. Together, the new methodology and design patterns will provide the start of a new generation of data governance approaches that function properly in the era of Big Data, machine learning, and automated decision-making. This project will relate the emerging science of privacy to data analysis practice, governance, and compliance and show how to make the newest technologies relevant, actionable, and useful.

Deidre Mulligan
Performance Period: 01/01/2018 - 01/01/2018
Institution: University of California, Berkeley
Sponsor: National Security Agency