Overview
The goal of the work presented in this course is to explore computational techniques for releasing useful information in such a way that the identity of any individual or entity contained in data cannot be recognized while the data remain practically useful.
We begin by demonstrating ways to learn information about entities from publicly available information.
We then provide a formal framework for reasoning about disclosure control and the ability to infer the identities of entities contained within the data.
We formally define and present null-map, k-map and wrong-map as models of protection. Each model provides protection by ensuring that released information maps to no, k or incorrect entities, respectively.
We discuss the strengths and weaknesses of these protection models and provide real-world examples.
Data explosion
Materials presented in this section of the course proposes a relation between the availability of inexpensive computers with large storage capacities and the collection of person-specific information. It also discusses the lack of barriers to widely distributing collected information. It then provides a formal mathematical model for characterizing and comparing real-world data sharing practices and policies and for computing privacy and risk measurements.
Demographics and Uniqueness
Materials presented in this section of the course provide experimental results from summary data that show how demographics often combine to make individuals unique or almost unique and such uniqueness typically occurs for a substantial number of individuals within a population. Knowing how many people share a particular set of characteristics forms the basis for confidently drawing inferences from data.
Data linking
Materials in this section of the course report on an experiment that demonstrates how health information that contains no explicit identifiers such as name, address or phone number, can be linked to fully identified information, such as a voter list, to re-identify patients who are the subjects of the health data.
Materials also chronicles an experiment in which five patients from a proposed release of cancer incidence information are accurately re-identified by drawing probabilistic inferences from publicly available data.
Data profiling
Material presented in this section of the course report on a difficult re-identification experiment in which children are identified from seemingly innocent cancer incidence information. The experiment combines the techniques introduced earlier with pattern matching and employs additional materials that include web pages and e-mail discussion groups.
Data Privacy Attacks (Computer Security is not privacy)
Materials presented in this section of the course review related work in the statistics community and in the field of computer security. However, none of this work provides solutions to the broader problems experienced in today’s setting that are the topic of this course.
Protection Models
Materials presented in this section of the course provide a formal framework for reasoning about disclosure control and the ability to infer the identities of entities contained within data. Formal protection models are defined and these provide the basis for characterizing and comparing systems that appear in later weeks.
Survey of techniques
Materials presented in this section of the course survey of disclosure limitation techniques and then provide a formal framework for reasoning about disclosure control and the ability to infer the identities of entities contained within data.
We further define an anonymous database system as one that makes individual and entity-specific data available such that individuals and other entities contained in the released data cannot be reliably identified.
We continue by introducing formal protection models, named null-map, k-map and wrong-map. Each model provides protection by ensuring that released information maps to no, k or incorrect entities, respectively.
Protecting delimited data I
Materials presented in this section of the course include MinGen, a theoretical computational system that uses a formal protection model to ensure that releases provide adequate protection and that of all the possible releases that could provide adequate protection, MinGen returns one that is minimally distorted. The real-world systems
explored in subsequent weeks are compared to the theoretical results of MinGen.
Protecting delimited data II
Materials presented in this section of the course include Datafly, a computational system
applied to field-structured databases. It uses a formal protection model to ensure that releases provide adequate protection and that of all the possible releases that could provide adequate protection, the system returns one that is considered useful to the recipient of the data.
Protecting delimited data III
Materials presented in this section of the course include Mu-Argus, a computational system created and used by Statistics Netherlands which is surprisingly similar to my Datafly and Datafly II Systems, even though the systems were developed at roughly the same time with no knowledge of each other and come from different academic traditions. Datafly II tends to over-distort data while m-Argus tends to under-protect data. Neither is perfect.
Protecting delimited data IV
Materials presented in this section of the course include k-Similar, another
computational solution that attempts to provide privacy protection in field-structured data. Unlike my Datafly II System and Statistics Netherlands’ Mu-Argus System discussed earlier, my k-Similar algorithm produces optimal releases without over distorting the data and without under-protecting the data. Further, the metrics used by k-Similar are extensible so that most of the disclosure limitation techniques. However, k-Similar is not practical on large data sets.
Protecting textual documents
Materials presented in this section of the course include Scrub, a computational
system that attempts to provide privacy protection in textual documents such as letters and email messages. An additional privacy problem that appears in unrestricted text, and not in field-structured databases, involves explicit identifiers within the text itself, which need to be located and altered. Because of the abundance of textual information on the Web, demand for systems that can render text sufficiently anonymous is great.
Technology, Policy, Privacy and Freedom
Materials presented in these sections of the course include current discussions on medical privacy legislation, policies and best practices. Ultimately, the social setting places many constraints on the problems that must be understood if proposed solutions are to be viable. For the proposed technical solutions discussed earlier to be most effective, accompanying policies are required and so, the two must work together. At present, policy makers as well as the general public appear unaware not only of possible solutions, but also of the problems themselves.
Materials presented in these sections also include an examination of privacy matters specific to the World Wide Web and to the growing challenges facing society. As earlier discussions demonstrate, we are moving towards a society that can have all the data on all the people, a situation that is at odds with the philosophical foundation of the American way of life. It is not clear what words like “liberty” and “freedom” mean in the absence of personal privacy, for example. Nor is it clear how protections previously provided by the Freedom of Information Act or the requirement for search warrants can continue to offer the security they have historically provided.