Carnegie Mellon University

Data Privacy Course

Syllabus, Assignments and Grading

About this area

The overall objective of the line of research and practice encouraged by this work is to create architectural, algorithmic and technological foundations for the maintenance of the privacy of individuals, the confidentiality of organizations, and the protection of sensitive information, despite the requirement that information be released publicly or semi-publicly. Data holders are finding it increasingly difficult to produce anonymous and declassified information in today’s globally networked society. Most data holders do not even realize the jeopardy at which they place financial, medical, or national security information when they erroneously rely on security practices of the past. Technology has eroded previous protections, leaving the information vulnerable. In the past, a person seeking to reconstruct private information was limited to visiting disparate file rooms and engaging in the labor-intensive review of printed material in geographically distributed locations. Today, one can access voluminous worldwide public information using a standard handheld computer and ubiquitous network resources. Thus, from seemingly innocuous anonymous data and available public and semi-public information, one can draw damaging inferences about sensitive information.

However, one cannot seriously propose that all information with any links to sensitive information be suppressed. Society has developed an insatiable appetite for all kinds of detailed information for many worthy purposes, and modern systems tend to distribute information widely. A goal of this work is to control the disclosure of data such that inferences about identities of people and organizations and about sensitive information contained in the released data cannot reliably be made. In this way, information that is practically useful can be shared freely with guarantees that it is sufficiently anonymous and declassified.

Syllabus

Below is a comprehensive list of topics expected to be covered in this course. Overview The goal of the work presented in this course is to explore computational techniques for releasing useful information in such a way that the identity of any individual or entity contained in data cannot be recognized while the data remain practically useful. We begin by demonstrating ways to learn information about entities from publicly available information. We then provide a formal framework for reasoning about disclosure control and the ability to infer the identities of entities contained within the data. We formally define and present null-map, k-map and wrong-map as models of protection. Each model provides protection by ensuring that released information maps to no, k or incorrect entities, respectively. We discuss the strengths and weaknesses of these protection models and provide real-world examples.

Data explosion

Materials presented in this section of the course proposes a relation between the availability of inexpensive computers with large storage capacities and the collection of person-specific information. It also discusses the lack of barriers to widely distributing collected information. It then provides a formal mathematical model for characterizing and comparing real-world data sharing practices and policies and for computing privacy and risk measurements.

Demographics and Uniqueness

Materials presented in this section of the course provide experimental results from summary data that show how demographics often combine to make individuals unique or almost unique and such uniqueness typically occurs for a substantial number of individuals within a population. Knowing how many people share a particular set of characteristics forms the basis for confidently drawing inferences from data.

Data linking

Materials in this section of the course report on an experiment that demonstrates how health information that contains no explicit identifiers such as name, address or phone number, can be linked to fully identified information, such as a voter list, to re-identify patients who are the subjects of the health data.

Materials also chronicles an experiment in which five patients from a proposed release of cancer incidence information are accurately re-identified by drawing probabilistic inferences from publicly available data.

Data profiling

Material presented in this section of the course report on a difficult re-identification experiment in which children are identified from seemingly innocent cancer incidence information. The experiment combines the techniques introduced earlier with pattern matching and employs additional materials that include web pages and e-mail discussion groups.

Data Privacy Attacks (Computer Security is not privacy)

Materials presented in this section of the course review related work in the statistics community and in the field of computer security. However, none of this work provides solutions to the broader problems experienced in today’s setting that are the topic of this course.

Protection Models

Materials presented in this section of the course provide a formal framework for reasoning about disclosure control and the ability to infer the identities of entities contained within data. Formal protection models are defined and these provide the basis for characterizing and comparing systems that appear in later weeks.

Survey of techniques

Materials presented in this section of the course survey of disclosure limitation techniques and then provide a formal framework for reasoning about disclosure control and the ability to infer the identities of entities contained within data. We further define an anonymous database system as one that makes individual and entity-specific data available such that individuals and other entities contained in the released data cannot be reliably identified. We continue by introducing formal protection models, named null-map, k-map and wrong-map. Each model provides protection by ensuring that released information maps to no, k or incorrect entities, respectively.

Protecting delimited data I

Materials presented in this section of the course include MinGen, a theoretical computational system that uses a formal protection model to ensure that releases provide adequate protection and that of all the possible releases that could provide adequate protection, MinGen returns one that is minimally distorted. The real-world systems explored in subsequent weeks are compared to the theoretical results of MinGen.

Protecting delimited data II

Materials presented in this section of the course include Datafly, a computational system applied to field-structured databases. It uses a formal protection model to ensure that releases provide adequate protection and that of all the possible releases that could provide adequate protection, the system returns one that is considered useful to the recipient of the data.

Protecting delimited data III

Materials presented in this section of the course include Mu-Argus, a computational system created and used by Statistics Netherlands which is surprisingly similar to my Datafly and Datafly II Systems, even though the systems were developed at roughly the same time with no knowledge of each other and come from different academic traditions. Datafly II tends to over-distort data while m-Argus tends to under-protect data. Neither is perfect.

Protecting delimited data IV

Materials presented in this section of the course include k-Similar, another computational solution that attempts to provide privacy protection in field-structured data. Unlike my Datafly II System and Statistics Netherlands’ Mu-Argus System discussed earlier, my k-Similar algorithm produces optimal releases without over distorting the data and without under-protecting the data. Further, the metrics used by k-Similar are extensible so that most of the disclosure limitation techniques. However, k-Similar is not practical on large data sets.

Protecting textual documents

Materials presented in this section of the course include Scrub, a computational system that attempts to provide privacy protection in textual documents such as letters and email messages. An additional privacy problem that appears in unrestricted text, and not in field-structured databases, involves explicit identifiers within the text itself, which need to be located and altered. Because of the abundance of textual information on the Web, demand for systems that can render text sufficiently anonymous is great.

Technology, Policy, Privacy and Freedom

Materials presented in these sections of the course include current discussions on medical privacy legislation, policies and best practices. Ultimately, the social setting places many constraints on the problems that must be understood if proposed solutions are to be viable. For the proposed technical solutions discussed earlier to be most effective, accompanying policies are required and so, the two must work together. At present, policy makers as well as the general public appear unaware not only of possible solutions, but also of the problems themselves.

Materials presented in these sections also include an examination of privacy matters specific to the World Wide Web and to the growing challenges facing society. As earlier discussions demonstrate, we are moving towards a society that can have all the data on all the people, a situation that is at odds with the philosophical foundation of the American way of life. It is not clear what words like “liberty” and “freedom” mean in the absence of personal privacy, for example. Nor is it clear how protections previously provided by the Freedom of Information Act or the requirement for search warrants can continue to offer the security they have historically provided.

Assignments

Assignments are provided weekly, though students may skip one assignment of their own selection without penalty provided the assignment is not an assignment directly related to the student's project. These weekly assignments are extensions to the in-class lab activities conducted each week. They engage students in a variety of diverse activities. Examples include: surveys, statistical analysis, database manipulations, image analysis, web searching, and more.

Students are expected to be uniquely responsible for their work. However, working and sharing ideas with other students in the course who are working on similar or complementary work is strongly encouraged.

Background

Programming
It is assumed students are comfortable writing programs in Java (or an equivalent programming language). While there is not necessarily a large amount of programming involved, students should feel comfortable doing so.
Communication skills
Students should be able to express them verbally and in writing.
Internet and SQL
Students are assumed to be familiar with basic Internet tools such as web searching, and sending and receiving email messages. Familiarity with database-backed websites and SQL is beneficial, but a primer is provided for those activities.

Term Project

Each student must complete a term project, which constitutes the bulk of a student's grade. Special assignments will be provided throughout the semester to insure progress on projects. At the end of the semester is an SCS-wide exhibition of student projects. EACH STUDENT MUST BE PRESENT AND MUST EXHIBIT HIS/HER PROJECT! The overall grade for a project is broken down into:

Two project-specific assignments
Powerpoint presentation, Poster
Project report (undergraduate credit) or a conference-style paper (graduate credit)

Grading Policy

Your final grade in this course is based on:

10% Class participation (in-class labs)
30% Lab assignments (may drop 1 with no penalty)
60% Project (Assignments, Presentation, Report)