Carnegie Mellon University

Data Privacy Center

Data Privacy Course


Project Track 5: Provably Anonymous Data




Objective

A tenet in the Data Privacy Lab is to construct privacy solutions such that data can be shared with guarantees of privacy protection while the data remain practically useful. Described in this way, a privacy solution must have a compliance statement addressing the privacy guarantee provided. Likewise, a privacy solution must have a warranty statement addressing the usefulness that remains.

Projects in this track provide privacy solutions for sharing a medical database such that the data can be useful for answering public health questions but the identities of the patients who are the subjects of the data cannot be reliably determined. In these projects, the privacy compliance is to thwart the ability to re-identify the names of the patients by linking to population registers. This much be achieved while also warranting the data remain useful for public health survey.

Raw Materials

Health Data
In Lab 8a, you worked with a sample of hospital visit records of patients who died in the hospital. These records include demographic information, dates of admission, and diagnosis and treatment information. These records do not include any explicit identifiers such as name, address, or Social Security number. A copy is available in Excel .xls format and in HTML format.


ICD-9 Diagnosis Codes
In Lab 8a, you had to interpret ICD-9 diagnosis codes (from a code to an English description) in order to determine the diagnoses of patients during the hospital visit in which they died. A copy of this information is available in Excel .xls format and in HTML format.


Death Registry
In Lab 8a's assignment, the students in the class helped assemble a death registry of people who have some information matching the demographics in the health data. It is believed that most (about 90%) of the patients in the health data appear in this death registry. However, there are some patients in the health data (about 10%) that do not appear in the death registry at all. Conversely, there are some people in the death registry who do not appear in the health data. For example, there are 402 people in the death registry and only 200 people in the health data. A copy of the death registry is available in Excel, tab-delimited text, and HTML formats.


Measuring Identifiability
In Lab 8a's assignment, you estimated the identifiability of the patients in the health data. In lecture, Professor Sweeney described ways of measuring identifiabilty. Here are Professor Sweeney's slides on identifiability.


Anonymization Techniques
In lecture Professor Sweeney described numerous techniques that can be used to distort information to provide privacy protection. She also described formal protection models and introduced computational approaches. Here are Professor Sweeney's slides on techniques and first k-anonymity algorithms.

For related papers, see k-anonmity, and more k-anonmity, Gen-Tree

Project Ideas

The exact nature of your project is up to you with some guidance from the course TAs and Professor Sweeney. If you are interested in working in this track, then you will need to complete at least one of the activities below as your "first assignment." Then, you can complete a second activity below (or propose and complete another related activity of your own design), so that together they comprise your final project in the course.

Final report

Write a summary report of your findings. Include all graphs, tables, spreadsheets and findings reported as part of your project presentation. Submit your final report by email to paddataprivacylab.org. Additionally, FTP any supporing documents you have as spreadsheets or tab-delimited files, into your personal space on dataprivacylab.org.

Graduate credit

If you are taking this course for graduate credit, you must complete at least three of the activities above (not 2). Rather than writing a project report, you will write a conference-style paper on your work.


Fal 2004 Privacy and Anonymity in Data
Professor: Latanya Sweeney, Ph.D. [latanya@dataprivacylab.org]