Data Science to Save the World
Home | Schedule | Syllabus
Lab 2. Data Re-identification Project
In this lab, you will design experiments that help re-identify people in publicly available data.
A goal of this project is to help improve de-identification standards in publicly available data. There is growing demand to share personal data widely. Here are some examples.
We will perform a re-identification experiment as a class. The purpose of today's lab is to identify possible experiments we might undertake. An ideal dataset to select is one where the re-identification strategy is likely to work, we can substantiate our findings, and the result of the re-identification poses serious concerns.
Some possible sources to consider are open government initiatives in Seattle, Chicago, New York, and elsewhere. Specialized databases are also available through open record acts and other means, including stop and frisk data, state health data, national health data, and educational data. You are not limited to these sources.
Professor Sweeney wonders whether you can identify a dataset and a re-identification strategy that would be successful. Your task for next week is to design an experiment that if done, would likely lead to significant re-identification results.
Activity. Experimental Design
Divide into groups of up to 5 people. Become part of a group of people with whom you have not yet worked. Choose a group leader who will be responsible for writing your experimental design and making the presentation next week. The group leader must be a person that has not yet been a group leader. In your group, identify datasets and brainstorm over possible approaches.
Important. At the end of the session today, login to the Course Wiki and identify your group as being one of the names groups listed there for Project 2. The names are Alice, Bob, Cathy, David, Ellen, Frank, Gail and Harry. Select one of these as the name for your group and place your group information there. Identify your group members and the group leader. Write a sentence of two about the experiment you will likely address.
Paper and Presentation (Due Next Week)
Submit a short paper (3-5 pages) from your group describing your experiment. Make a 5 minute presentation of your experimental design. Be sure to include the names of all group members on the paper.
Your experimental design should include: the dataset you would attempt to re-identify, other datasets and sources needed for the re-identification, a description of the strategy to use, and a description of how to measure and confirm whether re-identifications were successful.
In this course you will write scientific papers.
The basic parts of the scientific paper you will write
for this assignment will have four parts: an Abstract, Introduction, Methods, Results, and Discussion. These should be the headings in your writeup. The Abstract
section should be a one paragraph summary stating what you intend to show with the experiment, how and the expected outcome.
The Introduction describes why your experiment is important. You should
not assume the person reading the paper knows anything about this assignment. Include references and use authoritative sources to make your points. You should not make any
sweeping or unsubstantiated statements in your writing. End the Introduction with a description of the hypothesis you will test.
The Methods section is where you describe the experiment you would do to test the hypothesis.
You will not likely have results for the experiment, though you may
likely have preliminary results that provide an expected outcome for your experiment. Use the Discussion section to explain why this test would prove or disprove the
hypothesis. Include at least one statement about the limits of your approach. An example of this format appears on the
See the write-up on the Gov1430 class-wide experiment.
Copyright © 2013-15 President and Fellows Harvard University | Data Privacy Lab