Data Privacy Lab  

Harvard University

Data Science to Save the World

Gov 2430

Home | Schedule | Syllabus


Lab 2b. Re-identification of Admissions Data

Goal.

A goal of this project is to help improve de-identification standards in publicly available data. There is growing demand to share personal data widely. You will design a plausible experiment to show how publicly available data can combine to reveal sensitive information about individuals, specifically the LSAT scores of applicants to law schools.

Suggested Readings


Problem.

Using public record requests, a researcher gathered admissions data about applicants to law schools. He shares the data publicly at the Data Website. All the values appearing in the data are truthful, though some values are suppressed (usually gender) in an attempt to to provide privacy by making sure that at least 5 people share common race characteristics applying to a school. The sensitive information is an applicant's LSAT score and undergraduate GPA. The Family Educational Rights and Privacy Act (FERPA) 20 U.S.C. ยง 1232g; 34 CFR Part 99 restricts the release of student scores. These scores should not relate to a specific student. Do they?

Professor Sweeney wonders whether you can design a re-identification strategy that would be successful. You will have to identify other data sources to put names to the data. Your task for next week is to design an experiment that if done, would likely lead to re-identification results and to demonstrate your approach, in part.


Activity. Experimental Design

Divide into groups of up to 5 people. Become part of a group of people with whom you have not yet worked. Choose a group leader who will be responsible for writing your experimental design and leading the presentation next week. The group leader must be a person that has not yet been a group leader. In your group, identify datasets and brainstorm over possible approaches. The Course Wiki has copies of the data and other data sources for your consideration. See the section for Lab 2b.

Important. By the end of the session today, login to the Course Wiki and identify your group as being one of the names groups listed there for Lab 3. The names are Alice, Bob, Cathy, David, Ellen, Frank, Gail, Harry, Irene and Jack. Select one of these as the name for your group and place your group information there. Identify your group members and the group leader.


Paper and Presentation (Due Next Week)

Submit a short paper (3-5 pages) from your group describing your experiment. Make a 5 minute presentation of your experimental design. Be sure to include the names of all group members on the paper.

You will design and demonstrate, in part, a plausible experiment that shows how LSAT scores can be reasonably associated with the named people to whom those scores likely belong (or a named person to one of 5 or 6 scores). Provide some anecdotal evidence to substantiate your approach.

In this course you will write scientific papers. The basic parts of the scientific paper you will write for this assignment will have four parts: an Abstract, Introduction, Methods, Results, and Discussion. These should be the headings in your writeup. The Abstract section should be a one paragraph summary stating what you intend to show with the experiment, how and the expected outcome. The Introduction describes why your experiment is important. You should not assume the person reading the paper knows anything about this assignment. Include references and use authoritative sources to make your points. You should not make any sweeping or unsubstantiated statements in your writing. End the Introduction with a description of the hypothesis you will test. The Methods section is where you describe the experiment you would do to test the hypothesis. You will not likely have results for the experiment, though you may likely have preliminary results that provides an expected outcome for your experiment. Use the Discussion section to explain why this test would prove or disprove the hypothesis. Include at least one statement about the limits of your approach. An example of this format appears on the Course Wiki. See the write-up on the Gov1430 class-wide experiment.


Copyright © 2013-15 President and Fellows Harvard University | Data Privacy Lab