Data Privacy Lab  

Harvard University

Data Science to Save the World

Gov 2430

Home | Schedule | Syllabus

Lab 1. Identity Angel Project


In this lab, you will design experiments that help inform and construct an overall system.

Suggested Readings


A goal of this project is to develop an automated benevolent program that locates resumes on the Web having Social Security Numbers, and then, for each resume found, sends an automated email message alerting the subject that they are needlessly placing themselves at risk to identity theft. We would automate the system and have it operate 24/7 for the remainder of the semester and measure its overall performance and impact.

We can even build on the Identity Angel Project of Prof. Sweeney from years earlier, but before any of that, Professor Sweeney wonders whether this is a viable project and good use of our time together. That's the purpose of today's lab.

If we were going to build a system (don't worry if you don't know how to program!), these are things we would have to address:

  1. Identify "real" resumes. How does a program identify real resumes from documents that are retrieved from a Google search (using Google's automated search feature)? Seems hard.

  2. Harvest email address from resumes. Given a resume text file, how does a program harvest the email address of the subject of the resume, if the email address is present?

  3. Harvest Social Security number from resumes. Given a resume text file, ow does a program harvest the Social Security number of the subject of the resume, if present?

  4. Harvest date of birth from resumes. Given a resume text file, how does a program harvest the date of birth of the subject of the resume, if present?

  5. Convert PDF into text. Given a resume in PDF format, can a program produce a text file containing the content?

  6. Convert HTML into text. Given a resume in HTML format, can a program produce a text file containing the content?

  7. Send email messages. Given an email address and a URL where a resume was found, can a program send an email message to the person who is the subject of the resume?

Meet professors who have hypotheses that question Professor Sweeney's vision.


Professor Alice asserts that you cannot write a program that will automatically identify real resumes from other pages (#1 above).

Professor Bob asserts that you cannot write a program that will automatically identify demographics in resumes (#2, 3 and 4 above).

Professor Cathy asserts that most resumes are in PDF or HTML format so you will not be able to retrieve the text (#5 and #6 above).

Not effective

Professor David asserts that you can write a program that sends email, but you need a human to respond to inquiries and that is cost prohibitive (#7 above).

Professor Ellen asserts that you can do it, but the number of takedowns will be small.

Professor Frank asserts that you can do it, but you can never tell the public because then identity thieves will build the same system.

Professor Gail asserts that you can do it, permanent archives will always exist anyway.

Don't bother

Professor Harry asserts that Social Security numbers are not a problem.

Professor Irene asserts that online resumes is such a small problem, there is no reason to bother.

Professor Jack asserts that the problem is solved if you get Google to not answer these kinds of search queries.

Professor Kim asserts that you cannot do it because it is illegal for us to have a repository of Social Security numbers on Harvard's computers.


Professor Larry asserts that resumes with Social Security numbers are making the U.S. vulnerable to identity theft (or "vulnerable to economic warfare").

Professor Sweeney wonders whether you can prove or disprove the claims of these professors. Your task for next week is to design an experiment that when it is done, will reveal the answer. You do not have to do the experiment you design, just describe it sufficiently for evaluation as to its credibility and utility and effectiveness in answering the question.

Activity. Experimental Design

Divide into groups of 4-5 people. Select an assertion made by the Professors. The goal of your group is to report on an experiment that could test your assigned professor's assertion. You might consider the resumes we compiled in class or other resources. Gather in your group and design a way you might design an experiment. Choose a group leader who will be responsible for writing your experimental design and making the presentation next week. In your group, brainstorm over possible approaches. Then, at the end of this segment of today's class, your group leader will make a 5 minute presentation describing your group's plan. Others will comment on proposed plans, but not discuss the proposals.

Paper and Presentation (Due Next Week)

Submit a short paper (3-5 pages) from your group describing the experiment you will do. Make a 5 minute presentation of your experimental design. Be sure to include the names of all group members on the paper.

In this course you will write scientific papers. The basic parts of the scientific paper you will write for this assignment will have four parts: an Abstract, Introduction, Methods, Results, and Discussion. These should be the headings in your writeup. The Abstract section should be a one paragraph summary stating what you intend to show with the experiment, how and the expected outcome. The Introduction describes why your experiment is important. You should not assume the person reading the paper knows anything about this assignment. Include references and use authoritative sources to make your points. You should not make any sweeping or unsubstantiated statements in your writing. End the Introduction with a description of the hypothesis you will test. The Methods section is where you describe the experiment you would do to test the hypothesis. You will not likely have results for the experiment, though you may likely have preliminary results that provide an expected outcome for your experiment. Use the Discussion section to explain why this test would prove or disprove the hypothesis. Include at least one statement about the limits of your approach.

Copyright © 2013-15 President and Fellows Harvard University | Data Privacy Lab