Harvard University The Politics of Personal Data Gov 1430 |
Lab 2 Supplement: Locating and Predicting Social Security Numbers
Goal.
In this lab you will gain first-hand experience about the ease at which Social Security numbers can be acquired.
Suggested Readings
Experiment. Predicting SSNs (Due Tuesday).
The goal of this activity is to give you first-hand experience predicting SSNs from seemingly innocent information.
Here is a file containing a list of SSNs that share the same first 5 digits. Only the last 4 digits and dates of birth of these people appear in the file. Many of the people were born in the United States after 1990. The file has the fields {Last4SSN, Date of birth}. Select a candidate from the list; it must be someone who is born after 1990. Your job is to predict the SSN of the candidate. How many digits can you predict accurately?
You will try different ways to make predictions. First, try using the {SSN DOB} information alone. You fit a line to the dates of births for those assignments of SSNs born after 1990 (not including your candidate). Then, see how many of the digits you can correctly predict for your candidate. What are the range of possible values?
Try your predictions accounting for the renumbering done with every fifth SSN (see lecture 2 slides on the Course Wiki).
Instructions for technically fitting a line to points using Excel is available in the suggested readings above. You should probably decide which points you believe make the most appropriate line. For example, you should discard any birth dates prior to 1989 (enumeration at birth began in 1987). You might elect to also discard or discount some birth dates that you do not believe were given exactly at the time of birth. For example, if many of the births are in August 2000 but among them are some in April or February, you could ignore those births in fitting the line in the belief that most of the births in August were at birth and the others were delayed in getting a Social Security number assignment.
Second, use your line to predict each of the other points for birth dates after 1989. How correct are the predictions? To how many digits?
NOTE: When using Excel to fit a line, you may want to change the dates to be relative days from a given date. Different versions of Excel do not work well using dates as numbers, so you can alternatively take the smallest date for your values and then subtract all other dates from it so you get the number of days from your reference date. You will now be able to plot a scaled down version of date values as days from your reference date. You then just add the reference date back and display the value as a date to get a date again. For example, if your earliest date of interest is 6/9/1990 then take each of the other birth dates and subtract 6/9/1990 from them in Excel. This will give you the number of days. Use the number of days as the horizontal axis on your plot. The line will then be in number of days, which will likely be more recognizable numbers. To translate a number back to date, simply add 6/9/1990 to it and display the result as a date. You may find the following Excel functions useful: DATEVALUE(C2&"/"&D2&"/"&B2) to construct a date from values in the cells c2, d2, and b2; VALUE(RIGHT(A2,4)) to get a numeric value for the rightmost characters in the string at a2; and, T2-DATEVALUE("6/9/1990") to compute the number of days of the value in t2 is from 6/9/1990. You do not have to use Excel for this assignment if you prefer another way.
Write a summary of your experiments as a 3-5 page scientific paper. The basic parts of the scientific paper you will write for this assignment will have five parts: an Abstract, Introduction, Methods, Results and Discussion. The Abstract section should be a one paragraph summary stating the goal of predicting Social Security numbers. The Introduction describes why your experiment is important. You should not assume the person reading the paper knows anything about this assignment. Include references and use authoritative sources to make your points. You should not make any sweeping or unsubstantiated statements in your writing. The Methods section is where you describe the actual ways you attempted predictions. In the Results section, include a graph or chart showing how successful each method was in predicting SSNs. Use the Discussion section to explain how the significance of this experiment, its shortcomings and broader implications. Include at least one statement about the limits of your approach.
Submit
Send your work to Sean Hooley by email.
Presentation
Be prepared to discuss your findings in class on Tuesday.
Copyright © 2013-2015 President and Fellows Harvard University | Data Privacy Lab