Carnegie Mellon University

Data Privacy Center

Data Privacy Course


Project Track 2: Identity Theft and On-line SSNs




Objective

Student projects in this track build on research projects of Prof. Sweeney's called Identity Angel and SSNwatch. The idea is to develop AI technologies that combat identity theft vulnerabilities specifically related to Social Security numbers (SSNs). Student projects in this track test the viability of these tools. More information about these are available here. It should also be noted that student projects in this track also build on prior student projects completed by students enrolled in this course in previous semesters.

Raw Materials

ConvertedResumes Database
In Lab 2 this semester, you built on the prior work of Yea-Wen Yang's student project (Fall 2003). In that project, she located 144 on-line resumes that contained SSNs in which the resumes appeared in PDF format. She then piped the results through Google to provide them in HTML format. This is termed the ConvertedResumes database. If you are working in this track and need a copy of this database, contact Professor Sweeney by sending an email message to paddataprivacylab.org.


Lab 2 Extraction Results
In Lab 2 this semester, you wrote programs to extract the SSN, date of birth, and email address from documents in the ConvertedResumes database. Student solutions and comparative results appear at the end of Lab 2.


LatestResumes Database
Professor Sweeney wrote a routine to extract 75 on-line resumes that contain Social Security numbers. These documents have been stored as text files (if they appeared in a PDF format) and in HTML if they appeared in HTML format on-line. These resumes were extracted recently from the Web (week of 12/1/2004). This database is termed the LatestResumes Database. If you are working in this track and need a copy of this database, contact Professor Sweeney by sending an email message to paddataprivacylab.org.


SSNwatch
During the in-class activites of Lab 3 this semester, you worked with the SSNwatch validation server. This is available on-line at https://dataprivacylab.org/dataprivacy/projects/ssnwatch/index.html. This server provides inferences about the age and demographics of a person using only the first few digits of their SSN.


Filtered Search
Professor Sweeney has written a Java program called FilteredSearch for your use (if needed). This program provides a skeleton for performing automated web searches and filtering the result. Based on search criteria, URL pages are fetched and stored locally. Multiple searches can be automatically perfomed. Searches can be limited to a particular site. Google is used as the basis for the search. (Note. You will have to get a registration key from Google, as described in the header of the Java file.)


Project Ideas

The exact nature of your project is up to you with some guidance from the course TAs and Professor Sweeney. If you are interested in working in this track, then you will need to complete at least one of the activities below as your "first assignment." Then, you can complete a second activity below (or propose and complete another related activity of your own design), so that together they comprise your final project in the course.

Final report

Write a summary report of your findings. Include all graphs and findings reported as part of your project presentation. Submit your final report by email to paddataprivacylab.org. Additionally, FTP any cluster results you have as spreadsheets or tab-delimited files, into your personal space on dataprivacylab.org.

Graduate credit

If you are taking this course for graduate credit, you must complete at least three of the activities above (not 2). Rather than writing a project report, you will write a conference-style paper on your work.


Fal 2004 Privacy and Anonymity in Data
Professor: Latanya Sweeney, Ph.D. [latanya@dataprivacylab.org]