Carnegie Mellon University

Data Privacy Center

Data Privacy Course


Project Track 4: Mining On-line SSNs




Earlier in this course, you explored how on-line resumes can be used to subject a person to identity theft by providing the person's name and Social Security number (SSN). In this track, you will write programs that help combat this problem by alerting people who have posted SSNs on-line of the dangers of their practice.

Several students enrolled in this course last Fall (2003), provided projects that located on-line resumes and identified those that contained SSNs. In this track this semester (Spring 2004), if you are considering a project in this track, then you will build on this work by writing a program that actually mines the SSN and other relevant information from an on-line resume. If an email address is found, your program will additionally send an email message to the person alerting that having their resume on-line with an SSN is not a good practice.

Assignment 1

Yea-Wen Yang did some outstanding work in this area last term. A copy of her project report is available here. In experiments related to her project, she constructed two repositories which you will use in Assignment 1. The first is a repository of on-line resumes, which appeared in PDF format. She piped these through the Google converter so the resulting files were made available in HTML. These resumes are available in the ConvertedResumes repository.

Note. You may download the files in the ConvertedResumes repository by either clicking on each file and then saving it to your local machine, or by FTP'ing to dataprivacylab.org as you had done in Lab 5 and then downloading the filess from the resumes/converted subdirectory found there.

She also provided a repository of resumes in HTML format. These are available in the HTMLResumes repository.

Note. Similarly, you may download the files in the HTMLResumes repository by either clicking on each file and then saving it to your local machine, or by FTP'ing to dataprivacylab.org as you had done in Lab 5 and then downloading the filess from the resumes/html subdirectory found there.

Write a program that given one an on-line resume from the ConvertedResumes repository, returns the following information from the document, if present:

Some of the documents in the repository may not have values for each of the fields. Your program should identify those that are present. Hint: Think of the Scrub system! Exploit the way in which society writes these values in resumes and use that as cues to locate the values.

Given a URL to one of these documents, your program should report the values found for the Social Security number and date of birth, as present in the document. Try your program on the documents in the ConvertedResumes repository. Examine the documents by hand and see how well your program performed.

Submit an abstract (up to 2 pages) describing your methods for detecting these fields and statistical results. Send the abstract to paddataprivacylab.org by the due date for Project Assignment 1.

If you are enrolled for graduate credit, you should additionally modify your programs so they also work in the HTMLResumes repository. Given a URL from either repository, your program should report the values found for the Social Security number and date of birth, as present in the document.

Give a 5-minute presentation to the class on your experiment and findings for either this assignment (Assignment 1) or the next assignment (Assignment 2).

Note. You may complete assignment 1 and then later change your mind about which project you will in fact provide as your term project, provided your final decision occurs prior to the second project assignment and is approved by the instructor. See the course schedule.


Assignment 2

Update your program so it identifies the following information:

Given a URL from the ConvertedResume repository, your program should report these values as present in the document. Hint. You can typically identify an email address using the @.

If you are enrolled for graduate credit, you should additionally modify your programs so they also work in the HTMLResumes repository. Given a URL from either repository, your program should report the values found for the Social Security number and date of birth, as present in the document.

Try your program on the documents. Examine the documents by hand and see how well your program performed.

Submit an abstract (up to 2 pages) describing your methods for detecting these fields and statistical results. Send the abstract to paddataprivacylab.org by the due date for Project Assignment 2.


Final report

In concluding your project, any one of the following:

  1. Write a program that uses the Google API to search for on-line resumes and then extract the demographic fields of information described in Assignment 2. You may use the FilteredSearch programs useful in working with the Google API through a Java program.

  2. Update your program from Assignment 2 to work directly with PDF files. The file format for PDF is available at https://partners.adobe.com/asn/tech/pdf/specifications.jsp. Use this information to allow your program to be given a resume in PDF format directly and have it extract the demographic fields of information described in Assignment 2.

  3. Update your program to automatically send an email message to the person who is the subject of the resume using the email address found in the resume, if present. Send an email address to the person alerting them that the resume was found with a SSN and that this practice can lead to identity theft.

Graduate credit. If you are enrolled for graduate credit, complete two of the options above. Rather than writing a project report, you will write a conference-style paper on your work.


Spring 2004 Privacy and Anonymity in Data
Professor: Latanya Sweeney, Ph.D. [latanya@dataprivacylab.org]