Several students enrolled in this course last Fall (2003), provided projects that located on-line resumes and identified those that contained SSNs. In this track this semester (Spring 2004), if you are considering a project in this track, then you will build on this work by writing a program that actually mines the SSN and other relevant information from an on-line resume. If an email address is found, your program will additionally send an email message to the person alerting that having their resume on-line with an SSN is not a good practice.
Note. You may download the files in the ConvertedResumes repository by either clicking on each file and then saving it to your local machine, or by FTP'ing to dataprivacylab.org as you had done in Lab 5 and then downloading the filess from the resumes/converted subdirectory found there.
She also provided a repository of resumes in HTML format. These are available in the HTMLResumes repository.
Note. Similarly, you may download the files in the HTMLResumes repository by either clicking on each file and then saving it to your local machine, or by FTP'ing to dataprivacylab.org as you had done in Lab 5 and then downloading the filess from the resumes/html subdirectory found there.
Write a program that given one an on-line resume from the ConvertedResumes repository, returns the following information from the document, if present:
Some of the documents in the repository may not have values for each of the fields. Your program should identify those that are present. Hint: Think of the Scrub system! Exploit the way in which society writes these values in resumes and use that as cues to locate the values.
Given a URL to one of these documents, your program should report the values found for the Social Security number and date of birth, as present in the document. Try your program on the documents in the ConvertedResumes repository. Examine the documents by hand and see how well your program performed.
Submit an abstract (up to 2 pages) describing your methods for detecting these fields and statistical results. Send the abstract to paddataprivacylab.org by the due date for Project Assignment 1.
If you are enrolled for graduate credit, you should additionally modify your programs so they also work in the HTMLResumes repository. Given a URL from either repository, your program should report the values found for the Social Security number and date of birth, as present in the document.
Give a 5-minute presentation to the class on your experiment and findings for either this assignment (Assignment 1) or the next assignment (Assignment 2).
Note. You may complete assignment 1 and then later change your mind about which project you will in fact provide as your term project, provided your final decision occurs prior to the second project assignment and is approved by the instructor. See the course schedule.
If you are enrolled for graduate credit, you should additionally modify your programs so they also work in the HTMLResumes repository. Given a URL from either repository, your program should report the values found for the Social Security number and date of birth, as present in the document.
Try your program on the documents. Examine the documents by hand and see how well your program performed.
Submit an abstract (up to 2 pages) describing your methods for detecting these fields and statistical results. Send the abstract to paddataprivacylab.org by the due date for Project Assignment 2.
Graduate credit. If you are enrolled for graduate credit, complete two of the options above. Rather than writing a project report, you will write a conference-style paper on your work.