Lab 2 Extraction Results
In Lab 2 this semester,
you wrote programs to extract the SSN, date of birth, and email address
from documents in the ConvertedResumes database. Student solutions
and comparative results appear at the end of
Lab 2.
LatestResumes Database
Professor Sweeney wrote a routine to extract 75 on-line resumes that contain Social Security numbers. These documents have been stored as text files (if they appeared
in a PDF format) and in HTML if they appeared in HTML format on-line. These
resumes were extracted recently from the Web (week of 12/1/2004). This database
is termed the LatestResumes Database. If you are working in this track
and need a copy of this database, contact Professor Sweeney by sending an
email message to paddataprivacylab.org.
SSNwatch
During the in-class activites of
Lab 3 this semester,
you worked with the SSNwatch validation server. This is available on-line
at https://dataprivacylab.org/dataprivacy/projects/ssnwatch/index.html. This server provides inferences about the age and
demographics of a person using only the first few digits of their SSN.
Filtered Search
Professor Sweeney has written a Java program called
FilteredSearch for your
use (if needed). This program provides a skeleton for performing automated
web searches and filtering the result. Based on search criteria,
URL pages are fetched and stored locally. Multiple searches can be automatically
perfomed. Searches can be limited to a particular site. Google is used as the
basis for the search. (Note. You will have to get a registration key from Google,
as described in the header of the Java file.)
Send each person having a resume in the LatestResume a personal email message alerting them that their SSN was found on-line and that there are identity theft risks associated with providing such information. Use the approved email message. Then, report how many people respond and the nature of the response. Some people may provide email responses directly to you. Record these and classify them. Just before the assignment is due, visit each of the URLs containing the on-line resumes and report how many have the SSNs (or resumes) removed. Do you think the service was well received?
Your presentation goal is to assess the responsiveness and usefulness of alerting people of their risk to identity theft. Your presentation would describe this experiment and report results found.
Your presentation goal is to determine the utlity of SSNwatch as a tool for combating identity theft. Report the usefulness of SSNwatch by reporting how many resumes contained supporting information sufficient to check against. [See also Activity C, D and E.]
Another approach is for you to write a program that conducts a brute-force attack by cycling through assigned SSNs to see which appear on-line. You may want to use number ranges that include those SSNs appearing in the ConvertedResumes or LatestResumes databases. Search on each number within a range of legal and useful values surrounding an SSN appearing in one of the databases. Recall, we discussed how SSNs are assigned in an earlier lecture. Not all 9 digits SSNs are valid or likely to represent people having on-line SSNs. Describe the nature and number of documents found. Among them should be resumes. How many were found? Any other kinds of documents?
Rather than the brute force attack on numbers appearing in the format ddd-dd-dddd, described above, you may want to alternatively try keywords (such as SSN, Social Security Number, or ID) and/or search for numbers appearing as a sequence of 9 consecutive digits (surrounded by non-digits). Use the first 5 digits of the numbers to be sure the retrieved documents are likely to hold assigned SSNs. Recall, we discussed how SSNs are assigned in an earlier lecture.
Your presentation goal may any of the following: (a) identify efficient search criteria to harvest on-line SSNs; and/or, (b) identify different kinds of documents that are on-line that contain SSNs. In your presentation, you will provide examples of interesting cases and report overall results.
Your presentation goal is to report on the performance at locating SSNs based on particular search critera. You should show the criteria and results found. You should also allow visitors to your project to review interesting documents by the kind of search criteria you performed.