Carnegie Mellon University
Data Privacy Course
Project Track 5: Identifiability of IP Addresses
In Lecture 8 and Lab 8, you learned how to do what I term in the RosterFinder work
as "filtered searches" on the web.
These kinds of searches are necessary when the the keyword strings used for
web searching retrieves a large number of false positives. There are typically
so many unwanted web pages that humans would waste too much time reviewing
the retrieved pages by hand, yet there are no refined search strings that yield
the desired pages only.
In order to locate the pages sought, we write a simple program that filters the
retrieved web pages, thereby identifying the desired pages. In lecture 8, we looked at
RosterFinder, which used filter searching as way to find on-line webpages listing
names of people. In this project track, you will write filter searches to gather
other information about IP addresses.
Assignment 1
The first assignment you must complete if you want to work in this track
is to perform a filtered search on publicly available weblogs.
If you perform a
Google search using robots.txt get and you will see lots of web pages,
many about how to write robots.txt files, but among them will be actual
web logs as well. An example of a weblog appears at
https://www.ursaoutdoors.com/logs/access.log.
Below are the first few lines of the log.
216.39.48.161 - - [18/Oct/2003:11:54:26 -0700] "GET /robots.txt HTTP/1.1" 404 347 "-" "Scooter/3.2"
216.39.48.161 - - [18/Oct/2003:11:54:29 -0700] "GET /logs/ HTTP/1.1" 200 1367 "-" "Scooter/3.2"
152.163.253.69 - - [18/Oct/2003:18:49:33 -0700] "GET /rainforestgentle2.html HTTP/1.0" 404 343 "https://www.google.com/search?q=free+brids+or+quaker+hl=en&lr=&ie=UTF-8&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows NT 5.1; Hotbar 4.3.5.0)"
The first line informs us that the machine whose IP address was 216.39.48.161 visited
this site on October 18, 2003.
In this assignment, you will uncover on-line weblogs.
Below is an overview of steps.
- Perform a manual Google search on robots.txt get and examine the
retrieved pages. Pay particular attention to the webpages that are actual weblogs.
Based on your examination, design an algorithm that will detect which of the
retrieved web pages are weblogs and which are not.
- Implement your algorithm from step 1 above. Score the results on at least 120 pages.
You will have to manually examine each of these pages to confirm whether your program
was correct. Record the results. Examine what kinds of errors your program tends to
make.
- Run your program and see how many distinct weblogs you can uncover using reasonable resources.
Record how many pages are examined and how many are classified as weblogs.
Write a 2-page report of your findings. Provide a summary discussion about
the privacy concerns as these findings may relate to trails or IP re-identifications.
Give a 5-minute presentation to the class on your findings.
Note. You must complete this
this task within the first weeks of the course.
You may complete assignment 1 and then later change your mind about
which project you will in fact provide as your term project, provided
your final decision occurs prior to the second project assignment
and is approved by the instructor.
See the course schedule.
Project 5-1: Assessing Publicly-Available Weblogs
This project continues on the work in Assignment 1, by having you
assess the number and nature of publicly available weblogs.
Using your results from assignment 1, report summary information about
the IP addresses appearing in the logs. Below is an overview of the values
you must report:
- Revise your algorithm from assignment 1 if needed to capture as many weblogs
as reasonable.
- Report the number of distinct weblogs and characteristics about the weblogs,
such as .com, .edu, etc.
- Report basic statistics (average, minumum, maximum,
standard deviation) on number of IP addresses per log.
- Report the total number of distinct IP addresses in all logs, and the number
of IP addresses appearing in 1, 2, ..., n weblogs.
- Report on the time frames captured by the weblogs -- that is, note the time period
covered in the log. Report how many cover one day, 7 days, 30 days, and so on.
Report your findings in comprehensive and informative charts.
Assignment 2.
Report on your results from steps 1, 2, and 3 above.
Describe your algorithmic design if changed. As you progress in this project,
you may further modify or even abandon your original design.
That is allowed, but for assignment 2 you report your algorithm and
the results it generated.
Final report. Complete the steps above and gather findings.
Be selective in what information you present. You may elect to report on other
aspects not necessarily listed above.
Write a final report for the project and prepare and conduct
an in-person poster presentation of your work.
Graduate credit: If you are taking this course for graduate credit, you must
additionally examine how these findings relate to the trails re-identification
algorithms introduced in lecture 7. Estimate likely results or risks.
Rather than writing a project report, you will write a conference-style paper
on your work.
Project 5-2: Constructing Trails from Publicly Available Weblogs
This project continues on the work in Assignment 1, by having you
add additional weblogs based on IP addresses appearing in the original set
of logs.
In assignment 1 you found an intial set of weblogs. In this project you will
grow this set by adding weblogs known to contain at least one of the IP
addresses contained in the set of known weblogs.
Below is an overview of the steps.
- Revise your algorithm from assignment 1 if needed to generate your initial
set of 120 weblogs.
- Write an additional program that given an IP address, will perform a Google
search on the IP address and return the URLs of weblogs found that
contain the IP address.
- Using your program from step 2 above, grow your set of weblogs by adding weblogs
to the set only if the weblog was not originally in the set (recall, a set has
no duplicates), and contains an IP address already appearing in a weblog in the
set. Attempt to run your program until no new additional weblogs are added.
- Report the total number of distinct IP addresses in all logs, and the number
of IP addresses appearing in 1, 2, ..., n weblogs.
Report your findings in comprehensive and informative charts.
Assignment 2.
Report on your results from steps 1, and 2 above working on a few weblogs and
IP addresses. Describe your algorithmic design.
As you progress in this project,
you may further modify or even abandon your original design.
That is allowed, but for assignment 2 you report your algorithm and
some initial results it generated.
Final report. Complete the steps above and gather findings.
Write a final report for the project and prepare and conduct
an in-person poster presentation of your work.
Graduate credit: If you are taking this course for graduate credit, you must
additionally examine how these findings relate to the trails re-identification
algorithms introduced in lecture 7. Estimate likely results or risks.
Rather than writing a project report, you will write a conference-style paper
on your work.
Project 5-3: identifiability of IP addresses
There are numerous network tools that provide inferences about an IP address.
In this project, you will use these network tools, as well as, other recorded
information that is publicly available to provide information about IP addresses
found in weblogs.
Below is an overview of steps.
- Research out available network tools, such as reverse nslookup and
network connection and allocation information that provide information about
an IP address. Summarize your findings.
- Write a program that given a weblog provides the kind of information you
described in step 1 for each IP address found in the log.
- Sometimes an email address can be associated with an IP address if the user
has posted an email message. To uncover such information, write a program
to search the web on an IP address, and identify whether the IP address (or its name)
appears in an email header of an email message that is on-line. If so,
return the email message content.
- Incorporate your program in step 3 with the one in step 2. Given a weblog
the enhanced program should report various network information about the IP address
as well as any associated email postings.
- Run you enhanced program from the step above on the IP addresses appearing
in the weblogs you found in assignment 1. Report your results. Note how much and
what kind of information was found on each IP address.
Assignment 2.
Report on your work for step 1 and 2. Demonstrate your progress by providing
some preliminary examples.
Final report. Complete all the steps above.
Write a final report for the project and prepare and conduct
an in-person poster presentation of your work.
Graduate credit: If you are taking this course for graduate credit, you must
additionally address how identifiable you find IP addresses (or certain kinds
of IP addresses). Rather than writing a project report,
you will write a conference-style paper on your work.
Fall 2003
Privacy and Anonymity in Data
Professor: Latanya Sweeney, Ph.D.
[latanya@dataprivacylab.org]