Carnegie Mellon University

Data Privacy Center

Data Privacy Course


Project Track 5: Identifiability of IP Addresses




In Lecture 8 and Lab 8, you learned how to do what I term in the RosterFinder work as "filtered searches" on the web. These kinds of searches are necessary when the the keyword strings used for web searching retrieves a large number of false positives. There are typically so many unwanted web pages that humans would waste too much time reviewing the retrieved pages by hand, yet there are no refined search strings that yield the desired pages only. In order to locate the pages sought, we write a simple program that filters the retrieved web pages, thereby identifying the desired pages. In lecture 8, we looked at RosterFinder, which used filter searching as way to find on-line webpages listing names of people. In this project track, you will write filter searches to gather other information about IP addresses.

Assignment 1

The first assignment you must complete if you want to work in this track is to perform a filtered search on publicly available weblogs.

If you perform a Google search using robots.txt get and you will see lots of web pages, many about how to write robots.txt files, but among them will be actual web logs as well. An example of a weblog appears at https://www.ursaoutdoors.com/logs/access.log. Below are the first few lines of the log.


216.39.48.161 - - [18/Oct/2003:11:54:26 -0700] "GET /robots.txt HTTP/1.1" 404 347 "-" "Scooter/3.2"
216.39.48.161 - - [18/Oct/2003:11:54:29 -0700] "GET /logs/ HTTP/1.1" 200 1367 "-" "Scooter/3.2"
152.163.253.69 - - [18/Oct/2003:18:49:33 -0700] "GET /rainforestgentle2.html HTTP/1.0" 404 343 "https://www.google.com/search?q=free+brids+or+quaker+hl=en&lr=&ie=UTF-8&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows NT 5.1; Hotbar 4.3.5.0)"

The first line informs us that the machine whose IP address was 216.39.48.161 visited this site on October 18, 2003.

In this assignment, you will uncover on-line weblogs. Below is an overview of steps.

  1. Perform a manual Google search on robots.txt get and examine the retrieved pages. Pay particular attention to the webpages that are actual weblogs. Based on your examination, design an algorithm that will detect which of the retrieved web pages are weblogs and which are not.

  2. Implement your algorithm from step 1 above. Score the results on at least 120 pages. You will have to manually examine each of these pages to confirm whether your program was correct. Record the results. Examine what kinds of errors your program tends to make.

  3. Run your program and see how many distinct weblogs you can uncover using reasonable resources. Record how many pages are examined and how many are classified as weblogs.

Write a 2-page report of your findings. Provide a summary discussion about the privacy concerns as these findings may relate to trails or IP re-identifications. Give a 5-minute presentation to the class on your findings.

Note. You must complete this this task within the first weeks of the course. You may complete assignment 1 and then later change your mind about which project you will in fact provide as your term project, provided your final decision occurs prior to the second project assignment and is approved by the instructor. See the course schedule.


Project 5-1: Assessing Publicly-Available Weblogs

This project continues on the work in Assignment 1, by having you assess the number and nature of publicly available weblogs.

Using your results from assignment 1, report summary information about the IP addresses appearing in the logs. Below is an overview of the values you must report:

  1. Revise your algorithm from assignment 1 if needed to capture as many weblogs as reasonable.
  2. Report the number of distinct weblogs and characteristics about the weblogs, such as .com, .edu, etc.
  3. Report basic statistics (average, minumum, maximum, standard deviation) on number of IP addresses per log.
  4. Report the total number of distinct IP addresses in all logs, and the number of IP addresses appearing in 1, 2, ..., n weblogs.
  5. Report on the time frames captured by the weblogs -- that is, note the time period covered in the log. Report how many cover one day, 7 days, 30 days, and so on.
Report your findings in comprehensive and informative charts.

Assignment 2. Report on your results from steps 1, 2, and 3 above. Describe your algorithmic design if changed. As you progress in this project, you may further modify or even abandon your original design. That is allowed, but for assignment 2 you report your algorithm and the results it generated.

Final report. Complete the steps above and gather findings. Be selective in what information you present. You may elect to report on other aspects not necessarily listed above. Write a final report for the project and prepare and conduct an in-person poster presentation of your work.

Graduate credit: If you are taking this course for graduate credit, you must additionally examine how these findings relate to the trails re-identification algorithms introduced in lecture 7. Estimate likely results or risks. Rather than writing a project report, you will write a conference-style paper on your work.


Project 5-2: Constructing Trails from Publicly Available Weblogs

This project continues on the work in Assignment 1, by having you add additional weblogs based on IP addresses appearing in the original set of logs.

In assignment 1 you found an intial set of weblogs. In this project you will grow this set by adding weblogs known to contain at least one of the IP addresses contained in the set of known weblogs. Below is an overview of the steps.

  1. Revise your algorithm from assignment 1 if needed to generate your initial set of 120 weblogs.

  2. Write an additional program that given an IP address, will perform a Google search on the IP address and return the URLs of weblogs found that contain the IP address.

  3. Using your program from step 2 above, grow your set of weblogs by adding weblogs to the set only if the weblog was not originally in the set (recall, a set has no duplicates), and contains an IP address already appearing in a weblog in the set. Attempt to run your program until no new additional weblogs are added.

  4. Report the total number of distinct IP addresses in all logs, and the number of IP addresses appearing in 1, 2, ..., n weblogs.

Report your findings in comprehensive and informative charts.

Assignment 2. Report on your results from steps 1, and 2 above working on a few weblogs and IP addresses. Describe your algorithmic design. As you progress in this project, you may further modify or even abandon your original design. That is allowed, but for assignment 2 you report your algorithm and some initial results it generated.

Final report. Complete the steps above and gather findings. Write a final report for the project and prepare and conduct an in-person poster presentation of your work.

Graduate credit: If you are taking this course for graduate credit, you must additionally examine how these findings relate to the trails re-identification algorithms introduced in lecture 7. Estimate likely results or risks. Rather than writing a project report, you will write a conference-style paper on your work.


Project 5-3: identifiability of IP addresses

There are numerous network tools that provide inferences about an IP address. In this project, you will use these network tools, as well as, other recorded information that is publicly available to provide information about IP addresses found in weblogs. Below is an overview of steps.

  1. Research out available network tools, such as reverse nslookup and network connection and allocation information that provide information about an IP address. Summarize your findings.

  2. Write a program that given a weblog provides the kind of information you described in step 1 for each IP address found in the log.

  3. Sometimes an email address can be associated with an IP address if the user has posted an email message. To uncover such information, write a program to search the web on an IP address, and identify whether the IP address (or its name) appears in an email header of an email message that is on-line. If so, return the email message content.

  4. Incorporate your program in step 3 with the one in step 2. Given a weblog the enhanced program should report various network information about the IP address as well as any associated email postings.

  5. Run you enhanced program from the step above on the IP addresses appearing in the weblogs you found in assignment 1. Report your results. Note how much and what kind of information was found on each IP address.

Assignment 2. Report on your work for step 1 and 2. Demonstrate your progress by providing some preliminary examples.

Final report. Complete all the steps above. Write a final report for the project and prepare and conduct an in-person poster presentation of your work.

Graduate credit: If you are taking this course for graduate credit, you must additionally address how identifiable you find IP addresses (or certain kinds of IP addresses). Rather than writing a project report, you will write a conference-style paper on your work.


Fall 2003 Privacy and Anonymity in Data
Professor: Latanya Sweeney, Ph.D. [latanya@dataprivacylab.org]