Carnegie Mellon University
Data Privacy Course
Project Track 1: Privacy Concerns of the Past
In
Lab 1,
you searched for articles in the New York Times Historical Archive that contained the word privacy. You then summarized these articles in terms of the privacy issues present in the article. You also identified the entities (people or organizations) involved and the notion of 'privacy space' that was the subject of the article. In this project, you will build tools that perform these tasks semi-automatically. The end result is an indexed abstract of privacy articles from the abstract.
Here are some links for reference:
Assignment 1
The first project assignment which you must conduct if you want to do a project
in this track is to extract all the privacy articles
from the on-line archive. You may use any semi-automated, automated, or even manual system that works best for you. Save a local copy of each article that contained the word 'privacy' for further processing.
Build a database containing the extracted articles. The fields in the database should include title and citation sub-fields, as well as, date and content. These fields should be correctly filled with the information from the extracted articles.
From your database, report summary statistics on the abstracted articles. Validate your results using searches from the original database.
Submit a text dump of your database by FTP'ing the results into your space on dataprivacylab.org. The text dump may have one file per article or all articles in one file, whichever is easiest for you. Submit your dump on the day Project Assignment 1 is due.
Submit a one-page abstract describing your methods and statistics of results. Send the abstract to paddataprivacylab.org by the due date for Project Assignment 1.
Note.
You may complete assignment 1 and then later change your mind about
which project you will in fact provide as your term project, provided
your final decision occurs prior to the second project assignment
and is approved by the instructor.
See the course schedule.
Note. You will have to provide a public 5-minute presentation of your work to the course. You may elect to present your results from assignment 1 or 2.
Project: Classifying Privacy Articles
In the remainder of this project, you will develop an annotated database to facilitatemore intelligent retrieval of articles. The resulting database should enable more insight into the nature of privacy concerns over time than was possible searching the original
database.
You may elect to do any one of the following:
- Write a system that extracts summaries, in terms of the privacy issues addressed in the article. (We can provide some ideas on methods. Summaries do not have to be well-formed English sentences.) Append a summary to each article in your database. The summaries can then be used for rapid review of articles, especially in cases where a search resulted in numerous articles, the summaries can also enhance the search.
- Write a system that automatically appends descriptive categorical information to each article in your database. You may use the categories and tags used in Lab 1. See
https://dataprivacylab.org/dataprivacy/projects/news/submit.html". (We can provide some ideas on methods.) Given an article, your system will automatically annotate the article with the appropriate tags. Because each tag is likely to represent a concept which may be realized by many different kinds of words appearing in the text, having a tagged database will allow more generalized searching.
- Write a system that finds like articles using a statistical clustering technique. (We can provide some ideas on methods.) Your resulting system should allow a user to select some articles of interest, and then your system will find similar articles to those selected. Having such a system should allow for more targeted searches.
Assignment 2. Provide a description of your method and the algorithm you will use for your system. Include some initial results. Discuss what you perceive as the advantages and disadvantages of your approach. Submit a summary report (3 to 5 pages) by email. Include text that provided the basis of your initial results as an appendix. Send your report to paddataprivacylab.org.
Final report. Feel free to revise and modify your method and algorithm as you deem appropriate. Use your final version on all the articles in your database. Review the results by providing meaning descriptive statistics. Also, analyze the usefulness of
your resulting database in terms of uses you can perceive. Report on a couple of interesting facts that are learned as a result of your final database. Submit your final report by email to paddataprivacylab.org. Provide your supporting final database in a text format by FTP'ing the results to your personal space on dataprivacylab.org.
Graduate credit: If you are taking this course for graduate credit, you must
also provide a rigorous analysis of the results in terms of the methods used, as
well as, in termns of the benefits afforded by your new database. Rather than writing a project report, you will write a
conference-style paper on your work.
Fal 2004
Privacy and Anonymity in Data
Professor: Latanya Sweeney, Ph.D.
[latanya@dataprivacylab.org]