Trails Learning Project



	Trails Learning Project

The Trails Learning Project

Work in the Data Privacy Lab on the Trails Learning Project has sought to answer the question:

How can people be identified to the trail of seemingly innocent and anonymous data they leave behind at different locations?
Here are some examples.

DNA.
Given a set of hospitals and a set of patients having gene-based diseases who visit those hospitals, patients leave behind sequenced DNA at some hospitals visited as part of their clinical experience. How (in the real-world) can these patientsbe re-identified to their DNA sequences based on the hospitals visited?
Answers: see Trail re-identification of DNA sequences, and How (not) to protect genomic data privacy in a distributed network.

Web logs
Given a set of websites in which each site provides a weblog, which is a list of IP addresses recorded from machines visiting the website, how can the people who are using the machines be identifed?
Answers: see Trail re-identification of on-line consumers using IP addresses.

Generalized solutions (the REID-IT algorithms):

Each of the solutions provided above exploits situations in which there also exists a parallel (though incomplete and perhaps erroneous) log of visits to the locations in which the subjects are identified. Given this parallel identified log, the trails learning problem becomes one of matching the trails of identified information to the trails of de-identified information. For example, publicly available hospital data provides demographics on each hospital patient and publicly available marketing lists idenitfy on-line consumers who have made purchases at particular websites. Using this information, a trail can be constructed that lists places where the identified subject was known to have visited. Algorithms are then provided that match the identified trails (those in which subjects are known) to the de-identified trails (such as DNA sequences or IP addresses -- trails in which the subjects are not known).

Solutions: See the REID-IT algorithms.

Keywords: trail re-identification, trail linkage, trail matching, data linkage, entity resolution, "connect the dots", data inference, data integration

Related Publications

Malin, B. and Sweeney, L. How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. Journal of Biomedical Informatics. 2004; 37(3): 179-192. Also available on MEDLINE.

B. Malin, L. Sweeney, and E. Newton. Trail Re-identification: Learning Who You are From Where You Have Been. Carnegie Mellon University, School of Computer Science, Data Privacy Laboratory Technical Report, LIDAP-WP12. Pittsburgh: February 2003.

Malin, B. Compromising Privacy with Trail Re-Identification: the REIDIT algorithms. Masters Thesis. Carnegie Mellon University, School of Computer Science, Technical Report, CMU-CALD-02-108. Pittsburgh: December 2002.

Malin, B. and Sweeney, L. Compromising Privacy in Distributed Population-Based Databases with Trail Matching. Carnegie Mellon University, School of Computer Science, Technical Report, CMU-CS-02-189. Pittsburgh: December 2002.

Malin, B. and Sweeney, L. Re-Identification of DNA through an Automated Linkage Process. Proceedings, Journal of the American Medical Informatics Association. Washington, DC: Hanley & Belfus, Inc. Nov 2001; 423-427.

Malin, B. and Sweeney, L.
Composition and disclosure of unlinkable distributed databases. 22nd IEEE International Conference on Data Engineering, Atlanta, GA, April 2006.

Malin, B. and Sweeney, L.
A secure protocol to distribute unlinkable health data. Proceedings, Journal of the American Medical Informatics Association (AMIA). Washington, DC. Oct 2005: 485-489.

Related Links

Genomic Privacy Project
The Watchlist Problem
Data Privacy Lab
Identifying Computer Science Undergraduates (ICU)