Trails Learning Project

Trail Re-identification:
Learning Who You are From Where You Have Been

by Bradley Malin, Latanya Sweeney, and Elaine Newton

Demonstration (Internet Explorer only)

Abstract

This work provides algorithms for learning the identities of individuals from the trails of seemingly anonymous information they leave behind. Consider online consumers, who have the IP addresses of their computers logged at each website visited. Many falsely believe they cannot be identified. The term “re-identification” refers to correctly relating seemingly anonymous data to explicitly identifying information (such as the name or address) of the person who is the subject of those data. Re-identification has historically been associated with data released from a single data holder. This work extends the concept to “trail re-identification” in which a person is related to a trail of seemingly anonymous and homogenous data left across different locations. The 3 novel algorithms presented in this work perform trail re-identifications by exploiting the fact that some locations also capture explicitly identifying information and subsequently provide the unidentified data and the identified data as separate data releases. Intersecting occurrences in these two kinds of data can reveal identities. For example, an online consumer may visit 50 websites and purchase at 5 and another may visit 30 sites and purchase at 7. Shared visit logs provide unidentified data. Exchanged customer lists provide identified data. The algorithms presented herein re-identify individuals based on the uniqueness of trails across unidentified and identified datasets. The algorithms differ in the amount of completeness and multiplicity assumed in the data. Successful re-identifications are reported for DNA sequences left by hospital patients and for IP addresses left by online consumers. These algorithms are extensible to tracking collocations of people, which is an objective of homeland defense surveillance.

Keywords: Re-identification Algorithms, Distributed Databases, Homeland Defense, Security and Privacy

Citation:
B. Malin, L. Sweeney, and E. Newton. Trail Re-identification: Learning Who You are From Where You Have Been Carnegie Mellon University, School of Computer Science, Data Privacy Laboratory Technical Report, LIDAP-WP12. Pittsburgh: February 2003. (10 pages in PS,PDF) Under review for publication.

Related Links


Fall 2004 Data Privacy Laboratory [LIDAP@dataprivacylab.org]