Trails Learning Project

Compromising Privacy with Trail Re-Identification:
The REIDIT Algorithms

by Bradley Malin

Abstract

Re-identification is the process of relating unique and specific entities to seemingly anonymous data, and as such, is an attack on the privacy of a data collection. This work introduces a new reidentification attack, termed the trail problem, for data distributed over multiple locations. Through the use of data trails an adversary can independently reconstruct the trails of locations that identified entities and their un-identified data visited, which can then employed for re-identification via trail matching. The attack strategy is based on the premise that data collecting institutions partition and release a dataset as multiple subsets, such that one release contains identifying attributes (e.g. name, social security number, phone number) and a second is devoid of these attributes (e.g. DNA sequences). The trail attack is dependent on whether the identified data is always collected with the un-identified data, termed complete, or whether one of the attributes is under-collected, termed incomplete. Both the complete and incomplete trail problems are formalized and several novel algorithms for re-identification are introduced. Examples are drawn from the areas of clickstream, DNA sequence, health, and video data.

Keywords: Re-identification Algorithms, Distributed Databases, Homeland Defense, Security and Privacy, data mining

Citation:
Malin, B. Compromising Privacy with Trail Re-Identification: The REIDIT Algorithms Masters Thesis. Carnegie Mellon University, School of Computer Science, Technical Report, CMU-CALD-02-108. Pittsburgh: December 2002. (23 pages in PDF)

Related Links


Fall 2004 Data Privacy Laboratory [LIDAP@dataprivacylab.org]