Trails Learning Project |
Keywords: Re-identification Algorithms, Distributed Databases, Homeland Defense, Security and Privacy, data mining
Citation:
Abstract
Re-identification is the process of relating unique and specific entities to seemingly anonymous
data, and as such, is an attack on the privacy of a data collection. This work introduces a new reidentification
attack, termed the trail problem, for data distributed over multiple locations.
Through the use of data trails an adversary can independently reconstruct the trails of locations that
identified entities and their un-identified data visited, which can then employed for re-identification
via trail matching. The attack strategy is based on the premise that data collecting institutions
partition and release a dataset as multiple subsets, such that one release contains identifying
attributes (e.g. name, social security number, phone number) and a second is devoid of these
attributes (e.g. DNA sequences). The trail attack is dependent on whether the identified data is
always collected with the un-identified data, termed complete, or whether one of the attributes is
under-collected, termed incomplete. Both the complete and incomplete trail problems are
formalized and several novel algorithms for re-identification are introduced. Examples are drawn
from the areas of clickstream, DNA sequence, health, and video data.
Malin, B.
Compromising Privacy with Trail Re-Identification:
The REIDIT Algorithms
Masters Thesis. Carnegie Mellon University, School of Computer Science, Technical Report, CMU-CALD-02-108. Pittsburgh: December 2002.
(23 pages in PDF)
Related Links