Trails Learning Project

How (Not) to Protect Genomic Data Privacy in a Distributed Network: Using Trail Re-identification to Evaluate and Design Anonymity Protection Systems

by Bradley Malin and Latanya Sweeney

Abstract

The increasing integration of patient-specific genomic data into clinical practice and research raises serious privacy concerns. Various systems have been proposed that protect privacy by removing or encrypting explicitly identifying information, such as name or social security number, into pseudonyms. Though these systems claim to protect identity from being disclosed, they lack formal proofs. In this paper, we study the erosion of privacy when genomic data, either pseudonymous or data believed to be anonymous, is released into a distributed healthcare environment. Several algorithms are introduced, collectively called RE-Identification of Data In Trails (REIDIT), which link genomic data to named individuals in publicly available records by leveraging unique features in patient-location visit patterns. Algorithmic proofs of re-identification are developed and we demonstrate, with experiments on real-world data, that susceptibility to re-identification is neither trivial nor the result of bizarre isolated occurrences. We propose that such techniques can be applied as system tests of privacy protection capabilities.

Keywords: Privacy, Anonymity, Re-identification Algorithms, Distributed Databases, Genomics, DNA Databases

Citation:
B. Malin and
L. Sweeney. How (Not) to Protect Genomic Data Privacy in a Distributed Network: Using Trail Re-identification to Evaluate and Design Anonymity Protection Systems. Journal of Biomedical Informatics. 2004; 37(3): 179-192. (PDF).

An early version is available as: B. Malin and L. Sweeney. How (Not) to Protect Genomic Data Privacy in a Distributed Network: Using Trail Re-identification to Evaluate and Design Privacy Protection Systems. Technical Report CMU-ISRI-04-115, School of Computer Science, Carnegie Mellon University. Pittsburgh, PA: April 2004. (pdf) (ps)

Best of the Year Award!, appearing in The 2006 Yearbook In Medical Informatics

We are pleased to announce that this paper received one of the highest honors possible for a paper in medical informatics --inclusion in the Yearbook of Medical Informatics which selects the "best of the year" among all peer-reviewed published journal papers in the field.

Here is how they describe the selection process: "Papers are selected based on a systematic search of all peer-reviewed medical informatics journal publications between April 2004 and March 2005. For each area within medical informatics, this search is coordinated by one managing editor. Around 150 papers are pre-selected this way each year. Each pre-selected paper is then internationally reviewed by three additional reviewers. Based on those reviwes, the editors and the respective managing editors decide on inclusion of papers in the next yearbook. This final selection is usually done around September (e.g. September 2005 for the Yearbook 2006 which will then appear in March 2006).  Congratulations..."

Related Links


Fall 2004 Data Privacy Laboratory [LIDAP@dataprivacylab.org]