Trails Learning Project

Compromising Privacy in Distributed Population-Based Databases with Trail Matching: A DNA Example

by Bradley Malin and Latanya Sweeney

Abstract

This paper is concerned with the privacy of person-specific data collected over multiple institutions. In particular, we focus on an example of person-specific DNA sequences collected and stored at various hospitals in a defined geographic region. The applications of human genetics and genomic analysis have generated much discussion with respect to privacy and confidentiality in ethical, legal, and social issues. For the most part, the previous analysis has concentrated on direct application and disclosure of the genetic information of an individual, however, there has been much less attention devoted to the question of computational challenges to privacy in the secondary sharing of de-identified databases (i.e. released in a format devoid of directly identifying information, such as name, address, or phone number). We introduce methods for determining the re-identifiability of such DNA data and, in the process of doing so, prove that the removal of identifying information from DNA does not sufficiently protect the privacy of the entities to which the data was derived from. We demonstrate, through several novel re-identification algorithms, that despite a lack of personal demographic information, such database entries can be reidentified through linkage to other publicly available databases, such as hospital discharge information through the use of hospital visit and data collection patterns, which we refer to as data trails, which are iteratively discovered from released data collections. Using real-world data, we are able to determine when identifiable linkages can occur for a substantial number of individuals with particular gene-based disorders. Furthermore, we provide empirical analysis of the re-identification algorithms with respect to population-institution visit distributions and data trails.

Keywords: Re-identification Algorithms, Distributed Databases, DNA privacy, data mining

Citation:

  • Malin, B. and Sweeney, L. Compromising Privacy in Distributed Population-Based Databases with Trail Matching. Carnegie Mellon University, School of Computer Science, Technical Report, CMU-CS-02-189. Pittsburgh: December 2002. (23 pages in PDF)
  • Related Links


    Fall 2004 Data Privacy Laboratory [LIDAP@dataprivacylab.org]