Trails Learning Project |
Keywords: Re-identification Algorithms, Distributed Databases, DNA privacy, data mining
Citation:
Abstract
This paper is concerned with the privacy of person-specific data collected over multiple institutions. In
particular, we focus on an example of person-specific DNA sequences collected and stored at various
hospitals in a defined geographic region. The applications of human genetics and genomic analysis have
generated much discussion with respect to privacy and confidentiality in ethical, legal, and social issues.
For the most part, the previous analysis has concentrated on direct application and disclosure of the
genetic information of an individual, however, there has been much less attention devoted to the question
of computational challenges to privacy in the secondary sharing of de-identified databases (i.e. released in
a format devoid of directly identifying information, such as name, address, or phone number). We
introduce methods for determining the re-identifiability of such DNA data and, in the process of doing so,
prove that the removal of identifying information from DNA does not sufficiently protect the privacy of
the entities to which the data was derived from. We demonstrate, through several novel re-identification
algorithms, that despite a lack of personal demographic information, such database entries can be reidentified
through linkage to other publicly available databases, such as hospital discharge information
through the use of hospital visit and data collection patterns, which we refer to as data trails, which are
iteratively discovered from released data collections. Using real-world data, we are able to determine
when identifiable linkages can occur for a substantial number of individuals with particular gene-based
disorders. Furthermore, we provide empirical analysis of the re-identification algorithms with respect to
population-institution visit distributions and data trails.
Related Links