Genomic Privacy Project
The incorporation of genomic data into personal medical records in clinical practice and research poses many challenges to patient privacy. In response, various systems for preserving patient privacy in shared genomic data have been developed and deployed. Though these systems de-identify the data by removing explicit identifiers (e.g. name, address, or Social Security number) and incorporate sound security design principles, they suffer from a lack of formal modeling of inferences learnable from shared data. This paper evaluates the extent to which current protection systems are capable of withstanding a range of re-identification methods, including genotype-phenotype inferences, location-visit patterns, family structures, and dictionary attacks. For a comparative re-identification analysis, the systems are mapped to a common formalism. Though there is variation in susceptibility, each system is deficient in its protection capacity. We discover patterns of protection failure and discuss several of the reasons why these systems are susceptible. The analyses and discussion within provide guideposts for the development of next generation protection methods amenable to formal proofs.
Keywords: Privacy, Confidentiality, Databases, Genetics, Genomics, Medical Genetics
B. Malin. An Evaluation of the Current State of Genomic Data Privacy Protection Technology and a Roadmap for the Future. Journal of the American Medical Informatics Association. Forthcoming.
Early version, with additional analysis, and results:
B. Malin. Why Pseudonyms Donít Anonymize: A Computational Re-identification Analysis of Genomic Data Privacy Protection Systems. Working Paper LIDAP-WP19. Carnegie Mellon University, Data Privacy Laboratory, Pittsburgh, PA. Nov 2003. (pdf) (ps)