Genomic Privacy Project

Determining the Identifiability of DNA Database Entries

by Bradley Malin and Latanya Sweeney


CleanGene is a software program that helps determine the identifiability of sequenced DNA, independent of any explicit demographics or identifiers maintained with the DNA. The program computes the likelihood that the release of DNA database entries could be related to specific individuals that are the subjects of the data. The engine within CleanGene relies on publicly available health care data and on knowledge of particular diseases to help relate identified individuals to DNA entries. Over 20 diseases, ranging over ataxias, blood diseases, and sex-linked mutations are accounted for, with 98-100% of individuals found identifiable. We assume the genetic material is released in a linear sequencing format from an individualís genome. CleanGene and its related experiments are useful tools for any institution seeking to provide anonymous genetic material for research purposes.

Keywords: DNA privacy, genetic privacy, privacy technology

Malin, B. and Sweeney, L. Determining the Identifiability of DNA Database Entries (with Brad Malin). Proceedings, Journal of the American Medical Informatics Association. Washington, DC: Hanley & Belfus, Inc. Nov 2000; 537-541. Available on MEDLINE. (

