Scrub: de-identification of textual documents

by Latanya Sweeney, Ph.D.


We define a new approach to locating and replacing personally-identifying information in unrestricted text that extends beyond straight search-and-replace procedures, and we provide techniques for minimizing risk to patient confidentiality. The straightforward approach of global search and replace properly located no more than 30-60% of all personally-identifying information that appeared explicitly in letters between physicians and notes written by clinicians within a pediatric database. On the other hand, our Scrub system found 99-100% of these references. Scrub uses detection algorithms that employ templates and specialized knowledge of what constitutes a name, address, phone number and so forth.

Keywords: data anonymity, data privacy, text de-identification

L. Sweeney, Replacing Personally-Identifying Information in Medical Records, the Scrub System. In: Cimino, JJ, ed. Proceedings, Journal of the American Medical Informatics Association. Washington, DC: Hanley & Belfus, Inc, 1996:333-337. (This paper was awarded First Prize at AMIA 1996.) Paper: 5 pages in
PS or PDF.

Related Links

Related Publications

Latanya Sweeney's Home Page
Selected publications by Latanya Sweeney

Last modified 2/2003 by