De-identification Project

Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression

by Pierangela Samarati and Latanya Sweeney, Ph.D.

Abstract

Today's globally networked society places great demand on the dissemination and sharing of person-specific data. Situations where aggregate statistical information was once the reporting norm now rely heavily on the transfer of microscopically detailed transaction and encounter information. This happens at a time when more and more historically public information is also electronically available. When these data are linked together, they provide an electronic shadow of a person or organization that is as identifying and personal as a fingerprint, even when the sources of the information contains no explicit identifiers, such as name and phone number. In order to protect the anonymity of individuals to whom released data refer, data holders often remove or encrypt explicit identifiers such as names, addresses and phone numbers. However, other distinctive data, which we term quasi-identifiers, often combine unquely and can be linked to publicly available information to re-identify individuals.

In this paper we address the problem of releasing person-specific data while, at the same time, safeguarding the anonymity of individuals to whom the data refer. The approach is based on the definition of k-anonymity. A table provides k-anonymity if attempts to link explicitly identifying information to its contents ambiguiously map the information to at least k entities. We illustrate how k-anonymity can be provided by using generalization and suppression techniques. We introduce the concept of minimal generalization, which captures the property of the release process not to distort the data more than needed to achieve k-anonymity. We illustrate possible preference policies to choose among different minimal generalizations. Finally, we present an algorithm and experimental results when an implementation of the algorithm was used to produce releases of real medical information. We also report ont he quality of the released data by measuring precision and completeness of the results for different values of k.

Keywords: data anonymity, data privacy, re-identification, data fusion, privacy

Citation:
Pierangela Samarati and L. Sweeney. k-anonymity: a model for protecting privacy. Proceedings of the IEEE Symposium on Research in Security and Privacy (S&P). May 1998, Oakland, CA. (
PDF).

This same paper also appears as Protecting respondents identities in microdata release, IEEE Transactions on Knowledge and Data Engineering, 2001.

See links below for authorative sources of k-anonymity.

Related Links


Summer 2003 Data Privacy Lab [De-identification Project]