Re-Identifications of Law School Data

Saying it's Anonymous Doesn't Make It So: re-identifications of "anonymized" law school data

by Latanya Sweeney, Michael von Lowenfeldt, and Melissa Perry

Abstract

Data privacy practitioners are trusted to make decisions about which fields of personal income, medical, or educational information gets shared publicly. How good are the decisions they make? They don't have to publish the protocol used and they often prohibit others from telling them about vulnerabilities found in the data. So, in the silence, they assert that there are no problems. We had a unique opportunity in a legal setting to examine the real-world decision-making of a team of accomplished data privacy experts and to test whether the decisions they make are any good. The litigation was over whether the release of requested data was required by California law, Richard Sander et. al v. State Bar of California et. al. During the lawsuit, an expert team of data privacy practitioners proposed four "best practice" protocols that they asserted were sufficient to protect the privacy of individuals whose information was in the data. All four protocols claimed to leverage approaches widely used today in government and research practice. This paper presents their protocols and shows, based on analysis that was made public during the trial, vulnerabilities each protocol had to re-identifications ‐ the ability to associate real names to "anonymized" data records.

Results Summary: One protocol used a physical data enclave; two purported to produce a k-anonymity version of the data; and a fourth protocol developed a statistical model of the data. None of the protocols provided the privacy protection promised or commensurate with common expectations under public records laws. We demonstrated important lessons. k-anonymity guarantees that an adversary cannot do better than guessing that a name matches to at least k records, or vice versa. All fields that can be used for linking are included in the k-anonymization, which in today's networked society means all fields or requires justifying any non-included field. None of the "k-anonymity" protocols provided k-anonymity protection. Small group re-identifications can be as harmful as unique re-identifications. Physical data enclaves cannot thwart hiding or memorizing sensitive information. Adversarial testing on de-identified data can point out vulnerabilities and improve real-world practice. All four protocols left the records of Black and Hispanic test takers significantly more identifiable than the records of Whites. The Superior Court of California denied Sander's request for compelled disclosure of the data.

To Appear in the Journal of Technology Science on November 7, 2017.

Citation:

Sweeney L, Von Lowenfeldt M and Perry M. Saying it's Anonymous Doesn't Make It So: re-identifications of "anonymized" law school data. Technology Science. 2017110702. November 7, 2017. https://techscience.org/a/2017110702
A white paper version is available as:
Sweeney L, Von Lowenfeldt M and Perry M. Saying it's Anonymous Doesn't Make It So: re-identifications of "anonymized" law school data. Harvard University. Data Privacy Lab. White Paper. Oct 25, 2017. (PDF)

Re-Identifications of Law School Data

Saying it's Anonymous Doesn't Make It So: re-identifications of "anonymized" law school data

Abstract

Related Projects at the Data Privacy Lab