Research Accomplishments of Latanya Sweeney, Ph.D.



Overview

Medical Informatics
      Scrub
      Datafly
      Genomic identifiability
      Patient-centered management

Database Security
      k-anonymity

Surveillance
      Selective-revelation
      Risk assessment server
      PrivaMix

Vision
      Face de-identification

Biometrics
      Contactless capture

Policy and Law
      Identifiability of de-identified data
      HIPAA assessments
      Privacy-preserving surveillance

Public Education
      Identity angel
      SSNwatch
      CameraWatch

Quantitative assessments

Overview

Latanya Sweeney is a Distinguished Career Professor of Computer Science, Technology and Policy at Carnegie Mellon University, founder and director of the Data Privacy Lab, and an elected fellow of the American College of Medical Informatics, with almost 100 academic publications, 2 patents, citations in the Federal Register for 2 regulations, and 3 company spin-offs. She has received professional and academic awards, and testified before federal and international government bodies. In 2009, through a national GAO search, she was appointed to the privacy and security seat of the Federal Health Information Technology Policy Committee.

Dr. Sweeney has made numerous cross-disciplinary contributions to privacy technology having significant scientific influence and real-world impact. Scientific American profiled her earlier work (profile). Her greatest impact has been in medical privacy ( medical informatics, policy and law). Dr. Sweeney's most cited statistic is "87% of the U.S. population is uniquely identified by {date of birth, gender, postal code}" [cite]. Her most cited academic work is k-anonymity. An index of her work, across disciplines and scientific areas, appears on the left side panel.

Dr. Sweeney's current research goal is to replace the 3 historical pillars of privacy (consent, notice, and de-identification) with new technology-powered mechanisms that jointly provide a privacy fabric appropriate for today's setting. The goal is to allow society to reap the benefits of emerging technologies while enjoying privacy protection.

Dr. Sweeney's prior research goal and earlier work was to create systems that automatically learn strategic and sensitive information from data, and its converse, to create systems that controlled what could be learned. Most often, the sensitive information she wanted systems to learn were ways to relate personal identity to seemingly innocents facts ("re-identification" and "identifiability"). And most often her pursuits in the opposite direction involved creating systems with guarantees that identity could not be learned ("anonymity") while still making sure the anonymized results remained useful.

In the 2000's, Dr. Sweeney believed that computer scientists who pioneered privacy-invasive technologies and scholars who designed related policy were in the best positions to solve privacy-technology clashes 1 through design. 5 So, she worked within those communities engaged in constructing privacy invasive technologies and in related disciplines to encourage such solutions 6 Her approach was to identify a privacy-technology clash within a community, formulate a privacy problem statement, and then offer a solution to the problem as an exemplar to seed privacy-preserving work within the originating community.

In computer science, these were: medical informatics, database security, surveillance, vision, and biometrics. Related communities outside computer science were: policy and law, and public education.

Figure 1 (below) gives an overview of work foci, areas, and the relationships between them. The 5 scientific areas appear as rectangles left of the vertical bar and the 2 societal areas appear right of the bar. Each of the 15 research foci (circles in Figure 1) appear within its respective community. Arrows between foci show an influential relationship.


Figure 1. Overview of Latanya Sweeney's contributions in 15 research foci (circles) across 5 scientific communities (rectangles left of vertical bar) and 2 societal communities (rectangles right of vertical bar). Arrows show influential relationship.


The index in the left margin has links to descriptions of each of the 15 research foci listed in Figure 1, by area overviews, and by foci, one per page. Below is a text summary of each area.


Medical Informatics

In medical informatics, Dr. Sweeney contributed: (1) Scrub, which de-identifies textual documents [cite, cite]; (2) Datafly, which balances privacy and utility in field-structured data [cite, cite, cite, cite]; (3) reidentifications of genomic data [cite, cite, cite, cite]; and, (4) a healthcare experiment using anonymous data to compare cohort outcomes [cite].

Scientific influence and impact:

  • Scrub [cite] was first to introduce the problem of de-identifying medical text and poses a solution. Academic institutions, such as MIT [Szolovits et al.] and the University of Pittsburgh, have implemented versions of Scrub and related alternatives, which they currently license to medical organizations (e.g., Vanderbilt) for real-world use.

  • Datafly [cite] was one of the first to pose a completely algorithmic solution. Researchers proposed efficiencies and alternatives [Ohno-Machado, Vinterbo, et al.].

  • The DNA re-identification experiments of Dr. Sweeney and her student, Bradley Malin, were the first [cite, cite, cite, cite]. Other researchers then showed other vulnerabilities [Kohane, Altman, et al.], until most recently, NIH ceased providing human DNA databases publicly based on re-identifications [McGuire et al.].

  • Dr. Sweeney's work seems to be the first to introduce an experimental design for comparing health outcomes of cohorts over time using provably anonymous data for analysis [cite]. Hagan at Price Waterhouse Coopers reports that other healthcare organizations (e.g., Healthnet, Alere, et al.] are already using variants of the experimental design.

Other achievements: 3 best paper awards; papers among 15 most cited American Medical Informatics papers; Fellow in College of Medical Informatics, and, Privacy Leadership Award.

See more about Dr. Sweeney's accomplishments with Scrub, Datafly, Genomic identifiability, and Patient-centered management.

Database Security

In database security, Dr. Sweeney contributed k-anonymity [cite, cite, cite, cite, cite, cite, cite, including two papers with Pierangela Samarati]. Data are k-anonymized if data for each person is indistinguishable from at least k-1 individuals who also appear in the data.

Scientific influence and impact:

  • k-anonymity was the first formal privacy protection model. Its original intention was to thwart the ability to link field-structured databases, but has been viewed more broadly, and in so doing, spurred a series of highly cited works. For example, other researchers have proposed efficiencies, alternatives and hardness proofs [Meyerson, Williams, et al.]. To improve utility, k-anonymity can allow an assumption that it may be enforced on a subset of fields known to lead to re-identifications. L-diversity [Gehrke et al.] poses an alternative motivated if the subset is chosen incorrectly. T-closeness [Li et al.] poses an alternative to address concerns found in l-diversity and vulnerabilities if k-anonymity is applied generally. Most recently, differential privacy [Dwork et al.] poses another alternative, which typically distorts data using randomization and noise, enforced across all values, to report inexact commonly occurring information.

Other achievements: recognition award; patent; second highest citation count among joint citation counts of Associate Professors in the School of Computer Science at Carnegie Mellon.

See more about Dr. Sweeney's accomplishments with k-anonymity.

Surveillance

In surveillance, Dr. Sweeney contributed: (1) Selective Revelation: a data sharing architecture that matches identifiability and utility [cite]; (2) Risk Assessment Server: computes the identifiability of data [cite, cite, cite]; and, (3) PrivaMix: allows a network of data holders to jointly produce a de-identified linked dataset without a trusted third party [cite, cite].

Scientific influence and impact:

  • Selective-revelation [cite] was part of congressional and media discussions regarding surveillance of Americans through secondary uses of data they leave behind. Robert Popp, then Deputy Director at DARPA for the Total Information Awareness Project (TIA), described it often in response to privacy concerns.

  • Risk Assessment Server [cite, cite, cite] originated with Dr. Sweeney's study of the identifiability of basic demographics, leading to my highly cited result “87% of the population of the United States is uniquely identified by {date of birth, gender, ZIP}. [Golle et al.] found 64% using more recent population data and a different model. [Malin et al.] explained the difference due to models and showed there is no difference for binsizes >= 5.

  • Even though PrivaMix [cite, cite] is very recent, HUD had the system and functions evaluated by independent security and cryptographic experts, who confirmed their correctness and applicability. PrivaMix worked flawlessly in real-world HUD experiments in Iowa. NIH provided support to help port PrivaMix to healthcare.

Other achievements: praise from a Federal Advisory Committee; DARPA, HUD, and NIH funding; patent filing; 2 licenses to businesses; and, news articles 7.

See more about Dr. Sweeney's accomplishments with Selective revelation, Risk assessment server, and PrivaMix.

Vision

In vision, Dr. Sweeney and her students acontributed formal methods for de-identifying and anonymizing faces in video and photographs [cite, cite, cite, cite, cite, cite, cite].

Scientific influence and impact:

  • Dr. Sweeney and her students were the first to demonstrate the importance of using provable privacy protection over ad hoc approaches, by showing how face recognition, used in its most ideal settings, could re-identify faces distorted by masking, additive noise, and pixelation [cite]. They then introduced the first formal model for protection [cite]. Others have introduced alternatives and enhancements [Defaux et al.]. Senior recently edited a book on the topic in which we contributed [cite], and in more recent work [cite, cite, cite, cite, cite], Dr. Sweeney working with her student, Ralph Gross, and other collaborators produced anonymized, photo realistic video. Working with Gross, Cohn, de la Torre and Baker, they produced anonymized, photo realistic video of pain grimace in patients for NIH [cite].

Other achievements: paper in a top CS journal (IEEE TKDE); paper in a top CS conference (IEEE Conference on Biometrics, 10% acceptance rate).

See more about Dr. Sweeney's accomplishments with Face de-identification.

Biometrics

In biometrics, we contributed new technologies that use photography for contactless capture of fingerprints [cite, cite, cite, cite, cite, cite, cite, cite, cite, cite, cite, cite, cite, cite].

Scientific influence and impact:

  • This work is still underway, but has already ignited lots of interest from government funding agencies, including DOJ, DOD, and DHS, and has received lots of interest in early real-world trials from local jails (for booking) and from U.S. Border stations.

Other achievements: DOJ funding; paper in a top CS conference (IEEE BTAS 10% acceptance rate), 2 patent filings, business venture, and news articles 7.

See more about Dr. Sweeney's accomplishments with Contactless capture.

Policy and Law

In policy and law, Dr. Sweeney contributed: (1) numerous real-world re-identification studies [cite, cite, cite, cite, cite, cite, cite, cite, cite, cite, cite, cite]; (2) operational standards for determining compliance (e.g. HIPAA) [cite, cite, cite cite]; and, (3) real-world examples of surveillance with privacy protection [cite, cite, cite, cite, cite, cite, cite, cite, cite, cite, cite, cite].

Scientific influence and impact:

  • Dr. Sweeney's earliest re-identification studies were discussed and cited as reasons for approaches taken in the HIPAA Privacy Rule [Gellman, Federal Register, et al.]. Four court decisions cite and discuss her re-identifications, and in one case, her method was sealed [Southern Illinoisian v. Dept. of Public Health]. Researchers have replicated her experiments in other countries [Emam, et al.]. Legal scholars have discussed ramifications [Kerr, et al.] and offer new legal theories to address her findings [Rothstein, Ohm, Weitzner, et al.].

  • Attorneys publicly endorsed Dr. Sweeney's standard for determining HIPAA compliance as a means of reducing litigation risk [Tupman, et al.] and support its use in practice [American health lawyers, et al.]. Two companies have licenses to her related technology and use it to commercially provide HIPAA Compliance Assessments [Privacert, et al.].

Other achievements: citation in the commentary of the HIPAA Privacy Rule and in Medical Breach Regulation, in 4 court decisions; presentations at the European Union and the U.S. Senate; Privacy Advocacy award; appointment to the Privacy and Security Seat of the Federal HIT Policy Committee in the Obama Administration; and news articles 7.

See more about Dr. Sweeney's accomplishments with Identifiability of de-identified data, HIPAA assessments, and Privacy-preserving surveillance.

Public Education

In public education, Dr. Sweeney contributed: (1) Identity Angel, which crawls the Web and notifies people of sensitive personal information found about them on-line [cite, cite, cite]; (2) SSNwatch, which validates Social Security numbers [cite]; and, (3) CameraWatch, which locates URLs of publicly available webcams [cite, cite]. 7.

Scientific influence and impact:

  • Dr. Sweeney's Identity Angel program [cite, cite, cite] found almost 10,000 Social Security numbers on-line and attempted to email about 3000 individuals whose {SSN, email} were found. A month later, about 2000 SSNs were removed. CBS News interviewed different people in different cities for reactions and aired the interviews on local stations, e.g. Denver [cbs4denver.com/video/?id=10164@kcnc.dayport.com].

  • With respect to SSN validation, SSNwatch [cite] receives about 1000 hits/week. District attorneys are primary users, seeming to match SSNs to information in statements.

  • With respect to SSN prediction, Dr. Sweeney was first to warn of a pending crisis in the ability to predict the 9 digits of a person's SSN given only {date of birth, home town}. A private company confirmed her suspicion using millions of SSNs of live people. One of her students, Ralph Gross, working with a colleague, Alessandro Acquisti, repeated the experiment using SSNs of dead people and got publishable results and deserved media attention.

See more about Dr. Sweeney's accomplishments with Identity angel, SSNwatch, and Camera Watch.


Notes

1 To assist reviews of Dr. Sweeney's work across areas, see quantitative assessments of her work.

2 Given a constraint to satisfy, a "discipline-specific proof" is one in which the evidence and validity uses the methods appropriate to the discipline that describes the constraint. For example, if the constraint is a legal requirement, then the proof must satisfy legal muster, not necessarily scientific reasoning.

3 Dr. Sweeney's Iterative Profiler [cite] sifts through publicly available data, using inferential linkages of data fragments across data sources to construct profiles of people whose information appears in the data. After it re-identified the names of children who appeared in a cancer registry, an Illinois court ordered the approach sealed. For the appellate decision, which upheld the sealing of the methodology while highly praising her abilities, see https://www.state.il.us/court/Opinions/AppellateCourt/2004/5thDistrict/June/Html/5020836.htm.

4 These tools resonate with a recent CACM paper by Weitzner, Abelson, and Sussman at MIT, and seemingly with the Obama Administration's voiced notion of accountability and transparency.

5 Dr. Sweeney's Fair Data Sharing Practices modernizes prior work by Bob Gelman in 1974 in Fair Information Practices, which focused on policy practices for data collection. Fair Information Practices are the cornerstone of the U.S. 1974 Privacy Act and the EU Privacy Directive. They do not include descriptions of enabling technologies.
5 Helen Nissenbaum discusses design decisions made by technology developers. See her book, Privacy in Context (2009).

6 Working across areas is unorthodox. Rather than Dr. Sweeney's work residing in one community, which is customary, she pursues scientific contributions of privacy in multiple communities and in the real-world too –in the places where technology-privacy clashes are underway. This makes review of her work difficult for the same reasons it makes it difficult to work across areas. Each area has its own language, concepts, history, and scientific methods. Even though her papers are reviewed with the same rigor as others within a community, it is not easy to assess impact from outside that community. So, an array of quantitative assessments are available.

7 Featured news articles for my work in surveillance, biometrics, policy and law, and public education include as venues: Scientific American, CBS, NBC, ABC, Newsweek, USA Today, and National Public Radio.

Previous | Next


Related links:


Fall 2009