Invited Talks Given by Lab Members Internationally

Recommendations to Identify and Combat Privacy Problems in the Commonwealth

Talk by Latanya Sweeney, PhD


Testimony before the Pennsylvania House Select Committee on Information Security (House Resolution 351), Pittsburgh, PA, October 5, 2005.

Testimony (text, pdf)

References and Related Links


Statement of Latanya Sweeney, PhD
Associate Professor of Computer Science, Technology and Policy
Director, Data Privacy Laboratory
Carnegie Mellon University

before the Pennsylvania House Select Committee on Information Security (House Resolution 351), "Recommendations to Identify and Combat Privacy Problems in the Commonwealth"

October 5, 2005

Chairman Flick and respected members,

Thank you for visiting us here at CyLab in Pittsburgh and for this opportunity to address this committee today. As noted in the introduction, I am an associate professor in computer science here at Carnegie Mellon, and I am the Director and founder of the Data Privacy Lab, which works with real-world stakeholders, including the government, on privacy matters.

I would like to make 6 recommendations to you today. Before I enumerate them, let me give some background. My comments and recommendations build on three perspectives: my work with the federal government, my work in the Data Privacy Lab on learning algorithms, and my work in the Data Privacy Lab on constructing privacy technology. All of these have provided some unique relevant experiences that I would like to share with this committee. Before I elaborate, let me qualify each of these perspectives.

My work on privacy and privacy technology in the federal government includes:

  • medical privacy –e.g., my NCVHS testimony on HIPAA, the citation to my work in the original HIPAA regulation in the Federal Register, and my testimony before the healthcare committee chaired by Senators Frist (R-TN) and Rockefeller (D-NY).
  • surveillance –e.g., my testimony before the Department of Homeland Security’s Privacy Advisory Committee and my testimony before the TAPAC Committee on privacy and surveillance.
  • recent work –e.g., my current work with HUD on personal identifiers to track the homeless (HMIS) and my newly announced project with the Department of Justice on fingerprint capture.
While these are at the federal level, they give witness to the breadth of privacy concerns I have addressed.

The goal of the Data Privacy Lab here at Carnegie Mellon University, of which I am the Director, is to create technologies and related policies with provable guarantees of privacy protection while allowing society to collect and share sensitive information for many worthy purposes. In order to accomplish this goal, we develop algorithms that learn sensitive information across disparate pieces of data. And if we are good at this “data detective” work, we better understand how to be good “data protectors” by constructing privacy technology that limits what can be learned. Examples in the context of this committee’s pursuits are provided below.

This committee is particularly focused on identity theft as it relates to state databases. I have 6 recommendations. These relate to privacy risks of data given away. I will not talk about computer security issues, such as hacking, which relate to break-ins and unauthorized access.

Recommendation #1: Launch programs to educate citizens, agencies, and departments to best practices in order to reduce risks. Additionally, deploy technologies that automatically identify risky behavior (by individuals or government entities) and that send alerts and notifications to attempt change.

My work on Identity Angel provides an example. This week in fact CBS News is running a piece that includes some of my work in this effort.

In the Data Privacy Lab, we have a project named Identity Angel, whose goal is to scan the Web, and determine whether there is sufficient publicly available information to fraudulently represent a person in financial transactions. We began by focusing on the acquisition of personal information sufficient to fraudulently acquire a new credit card using freely available on-line resumes.

To fraudulently acquire a new credit card, an imposter needs to learn the {name, Social Security number, address, date of birth} of the subject. Only these 4 elements are needed. Our results show that thousands of resumes are available on-line containing this information. We were able to write simple programs that harvest this information from the Web in a matter of minutes. For every 100 resumes containing a Social Security number, 69% contained all the necessary information to acquire a credit card in the person’s name. So, no further effort was needed in these cases.

Clearly, a simple remedy is to have Social Security numbers removed from the resume, and dates of birth and addresses removed or replaced with more general age and city-only information. We wrote a program that automatically sent an email message to each person alerting them to the danger of having the noted information posted on the Web. While many of the postings were done by the individuals, some resumes were posted on-line by employers and others who were not the subject of the resume. Within a month of sending a single email message, more than half the resumes no longer had the sensitive information available. By a year later, most (68%) were no longer had the information available. A few employers changed their policy about posting of employee information. We received many thank-you messages!

This experiment showed that education can have an impact and that notification of risky behavior can be automatically detected and effect a reduction in risk. For this reason, I not only recommend educating individuals and agencies to best practices in order to reduce risks, but I also recommend the deployment of technologies that automatically identify risky behavior (by individuals or state agencies) and send notifications to effect change. While this recommendation is based on the example of on-line resumes, it is applicable to many other forms of risky behavior and serves as a general model to educate and effect change.

In terms of on-line resumes and on-line state information, we offer the service of our Identity Angel to any state agency that may want to host it. More information about Identity Angel is available at

Recommendation #2: Promote the use of automated validation and verification tools to identify fraudulent presentations early, at data capture. Doing so can help identify data errors (innocent or fraudulent) before they become part of a government record.

One of my projects on Social Security numbers, named SSNwatch, provides an example.

No discussion on identity theft can exclude consideration of Social Security numbers (SSNs) because SSNs have become a de facto national identifying number used to identify a person in data.

While there are grave concerns with SSNs, some of which will be discussed later, there is at least one good thing about SSNs. Social Security numbers are not random values. Instead, information is encoded within the number and part of the number is sequentially assigned. This allows us to make reliable inferences about a person from part or all of their SSN. This ability can be extremely useful in validating whether a person presenting an SSN is likely to be the person to whom the SSN was issued. Here is how it works.

Using publicly available information about SSN encoding and SSN assignments, we constructed the SSNwatch Validation Server, which is publicly available on-line. A user enters the first 3, 5, 6, or 9 digits of an SSN, and the program returns information about the person’s residence, year of birth, the date they received their SSN, and whether the SSN has been retired (usually due to death).

At present, the SSNwatch Validation Server is used to as a quick way to validate applications and claims. About 20 law enforcement agencies have reported using it for various activities. A larger number of claims processing houses have reported using it to validate the correctness of information before it is entered into a large database. Here are some examples.

Given the first 5 digits, 078-05, SSNwatch reports that the person resided in New York at the time of issuance, and that person was most likely born between 1879 and 1921. So, if a person presenting the SSN is about age 20, it is extremely unlikely that the provided SSN was issued to that person.

Given the first 5 digits, 221-98, SSNwatch would alert that as of January 2004, no SSN beginning with those digits has been issued.

Another example is when a recently issued SSN is provided (such as 615-23 issued in February 2001), the user of SSNwatch would know that this SSN does not match a person presenting work experience dating back many years.

The SSNwatch project demonstrates that validation services can be provided to help detect the appearance of incorrect or fraudulent information before it becomes part of a government record. There are many other kinds of validations that can be made on many other forms of data that is captured in state databases and that needs to be accurate. For this reason, I recommend that government agencies use validation services.

In terms of Social Security number validation, we offer the service of our SSNwatch Validation Server to any state agency that may want to host or use it. More information about SSNwatch is available at

Recommendation #3: Promote enhanced technology for linking records about people across data collections rather than the use of explicit identifiers. While many of these strategies are likely to be developed and deployed through federal government initiatives, I mention them to this committee in order for Committee members to be better informed about ways the landscape may change in the near future.

For example, given two datasets containing records on many of the same people, many errors exist when matching on names because of the various ways a person can write their own name and the likelihood that a name can match multiple people. Despite these problems, name matching remains a common way in which state databases are linked. An alternative has been to match Social Security numbers across the databases. This increases the use of Social Security numbers, which themselves are problematical. Instead of these outdated approaches, new approaches exist and should be considered.

Enhanced technology for linking and matching exists and opportunities to deploy this technology is encouraged. I am currently working with HUD on examining these kinds of enhanced approaches to linking personal information across social service databases related to the homeless. In this regard, I think there will be significant guidance from the federal government to local municipalities on the use of these technologies because of the extensive funding by HUD into municipalities to deploy these technologies through the HMIS program. [A preliminary presentation of my findings on the HUD project is available at]

Another strategy to improve the ability to reliably link records across databases is the use of biometrics (such as fingerprints and iris scans). A biometric is a measurement of the person that is specific to the person. Using biometrics to match records belonging to the same person in multiple databases is a strategy further supported by the recent dramatic decrease in the cost of capturing fingerprints (less than $100 for low quality images) and by policy changes aimed at authenticating citizenship. The latter includes the RealID Act which is part of an overall effort to include fingerprints within Pennsylvania driver licenses.

For the purposes of representing a person in a database, one can liken the use of a biometric to that of Social Security numbers. SSNs are easy to replicate, easy to provide in-person and remotely, and easy to store and match. But SSNs are not verifiable (unless services like SSNwatch, described above, are used), easy to forge, and because they are encoded, they leak information. If we replace SSNs with biometrics, we find that biometrics have the same advantages as SSNs. But biometrics also have the same weaknesses. Biometrics are verifiable when presented in-person, which is an improvement over SSNs, but they are not verifiable remotely. Forgery and encoding inferences are also possible.

The point of this comparison is to point out that merely replacing SSNs with biometrics does not alone solve the problem. While some may champion biometrics as “the” solution to personal authentication, identification, recognition and authorization concerns, my work shows how biometrics alone can only partially solve these problems. Further, widespread deployment of biometrics can generate its own privacy concerns. What is needed are holistic systems in which biometrics are embedded with provable guarantees of appropriateness and utility. Biometrics with accompanying policy and technical infrastructures can allow society to reap potential benefits without introducing new problems or exasperating old ones. At this point, identifying holistic solutions is still underway.

I do think there will be significant guidance from the federal government to local municipalities and state agencies on the use of biometric technologies within a holistic framework driven by national security interests. [My recent talk to the 2005 Biometrics Symposium, hosted by the Department of Homeland Security and the Department of Justice, addresses these issues. A copy is available at]

Recommendation #4: Establish a person (or small group of people) to help coordinate state releases of data. The overall effort would include the creation and maintenance of a database that describes the databases kept by the state, along with a description of the data elements released or shared with businesses, the public, or researchers. Having such a database makes it easier to identify re-identification potential, or when a proposed dataset can make other datasets more identifiable. An additional role for this effort is that when data are shared with researchers and other third parties, they can advise which technical tools should be used to prove there is minimal privacy risk. A final role for this effort is to coordinate and provide advice on best practices across agencies, departments, and municipalities.

In an earlier work, I used Census data to show that combinations of few characteristics often combine in populations to uniquely or nearly uniquely identify some individuals. For example, a finding in my study was that 87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}. Clearly, data released containing such information about these individuals should not be considered anonymous. Yet, health and other person-specific data are often publicly available in this form.

One of my earliest examples of such a re-identification involved linking hospital discharge data from the state with voter lists. This experiment was conducted in the State of Massachusetts, though the same datasets existed in Pennsylvania.

In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees. GIC collected patient specific data with nearly one hundred attributes per encounter for approximately 135,000 state employees and their families. All explicit identifiers, such as name, address, and Social Security number were removed, so the data were falsely believed to be anonymous and was then given freely to researchers and industry. For twenty dollars I purchased the voter registration list for Cambridge Massachusetts and received the information on two diskettes [nowadays I can purchase voter data over the Web and receive it immediately]. The voter list included the name, address, ZIP code, birth date, and gender of each voter. I showed how two datasets could be linked using ZIP code, birth date and gender, thereby relating diagnosis, procedures, and medications to particularly named individuals.

One case stood out. William Weld was governor of Massachusetts at that time and his medical records were in the GIC data. Governor Weld lived in Cambridge Massachusetts. According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code.

This experiment shows that care must be taken when decisions are made to share state databases. Part of this care can be addressed by identifying which other datasets could be brought to bear and coordinating releases. More information about my work on coordinating data releases is available upon request.

Existing technology and technical strategies can also help. One of our technologies, the Privacert Risk Assessment server has already been licensed to a company (, who uses the Risk Assessment Server to identify the privacy risk of a dataset. This technology was used in my earlier work on privacy-preserving bio-terrorism surveillance; see

A technical strategy for coordinating releases is a concept termed selective revelation. I developed this concept while working with DARPA on surveillance. Under selective revelation, multiple datasets are provided with a varying levels of identifiability. The level of anonymity of the dataset is lowered based on scientific and evidentiary need. For example, a department may release one version of the data that was determined to be sufficiently anonymous. While not all requests can be satisfied with this one version, other versions may be made available with increasing scrutiny for access. While the principle of selective revelation is generally applicable and can be performed manually, the original concept design is included within an automated access system; see

In concluding, I will not go into as much detail on the last two recommendations because I am confident they are notions that have already been presented to this committee. I do support them as recommendations.

Recommendation #5: Provide consumer protections when data collected for state purposes are stored in or gathered by commercial sources. At present, there are no Fair Information Practices possible. For example, a person cannot inquire whether they are in the dataset, cannot review the information stored (and thereby being used by the state), and cannot correct false information.

Recommendation #6: Have any known breaches be publicly announced. In our society, individuals bear the burden of privacy problems, yet discovering the existence of a problem is difficult to trace. Severe penalties are of limited protection because the probability of being caught is far less than the probability of a successful violation. One way to help improve protection practices it publicly announce known breaches.

Thank you.

Latanya Sweeney, PhD
Associate Professor of Computer Science, Technology and Policy
Carnegie Mellon University
Voice: 412-268-4484


Copyright © 2011. President and Fellows Harvard University.   |   IQSS   |    Data Privacy Lab   |    []