Privacy Algorithms
Computer World
OCTOBER 14, 2002

MATT HAMBLEN

Technology-based protections could make personal data impersonal.

In the ongoing debate over how to protect personal information, much of the attention has focused on whether - and to what degree - the government should limit the amount of personal information companies can ask for or share.

Recently however, a small group of computer scientists has been taking a different tack. They're building software tools that promise to keep names, addresses, health status and other information secret while allowing patterns to emerge within large data sets that can help predict broad social trends, buying behaviors or massive health or terrorist threats.

Some of this software has been patented and used by government agencies in the U.S.; other algorithms are several years from practical implementation. The tools may someday be used by health care providers, financial services firms and the government for collecting and using data gleaned from individuals.

Some of the existing tools enhance anonymity. For example, the Freedom browser from Zero-Knowledge Systems Inc. in Montreal prevents sending of personal information over an Internet connection without the user's consent.

The techniques under development can change personal and private data in various ways, including making personal information anonymous, possibly using cryptography or by disguising it in other ways.

For example, researchers at the IBM Privacy Research Institute in San Jose are perfecting an approach that "randomizes" data before it's communicated. A Web business might use it to extract valuable demographic data without knowing the underlying personal data of the consumer.

A user would enter his age, salary or weight, and software would randomize it by adding or subtracting that number from a random value. The random value would differ for every user, while the range of randomization wouldn't change. The software would use the randomized values and the range of randomization to find a close approximation of the true distribution, IBM officials say. Experiments show a 5% to 10% loss in accuracy of data even when all values are randomized, says Rakesh Agrawal, an IBM researcher on the project.

Carnegie Mellon University in Pittsburgh is focusing on protecting personal information that's already public, such as voter registration information and hospital discharge data. "One of the biggest problems is that people think their data might be anonymous when it is not," says Latanya Sweeney, a computer science professor and director of the school's Laboratory for International Data Privacy.

Sweeney estimates that 87% of the U.S. population can be uniquely identified if only a date of birth, gender and five-digit ZIP code are known. "It doesn't take much to identify you," she says.

Sweeney helped found DatAnon LLC in Pittsburgh in August to commercialize technology she developed at Carnegie Mellon. Her tools look at an individual record in a database, determine which elements make that record unique and then modify only the elements necessary to make the record anonymous. For example, a date of birth might be generalized to a year of birth.

Another DatAnon tool known as Datafly could be deployed by public health authorities in bioterrorism surveillance to vary the anonymity standard in a data pool to match the need to identify someone, Sweeney says. For example, if a large group of people in an area were ill and missing work, public health officials could lower the anonymity standard temporarily within hospital discharge records and other information to find people connected with, or possibly responsible for, the illness.

She's also working with Carnegie Mellon students on video anonymity systems that blot out images of innocent people on surveillance tapes.

Another idea for protecting privacy is to store different pieces of data in different databases, so that no one source has a complete record that could violate a person's privacy, says Chris Clifton, an associate professor at Purdue University in West Lafayette, Ind. This method requires data encryption so that nobody can tell which information comes from which source. A patient's name, for example, might be kept in one database, his medical history in a second and his drug regimen in a third. All could be brought together only by authorized users of the data. That would be valuable, for example, if a pharmacy, doctors and hospitals wanted to collaborate on a new drug dosage procedure.

Clifton says he believes it will be several years before tools are sold that provide automatic demographic comparisons while keeping personal information secret. "The tools in use today for changing private data are very limited and tied to limited data sets and not easily applied beyond those data sets," he says.


Data Privacy Lab   |    [info@dataprivacylab.org]