Semantic Learning Algorithms

Privacy is a Learning Problem

An essay by Latanya Sweeney

Constructing machines that learn in the human sphere of experience and enlightenment has always fascinated me, even from my earliest years. How well I remember sitting in my third grade classroom, bored with the happenings in class, fantasizing about a black box that could learn and make connections about people and places as I could. My childhood dream’s origin is distant in time, yet the drive to fulfill that dream is ever present in my life today, only now it’s grown into a passion – a passion that has driven me to study the science of “privacy.” Learning… privacy? Yes, privacy!

When someone voices the word “privacy” all kinds of images conjure about seclusion and one’s personal affairs. “Risk,” “danger”, and even “intimacy” might be words associated with privacy, but “learning” is probably not in the top 100 associations for most people. What does learning – the acquiring of knowledge of skill – have to do with privacy? Everything!

When I speak of learning, I don’t particularly mean Papert’s Logo style of understanding how children learn through play. When I speak of learning, I don’t particularly mean the Agrawal notion of mining through massive datasets to find statistically valid correlations. When I speak of learning, I don’t particularly mean any of the constructs that computer scientists have found fascinating before. Instead I mean something like them all, but something very different too, something very pointed and directed at learning something about you, especially if you thought it could not be learned. I want to write algorithms and design systems and protocols that learn sensitive or strategic information across disparate or seemingly innocuous or unrelated information. When I learn something which someone thought could not be learned from data, I’ve not only learned something, but I educated others about what could be learned in the process!

Here are some examples. Suppose you have a string of ACG’s that constitute a person’s DNA. Certainly, DNA is unique for each person, but can I tell you to whom that particular sequence belongs? Suppose you share the first or last few digits of a Social Security number (SSN), can I tell you demographics (residence and age) about the person to whom the SSN was issued? Suppose your friend is walking down a street in lower Manhattan picking his nose, can I be in Pittsburgh and see him when he does it, and if so, can I know it’s your friend doing it? An answer to each of these questions is yes. These are things we already know how to do (to some extent) [Genomic Privacy Project; SSNwatch; CameraWatch].

Now you might not be too surprised that these things can be learned because you realize I could pay money to a human detective and get answers to each of these questions. But what makes this work more interesting is the ability to accomplish these feats not with human detectives, but with generalizable computer methods, which I term "data detectives."

I term the pursuit of constructing data detective tools, as one of "semantic learning." These methods are aimed at learning not just facts or patterns but at learning strategic or sensitive knowledge about a person, place or entity. Such algorithms, programs, systems and protocols can themselves be incredibly useful tools because learning strategic knowledge about people is almost always a benefit to someone. So, many semantic learning technologies have noteworthy uses unto themselves (see Sweeney CV).

Now clearly, we can turn this all around and ask how might I be able to prevent the ability to learn sensitive information? One way is for you to pass a law and make it illegal for me to do it, but then you would likely thwart the noteworthy uses too. Instead, we want to construct methods that can thwart some kinds of sensitive information from being learned, while still allowing noteworthy uses. That’s where “privacy technology” comes in. It is a converse of semantic learning that seeks to control that which can be learned.

Here are some examples. Suppose you want to share video clips with law enforcement routinely so they can monitor captured images for suspicious behavior, but you want to do so in such a way that no matter how good face recognition may become, the faces cannot be matched to driver license photos without a court warrant. You want to make sure law-enforcement cannot track the whereabouts of all the non-suspicious people all the time. Here’s a second example. Suppose you have some stores who want to compute the total hourly sales among them without revealing the private sales of any store. Yet, one business owns most of the stores and there are thousands of stores, so the computation has to be safe and fast. Here is a final example. Suppose you have some medical data you want to put on-line to share with researchers, how can you be sure no one can be re-identified? These are things we already know how to do [k-Same faces in video; PrivaSum, 2004; k-Anonymity]. These examples show the symbiotic relationship between semantic learning algorithms and privacy technology. It is not surprising therefore, that my work encompasses both and that the Data Privacy Lab pursues both intensely.

June 2004.


Tell me more about:


Copyright © 2011. President and Fellows Harvard University.   |   IQSS   |    Data Privacy Lab   |    [info@dataprivacylab.org]