Trails Learning Project

The Trails Learning Project

Work in the Data Privacy Lab on the Trails Learning Project has sought to answer the question:

How can people be identified to the trail of seemingly innocent and anonymous data they leave behind at different locations?

Here are some examples.

  1. DNA.
    Given a set of hospitals and a set of patients having gene-based diseases who visit those hospitals, patients leave behind sequenced DNA at some hospitals visited as part of their clinical experience. How (in the real-world) can these patientsbe re-identified to their DNA sequences based on the hospitals visited?
    Answers: see Trail re-identification of DNA sequences, and How (not) to protect genomic data privacy in a distributed network.

  2. Web logs
    Given a set of websites in which each site provides a weblog, which is a list of IP addresses recorded from machines visiting the website, how can the people who are using the machines be identifed?
    Answers: see Trail re-identification of on-line consumers using IP addresses.

Generalized solutions (the REID-IT algorithms):

Each of the solutions provided above exploits situations in which there also exists a parallel (though incomplete and perhaps erroneous) log of visits to the locations in which the subjects are identified. Given this parallel identified log, the trails learning problem becomes one of matching the trails of identified information to the trails of de-identified information. For example, publicly available hospital data provides demographics on each hospital patient and publicly available marketing lists idenitfy on-line consumers who have made purchases at particular websites. Using this information, a trail can be constructed that lists places where the identified subject was known to have visited. Algorithms are then provided that match the identified trails (those in which subjects are known) to the de-identified trails (such as DNA sequences or IP addresses -- trails in which the subjects are not known).

Solutions: See the REID-IT algorithms.

Keywords: trail re-identification, trail linkage, trail matching, data linkage, entity resolution, "connect the dots", data inference, data integration

Related Publications


Related Links


Fall 2004 [Data Privacy Lab]