License for Software and Data

Contact our Technology Transfer Office to acquire a license for restricted software or data.

Below are descriptions of a couple of the available systems.

  • The Scrub System
    The Scrub System concerns maintaining privacy in textual documents. In field-structured databases, explicit identifiers, which provide a means to directly communicate with the person who is the subject of the data, appear within the data, grouped by a field name, such as {name, phone number}. Locating explicit identifiers in unrestricted text, however, becomes a problem unto itself. The Scrub System provides a new computational approach to locating and replacing personally identifying information in textual documents that extends beyond straight search-and-replace procedures, which was the previous norm. The system's approach is based on a model of how humans de-identify textual documents. The basic idea is a system of detectors work in parallel, where each detector specializes in recognizing a particular kind of explicit identifier.

    While the Scrub System was proven to be quite effective, accurately locating 98-100% of all explicit identifiers found in letters to referring physicians, the final analysis reveals that de-identifying textual documents (i.e., removal of explicit identifiers) is not sufficient to ensure anonymity. Therefore, Scrub is not an anonymous database system. Nonetheless, de-identifying textual documents remains in great demand primarily due to archives of email messages, personal web pages and other information found on the World Wide Web.

  • The Datafly I and Datafly II Systems
    The Datafly Systems concerns field-structured databases. Both Datafly I and Datafly II systems use computational disclosure techniques to maintain anonymity in entity-specific data by automatically generalizing, substituting and removing information as appropriate without losing many of the details found within the data. Decisions are made at the attribute (field) and tuple (record) level at the time of database access, so the approach can be used on the fly in role-based security within an institution, and in batch mode for exporting data from an institution. Organizations often release person-specific data with all explicit identifiers, such as name, address, phone number, and social security number, removed in the incorrect belief that the identity of the individuals is protected because the resulting data look anonymous. However, in most of these cases, the remaining data can be used to re-identify individuals by linking or matching the data to other databases or by looking at unique characteristics found in the attributes and tuples of the database itself. When these less apparent aspects are taken into account, as is done in my Datafly II System, each released tuple can be made to ambiguously map to many possible people, providing a level of anonymity that the data provider determines.

    This model of protection is termed k-map protection. In the Datafly I and Datafly II systems, the k is enforced on the data itself, resulting in a special form of k-map protection called k-anonymity. This is attractive because adherence to k-anonymity can be determined by the data holder’s data alone and does not require omniscience. Further, in the Datafly System the data holder assigns to each attribute, the amount of tolerance for distortion that is desirable and the amount of protection necessary. In this way, the Datafly systems transform the disclosure limitation problem into an optimization problem. As a consequence, the final results are auately protected while remaining useful to the recipient. It is shown that Datafly is an anonymous database system.


Related LIDAP links



Copyright © 2011. President and Fellows Harvard University.   |   IQSS   |    Data Privacy Lab   |    [info@dataprivacylab.org]