Projects in this track provide privacy solutions for sharing a medical database such that the data can be useful for answering public health questions but the identities of the patients who are the subjects of the data cannot be reliably determined. In these projects, the privacy compliance is to thwart the ability to re-identify the names of the patients by linking to population registers. This much be achieved while also warranting the data remain useful for public health survey.
ICD-9 Diagnosis Codes
In
Lab 8a, you had
to interpret ICD-9 diagnosis codes (from a code to an English description)
in order to determine the diagnoses of patients during the hospital visit
in which they died. A copy of this information is available in
Excel .xls
format and in
HTML format.
Death Registry
In
Lab 8a's assignment,
the students in the class helped assemble a death registry
of people who have some information matching the demographics in the health
data. It is believed that most (about 90%) of the patients
in the health data appear in this death registry. However, there are
some patients in the health data (about 10%) that do not appear in the
death registry at all. Conversely, there are some people in the death
registry who do not appear in the health data. For example, there
are 402 people in the death registry and only 200 people in the health data.
A copy of the death registry is available in
Excel,
tab-delimited text,
and HTML formats.
Measuring Identifiability
In
Lab 8a's assignment,
you estimated the identifiability of the patients in the health data.
In lecture, Professor Sweeney described ways of measuring identifiabilty.
Here are Professor Sweeney's slides on
identifiability.
Anonymization Techniques
In lecture
Professor Sweeney described numerous techniques that can be used to distort
information to provide privacy protection. She also described formal
protection models and introduced computational approaches.
Here are Professor Sweeney's slides on
techniques
and
first k-anonymity algorithms.
For related papers, see k-anonmity, and more k-anonmity, Gen-Tree
ICD9 (first 3 digits) | ICD-9 Description |
---|---|
146 | MALIG NEO OROPHARYNX |
147 | MALIG NEO NASOPHARYNX |
148 | MALIG NEOPL HYPOPHARYNX |
150 | MALIGNANT NEO ESOPHAGUS |
151 | MALIGNANT NEO STOMACH |
152 | MALIG NEO SMALL BOWEL |
155 | MALIGNANT NEOPLASM LIVER |
156 | MAL NEO GB/EXTRAHEPATIC |
157 | MALIGNANT NEO PANCREAS |
158 | MALIG NEO PERITONEUM |
159 | OTH MALIG NEO GI/PERITON |
160 | MAL NEO NASAL CAV/SINUS |
162 | MAL NEO TRACHEA/LUNG |
163 | MALIGNANT NEOPL PLEURA |
164 | MAL NEO THYMUS/MEDIASTIN |
165 | OTH/ILL-DEF MAL NEO RESP |
170 | MAL NEO BONE/ARTIC CART |
183 | MAL NEO UTERINE ADNEXA |
189 | MAL NEO URINARY NEC/NOS |
191 | MALIGNANT NEOPLASM BRAIN |
192 | MAL NEO NERVE NEC/NOS |
194 | MAL NEO OTHER ENDOCRINE |
196 | MALIG NEO LYMPH NODES |
197 | SECONDRY MAL NEO GI/RESP |
198 | SEC MALIG NEO OTH SITES |
199 | MALIGNANT NEOPLASM NOS |
200 | LYMPHOSARC/RETICULOSARC |
201 | HODGKIN'S DISEASE |
202 | OTH MAL NEO LYMPH/HISTIO |
203 | MULTIPLE MYELOMA ET AL |
204 | LYMPHOID LEUKEMIA |
205 | MYELOID LEUKEMIA |
206 | MONOCYTIC LEUKEMIA |
207 | OTHER SPECIFIED LEUKEMIA |
208 | LEUKEMIA-UNSPECIF CELL |
In your assignment in Lab 8a, you re-identified many of the patients in the health data. For each person that you have re-identified in the health data that died from a terminal cancer, list the name of the person and the specific diagnosis of the person (use the English description). This list will include not only the basic cancer diagnosis above, but also the descriptions of the other diagnosis codes as well.
Your project presentation should include the histograms showing the usefulness of the health data, as well as the re-identifications. Visitors to your project should be able to scroll through the re-identifications and see the English descriptions of the diagnoses associated with the person.
Your project presentation should include comparative results. Visitors should be convinced that the resulting data remained useful but thwarts the re-identification.
Your project presentation will describe your scheme for matching names and provide some examples. Display your identifiability results graphically. In earlier work, Professor Sweeney reports that 87% of the population of the United States is uniquely identified by {date of birth, gender, 5-digit ZIP}; how does this compare with your findings? Vistors to your project should be able to view re-identified records including ambiguous cases.