HHS plans to test re-identification of “de-identified” health data

In a special notice posted yesterday on FedBizOpps, the HHS Office of the National Coordinator for Health IT is getting ready to put a contract out to fund research on re-identifying datasets that have been de-identified according to HIPAA Privacy Rule Standards. As noted previously in this space, academic researchers working at Carnegie-Mellon and at UT-Austin have already reported on efforts to successfully identify records ostensibly anonymized, although to be fair neither of these specific research examples were based on HIPAA de-identified data. What’s most intriguing about this solicitation notice is that ONC has one of the leading experts on the subject, Latanya Sweeney, sitting on its Health IT Policy Committee. Sweeney’s doctoral research included work with anonymized medical records which, she discovered, could be positively identified a majority of the time simply by correlating the medial records with other publicly available data sources that included the demographic information stripped out of the health data. The research to be funded by ONC will focus on data that has been de-identified according to current HIPAA standards, which basically require the removal of 18 specific identifiers and any other information in a health record that might otherwise uniquely identify the individual in question. Specifically:

  1. Names.
  2. All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP Code, and their equivalent geographical codes, except for the initial three digits of a ZIP Code if, according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all ZIP Codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a ZIP Code for all such geographic units containing 20,000 or fewer people are changed to 000.
  3. All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.
  4. Telephone numbers.
  5. Facsimile numbers.
  6. Electronic mail addresses.
  7. Social security numbers.
  8. Medical record numbers.
  9. Health plan beneficiary numbers.
  10. Account numbers.
  11. Certificate/license numbers.
  12. Vehicle identifiers and serial numbers, including license plate numbers.
  13. Device identifiers and serial numbers.
  14. Web universal resource locators (URLs).
  15. Internet protocol (IP) address numbers.
  16. Biometric identifiers, including fingerprints and voiceprints.
  17. Full-face photographic images and any comparable images.
  18. Any other unique identifying number, characteristic, or code, unless otherwise permitted by the Privacy Rule for re-identification.

It’s not clear at this point what the outcome of such research might be, assuming some level of “success” in re-identifying health data. One mitigation might be an expansion of the list of fields that need to be removed to effectively de-identify someone. A more significant response might be an acknowledgment that true anonymization of health data to the degree sought (and one could argue assumed) under current law and policy simply isn’t possible without more extensive alteration of the original data.