In the last year, a private citizen attempted to use the Freedom of Information Act (FOIA) to obtain a "HIPAA de-identified" research-generated dataset containing mainly CPRS information on VA patients with rheumatoid arthritis (RA). The FOIA office, which has final decision authority, was initially inclined to release the data, but ultimately decided not to release the dataset after much deliberation. One rationale for reversal of the decision was ORD's argument that combining a "de-identified" dataset with identified datasets poses a significant risk of re-identification. ORD was asked to assess the re-identification risk of the de-identified RA dataset. Andrew Zhou was appointed to lead the "Re-Identification Risk Evaluation Committee" to understand the methods used to assess re-identification risk and to apply them to the RA dataset.
The committee had two primary objectives:
Obtain a comprehensive understanding of issues related to the re- Identification risk of VA patient health data that meets HIPAA de-identification.
Evaluate the re-identification risk of the specific HIPAA de-identified VA Rheumatoid Arthritis (VARA) Registry using available risk measures and estimation methods.
For the first objective, the project team reviewed and discussed, in detail, more than 20 peer-reviewed original articles on the subject of re-identification risk methods in order to obtain a comprehensive understanding of re-identification issues and estimation methods. This allowed the project team to classify all statistical population-based methods as falling under one of 4 measures. These 4 measures are based on the concept of how probable it is for someone to match a subject in the released dataset to an external dataset. The more "unique" a subject is in the dataset, with respect to the dataset variable values (labeled key variables), the higher the risk of re-identification. More specifically, re-identification risk typically arises when small counts on cross-classified key variables (such as age, sex, marital status, occupation, etc.) can be used to identify a subject and confidential information can be learned. Quantifying re-identification risk requires realistic assumptions on the information available or IT tools that increase the probability of re-identification, e.g., assumptions on the ability of an intruder to match a dataset to an external public file based on a common set of key variables, as well as the ability to identify unique subjects through visible and rare attributes.
The 4 re-identification measures include: the expected number of population uniques; the expected number of sample uniques that are population unique; the expected number of correct matches for sample uniques; and the expected number of correct matches for all subjects. These measures are defined in detail in Table 1 of the longer report.
For the second objective, the project team reviewed the most recent advances in modeling and estimation methods that exist in the literature for the purpose of implementation. The measures and estimation methods were coded by project team members in available software (using Matlab, R, and SAS) or provided in existing software for which the VA was given access to for the purpose of this project (e.g., SUDA1). The project team then applied these risk methods to the de-identified VA Rheumatoid Arthritis (VARA) Registry in order to evaluate the risk of re-identification
The project team fulfilled the two stated objectives.
For a summary of results:
- The project team created a 46 page report detailing their comprehensive review of re-identification issues and estimation methods, and the classification of methods into one of four commonly used measures. Included in this report are details of the most recent advances in modeling and estimation methods that exist in the published literature, including methods developed and used by statistical agencies in Europe.
- There is no single method widely adopted by organizations. Rather, there are multiple methods available, each with their own strengths and weaknesses.
- The application of risk methods to the VARA dataset provided evidence of a very low overall risk of re-identification. In particular, the estimated number of expected population uniques was very low; very few estimated sample uniques were found to be population unique; and the estimated number of correct matches amongst sample uniques was close to zero under all estimation methods. This translates into a very low re-identification risk for the VARA dataset.
The only evidence of any sizeable risk of re-identification was found in the individual per-record risk measures, of a handful of subjects, under a conservative scenario where the intruder has access to comorbidities in their external dataset.
Dissemination of VA data can facilitate advances in research; inform public policy, and further citizens' knowledge. The VA is ethically and legally obligated to protect the confidentiality of veterans' identities and sensitive attributes. Failure to do so can break promises, violate laws, cause veterans to give lower-quality answers, and reduce future participation rates. Data disseminators are therefore pulled in two opposite directions: the benefits of data access encourage them to release data, but the need to protect confidentiality encourages them not to release data. Dr. Zhou successfully organized and led a Re-Identification Risk Evaluation Committee that summarized the state of major methods used to assess re-identification risk, and applied these methods to a VA dataset to assess potential risk of re-identification.
External Links for this Project
None at this time.