National agencies commonly release statistical data to other agencies for research purposes or to inform public policy. A disclosure, or reidentification, occurs when a person or organization learns something they did not already know about an entity because of the released data. Organizations typically de-identify data by removing identifying information, such as personal names, from a dataset before release. In the U.S., the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule offers guidance on deidentifying personal health information. A common method used to satisfy this rule is Safe Harbor, which consists of removing 18 groups of identifiers from a dataset. However, there is increasing evidence that blanket policies, such as Safe Harbor, can still leave different organizations vulnerable to disclosure at different rates, justifying the need for locally performed statistical disclosure risk analyses prior to data release. Although many risk measures have been developed, their statistical properties have not been well-studied and comparisons between competing estimators is limited. The goal of this proposal is to develop a set of novel statistical approaches for estimating re-identification risk in the context of VA's data-sharing policies and to provide recommendations for the VA Re-identification Risk subcommittee.
The objectives are:
1. To develop new and more flexible statistical methods to assess re-identification risk.
2. To assess the performance of new and existing estimation methods using imitation-data and real-data simulations.
3. To apply existing and newly developed methods to estimate re-identification risk of a deidentified VA dataset.
4. To develop recommendations for the VA Re-identification Risk subcommittee of elements the VA must consider before releasing data and actions the VA can take when faced with privacy threats.
We initially reviewed the latest peer-reviewed articles on this topic and developed a tutorial describing current statistical methods for estimating disclosure risk. We identified limitations in these existing parametric methods which all use log-linear models, and developed a log-nonlinear model which has never been used before in a disclosure risk setting. The log-expected cell frequencies are related to key variables nonlinearly through an unknown function. Since there are many zero counts, they model Fj with a zero-inflated Poisson, using sliced inverse regression to estimate model coefficients and a local linear method to estimate the unknown function.
We also applied existing disclosure risk measures to two VA datasets which have been de-identified using HIPAA Safe Harbor guidelines: the VA rheumatoid arthritis dataset and the suicide dataset from REACH VET Prediction Cohort. In both cases, to determine the sensitivity of risk measures we used a "small key" with only a small subset of demographic information that might be available to the general public and a "large key" containing a wider range of variables from research-only public datasets that would be available to other researchers. Five major risk measures were calculated to inform disclosure risk of each dataset: expected number of population uniques, expected number of sample uniques that are population unique, expected number of correct matches among sample uniques, probability of a correct match given a unique match, and probability of a correct match.
Aim 1 & 2: We assessed the performance of our new log-nonlinear disclosure risk model in a simulation study and found that, compared to existing log-linear models, our proposed model had less biased estimates of disclosure risk. In log-linear modeling, the maximum likelihood estimator may not exist because of the many zero counts in contingency tables, known as sparsity. The method we developed using log non-linear models and utilizing a Zero-Inflated Poisson distribution was better able to handle this sparsity.
Aim 2: We developed a tutorial describing current statistical methods for estimating disclosure risk for microdata. It serves as a guide for risk analysis by defining existing disclosure risk measures, comparing their effectiveness, and explaining how to estimate them with available software. We also evaluated an existing measure from Skinner and Elliott (2002) using real-data simulations which utilized the American Community Survey US Census data to estimate coverage probabilities and bias, and found no differences in estimator behavior based on method of sampling.
Aim 3: For both the real VA datasets we analyzed, rheumatoid arthritis and REACH VET, the disclosure risk estimates were relatively low. In both cases, for the "small key", all 5 risk measures used were close to 0, suggesting a low disclosure risk for these datasets. For the "large key", estimates were a bit higher, but still relatively insignificant compared to the total population size. Although these datasets had a relatively low reidentification risk, the variation shows it is still important to use multiple risk measures and sets of key variables to create an accurate portrayal of reidentification risk.
Aim 4: In addition to the reference tutorial we created, our team of statisticians is currently working with John F. Quinn, Director of VA National Data Systems, to assist in risk analyses of potentially high-risk VA datasets to determine if they are safe for release and will be funded by National Data Systems for 3 years.
This proposal has a high degree of significance to the VA for two reasons: (1) this work adds to the methodology and science of data protection which is important to the VA; and (2) this work informs VA policy and operations on release of VA data which ultimately impacts future research. This benefits veterans by reducing the risk that veteran data will be compromised, as well as allowing datasets to be more safely and effectively shared, facilitating important future research and policies serving veterans.
- Taylor L, Zhou XH, Rise P. A tutorial in assessing disclosure risk in microdata. Statistics in medicine. 2018 Nov 10; 37(25):3693-3706.
- Sherwood B, Zhou AX, Weintraub S, Wang L. Using quantile regression to create baseline norms for neuropsychological tests. Alzheimer's & dementia : diagnosis, assessment & disease monitoring. 2015 Dec 19; 2:12-8.