Talk to the Veterans Crisis Line now
U.S. flag
An official website of the United States government

Health Services Research & Development

Veterans Crisis Line Badge
Go to the ORD website
Go to the QUERI website

HIR 09-006 – HSR&D Study

New | Current | Completed | DRA | DRE | Portfolios/Projects | Centers | Career Development Projects

HIR 09-006
Consortium for Healthcare Informatics Research - De-Identification
Matthew H. Samore MD
VA Salt Lake City Health Care System, Salt Lake City, UT
Salt Lake City, UT
Funding Period: February 2009 - January 2013

BACKGROUND/RATIONALE:
The privacy and the confidentiality of a patient's health information is a cornerstone of the physician-patient relationship. Regulations protecting confidentiality require informed consent of the patient for use of their medical record for purposes other than their own health care, such as research. But obtaining the informed consent of a large population of patients, especially in retrospective research, is a difficult and costly obligation. The informed consent requirement can be waived if the medical record is de-identified. To reduce the time and effort required to manually de-identify medical records, natural language processing (NLP) methods can be applied to automatically de-identify narrative text documents in the EHR (Electronic Health Record). Several systems for automated de-identification have been developed, but they have been adapted to the document types and formats they were designed to process. The VA CPRS narrative text documents have significant differences with documents in other systems, the most prominent being the widespread use of templates. Any automated de-identification system would therefore require significant adaptation efforts to be used with VA CPRS narrative text documents.

OBJECTIVE(S):
This project was driven by the following research questions: 1) Can automatic text de-identification be applied to VA clinical narratives with good performance? 2) What is the risk that a de-identified clinical note can be re-identified? 3) How much does automatic text de-identification impact subsequent uses of the clinical narratives?
The objectives of this study were to:
1. Evaluate existing automated text de-identification methods and develop a best-of-breed application for VA clinical narratives by combining the best performing methods for each type of identifier.
2. Evaluate the risk that de-identified clinical text can be linked to the identity of the corresponding patient.
3. Determine the influence of automated de-identification on the accuracy of information extraction and the optimal combination of both.

METHODS:
The developments and evaluations in this project were based on a stratified random sample of various VHA clinical narratives authored between April 1, 2008 and March 31, 2009 from VHA patient EHRs in VISN 19. No patient criteria were used for the selection. The 100 most frequent note types (addendum excluded) were used as strata for sampling. We then randomly selected eight documents in each stratum, reaching a total of 800 clinical documents.
The First objective included an evaluation of existing text de-identification methods (comprehensive survey and evaluation of a selection of algorithms and systems), the development of a best-of-breed automatic clinical text de-identification application, and the evaluation of this new application.
The Second objective consisted in the evaluation of the level of anonymity of automatically de-identified clinical documents when presented to healthcare providers at various levels of proximity to the patient (e.g., nurse working in the ward a patient was hospitalized in versus an attending physician consulting in the same hospital). Discharge summaries from a random sample of 100 patients hospitalized in acute medicine at the Salt Lake City VHA Medical Center between September and December 2012 were automatically de-identified with BoB for this survey. This objective also included an estimation of the re-identification risk based on the uniqueness of automatically de-identified clinical documents and the other identified data sets that could be used for re-identification.
The Third objective focused on evaluating the impact of automatic de-identification on clinical data (readability and interpretability) and on subsequent information extraction processes.

To guide our efforts and have a better understanding of Information Security and Privacy Officers' opinions about the use of automated de-identification and de-identified notes in research, we conducted a survey of these VHA employees.

FINDINGS/RESULTS:
Each document in our sample of VHA clinical notes was independently annotated by two reviewers for PHI (Protected Health Information) and clinical eponyms; disagreements were adjudicated by a third reviewer. This annotated corpus served as reference standard for training and testing.

First objective: We realized and published the results of a comprehensive survey of research and software developed for clinical text de-identification. We also implemented and evaluated several such applications with VHA clinical documents. Based on the results of the evaluation and analysis of several de-identification applications, we chose the best methods and resources for each type of PHI, and developed a best-of-breed VHA clinical text automatic de-identification application (called BoB). A first version of BoB was released in December 2011, and performance optimization efforts followed this first release, reaching an overall sensitivity of 92.6% (98-100% for highly sensitive PHI) and positive predictive value of 84.1%.

Second objective: The anonymity survey used 100 automatically de-identified notes and none was formally identified by healthcare providers. Eight residents and four attending physicians in acute medicine at the Salt Lake City VHA Medical Center participated in the survey, and even residents having taken care of the patients within the past 3 months didn't formally recognize the patients.
The uniqueness of automatically de-identified clinical documents was estimated by automatically mapping ICD-9-CM and CPT-4 terms from clinical notes in the 2010 i2b2 NLP challenge corpus. About 23% of the notes had a unique ICD-9-CM or CPT-4 code, and might therefore be linked with some identified database that includes these codes.

Third objective: We studied the impact of de-identification on the readability and interpretability of clinical documents, and the impact of de-identification on subsequent information extraction with an existing corpus of clinical notes from the 2010 i2b2 NLP challenge and with part of our VHA clinical narratives corpus. This impact was only minimal (0.81-1.87% of clinical terms).

IMPACT:
The creation of a de-identified patient data repository would have significant implications for the future of research within the VHA. Such a repository would provide researchers with greatly increased access to patient data across the entire VHA system, thereby facilitating research projects currently not possible within VHA research confines. The BoB system we have developed would enable the creation of such a repository, and our other findings could guide updated or new policies to access this repository.

PUBLICATIONS:

Journal Articles

  1. Meystre SM, Dalianis H, Aberdeen J, Malin B. Automatic clinical text de-identification: is it worth it, and could it work for me? Studies in health technology and informatics. 2013 Aug 7; 192:1242.
  2. Ferrández Ó, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. Generalizability and comparison of automatic clinical text de-identification methods and resources. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium. 2013 Jul 30; 2012:199-208.
  3. Meystre S, Mallin BM. Automatic Clinical Text De-Identification: Is It Worth It, and Could It Work for Me? Medinfo. 2013 Jun 30; 2013:1-3.
  4. Meystre SM, Ferrández O, South BR, Shen S, Samore MH. How much does automatic text de-identification impact clinical problems, tests, and treatments? AMIA Summits on Translational Science proceedings. 2013 Mar 18; 2013:177.
  5. Doing-Harris K, Meystre SM, Samore M, Ceusters W. Applying ontological realism to medically unexplained syndromes. Studies in health technology and informatics. 2013 Jan 1; 192:97-101.
  6. Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. Journal of the American Medical Informatics Association : JAMIA. 2013 Jan 1; 20(1):77-83.
  7. Kim Y, Garvin J, Heavirland J, Meystre SM. Improving heart failure information extraction by domain adaptation. Studies in health technology and informatics. 2013 Jan 1; 192:185-9.
  8. Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. Evaluating current automatic de-identification methods with Veteran's health administration clinical documents. BMC medical research methodology. 2012 Jul 27; 12:109.
  9. Meystre SM, Thibault J, Shen S, Hurdle JF, South BR. Textractor: a hybrid system for medications and reason for their prescription extraction from clinical text documents. Journal of the American Medical Informatics Association : JAMIA. 2010 Sep 1; 17(5):559-62.
  10. Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC medical research methodology. 2010 Aug 2; 10:70.
  11. Mayer J, Shen S, South BR, Meystre S, Friedlin FJ, Ray WR, Samore M. Inductive creation of an annotation schema and a reference standard for de-identification of VA electronic clinical notes. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium. 2009 Nov 14; 2009:416-20.
Conference Presentations

  1. Kim Y, Garvin JH, Heavirland J, Meystre S. Relatedness Analysis of LVEF Qualitative Assessments and Quantitative Values. Poster session presented at: American Medical Informatics Association Spring Congress; 2013 Mar 20; San Francisco, CA.
  2. Meystre S, Ferrandez O, South B, Shen S, Samore MH. How Much Does Automatic Text De-Identification Impact Clinical Problems, Tests, and Treatments? Presented at: American Medical Informatics Association Translational Bioinformatics / Clinical Research Informatics Annual Joint Summits on Translational Science; 2013 Mar 18; San Francisco, CA.
  3. Nokes N, Meystre S, Scehnet JS, South B, Shen S, Maw M. A Survey of VHA Privacy Officers for the External Use of Automatically De-Identified Clinical Documents. Presented at: American Medical Informatics Association Annual Symposium; 2012 Nov 3; Chicago, IL.
  4. Ferrandez O, South B, Shen S, Meystre S. A Hybrid Stepwise Approach for De-identifying Person Names in Clinical Documents. Presented at: North American Chapter of the Association for Computational Linguistics: Human Language Technologies Annual Conference; 2012 Jun 8; Montreal, Canada.
  5. South B, Meystre S, Shen S, Ferrandez O, Nokes N, Maw M. An Evaluation of the Informativeness of De-identified Documents. Presented at: American Medical Informatics Association Spring Congress; 2012 Mar 21; San Francisco, CA.
  6. South B, Shen S, Maw M, Ferrandez O, Meystre S. Prevalence Estimates of Clinical Eponyms in De-Identified Clinical Documents. Presented at: American Medical Informatics Association Spring Congress; 2012 Mar 21; San Francisco, CA.
  7. Ferrandez O, South B, Shen S, Maw M, Nokes N, Meystre S. Striving for Optimal Sensitivity to De-identify Clinical Documents. Presented at: American Medical Informatics Association Translational Bioinformatics / Clinical Research Informatics Annual Joint Summits on Translational Science; 2012 Mar 19; San Francisco, CA.
  8. Samore MH, Meystre S. Coverage of Manual De-Identification on VA Clinical Documents. Poster session presented at: American Medical Informatics Association Annual Symposium; 2011 Nov 3; Washington, DC.
  9. Shen S, South B, Nokes N, Ferrandez O, Meystre S. Estimating the Judges and Coverage required to generate an adequate reference standard for De-Identification of Clinical texts. Poster session presented at: American Medical Informatics Association Annual Symposium; 2011 Oct 21; Washington, DC.
  10. Divita G, Zeng Q, Meystre S, South B, Shen S, Cornia R, Garvin JH, Nebeker JR, Samore MH. Standardization to aid interoperability between NLP systems. Paper presented at: International Society for Disease Surveillance Annual Conference; 2010 Dec 1; Park City, UT.
  11. South B, Shen S, Friedlin FJ, Samore MH, Meystre S. Enhancing Annotation of Clinical Text using Pre-Annotation of Common PHI. Poster session presented at: American Medical Informatics Association Annual Symposium; 2010 Nov 13; Washington, DC.
  12. Ferrandez O, South B, Shen S, Samore MH, Meystre S. Generalizability and comparison of automatic clinical text de-identification methods and resources. Presented at: American Medical Informatics Association Annual Symposium; 2010 Nov 3; Chicago, IL.
  13. South B, DuVall SL, Shen S, Meystre S. Beyond the basics: Building a NLP application and a reference standard with open source tools. Poster session presented at: American Medical Informatics Association Spring Congress; 2010 May 25; Phoenix, AZ.


DRA: Health Systems
DRE: Technology Development and Assessment, Research Infrastructure
Keywords: Data Management
MeSH Terms: none

Questions about the HSR&D website? Email the Web Team.

Any health information on this website is strictly for informational purposes and is not intended as medical advice. It should not be used to diagnose or treat any condition.