SHP 08-179 – HSR&D Study
Career Development Projects
New NLP Tools for Extraction of Values from Microbiology Text
Michael E Matheny MD MS MPH
Tennessee Valley Healthcare System Nashville Campus, Nashville, TN
May 2008 -
Extracting interpretable data from free text records that are semi-structured has been an ongoing issue in health services research and medical informatics for many years. These records lack regular sentence structure in at least a portion of the report and can include many values (numbers or categorical text) associated with medical information either phrased as a question or as a phrase.
Natural language processing (NLP) is a type of text extraction method that analyzes sentence structure and syntax to extract meaning from words and phrases and can relate terms to other terms in a sentence or paragraph using grammar rules. The power of these systems has improved greatly over the years, and medical NLP systems associate terms with medical ontologies, which incorporate large bodies of medical knowledge into relationships which are computer interpretable. However, use of NLP for semi-structured text is difficult because sentence syntax and grammar do not exist to code relationships.
Blood culture reports are an important example of medical information that is reported in a semi-structured format primarily of the phrase and value type of data. These reports contain critical patient information which informs public health officials regarding antibiotic resistance as well as providing definitive diagnoses for sepsis, urinary tract infections, and other infections. Contamination during specimen collection is also a significant problem, with reported rates of 2-3% of all blood cultures and 30-40% of positive blood cultures.
We sought to develop an informatics tool to extract information from semi-structured text records that addresses limitations in the current standard method of expression matching extraction by leveraging the strengths of a medical ontology-based natural language processing system. Microbiology reports were selected because of the challenges this type of report will present to an NLP system and the high yield of medical information present in the reports.
Objective 1: Develop and validate a natural language processing solution for extracting values from semi-structured microbiology reports
A concept-based natural language processing (NLP) system will be adapted to parse values from blood culture microbiology reports. After the system parses a random sample of microbiology reports from among each of the hospitals in VISN-9, manual review will be conducted to evaluate whether the system was able to correctly pair antibiotic and bacteria information with minimum inhibitory concentrations and sensitivity interpretations.
Objective 2: Develop and validate a rule algorithm in order to determine whether a positive blood culture result should be considered contaminated.
A clinical rule algorithm to detect blood culture contamination will be developed within the NLP system environment in compliance with guidelines and clinical expert opinion. The positive culture data used in Aim 1 will be used in Aim 2. Two independent clinicians will review the microbiology susceptibility data and make the determination of whether the sample was contaminated. The processed data in Aim 1 will then evaluated with the rule algorithm to determine automated accuracy for detecting culture contamination.
We used the VA NSQIP database to identify surgical admissions from 1999-2006 among VISN 9 hospital facilities. Random training and testing data sets were extracted and manually reviewed to determine which organisms were present and make a determination of contamination. An NLP tool was developed and adapted to extract microbiology culture and sensitivity information. The tool was then iteratively developed for both data extraction and contamination determination using the training data. Finally, it was evaluated on the testing data in order to determine the accuracy of antibiotic susceptibility data extraction and contamination determination.
The automatic detection algorithm correctly mapped the antibiotic and bacteria with the appropriate sensitivity finding in 5073 of 5217 pairs and found an additional 91 false positives. This resulted in a sensitivity of 97.2% and a positive predictive value of 98.2%. Detection of contamination depends on identification of the bacteria, and the automated algorithm provided an exact match for 717 of 860 organisms with an additional 96 near matches. 47 organisms were missed by the algorithm, and an additional 91 were falsely added. This resulted in a sensitivity of 83.4% and a positive predictive value of 88.7% for exact matches with a correct contamination determination in 97.6%.
These studies highlight the utility of a natural language processing system combined with expression matching to provide a potentially more robust and accurate processing of microbiology reports for blood cultures. These tools can be extended to national VA blood culture data as well as to other types of microbiology reports and other semi-structured text (such as pathology reports).
None at this time.