Knowledge discovery in databases (KDD) techniques were applied to text data from the electronic health record (EHR). We compared KDD techniques, regular expression-based pattern matching and statistical text mining (STM) on their ability to identify and characterize circumstances of falls treated in the ambulatory care setting. Falls are an important health care issue among aging Veterans. A history of a previous fall is the single most important clinical indicator of risk for additional falls and targets Veterans for fall prevention programs.
The goal of this study was to explore the usefulness of KDD strategies to identify Veterans who seek care for injurious falls in VHA ambulatory care. Objectives: 1) Create a benchmark dataset on which KDD analyses will be conducted; 2) Compare the ability of KDD techniques to identify fall-related ambulatory events (FRAE) with three types of data (text-based notes alone, text-based notes plus information from administrative data, and text-based notes plus information from chart review) using area under the receiver operating characteristic (AUROC) curve analyses; 3) Test the generalizability of the results from VISN8 in data from other VAMCs; 4) Apply the KDD method found to be most highly predictive in Objective 2 to identify mechanism and place of injury associated with FRAE; and, 5) Investigate the effects of embedded semi-structured text passages such as templates on statistical text mining algorithms.
A large dataset was developed using text data from four VAMCs in VISN 8 and two VAMCs outside of VISN 8. All ambulatory encounters (FY 2007) with ICD-9-CM codes for an injury due to a fall (E-880-889) and matched controls with similar injuries but no fall-related E-codes were identified. Encounters on a given day were combined to form visits and associated documents within 48 hours of the visits were extracted. A benchmark dataset was created using chart review. We compared the ability of regular expression pattern matching and STM to identify fall falls in text-based notes in a single institution. We compared the ability of STM to identify FRAE in text notes with and without information obtained through chart review in data from VISN 8. We applied models developed in VISN 8 to other VAMCs outside of VISN 8. We explored the ability of STM to identify the mechanism of and place in which the fall occurred. Finally, we investigated the impact of removing semi-structured text passages on results of STM models.
Documents (n = 27,619) were obtained for 1,652 Veterans with ICD-9-CM codes indicating an ambulatory encounter for an injury coded with E-880-889 and 1,341 matched controls with similar injuries but no fall-related E-codes. Chart review was completed by trained nurse annotators. Regular expression-based rules correctly identified 95.2% of documents with FRAE from one VAMC, results similar to that of STM. STM models using no information from the chart review (unsupervised term weighting) did not identify FRAE as well as those using information from chart review (supervised term weighting). AUROC values for unsupervised STM models from VISN 8 facilities ranged from 89.9 to 94.3, while those from supervised STM models ranged from 95.2 to 96.4. The supervised STM models improved the AUROC (p < .001) values between 2.1 and 5.3 across the VISN 8. An STM model based on data from one VAMC and then applied to unseen data from that VAMC or other VISN 8 VAMCs achieved sensitivity ranging from 0.76 to 0.83, and specificity from 0.94 to 0.96. Inferences at the visit and patient level based on these models demonstrated improved sensitivity and reduced specificity. When the STM model developed in VISN 8 was applied to data from two other VAMCs, sensitivity (85.2 and 92.6) was as good as or better than results within VISN 8, however specificity was less consistent with values of 94.5 and 77.6 respectively. STM models developed to identify the mechanism an injury resulted in sensitivity of 61.1 and specificity of 94.2, while models for place of injury resulted in sensitivity of 14.5 and specificity of 97.7. We investigated the impact of removing semi-structured data from before performing STM and found that this had very minimal effect model performance suggesting that STM is robust to these data formatting issues.
To estimate the potential impact of combining STM and E-codes to identify FRAE, document level results were rolled to the patient level. Of the 1,260 patients identified as having a FRAE based on chart review, 885 (70.2%) were identified by both STM and E-codes. A total of 315 (25.0%) and 28 (2.2%) were identified by STM and E-codes alone respectively. Those patients identified by STM alone represent 13.8% of the matched controls. Only 32 patients (2.5%) were not identified by either STM or E-codes as having a fall-related injury.
Results of this study suggest that STM can identify FREA recorded in text notes but not recorded in ICD-9-CM codes and have the potential to improve surveillance of this important health issue. STM was robust across multiple institutions within one VISN but less so across VISNs. Automated systems derived from data within individual institutions would likely improve results. A major advantage of STM is that models can be trained with simple document level labeled (fall/not fall) reference standard data sets. While we used traditional chart review (which can be very expensive) to create the reference standard in this study, an automated system could be developed based on feedback directly from clinicians. During normal clinical practice, a random sample of results from the model could be presented to volunteer clinicians for classification. After an initial reference set is developed, this sample could be very small, perhaps just several patients a week, but would provide continuously updated reference sets and STM models. VISN and facility level reports could be developed that accurately describe FRAE, resources expended to treat FRAE and identify high risk groups for fall prevention programs. Additionally applications could be written to allow clinicians and researchers to search the EMR to identify patients at high risk.
- McCart JA, Berndt DJ, Jarman J, Finch DK, Luther SL. Finding falls in ambulatory care clinical documents using statistical text mining. Journal of the American Medical Informatics Association : JAMIA. 2013 Sep 1; 20(5):906-14.
- Finch D, Berndt DJ, Luther SL. Extracting Semi-Structured Text Elements in Medical Progress Notes: A Machine Learning Approach. Poster session presented at: American Medical Informatics Association Annual Symposium; 2012 Nov 3; Chicago, IL.
- Jarman J, McCart J, Luther SL, Berndt DJ. Automated Rule Development Using Text Mining. Poster session presented at: American Medical Informatics Association Annual Symposium; 2012 Nov 3; Chicago, IL.
- McCart J, Berndt DJ, Finch D, Jarman J, Luther SL. Using Statistical Text Mining to Identify Fall-related Injuries in VHA Ambulatory Care Data. Poster session presented at: American Medical Informatics Association Annual Symposium; 2012 Nov 3; Chicago, IL.
- McCart J, Finch D, Jarman J, Luther SL. Identifying Fall-Related Injuries Using Statistical Text Mining: A Preliminary Analysis. Poster session presented at: VA HSR&D / QUERI National Meeting; 2012 Jul 16; National Harbor, MD.
Aging, Older Veterans' Health and Care, Health Systems
Diagnosis, Research Infrastructure, Treatment - Observational, Technology Development and Assessment
Behavior (provider), Healthcare Algorithms, Informatics, Natural Language Processing, Patient outcomes, Predictive Modeling