Talk to the Veterans Crisis Line now
U.S. flag
An official website of the United States government

Health Services Research & Development

Veterans Crisis Line Badge
Go to the ORD website
Go to the QUERI website

2012 HSR&D/QUERI National Conference Abstract

Printable View

2012 National Meeting

3008 — Identifying Fall-Related Injuries Using Statistical Text Mining: A Preliminary Analysis

McCart JAFinch DKJarman J, and Luther SL, HSR&D/RR&D Center of Excellence/James A. Haley Veterans Hospital;

Objectives:
The goal of this study was to determine the effectiveness of statistical text mining (STM) in identifying progress notes about fall-related injuries (FRIs).

Methods:
The dataset consisted of 19,698 outpatient progress notes from 3,008 Veterans at four VAMCs in VISN8 during FY 2007. Each note was annotated as FRI (n = 3,837 – 19%) or not (n = 15,861 – 81%) using one of three trained nurses and one clinical expert (reviewing a random sample). Notes were separated into one training (TRAIN) and two test (TEST1 and TEST2) datasets. Three VAMCs each contributed a stratified random selection of 70% of notes to TRAIN (n = 12,687) with the remaining 30% placed in TEST1 (n = 5,436). The final VAMC was used only for TEST2 (n = 1,575). STM models consisting of decision trees (DT), logistic regression (LR), and support vector machines (SVM) were built by following a two-phase training process using 10-fold stratified cross-validation. A total of 406 model combinations were built and evaluated on TRAIN. Models with the highest F1 score for each algorithm were then applied to TEST1 and TEST2.

Results:
F1 scores from the best model built during training and then applied to the test datasets are shown by algorithm and dataset (TRAIN/TEST1/TEST2): DT (86.0/84.7/82.1), LR (83.8/83.2/81.2), and SVM (86.0/86.6/83.9). SVM had the highest F1 scores across all datasets. Additional statistics for SVM are shown by dataset: accuracy (94.5/94.8/94.1), sensitivity (86.0/86.8/83.5), specificity (96.6/96.7/96.5), positive predictive value (86.1/86.4/84.3), and negative predictive value (96.6/96.8/96.3).

Implications:
All three algorithms were effective at identifying FRIs, with SVM having the best overall performance. SVM also realized the most consistent performance between training and test datasets, highlighting the generalizability of the trained model. Future work will evaluate STM using additional VAMCs outside VISN8, with the eventual goal being a nationwide STM-based surveillance system to detect FRIs.

Impacts:
FRIs are an important health care issue, especially among aging Veterans. Despite being documented in medical records, FRIs are significantly under-coded in administrative databases, making it difficult to identify at-risk Veterans and take steps to help prevent future FRIs. This preliminary analysis demonstrated the effectiveness of STM models in identifying FRI notes, which can then be linked to at-risk Veterans.


Questions about the HSR&D website? Email the Web Team.

Any health information on this website is strictly for informational purposes and is not intended as medical advice. It should not be used to diagnose or treat any condition.