Using Text Mining to Differentiate Between PTSD and Mild TBI in OEF/OIF Veterans
Stephen Lee Luther PhD MA
James A. Haley Veterans' Hospital, Tampa, FL
Funding Period: May 2008 - September 2008
In April 2007 the VA issued VHA Directive 2007-013 that established a policy for screening and evaluation of possible traumatic brain injury (TBI) in OEF and OIF veterans, the TBI Clinical Reminder. A recent study found that there has been no empirical validation of the TBI Clinical Reminder questions, and hypothesized that such population screening for TBI deployment-related health problems represents a complex clinical phenomenon with unique and co-morbid conditions often involving mild TBI (mTBI), PTSD, depression, pain, and other conditions. Studying these factors in retrospective data is hampered by the fact that important clinical information is not routinely coded in the electronic medical record (EMR). It is likely however that much of this information is available in the free text of the EMR. Automated techniques that search the text-based EMR clinical notes to extract information about OEF veterans could potentially make these data available for clinicians and researchers.
1) Determine if text mining of clinical notes from the VistA electronic medical record (EMR) can reliably distinguish between OEF/OIF veterans who screened positive for both PTSD and mTBI and those who screened positive for PTSD alone. 2) Determine whether any specific type(s) of text notes (e.g. in-patient, outpatient, physician, nursing, psychology, neuropsychology, etc.) contribute more to the ability to classify PTSD patients with and without a reported history of mild TBI. 3) Conduct preliminary analyses to investigate whether the results of the text mining can identify clinical concepts, symptoms, signs, or findings that may improve the accuracy of the TBI Clinical Reminder.
A list of unique OEF/OIF patients from the James A. Haley Veterans Hospital who screened positive for PTSD and mild TBI were identified based on reports from the VHA Support Service Center (VSSC) website. Text notes (both inpatient and outpatient) for the six months prior and the six months after the positive screen were extracted. Social security numbers were replaced with sequential patient IDs and all variables that contained the names of patients or providers were eliminated. Veterans positive for both screens were labeled as target cases and those positive for PTSD alone were treated as controls. Notes were concatenated to form a single record for each patient. Statistically-based machine learning algorithms were employed to extract information from the text notes. These approaches parse the text and count terms (words or small phrases), constructing a potentially very large term-by-document frequency matrix. Several weighting schemes can be used to refine the raw frequencies, including methods that use target information in the training data (supervised learning). Similar to principal component analysis, a large term-by-document matrix can be reduced in complexity using singular value decomposition (SVD) or other summarization techniques. In addition, clinical domain knowledge can be incorporated through the inclusion or exclusion of specific terms (using start and stop lists) or through the use of synonyms. The resulting text mining output can then be used for clustering or as inputs to additional predictive models.
In the six months before positive screens, text notes were identified for 43 veterans positive for both PTSD and TBI and 414 veterans positive for PTSD alone, while in the six months after there were 52 veterans positive for both PTSD and TBI and 527 veterans positive for PTSD alone. Text mining models developed on notes from the six months before the positive screening proved unstable. To determine why this occurred, text notes for positive cases were manually reviewed and it was found that very little clinical information that could be attributed to TBI was included in these notes. Based on these results the remainder of the analysis was conducted on the text notes generated for the six months after the positive screen. For this data set, a series of text mining models were developed and fed into logistic regression models for prediction of whether cases should be classified as mild TBI or not. The use of domain knowledge via explicit term inclusion or exclusion was somewhat limited and an area of future research. The models were evaluated through cross validation (repeated trials with separate training and evaluation data sets). The combination of text mining and logistic regression models that has been retained thus far achieved a predictive accuracy of 85.3%, a sensitivity of 86.5%, and a specificity of 84.7%. The candidate models differed in several ways, including the term weighting methods, with information gain and mutual information gain (both of which take advantage of target classes in the training data), along with inverse document frequency producing competitive models. Several challenges were identified through chart review of the results, including the role of templates embedded in the text notes. Some of the terms extracted during text mining were associated with templates, which can include terms in the prompts and indicate the presence or absence of items with "checkmarks" or other symbols. Strategies for preprocessing templates, differing note types, and other embedded patterns are among the approaches being developed for future analyses. Subsequent analyses will also build on these initial findings through more comprehensive use of clinical knowledge by pre-specifying key terms and synonyms.
Automated techniques that could search the text-based EMR clinical notes to identify mTBI patients and extract information about their condition represent a potentially important resource to VA clinicians and HSR&D researchers. This tool could augment the process of identifying patients with this complex and subtle condition. This study represents an initial step towards developing such a resource. These unique yet often overlapping conditions present challenges to health care providers to determine what condition or conditions account for the clinical presentation.
None at this time.
DRA: Mental, Cognitive and Behavioral Disorders
MeSH Terms: none