Talk to the Veterans Crisis Line now
U.S. flag
An official website of the United States government

Health Services Research & Development

Veterans Crisis Line Badge
Go to the ORD website
Go to the QUERI website

2011 HSR&D National Meeting Abstract

Printable View

2011 National Meeting

3061 — Statistical Text Mining to Supplement the Development of a Clinical Vocabulary for PTSD in Veterans

Luther SL (COE-Tampa), Berndt DJ (Universtiy of South Florida), Finch D (COE-Tampa), Richardson M (COE-Tampa), Hickling E (COE-Tampa), Hickam D (HSR&D Research Enhancement Program, Portland VA Medical Center)

Objectives:
Statistical text mining was used to supplement efforts to develop a clinical vocabulary to support the development of natural language processing programs for post-traumatic stress disorder (PTSD) in the VA. The objective of this study was to develop a strategy to summarize information obtained from multiple statistical text mining models of progress notes to generate a term list for experts to review and consider for inclusion in the clinical vocabulary.

Methods:
A set of outpatient progress notes was collected for a cohort of 405 unique veterans with PTSD and a comparison group of 392 with other psychological conditions at one VA hospital. Statistical text mining and stepwise logistic regression were applied to these data to develop 21 separate models by varying three potential frequency weight and seven term weight options. The ability of terms to distinguish between notes from PTSD and non-PTSD cases across the logistic regression models was summarized in two ways, the total number of models in which the term was found to be significantly associated with the prediction of PTSD and the mean value of the regression coefficient in the models.

Results:
The resultant regression models had high sensitivity for correctly classifying the PTSD cases (0.987-0.983 across the 21 models). However, specificity was low across the models (0.317-0.611). Models developed using the information gain term weight proved to have the highest specificity. A maximum of 113 terms were identified in any one model. Combining results of the 21 models identified a total of 450 individual terms for review by the investigators developing the clinical vocabulary.

Implications:
This represents a preliminary analysis and ongoing refinement of the methods is warranted. However, we believe it is a robust and practical method to conduct this type of analysis.

Impacts:
Inductive approaches such as the one described here hold tremendous promise to assist with the development of clinical vocabularies and ontologies. The method described in this study does not require extensive development of customized software, therefore is easy to implement and accessible to a wide variety of clinicians and researchers.


Questions about the HSR&D website? Email the Web Team.

Any health information on this website is strictly for informational purposes and is not intended as medical advice. It should not be used to diagnose or treat any condition.