2011 National Meeting

3051 — Accounting for Symptom Severity and Other Item Level Characteristics Yields a More Precise Measure of Depression

Kudel I (Cincinnati VAMC) , Edwards MC (The Ohio State University), Justice AC (Veterans Affairs Connecticut Healthcare System), Tsevat J (Cincinnati VAMC)

Objectives:
Depression screening is routinely conducted in outpatient settings, but the statistical approach for deriving the score – summing responses across items – assumes that each item has the same psychometric properties and carries equal weight. Alternative state-of-the-art scoring methods such as item response theory (IRT) can differentially weight item-level characteristics such as symptom severity, thereby yielding more precise scores. In this study, we compared scores on a depression measure using the sum-score vs. an IRT-score.

Methods:
Outpatients without HIV (N = 2813) enrolled in the Veterans Aging Cohort Study, an ongoing longitudinal, prospective study, responded to the PHQ-9, a 9-item measure used widely to screen for depression. The data were analyzed to produce a sum-score and an IRT-score. The latter procedure required 2 steps: 1) the Graded Response Model produced item properties, and 2) item properties were applied to every response from each patient to produce a score.

Results:
Veterans were predominantly male, African-American, and middle-aged. The sum-score procedure yielded all 28 different possible scores (range 0-27). The first step of the IRT analyses ordered the 9-items from least severe (changes in sleep) to most severe (suicidal ideation). The 2nd step yielded 931 discrete scores in a standard normal distribution (mean = 0, sd = 1) ranging from -1.06 to 2.58. The number of IRT scores per sum-score ranged from 1-89 and each sum-score masked an average of 0.43 sd of IRT scores. Patients with a sum score at the cut-point of 10 (scores >= 10 indicate major depression) differed by as much as .63 sd, indicating different levels of depression. Conversely, some patients with dissimilar sum-scores had very similar IRT scores. For example, two respondents had sum-scores of 7 and 10, but IRT scores of 0.484 and 0.485, respectively, indicating that they actually had comparable levels of depression.

Implications:
Modeling self-report data on the PHQ-9 using IRT produces scores that reflect more fine-grained differences among respondents and therefore more precise indicators of depression.

Impacts:
Applying IRT scoring to the PHQ-9 may revolutionize the detection of depression in primary care if it can be easily administered and prove better at identifying veterans with major depression.