2017 HSR&D/QUERI National Conference

1012 — Development of a Natural Language Processing Engine to Generate Pathology Data to Advance Bladder Cancer Care

Lead/Presenter: Florian Schroeck
All Authors: Schroeck FR (White River Junction VAMC) Patterson OV (VA Salt Lake City Healthcare System) Alba PR (VA Salt Lake City Healthcare System) DuVall SL (VA Salt Lake City Healthcare System) Sirovich B (White River Junction VAMC) Robertson DJ (White River Junction VAMC) Seigne JD (Dartmouth Hitchcock Medical Center) Goodney PP (White River Junction VAMC)

Objectives:
Bladder cancer is the third most prevalent non-cutaneous cancer among Veterans. Most Veterans with bladder cancer have early stage disease. The standard of care for these Veterans is resection followed by frequent cystoscopic surveillance procedures. Research informing the recommended frequency of surveillance has been limited by the lack of population based datasets that include longitudinal pathology data. Taking the first step towards assembling such cohorts, we developed and validated a natural language processing (NLP) engine capable of accurately abstracting pathology data from full text pathology reports.

Methods:
We used 600 randomly selected bladder pathology reports from the Department of Veterans Affairs. We developed and tested the NLP engine to abstract data on histology, invasion (presence versus absence and depth), grade, presence of muscularis propria, and presence of carcinoma in situ. Our gold standard was based on independent annotation of reports by two urologists, followed by adjudication. We assessed NLP performance by calculating accuracy, positive predictive value (PPV), and sensitivity and then applied the NLP engine to pathology reports from 10,725 Veterans with bladder cancer.

Results:
The validated engine was capable of abstracting pathologic characteristics for 99% of bladder cancer patients. When comparing the NLP output to the gold standard, NLP achieved the highest accuracy (0.98) for presence of carcinoma in situ. Accuracy for histology, invasion (presence versus absence), grade, and presence of muscularis propria ranged from 0.83 to 0.96. The most challenging variable was depth of invasion (accuracy 0.68), with acceptable PPV for superficial (0.82) and muscle-invasive (0.87) disease.

Implications:
NLP accurately abstracted details from full text bladder pathology reports for a vast majority of patients.

Impacts:
The data abstracted by the NLP engine now allow for assembly of population based cohorts with longitudinal pathology data. These population based cohorts will be used to assess how different surveillance practices affect recurrence and progression of disease. The results will inform best surveillance practices for Veterans with bladder cancer.