Colorectal cancer (CRC) is the 2nd leading cause of cancer death in the United States, and the 3rd most commonly diagnosed cancer among Veterans. To reduce risk for CRC, polyps found at colonoscopy are routinely removed. Current post-polypectomy guidelines recommend repeat colonoscopy in 3 to 10 years based on size, number, and histology of polyps. However, the sensitivity and specificity of current guidelines for predicting advanced neoplasia (defined as occurrence of CRC or a polyp with features considered high-risk for developing into CRC) are limited. The result is surveillance colonoscopy over- and underutilization. This issue is especially relevant to Veterans, who have the highest rates of adenomatous polyps (precursors of most CRCs) in the U.S. Post-polypectomy surveillance may be improved by developing a risk prediction model that includes clinical factors (e.g. age, smoking), as well as polyp factors (e.g. size, location, histology) and quality factors (e.g. colonoscopist adenoma detection rate) associated with advanced neoplasia.
The objectives of this study are to: A) Identify a large, representative cohort of all Veterans with baseline polypectomy and at least one follow-up surveillance colonoscopy (the VA Colonoscopy Cohort) between 2000 and 2014; B) Develop and validate natural language processing (NLP) algorithms to extract key factors required for prediction model that are only available in free-text colonoscopy and pathology reports; and C) Use the VA Colonoscopy Cohort to develop and validate a risk prediction model for advanced neoplasia. We hypothesize that the new risk prediction model will significantly improve sensitivity of current guidelines for advanced neoplasia risk, without loss of specificity.
We will create NLP algorithms to extract key free-text data that may be associated with risk for advanced neoplasia (e.g. quality factors, polyp characteristics). Algorithms will be validated against manual chart review and then applied to the entire VA Colonoscopy Cohort. We will then develop a prediction model for advanced neoplasia using predictors of recurrent neoplasia. Variables considered will include discrete data readily available in the Corporate Data Warehouse, as well as free-text data derived from our NLP work. Using a random sample training set of the cohort, we will apply cross-validated measures of discrimination, calibration, and prediction error to train the model. Finally, we will validate the novel prediction model on an independent validation set and estimate the improvement in performance compared to current guidelines for prediction of advanced neoplasia.
To date, we have identified 364,040 Veterans with baseline polypectomy and at least one follow-up colonoscopy. This estimate is substantially larger than our initial projection of 30,000 Veterans. As a result, the likelihood of generating meaningful, interpretable results has markedly increased, and we are likely to have sufficient power to study CRC in isolation, as well as our previously proposed composite outcome of advanced neoplasia. We have curated several clinical factors (e.g. age, sex, race/ethnicity, body mass index, diabetes, smoking, aspirin) that will be included in the risk prediction model. We have also developed NLP-based algorithms to extract polyp factors (e.g. size, location, histology) and quality factors (e.g. bowel prep, extent of exam, indication) only available in free-text colonoscopy and pathology reports.
The potential impact of the study is substantial. We will create new knowledge on the value of colonoscopy surveillance among individuals with polyps and identify individuals most likely to benefit from surveillance. Current approaches to risk stratification and management are suboptimal. Improving risk stratification and management might lead to improved prevention and detection of CRC and reduction in colonoscopy over and under-use. Our ultimate goal is to transform the manner in which Veterans with polyps are advised and managed.
- Bustamante R, Earles A, Murphy JD, Bryant AK, Patterson OV, Gawron AJ, Kaltenbach T, Whooley MA, Fisher DA, Saini SD, Gupta S, Liu L. Ascertainment of Aspirin Exposure Using Structured and Unstructured Large-scale Electronic Health Record Data. Medical care. 2019 Oct 1; 57(10):e60-e64.
- Demb J, Earles A, Martínez ME, Bustamante R, Bryant AK, Murphy JD, Liu L, Gupta S. Risk factors for colorectal cancer significantly vary by anatomic site. BMJ open gastroenterology. 2019 Aug 24; 6(1):e000313.
- Earles A, Liu L, Bustamante R, Coke P, Lynch J, Messer K, Martínez ME, Murphy JD, Williams CD, Fisher DA, Provenzale DT, Gawron AJ, Kaltenbach T, Gupta S. Structured Approach for Evaluating Strategies for Cancer Ascertainment Using Large-Scale Electronic Health Record Data. JCO clinical cancer informatics. 2018 Dec 1; 2:1-12.
- Gupta S, Liu L, Patterson OV, Earles A, Bustamante R, Gawron AJ, Thompson WK, Scuba W, Denhalter D, Martinez ME, Messer K, Fisher DA, Saini SD, DuVall SL, Chapman WW, Whooley MA, Kaltenbach T. A Framework for Leveraging "Big Data" to Advance Epidemiology and Improve Quality: Design of the VA Colonoscopy Collaborative. EGEMS (Washington, DC). 2018 Apr 13; 6(1):4.
- Liu L, Messer K, Baron JA, Lieberman DA, Jacobs ET, Cross AJ, Murphy G, Martinez ME, Gupta S. A prognostic model for advanced colorectal neoplasia recurrence. Cancer Causes & Control : Ccc. 2016 Oct 1; 27(10):1175-85.
- Scuba W, Tharp M, Mowery D, Tseytlin E, Liu Y, Drews FA, Chapman WW. Knowledge Author: facilitating user-driven, domain content development to support clinical information extraction. Journal of biomedical semantics. 2016 Jun 23; 7(1):42.
Cancer, Digestive Diseases
Best Practices, Care Management Tools