Colorectal cancer (CRC) is the 2nd leading cause of cancer death in the United States, and the 3rd most commonly diagnosed cancer among Veterans. To reduce risk for CRC, polyps found at colonoscopy are routinely removed. Current post-polypectomy guidelines recommend repeat colonoscopy in 3 to 10 years based on size, number, and histology of polyps. However, the sensitivity and specificity of current guidelines for predicting advanced neoplasia (defined as occurrence of CRC or a polyp with features considered high-risk for developing into CRC) are limited. The result is surveillance colonoscopy over- and underutilization. This issue is especially relevant to Veterans, who have the highest rates of adenomatous polyps (precursors of most CRCs) in the U.S. Post-polypectomy surveillance may be improved by developing a risk prediction model that includes clinical factors (e.g. age, smoking), as well as polyp factors (e.g. size, location, histology) and quality factors (e.g. colonoscopist adenoma detection rate) associated with advanced neoplasia.
The objectives of this study are to: A) Identify a large, representative cohort of all Veterans with baseline polypectomy and at least one follow-up surveillance colonoscopy (the VA Colonoscopy Cohort) between 2000 and 2014; B) Develop and validate natural language processing (NLP) algorithms to extract key factors required for prediction model that are only available in free-text colonoscopy and pathology reports; and C) Use the VA Colonoscopy Cohort to develop and validate a risk prediction model for advanced neoplasia. We hypothesize that the new risk prediction model will significantly improve sensitivity of current guidelines for advanced neoplasia risk, without loss of specificity.
We will create NLP algorithms to extract key free-text data that may be associated with risk for advanced neoplasia (e.g. quality factors, polyp characteristics). Algorithms will be validated against manual chart review and then applied to the entire VA Colonoscopy Cohort. We will then develop a prediction model for advanced neoplasia using predictors of recurrent neoplasia. Variables considered will include discrete data readily available in the Corporate Data Warehouse, as well as free-text data derived from our NLP work. Using a random sample training set of the cohort, we will apply cross-validated measures of discrimination, calibration, and prediction error to train the model. Finally, we will validate the novel prediction model on an independent validation set and estimate the improvement in performance compared to current guidelines for prediction of advanced neoplasia.
To date, we have identified 364,040 Veterans with baseline polypectomy and at least one follow-up colonoscopy. This estimate is substantially larger than our initial projection of 30,000 Veterans. As a result, the likelihood of generating meaningful, interpretable results has markedly increased, and we are likely to have sufficient power to study CRC in isolation, as well as our previously proposed composite outcome of advanced neoplasia. We have curated several clinical factors (e.g. age, sex, race/ethnicity, body mass index, diabetes, smoking, aspirin) that will be included in the risk prediction model. We have also developed NLP-based algorithms to extract polyp factors (e.g. size, location, histology) and quality factors (e.g. bowel prep, extent of exam, indication) only available in free-text colonoscopy and pathology reports.
The potential impact of the study is substantial. We will create new knowledge on the value of colonoscopy surveillance among individuals with polyps and identify individuals most likely to benefit from surveillance. Current approaches to risk stratification and management are suboptimal. Improving risk stratification and management might lead to improved prevention and detection of CRC and reduction in colonoscopy over and under-use. Our ultimate goal is to transform the manner in which Veterans with polyps are advised and managed.
- Liu L, Messer K, Baron JA, Lieberman DA, Jacobs ET, Cross AJ, Murphy G, Martinez ME, Gupta S. A prognostic model for advanced colorectal neoplasia recurrence. Cancer Causes & Control : Ccc. 2016 Oct 1; 27(10):1175-85.
- Scuba W, Tharp M, Mowery D, Tseytlin E, Liu Y, Drews FA, Chapman WW. Knowledge Author: facilitating user-driven, domain content development to support clinical information extraction. Journal of biomedical semantics. 2016 Jun 23; 7(1):42.