Screening for colorectal cancer (CRC) is recommended for average-risk persons aged 50 years and older. However, 7-11% of all CRC occurs in persons < 50, most of whom have no classic risk factors at the time of diagnosis. These persons are not only younger, but often present with more advanced disease and have a less favorable prognosis than older persons. During the last 20 years, the incidence of CRC, while falling in persons 50 years old and older, has risen steadily in persons under age 50. For these reasons, it is critically important to try to identify among Veterans (who are already a high-risk group), those < age 50 at high-risk for CRC, who may be candidates for "early" screening. From a practical perspective, an efficient way to identify Veterans using electronic medical record (EMR) data would facilitate implementation.
1) Identify risk factors for sporadic (i.e., non-hereditary) CRC in persons < age 50;
2) Derive and validate a prediction model for quantifying absolute and relative risks for CRC;
3) Compare the accuracy of automated data abstraction using natural language processing for identifying and abstracting risk factor information from VA electronic health information to the gold standard of manual electronic medical record review.
Using the VA Central Cancer registry, we will identify incident cases of CRC diagnosed between 2008 and 2014. We will verify case eligibility from manual review of CPRS, excluding those with inflammatory bowel disease, a high-risk family history, polyposis syndrome, or hereditary nonpolyposis colon cancer syndrome. Using medical SAS datasets, we will match each final case to 4 controls during the same time period and validate the control group by using a second control group with a negative (i.e., no neoplasia) diagnostic colonoscopy. The same exclusions will apply to controls, along with previous colectomy of any extent and for any reason. Cases and controls will be matched for facility. Manual review of EMR in VistAweb will be conducted by trained research personnel, who will identify information about candidate risk factors of lifestyle habits (cigarette and ethanol use, occupation, leisure activity/exercise), family cancer history, BMI, socio-demographic features, certain laboratory test results, prior CRC screening test results, and medication use. Logistic regression will be used to identify independent factors associated with CRC. A prediction model will be derived and internally validated. Age- and gender-specific SEER CRC incidence rates will be used in conjunction with the prediction model to provide estimates of absolute and relative CRC risks (or "colon age"). Depending on the magnitude of the absolute risk and how it compares with SEER population risks, CRC screening using some screening modality may be considered. From a methodological perspective, we will create a natural language processing tool and use it to perform automated identification and abstraction on the EMRs of cases and controls, comparing its capture of information to that of manual EMR review.
Chart abstractions confirmed the lower exclusion rates of earlier screening. 20% of the original cohort were excluded. 65% of our the inclusion cohort fell into the 45-49 year old age range at index, with 27% being 40-44. Ethnicity of the included was 32% Black, 60% White, 5% unknown, with 6.39% claiming some Hispanic background. The most frequent presenting symptom was Rectal Bleeding (46% of cohort), followed by Abdomen Pain (38%) and Blood in Stool (30%). Hypertension was the most common co morbidity. Roughly 32% were current tobacco users.
Identification of risk factors for sporadic colorectal cancer (CRC) and creation of a prediction model for it will help target high-risk persons for early screening. Such targeting may reduce morbidity and mortality from this particularly devastating disease in this very vital age group, and without the need to apply screening broadly to a population where non-targeted screening is likely to cause more harm than good. A natural language processing tool that accurately performs automated identification and data abstraction will facilitate the conduct of health-services research, expediting completion and implementation of research findings to clinical practice.
None at this time.
Treatment - Observational, TRL - Applied/Translational
Clinical Diagnosis and Screening, Natural Language Processing, Predictive Modeling, Risk Factors, Cancer