» Back to Table of Contents
The term “big data” is becoming increasingly popular. Although big data has come to mean many things to people in different fields, the term generally refers to data sets so large and complex that processing them with conventional hardware, software, and techniques is extremely difficult or impossible.
Routine clinical care by its very nature generates a vast array of data. Currently, data from clinical notes, imaging and pathology reports, vital sign measurements, and answers to questionnaires are often in unstructured or semi-structured format. As we continue to work with this vast data store, we must develop strategies to effectively utilize and manage these data, or the lack of standardization and structure will pose significant challenges.
We are now at a time when sophisticated data mining techniques are available to help make sense and use of big data in health care. These techniques can accumulate serial quantitative structured data points on a patient. Natural language processing (NLP) methods are becoming more capable of extracting and codifying data from unstructured narrative reports.
Data generated by clinical encounters can be used to develop predictive models for specific subpopulations. These models offer the promise of improving the accuracy of inferences a human can achieve unaided, a capability sometimes referred to as “cognitive extension.” This assistance can take various forms, ranging from relatively simple calculators to extremely complex and comprehensive full-scale simulation and prediction models.
While this sounds modern, the history of predictive modeling runs deep. In 1951, Morris Collen and colleagues, using technology that was antique by today’s standards, introduced computerized “multiphasic” screening and diagnosis at the Kaiser Foundation Health Plan in San Francisco. They used a statistical method to automatically determine the likelihood of a number of diseases based on the analysis of captured and recorded extensive historical, physical, and laboratory data.1
Within VA there are extensive and diverse data available for research. VA Informatics and Computing Infrastructure (VINCI) is an initiative to improve researchers’ access to VA data and to facilitate the analysis of those data while ensuring Veterans’ privacy, confidentiality, and data security. VINCI partners with VHA Corporate Data Warehouse (CDW) to host their data while also making data available from other VA sources. VINCI currently houses data on over 21 million Veterans nationwide. The longitudinal care provided to these Veterans over the past 14 years has generated 2.64 billion clinical notes, 114 million radiology reports, 938 million outpatient encounters, 899 million outpatient prescriptions, and 10 million hospital stays.
Post-deployment homelessness has been a major issue for Veterans after all conflicts and is a priority area for VA. To support VA’s commitment to end homelessness, there is a need to develop electronic algorithms and alerting systems to identify Veterans at risk for homelessness.
Estimates of Veterans experiencing homelessness are based on those who are currently receiving, have previously received, or are currently being directed to specific VA homeless services. Existing methods for risk stratification are based solely on administrative data. Those considered “at-risk” of homelessness, especially for the first time, are a major focus of VA prevention efforts. Early warning indicators to identify these Veterans are currently inferred only from known risk factors for homelessness that can be gleaned from administrative data.
Our current project, “Current Evidence and Early Warning Indicators of Homelessness Risk Among Veterans,” aims to: (1) use text data to improve the accuracy of determination of the homelessness status of Veterans (including the risk of becoming homeless); and (2) develop and apply predictive models for homelessness and homelessness outcomes in Veterans.
This project builds on the informatics methods that the research team has developed under prior and current VA HSR&D funding. Using NLP, we demonstrated that references to indicators of risk are often recorded by VA providers in the clinical notes prior to the formal identification of Veterans as being homeless.2 We also developed an NLP algorithm for detecting psychosocial concepts from the free text of the clinical narratives written by VA providers.3 These NLP methods have been used across multiple research projects to extract information from the free text contained in clinical narratives.
We anticipate big data playing an increasingly central role in clinical practice and decision support both at the point of care and at the population level. Use of such methods will ultimately lead to improvement in the care provided to Veterans.
- Krall, M.A., A.V. Gundlapalli, and M.H. Samore, Big Data and Population-Based Decision Support in Clinical Decision Support: The Road to Broad Adoption, R.A. Green, Editor. 2014, Academic Press.
- Redd, A. et al. Detecting Earlier Indicators of Homelessness in the Free Text of Medical Records. in International Conference on Informatics, Management, and Technology in Healthcare. 2014. Athens, Greece: IOS Press.
- Gundlapalli, A.V. et al. “Validating a Strategy for Psychosocial Phenotyping using a Large Corpus of Clinical Text,” Journal of the American Medical Informatics Association 2013.