HSR Citation Abstract

HSR Citation Abstract

Search | Search by Center | Search by Source | Keywords in Title

Text classification performance: is the sample size the only factor to be considered?

Figueroa RL, Zeng-Treitler Q. Text classification performance: is the sample size the only factor to be considered? Studies in health technology and informatics. 2013 Jan 1; 192:1193.

Related HSR&D Project(s)

HIR 09-005 – Consortium for Health Care Informatics Research: Information Extraction

Search Dimensions for VA for this citation
* Don't have VA-internal network access or a VA email address? Try searching the free-to-the-public version of Dimensions

Search for Abstract from PubMed

Abstract:

The use of text mining and supervised machine learning algorithms on biomedical databases has become increasingly common. However, a question remains: How much data must be annotated to create a suitable training set for a machine learning classifier? In prior research with active learning in medical text classification, we found evidence that not only sample size but also some of the intrinsic characteristics of the texts being analyzed-such as the size of the vocabulary and the length of a document-may also influence the resulting classifier's performance. This study is an attempt to create a regression model to predict performance based on sample size and other text features. While the model needs to be trained on existing datasets, we believe it is feasible to predict performance without obtaining annotations from new datasets once the model is built.

Questions about the HSR website? Email the Web Team

Any health information on this website is strictly for informational purposes and is not intended as medical advice. It should not be used to diagnose or treat any condition.

VA Health Systems Research