unbalanced data

Screening PubMed abstracts: is class imbalance always a challenge to machine learning?

We combined four machine learning techniques and four data preprocessing for class imbalance to identify the outperforming strategy to screen articles in PubMed for inclusion in systematic reviews. We used textual data of 14 systematic reviews as case studies. Meta-analytic fixed-effect models were used to pool delta AUCs separately by classifier and strategy. Resampling techniques slightly improved the performance of the investigated machine learning techniques. From a computational perspective, random undersampling 35:65 may be preferred.