C. Padurariu, M. E. Breaban

Dealing with Data Imbalance in Text Classification

Machine Learning Natural Language Processing

Many real world datasets don’t offer enough training input for regular classifiers: some classes are more represented than others. Imbalanced data raises problems in Machine Learning classification and predicting an outcome becomes difficult when there is not enough data to learn from.

The object of classification in our study is data coming from the field of Human Resources, consisting of short descriptions of work experiences which must be classified into several highly imbalanced classes expressing job types. We perform an extensive experimental analysis using various representations of text data, several classification algorithms and balancing schemes to derive a model that achieves highest performance with respect to metrics such as precision and recall.

The contribution is twofold:
a) with a comprehensive experimental design, the analysis is focused on studying the interactions between classification algorithms, text vectorization choices and the schemes to deal with data imbalance at several degrees of imbalance;
b) besides state-of-the-art balancing schemes, we propose and analyze a cost sensitive approach formulated as a numerical optimization problem where the costs are derived with a Differential Evolution algorithm in two steps: in a first step costs are optimized at the class level and in a subsequent step costs are refined at the data instance level.

The results indicate that the use of cost-sensitive classifiers where the cost matrices are optimized with a Differential Evolution algorithm brings important benefits on our real-world problem.

This article is authored also by Synbrain data scientists and collaborators. READ THE FULL ARTICLE