Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric
نویسندگان
چکیده
Data imbalance is frequently encountered in biomedical applications. Resampling techniques can be used in binary classification to tackle this issue. However such solutions are not desired when the number of samples in the small class is limited. Moreover the use of inadequate performance metrics, such as accuracy, lead to poor generalization results because the classifiers tend to predict the largest size class. One of the good approaches to deal with this issue is to optimize performance metrics that are designed to handle data imbalance. Matthews Correlation Coefficient (MCC) is widely used in Bioinformatics as a performance metric. We are interested in developing a new classifier based on the MCC metric to handle imbalanced data. We derive an optimal Bayes classifier for the MCC metric using an approach based on Frechet derivative. We show that the proposed algorithm has the nice theoretical property of consistency. Using simulated data, we verify the correctness of our optimality result by searching in the space of all possible binary classifiers. The proposed classifier is evaluated on 64 datasets from a wide range data imbalance. We compare both classification performance and CPU efficiency for three classifiers: 1) the proposed algorithm (MCC-classifier), the Bayes classifier with a default threshold (MCC-base) and imbalanced SVM (SVM-imba). The experimental evaluation shows that MCC-classifier has a close performance to SVM-imba while being simpler and more efficient.
منابع مشابه
Evaluation of Classifiers in Software Fault-Proneness Prediction
Reliability of software counts on its fault-prone modules. This means that the less software consists of fault-prone units the more we may trust it. Therefore, if we are able to predict the number of fault-prone modules of software, it will be possible to judge the software reliability. In predicting software fault-prone modules, one of the contributing features is software metric by which one ...
متن کاملSynthetic Protein Sequence Oversampling Method for Classification and Remote Homology Detection in Imbalanced Protein Data
Many classifiers are designed with the assumption of wellbalanced datasets. But in real problems, like protein classification and remote homology detection, when using binary classifiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A...
متن کاملOptimization of General Statistical Accuracy Measures for Classification Based on Learning Vector Quantization
We propose a framework for classification learning based on generalized learning vector quantization using statistical quality measures as cost function. Statistical measures like the F -measure or the Matthews correlation coefficient reflect better the performance for two-class classification problems than the simple accuracy, in particular if the data classes are imbalanced. For this purpose,...
متن کاملBias : The Use of Machine Learning in Software Defect Prediction : Supplementary Materials
As mentioned in the main paper, one slightly awkward property of the Matthews Correlation Coefficient (MCC) is that depending upon the marginal distributions of the confusion matrix, plus or minus unity may not be attainable and so the theoretical maxima and minima are constrained. Some statisticians propose a φ/φmax rescaling [1]. We choose not to follow this procedure since it results in an o...
متن کاملHow to evaluate an agent's behavior to infrequent events?—Reliable performance estimation insensitive to class distribution
In everyday life, humans and animals often have to base decisions on infrequent relevant stimuli with respect to frequent irrelevant ones. When research in neuroscience mimics this situation, the effect of this imbalance in stimulus classes on performance evaluation has to be considered. This is most obvious for the often used overall accuracy, because the proportion of correct responses is gov...
متن کامل