Alleviating Classification Problem of Imbalanced Dataset
نویسنده
چکیده
The Class Imbalance problem occurs when there are many more instances of some class than others. i.e. skewed class distribution. In cases like this, standard classifier tends to be overwhelmed by the majority class and ignores the minority class. It is one of the 10 challenging problems of data mining research and pattern recognition. This imbalanced dataset degrades the performance of the classifier as accuracy is tendered towards the majority class. Several techniques have been proposed to solve this problem. This paper aims to improve the true positive rate/ detection of the minority class (GDM) which is the class of interest. This study proposes the use of two under sampling techniques reported in the literature. It involves under sampling the majority class which balances the dataset before classification. These under sampling schemes were evaluated on three learning algorithms (Decision tree both pruned and unpruned and RIPPER) using Matthew’s Correlation Coefficient (MCC) and Kappa Statistics as metrics. The implementation of these under sampling techniques was assessed in the medical domain. The real life dataset collected contained 886 instance of patients with diabetes mellitus disease. The diagnosis was in three classes with the following class distribution TYPE1 containing 62 instances, TYPE2 containing 807 instances and Gestation Diabetes Mellitus (GDM) which is the class of interest containing only 17 instances. This study revealed that, compared with the original dataset and RUS dataset, NCL dataset presents superiority in achieving better true positive rate for the minority class and also high MCC and KStatistics with the three learning algorithm.
منابع مشابه
Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining
This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...
متن کاملImproving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering
Classification is an one of the important parts of data mining and knowledge discovery. In most cases, the data that is utilized to used to training the clusters is not well distributed. This inappropriate distribution occurs when one class has a large number of samples but while the number of other class samples is naturally inherently low. In general, the methods of solving this kind of prob...
متن کاملProposing a Novel Cost Sensitive Imbalanced Classification Method based on Hybrid of New Fuzzy Cost Assigning Approaches, Fuzzy Clustering and Evolutionary Algorithms
In this paper, a new hybrid methodology is introduced to design a cost-sensitive fuzzy rule-based classification system. A novel cost metric is proposed based on the combination of three different concepts: Entropy, Gini index and DKM criterion. In order to calculate the effective cost of patterns, a hybrid of fuzzy c-means clustering and particle swarm optimization algorithm is utilized. This ...
متن کاملImbalanced Dataset Classification and Solutions: a Review
-Imbalanced data set problem occurs in classification, where the number of instances of one class is much lower than the instances of the other classes. The main challenge in imbalance problem is that the small classes are often more useful, but standard classifiers tend to be weighed down by the huge classes and ignore the tiny ones. In machine learning the imbalanced datasets has become a cri...
متن کاملAn Improved Instance Based K-Nearest Neighbor (IIBK) Classification of Imbalanced Datasets with Enhanced Preprocessing
The presence of data with skewed class distributions is a problem common to a variety of fields, including Bioinformatics, Computer science, Text classification, Remote-sensing, and Manufacturing industries. In Bioinformatics applications, the numbers of non-interacting proteins (majority class) are greater than number of interacting proteins (minority class) in predicting the protein-protein i...
متن کامل