Semi-Supervised Self-training Approaches for Imbalanced Splice Site Datasets
نویسنده
چکیده
Machine Learning algorithms produce accurate classifiers when trained on large, balanced datasets. However, it is generally expensive to acquire labeled data, while unlabeled data is available in much larger amounts. A cost-effective alternative is to use Semi-Supervised Learning, which uses unlabeled data to improve supervised classifiers. Furthermore, for many practical problems, data often exhibits imbalanced class distributions and learning becomes more challenging for both supervised and semisupervised learning scenarios. While the problem of supervised learning from imbalanced data has been extensively studied, it has not been studied much for semi-supervised learning. Thus, in this study, we carry out an empirical evaluation of a semi-supervised learning algorithm, specifically self-training based on Näıve Bayes Multinomial (NBM), and address the issue of imbalanced class distributions both at data-level (by re-sampling) and algorithmic-level (using cost-sensitive learning and ensembles). We conduct our study on the problem of splice site prediction, a problem for which the ratio of positive to negative examples is very high. Our experiments on five different datasets show that a simple method that adds only positive instances to the labeled data in the semi-supervised iterations produces consistently better results when compared with other methods that deal with data imbalance.
منابع مشابه
Reserved Self-training: A Semi-supervised Sentiment Classification Method for Chinese Microblogs
The imbalanced sentiment distribution of microblogs induces bad performance of binary classifiers on the minority class. To address this problem, we present a semisupervised method for sentiment classification of Chinese microblogs. This method is similar to self-training, except that, a set of labeled samples is reserved for a confidence scores computing process through which samples that are ...
متن کاملIterative Nearest Neighborhood Oversampling in Semisupervised Learning from Imbalanced Data
Transductive graph-based semisupervised learning methods usually build an undirected graph utilizing both labeled and unlabeled samples as vertices. Those methods propagate label information of labeled samples to neighbors through their edges in order to get the predicted labels of unlabeled samples. Most popular semi-supervised learning approaches are sensitive to initial label distribution wh...
متن کاملEmpowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach
We present a framework to address the imbalanced data problem using semi-supervised learning. Specifically, from a supervised problem, we create a semi-supervised problem and then use a semi-supervised learning method to identify the most relevant instances to establish a welldefined training set. We present extensive experimental results, which demonstrate that the proposed framework significa...
متن کاملSemi-Supervised Boosting for Multi-Class Classification
Most semi-supervised learning algorithms have been designed for binary classification, and are extended to multi-class classification by approaches such as one-against-the-rest. The main shortcoming of these approaches is that they are unable to exploit the fact that each example is only assigned to one class. Additional problems with extending semisupervised binary classifiers to multi-class p...
متن کاملBenchmarking the semi-supervised naïve Bayes classifier
Semi-supervised learning involves constructing predictive models with both labelled and unlabelled training data. The need for semi-supervised learning is driven by the fact that unlabelled data are often easy and cheap to obtain, whereas labelling data requires costly and time consuming human intervention and expertise. Semi-supervised methods commonly use self training, which involves using t...
متن کامل