Semi-Supervised Self-training Approaches for Imbalanced Splice Site Datasets

نویسنده

  • Ana Stanescu
چکیده

Machine Learning algorithms produce accurate classifiers when trained on large, balanced datasets. However, it is generally expensive to acquire labeled data, while unlabeled data is available in much larger amounts. A cost-effective alternative is to use Semi-Supervised Learning, which uses unlabeled data to improve supervised classifiers. Furthermore, for many practical problems, data often exhibits imbalanced class distributions and learning becomes more challenging for both supervised and semisupervised learning scenarios. While the problem of supervised learning from imbalanced data has been extensively studied, it has not been studied much for semi-supervised learning. Thus, in this study, we carry out an empirical evaluation of a semi-supervised learning algorithm, specifically self-training based on Näıve Bayes Multinomial (NBM), and address the issue of imbalanced class distributions both at data-level (by re-sampling) and algorithmic-level (using cost-sensitive learning and ensembles). We conduct our study on the problem of splice site prediction, a problem for which the ratio of positive to negative examples is very high. Our experiments on five different datasets show that a simple method that adds only positive instances to the labeled data in the semi-supervised iterations produces consistently better results when compared with other methods that deal with data imbalance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reserved Self-training: A Semi-supervised Sentiment Classification Method for Chinese Microblogs

The imbalanced sentiment distribution of microblogs induces bad performance of binary classifiers on the minority class. To address this problem, we present a semisupervised method for sentiment classification of Chinese microblogs. This method is similar to self-training, except that, a set of labeled samples is reserved for a confidence scores computing process through which samples that are ...

متن کامل

Iterative Nearest Neighborhood Oversampling in Semisupervised Learning from Imbalanced Data

Transductive graph-based semisupervised learning methods usually build an undirected graph utilizing both labeled and unlabeled samples as vertices. Those methods propagate label information of labeled samples to neighbors through their edges in order to get the predicted labels of unlabeled samples. Most popular semi-supervised learning approaches are sensitive to initial label distribution wh...

متن کامل

Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach

We present a framework to address the imbalanced data problem using semi-supervised learning. Specifically, from a supervised problem, we create a semi-supervised problem and then use a semi-supervised learning method to identify the most relevant instances to establish a welldefined training set. We present extensive experimental results, which demonstrate that the proposed framework significa...

متن کامل

Semi-Supervised Boosting for Multi-Class Classification

Most semi-supervised learning algorithms have been designed for binary classification, and are extended to multi-class classification by approaches such as one-against-the-rest. The main shortcoming of these approaches is that they are unable to exploit the fact that each example is only assigned to one class. Additional problems with extending semisupervised binary classifiers to multi-class p...

متن کامل

Benchmarking the semi-supervised naïve Bayes classifier

Semi-supervised learning involves constructing predictive models with both labelled and unlabelled training data. The need for semi-supervised learning is driven by the fact that unlabelled data are often easy and cheap to obtain, whereas labelling data requires costly and time consuming human intervention and expertise. Semi-supervised methods commonly use self training, which involves using t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014