Improving Rule Induction Precision for Automated Annotation by Balancing Skewed Data Sets

نویسندگان

  • Gustavo E. A. P. A. Batista
  • Maria Carolina Monard
  • Ana L. C. Bazzan
چکیده

There is an overwhelming increase in submissions to genomic databases, posing a problem for database maintenance, especially regarding annotation of fields left blank during submission. In order not to include all data as submitted, one possible alternative consists of performing the annotation manually. A less resource demanding alternative is automatic annotation. The latter helps the curator since predicting the properties of each protein sequence manually is turning a bottleneck, at least for protein databases. Machine Learning – ML – techniques have been used to generate automatic annotation and to help curators. A challenging problem for automatic annotation is that traditional ML algorithms assume a balanced training set. However, real-world data sets are predominantly imbalanced (skewed), i.e., there is a large number of examples of one class compared with just few examples of the other class. This is the case for protein databases where a large number of proteins is not annotated for every feature. In this work we discuss some over and under-sampling techniques that deal with class imbalance. A new method to deal with this problem that combines two known over and under-sampling methods is also proposed. Experimental results show that the symbolic classifiers induced by C4.5 on data sets after applying known over and under-sampling methods, as well as the new proposed method are always more accurate than the ones induced from the original imbalanced data sets. Therefore, this is a step towards producing more accurate rules for automating annotation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Balancing Training Data for Automated Annotation of Keywords: a Case Study

There has been an increasing interest in tools for automating the annotation of databases. Machine learning techniques are promising candidates to help curators to, at least, guide the process of annotation which is mostly done manually. Following previous works on automated annotation using symbolic machine learning techniques, the present work deals with a common problem in machine learning: ...

متن کامل

Additional Paper for “ Biological Data Mining ”

Nowadays, the size of sequence databases is increasing exponentially. There is also proportional increase in the number of protein sequences. In order to make sense of these amino acid sequences, it is necessary to annotate them (annotation: comment/explanation). However owing to their huge size, it is infeasible to manually annotate the protein sequences. One solution is automating the annotat...

متن کامل

Boosting First-Order Clauses for Large, Skewed Data Sets

Creating an e ective ensemble of clauses for large, skewed data sets requires nding a diverse, high-scoring set of clauses and then combining them in such a way as to maximize predictive performance. We have adapted the RankBoost algorithm in order to maximize area under the recall-precision curve, a much better metric when working with highly skewed data sets than ROC curves. We have also expl...

متن کامل

On Burr III-Inverse Weibull Distribution with COVID-19 Applications

We introduce a flexible lifetime distribution called Burr III-Inverse Weibull (BIII-IW). The new proposed distribution has well-known sub-models. The BIII-IW density function includes exponential, left-skewed, right-skewed and symmetrical shapes. The BIII-IW model’s failure rate can be monotone and non-monotone depending on the parameter values. To show the importance of the BIII-IW distributio...

متن کامل

Extracting generic basis of association rules from SAGE data

Applying classical association rule extraction framework to dense SAGE data leads to an unmanageably highly sized association rule sets– compounded with their low precision– that often make the perusal of knowledge ineffective, their exploitation time-consuming, and frustrating for the user. To overcome such drawback, we advocate the extraction and exploitation of compact and informative generi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004