Balancing Training Data for Automated Annotation of Keywords: a Case Study
نویسندگان
چکیده
There has been an increasing interest in tools for automating the annotation of databases. Machine learning techniques are promising candidates to help curators to, at least, guide the process of annotation which is mostly done manually. Following previous works on automated annotation using symbolic machine learning techniques, the present work deals with a common problem in machine learning: that classes usually have skewed class prior probabilities, i.e., there is a large number of examples of one class compared with just few examples of the other class. This happens due to the fact that a large number of proteins is not annotated for every feature. Thus, we analyze and employ some techniques aiming at balancing the training data. Our experiments show that the classifiers induced from balanced data sampled with our method are more accurate than those induced from the original data.
منابع مشابه
A Data Mining approach for forecasting failure root causes: A case study in an Automated Teller Machine (ATM) manufacturing company
Based on the findings of Massachusetts Institute of Technology, organizations’ data double every five years. However, the rate of using data is 0.3. Nowadays, data mining tools have greatly facilitated the process of knowledge extraction from a welter of data. This paper presents a hybrid model using data gathered from an ATM manufacturing company. The steps of the research are based on CRISP-D...
متن کاملScalable Image Annotation by Summarizing Training Samples into Labeled Prototypes
By increasing the number of images, it is essential to provide fast search methods and intelligent filtering of images. To handle images in large datasets, some relevant tags are assigned to each image to for describing its content. Automatic Image Annotation (AIA) aims to automatically assign a group of keywords to an image based on visual content of the image. AIA frameworks have two main sta...
متن کاملNasullah Khalid Alham
Machine learning techniques have facilitated image retrieval by automatically classifying and annotating images with keywords. Among them Support Vector Machines (SVMs) are used extensively due to their generalization properties. However, SVM training is notably a computationally intensive process especially when the training dataset is large. In this thesis distributed computing paradigms have...
متن کاملAutomatic Image Annotation of News Images with Large Vocabularies and Low Quality Training Data
A traditional approach to retrieving images is to manually annotate the image with textual keywords and then retrieve images using these keywords. Manual annotation is expensive and recently a few approaches have been proposed for automatically annotating images. These techniques usually learn a statistical model using a training set of images annotated with keywords and use this model to autom...
متن کاملAutomated annotation of keywords for proteins related to mycoplasmataceae using machine learning techniques
MOTIVATION With the increase in submission of sequences to public databases, the curators of these are not able to cope with the amount of information. The motivation of this work is to generate a system for automated annotation of data we are particularly interested in, namely proteins related to the Mycoplasmataceae family. Following previous works on automatic annotation using symbolic machi...
متن کامل