Validation of Automated Protein Annotation

نویسندگان

  • Francisco M. Couto
  • Mário J. Silva
  • Pedro M. Coutinho
چکیده

Given the large amount of data stored in biological databases, the management of uncertainty and incompleteness in them is a non-trivial problem. To cope with the large amount of sequences being produced, a significant number of genes and proteins have been functionally characterized by automated tools. However, these tools have also produced a significant number of misannotations that are now present in the databases. This paper proposes a new approach for validating the automated annotations, which uses the large amount of publicly available information to compare automated annotations with preexisting curated annotations. To test the proposed approach, we developed a novel unsupervised method for filtering misannotations provided by automated annotation systems. We evaluated our method using the automated annotations submitted to BioCreAtIvE, a joint evaluation of state-of-the-art text-mining systems in Biology. The method scored each of these annotations and those scored below a certain threshold were discarded. The results have shown a small trade-off in recall for a large improvement in precision. For example, we were able to discard 44.6%, 66.8% and 81% of the misannotations, maintaining 96.9%, 84.2%, and 47.8% of the correct annotations, respectively. Moreover, we were able to outperform each individual submission to BioCreAtIvE by proper adjustment of the threshold. These results show the effectiveness of our approach in assisting curators of large biological databases in the use of contemporary tools for automatic identification of annotations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Filtering erroneous protein annotation

MOTIVATION Automatically generated annotation on protein data of UniProt (Universal Protein Resource) is planned to be publicly available on the UniProt web pages in April 2004. It is expected that the data content of over 500,000 protein entries in the TrEMBL section will be enhanced by the output of an automated annotation pipeline. However, a part of the automatically added data will be erro...

متن کامل

Automated annotation of keywords for proteins related to mycoplasmataceae using machine learning techniques

MOTIVATION With the increase in submission of sequences to public databases, the curators of these are not able to cope with the amount of information. The motivation of this work is to generate a system for automated annotation of data we are particularly interested in, namely proteins related to the Mycoplasmataceae family. Following previous works on automatic annotation using symbolic machi...

متن کامل

PathBuilder - open source software for annotating and developing pathway resources

SUMMARY We have developed PathBuilder, an open-source web application to annotate biological information pertaining to signaling pathways and to create web-based pathway resources. PathBuilder enables annotation of molecular events including protein-protein interactions, enzyme-substrate relationships and protein translocation events either manually or through automated importing of data from o...

متن کامل

About Viral and Phage Genome Processing and Tools

The National Center for Biotechnology Information (NCBI) Viral Genome Resource hosts all virus-related data and tools. All complete viral genome sequences deposited in the International Nucleotide Sequence Database Collaboration (INSDC) databases are collected by the NCBI Viral Genome Project (1). A RefSeq record is created from one of the complete genome sequences for each virus species, and t...

متن کامل

GPCRRD: G protein-coupled receptor spatial restraint database for 3D structure modeling and function annotation

SUMMARY G protein-coupled receptors (GPCRs) comprise the largest family of integral membrane proteins. They are the most important class of drug targets. While there exist crystal structures for only a very few GPCR sequences, numerous experiments have been performed on GPCRs to identify the critical residues and motifs. GPCRRD database is designed to systematically collect all experimental res...

متن کامل

Simple topological properties predict functional misannotations in a metabolic network

MOTIVATION Misannotation in sequence databases is an important obstacle for automated tools for gene function annotation, which rely extensively on comparison with sequences with known function. To improve current annotations and prevent future propagation of errors, sequence-independent tools are, therefore, needed to assist in the identification of misannotated gene products. In the case of e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005