NLProt: extracting protein names and sequences from papers

نویسندگان

  • Sven Mika
  • Burkhard Rost
چکیده

Automatically extracting protein names from the literature and linking these names to the associated entries in sequence databases is becoming increasingly important for annotating biological databases. NLProt is a novel system that combines dictionary- and rule-based filtering with several support vector machines (SVMs) to tag protein names in PubMed abstracts. When considering partially tagged names as errors, NLProt still reached a precision of 75% at a recall of 76%. By many criteria our system outperformed other tagging methods significantly; in particular, it proved very reliable even for novel names. Names encountered particularly frequently in Drosophila, such as white, wing and bizarre, constitute an obvious limitation of NLProt. Our method is available both as an Internet server and as a program for download (http://cubic.bioc.columbia.edu/services/NLProt/). Input can be PubMed/MEDLINE identifiers, authors, titles and journals, as well as collections of abstracts, or entire papers.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Protein names precisely peeled off free text

MOTIVATION Automatically identifying protein names from the scientific literature is a pre-requisite for the increasing demand in data-mining this wealth of information. Existing approaches are based on dictionaries, rules and machine-learning. Here, we introduced a novel system that combines a pre-processing dictionary- and rule-based filtering step with several separately trained support vect...

متن کامل

High-throughput, interoperability and benchmarking of text-mining with BeCalm biomedical metaserver

Biomedical annotators are very specific tools applied to a highly complex field. Therefore, this kind of software suffers from an extreme complexity which impedes its usage. This complexity, which is reflected in usability problems, is the main cause of disuse, rejection and low impact. This document discusses several of these problems, as well as possible solutions. As a use case, the NLProt p...

متن کامل

Toward information extraction: identifying protein names from biological papers.

To solve the mystery of the life phenomenon, we must clarify when genes are expressed and how their products interact with each other. But since the amount of continuously updated knowledge on these interactions is massive and is only available in the form of published articles, an intelligent information extraction (IE) system is needed. To extract these information directly from articles, the...

متن کامل

Contentment and Architecture An Investigation of the Manifestation of the Concept of Contentment in the Pattern of Iranian Traditional Houses (Case Study: Mortaz House)

The concept of contentment derived from the content is one of the names of Allah that in Islamic foundations has been emphasized. The traditional man, relying on these bases, to illustrate the divine names, he has tried in different aspects of his life. Architecture is one of the areas which these names appear in it and from a variety of architectures; the house provides the most possible for t...

متن کامل

Semi-Automatically Extracting Features from Source Code of Android Applications

It is not always easy for an Android user to choose the most suitable application for a particular task from the great number of applications available. In this paper, we propose a semi-automatic approach to extract feature names from Android applications. The case study verifies that we can associate common sequences of Android API calls with feature names. key words: Android, feature extracti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Nucleic acids research

دوره 32 Web Server issue  شماره 

صفحات  -

تاریخ انتشار 2004