A hybrid approach to protein name identification in biomedical texts

نویسندگان

  • Kazuhiro Seki
  • Javed Mostafa
چکیده

This paper presents a hybrid approach to identifying protein names in biomedical texts, which is regarded as a crucial step for text mining. Our approach employs a set of simple heuristics for initial detection of protein names and uses a probabilistic model for locating complete protein names. In addition, a protein name dictionary is complementarily consulted. In contrast to previously proposed methods, our proposed method avoids the use of natural language processing tools such as part-of-speech taggers and syntactic parsers and solely relies on surface clues, so as to reduce the processing overhead. Moreover, we propose a framework to automatically create a large-scale corpus annotated with protein names, which can be then used for training our probabilistic model. We implemented a protein name identification system, named P, based on our proposed method and evaluated it by comparing with a system developed by other researchers on a common test set. The experiments showed that the automatically constructed corpus is equally useful in training as compared with manually annotated corpora and that effective performance can be achieved in identifying compound protein names with P.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Probabilistic Model for Identifying Protein Names and their Name Boundaries

This paper proposes a method for identifying protein names in biomedical texts with an emphasis on detecting protein name boundaries. We use a probabilistic model which exploits several surface clues characterizing protein names and incorporates word classes for generalization. In contrast to previously proposed methods, our approach does not rely on natural language processing tools such as pa...

متن کامل

Research Paper: A Simple and Practical Dictionary-based Approach for Identification of Proteins in Medline Abstracts

OBJECTIVE The aim of this study was to develop a practical and efficient protein identification system for biomedical corpora. DESIGN The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of Medlin...

متن کامل

A hybrid named entity tagger for tagging human proteins/genes

The predominant step and pre-requisite in the analysis of scientific literature is the extraction of gene/protein names in biomedical texts. Though many taggers are available for this Named Entity Recognition (NER) task, we found none of them achieve a good state-of-art tagging for human genes/proteins. As most of the current text mining research is related to human literature, a good tagger to...

متن کامل

Nano-bio Hybrid Material Based on Bacteriorhodopsin and ZnO for Bioelectronics Applications

Bioelectronics has attracted increasing interest in recent years because of their applications in various disciplines, such as biomedical. Development of efficient bio-nano hybrid materials is a new move towards revolution of nano-bioelectronics. A novel nano-bio hybrid electrode based on ZnO–protein for bioelectronics applications was prepared and characterized. The electrode was made by coval...

متن کامل

Nano-bio Hybrid Material Based on Bacteriorhodopsin and ZnO for Bioelectronics Applications

Bioelectronics has attracted increasing interest in recent years because of their applications in various disciplines, such as biomedical. Development of efficient bio-nano hybrid materials is a new move towards revolution of nano-bioelectronics. A novel nano-bio hybrid electrode based on ZnO–protein for bioelectronics applications was prepared and characterized. The electrode was made by coval...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Inf. Process. Manage.

دوره 41  شماره 

صفحات  -

تاریخ انتشار 2005