What's in a gene name? Automated refinement of gene name dictionaries
نویسنده
چکیده
Many approaches for named entity recognition rely on dictionaries gathered from curated databases (such as Entrez Gene for gene names.) Strategies for matching entries in a dictionary against arbitrary text use either inexact string matching that allows for known deviations, dictionaries enriched according to some observed rules, or a combination of both. Such refined dictionaries cover potential structural, lexical, orthographical, or morphological variations. In this paper, we present an approach to automatically analyze dictionaries to discover how names are composed and which variations typically occur. This knowledge can be constructed by looking at single entries (names and synonyms for one gene), and then be transferred to entries that show similar patterns in one or more synonyms. For instance, knowledge about words that are frequently missing in (or added to) a name (“antigen”, “protein”, “human”) could automatically be extracted from dictionaries. This paper should be seen as a vision paper, though we implemented most of the ideas presented and show results for the task of gene name recognition. The automatically extracted name composition rules can easily be included in existing approaches, and provide valuable insights into the biomedical sub-language.
منابع مشابه
Gene/Protein/Family Name Recognition In Biomedical Literature
Rapid advances in the biomedical field have resulted in the accumulation of numerous experimental results, mainly in text form. To extract knowledge from biomedical papers, or use the information they contain to interpret experimental results, requires improved techniques for retrieving information from the biomedical literature. In many cases, since the information is required in gene units, r...
متن کاملPolymorphism and Sequencing of DGAT1 Gene in Iranian Holstein Bulls
Quantitative traits locus for milk production traits has been described on centromeric end of bovine chromosome 14. Reports name the acyl coA: diacylglycerol acyltransferase (DGAT1) gene as a potential candidate gene with dinucleotide substitution (AA to GC) in exon VIII which causes the change of lysine to alanine in amino acid (K232A).The aim of the present study was to estimate the frequency...
متن کاملIDENTIFICATION, ISOLATION, CLONING AND SEQUENCING APARTIALANNEXIN GENE FROM AUREOBASIDIUM PULLULANS
Background and Objectives: Annexin is the common name for genes and proteins that were identified as calcium-dependent phospholipid-binding proteins. Recently a more complex set of functions has been recognized for this superfamily of proteins including in vesicle trafficking, cell division, apoptosis, calcium signalling, mineralization, crystal nucleation inside the extracellular organelle...
متن کاملIdentification and characterization of a NBS–LRR class resistance gene analog in Pistacia atlantica subsp. Kurdica
P. atlantica subsp. Kurdica, with the local name of Baneh, is a wild medicinal plant which grows in Kurdistan, Iran. The identification of resistance gene analogs holds great promise for the development of resistant cultivars. A PCR approach with degenerate primers designed according to conserved NBS-LRR (nucleotide binding site-leucine rich repeat) regions of known disease-resistance (R) gene...
متن کاملNamed Entity Recognition and Classification in Kannada Language
Named Entity Recognition and classification (NERC) is an essential and challenging task in (NLP). Kannada is a highly inflectional and agglutinating language providing one of the richest and most challenging sets of linguistic and statistical features resulting in long and complex word forms, which is large in number. It is primarily a suffixing Language and inflected word starts with a root an...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007