Linked annotations: a middle ground for manual curation of biomedical databases and text corpora
نویسندگان
چکیده
Annotators of text corpora and biomedical databases carry out the same labor-intensive task to manually extract structured data from unstructured text. Tasks are needlessly repeated because text corpora are widely scattered. We envision that a linked annotation resource unifying many corpora could be a game changer. Such an open forum will help focus on novel annotations and on optimally benefiting from the energy of many experts. As proof-of-concept, we annotated protein subcellular localization in 100 abstracts cited by UniProtKB. The detailed comparison between our new corpus and the original UniProtKB annotations revealed sustained novel annotations for 42% of the entries (proteins). In a unified linked annotation resource these could immediately extend the utility of text corpora beyond the text-mining community. Our example motivates the central idea that linked annotations from text corpora can complement database annotations.
منابع مشابه
The Meta-knowledge of Causality in Biomedical Scientific Discourse
Causality lies at the heart of biomedical knowledge, being involved in diagnosis, pathology or systems biology. Thus, automatic causality recognition can greatly reduce the human workload by suggesting possible causal connections and aiding in the curation of pathway models. For this, we rely on corpora that are annotated with classified, structured representations of important facts and findin...
متن کاملBiocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II
Manual curation of data from the biomedical literature is a rate-limiting factor for many expert curated databases. Despite the continuing advances in biomedical text mining and the pressing needs of biocurators for better tools, few existing text-mining tools have been successfully integrated into production literature curation systems such as those used by the expert curated databases. To clo...
متن کاملAutomatic functional annotation of predicted active sites: combining PDB and literature mining
texts was drawn from the Uniprot corpus, where every abstract text must contain the tri-occurrences of organism, protein and residue. Notice that the detection of the entities was based on the entity recognition (ER) systems described in the previous section. It is not expected that the ER systems are performing at top level, and therefore a certain proportion of the filtered abstract texts con...
متن کاملCALBC: Releasing the Final Corpora
A number of gold standard corpora for named entity recognition are available to the public. However, the existing gold standard corpora are limited in size and semantic entity types. These usually lead to implementation of trained solutions (1) for a limited number of semantic entity types and (2) lacking in generalization capability. In order to overcome these problems, the CALBC project has a...
متن کاملProFAL: PROtein Functional Annotation through Literature
We introduce ProFAL (PROtein Functional Annotation through Literature), a new information system for automatic annotation of biological databases using Bioinformatics methods. The annotations are (gene-product, functional property) pairs, associating the attributes of a gene-product, stored in the database, to functional properties. The system retrieves documents related to each geneproduct fro...
متن کامل