An automated method to build a corpus of rhetorically-classified sentences in biomedical texts
نویسندگان
چکیده
The rhetorical classification of sentences in biomedical texts is an important task in the recognition of the components of a scientific argument. Generating supervised machine learned models to do this recognition requires corpora annotated for the rhetorical categories Introduction (or Background), Method, Result, Discussion (or Conclusion). Currently, a few, small annotated corpora exist. We use a straightforward feature of co-referring text using the word “this” to build a selfannotating corpus extracted from a large biomedical research paper dataset. The corpus is annotated for all of the rhetorical categories except Introduction without involving domain experts. In a 10-fold cross-validation, we report an overall Fscore of 97% with Naı̈ve Bayes and 98.7% with SVM, far above those previously reported.
منابع مشابه
Automated Construction and Evaluation of Japanese Web-based Reference Corpora
A particularly promising approach to the use of the Web for linguistic research is to build corpora via automated queries to search engines, retrieving and post-processing the pages found in this way (Ghani et al. 2003, Baroni and Bernardini 2004, Sharoff to appear). This approach differs from the traditional method of corpus construction, where one needs to spend considerable time finding and ...
متن کاملPaCCSS-IT: A Parallel Corpus of Complex-Simple Sentences for Automatic Text Simplification
In this paper we present PaCCSS–IT, a Parallel Corpus of Complex–Simple Sentences for ITalian. To build the resource we develop a new method for automatically acquiring a corpus of complex–simple paired sentences able to intercept structural transformations and particularly suitable for text simplification. The method requires a wide amount of texts that can be easily extracted from the web mak...
متن کاملAn Automated MR Image Segmentation System Using Multi-layer Perceptron Neural Network
Background: Brain tissue segmentation for delineation of 3D anatomical structures from magnetic resonance (MR) images can be used for neuro-degenerative disorders, characterizing morphological differences between subjects based on volumetric analysis of gray matter (GM), white matter (WM) and cerebrospinal fluid (CSF), but only if the obtained segmentation results are correct. Due to image arti...
متن کاملProducing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations
The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...
متن کاملA Simple Ensemble Method for Hedge Identification
We present in this paper a simple hedge identification method and its application on biomedical text. The problem at hand is a subtask of CoNLL-2010 shared task. Our solution consists of two classifiers, a statistical one and a CRF model, and a simple combination schema that combines their predictions. We report in detail on each component of our system and discuss the results. We also show tha...
متن کامل