Automatic discourse connective detection in biomedical text

نویسندگان

  • Balaji Polepalli Ramesh
  • Rashmi Prasad
  • Tim Miller
  • Brian Harrington
  • Hong Yu
چکیده

OBJECTIVE Relation extraction in biomedical text mining systems has largely focused on identifying clause-level relations, but increasing sophistication demands the recognition of relations at discourse level. A first step in identifying discourse relations involves the detection of discourse connectives: words or phrases used in text to express discourse relations. In this study supervised machine-learning approaches were developed and evaluated for automatically identifying discourse connectives in biomedical text. MATERIALS AND METHODS Two supervised machine-learning models (support vector machines and conditional random fields) were explored for identifying discourse connectives in biomedical literature. In-domain supervised machine-learning classifiers were trained on the Biomedical Discourse Relation Bank, an annotated corpus of discourse relations over 24 full-text biomedical articles (~112,000 word tokens), a subset of the GENIA corpus. Novel domain adaptation techniques were also explored to leverage the larger open-domain Penn Discourse Treebank (~1 million word tokens). The models were evaluated using the standard evaluation metrics of precision, recall and F1 scores. RESULTS AND CONCLUSION Supervised machine-learning approaches can automatically identify discourse connectives in biomedical text, and the novel domain adaptation techniques yielded the best performance: 0.761 F1 score. A demonstration version of the fully implemented classifier BioConn is available at: http://bioconn.askhermes.org.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Cross-Domain PDTB-Style Discourse Parsing

Discourse relation parsing is an important task with the goal of understanding text beyond the sentence boundaries. With the availability of annotated corpora (Penn Discourse Treebank) statistical discourse parsers were developed. In the literature it was shown that the discourse parsing subtasks of discourse connective detection and relation sense classification do not generalize well across d...

متن کامل

BioDCA Identifier: A System for Automatic Identification of Discourse Connective and Arguments from Biomedical Text

This paper describes a Natural language processing system developed for automatic identification of explicit connectives, its sense and arguments. Prior work has shown that the difference in usage of connectives across corpora affects the cross domain connective identification task negatively. Hence the development of domain specific discourse parser has become indispensable. Here, we present a...

متن کامل

Automatic evaluation of surface coherence in L2 texts in Czech

We introduce possibilities of automatic evaluation of surface text coherence (cohesion) in texts written by learners of Czech during certified exams for non-native speakers. On the basis of a corpus analysis, we focus on finding and describing relevant distinctive features for automatic detection of A1–C1 levels (established by CEFR – the Common European Framework of Reference for Languages) in...

متن کامل

Multilingual Annotation and Disambiguation of Discourse Connectives for Machine Translation

Many discourse connectives can signal several types of relations between sentences. Their automatic disambiguation, i.e. the labeling of the correct sense of each occurrence, is important for discourse parsing, but could also be helpful to machine translation. We describe new approaches for improving the accuracy of manual annotation of three discourse connectives (two English, one French) by u...

متن کامل

Automatic sense prediction for implicit discourse relations in text

We present a series of experiments on automatically identifying the sense of implicit discourse relations, i.e. relations that are not marked with a discourse connective such as “but” or “because”. We work with a corpus of implicit relations present in newspaper text and report results on a test set that is representative of the naturally occurring distribution of senses. We use several linguis...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of the American Medical Informatics Association : JAMIA

دوره 19 5  شماره 

صفحات  -

تاریخ انتشار 2012