Feasibility of pooling annotated corpora for clinical concept extraction

نویسندگان

  • Kavishwar Wagholikar
  • Manabu Torii
  • Siddhartha Jonnalagadda
  • Hongfang Liu
چکیده

Availability of annotated corpora has facilitated application of machine learning algorithms to concept extraction from clinical notes. However, it is expensive to prepare annotated corpora in individual institutions, and pooling of annotated corpora from other institutions is a potential solution. In this paper we investigate whether pooling of corpora from two different sources, can improve performance and portability of resultant machine learning taggers for medical problem detection. Specifically, we pool corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. Contrary to our expectations, pooling of corpora is found to decrease the F1-score. We examine the annotation guidelines to identify factors for incompatibility of the corpora and suggest development of a standard annotation guideline by the clinical NLP community to allow compatibility of annotated corpora.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pooling annotated corpora for clinical concept extraction

BACKGROUND The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have inv...

متن کامل

Exploiting large corpora: A circular process of partial syntactic analysis, corpus query and extraction of lexicographic information

Our approach follows the work of Eckle-Kohler (1999) who used a regular grammar to extract lexicographic information from text corpora. We employ a system that allows to improve her querybased grammar especially with respect to recall and speed without reducing accuracy. In contrast to Eckle-Kohler (1999), we do not attempt to parse a whole sentence or phrase at once during the extraction proce...

متن کامل

Exploiting Multiply Annotated Corpora in Biomedical Information Extraction Tasks

This paper discusses the problem of utilising multiply annotated data in training biomedical information extraction systems. Two corpora, annotated with entities and relations, and containing a number of multiply annotated documents, are used to train named entity recognition and relation extraction systems. Several methods of automatically combining the multiple annotations to produce a single...

متن کامل

ررسی وتعیین پارامترهای موثر در استخراج کاتیونهای Mn2+ و Cu2+ از بستر جامد توسط سیال فوق بحرانی

Feasibility of using Cyanex 301, as the auxiliary agent, for supercritical extraction of Cupper and Manganese cations from solid matrix was studied statistically. The amount of extraction is influenced by several parameters, such as amount of ligand, pressure, temperature, SCCO2 flow rate, time of extraction and amount of acid. Recent researches showed that factorial design is an effective tool...

متن کامل

Cross-corpus Training with Treelstm for the Extraction of Biomedical Relationships from Text

A bottleneck problem in machine learning-based relationship extraction (RE) algorithms, and particularly of deep learning-based ones, is the availability of training data in the form of annotated corpora. For specific domains, such as biomedicine, the long time and high expertise required for the development of manually annotated corpora explain that most of the existing one are relatively smal...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 2012  شماره 

صفحات  -

تاریخ انتشار 2012