The Gold Standard in Corpus Annotation

نویسندگان

Lars Wißler

Mohammed Almashraee

Dagmar Monett Díaz

Adrian Paschke

چکیده

Trustworthy corpora are necessary for training and meaningful evaluation of algorithms which use annotations. These standard collections are called Gold Standard Corpora (GSC). However the construction of GSC is a laborious and time-consuming process and size, quality and most of all availability of task-specific GSC directly influence the development of machine learning based natural language processing algorithms. This paper provides an introduction to gold standard corpus construction in the context of natural language processing and gives an overview of alternative approaches.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Collection, Annotation and Analysis of Gold Standard Corpora for Knowledge-Rich Context Extraction in Russian and German

This paper describes the collection, annotation and linguistic analysis of a gold standard for knowledge-rich context extraction on the basis of Russian and German web corpora as part of ongoing PhD thesis work. In the following sections, the concept of knowledge-rich contexts is refined and gold standard creation is described. Linguistic analyses of the gold standard data and their results are...

متن کامل

An Annotation Scheme and Gold Standard for Dutch-English Word Alignment

The importance of sentence-aligned parallel corpora has been widely acknowledged. Reference corpora in which sub-sentential translational correspondences are indicated manually are more labour-intensive to create, and hence less wide-spread. Such manually created reference alignments – also called Gold Standards – have been used in research projects to develop or test automatic word alignment s...

متن کامل

Named Entity Recognition in Wikipedia

Named entity recognition (NER) is used in many domains beyond the newswire text that comprises current gold-standard corpora. Recent work has used Wikipedia’s link structure to automatically generate near gold-standard annotations. Until now, these resources have only been evaluated on newswire corpora or themselves. We present the first NER evaluation on a Wikipedia gold standard (WG) corpus. ...

متن کامل

A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC

OBJECTIVE To create a multilingual gold-standard corpus for biomedical concept recognition. MATERIALS AND METHODS We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified M...

متن کامل

Anaphoric relations in the clinical narrative: corpus creation

OBJECTIVE The long-term goal of this work is the automated discovery of anaphoric relations from the clinical narrative. The creation of a gold standard set from a cross-institutional corpus of clinical notes and high-level characteristics of that gold standard are described. METHODS A standard methodology for annotation guideline development, gold standard annotations, and inter-annotator ag...

متن کامل

A Gold Standard Corpus of Early Modern German

This paper describes an annotated gold standard sample corpus of Early Modern German containing over 50,000 tokens of text manually annotated with POS tags, lemmas, and normalised spelling variants. The corpus is the first resource of its kind for this variant of German, and represents an ideal test bed for evaluating and adapting existing NLP tools on historical data. We describe the corpus fo...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

The Gold Standard in Corpus Annotation

نویسندگان

چکیده

منابع مشابه

Collection, Annotation and Analysis of Gold Standard Corpora for Knowledge-Rich Context Extraction in Russian and German

An Annotation Scheme and Gold Standard for Dutch-English Word Alignment

Named Entity Recognition in Wikipedia

A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC

Anaphoric relations in the clinical narrative: corpus creation

A Gold Standard Corpus of Early Modern German

عنوان ژورنال:

اشتراک گذاری