The Gold Standard in Corpus Annotation
نویسندگان
چکیده
Trustworthy corpora are necessary for training and meaningful evaluation of algorithms which use annotations. These standard collections are called Gold Standard Corpora (GSC). However the construction of GSC is a laborious and time-consuming process and size, quality and most of all availability of task-specific GSC directly influence the development of machine learning based natural language processing algorithms. This paper provides an introduction to gold standard corpus construction in the context of natural language processing and gives an overview of alternative approaches.
منابع مشابه
Collection, Annotation and Analysis of Gold Standard Corpora for Knowledge-Rich Context Extraction in Russian and German
This paper describes the collection, annotation and linguistic analysis of a gold standard for knowledge-rich context extraction on the basis of Russian and German web corpora as part of ongoing PhD thesis work. In the following sections, the concept of knowledge-rich contexts is refined and gold standard creation is described. Linguistic analyses of the gold standard data and their results are...
متن کاملAn Annotation Scheme and Gold Standard for Dutch-English Word Alignment
The importance of sentence-aligned parallel corpora has been widely acknowledged. Reference corpora in which sub-sentential translational correspondences are indicated manually are more labour-intensive to create, and hence less wide-spread. Such manually created reference alignments – also called Gold Standards – have been used in research projects to develop or test automatic word alignment s...
متن کاملNamed Entity Recognition in Wikipedia
Named entity recognition (NER) is used in many domains beyond the newswire text that comprises current gold-standard corpora. Recent work has used Wikipedia’s link structure to automatically generate near gold-standard annotations. Until now, these resources have only been evaluated on newswire corpora or themselves. We present the first NER evaluation on a Wikipedia gold standard (WG) corpus. ...
متن کاملA multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC
OBJECTIVE To create a multilingual gold-standard corpus for biomedical concept recognition. MATERIALS AND METHODS We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified M...
متن کاملAnaphoric relations in the clinical narrative: corpus creation
OBJECTIVE The long-term goal of this work is the automated discovery of anaphoric relations from the clinical narrative. The creation of a gold standard set from a cross-institutional corpus of clinical notes and high-level characteristics of that gold standard are described. METHODS A standard methodology for annotation guideline development, gold standard annotations, and inter-annotator ag...
متن کاملA Gold Standard Corpus of Early Modern German
This paper describes an annotated gold standard sample corpus of Early Modern German containing over 50,000 tokens of text manually annotated with POS tags, lemmas, and normalised spelling variants. The corpus is the first resource of its kind for this variant of German, and represents an ideal test bed for evaluating and adapting existing NLP tools on historical data. We describe the corpus fo...
متن کامل