corpora creation

corpus-tools.org: An Interoperable Generic Software Tool Set for Multi-layer Linguistic Corpora

2016

Stephan Druskat Volker Gast Thomas Krause Florian Zipser

This paper introduces an open source, interoperable generic software tool set catering for the entire workflow of creation, migration, annotation, query and analysis of multi-layer linguistic corpora. It consists of four components: Salt, a graph-based meta model and API for linguistic data, the common data model for the rest of the tool set; Pepper, a conversion tool and platform for linguisti...

متن کامل

Digging into Signs: Emerging Annotation Standards for Sign Language Corpora

2017

Kearsy Cormier Onno Crasborn Richard Bank

This paper describes the creation of annotation standards for glossing sign language corpora as part of the Digging into Signs project (2014-2015, http://www.ru.nl/sign-lang/projects/digging-signs/). This project was based on the annotation of two major sign language corpora, the BSL Corpus (British Sign Language) and the Corpus NGT (Sign Language of the Netherlands). The focus of the gloss ann...

متن کامل

Automatic Adaptation of Annotations

Journal: :Computational Linguistics 2015

Wenbin Jiang Yajuan Lü Liang Huang Qun Liu

Manually annotated corpora are indispensable resources, yet for many annotation tasks, such as the creation of treebanks, there exist multiple corpora with different and incompatible annotation guidelines. This leads to an inefficient use of human expertise, but it could be remedied by integrating knowledge across corpora with different annotation guidelines. In this article we describe the pro...

متن کامل

Linguistic Technologies Applied Lexicography and Scientific Text Corpora

2014

Larisa Beliaeva

Nowadays applied lexicography is a special domain of applied linguistics and language engineering in the framework of problemoriented automated and automatic dictionaries and databases. Modern approach to dictionary creation assumes preliminary work with parallel or comparable text corpora to be considered as reference database for solving both research and practical lexicographic problems. Pa...

متن کامل

Hooking up to the corpus: the Viennese Lexicographic Editor’s corpus interface

2011

Gerhard Budin Karlheinz Mörth

The paper addresses the issue of interfacing between digital corpora and a new dictionary writing application being developed at the ICLTT (Institute of Corpus Linguistics and Text Technology of the Austrian Academy of Sciences). It deals with issues of dictionary creation, software design, usability and interoperability in relation to the example of this fairly new piece of software, the Vienn...

متن کامل

Corpora for Learning the Mutual Relationship between Semantic Relatedness and Textual Entailment

2016

Ngoc Phuoc An Vo Octavian Popescu

In this paper we present the creation of a corpora annotated with both semantic relatedness (SR) scores and textual entailment (TE) judgments. In building this corpus we aimed at discovering, if any, the relationship between these two tasks for the mutual benefit of resolving one of them by relying on the insights gained from the other. We considered a corpora already annotated with TE judgment...

متن کامل

Exploiting the Two-Dimensional Nature of Agnostic Music Notation for Neural Optical Music Recognition

Journal: :Applied sciences 2021

State-of-the-art Optical Music Recognition (OMR) techniques follow an end-to-end or holistic approach, i.e., a sole stage for completely processing single-staff section image and retrieving the symbols that appear therein. Such recognition systems are characterized by not requiring exact alignment between each staff their corresponding labels, hence facilitating creation retrieval of labeled co...

متن کامل

Induction of a Stem Lexicon for Two-level Morphological Analysis

1998

Erika F. de Lima

A method is described to automatically acquire from text corpora a Portuguese stem lexicon for two-level morphological analysis. It makes use of a lexical transducer to generate all possible stems for a given unknown inflected word form, and the EM algorithm to rank alternative stems. 1 M o t i v a t i o n Morphological analysis is the basis for most natural language processing tasks. Hand-code...

متن کامل

A Web Survey on the Use of Active Learning to Support Annotation of Text Data

2009

Katrin Tomanek Fredrik Olsson

As supervised machine learning methods for addressing tasks in natural language processing (NLP) prove increasingly viable, the focus of attention is naturally shifted towards the creation of training data. The manual annotation of corpora is a tedious and time consuming process. To obtain high-quality annotated data constitutes a bottleneck in machine learning for NLP today. Active learning is...

متن کامل

Knowtator: A Protégé plug-in for annotated corpus construction

2006

Philip V. Ogren

A general-purpose text annotation tool called Knowtator is introduced. Knowtator facilitates the manual creation of annotated corpora that can be used for evaluating or training a variety of natural language processing systems. Building on the strengths of the widely used Protégé knowledge representation system, Knowtator has been developed as a Protégé plug-in that leverages Protégé’s knowledg...

متن کامل