corpora creation

Slate - A Tool for Creating and Maintaining Annotated Corpora

Journal: :JLCL 2011

Dain Kaplan Ryu Iida Kikuko Nishina Takenobu Tokunaga

Recent research trends of the last five years show that richly annotated corpora inspire novel research. These richly annotated corpora are indispensable for progressing research, but also more difficult to manage and maintain due to increasing complexity – what is needed is a way to manage the annotation project in its entirety. However, annotation project management has received little attent...

متن کامل

Two Years of Aranea: Increasing Counts and Tuning the Pipeline

2016

Vladimír Benko

The Aranea Project is targeted at creation of a family of Gigaword web-corpora for a dozen of languages that could be used for teaching languageand linguistics-related subjects at Slovak universities, as well as for research purposes in various areas of linguistics. All corpora are being built according to a standard methodology and using the same set of tools for processing and annotation, whi...

متن کامل

TWORPUS - An Easy-to-Use Tool for the Creation of Tailored Twitter Corpora

2013

Alexander Bazo Manuel Burghardt Christian Wolff

In this paper we present Tworpus, an easy-to-use tool for the creation of tailored Twitter corpora. Tworpus allows scholars to create corpora without having to know about the Twitter Application Programming Interface (API) and related technical aspects. At the same time our tool complies with Twitter’s ”rules of the road” on how to use tweet data. Corpora may be composed in various sizes and fo...

متن کامل

Interlingual Annotation of Parallel Text Corpora: A New Framework for Annotation and Evaluation

2004

This paper focuses on the next step in the creation of a system of meaning representation and the development of semantically-annotated parallel corpora, for use in applications such as machine translation, question answering, text summarization, and information retrieval. The work described below constitutes the first effort of any kind to provide parallel corpora annotated with detailed deep ...

متن کامل

Generative Adversarial Nets for Multiple Text Corpora

Journal: :CoRR 2017

Baiyang Wang Diego Klabjan

Generative adversarial nets (GANs) have been successfully applied to the artificial generation of image data. In terms of text data, much has been done on the artificial generation of natural language from a single corpus. We consider multiple text corpora as the input data, for which there can be two applications of GANs: (1) the creation of consistent cross-corpus word embeddings given differ...

متن کامل

Evaluation of Corpus Assisted Spanish Learning

2013

Hui-Chuan Lu Yu-Hsin Chu

In the development of corpus linguistics, the creation of corpora has had a critical role in corpus-based studies. The majority of created corpora have been associated with English and native languages, while other languages and types of corpora have received relatively less attention. Because an increasing number of corpora have been constructed, and each corpus is constructed for a definite p...

متن کامل

A massively parallel corpus: the Bible in 100 languages

2015

Christos Christodoulopoulos Mark Steedman

We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other En...

متن کامل

Development of Language Resources for Speech-to-speech Translation

2007

Victoria Arranz

This paper describes the creation of linguistically enriched aligned corpora for Catalan, Spanish and US-English for Speech-to-Speech Translation. These corpora are obtained from two diierent sources: US-English transcribed speech data and transcriptions of conversations recorded in Catalan and Spanish. After human translation, a large trilingual spontaneous speech corpus has been obtained. Thi...

متن کامل

Querying Annotated Speech Corpora

2004

Ulrike Gut Jan-Torsten Milde Holger Voormann Ulrich Heid

This paper is concerned with querying annotated speech corpora. A growing number of such corpora is currently being created worldwide; however, their usefulness for a wider research community is restricted by the lack of standard tools for creating, editing, annotating, storing and querying them. Two solutions for these problems are presented here: the XML-based data format TASX for corpus crea...

متن کامل

A Multi-criteria Text Selection Approach for Building a Speech Corpus

2015

Chiragkumar Patel Sunil Kumar Kopparapu

Speech corpus is an important and primary requirement for several speech tasks. Building a speech corpora is a lengthy, time consuming and expensive process, it typically involves collection of a large set of textual utterances and then selective distribution of these text utterances among a set of speakers, called speaker sheets. These speaker sheets are articulated by speakers to generate the...

متن کامل