corpora creation

Typos in Czech Corpora

2013

Marek Grác

The extended usage of written corpora not only for manual querying but also for machine learning led to the creation of massive corpora. These corpora are almost solely crawled from the internet and contain texts of various quality. Corpora that contain more typos or ungrammatical texts are more difficult to use for computational linguists and are thus a major obstacle in automatic development....

متن کامل

The contours of a semantic annotation scheme for Dutch

2005

Ineke Schuurman Paola Monachesi

The creation of semantically annotated corpora has lagged dramatically behind. As a result, the need for such resources has now become urgent. Several initiatives have been launched at the international level in the last years, however, they have focussed almost entirely on English and not much attention has been dedicated to the creation of semantically annotated Dutch corpora. The Flemish-Dut...

متن کامل

Automatic Identification of Closely-related Indian Languages: Resources and Experiments

2018

Ritesh Kumar Bornini Lahiri Deepak Alok Atul Kr. Ojha Mayank Jain Abdul Basit Yogesh Dawer

In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India – Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification ...

متن کامل

AnCoraPipe: A new tool for corpora annotation

2011

Manuel Bertran Oriol Borrega M.Antònia Martí Mariona Taulé M. Antònia Martí

This paper describes AnCoraPipe, an environment for the creation, edition and analysis of linguistic corpora and lexicons. AnCoraPipe has been used in the development of different linguistic resources: AnCora, CesCa, ClInt, Amazighe corpora, and the verbal and nominal AnCora lexicons. We present the functionalities of AnCoraPipe, the way in which the data and metadata is structure, as well as s...

متن کامل

Preparation and Analysis of Linguistic Corpora

2005

The corpus is a fundamental tool for any type of research on language. The availability of computers in the 1950’s immediately led to the creation of corpora in electronic form that could be searched automatically for a variety of language features and compute frequency, distributional characteristics, and other descriptive statistics. Corpora of literary works were compiled to enable stylistic...

متن کامل

ANAWIKI: Creating Anaphorically Annotated Resources through Web Cooperation

2008

Massimo Poesio Udo Kruschwitz Jon Chamberlain

The ability to make progress in Computational Linguistics depends on the availability of large annotated corpora, but creating such corpora by hand annotation is very expensive and time consuming; in practice, it is unfeasible to think of annotating more than one million words. However, the success of Wikipedia, the ESP game, and other projects shows that another approach might be possible: col...

متن کامل

Analysis of Wikipedia-based Corpora for Question Answering

Journal: :CoRR 2018

Tomasz Jurczyk Amit Deshmane Jinho D. Choi

This paper gives comprehensive analyses of corpora based on Wikipedia for several tasks in question answering. Four recent corpora are collected, WIKIQA, SELQA, SQUAD, and INFOBOXQA, and first analyzed intrinsically by contextual similarities, question types, and answer categories. These corpora are then analyzed extrinsically by three question answering tasks, answer retrieval, selection, and ...

متن کامل

The Mixer and Transcript Reading Corpora: Resources for Multilingual, Crosschannel Speaker Recognition Research

2006

Christopher Cieri Walter D. Andrews Joseph P. Campbell George R. Doddington John J. Godfrey Shudong Huang Mark Liberman Alvin F. Martin Hirotaka Nakasone Mark A. Przybocki Kevin Walker

This paper describes the planning and creation of the Mixer and Transcript Reading corpora, their properties and yields, and reports on the lessons learned during their development.

متن کامل

Towards a comprehensive open repository of Polish language resources

2012

Maciej Ogrodniczuk Piotr Pezik Adam Przepiórkowski

The aim of this paper is to present current efforts towards the creation of a comprehensive open repository of Polish language resources and tools (LRTs). The work described here is carried out within the CESAR project, member of the META-NET consortium. It has already resulted in the creation of the Computational Linguistics in Poland website containing an exhaustive collection of Polish LRTs....

متن کامل

Designing the Latvian Speech Recognition Corpus

2014

Marcis Pinnis Ilze Auzina Karlis Goba

In this paper the authors present the first Latvian speech corpus designed specifically for speech recognition purposes. The paper outlines the decisions made in the corpus designing process through analysis of related work on speech corpora creation for different languages. The authors provide also guidelines that were used for the creation of the Latvian speech recognition corpus. The corpus ...

متن کامل