نتایج جستجو برای: corpora creation

تعداد نتایج: 147847  

2013
Marek Grác

The extended usage of written corpora not only for manual querying but also for machine learning led to the creation of massive corpora. These corpora are almost solely crawled from the internet and contain texts of various quality. Corpora that contain more typos or ungrammatical texts are more difficult to use for computational linguists and are thus a major obstacle in automatic development....

2005
Ineke Schuurman Paola Monachesi

The creation of semantically annotated corpora has lagged dramatically behind. As a result, the need for such resources has now become urgent. Several initiatives have been launched at the international level in the last years, however, they have focussed almost entirely on English and not much attention has been dedicated to the creation of semantically annotated Dutch corpora. The Flemish-Dut...

2018
Ritesh Kumar Bornini Lahiri Deepak Alok Atul Kr. Ojha Mayank Jain Abdul Basit Yogesh Dawer

In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India – Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification ...

2011
Manuel Bertran Oriol Borrega M.Antònia Martí Mariona Taulé M. Antònia Martí

This paper describes AnCoraPipe, an environment for the creation, edition and analysis of linguistic corpora and lexicons. AnCoraPipe has been used in the development of different linguistic resources: AnCora, CesCa, ClInt, Amazighe corpora, and the verbal and nominal AnCora lexicons. We present the functionalities of AnCoraPipe, the way in which the data and metadata is structure, as well as s...

2005

The corpus is a fundamental tool for any type of research on language. The availability of computers in the 1950’s immediately led to the creation of corpora in electronic form that could be searched automatically for a variety of language features and compute frequency, distributional characteristics, and other descriptive statistics. Corpora of literary works were compiled to enable stylistic...

2008
Massimo Poesio Udo Kruschwitz Jon Chamberlain

The ability to make progress in Computational Linguistics depends on the availability of large annotated corpora, but creating such corpora by hand annotation is very expensive and time consuming; in practice, it is unfeasible to think of annotating more than one million words. However, the success of Wikipedia, the ESP game, and other projects shows that another approach might be possible: col...

Journal: :CoRR 2018
Tomasz Jurczyk Amit Deshmane Jinho D. Choi

This paper gives comprehensive analyses of corpora based on Wikipedia for several tasks in question answering. Four recent corpora are collected, WIKIQA, SELQA, SQUAD, and INFOBOXQA, and first analyzed intrinsically by contextual similarities, question types, and answer categories. These corpora are then analyzed extrinsically by three question answering tasks, answer retrieval, selection, and ...

2006
Christopher Cieri Walter D. Andrews Joseph P. Campbell George R. Doddington John J. Godfrey Shudong Huang Mark Liberman Alvin F. Martin Hirotaka Nakasone Mark A. Przybocki Kevin Walker

This paper describes the planning and creation of the Mixer and Transcript Reading corpora, their properties and yields, and reports on the lessons learned during their development.

2012
Maciej Ogrodniczuk Piotr Pezik Adam Przepiórkowski

The aim of this paper is to present current efforts towards the creation of a comprehensive open repository of Polish language resources and tools (LRTs). The work described here is carried out within the CESAR project, member of the META-NET consortium. It has already resulted in the creation of the Computational Linguistics in Poland website containing an exhaustive collection of Polish LRTs....

2014
Marcis Pinnis Ilze Auzina Karlis Goba

In this paper the authors present the first Latvian speech corpus designed specifically for speech recognition purposes. The paper outlines the decisions made in the corpus designing process through analysis of related work on speech corpora creation for different languages. The authors provide also guidelines that were used for the creation of the Latvian speech recognition corpus. The corpus ...

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید