corpora creation

نتایج جستجو برای: corpora creation

تعداد نتایج: 147847 فیلتر نتایج به سال:

Integration of Russian Language Resources

2004

Serge A. Yablonsky

In this paper we describe the creation of large scale linguistic resources for Russian language. Internet/intranet system architecture was developed to make a large volume of Russian language lexical information, corpora (texts) and knowledge base (Russian WordNet) available to the system at development and/or run time. There are four linguistic counterparts, corresponding to the major categori...

متن کامل

Collecting Language Resources for the Latvian e-Government Machine Translation Platform

2016

Roberts Rozis Andrejs Vasiljevs Raivis Skadins

This paper describes corpora collection activity for building large machine translation systems for Latvian e-Government platform. We describe requirements for corpora, selection and assessment of data sources, collection of the public corpora and creation of new corpora from miscellaneous sources. Methodology, tools and assessment methods are also presented along with the results achieved, cha...

متن کامل

Festvox: Tools for Creation and Analyses of Large Speech Corpora

2010

Gopala Krishna Anumanchipalli Kishore Prahallad Alan W Black

This paper summarises the tools provided within Festvox[1], a freely available software suite for creation and analyses of large scale speech corpora for enabling research, development and instruction in speech technologies.

متن کامل

The Crúbadán Project: Corpus building for under-resourced languages

2007

Kevin P. Scannell KEVIN P. SCANNELL

We present an overview of the Crúbadán project, the aim of which is the creation of text corpora for a large number of under-resourced languages by crawling the web.

متن کامل

About the creation of a parallel bilingual corpora of web-publications

Journal: :CoRR 2008

D. V. Lande V. V. Zhygalo

The algorithm of the creation texts parallel corpora was presented. The algorithm is based on the use of "key words" in text documents, and on the means of their automated translation. Key words were singled out by means of using Russian and Ukrainian morphological dictionaries, as well as dictionaries of the translation of nouns for the Russian and Ukrainianlanguages. Besides, to calculate the...

متن کامل

Creating German unit selection voices for the MARY TTS platform from the BITS corpora

2007

Marc Schröder Anna Hunecke

The present paper reports on the creation of German unit selection voices from corpora which had been recorded and annotated previously in the BITS project. We describe the unit selection mechanism of our MARY TTS platform, as well as the tools for creating a synthesis voice from a speech corpus, and their application to the creation of German unit selection voices from the BITS corpora. Becaus...

متن کامل

Creation of comparable corpora for English-Urdu, Arabic, Persian

2016

Murad Abouammoh Kashif Shah Ahmet Aker

Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. However, in the case of under-resourced languages or some specific domains, parallel corpora are not readily available. This leads to under-performing machine translation systems in those sparse data settings. To overcome the low availability of parallel resources the machine translation community has rec...

متن کامل

Options for Automatic Creation of Dictionary Definitions from Corpora

2016

Marie Stará Vojtech Kovár

This paper maps the possibilities of using existing corpus tools to acquire definitions for Czech in an automatic way. It compares definitions from Dictionary of contemporary Czech (Slovník současné češtiny pro školu a veřejnost) and data acquired using Thesaurus and Word sketch in corpus czTenTen12.

متن کامل

Rapid creation of large-scale corpora and frequency dictionaries

2012

Attila Zséder Gábor Recski Dániel Varga András Kornai

We describe, and make public, large-scale language resources and the toolchain used in their creation, for fifteen medium density European languages: Catalan, Czech, Croatian, Danish, Dutch, Finnish, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Serbian, Slovak, Spanish, and Swedish. To make the process uniform across languages, we selected tools that are either language-independent or e...

متن کامل

Report on the annotation of semantic roles - TR7

2007

Paola Monachesi Jantine Trapman

The creation of semantically annotated corpora has lagged dramatically behind. As a result, the need for such resources has now become urgent. Several initiatives have been launched at the international level in the last years, however, they have focussed almost entirely on English and not much attention has been dedicated to the creation of semantically annotated Dutch corpora. The Flemish-Dut...

متن کامل

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید