Building High Quality Databases for Minority Languages such as Galician
نویسندگان
چکیده
This paper describes the result of a joint R&D project between Microsoft Portugal and the Signal Theory Group of the University of Vigo (Spain), where a set of language resources was developed with application to Text–to–Speech synthesis. First, a large Corpus of 10000 Galician sentences was designed and recorded by a professional female speaker. Second, a lexicon with phonetic and grammatical information of over 90000 entries was collected and reviewed manually by a linguist expert. And finally, these resources were used for a MOS (Mean Opinion Score) perceptual test to compare two state–of–the–art speech synthesizers of both groups, the one from Microsoft based on HMM, and the one from the University of Vigo based on unit selection.
منابع مشابه
Acoustic Modeling and Training of a Bilingual ASR System when a Minority Language is Involved
This paper describes our work in developing a bilingual speech recognition system using two SpeechDat databases. The bilingual aspect of this work is of particular importance in the Galician region of Spain where both languages Galician and Spanish coexist and one of the languages, the Galician one, is a minority language. Based on a global Spanish-Galician phoneme set we built a bilingual spee...
متن کاملTecnologías del habla y lenguas minoritarias
In this paper we show our latest developments of speech and language technology for two languages: Spanish and Galician. Special attention is devoted to the situation of this minority language: Galician, where the lack of resources puts in danger its inclusion in speech products.
متن کاملProactive Learning for Building Machine Translation Systems for Minority Languages
Building machine translation (MT) for many minority languages in the world is a serious challenge. For many minor languages there is little machine readable text, few knowledgeable linguists, and little money available for MT development. For these reasons, it becomes very important for an MT system to make best use of its resources, both labeled and unlabeled, in building a quality system. In ...
متن کاملCORILGA: a Galician Multilevel Annotated Speech Corpus for Linguistic Analysis
This paper describes the CORILGA (“Corpus Oral Informatizado da Lingua Galega”). CORILGA is a large high-quality corpus of spoken Galician from the 1960s up to present-day, including both formal and informal spoken language from both standard and non-standard varieties, and across different generations and social levels. The corpus will be available to the research community upon completion. Ga...
متن کاملAlgoritmo de stemming para el gallego
The quantity and quality of the resources and tools for natural language processing for a given language depend on such a language. In the Iberian Peninsula, Galician is one of the languages that lack this type of tools and resources. To contribute to their development, this paper shows a stemmer specifically designed for the Galician language. It was first introduced in 2002, but since then it...
متن کامل