Quantifying the effect of corpus size on the quality of automatic diacritization of Yorùbá texts

نویسندگان

  • Tunde Adegbola
  • Lydia Uchechukwu Odilinye
چکیده

Yorùbá being a tone language requires tone information for the correct pronunciation of words in Text-to-Speech synthesis. Based on standard Yorùbá orthography, such information is held in tone marks, which applied to vowels and syllabic nasals as diacritical markings. However, the tone marks are not always correctly applied in many Yorùbá documents because appropriate input devices for the accurate application of the diacritic marks are not always available. Hence, the absence of tone marks in most written Yorùbá texts presents a major challenge in speech synthesis as the information required for applying the right tone sequences to synthesized Yorùbá speech may not always be available. This study proposes the use of Machine Learning techniques as a basis for the automatic application of tone marks as part of the pre-processing in high level synthesis. Being a resource-scarce language however, there is a lack of sufficiently large Yorùbá corpora for the training of an automatic diacritizer. The study therefore investigated the relationship between corpus size and the quality of automatic diacritization towards estimating the size of corpus required for an ideal level of accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Attentive Sequence-to-Sequence Learning for Diacritic Restoration of Yor\`ub\'a Language Text

Yorùbá is a widely spoken West African language with a writing system rich in tonal and orthographic diacritics. With very few exceptions, diacritics are omitted from electronic texts, due to limited device and application support. Diacritics provide morphological information, are crucial for lexical disambiguation, pronunciation and are vital for any Yorùbá text-to-speech (TTS), automatic spee...

متن کامل

Design Issues in Automatic Grapheme-to-Phoneme Conversion for Standard Yorùbá

Grapheme-to-Phoneme (G2P) conversion is an important problem in Human Language Processing development, particularly Textto-Speech (TTS). Its primary goal is to accurately compute the pronunciation of words in the input texts. This work examines design issues with respect to components of the automatic G2P for standard Yorùbá (SY). The automatic process includes: (i) Tokenisation of Input, (ii) ...

متن کامل

Smoothing methods for a morpho-statistical approach of automatic diacritization Arabic texts (Méthodes de lissage d'une approche morpho-statistique pour la voyellation automatique des textes arabes) [in French]

We present in this work a new approach for the Automatic diacritization for Arabic texts using three stages. During the first phase, we integrated a lexical database containing the most frequent words of Arabic with morphological analysis by Alkhalil Morpho Sys which provided possible diacritization for each word. The objective of the second module is to eliminate the ambiguity using a statisti...

متن کامل

Diacritization for Real-World Arabic Texts

For Arabic, diacritizing written text is important for many NLP tasks. In the work presented here, we investigate the quality of a diacritization approach, with a high success rate for treebank data but with a more limited success on realworld data. One of the problems we encountered is the non-standard use of the hamza diacritic, which leads to a decrease in diacritization accuracy. If an auto...

متن کامل

Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems

Arabic diacritics are often missed in Arabic scripts. This feature is a handicap for new learner to read َArabic, text to speech conversion systems, reading and semantic analysis of Arabic texts. The automatic diacritization systems are the best solution to handle this issue. But such automation needs resources as diactritized texts to train and evaluate such systems. In this paper, we describe ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012