Reducing the corpus-based TTS signal degradation due to speaker's word pronunciations
نویسندگان
چکیده
The goal of producing a corpus-based synthesizer with the owner’s voice can only be achieved if the system can handle recordings with less than ideal characteristics. One of the limitations is that a normal speaker does not always pronounce a word exactly as predicted by the language rules. In this work we compare two methods for handling variations on word pronunciation for corpus-based speech synthesizers. Both approaches rely on a speech corpus aligned with a phone-level segmentation tool that allows alternative word pronunciations. The first approach performs an alignment between the observed pronunciation and the canonical form used in the system’s lexicon, allowing the mapping of the time labels from the observed phones into the canonical form. At synthesis time the unit selection is performed on the phone sequence predicted by the system. In the second approach, no modification is performed on the phone sequence generated by the segmentation tool. This way, at synthesis time, the words are converted into phones by using the speaker’s word pronunciation, rather than the system’s lexicon. Finally, both approaches are compared by evaluating the naturalness of the signals generated by each approach.
منابع مشابه
Reducing the Corpus-based TTS Signal Deg Pronunciatio
The goal of producing a corpus-based synthesizer with the owner’s voice can only be achieved if the system can handle recordings with less than ideal characteristics. One of the limitations is that a normal speaker does not always pronounce a word exactly as predicted by the language rules. In this work we compare two methods for handling variations on word pronunciation for corpus-based speech...
متن کاملPronunciation lexicon adaptation for TTS voice building
This paper describes reducing phone label errors in TTS voice building by means of modeling of speaker pronunciation variants. Each speaker has his or her own unique pronunciations (and context-dependent variations), so that no one standard lexicon is able to cover all of the speaker’s variations. Creating speaker-dependent pronunciation lexicons for automatic speech labeling of our TTS voice d...
متن کاملA comparison of pronunciation modeling approaches for HMM-TTS
Hidden Markov model-based text-to-speech (HMM-TTS) systems are often trained on manual voice corpus phonetic transcriptions, despite the fact that because these manual pronunciations cannot be predicted with complete accuracy at synthesis time, the result is training/synthesis mismatch. In this paper, an alternate approach is proposed in which a set of manually written post-lexical effects (PLE...
متن کاملOptimal Feature Set and Minimal Training Size for Pronunciation Adaptation in TTS
Text-to-Speech (TTS) systems rely on a grapheme-to-phoneme converter which is built to produce canonical, or statically stylized, pronunciations. Hence, the TTS quality drops when phoneme sequences generated by this converter are inconsistent with those labeled in the speech corpus on which the TTS system is built, or when a given expressivity is desired. To solve this problem, the present work...
متن کاملLearning Similarity Functions for Pronunciation Variations
A significant source of errors in Automatic Speech Recognition (ASR) systems is due to pronunciation variations which occur in spontaneous and conversational speech. Usually ASR systems use a finite lexicon that provides one or more pronunciations for each word. In this paper, we focus on learning a similarity function between two pronunciations. The pronunciations can be the canonical and the ...
متن کامل