Joint Learning of Phonetic Units and Word Pronunciations for ASR
نویسندگان
چکیده
The creation of a pronunciation lexicon remains the most inefficient process in developing an Automatic Speech Recognizer (ASR). In this paper, we propose an unsupervised alternative – requiring no language-specific knowledge – to the conventional manual approach for creating pronunciation dictionaries. We present a hierarchical Bayesian model, which jointly discovers the phonetic inventory and the Letter-to-Sound (L2S) mapping rules in a language using only transcribed data. When tested on a corpus of spontaneous queries, the results demonstrate the superiority of the proposed joint learning scheme over its sequential counterpart, in which the latent phonetic inventory and L2S mappings are learned separately. Furthermore, the recognizers built with the automatically induced lexicon consistently outperform grapheme-based recognizers and even approach the performance of recognition systems trained using conventional supervised procedures.
منابع مشابه
Wiktionary as a source for automatic pronunciation extraction
In this paper, we analyze whether dictionaries from the World Wide Web which contain phonetic notations, may support the rapid creation of pronunciation dictionaries within the speech recognition and speech synthesis system building process. As a representative dictionary, we selected Wiktionary [1] since it is at hand in multiple languages and, in addition to the definitions of the words, many...
متن کاملLearning Similarity Functions for Pronunciation Variations
A significant source of errors in Automatic Speech Recognition (ASR) systems is due to pronunciation variations which occur in spontaneous and conversational speech. Usually ASR systems use a finite lexicon that provides one or more pronunciations for each word. In this paper, we focus on learning a similarity function between two pronunciations. The pronunciations can be the canonical and the ...
متن کاملLearning Similarity Function for Pronunciation Variations
A significant source of errors in Automatic Speech Recognition (ASR) systems is due to pronunciation variations which occur in spontaneous and conversational speech. Usually ASR systems use a finite lexicon that provides one or more pronunciations for each word. In this paper, we focus on learning a similarity function between two pronunciations. The pronunciation can be the canonical and the s...
متن کاملLearning from Mistakes: Expanding Pronunciation Lexicons Using Word Recognition Errors
We introduce the problem of learning pronunciations of out-ofvocabulary words from word recognition mistakes made by an automatic speech recognition (ASR) system. This question is especially relevant in cases where the ASR engine is a black box – meaning that the only acoustic cues about the speech data come from the word recognition outputs. This paper presents an expectation maximization appr...
متن کاملA comparison of methods for speaker-dependent pronunciation tuning for text-to-speech synthesis
Unit-based text-to-speech (TTS) systems typically use a set of speech recordings that have been phonetically transcribed to create a large set of phonetic units. During synthesis, pronunciations for input text are generated and used to guide the selection of a sequence of phonetic units. The style of these system pronunciations must match the style of the phonetic transcriptions of the recorded...
متن کامل