Multilingual speech recognition A posterior based approach
نویسندگان
چکیده
Modern automatic speech recognition (ASR) systems are based on parametric statistical models such as hidden Markov models (HMMs), exploiting 1) acoustic-phonetic models, which need to be trained on large amount of acoustic data, 2) a language model, which needs to be trained on large amount of text data and, finally, 3) a lexicon with phonetic transcription which requires linguistic expertise. Developing multilingual ASR systems, or systems that are robust to accents and dialects, is therefore a very challenging task for current state-of-the-art ASR systems. In this thesis, we focus on investigating acoustic-phonetic modeling and lexical diversity across languages and databases, and assume that a language model is available. In our case, this is done in the context of hybrid HMM/MLP ASR, where the HMM emission probabilities are modeled as posterior probabilities of HMM states, conditioned on the acoustics, estimated at the output of a multilayer perceptron (MLP). We build upon a recently proposed acoustic modeling approach, referred to as KL-HMM, where posterior probabilities are directly used as acoustic features, and where the HMM states are directly parametrized by trained posterior probabilities. The set of HMM reference posteriors is then estimated by minimizing the Kullback–Leibler divergence between posterior features extracted from the training data and reference posteriors. The proposed KL-HMM model is extensively developed and adapted to tackle several challenging problems related to multilingual ASR, including lexical diversity, stochastic phone space transformations, accented speech recognition and using multilingual data resources to boost monolingual systems. The efficiency of the proposed approach is demonstrated through theoretical and experimental comparisons with similar approaches such as probabilistic acoustic mapping, linear hidden networks and maximum a posteriori adaptation. Furthermore, KL-HMM is also compared with other posterior feature based ASR techniques such as Tandem and short-term spectral feature based approaches such as subspace Gaussian mixture models. The comparison reveals that the KL-HMM framework is a suitable alternative to conventional acoustic modeling techniques and seems to be preferable in low amount of data as well as phoneme set mismatch scenarios.
منابع مشابه
Improving Non-Native ASR Through Stochastic Multilingual Phoneme Space Transformations
We propose a stochastic phoneme space transformation technique that allows the conversion of conditional source phoneme posterior probabilities (conditioned on the acoustics) into target phoneme posterior probabilities. The source and target phonemes can be in any language and phoneme format such as the International Phonetic Alphabet. The novel technique makes use of a Kullback-Leibler diverge...
متن کاملMultilingual speech recognition: a unified approach
In this paper, we present a unified approach for hidden markov model based multilingual speech recognition. The proposed approach could be used across acoustically similar as well as diverse languages. We use an automatic phone mapping algorithm to map phones across languages and reduce the effective number of phones in the multililingual acoustic model. We experimentally verify the effectivene...
متن کاملComparing Three Methods to Create Multilingual Phone Models for Vocabulary Independent Speech Recognition Tasks
This paper presents three different methods to develop multilingual phone models for flexible speech recognition tasks. The main goal of our investigations is to find multilingual speech units which work equally well in many languages. With this universal set it is possible to build speech recognition systems for a variety of languages. One advantage of this approach is to share acoustic-phonet...
متن کاملMultilingual text-to-phoneme mapping
This paper introduces a novel approach for generating multilingual text-to-phoneme mappings for use in multilingual speech recognition systems. The multilingual mappings are based on the weighted outputs from a neural network text-to-phoneme model, trained on data mixed from several languages. The multilingual mappings used together with a branched grammar decoding scheme is able to capture bot...
متن کاملOnline Unsupervised Multilingual Acoustic Model Adaptation for Nonnative Asr
Automatic speech recognition (ASR) is currently one of the main research interests in computer science. Hence, many ASR systems are available in the market. Yet, the performance of speech and language recognition systems is poor on nonnative speech. The challenge for nonnative speech recognition is to maximize the accuracy of a speech recognition system when only a small amount of nonnative dat...
متن کامل