Improved bimodal speech recognition using tied-mixture HMMs and 5000 word audio-visual synchronous database
نویسندگان
چکیده
This paper presents methods to improve speech recognition accuracy by incorporating automatic lip reading. The paper improves lip reading accu racy by following approaches; 1)collection of im age and speech synchronous data of 5240 words, 2)feature extraction of 2・dimensional power spect日 around a mouth and 3)sub-word unit HMMs with tied-mixture distribution(Tied-Mixture HMMs). Ex periments through 100 word test show the perfor mance of 85% by lipreading alone. It is also shown that tied-mixture HMMs improve the lip reading ac curacy. The speech recognition experiments are car ried out over various SNR integrating audiovisual information. The results show the integration always realizes better performance than that using either au dio or visual information.
منابع مشابه
Product HMMs for audio-visual continuous speech recognition using facial animation parameters
The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both Singl...
متن کاملAudio-visual speech recognition using MCE-based hmms and model-dependent stream weights
This paper presents a framework for designing a hidden Markov model (HMM)-based audio-visual automatic speech recognition (ASR) system based on minimum classification error training. Audio/visual HMM parameters are optimized with the generalized probabilistic descent (GPD) method, and their likelihoods are combined using model-dependent stream weights which are also estimated with the GPD metho...
متن کاملCENSREC-1-AV: an audio-visual corpus for noisy bimodal speech recognition
In this paper, an audio-visual speech corpus CENSREC-1-AV for noisy speech recognition is introduced. CENSREC-1-AV consists of an audio-visual database and a baseline system of bimodal speech recognition which uses audio and visual information. In the database, there are 3,234 and 1,963 utterances made by 42 and 51 speakers as a training and a test sets respectively. Each utterance consists of ...
متن کاملCombination of Standard and Complementary Models for Audio-Visual Speech Recognition
In this work, new multi-classifier schemes for isolated word speech recognition based on the combination of standard Hidden Markov Models (HMMs) and Complementary Gaussian Mixture Models (CGMMs) are proposed. Typically, in speech recognition systems, each word or phoneme in the vocabulary is represented by a model trained with samples of each particular class. The recognition is then performed ...
متن کاملImproving lip-reading performance for robust audiovisual speech recognition using DNNs
This paper presents preliminary experiments using the Kaldi toolkit [1] to investigate audiovisual speech recognition (AVSR) in noisy environments using deep neural networks (DNNs). In particular we use a single-speaker large vocabulary, continuous audiovisual speech corpus to compare the performance of visual-only, audio-only and audiovisual speech recognition. The models trained using the Kal...
متن کامل