Large-vocabulary audio-visual speech recognition by machines and humans
نویسندگان
چکیده
We compare automatic recognition with human perception of audio-visual speech, in the large-vocabulary, continuous speech recognition (LVCSR) domain. Specifically, we study the benefit of the visual modality for both machines and humans, when combined with audio degraded by speech-babble noise at various signal-to-noise ratios (SNRs). We first consider an automatic speechreading system with a pixel based visual front end that uses feature fusion for bimodal integration, and we compare its performance with an audio-only LVCSR system. We then describe results of human speech perception experiments, where subjects are asked to transcribe audio-only and audiovisual utterances at various SNRs. For both machines and humans, we observe approximately a 6 dB effective SNR gain compared to the audio-only performance at 10 dB, however such gains significantly diverge at other SNRs. Furthermore, automatic audio-visual recognition outperforms human audioonly speech perception at low SNRs.
منابع مشابه
Large Vocabulary Audio-Visual Speech Recognition Using the Janus Speech Recognition Toolkit
This paper describes audio-visual speech recognition experiments on a multi-speaker, large vocabulary corpus using the Janus speech recognition toolkit. We describe a complete audio-visual speech recognition system and present experiments on this corpus. By using visual cues as additional input to the speech recognizer, we observed good improvements, both on clean and noisy speech in our experi...
متن کاملLarge-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins Summer 2000 Workshop
We report a summary of the Johns Hopkins Summer 2000 Workshop on audio-visual automatic speech recognition (ASR) in the large-vocabulary, continuous speech domain. Two problems of audio-visual ASR were mainly addressed: Visual feature extraction and audio-visual information fusion. First, image transform and model-based visual features were considered, obtained by means of the discrete cosine t...
متن کاملA 3D Audio-visual Corpus for Speech Recognition
A new 3D audio-visual speech recognition corpus is described in this paper. This data corpus consists of a large number of read numbers, various types of vocabularies and well designed sentences made by approximately 1000 speakers. In this paper, we state the process of generating this data corpus with particular emphasis on visual speech processing. The visual data is collected by a stereo cam...
متن کاملAudio - Visual Speech Recognition
We have made signi cant progress in automatic speech recognition (ASR) for well-de ned applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, for ASR to approach human levels of performance and for speech to become a truly pervasive user interface, we need novel, nontraditional approaches that have the potential of yielding...
متن کاملRecent Advances in the Automatic Recognition of Audio-Visual Speech
Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design,...
متن کامل