Audio-visual speech recognition in challenging environments
نویسندگان
چکیده
Visual speech information is known to improve accuracy and noise robustness of automatic speech recognizers. However, todate, all audio-visual ASR work has concentrated on “visually clean” data with limited variation in the speaker’s frontal pose, lighting, and background. In this paper, we investigate audiovisual ASR in two practical environments that present significant challenges to robust visual processing: (a) Typical offices, where data are recorded by means of a portable PC equipped with an inexpensive web camera, and (b) automobiles, with data collected at three approximate speeds. The performance of all components of a state-of-the-art audio-visual ASR system is reported on these two sets and benchmarked against “visually clean” data recorded in a studio-like environment. Not surprisingly, both audioand visual-only ASR degrade, more than doubling their respective word error rates. Nevertheless, visual speech remains beneficial to ASR.
منابع مشابه
Recent Advances in the Automatic Recognition of Audio-Visual Speech
Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design,...
متن کاملJoint audio-visual speech processing for recognition and enhancement
Visual speech information present in the speaker’s mouth region has long been viewed as a source for improving the robustness and naturalness of human-computer-interfaces (HCI). Such information can be particularly crucial in realistic HCI environments, where the acoustic channel is corrupted, and as a result, the performance of traditional automatic speech recognition (ASR) systems falls below...
متن کاملClassification based on located ROI Normalise ROI for Scale and Rotation
The performance of visual speech recognition (VSR) systems are significantly influenced by the accuracy of the visual front-end. The current state-of-the-art VSR systems use off-the-shelf face detectors such as ViolaJones (VJ) which has limited reliability for changes in illumination and head poses. For a VSR system to perform well under these conditions, an accurate visual front end is require...
متن کاملVisual front-endwars: Viola-Jones face detector vs Fourier Lucas-Kanade
The performance of visual speech recognition (VSR) systems are significantly influenced by the accuracy of the visual front-end. The current state-of-the-art VSR systems use off-the-shelf face detectors such as ViolaJones (VJ) which has limited reliability for changes in illumination and head poses. For a VSR system to perform well under these conditions, an accurate visual front end is require...
متن کاملCENSREC-AV: evaluation frameworks for audio-visual speech recognition
This paper introduces incoming evaluation frameworks for bimodal speech recognition in noisy conditions and real environments. In order to develop a robust speech recognition in noisy environments, bimodal speech recognition which uses acoustic and visual information has been paid attention to particularly for this decade. As a lot of methods and techniques for bimodal speech recognition have b...
متن کامل