Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment
نویسندگان
چکیده
The paper considers the problem of audio-visual speech recognition in a simultaneous (target/masker) speaker environment. The paper follows a conventional multistream approach and examines the specific problem of estimating reliable timevarying audio and visual stream weights. The task is challenging because, in the two speaker condition, signal-to-noise ratio (SNR) – and hence audio stream weight – cannot always be reliably inferred from the acoustics alone. Similarity between the target and masker sound sources can cause the foreground and background to be confused. The paper presents a novel solution that combines both audio and visual information to estimate acoustic SNR. The method employs artificial neural networks to estimate the SNR from hidden Markov model (HMM) state-likelihoods calculated using separate audio and visual streams. SNR estimates are then mapped to either constant utterance-level (global) stream weights or time-varying frame-based (local) stream weights. The system has been evaluated using either gender dependent models that are specific to the target speaker, or gender independent models that discriminate poorly Preprint submitted to Elsevier Science 8 November 2007 between target and masker. When using known SNR, the time-varying stream weight system outperforms the constant stream weight systems at all SNRs tested. It is thought that the time-vary weight allows the automatic speech recognition system to take advantage of regions where local SNRs are temporally high despite the global SNR being low. When using estimated SNR the time-varying system outperformed the constant stream weight system at SNRs of 0 dB and above. Systems using stream weights estimated from both audio and video information performed better than those using stream weights estimated from the audio stream alone, particularly in the gender independent case. However, when mixtures are at a global SNR below 0 dB, stream weights are not sufficiently well estimated to produce good performance. Methods for improving the SNR estimation are discussed. The paper also relates the use of visual information in the current system to its role in recent simultaneous speaker intelligibility studies, where, as well as providing phonetic content, it triggers ‘informational masking release’, helping the listener to attend selectively to the target speech stream.
منابع مشابه
Audio-Visual Speech Modeling for Continuous Speech Recognition
This paper describes a speech recognition system that uses both acoustic and visual speech information to improve the recognition performance in noisy environments. The system consists of three components: 1) a visual module; 2) an acoustic module; and 3) a sensor fusion module. The visual module locates and tracks the lip movements of a given speaker and extracts relevant speech features. This...
متن کاملSpeaker adaptation for audio-visual speech recognition
In this paper, speaker adaptation is investigated for audiovisual automatic speech recognition (ASR) using the multistream hidden Markov model (HMM). First, audio-only and visual-only HMM parameters are adapted by combining maximum a posteriori and maximum likelihood linear regression adaptation. Subsequently, the audio-visual HMM stream exponents are adapted to better capture the reliability o...
متن کاملA new posterior based audio-visual integration method for robust speech recognition
We describe the development of a multistream HMM based audio-visual speech recognition (AVSR) system and a new method for integrating the audio and visual streams using frame level posterior probabilities. This is compared to the standard feature concatenation and weighted product methods in speaker-dependent tests using our own multimodal database, by examining speech recognition robustness to...
متن کاملAudio-visual speech recognition in the presence of a competing speaker
This paper examines the problem of estimating stream weights for a multistream audio-visual speech recogniser in the context of a simultaneous speaker task. The task is challenging because signalto-noise ratio (SNR) cannot be readily inferred from the acoustics alone. The method proposed employs artificial neural networks (ANNs) to estimate the SNR from HMM state-likelihoods. SNR is converted t...
متن کاملStream weight estimation using higher order statistics in multi-modal speech recognition
In this paper, stream weight optimization for multi-modal speech recognition using audio information and visual information is examined. In a conventional multi-stream Hidden Markov Model (HMM) used in multi-modal speech recognition, a constraint in which the summation of audio and visual weight factors should be one is employed. This means balance between transition and observation probabiliti...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Speech Communication
دوره 50 شماره
صفحات -
تاریخ انتشار 2008