Mask estimation incorporating time-frequency trajectories for a CASA-based ASR front-end
نویسندگان
چکیده
In this paper, we propose a mask estimation method for a computational auditory scene analysis (CASA) based speech recognition front-end using speech obtained from two microphones. The proposed mask estimation method incorporates the observation that the mask information should be correlated over contiguous analysis time frames and adjacent frequency channels. To this end, two different hidden Markov models (HMMs), time HMM and frequency HMM, representing the time and frequency trajectories respectively, are trained using features such as the interaural time difference and the interaural level difference of two-channel signals. A mask for the given timefrequency bin is estimated by combining the likelihoods estimated from the two HMMs, and used to separate the desired speech from noisy speech. To show the effectiveness of the proposed mask estimation, we first measure the root mean square error between the ideal mask and that estimated by the proposed method. Then, we compare the performance of a speech recognition system using the proposed mask estimation method to those using conventional methods. Consequently, the proposed method provides an average word error rate reduction of 63.2% and 3.1% when compared with the Gaussian kernel-based and time HMM-based mask estimation methods, respectively.
منابع مشابه
SNR-based mask compensation for computational auditory scene analysis applied to speech recognition in a car environment
In this paper, we propose a computational auditory scene analysis (CASA)–based front–end for two–microphone speech recognition in a car environment. One of the important issues associated with CASA is the accurate estimation of mask information for target speech separation within multiple microphone noisy speech. For such a task, the time–frequency mask information is compensated through the si...
متن کاملAsr-driven Binary Mask Estimation for Robust Automatic Speech Recognition
Additive noise has long been an issue for robust automatic speech recognition (ASR) systems. One approach to noise robustness is the removal of noise information through segregation by binary time-frequency masks; each time-frequency unit in a spectro-temporal representation of the speech signal is labeled either noise-dominant or signal-dominant. The noise-dominant units are masked and their e...
متن کاملCASA based speech separation for robust speech recognition
This paper introduces a speech separation system as a front-end processing step for automatic speech recognition (ASR). It employs computational auditory scene analysis (CASA) to separate the target speech from the interference speech. Specifically, the mixed speech is preprocessed based on auditory peripheral model. Then a pitch tracking is conducted and the dominant pitch is used as a main cu...
متن کاملMask estimation in non-stationary noise environments for missing feature based robust speech recognition
In missing feature based automatic speech recognition (ASR), the role of the spectro-temporal mask in providing an accurate description of the relationship between target speech and environmental noise is critical for minimizing the degradation in ASR word accuracy (WAC) as the signal-to-noise ratio (SNR) decreases. This paper demonstrates the importance of accurate characterization of instanta...
متن کاملOptimization of Speech Enhancement Front-End with Speech Recognition-Level Criterion
This paper concerns the use of speech enhancement to improve automatic speech recognition (ASR) performance in noisy environments. Speech enhancement systems are usually designed separately from a back-end recognizer by optimizing the frontend parameters with signal-level criteria. Such a disjoint processing approach is not always useful for ASR. Indeed, timefrequency masking, which is widely u...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008