Informing multisource decoding in robust automatic speech recognition
نویسنده
چکیده
Listeners are remarkably adept at recognising speech in natural multisource environments, while most Automatic Speech Recognition (ASR) technology fails in these conditions. It has been proposed that this human ability is governed by Auditory Scene Analysis (ASA) processes, in which a sound mixture is segregated into perceptual packages, called ‘streams’, by a combination of bottom-up and top-down processing. This thesis examines a novel ASR framework based on the ASA account, Speech Fragment Decoding (SFD). A ‘fragment’ is a spectro-temporal region where energy from a single sound source dominates. SFD employs techniques developed from knowledge about the auditory system to identify fragments. A decoding process using statistical speech models is applied to the fragment representation to simultaneously identify speech evidence and recognise speech. In this study three techniques for improving SFD are investigated. Firstly, explicit duration modelling is exploited to combat the corruption of acoustic data which often causes the decoder to produce word matches with unrealistic durations. Secondly, it is argued that the top-down information in recognition models may be insufficient to mediate the speech identification. Knowledge that can assist the decoder in the choice of speech evidence is investigated. Thirdly, pitch cues derived from structure in the correlogram are used in the fragment generation process. A range of small-vocabulary speech recognition experiments are conducted for evaluation. The improved SFD system is able to produce word error rates significantly lower than conventional ASR, and is relatively insensitive to a range of noise conditions. In conclusion, the framework provides some progress towards finding a general solution to the robust ASR problem in multisource environments.
منابع مشابه
Advances in audio source seperation and multisource audio content retrieval
Audio source separation aims to extract the signals of individual sound sources from a given recording. In this paper, we review three recent advances which improve the robustness of source separation in real-world challenging scenarios and enable its use for multisource content retrieval tasks, such as automatic speech recognition (ASR) or acoustic event detection (AED) in noisy environments. ...
متن کاملImproving the performance of MFCC for Persian robust speech recognition
The Mel Frequency cepstral coefficients are the most widely used feature in speech recognition but they are very sensitive to noise. In this paper to achieve a satisfactorily performance in Automatic Speech Recognition (ASR) applications we introduce a noise robust new set of MFCC vector estimated through following steps. First, spectral mean normalization is a pre-processing which applies to t...
متن کاملA Simplified Decoding Method for a Robust Distant-talking Asr Concept Based on Feature-domain Dereverberation
A simplified decoding method for the concept of REverberation MOdeling for Speech recognition (REMOS) [1] is proposed. In order to achieve robust distant-talking Automatic Speech Recognition (ASR), the REMOS concept uses a combination of clean-speech HMMs and a reverberation model to perform feature-domain dereverberation during decoding. The simplified decoding/dereverberation method proposed ...
متن کاملUncertainty Decoding for Noise Robust Automatic Speech Recognition
This report presents uncertainty decoding as a method for robust automatic speech recognition for the Noise Robust Automatic Speech Recognition project funded by Toshiba Research Europe Limited. The effects of noise on speech recognition are reviewed and a general framework for noise robust speech recognition introduced. Common and related noise robustness techniques are described in the contex...
متن کاملAutomatic Assessment of Reading with Speech Recognition Technology
In this paper, we describe ongoing research towards building an automatic reading assessment system that emulates a human expert in a spoken language learning scenario. Audio recordings of read aloud English stories by children of grades 6-8 are acquired on an available tablet application that facilitates guided oral reading and recording. The created recordings, uploaded to a web-based ratings...
متن کامل