Endpoint Detection Using Grid Long Short-Term Memory Networks for Streaming Speech Recognition
نویسندگان
چکیده
The task of endpointing is to determine when the user has finished speaking. This is important for interactive speech applications such as voice search and Google Home. In this paper, we propose a GLDNN-based (grid long short-term memory deep neural network) endpointer model and show that it provides significant improvements over a state-of-the-art CLDNN (convolutional, long short-term memory, deep neural network) model. Specifically, we replace the convolution layer in the CLDNN with a grid LSTM layer that models both spectral and temporal variations through recurrent connections. Results show that the GLDNN achieves 32% relative improvement in false alarm rate at a fixed false reject rate of 2%, and reduces median latency by 11%. We also include detailed experiments investigating why grid LSTMs offer better performance than convolution layers. Analysis reveals that the recurrent connection along the frequency axis is an important factor that greatly contributes to the performance of grid LSTMs, especially in the presence of background noise. Finally, we also show that multichannel input further increases robustness to background speech. Overall, we achieve 16% (100 ms) endpointer latency improvement relative to our previous best model on a Voice Search Task.
منابع مشابه
Speech Emotion Recognition Using Scalogram Based Deep Structure
Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concat...
متن کاملSpeaker Change Detection in Broadcast TV Using Bidirectional Long Short-Term Memory Networks
Speaker change detection is an important step in a speaker diarization system. It aims at finding speaker change points in the audio stream. In this paper, it is treated as a sequence labeling task and addressed by Bidirectional long short term memory networks (Bi-LSTM). The system is trained and evaluated on the Broadcast TV subset from ETAPE database. The result shows that the proposed model ...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملRobust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling
One method to achieve robust speech recognition in adverse conditions including noise and reverberation is to employ acoustic modelling techniques involving neural networks. Using long short-term memory (LSTM) recurrent neural networks proved to be efficient for this task in a setup for phoneme prediction in a multi-stream GMM-HMM framework. These networks exploit a self-learnt amount of tempor...
متن کاملReducing the Computational Complexity of Two-Dimensional LSTMs
Long Short-Term Memory Recurrent Neural Networks (LSTMs) are good at modeling temporal variations in speech recognition tasks, and have become an integral component of many state-of-the-art ASR systems. More recently, LSTMs have been extended to model variations in the speech signal in two dimensions, namely time and frequency [1, 2]. However, one of the problems with two-dimensional LSTMs, suc...
متن کامل