Scalable Recurrent Neural Network Language Models for Speech Recognition
نویسندگان
چکیده
Language Modelling is a crucial component in many areas and applications including automatic speech recognition (ASR). n-gram language models (LMs) have been the dominant technology during the last few decades, due to their easy implementation and good generalism on unseen data. However, there are two well known problems with n-gram LMs: data sparsity; and the n-order Markov assumption. Previous research has explored various options to mitigate these issues. Recently, recurrent neural network LMs (RNNLMs) have been found to offer a solution for both of these issues. The data sparsity issue is solved by projecting each word into a low, continuous, space, and the long term history is modelled via the recurrent connection between hidden and input layer. Hence, RNNLMs have become increasingly popular and promising results have been reported on a range of tasks. However, there are still several issues to be solved in area to apply RNNLMs to the ASR task. Due to the long term history, the training of RNNLMs is difficult to parallelise and slow to train on large quantities of training data and large model size. It is easy to apply Viterbi decoding or lattice rescoring for standard n-gram LMs as they have limited history, while it is difficult for RNNLMs because of their long term history. This thesis aims to facilitate the application of RNNLMs in ASR. First, efficient training and evaluation of RNNLMs are developed. By splicing multiple sentences, RNNLMs could be trained efficiently with bunch (i.e. minibatch) mode on GPUs. Several improved training criteria are also investigated to further improve the efficiency of training and evaluation. Second, two algorithms are proposed for efficient lattice rescoring and compact lattices are able to generate. Third, the adaptation of RNNLMs is investigated. Model fine-tune and incorporation of informative feature based adaptation are investigated. Various topic models are applied to extract topic representation for efficient adaptation. Finally, the different modelling power of RNNLMs and n-gram LMs are explored and the interpolation of these two types of models is studied. The first contribution of this thesis is the efficient training and inference of RNNLMs. The training of RNNLMs is computationally heavy due to the large output layer and difficulty of parallelisation. In most previous works, RNNLMs were trained on CPU with class based RNNLMs. In this thesis, a novel sentence splicing method is proposed, which allows RNNLMs to be trained much more efficiently with bunch mode. GPU is also used
منابع مشابه
Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition
Recurrent neural network language models (RNNLMs) are powerful language modeling techniques. Significant performance improvements have been reported in a range of tasks including speech recognition compared to n-gram language models. Conventional n-gram and neural network language models are trained to predict the probability of the next word given its preceding context history. In contrast, bi...
متن کاملSpeed versus Accuracy in Neural Sequence Tagging for Natural Language Processing
Sequence Tagging, including part of speech tagging, chunking and named entity recognition, is an important task in NLP. Recurrent neural network models such as Bidirectional LSTMs have produced impressive results on sequence tagging. In this work, we first present a Bidirectional LSTM neural network model for sequence tagging tasks. Then we show a simple and fast greedy sequence tagging system ...
متن کاملSpeech Emotion Recognition Using Scalogram Based Deep Structure
Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concat...
متن کاملشبکه عصبی پیچشی با پنجرههای قابل تطبیق برای بازشناسی گفتار
Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...
متن کاملLong Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting ...
متن کاملProsodically-enhanced recurrent neural network language models
Recurrent neural network language models have been shown to consistently reduce the word error rates (WERs) of large vocabulary speech recognition tasks. In this work we propose to enhance the RNNLMs with prosodic features computed using the context of the current word. Since it is plausible to compute the prosody features at the word and syllable level we have trained the models on prosody fea...
متن کامل