Integrating Knowledge Into End-to-End Speech Recognition From External Text-Only Data
نویسندگان
چکیده
Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because of the end-to-end training, an AED model is usually trained with speech-text paired data. It challenging to incorporate external text-only data into models. Another issue that it does not use right context a text token while predicting token. To alleviate above two issues, we propose unified method called LST (Learn Spelling from Teachers) integrate knowledge and leverage whole sentence. The divided stages. First, representation stage, language on text. can be seen as compressed LM. Then, at transferring transferred via teacher-student learning. further sentence, LM causal cloze completer (COR), which estimates probability token, given both left it. Therefore, Different fusion based methods, during decoding, proposed increase any extra complexity inference stage. We conduct experiments scales public Chinese datasets AISHELL-1 AISHELL-2. experimental results demonstrate effectiveness leveraging sentence our method, compared baseline hybrid systems systems.
منابع مشابه
End-to-end Audiovisual Speech Recognition
Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-toend audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the ...
متن کاملEnd-to-End Speech Recognition Models
For the past few decades, the bane of Automatic Speech Recognition (ASR) systems have been phonemes and Hidden Markov Models (HMMs). HMMs assume conditional independence between observations, and the reliance on explicit phonetic representations requires expensive handcrafted pronunciation dictionaries. Learning is often via detached proxy problems, and there especially exists a disconnect betw...
متن کاملMultichannel End-to-end Speech Recognition
The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend ...
متن کاملTowards End-to-End Speech Recognition
Standard automatic speech recognition (ASR) systems follow a divide and conquer approach to convert speech into text. Alternately, the end goal is achieved by a combination of sub-tasks, namely, feature extraction, acoustic modeling and sequence decoding, which are optimized in an independent manner. More recently, in the machine learning community deep learning approaches have emerged which al...
متن کاملDeep Speech: Scaling up end-to-end speech recognition
We present a state-of-the-art speech recognition system developed using end-toend deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model backgro...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing
سال: 2021
ISSN: ['2329-9304', '2329-9290']
DOI: https://doi.org/10.1109/taslp.2021.3066274