Improving Mandarin End-to-End Speech Recognition With Word N-Gram Language Model
نویسندگان
چکیده
Despite the rapid progress of end-to-end (E2E) automatic speech recognition (ASR), it has been shown that incorporating external language models (LMs) into decoding can further improve performance E2E ASR systems. To align with modeling units adopted in systems, subword-level (e.g., characters, BPE) LMs are usually used to cooperate current However, use will ignore word-level information, which may limit strength ASR. Although several methods have proposed incorporate ASR, these mainly designed for languages clear word boundaries such as English and cannot be directly applied like Mandarin, each character sequence multiple corresponding sequences. this end, we propose a novel algorithm where lattice is constructed on-the-fly consider all possible sequences partial hypothesis. Then, LM score hypothesis obtained by intersecting generated an N-gram LM. The method examined on both Attention-based Encoder-Decoder (AED) Neural Transducer (NT) frameworks. Experiments suggest our consistently outperforms LMs, including neural network We achieve state-of-the-art results Aishell-1 (CER 4.18%) Aishell-2 5.06%) datasets reduce CER 14.8% relatively 21K-hour Mandarin dataset.
منابع مشابه
Improving End-to-End Speech Recognition with Policy Learning
Connectionist temporal classification (CTC) is widely used for maximum likelihood learning in end-to-end speech recognition models. However, there is usually a disparity between the negative maximum likelihood and the performance metric used in speech recognition, e.g., word error rate (WER). This results in a mismatch between the objective function and metric during training. We show that the ...
متن کاملMultilingual Speech Recognition With A Single End-To-End Model
Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the subword unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we pres...
متن کاملTowards Language-Universal End-to-End Speech Recognition
Building speech recognizers in multiple languages typically involves replicating a monolingual training recipe for each language, or utilizing a multi-task learning approach where models for different languages have separate output labels but share some internal parameters. In this work, we exploit recent progress in end-to-end speech recognition to create a single multilingual speech recogniti...
متن کاملDeep Speech 2 : End-to-End Speech Recognition in English and Mandarin
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our a...
متن کاملEnd-to-end Audiovisual Speech Recognition
Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-toend audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Signal Processing Letters
سال: 2022
ISSN: ['1558-2361', '1070-9908']
DOI: https://doi.org/10.1109/lsp.2022.3154241