A hybrid language model for open-vocabulary Thai LVCSR
نویسندگان
چکیده
This paper investigates the use of a hybrid language model for open-vocabulary Thai LVCSR. Thai text is written without word boundary markers and the definition of word unit is often ambiguous due to the presence of compound words. Hence, to build open-vocabulary LVCSR, a very large lexicon is required to also handle word unit ambiguity. Pseudomorpheme (PM), a syllable-like sub-word unit specifically designed for Thai is considered to be a more well-defined unit. To overcome the problem of out-of-vocabulary words and to also reduce the size of the lexicon, a hybrid language model which combines word and sub-word units is proposed. Words and sub-words frequently found in several domains constitute open-vocabulary for general domain Thai LVCSR. To verify our scheme, we run recognition experiments on data from various tasks including broadcast news transcription, dictation and mobile speech-to-speech translation. Open-vocabulary Thai LVCSR using the hybrid language model obviously reduces the out-of-vocabulary problem. The proposed model having a much smaller lexicon size achieves a comparable recognition error rate to a baseline system using a full-word lexicon.
منابع مشابه
A hybrid input-type recurrent neural network for LVCSR language modeling
Substantial amounts of resources are usually required to robustly develop a language model for an open vocabulary speech recognition system as out-of-vocabulary (OOV) words can hurt recognition accuracy. In this work, we applied a hybrid lexicon of word and sub-word units to resolve the problem of OOV words in a resource-efficient way. As sub-lexical units can be combined to form new words, a c...
متن کاملLexical units for Thai LVCSR
Traditional language models rely on lexical units that are de ned as entities separated from each other by word boundary markers. Since there are no such boundaries in Thai, alternative de nitions of lexical units have to be pursued. The problem is to nd the optimal set of lexical units that constitutes the vocabulary of the language model and yields the best nal result. The word is a tradition...
متن کاملHybrid Language Models Using Mixed Types of Sub-Lexical Units for Open Vocabulary German LVCSR
German is a highly inflected language with a large number of words derived from the same root. It makes use of a high degree of word compounding leading to high Out-of-vocabulary (OOV) rates, and Language Model (LM) perplexities. For such languages the use of sub-lexical units for Large Vocabulary Continuous Speech Recognition (LVCSR) becomes a natural choice. In this paper, we investigate the ...
متن کاملInvestigation of Maximum Entropy Hybrid Language Models for Open Vocabulary German and Polish LVCSR
For languages like German and Polish, higher numbers of word inflections lead to high out-of-vocabulary (OOV) rates and high language model (LM) perplexities. Thus, one of the main challenges in large vocabulary continuous speech recognition (LVCSR) is recognizing an open vocabulary. In this paper, we investigate the use of mixed type of sub-word units in the same recognition lexicon. Namely, m...
متن کاملImprovements in RWTH LVCSR evaluation systems for Polish, Portuguese, English, urdu, and Arabic
In this work, Portuguese, Polish, English, Urdu, and Arabic automatic speech recognition evaluation systems developed by the RWTH Aachen University are presented. Our LVCSR systems focus on various domains like broadcast news, spontaneous speech, and podcasts. All these systems but Urdu are used for Euronews and Skynews evaluations as part of the EUBridge project. Our previously developed LVCSR...
متن کامل