A hybrid language model for open-vocabulary Thai LVCSR

نویسندگان

Kwanchiva Thangthai

Ananlada Chotimongkol

Chai Wutiwiwatchai

چکیده

This paper investigates the use of a hybrid language model for open-vocabulary Thai LVCSR. Thai text is written without word boundary markers and the definition of word unit is often ambiguous due to the presence of compound words. Hence, to build open-vocabulary LVCSR, a very large lexicon is required to also handle word unit ambiguity. Pseudomorpheme (PM), a syllable-like sub-word unit specifically designed for Thai is considered to be a more well-defined unit. To overcome the problem of out-of-vocabulary words and to also reduce the size of the lexicon, a hybrid language model which combines word and sub-word units is proposed. Words and sub-words frequently found in several domains constitute open-vocabulary for general domain Thai LVCSR. To verify our scheme, we run recognition experiments on data from various tasks including broadcast news transcription, dictation and mobile speech-to-speech translation. Open-vocabulary Thai LVCSR using the hybrid language model obviously reduces the out-of-vocabulary problem. The proposed model having a much smaller lexicon size achieves a comparable recognition error rate to a baseline system using a full-word lexicon.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A hybrid input-type recurrent neural network for LVCSR language modeling

Substantial amounts of resources are usually required to robustly develop a language model for an open vocabulary speech recognition system as out-of-vocabulary (OOV) words can hurt recognition accuracy. In this work, we applied a hybrid lexicon of word and sub-word units to resolve the problem of OOV words in a resource-efficient way. As sub-lexical units can be combined to form new words, a c...

متن کامل

Lexical units for Thai LVCSR

Traditional language models rely on lexical units that are de ned as entities separated from each other by word boundary markers. Since there are no such boundaries in Thai, alternative de nitions of lexical units have to be pursued. The problem is to nd the optimal set of lexical units that constitutes the vocabulary of the language model and yields the best nal result. The word is a tradition...

متن کامل

Hybrid Language Models Using Mixed Types of Sub-Lexical Units for Open Vocabulary German LVCSR

German is a highly inflected language with a large number of words derived from the same root. It makes use of a high degree of word compounding leading to high Out-of-vocabulary (OOV) rates, and Language Model (LM) perplexities. For such languages the use of sub-lexical units for Large Vocabulary Continuous Speech Recognition (LVCSR) becomes a natural choice. In this paper, we investigate the ...

متن کامل

Investigation of Maximum Entropy Hybrid Language Models for Open Vocabulary German and Polish LVCSR

For languages like German and Polish, higher numbers of word inflections lead to high out-of-vocabulary (OOV) rates and high language model (LM) perplexities. Thus, one of the main challenges in large vocabulary continuous speech recognition (LVCSR) is recognizing an open vocabulary. In this paper, we investigate the use of mixed type of sub-word units in the same recognition lexicon. Namely, m...

متن کامل

Improvements in RWTH LVCSR evaluation systems for Polish, Portuguese, English, urdu, and Arabic

In this work, Portuguese, Polish, English, Urdu, and Arabic automatic speech recognition evaluation systems developed by the RWTH Aachen University are presented. Our LVCSR systems focus on various domains like broadcast news, spontaneous speech, and podcasts. All these systems but Urdu are used for Euronews and Skynews evaluations as part of the EUBridge project. Our previously developed LVCSR...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

A hybrid language model for open-vocabulary Thai LVCSR

نویسندگان

چکیده

منابع مشابه

A hybrid input-type recurrent neural network for LVCSR language modeling

Lexical units for Thai LVCSR

Hybrid Language Models Using Mixed Types of Sub-Lexical Units for Open Vocabulary German LVCSR

Investigation of Maximum Entropy Hybrid Language Models for Open Vocabulary German and Polish LVCSR

Improvements in RWTH LVCSR evaluation systems for Polish, Portuguese, English, urdu, and Arabic

عنوان ژورنال:

اشتراک گذاری