Class-Based N-Gram Language Model for New Words Using Out-of-Vocabulary to In-Vocabulary Similarity
نویسندگان
چکیده
Out-of-vocabulary (OOV) words create serious problems for automatic speech recognition (ASR) systems. Not only are they missrecognized as in-vocabulary (IV) words with similar phonetics, but the error also causes further errors in nearby words. Language models (LMs) for most open vocabulary ASR systems treat OOV words as a single entity, ignoring the linguistic information. In this paper we present a class-based n-gram LM that is able to deal with OOV words by treating each of them individually without retraining all the LM parameters. OOV words are assigned to IV classes consisting of similar semantic meanings for IV words. The World Wide Web is used to acquire additional data for finding the relation between the OOV and IV words. An evaluation based on adjusted perplexity and word-error-rate was carried out on the Wall Street Journal corpus. The result suggests the preference of the use of multiple classes for OOV words, instead of one unknown class. key words: out-of-vocabulary, class-based n-gram, language model, adjusted perplexity, speech recognition
منابع مشابه
Multi Class-based n-gram Language Model for New Words Using Web Data
Out-of-vocabulary (OOV) words cause a serious problem for automatic speech recognition (ASR) system. Not only it will be miss-recognized as an in-vocabulary word with similar phonetics, but the error will also affect nearby words to make errors. Language models (LMs) for most of open vocabulary ASR systems treat OOV words as one entity, ignoring the linguistic information. In this paper we pres...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملA hierarchical language model incorporating class-dependent word models for OOV words recognition
A new language model is proposed to cope with the demands for recognizing out-of-vocabulary (OOV) words not registered in the lexicon. This language model is a class N-gram incorporating a set of word models that reflect the statistical characteristics of the phonotactics, which depend on the lexical classes. Utilization of class-dependency enhances recognition accuracy and enables identificati...
متن کاملتشخیص دستنوشتۀ برخط فارسی با استفاده از مدل زبانی و کاهش قوانین نگارش کاربر
The Joint-up, cursive form of Persian words and immense variety of its scripts, also different figures of Persian letters depending on their sitting positions in the words, have turned the Persian handwritings recognition to an intense challenge. The major obstacle of the most often recognition ways, is their inattention to sentence contexture which causes utilizing of a word with correct appea...
متن کاملThe Comparison of Computer Assisted Teaching and Traditional Explicit Method in Learning / Teaching English Vocabulary.
This review surveys research on second language vocabulary teaching and learning since1999. It first considers the distinction between incidental and intentional vocabulary learning.Although learners certainly acquire word knowledge incidentally while engaged in variouslanguage learning activities, more direct and systematic study of vocabulary is also required.There is a discussion of how word...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IEICE Transactions
دوره 95-D شماره
صفحات -
تاریخ انتشار 2012