Multiple text segmentation for statistical language modeling
نویسندگان
چکیده
In this article we deal with the text segmentation problem in statistical language modeling for under-resourced languages with a writing system without word boundary delimiters. While the lack of text resources has a negative impact on the performance of language models, the errors introduced by the automatic word segmentation makes those data even less usable. To better exploit the text resources, we propose a method based on weighted finite state transducers to estimate the N -gram language model from the training corpus on which each sentence is segmented in multiple ways instead of a unique segmentation. The multiple segmentation generates more N -grams from the training corpus and allows obtaining the N -grams not found in unique segmentation. We use this approach to train the language models for automatic speech recognition systems of Khmer and Vietnamese languages and the multiple segmentations lead to a better performance than the unique segmentation approach.
منابع مشابه
Text classification in Asian languages without word segmentation
We present a simple approach for Asian language text classification without word segmentation, based on statistical -gram language modeling. In particular, we examine Chinese and Japanese text classification. With character -gram models, our approach avoids word segmentation. However, unlike traditional ad hoc -gram models, the statistical language modeling based approach has strong information...
متن کاملVers une modélisation statistique multi-niveau du langage, application aux langues peu dotées. (Toward a multi-level statistical language modeling for under-resourced language)
This PhD thesis focuses on the problems encountered when developing automatic speech recognition for under-resourced languages with a writing system without explicit separation between words. The specificity of the languages covered in our work requires automatic segmentation of text corpus into words in order to make the n-gram language modeling applicable. While the lack of text data has an i...
متن کاملIntegration of Multiple Bilingually-Learned Segmentation Schemes into Statistical Machine Translation
This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous source language text in order to improve the translation quality of statistical machine translation (SMT) approaches. The method can be applied to any language pair where the source language is unsegmented and the target language segmentation is known. First, an iterative bootstrap meth...
متن کاملTowards better language modeling for Thai LVCSR
One of the difficulties of Thai language modeling is the process of text corpus preparation. Because there is no explicit word boundary marker in written Thai text, word segmentation must be performed prior to training a language model. This paper presents two approaches to language model construction for Thai LVCSR based on pseudo-morpheme merging. The first approach merges pseudo-morphemes us...
متن کاملSequence segmentation for statistical machine translation
In the last decade, while statistical machine translation has advanced significantly, there is still much room for further improvements relating to many natural language processing tasks such as word segmentation, word alignment and parsing. Human language is composed of sequences of meaningful units. These sequences can be words, phrases, sentences or even articles serving as basic elements in...
متن کامل