Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling
نویسندگان
چکیده
In this work we present a generalisation of the Modified Kneser-Ney interpolative smoothing for richer smoothing via additional discount parameters. We provide mathematical underpinning for the estimator of the new discount parameters, and showcase the utility of our rich MKN language models on several European languages. We further explore the interdependency among the training data size, language model order, and number of discount parameters. Our empirical results illustrate that larger number of discount parameters, i) allows for better allocation of mass in the smoothing process, particularly on small data regime where statistical sparsity is severe, and ii) leads to significant reduction in perplexity, particularly for out-of-domain test sets which introduce higher ratio of out-ofvocabulary words.1
منابع مشابه
Sub-word Based Language Modeling for Amharic
This paper presents sub-word based language models for Amharic, a morphologically rich and under-resourced language. The language models have been developed (using an open source language modeling toolkit SRILM) with different n-gram order (2 to 5) and smoothing techniques. Among the developed models, the best performing one is a 5gram model with modified Kneser-Ney smoothing and with interpola...
متن کاملLanguage Modeling with Power Low Rank Ensembles
We present power low rank ensembles (PLRE), a flexible framework for n-gram language modeling where ensembles of low rank matrices and tensors are used to obtain smoothed probability estimates of words in context. Our method can be understood as a generalization of ngram modeling to non-integer n, and includes standard techniques such as absolute discounting and Kneser-Ney smoothing as special ...
متن کاملImproved Smoothing for N-gram Language Models Based on Ordinary Counts
Kneser-Ney (1995) smoothing and its variants are generally recognized as having the best perplexity of any known method for estimating N-gram language models. Kneser-Ney smoothing, however, requires nonstandard N-gram counts for the lowerorder models used to smooth the highestorder model. For some applications, this makes Kneser-Ney smoothing inappropriate or inconvenient. In this paper, we int...
متن کاملSmoothing a Tera-word Language Model
Frequency counts from very large corpora, such as the Web 1T dataset, have recently become available for language modeling. Omission of low frequency n-gram counts is a practical necessity for datasets of this size. Naive implementations of standard smoothing methods do not realize the full potential of such large datasets with missing counts. In this paper I present a new smoothing algorithm t...
متن کاملA Hierarchical Bayesian Language Model Based On Pitman-Yor Processes
We propose a new hierarchical Bayesian n-gram model of natural languages. Our model makes use of a generalization of the commonly used Dirichlet distributions called Pitman-Yor processes which produce power-law distributions more closely resembling those in natural languages. We show that an approximation to the hierarchical Pitman-Yor language model recovers the exact formulation of interpolat...
متن کامل