A PAC-Bayesian Approach to Minimum Perplexity Language Modeling

نویسندگان

  • Sujeeth Bharadwaj
  • Mark Hasegawa-Johnson
چکیده

Despite the overwhelming use of statistical language models in speech recognition, machine translation, and several other domains, few high probability guarantees exist on their generalization error. In this paper, we bound the test set perplexity of two popular language models – the n-gram model and class-based n-grams – using PAC-Bayesian theorems for unsupervised learning. We extend the bound to sequence clustering, wherein classes represent longer context such as phrases. The new bound is dominated by the maximum number of sequences represented by each cluster, which is polynomial in the vocabulary size. We show that we can still encourage small sample generalization by sparsifying the cluster assignment probabilities. We incorporate our bound into an efficient HMM-based sequence clustering algorithm and validate the theory with empirical results on the resource management corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Selection of t-SNE Perplexity

t-Distributed Stochastic Neighbor Embedding (t-SNE) is one of the most widely used dimensionality reduction methods for data visualization, but it has a perplexity hyperparameter that requires manual selection. In practice, proper tuning of t-SNE perplexity requires users to understand the inner working of the method as well as to have hands-on experience. We propose a model selection objective...

متن کامل

Modeling Topic and Role Information in Meetings Using the Hierarchical Dirichlet Process

In this paper, we address the modeling of topic and role information in multiparty meetings, via a nonparametric Bayesian model called the hierarchical Dirichlet process. This model provides a powerful solution to topic modeling and a flexible framework for the incorporation of other cues such as speaker role information. We present our modeling framework for topic and role on the AMI Meeting C...

متن کامل

Articulatory feature-based pronunciation modeling

Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This is one factor in the poor performance of automatic speech recognizers on conversational speech, and it has been very difficult to mitigate in traditional phonebased approaches to speech recognition. An altern...

متن کامل

Interpolated Dirichlet Class Language Model for Speech Recognition Incorporating Long-distance N-grams

We propose a language modeling (LM) approach incorporating interpolated distanced n-grams in a Dirichlet class language model (DCLM) (Chien and Chueh, 2011) for speech recognition. The DCLM relaxes the bag-of-words assumption and documents topic extraction of latent Dirichlet allocation (LDA). The latent variable of DCLM reflects the class information of an n-gram event rather than the topic in...

متن کامل

A parallel training algorithm for hierarchical pitman-yor process language models

The Hierarchical Pitman Yor Process Language Model (HPYLM) is a Bayesian language model based on a nonparametric prior, the Pitman-Yor Process. It has been demonstrated, both theoretically and practically, that the HPYLM can provide better smoothing for language modeling, compared with state-of-the-art approaches such as interpolated KneserNey and modified Kneser-Ney smoothing. However, estimat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014