Word clustering for a word bi-gram model

نویسندگان

  • Shinsuke Mori
  • Masafumi Nishimura
  • Nobuyasu Itoh
چکیده

In this paper we describe a word clustering method for class-based n-gram model. The measurement for clustering is the entropy on a corpus di erent from the corpus for n-gram model estimation. The search method is based on the greedy algorithm. We applied this method to a Japanese EDR corpus and English Penn Treebank corpus. The perplexities of word-based n-gram model on EDR corpus and Penn Treebank are 153.1 and 203.5 respectively. And Those of class-based n-gram model, estimated through our method, are 146.4 and 136.0 respectively. The result tells us that our clustering methods is better than the Brown's method and the Ney's method called leaving-one-out.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhanced Word Classing for Recurrent Neural Network Language

Recurrent Neural Network Language Model (RNNLM) has recently been shown to outperform conventional N-gram LM as well as many other competing advanced language model techniques. However, the computation complexity of RNNLM is much higher than the conventional N-gram LM. As a result, the Class-based RNNLM (CRNNLM) is usually employed to speed up both the training and testing phase of RNNLM. In pr...

متن کامل

Using Collocations and K-means Clustering to Improve the N-pos Model for Japanese IME

Kana-Kanji conversion is known as one of the representative applications of Natural Language Processing (NLP) for the Japanese language. The N-pos model, presenting the probability of a Kanji candidate sequence by the product of bi-gram Part-of-Speech (POS) probabilities and POS-to-word emission probabilities, has been successfully applied in a number of well-known Japanese Input Method Editor ...

متن کامل

Bangla Word Clustering Based on Tri-gram, 4-gram and 5-gram Language Model

 SUST, ICERIE. Abstract: — In this paper, we describe a research method that generates Bangla word clusters on the basis of relating to meaning in language and contextual similarity. The importance of word clustering is in parts of speech (POS) tagging, word sense disambiguation, text classification, recommender system, spell checker, grammar checker, knowledge discover and for many others Nat...

متن کامل

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

Word Clustering Using Word Embedding Generated by Neural Net-based Skip Gram

This paper proposes word clustering using word embedding. We used a neural net-based continuous skip-gram method for generating word embedding in continuous space. The proposed word clustering method represents each word in the vector space using a neural network. The K-means clustering method partitions word embedding into predetermined K-word

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998