A Method of Accounting Bigrams in Topic Models

نویسندگان

Michael Nokel

Natalia V. Loukachevitch

چکیده

The paper describes the results of an empirical study of integrating bigram collocations and similarities between them and unigrams into topic models. First of all, we propose a novel algorithm PLSA-SIM that is a modification of the original algorithm PLSA. It incorporates bigrams and maintains relationships between unigrams and bigrams based on their component structure. Then we analyze a variety of word association measures in order to integrate top-ranked bigrams into topic models. All experiments were conducted on four text collections of different domains and languages. The experiments distinguish a subgroup of tested measures that produce topranked bigrams, which demonstrate significant improvement of topic models quality for all collections, when integrated into PLSASIM algorithm.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Models: Accounting Component Structure of Bigrams

متن کامل

Bigram Anchor Words Topic Model

A probabilistic topic model is a modern statistical tool for document collection analysis that allows extracting a number of topics in the collection and describes each document as a discrete probability distribution over topics. Classical approaches to statistical topic modeling can be quite effective in various tasks, but the generated topics may be too similar to each other or poorly interpr...

متن کامل

Lau, Jey Han, David Newman and Timothy Baldwin (to appear) On Collocations and Topic Models, ACM Transactions on Speech and Language Processing

We investigate the impact of pre-extracting and tokenising bigram collocations on topic models. Using extensive experiments on four different corpora, we show that incorporating bigram collocations in the document representation creates more parsimonious models and improves topic coherence. We point out some problems in interpreting test likelihood and test perplexity to compare model fit, and ...

متن کامل

Тематические модели: учет сходства между униграммами и биграммами (Topic Models: Taking into Account Similarity Between Unigrams and Bigrams)

متن کامل

Compact Representations of Word Location Independence in Connectionist Models

We studied representations built in Cascade-Correlation (Cascor) connectionist models, using a modified encoder task in which networks learn to reproduce fourletter strings of characters and words in a locationindependent fashion. We found that Cascor successfully encodes input patterns onto a smaller set of hidden units. Cascor learned simultaneously regularities related to word structure (“wo...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

A Method of Accounting Bigrams in Topic Models

نویسندگان

چکیده

منابع مشابه

Topic Models: Accounting Component Structure of Bigrams

Bigram Anchor Words Topic Model

Lau, Jey Han, David Newman and Timothy Baldwin (to appear) On Collocations and Topic Models, ACM Transactions on Speech and Language Processing

Тематические модели: учет сходства между униграммами и биграммами (Topic Models: Taking into Account Similarity Between Unigrams and Bigrams)

Compact Representations of Word Location Independence in Connectionist Models

عنوان ژورنال:

اشتراک گذاری