A Method of Accounting Bigrams in Topic Models
نویسندگان
چکیده
The paper describes the results of an empirical study of integrating bigram collocations and similarities between them and unigrams into topic models. First of all, we propose a novel algorithm PLSA-SIM that is a modification of the original algorithm PLSA. It incorporates bigrams and maintains relationships between unigrams and bigrams based on their component structure. Then we analyze a variety of word association measures in order to integrate top-ranked bigrams into topic models. All experiments were conducted on four text collections of different domains and languages. The experiments distinguish a subgroup of tested measures that produce topranked bigrams, which demonstrate significant improvement of topic models quality for all collections, when integrated into PLSASIM algorithm.
منابع مشابه
Topic Models: Accounting Component Structure of Bigrams
The paper describes the results of an empirical study of integrating bigram collocations and similarities between them and unigrams into topic models. First of all, we propose a novel algorithm PLSASIM that is a modification of the original algorithm PLSA. It incorporates bigrams and maintains relationships between unigrams and bigrams based on their component structure. Then we analyze a varie...
متن کاملBigram Anchor Words Topic Model
A probabilistic topic model is a modern statistical tool for document collection analysis that allows extracting a number of topics in the collection and describes each document as a discrete probability distribution over topics. Classical approaches to statistical topic modeling can be quite effective in various tasks, but the generated topics may be too similar to each other or poorly interpr...
متن کاملLau, Jey Han, David Newman and Timothy Baldwin (to appear) On Collocations and Topic Models, ACM Transactions on Speech and Language Processing
We investigate the impact of pre-extracting and tokenising bigram collocations on topic models. Using extensive experiments on four different corpora, we show that incorporating bigram collocations in the document representation creates more parsimonious models and improves topic coherence. We point out some problems in interpreting test likelihood and test perplexity to compare model fit, and ...
متن کاملТематические модели: учет сходства между униграммами и биграммами (Topic Models: Taking into Account Similarity Between Unigrams and Bigrams)
متن کامل
Compact Representations of Word Location Independence in Connectionist Models
We studied representations built in Cascade-Correlation (Cascor) connectionist models, using a modified encoder task in which networks learn to reproduce fourletter strings of characters and words in a locationindependent fashion. We found that Cascor successfully encodes input patterns onto a smaller set of hidden units. Cascor learned simultaneously regularities related to word structure (“wo...
متن کامل