Latent Tree Models for Hierarchical Topic Detection
نویسندگان
چکیده
We present a novel method for hierarchical topic detection where topics are obtained by clustering documents in multiple ways. Specifically, we model document collections using a class of graphical models called hierarchical latent tree models (HLTMs). The variables at the bottom level of an HLTM are observed binary variables that represent the presence/absence of words in a document. The variables at other levels are binary latent variables, with those at the lowest latent level representing word co-occurrence patterns and those at higher levels representing co-occurrence of patterns at the level below. Each latent variable gives a soft partition of the documents, and document clusters in the partitions are interpreted as topics. Latent variables at high levels of the hierarchy capture long-range word co-occurrence patterns and hence give thematically more general topics, while those at low levels of the hierarchy capture short-range word co-occurrence patterns and give thematically more specific topics. Unlike LDA-based topic models, HLTMs do not refer to a document generation process and use word variables instead of token variables. They use a tree structure to model the relationships between topics and words, which is conducive to the discovery of meaningful topics and topic hierarchies. ∗Corresponding author Email address: [email protected] (Nevin L. Zhang) Preprint submitted to Elsevier December 22, 2016 ar X iv :1 60 5. 06 65 0v 2 [ cs .C L ] 2 1 D ec 2 01 6
منابع مشابه
Document Generation with Hierarchical Latent Tree Models
In most probabilistic topic models, a document is viewed as a collection of tokens and each token is a variable whose values are all the words in a vocabulary. One exception is hierarchical latent tree models (HLTMs), where a document is viewed as a binary vector over the vocabulary and each word is regarded as a binary variable. The use of word variables allows the detection and representation...
متن کاملProgressive EM for Latent Tree Models and Hierarchical Topic Detection
Hierarchical latent tree analysis (HLTA) is recently proposed as a new method for topic detection. It differs fundamentally from the LDA-based methods in terms of topic definition, topic-document relationship, and learning method. It has been shown to discover significantly more coherent topics and better topic hierarchies. However, HLTA relies on the Expectation-Maximization (EM) algorithm for...
متن کاملLatent Tree Analysis
Latent tree analysis seeks to model the correlations among a set of random variables using a tree of latent variables. It was proposed as an improvement to latent class analysis — a method widely used in social sciences and medicine to identify homogeneous subgroups in a population. It provides new and fruitful perspectives on a number of machine learning areas, including cluster analysis, topi...
متن کاملHierarchical Topic Models for Image Categorization
We propose hierarchical topic model for the image categorization task. Motivated by standard topic models such as PLSA and LDA, and augmented with prior knowledge extracted from WordNet, our model explicitly specifies the latent topics with emphasis on their semantic relationships. The proposed model offers several advantages over current approaches in the literature. The latent topic model com...
متن کاملEfficient Methods for Inferring Large Sparse Topic Hierarchies
Latent variable topic models such as Latent Dirichlet Allocation (LDA) can discover topics from text in an unsupervised fashion. However, scaling the models up to the many distinct topics exhibited in modern corpora is challenging. “Flat” topic models like LDA have difficulty modeling sparsely expressed topics, and richer hierarchical models become computationally intractable as the number of t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Artif. Intell.
دوره 250 شماره
صفحات -
تاریخ انتشار 2017