Topic Trend Detection in Text Collections using Latent Dirichlet Allocation
نویسندگان
چکیده
Algorithms that enable the process of automatically mining distinct topics in document collections have become increasingly important due to their applications in many fields and the extensive growth of the number of documents in many domains. Traditionally, the task of topic discovery has been mainly addressed through algorithms that work on a snapshot view of the repository, which ignores the temporal characteristics of the collection. In a significant number of collections, the documents are temporal in nature and this temporal dimension can influence the topic discovery process. This paper proposes a generative model based on latent Dirichlet allocation that integrates the temporal ordering of the documents into the generative process in an iterative fashion. The document collection is divided into time segments where the discovered topics in each segment is propagated to influence the topic discovery in the subsequent time segments. We conduct experiments on the collection of academic papers from CiteSeer repository. In addition to the textual content of the documents, we augment the text corpus with the addition of user queries and tags and integrate the citation graph to boost the weight of the topiPreprint submitted to Information Systems 24 December 2007 cal terms. The experiment results show that segmented topic model can effectively detect distinct topics and their evolution over time.
منابع مشابه
Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملTopic Trend Detection in Text Collections using Latent Dirichlet Allocation
Algorithms that enable the process of automatically mining distinct topics in document collections have become increasingly important due to their applications in many fields and the extensive growth of the number of documents in many domains. Traditionally, the task of topic discovery has been mainly addressed through algorithms that work on a snapshot view of the documents, which ignores the ...
متن کاملTopic and Trend Detection in Text Collections Using Latent Dirichlet Allocation
Algorithms that enable the process of automatically mining distinct topics in document collections have become increasingly important due to their applications in many fields and the extensive growth of the number of documents in various domains. In this paper, we propose a generative model based on latent Dirichlet allocation that integrates the temporal ordering of the documents into the gene...
متن کاملDocument Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps
Clustering and visualization of large text document collections aids in browsing, navigation, and information retrieval. We present a document clustering and visualization method based on Latent Dirichlet Allocation and self-organizing maps (LDA-SOM). LDA-SOM clusters documents based on topical content and renders clusters in an intuitive twodimensional format. Document topics are inferred usin...
متن کاملUsing Variational Inference and MapReduce to Scale Topic Modeling
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference of LDA. In this paper, we propose a technique called MapReduce LDA (Mr. LDA) to accommodate very large corpus collections in the MapReduce framework. In contrast to other techni...
متن کامل