Enriching Text Representation with Frequent Pattern Mining for Probabilistic Topic Modeling

نویسندگان

  • Hyun Duk Kim
  • Dae Hoon Park
  • Yue Lu
  • ChengXiang Zhai
چکیده

Probabilistic topic models have been proven very useful for many text mining tasks. Although many variants of topic models have been proposed, most existing works are based on the bag-of-words representation of text in which word combination and order are generally ignored, resulting in inaccurate semantic representation of text. In this paper, we propose a general way to go beyond the bag-of-words representation for topic modeling by applying frequent pattern mining to discover frequent word patterns that can capture semantic associations between words and then using them as additional supplementary semantic units to augment the conventional bag-of-words representation. By viewing a topic model as a generative model for such augmented text data, we can go beyond the bag-of-words assumption to potentially capture more semantic associations between words. Since efficient algorithms for mining frequent word patterns are available, this general strategy for improving topic models can be applied to improve any topic models without substantially increasing the computational complexity of the model. Experiment results show that such a frequent pattern-based data enrichment approach can improve over two representative existing probabilistic topic models for the classification task. We also studied variations of frequent pattern usage in topic modeling and found that using compressed and closed patterns performs best.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

A Two-Stage Approach for Generating Topic Models

Topic modeling has been widely utilized in the fields of information retrieval, text mining, text classification etc. Most existing statistical topic modeling methods such as LDA and pLSA generate a term based representation to represent a topic by selecting single words from multinomial word distribution over this topic. There are two main shortcomings: firstly, popular or common words occur v...

متن کامل

A review of text mining approaches and their function in discovering and extracting a topic

Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling.  Findings: LSA could be used to classify specific and unique topics in documents that address only a single topic. The other three text min...

متن کامل

Topic Modeling and Classification of Cyberspace Papers Using Text Mining

The global cyberspace networks provide individuals with platforms to can interact, exchange ideas, share information, provide social support, conduct business, create artistic media, play games, engage in political discussions, and many more. The term cyberspace has become a conventional means to describe anything associated with the Internet and the diverse Internet culture. In fact, cyberspac...

متن کامل

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING

A number of probabilistic methods such as LDA, hidden Markov models, Markov random fields have arisen in recent years for probabilistic analysis of text data. This chapter provides an overview of a variety of probabilistic models for text mining. The chapter focuses more on the fundamental probabilistic techniques, and also covers their various applications to different text mining problems. So...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012