Latent Dirichlet Allocation For Text And Image Topic Modeling

نویسندگان

  • Jong-Chyi Su
  • Wei-Ping Liao
چکیده

Latent Dirichlet Allocation (LDA) is a generative model for text documents. It is an unsupervised method which can learn latent topics from documents. We investigate the task of topic modeling of documents using LDA, where the parameters are trained with collapsed Gibbs sampling. Since the training process is unsupervised and the true labels of the training documents are absent, it is hard to measure the goodness-of-fit and degree of overfitting. We discuss harmonic mean and perplexity to measure the goodness-of-fit and degree of overfitting respectively. In this report, we apply LDA on Classic4001 text dataset and Binary Alphadigits2 image dataset. For text dataset, results are shown by visualizing the probability of every document belongs to each topic. For image dataset, we show the topics which represent interesting features of characters.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Latent Dirichlet Allocation For Text And Image Topic Modeling

Latent Dirichlet allocation (LDA) is a popular unsupervised technique for topic modeling. It learns a generative model which can discover latent topics given a collection of training documents. In the unsupervised learning framework, where the class label is unavailable, it is less intuitive to evaluate the goodness-of-fit and degree of overfitting of learned model. We discuss two measurements ...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Latent Dirichlet Allocation with Topic-in-Set Knowledge

Latent Dirichlet Allocation is an unsupervised graphical model which can discover latent topics in unlabeled data. We propose a mechanism for adding partial supervision, called topic-in-set knowledge, to latent topic modeling. This type of supervision can be used to encourage the recovery of topics which are more relevant to user modeling goals than the topics which would be recovered otherwise...

متن کامل

Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey

Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data, text documents. Researchers have published many articles in the field of topic modeling and applied in various fields such as software engineering, political science, medical and linguistic science, etc. There are various methods for topic modeling, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014