Stability of Topic Modeling via Matrix Factorization

نویسندگان

  • Mark Belford
  • Brian Mac Namee
  • Derek Greene
چکیده

Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, in both cases, standard implementations rely on stochastic elements in their initialization phase, which can potentially lead to different results being generated on the same corpus when using the same parameter values. This corresponds to the concept of “instability” which has previously been studied in the context of k-means clustering. In many applications of topic modeling, this problem of instability is not considered and topic models are treated as being definitive, even though the results may change considerably if the initialization process is altered. In this paper we demonstrate the inherent instability of popular topic modeling approaches, using a number of new measures to assess stability. To address this issue in the context of matrix factorization for topic modeling, we propose the use of ensemble learning strategies. Based on experiments performed on annotated text corpora, we show that a K-Fold ensemble strategy, combining both ensembles and structured initialization, can significantly reduce instability, while simultaneously yielding more accurate topic models.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ensemble Topic Modeling via Matrix Factorization

Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents, facilitating knowledge discovery and information summarization. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, these methods tend to have stochastic elements in their initialization,...

متن کامل

Finding Hierarchy of Topics from Twitter Data

Topic modeling of text collections is rapidly gaining importance for a wide variety of applications including information retrieval and automatic multimedia indexing. Our motivation is to exploit a hierarchical topic selection via nonnegative matrix factorization to capture the nature content of text posted on Twitter. This paper explores the use of an effective framework to automatically disco...

متن کامل

High-Recall Document Retrieval from Large-Scale Noisy Documents via Visual Analytics based on Targeted Topic Modeling

We present a visual analytics system for large-scale document retrieval tasks with high recall where any missing relevant documents can be critical. Our system utilizes a novel user-driven topic modeling called targeted topic modeling, a variant of nonnegative matrix factorization (NMF). Our system visualizes a topic summary in a treemap form and lets users keep relevant topics and incrementall...

متن کامل

WZ factorization via Abay-Broyden-Spedicato algorithms

Classes of‎ ‎Abaffy-Broyden-Spedicato (ABS) methods have been introduced for‎ ‎solving linear systems of equations‎. ‎The algorithms are powerful methods for developing matrix‎ ‎factorizations and many fundamental numerical linear algebra processes‎. ‎Here‎, ‎we show how to apply the ABS algorithms to devise algorithms to compute the WZ and ZW‎ ‎factorizations of a nonsingular matrix as well as...

متن کامل

Weak Supervision for Semi-supervised Topic Modeling via Word Embeddings

Semi-supervised algorithms have been shown to improve the results of topic modeling when applied to unstructured text corpora. However, sufficient supervision is not always available. This paper proposes a new process, Weak+, suitable for use in semi-supervised topic modeling via matrix factorization, when limited supervision is available. This process uses word embeddings to provide additional...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Expert Syst. Appl.

دوره 91  شماره 

صفحات  -

تاریخ انتشار 2018