Language Model-Based Document Clustering Using Random Walks
نویسنده
چکیده
We propose a new document vector representation specifically designed for the document clustering task. Instead of the traditional termbased vectors, a document is represented as an -dimensional vector, where is the number of documents in the cluster. The value at each dimension of the vector is closely related to the generation probability based on the language model of the corresponding document. Inspired by the recent graph-based NLP methods, we reinforce the generation probabilities by iterating random walks on the underlying graph representation. Experiments with k-means and hierarchical clustering algorithms show significant improvements over the alternative vector representation.
منابع مشابه
A Random Walks Method for Text Classification
Practical text classification system should be able to utilize information from both expensive labelled documents and large volumes of cheap unlabelled documents. It should also easily deal with newly input samples. In this paper, we propose a random walks method for text classification, in which the classification problem is formulated as solving the absorption probabilities of Markov random w...
متن کاملCLAIRLIB Documentation v1.03
The Clair library is intended to simplify a number of generic tasks in Natural Language Processing (NLP), Information Retrieval (IR), and Network Analysis. Its architecture also allows for external software to be plugged in with very little effort. Functionality native to Clairlib includes Tokenization, Summarization, LexRank, Biased LexRank, Document Clustering, Document Indexing, PageRank, Bi...
متن کاملFaster Clustering via Non-Backtracking Random Walks
This paper presents VEC-NBT, a variation on the unsupervised graph clustering technique VEC, which improves upon the performance of the original algorithm significantly for sparse graphs. VEC employs a novel application of the state-ofthe-art word2vec model to embed a graph in Euclidean space via random walks on the nodes of the graph. In VEC-NBT, we modify the original algorithm to use a non-b...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملCluster-based language model for spoken document retrieval using NMF-based document clustering
In this paper, a non-negative matrix factorization (NMF)based document clustering approach is proposed for the cluster-based language model for spoken document retrieval. The retrieval language model comprises three different unigram models: a whole corpus collect-based unigram, documentbased unigram, and a document clustering-based unigram. They are combined with double linear interpolations. ...
متن کامل