A Joint Semantic Vector Representation Model for Text Clustering and Classification

Authors

  • A. Rahbar Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.
  • D. Salami Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.
  • I. Khanijazani Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.
  • S. Momtazi Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.
Abstract:

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use semantic models for document vector representations. Latent Dirichlet allocation (LDA) topic modeling and doc2vec neural document embedding are two well-known techniques for this purpose. In this paper, we first study the conceptual difference between the two models and show that they have different behavior and capture semantic features of texts from different perspectives. We then proposed a hybrid approach for document vector representation to benefit from the advantages of both models. The experimental results on 20newsgroup show the superiority of the proposed model compared to each of the baselines on both text clustering and classification tasks. We achieved 2.6% improvement in F-measure for text clustering and 2.1% improvement in F-measure in text classification compared to the best baseline model.

Upgrade to premium to download articles

Sign up to access the full text

Already have an account?login

similar resources

Document Vector Space Representation Model for Automatic Text Classification

Classification of text documents presents a unique challenge to conventional classification algorithms. Due to the existence of large number of features in the datasets, providing a desired representation for text documents can be seen as another problem. In this paper a simple but effective representation model for text documents to tackle the classification problem is discussed. Two different...

full text

New Word Vector Representation for Semantic Clustering

RÉSUMÉ. L’idée que nous défendons dans cet article est qu’il est possible d’obtenir des concepts sémantiques significatifs par des méthodes de classification automatique. Pour ce faire, nous commençons par proposer des mesures permettant de quantifier les relations sémantiques entre mots. Ensuite, nous utilisons les méthodes de classification non supervisée pour construire les concepts d’une ma...

full text

A Content Vector Model for Text Classification

As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications. In this paper, an LSI-based content vector model for text classification is presented, which constructs multiple augmented category LSI spaces and classifies text by their content. The model integrates the class discriminative information from the traini...

full text

Semantic Clustering for a Functional Text Classification Task

We describe a semantic clustering method designed to address shortcomings in the common bag-of-words document representation for functional semantic classification tasks. The method uses WordNetbased distance metrics to construct a similarity matrix, and expectation maximization to find and represent clusters of semantically-related terms. Using these clusters as features for machine learning h...

full text

Distributional Semantic Representation for Text Classification and Information Retrieval

The objective of this experiment is to validate the performance of the distributional semantic representation of text in the classification (Question Classification) task and the Information Retrieval task. Followed by the distributional representation, first level classification of the questions is performed and relevant tweets with respect to the given queries are retrieved. The distributiona...

full text

the innovation of a statistical model to estimate dependable rainfall (dr) and develop it for determination and classification of drought and wet years of iran

آب حاصل از بارش منبع تأمین نیازهای بی شمار جانداران به ویژه انسان است و هرگونه کاهش در کم و کیف آن مستقیماً حیات موجودات زنده را تحت تأثیر منفی قرار می دهد. نوسان سال به سال بارش از ویژگی های اساسی و بسیار مهم بارش های سالانه ایران محسوب می شود که آثار زیان بار آن در تمام عرصه های اقتصادی، اجتماعی و حتی سیاسی- امنیتی به نحوی منعکس می شود. چون میزان آب ناشی از بارش یکی از مولفه های اصلی برنامه ...

15 صفحه اول

My Resources

Save resource for easier access later

Save to my library Already added to my library

{@ msg_add @}


Journal title

volume 7  issue 3

pages  443- 450

publication date 2019-07-01

By following a journal you will be notified via email when a new issue of this journal is published.

Hosted on Doprax cloud platform doprax.com

copyright © 2015-2023