Evaluation of Internal Validity Measures in Short-Text Corpora
نویسندگان
چکیده
Short texts clustering is one of the most difficult tasks in natural language processing due to the low frequencies of the document terms. We are interested in analysing these kind of corpora in order to develop novel techniques that may be used to improve results obtained by classical clustering algorithms. In this paper we are presenting an evaluation of different internal clustering validity measures in order to determine the possible correlation between these measures and that of the F -Measure, a well-known external clustering measure used to calculate the performance of clustering algorithms. We have used several short-text corpora in the experiments carried out. The obtained correlation with a particular set of internal validity measures let us to conclude that some of them may be used to improve the performance of text clustering algorithms.
منابع مشابه
Particle Swarm Optimization for clustering short-text corpora
Clustering of short-text collections is a very relevant research area, given the current and future mode for people to use “small-language” (e.g. blogs, snippets, news and text-message generation such as email or chat). In recent years, a few approaches based on Particle Swarm Optimization (PSO) have been proposed to solve document clustering problems. However, the particularities that arise wh...
متن کاملMeasuring the homogeneity and similarity of language corpora
Corpus-based methods are now dominant in Natural Language Processing (NLP) . Creating big corpora is no longer difficult and the technology to analyze them is growing faster, more robust and more accurate. However, when an NLP application performs well on one corpus, it is unclear whether this level of performance would be maintained on others. To make progress on these questions, we need metho...
متن کاملSyntactic Complexity of Russian Unified State Exam Texts in English: A Study on Reliability and Validity
In this study we analyze texts used in Russian Unified State Exam on English language. Texts that formed small research corpora were retrieved from 2 resources: official USE database as a reference point, and popular website used by pupils for USE training “Neznaika” (https://neznaika.pro/). The size of two corpora is balanced: USE has 11934 tokens and “Neznaika” - 11918 tokens. We share Biber’...
متن کاملRough Text Assisting Text Mining: Focus on Document Clustering Validity
In this chapter, the applications of rough set theory (RST) in text mining are discussed and a new concept named “Rough Text” is presented along with some RST-based measures for the evaluation of decision systems. We will focus on the application of such concept in clustering validity, specifically cluster labeling and multidocument summarization. The experimental studies show that the proposed...
متن کاملA Discrete Particle Swarm Optimizer for Clustering Short-text Corpora
Work on “short-text clustering” is relevant, particularly if we consider the current/future mode for people to use ‘small-language’, e.g. blogs, text-messaging, snippets, etc. Potential applications in different areas of natural language processing may include re-ranking of snippets in information retrieval, and automatic clustering of scientific texts available on the Web. Despite its relevanc...
متن کامل