Text clustering for topic detection
نویسندگان
چکیده
The world wide web represents vast stores of information. However, the sheer amount of such information makes it practically impossible for any human user to be aware of much of it. Therefore, it would be very helpful to have a system that automatically discovers relevant, yet previously unknown information, and reports it to users in human-readable form. As the first attempt to accomplish such a goal, we proposed a new clustering algorithm and compared it with existing clustering algorithms. The proposed method is motivated by constructive and competitive learning from neural network research. In the construction phase, it tries to find the optimal number of clusters by adding a new cluster when the intrinsic difference between the instance presented and the existing clusters is detected. Each cluster then moves toward the optimal cluster center according to the learning rate by adjusting its weight vector. From the experimental results on the three different real world data sets, the proposed method shows an even trend of performance across the different domains, while the performance of our algorithm on text domains was better than that reported in previous research.
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملNetwork Topic Detection Model Based on Text Reconstructions
Single pass clustering algorithm is widely used in topic detection and tracking. It is a key part of network topic detection model. In the process of single pass algorithm, clustering results are not satisfactory, and the similarity matching would be reduced. Focusing on these two defects, this paper physically reconstructs web information into a volume, in which every document contains “theme ...
متن کاملChinese Microblog Topic Detection Based on the Latent Semantic Analysis and Structural Property
traditional topic detection method can not be applied to the microblog topic detection directly, because the microblog text is a kind of the short, fractional and grass-roots text. In order to detect the hot topic in the microblog text effectively, we propose a microblog topic detection method based on the combination of the latent semantic analysis and the structural property. According to the...
متن کاملQuery-Based Topic Detection Using Concepts and Named Entities
In this paper, we present a framework for topic detection in news articles. The framework receives as input the results retrieved from a query-based search and clusters them by topic. To this end, the recently introduced “DBSCAN-Martingale” method for automatically estimating the number of topics and the well-established Latent Dirichlet Allocation topic modelling approach for the assignment of...
متن کاملHealth-Related Hot Topic Detection in Online Communities Using Text Clustering
Recently, health-related social media services, especially online health communities, have rapidly emerged. Patients with various health conditions participate in online health communities to share their experiences and exchange healthcare knowledge. Exploring hot topics in online health communities helps us better understand patients' needs and interest in health-related knowledge. However, th...
متن کاملA confirmatory study of Differential Item Functioning on EFL reading comprehension
The present study aimed at investigating DIF sources on an EFL reading comprehension test. Accordingly, 2 DIF detection methods, logistic regression (LR) and item response theory (IRT), were used to flag emergent DIF of 203 (110 females & 93 males) Iranian EFL examinees’ performance on a reading comprehension test. Seven hypothetical DIF sources were examin...
متن کامل