Enhancing News Articles Clustering using Word N-Grams
نویسندگان
چکیده
In this work we explore the possible enhancement of the document clustering results, and in particular clustering of news articles from the web, when using word-based n-grams during the keyword extraction phase. We present and evaluate a weighting approach that combines clustering of news articles derived from the web using n-grams, extracted from the articles at an offline stage. We compared this technique with the single minded bag-of-words representation that our clustering algorithm, W-kmeans, previously used. Our experimentation revealed that via tuning of the weighting parameters between keyword and n-grams, as well as the n itself, a significant improvement regarding the clustering results metrics can be achieved. This reflects more coherent clusters and better overall clustering performance.
منابع مشابه
Assisting cluster coherency via n-grams and clustering as a tool to deal with the new user problem
Collaborative filtering systems typically need to acquire some data about the new user in order to start making personalized suggestions, a situation commonly referred to as the ‘‘new user problem’’. In this work we attempt to address the new user problem via a unique personalized strategy for prompting the user with articles to rate. Our approach makes use of hypernyms extracted from the WordN...
متن کاملJumping Distance based Chinese Person Name Disambiguation
In this paper, we describe a Chinese person name disambiguation system for news articles and report the results obtained on the data set of the CLP 2010 Bakeoff-3. The main task of the Bakeoff is to identify different persons from the news stories that contain the same person-name string. Compared to the traditional methods, two additional features are used in our system: 1) n-grams co-occurred...
متن کاملExploring Word Embeddings and Character N-Grams for Author Clustering
We presented our system for PAN 2016 Author Clustering task. Our software used simple character n-grams to represent the document collection. We then ran K-Means clustering optimized using the Silhouette Coefficient. Our system yields competitive results and required only a short runtime. Character n-grams can capture a wide range of information, making them effective for authorship attribution...
متن کاملIdenti cation of Case, Digits and Special Symbols Using a Context Window
We present strategies and results for identifying the symbol type of every character in a text document. Assuming reasonable word and character segmentation for shape clustering, we designed several type recognition methods that depend on cluster n-grams, characteristics of neighbors, and within-word context. On an ASCII test corpus of 925 articles, these methods represent a substantial improve...
متن کاملEvaluating the Unification of Multiple Information Retrieval Techniques into a News Indexing Service
While online information sources are rapidly increasing in amount, so does the daily available online news content. Several approaches have being proposed for organizing this immense amount of data. In this work we explore the integration of multiple information retrieval techniques, like text preprocessing, n-grams expansion, summarization, categorization and item/user clustering into a single...
متن کامل