Filtering Methods for Feature Selection in Web-Document Clustering

نویسندگان

  • Heum Park
  • Hyuk-Chul Kwon
چکیده

This paper presents the results of a comparative study of filtering methods for feature selection in web document clustering. First, we focused on feature selection methods based on Mutual Information (MI) and Information Gain (IG). With those features and feature values, and using MI and IG, we extracted from documents representative max-value features as well as a representative cluster for a feature and a representative cluster for a document. Second, we tested the Max Feature Selection Method (MFSM) with those representative features and clusters, and evaluated the web-document clustering performance. However, when document sets yield poor clustering results by term frequency, we cannot obtain good features using the MFSM with the MI and IG values. Therefore, we propose new filtering methods, Min Count of Representative Cluster for a Feature (MCRCF) and Min Count of Representative Cluster for a Document (MCRCD). In the experimental results, the MFSM showed better performance than was achieved using only term frequency, MI and IG. And when we applied the new filtering methods for feature selection (MCRCF, MCRCD), the clustering performance improved notably. Thus we can assert that those filtering methods are effective means of feature selection and offer good performance in web document clustering.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

RRLUFF: Ranking function based on Reinforcement Learning using User Feedback and Web Document Features

Principal aim of a search engine is to provide the sorted results according to user’s requirements. To achieve this aim, it employs ranking methods to rank the web documents based on their significance and relevance to user query. The novelty of this paper is to provide user feedback-based ranking algorithm using reinforcement learning. The proposed algorithm is called RRLUFF, in which the rank...

متن کامل

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...

متن کامل

Using Fuzzy Logic Clustering Discover Semantic Similarity in Web Document

The complex and high interactions between terms in documents demonstrates vague and ambiguous meanings. There exist complicated associations within one web document and linking to the others. Most of these approaches perform similarity and feature section methods. There is need of complex document clustering and produced meaningful document. This paper proposed methodology is capable of handles...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007