Feature Selection via Correlation Coefficient Clustering

نویسندگان

  • Hui-Huang Hsu
  • Cheng-Wei Hsieh
چکیده

Feature selection is a fundamental problem in machine learning and data mining. How to choose the most problem-related features from a set of collected features is essential. In this paper, a novel method using correlation coefficient clustering in removing similar/redundant features is proposed. The collected features are grouped into clusters by measuring their correlation coefficient values. The most class-dependent feature in each cluster is retained while others in the same cluster are removed. Thus, the most class-related and mutually unrelated features are identified. The proposed method was applied to two datasets: the disordered protein dataset and the Arrhythmia (ARR) dataset. The experimental results show that the method is superior to other feature selection methods in speed and/or accuracy. Detail discussions are given in the paper.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...

متن کامل

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...

متن کامل

Cancerous Tissue Classification Using Microarray Gene Expression

In this project, we apply machine learning techniques to perform tumor vs. normal tissue classification using gene expression microarray data, which was proven to be useful for early-stage cancer diagnosis and cancer subtype identification. We compare the results of both supervised learning (k-nearest-neighbors, SVMs, boosting) and unsupervised learning (k-means clustering, hierarchical cluster...

متن کامل

Comparison of Similarity Coefficients for Clustering and Compound Selection

Recent studies into the use of a selection of similarity coefficients, when applied to searches of chemical databases represented by binary fingerprints, have shown considerable variation in their retrieval performance and in the sets of compounds being retrieved. The main factor influencing performance is the density distribution of the bitstrings for the active class, a feature which is close...

متن کامل

Association Coefficient Measures for Document Clustering

This paper presents Association Coefficient Measures for Document Clustering. The proposed Association Coefficient Measures approach is based on Intuitionistic Fuzzy Sets. In this paper twelve Association Coefficient Measures from f1 to f12 are used. In Document Clustering Document collection, Text Pre-processing, Feature Selection, Indexing, Clustering Process and Results Analysis steps are us...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JSW

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2010