Simultaneous Gene Clustering and Subset Selection for Sample Classification Via MDL

نویسندگان

  • Rebecka Jörnsten
  • Bin Yu
چکیده

MOTIVATION The microarray technology allows for the simultaneous monitoring of thousands of genes for each sample. The high-dimensional gene expression data can be used to study similarities of gene expression profiles across different samples to form a gene clustering. The clusters may be indicative of genetic pathways. Parallel to gene clustering is the important application of sample classification based on all or selected gene expressions. The gene clustering and sample classification are often undertaken separately, or in a directional manner (one as an aid for the other). However, such separation of these two tasks may occlude informative structure in the data. Here we present an algorithm for the simultaneous clustering of genes and subset selection of gene clusters for sample classification. We develop a new model selection criterion based on Rissanen's MDL (minimum description length) principle. For the first time, an MDL code length is given for both explanatory variables (genes) and response variables (sample class labels). The final output of the proposed algorithm is a sparse and interpretable classification rule based on cluster centroids or the closest genes to the centroids. RESULTS Our algorithm for simultaneous gene clustering and subset selection for classification is applied to three publicly available data sets. For all three data sets, we obtain sparse and interpretable classification models based on centroids of clusters. At the same time, these models give competitive test error rates as the best reported methods. Compared with classification models based on single gene selections, our rules are stable in the sense that the number of clusters has a small variability and the centroids of the clusters are well correlated (or consistent) across different cross validation samples. We also discuss models where the centroids of clusters are replaced with the genes closest to the centroids. These models show comparable test error rates to models based on single gene selection, but are more sparse as well as more stable. Moreover, we comment on how the inclusion of a classification criterion affects the gene clustering, bringing out class informative structure in the data. AVAILABILITY The methods presented in this paper have been implemented in the R language. The source code is available from the first author.

منابع مشابه

Investigation on Several Model Selection Criteria for Determining the Number of Cluster

Abstract Determining the number of clusters is a crucial problem in clustering. Conventionally, selection of the number of clusters was effected via cost function based criteria such as Akaike’s information criterion (AIC), the consistent Akaike’s information criterion (CAIC), the minimum description length (MDL) criterion which formally coincides with the Bayesian inference criterion (BIC). In...

متن کامل

SFLA Based Gene Selection Approach for Improving Cancer Classification Accuracy

 In this paper, we propose a new gene selection algorithm based on Shuffled Frog Leaping Algorithm that is called SFLA-FS. The proposed algorithm is used for improving cancer classification accuracy. Most of the biological datasets such as cancer datasets have a large number of genes and few samples. However, most of these genes are not usable in some tasks for example in cancer classification....

متن کامل

Simultaneous Gene Clustering and Subset Selection for Sample Classi cation via MDL

Motivation: The microarray technology allows for the simultaneous monitoring of thousands of gene expression for each sample. The high-dimensional gene expression data can be used to study similarities of genes' expression pro les across di erent samples, that is gene clustering. The gene clusters may be indicative of genetic pathways. Another important application is sample classi cation, base...

متن کامل

Classification of Right/Left Hand Motor Imagery by Effective Connectivity Based on Transfer Entropy in EEG Signal

The right and left hand Motor Imagery (MI) analysis based on the electroencephalogram (EEG) signal can directly link the central nervous system to a computer or a device. This study aims to identify a set of robust and nonlinear effective brain connectivity features quantified by transfer entropy (TE) to characterize the relationship between brain regions from EEG signals and create a hierarchi...

متن کامل

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:
  • Bioinformatics

دوره 19 9  شماره 

صفحات  -

تاریخ انتشار 2003