Sparse Matrix Factorization for Analyzing Gene Expression Patterns

نویسندگان

  • Nathan Srebro
  • Tommi Jaakkola
چکیده

Motivated by the analysis of gene expression data, we develop a new unsupervised modeling technique. Specifically, we study how such data can be modeled via sparse matrix factorization (SMF). Unsupervised modeling using constrained matrix factorization has been studied by Lee and Seung [1, 2, 3]. Under this approach, one unveils structure in a data matrix A ∈ Rn×d, by approximating it as a product of two matrices A ≈ C · F , C ∈ Rn×k, F ∈ Rk×d, subject to various (e.g., non-negative) constraints on C and F . We suggest explicit sparsity constraint on C. Specifically, each row of C is to have at most m non-zero entries. Setting m = 1, we obtain a clustering of the data rows, where the rows of F indicate the cluster centers. At the other extreme, setting m = k, leaves C unconstrained and yields a lowrank approximation, specified by the leading components of the singular value decomposition. Focusing on small values of m, and viewing the rows of F as factors, each row of the data matrix A is approximated as a linear combination of only m of the k factors. In the context of gene expression analysis, where, e.g., the rows of the data matrix correspond to genes, and the columns to different experiments, we get a model in which the expression pattern of a gene is explained as a (linear) combination of a few (at most m) underlying factors. This model model allows us to capture combinatorial effects and genes which take part in more than one underlying process. Constraining to sparse C permits us to recover a higher number of interpretable factors than what is possible with singular value decomposition [4, 5, 6]. When m < k, finding the best SMF (i.e. finding appropriate C,F that best approximate A) is a difficult optimization task. We formulate and investigate several iterative (alternating) maximization techniques in this context. Alternatively, the hard sparsity constraint can be relaxed to regularization penalties on the rows of matrix C, yielding a continuous, and thus easier to handle, optimization problem. We study the statistical problem of reconstructing a sparse matrix factorization in the presence of noise. We determine the conditions under which the factors in F can be reconstructed, and study the problem of recovering the pattern of zeroes in C as an error correcting code, whose error correction properties can be determined as a function of the noise level. We also address the model selection problem of identifying meaningful settings of the number of factors k and the polymorphicity m. The primary goal of this work is to provide a large scale functional genomics analysis tool using gene expression and other data sources. Beyond using SMF to recover underlying factors, and structure among genes, we use SMF to extend partial factor realizations (some factors fixed according to known profiles of transcriptional activators). Moreover, we recover expression profiles for factors identified by common sequence motifs. We also explore the connection of SMF to other learning and inference tasks. SMF can also be viewed as a class of probability relational models (PRMs), similar to the ones suggested by Segal et al [7] for analyzing gene expression data. Moreover, SMF can be seen as a technique for independent component analysis (ICA), where the sparsity requirement serves as an additional (symmetry breaking) regularization constraint. Lee and Seung suggested viewing constrained matrix factorization as a coding of the rows in A, the point of view explicitly taken in our analysis.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Iterative Weighted Non-smooth Non-negative Matrix Factorization for Face Recognition

Non-negative Matrix Factorization (NMF) is a part-based image representation method. It comes from the intuitive idea that entire face image can be constructed by combining several parts. In this paper, we propose a framework for face recognition by finding localized, part-based representations, denoted “Iterative weighted non-smooth non-negative matrix factorization” (IWNS-NMF). A new cost fun...

متن کامل

Voice-based Age and Gender Recognition using Training Generative Sparse Model

Abstract: Gender recognition and age detection are important problems in telephone speech processing to investigate the identity of an individual using voice characteristics. In this paper a new gender and age recognition system is introduced based on generative incoherent models learned using sparse non-negative matrix factorization and atom correction post-processing method. Similar to genera...

متن کامل

Identifying Repeated Patterns in Music Using Sparse Convolutive Non-negative Matrix Factorization

We describe an unsupervised, data-driven, method for automatically identifying repeated patterns in music by analyzing a feature matrix using a variant of sparse convolutive non-negative matrix factorization. We utilize sparsity constraints to automatically identify the number of patterns and their lengths, parameters that would normally need to be fixed in advance. The proposed analysis is app...

متن کامل

Sparse Matrix Factorization of Gene Expression Data

Motivation: Gene expression data consists of expression level reads for thousands of genes across dozens of experimental conditions, time points, cell types or repeated experiments. The goal of unsupervised modeling of such data is to find some underlying organization, structure or redundancy in the data, such as similarity or dependency between genes or between experiments. Such structure can ...

متن کامل

Robust hierarchical image representation using non-negative matrix factorization with sparse code shrinkage preprocessing

When analyzing patterns, our goals are (i) to find structure in the presence of noise, (ii) to decompose the observed structure into sub-components, and (iii) to use the components for pattern completion. Here, a novel loop architecture is introduced to perform these tasks in an unsupervised manner. The architecture combines sparse code shrinkage with non-negative matrix factorization and blend...

متن کامل

Improving molecular cancer class discovery through sparse non-negative matrix factorization

MOTIVATION Identifying different cancer classes or subclasses with similar morphological appearances presents a challenging problem and has important implication in cancer diagnosis and treatment. Clustering based on gene-expression data has been shown to be a powerful method in cancer class discovery. Non-negative matrix factorization is one such method and was shown to be advantageous over ot...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001