Efficient Learning of Sparse Gaussian Mixture Models of Protein Conformational Substates
نویسنده
چکیده
Molecular Dynamics (MD) simulations are an important technique for studying the conformational dynamics of proteins in Computational Structural Biology. Traditional methods for the analysis of MD simulation assumes a single conformational state underlying the data. With recent developments in MD simulation technologies, MD simulation now can produce massive and long time-scale trajectories across multiple conformational substates, and new efficient methods to analyze these trajectories and to learn structural dynamics of the substates are needed. In this thesis, we develop new methods to learn parametric and semi-parametric, sparse generative models from the positional fluctuations of amino acid residues in the simulation. Specifically, our methods learn a mixture of sparse Gaussian or nonparanormal distributions. Each mixing component encodes the statistics of a different substate. L1 regularization is used to produce sparse graphical models that are easier to interpret than a simple covariance analysis, because the topology of the graphical model reveals the coupling structure between different parts of the molecule. Our method also employs coreset sampling to enhance scalability. We demonstrate that our methods produce models that have a number of advantages over traditional Gaussian Mixture Models (GMM). Experiments on synthetic data show substantial improvements over GMMs on the recovery of the true network structure, while remaining competitive in terms of test likelihood and imputation error. Experiments on a large real MD data set are consistent with the results on synthetic data. We also demonstrate the benefits of using semi-parametric models in terms of likelihood and imputation metrics.
منابع مشابه
The Hard-Cut EM Algorithm for Mixture of Sparse Gaussian Processes
The mixture of Gaussian Processes (MGP) is a powerful and fast developed machine learning framework. In order to make its learning more efficient, certain sparsity constraints have been adopted to form the mixture of sparse Gaussian Processes (MSGP). However, the existing MGP and MSGP models are rather complicated and their learning algorithms involve various approximation schemes. In this pape...
متن کاملContinuous Graphical Models for Static and Dynamic Distributions: Application to Structural Biology
Generative models of protein structure enable researchers to predict the behavior of proteins under different conditions. Continuous graphical models are powerful and efficient tools for modeling static and dynamic distributions, which can be used for learning generative models of molecular dynamics. In this thesis, we develop new and improved continuous graphical models, to be used in modeling...
متن کاملRecognizing the Emotional State Changes in Human Utterance by a Learning Statistical Method based on Gaussian Mixture Model
Speech is one of the most opulent and instant methods to express emotional characteristics of human beings, which conveys the cognitive and semantic concepts among humans. In this study, a statistical-based method for emotional recognition of speech signals is proposed, and a learning approach is introduced, which is based on the statistical model to classify internal feelings of the utterance....
متن کاملGene Identification from Microarray Data for Diagnosis of Acute Myeloid and Lymphoblastic Leukemia Using a Sparse Gene Selection Method
Background: Microarray experiments can simultaneously determine the expression of thousands of genes. Identification of potential genes from microarray data for diagnosis of cancer is important. This study aimed to identify genes for the diagnosis of acute myeloid and lymphoblastic leukemia using a sparse feature selection method. Materials and Methods: In this descriptive study, the expressio...
متن کاملProtein conformational populations and functionally relevant substates.
Functioning proteins do not remain fixed in a unique structure, but instead they sample a range of conformations facilitated by motions within the protein. Even in the native state, a protein exists as a collection of interconverting conformations driven by thermodynamic fluctuations. Motions on the fast time scale allow a protein to sample conformations in the nearby area of its conformational...
متن کامل