Bayesian Grouped Variable Selection

نویسنده

  • Sudhir Shankar Raman
چکیده

Traditionally, variable selection in the context of linear regression has been approached using optimization based approaches like the classical Lasso. Such methods provide a sparse point estimate with respect to regression coefficients but are unable to provide more information regarding the distribution of regression coefficients like expectation, variance estimates etc. In the recent years, there has been some progress on the Bayesian formulation for variable selection like for example, the Bayesian Lasso. Motivated by these developments, in this thesis, we build an omnibus Bayesian framework for grouped-variable selection in linear regression models. This framework is capable of summarizing the posterior distribution over the regression coefficients with estimates for the moments and the mode. The inference is carried out using Markov Chain Monte Carlo (MCMC) sampling. The estimate for the mode of the posterior distribution over regression coefficients is also generated from the same MCMC sampling algorithm with minimal changes using simulated annealing. Going beyond simple linear regression, the framework is also extended further to accommodate generalized linear models like Poisson and binomial models with minimal changes to the framework. On the algorithm side, we develop a highly efficient MCMC sampling algorithm for inference purposes. Apart from the Poisson and binomial models, another model that has been incorporated into this framework is the Weibull model which is extensively used for survival analysis. This extension has been combined with an additional clustering component using a survival mixture-of-experts model. The clustering component is particularly useful for performing variable selection (per cluster) simultaneously with cluster identification using Dirichlet processes which avoids the need for fixing the number of clusters in advance. The resulting framework has been applied to several biological applications like identification of novel compound bio-markers for breast cancer from tissue microarray data and analyzing splice site data for identifying distinguishing features of true splice sites. Survival data for breast cancer patients has been used to identify low-risk and high-risk patients and the significant compound markers of each group.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Bayesian Elastic Net: Classifying Multi-Task Gene-Expression Data

Highly correlated relevant features are frequently encountered in variable-selection problems, with gene-expression analysis an important example. It is desirable to select all of these highly correlated features simultaneously as a group, for better model interpretation and robustness. Further, irrelevant features should be excluded, resulting in a sparse solution (of importance for avoiding o...

متن کامل

Measuring the Risk of Public Contracts Using Bayesian Classifiers

Bayesian Classifiers are widely used in machine learning supervised models where there is a reasonable reliability in the dependent variable. This work aims to create a risk measurement model of companies that negotiate with the government using indicators grouped into four risk dimensions: operational capacity, history of penalties and findings, bidding profile, and political ties. It is expec...

متن کامل

The Impact of Different Genetic Architectures on Accuracy of Genomic Selection Using Three Bayesian Methods

Genome-wide evaluation uses the associations of a large number of single nucleotide polymorphism (SNP) markers across the whole genome and then combines the statistical methods with genomic data to predict the genetic values. Genomic predictions relieson linkage disequilibrium (LD) between genetic markers and quantitative trait loci (QTL) in a population. Methods that use all markers simultaneo...

متن کامل

Bayesian Variable Selections for Probit Models with Componentwise Gibbs Samplers

For variable selection to binary response regression, stochastic search variable selection and Bayesian Lasso have recently been popular. However, these two variable selection methods suffer from heavy computation burden caused by hyperparameter tuning and by matrix inversions, especially when the number of covariates is large. Therefore, this article incorporates the componenetwise Gibbs sampl...

متن کامل

Variable selection via the grouped weighted lasso for factor analysis models

The L1 regularization such as the lasso has been widely used in regression analysis since it tends to produce some coefficients that are exactly zero, which leads to variable selection. We consider the problem of variable selection for factor analysis models via the L1 regularization procedure. In order to select variables each of which is controlled by multiple parameters, we treat parameters ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012