Finding Sparse Features in Strongly Confounded Medical Binary Data
نویسندگان
چکیده
A typical task in statistical genetics is to find a sparse linear relation between genotypes with phenotypes, but often the data are confounded by age, ethnicity or population structure. We generalize the linear mixed model (LMM) Lasso approach for feature selection under confounding to the case of binary labels. This case is much more involved, as marginalization over the correlated noise leads to an intractable integral. We can overcome this problem with approximate inference techniques. We demonstrate on synthetic and real-world data that the sparse features that our method finds are less correlated with the top confounders. Introduction Genetic association studies have emerged as an important field of statistical genetics [1, 2]. In this class of problems, we associate high dimensional vectors of genotypes, such as SNPs or gene expression levels, with observable outcomes or phenotypes. These outcomes may be binary, such as the risk of getting a certain disease. For various diseases such as type 2 diabetes [3], the sparse linear effects that relate genotypes and phenotypes are largely undetected, which is why these missing associations have been entitled the The Dark Matter of Genomic Associations [4]. The problem is that these sparse signals can be spurious due to cofounders that induce spurious non-causal correlations between genotypes and phenotypes. Confounding can stem from varying experimental conditions and demographics such as age, ethnicity or gender [5]. The perhaps most important type of confounding in statistical genetics arises due to population structure [6], which is due to the relatedness between the samples [7, 5, 8]. Ignoring such confounders can often lead to spurious false positive findings that cannot be replicated on independent data [9]. Correcting for such confounding dependencies is considered to be one of the greatest challenges in statistical genetics [10]. In this paper, we propose an algorithm for feature selection in binary classification in the presence of confounding. Our goal is to eliminate the confounder as well as possible and find a sparse weight vector that best captures causal relations. Model Our model builds on the LMM-Lasso [11], an important method of statistical genetics to limit the impact of confounding. While the LMM-lasso relies on linear regression, we generalize this approach to the much more involved classification setup, where the target values are binary. Let X ∈ Rd×n be the matrix of n observed data points. The corresponding labels y ∈ {−1,+1} are assumed to be realized according to the following model, y = sign(X⊤w + ε), ε ∼ N (0,Σ), (1) where Σ ∈ Rn×n is a fixed covariance matrix and the model parameter w ∈ R is unknown. As in the LMM-lasso, the noise covariance Σ captures similarities between the samples that offer an alternative explanation of the observed labels. This way, the sparse weight vector w focuses on strong sparse signals that can not be well explained in terms of correlated noise. We choose Σ = λ1I + λ2X ⊤X + λ3Σside, where Σside is constructed from side information such as age and geographical location. The weights λi are cross-validated. In order to train the model we aim to maximize the marginal likelihood in the presence of a l1-norm regularizer (Lasso). This leads to the objective function
منابع مشابه
VLE Predictions of Strongly Non-Ideal Binary Mixtures by Modifying Van Der Waals and Orbey-Sandler Mixing Rules
By proposing a predictive method with no adjustable parameter and by using infinite dilution activity coefficients of components in binary mixtures obtained from UNIFAC model, the binary interaction parameters (k12) in van der Waals mixing rule (vdWMR) and Orbey-Sandler mixing rule (OSMR) have been evaluated. The predicted binary interaction parameters are used in Peng-Robinson-S...
متن کاملHyperspectral Image Classification Based on the Fusion of the Features Generated by Sparse Representation Methods, Linear and Non-linear Transformations
The ability of recording the high resolution spectral signature of earth surface would be the most important feature of hyperspectral sensors. On the other hand, classification of hyperspectral imagery is known as one of the methods to extracting information from these remote sensing data sources. Despite the high potential of hyperspectral images in the information content point of view, there...
متن کاملBiclustering Sparse Binary Genomic Data
Genomic datasets often consist of large, binary, sparse data matrices. In such a dataset, one is often interested in finding contiguous blocks that (mostly) contain ones. This is a biclustering problem, and while many algorithms have been proposed to deal with gene expression data, only two algorithms have been proposed that specifically deal with binary matrices. None of the gene expression bi...
متن کاملSparse Structured Principal Component Analysis and Model Learning for Classification and Quality Detection of Rice Grains
In scientific and commercial fields associated with modern agriculture, the categorization of different rice types and determination of its quality is very important. Various image processing algorithms are applied in recent years to detect different agricultural products. The problem of rice classification and quality detection in this paper is presented based on model learning concepts includ...
متن کاملMining non-redundant high order correlations in binary data
Many approaches have been proposed to find correlations in binary data. Usually, these methods focus on pair-wise correlations. In biology applications, it is important to find correlations that involve more than just two features. Moreover, a set of strongly correlated features should be non-redundant in the sense that the correlation is strong only when all the interacting features are consid...
متن کامل