Guarding against Spurious Discoveries in High Dimensions

نویسندگان

  • Jianqing Fan
  • Wen-Xin Zhou
چکیده

Many data-mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness of spurious fit, which shows how good a response variable can be fitted by an optimally selected subset of covariates under the null model, and propose a simple and effective LAMM algorithm to compute it. It coincides with the maximum spurious correlation for linear models and can be regarded as a generalized maximum spurious correlation. We derive the asymptotic distribution of such goodness of spurious fit for generalized linear models and L1-regression. Such an asymptotic distribution depends on the sample size, ambient dimension, the number of variables used in the fit, and the covariance information. It can be consistently estimated by multiplier bootstrapping and used as a benchmark to guard against spurious discoveries. It can also be applied to model selection, which considers only candidate models with goodness of fits better than those by spurious fits. The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Guarding from Spurious Discoveries in High Dimension

Many data-mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness ...

متن کامل

A total variation diminishing high resolution scheme for nonlinear conservation laws

In this paper we propose a novel high resolution scheme for scalar nonlinear hyperbolic conservation laws. The aim of high resolution schemes is to provide at least second order accuracy in smooth regions and produce sharp solutions near the discontinuities. We prove that the proposed scheme that is derived by utilizing an appropriate flux limiter is nonlinear stable in the sense of total varia...

متن کامل

Spurious Hyperleukocytosis

Hyperleukocytosis is an oncological emergency but is extremely rare in non-malignant conditions. Nucleated RBCs give rise to spuriously high total leucocyte count and cause clinical dilemma. Thalassemia major patients are known to have leucocytosis even after correction for nucleated RBCs. We report a case of undiagnosed Thalassemia major in a 4 month old infant with total leucocyte count highe...

متن کامل

Categories or dimensions: lessons learned from a taxometric analysis of Adult Attachment Interview data.

Booth-LaForce and Roisman's monograph on the Adult Attachment Interview (AAI) featured a taxometric analysis to determine whether variation along two components, dismissing and preoccupied states of mind, was categorical or dimensional. Empirically evaluating the latent structure of these constructs helps to avoid spurious categories or dimensions. This benefits researchers working with measure...

متن کامل

Mate-guarding courtship behaviour: tactics in a changing world

http://dx.doi.org/10.1016/j.anbehav.2014.08.007 0003-3472/© 2014 The Association for the Study of A Mate guarding is one of the most common tactics in sperm competition. Males are expected to guard their mates when costs of guarding (accrued from physical confrontations with rivals and/or reduced foraging) are low relative to the benefits of ensuring mating opportunities and paternity. We inves...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of machine learning research : JMLR

دوره 17  شماره 

صفحات  -

تاریخ انتشار 2016