Variable selection in model-based discriminant analysis

نویسندگان

  • Cathy Maugis
  • Gilles Celeux
  • Marie-Laure Martin-Magniette
چکیده

A general methodology for selecting predictors for Gaussian generative classification models is presented. The problem is regarded as a model selection problem. Three different roles for each possible predictor are considered: a variable can be a relevant classification predictor or not, and the irrelevant classification variables can be linearly dependent on a part of the relevant predictors or independent variables. This variable selection model was inspired by the model-based clustering model of Maugis et al. (2009b). A BIC-like model selection criterion is proposed. It is optimized through two embedded forward stepwise variable selection algorithms for classification and linear regression. The model identifiability and the consistency of the variable selection criterion are proved. Numerical experiments on simulated and real data sets illustrate the interest of this variable selection methodology. In particular, it is shown that this well ground variable selection model can be of great interest to improve the classification performance of the quadratic discriminant analysis in a high dimension context. Key-words: Discriminant, redundant or independent variables, Variable selection, Gaussian classification models, Linear regression, BIC ∗ Institut de Mathématiques de Toulouse, INSA de Toulouse, Université de Toulouse † INRIA Saclay Île-de-France, Projet select, Université Paris-Sud 11 ‡ UMR AgroParisTech/INRA MIA 518, Paris § URGV UMR INRA 1165, UEVE, ERL CNRS 8196, Evry Sélection de variables pour l’analyse discriminante gaussienne Résumé : Nous proposons une méthodologie générale pour la sélection de variables en analyse discriminante par des modèles génératifs gaussiens. Le problème est vu sous un angle de choix de modèles. Les variables en compétition peuvent avoir trois rôles : ce sont soit des prédicteurs utiles pour la classification supervisée, soit des variables redondantes, liés aux prédicteurs par une régression linéaire, soit des variables indépendantes. Ce modèle s’inspire directement du modèle de Maugis et al. (2009b) pour la sélection de variables en classification non supervisée par des modèles de mélanges de lois gaussiennes. Un critère de type BIC est proposé pour choisir le rôle des variables. Ce critère est optimisé par deux algorithmes embôıtés de sélection ascendante avec remise en cause pour la classification et la régression. Nous établissons l’identifiabilité de notre modèle et nous prouvons l’optimalité asymptotique de notre critère. Nous illustrons les bonnes performances de notre approche par des expérimentations sur des données simulées et réelles. Nous montrons en particulier que notre méthodologie de sélection de variables peut être profitable pour l’analyse discriminante quadratique en grand dimension. Mots-clés : Variables discriminantes, redondantes ou indépendantes, Sélection de variables, Classification supervisée gaussienne, Régression linéaire, BIC Variable selection in model-based discriminant analysis 3

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Model-Based Clustering, Classification, and Discriminant Analysis

The use of mixture models for clustering and classification has burgeoned into an important subfield of multivariate analysis. These approaches have been around for a half-century or so, with significant activity in the area over the past decade. The primary focus of this paper is to review work in model-based clustering, classification, and discriminant analysis, with particular attenti...

متن کامل

Variable Selection and Updating in Model-based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications by Thomas

Food authenticity studies are concerned with determining if food samples have been correctly labeled or not. Discriminant analysis methods are an integral part of the methodology for food authentication. Motivated by food authenticity applications, a model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-superv...

متن کامل

Variable Selection and Updating In Model-Based Discriminant Analysis for High-Dimensional Data

A model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-supervised manner using both labeled and unlabeled data. The method is shown to give excellent classification performance on several high-dimensional multiclass datasets with more variables than observations. The variables selected by the proposed method ...

متن کامل

Variable Selection Method Affects SVM Approach in Bankruptcy Prediction

This paper examined bankruptcy predictive accuracy of five statistics models-discriminant analysis logistic regression, probit regression, neural networks, support vector machine (SVM), and genetic-based SVM (GA-SVM) that influenced by variable selection. Empirical results indicate that the SVM-based models are very promising models for predicting financial failure, in terms of both best predic...

متن کامل

Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications.

Food authenticity studies are concerned with determining if food samples have been correctly labelled or not. Discriminant analysis methods are an integral part of the methodology for food authentication. Motivated by food authenticity applications, a model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-super...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Multivariate Analysis

دوره 102  شماره 

صفحات  -

تاریخ انتشار 2011