Carefully Appoximated Bayes Factors for Feature Selection in MaxEnt Models

نویسنده

  • Hal Daumé
چکیده

Feature selection is essentially a model selection problem. If we take a frequentist maximum likelihood approach, we will, in the limit, select all features (unless, as is typical, we apply some sort of “early stopping” critereon). Additionally, basing the next feature to selected solely on standard measures such as likelihood gain, we fail to account for the variance of the estimate of this feature. In this note, I carefully derive an approximation to the Bayes factor in the feature/model selection problem for maximum entropy models. See [2] for an introduction to the use of maximum entropy models in the natural language processing domain. The advantages to using a Bayesian criterea for model selection are numerous, but the two strongest are that (a) it enables us to take into account uncertainty in likelihood when adding new features and (b) it allows us to decide when to stop adding features.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier

With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...

متن کامل

مقایسه روش های مختلف آماری در انتخاب ژنومی گاوهای هلشتاین

Genomic selection combines statistical methods with genomic data to predict genetic values for complex traits.  The accuracy of prediction of genetic values ​​in selected population has a great effect on the success of this selection method. Accuracy of genomic prediction is highly dependent on the statistical model used to estimate marker effects in reference population. Various factors such a...

متن کامل

IRISM @ NTCIR-12 Temporalia Task: Experiments with MaxEnt, Naive Bayes and Decision Tree Classifiers

This paper describes our participation in Temporal Intent Disambiguation (TID), which is a subtask of the pilot task of NTCIR’12 Temporal Information Access (Temporalia-2) task [6]. We considered the task as a slight variation of supervised machine learning classification problem. Our strategy involves building models on different standard classifiers based on probabilistic and entropy models f...

متن کامل

A New Hybrid Method for Improving the Performance of Myocardial Infarction Prediction

Abstract Introduction: Myocardial Infarction, also known as heart attack, normally occurs due to such causes as smoking, family history, diabetes, and so on. It is recognized as one of the leading causes of death in the world. Therefore, the present study aimed to evaluate the performance of classification models in order to predict Myocardial Infarction, using a feature selection method tha...

متن کامل

Ensemble Classification and Extended Feature Selection for Credit Card Fraud Detection

Due to the rise of technology, the possibility of fraud in different areas such as banking has been increased. Credit card fraud is a crucial problem in banking and its danger is over increasing. This paper proposes an advanced data mining method, considering both feature selection and decision cost for accuracy enhancement of credit card fraud detection. After selecting the best and most effec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004