Risk bounds for purely uniformly random forests

نویسنده

  • Robin Genuer
چکیده

Random forests, introduced by Leo Breiman in 2001, are a very effective statistical method. The complex mechanism of the method makes theoretical analysis difficult. Therefore, a simplified version of random forests, called purely random forests, which can be theoretically handled more easily, has been considered. In this paper we introduce a variant of this kind of random forests, that we call purely uniformly random forests. In the context of regression problems with a one-dimensional predictor space, we show that both random trees and random forests reach minimax rate of convergence. In addition, we prove that compared to random trees, random forests improve accuracy by reducing the estimator variance by a factor of three fourths. Key-words: Random Forests, Non-parametric regression, Rate of convergence, Randomization. ∗ Univ Paris-Sud, Laboratoire de Mathématique, UMR 8628, Orsay F-91405 † Inria Saclay Ile-de-France in ria -0 04 92 23 1, v er si on 1 15 J un 2 01 0 Bornes de risque pour les forêts purement uniformément aléatoires. Résumé : Introduites par Leo Breiman en 2001, les forêts aléatoires sont une méthode statistique très performante. D’un point de vue théorique, leur analyse est difficile, du fait de la complexité de l’algorithme. Pour expliquer ces performances, des versions de forêts aléatoires simplifiées, et donc plus faciles à analyser, ont été introduites. Ces versions ont été appelées forêts purement aléatoires. Dans cet article, nous introduisons une autre version simplifiée, que nous appelons forêts purement uniformément aléatoires. Dans un contexte de régression, avec une seule variable explicative, nous montrons que les arbres aléatoires ainsi que les forêts aléatoires atteignent la vitesse de convergence minimax. De plus, nous prouvons que les forêts aléatoires améliorent les performances des arbres aléatoires, en réduisant la variance des estimateurs associés d’un facteur de trois quarts. Mots-clés : Forêts aléatoires, Régression non-paramétrique, Vitesse de convergence, Randomisation. in ria -0 04 92 23 1, v er si on 1 15 J un 2 01 0 Risk bounds for purely uniformly random forests 3

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of purely random forests bias

Random forests are a very effective and commonly used statistical method, but their full theoretical analysis is still an open problem. As a first step, simplified models such as purely random forests have been introduced, in order to shed light on the good performance of random forests. In this paper, we study the approximation error (the bias) of some purely random forest models in a regressi...

متن کامل

Bounds for Functions of Multivariate Risks

Abstract. Li, Scarsini, and Shaked (1996a) provide bounds on the distribution and the tail for functions of dependent random vectors having fixed multivariate marginals. In this paper, we correct a result stated in the above article and we give improved bounds in the case of the sum of identically distributed random vectors. Moreover, we provide the dependence structures meeting the bounds when...

متن کامل

Coalescent Random Forests

Various enumerations of labeled trees and forests, including Cayley's formula n for the number of trees labeled by [n], and Cayley's multinomial expansion over trees, are derived from the following coalescent construction of a sequence of random forests (Rn , Rn&1 , ..., R1) such that Rk has uniform distribution over the set of all forests of k rooted trees labeled by [n]. Let Rn be the trivial...

متن کامل

Some Infinity Theory for Predictor Ensembles

To dispel some of the mystery about what makes tree ensembles work, they are looked at in distribution space i.e. the limit case of "infinite" sample size. It is shown that the simplest kind of trees are complete in D-dimensional space if the number of terminal nodes T is greater than D. For such trees we show that the Adaboost minimization algorithm gives an ensemble converging to the Bayes ri...

متن کامل

Comparison of Random Survival Forests for Competing Risks and Regression Models in Determining Mortality Risk Factors in Breast Cancer Patients in Mahdieh Center, Hamedan, Iran

Introduction: Breast cancer is one of the most common cancers among women worldwide. Patients with cancer may die due to disease progression or other types of events. These different event types are called competing risks. This study aimed to determine the factors affecting the survival of patients with breast cancer using three different approaches: cause-specific hazards regression, subdistri...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010