TD ( X ) Converges with Probability 1

نویسنده

  • Richard Sutton
چکیده

The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future. Sutton (1988) proved that for a special case of temporal differences, the expected values of the predictions converge to their correct values, as larger samples are taken, and Dayan (1992) extended his proof to the general case. This article proves the stronger result that the predictions of a slightly modified form of temporal difference learning converge with probability one, and shows how to quantify the rate of convergence. Keywords, reinforcement learning, temporal differences, Q-learning

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Neural Computation . On the Convergence of Stochastic

Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD( ) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of conve...

متن کامل

Pointwise Convergence of Some Multiple Ergodic Averages

We show that for every ergodic system (X, μ,T1, . . . ,Td) with commuting transformations, the average 1 Nd+1 ∑ 0≤n1,...,nd≤N−1 ∑ 0≤n≤N−1 f1(T n 1 d ∏ j=1 T n j j x) f2(T n 2 d ∏ j=1 T n j j x) · · · fd(T n d d ∏ j=1 T n j j x). converges for μ-a.e. x ∈ X as N → ∞. If X is distal, we prove that the average 1 N N ∑ i=0 f1(T n 1 x) f2(T n 2 x) · · · fd(T n d x) converges for μ-a.e. x ∈ X as N → ∞...

متن کامل

Skorohod Representation on a given Probability Space

Let (Ω,A, P ) be a probability space, S a metric space, μ a probability measure on the Borel σ-field of S, and Xn : Ω → S an arbitrary map, n = 1, 2, . . .. If μ is tight and Xn converges in distribution to μ (in HoffmannJørgensen’s sense), then X ∼ μ for some S-valued random variable X on (Ω,A, P ). If, in addition, the Xn are measurable and tight, there are S-valued random variables ∼ Xn and ...

متن کامل

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Sutton, Szepesvári and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their “gradient temporal difference” (GTD) algorithm converges reliably, it can be very slow compared to conventional linear...

متن کامل

Supplementary Appendix to “Incentive Compatibility of Large Centralized Matching Markets”

First, we summarize definitions and related theorems of asymptotic statistics in Section A. We prove Theorems in Section B. Lastly, Section C contains additional simulation results. A Asymptotic Statistics We summarize some results of asymptotic statistics from (Serfling, 1980). Let X1, X2, . . . and X be random variables on a probability space (Ω,A, P ). We say that Xn converges in probability...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994