An Analysis of Temporal Di erence Learning with Function Approximation
نویسندگان
چکیده
We discuss the temporal di erence learning algorithm as applied to approximating the cost to go function of an in nite horizon discounted Markov chain The algorithm we analyze updates parameters of a linear function approximator on line during a single endless traject ory of an irreducible aperiodic Markov chain with a nite or in nite state space We present a proof of convergence with probability a characterization of the limit of convergence and a bound on the resulting approximation error Furthermore our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal di erence learning In addition to proving new and stronger positive results than those previously available we identify the signi cance of on line updating and potential hazards associated with the use of nonlinear function approximators First we prove that divergence may occur when updates are not based on trajectories of the Markov chain This fact reconciles positive and negative results that have been discussed in the literature regarding the soundness of temporal di erence learning Second we present an example illustrating the possibility of divergence when temporal di erence learning is used in the presence of a nonlinear function approximator
منابع مشابه
An Analysis of Temporal - Di erence Learning with Function Approximation 1
We discuss the temporal-di erence learning algorithm, as applied to approximating the cost-to-go function of an in nite-horizon discounted Markov chain. The algorithm we analyze updates parameters of a linear function approximator on{line, during a single endless trajectory of an irreducible aperiodic Markov chain with a nite or in nite state space. We present a proof of convergence (with proba...
متن کاملLIDS - P - 2390 May 1997 Average Cost Temporal { Di erence Learning 1
We propose a variant of temporal{di erence learning that approximates average and di erential costs of an irreducible aperiodic Markov chain. Approximations are comprised of linear combinations of xed basis functions whose weights are incrementally updated during a single endless trajectory of the Markov chain. We present a proof of convergence (with probability 1), and a characterization of th...
متن کاملStable Function Approximation in Dynamic Programming
The success of reinforcement learning in practical problems depends on the ability to combine function approximation with temporal di erence methods such as value iteration. Experiments in this area have produced mixed results; there have been both notable successes and notable disappointments. Theory has been scarce, mostly due to the difculty of reasoning about function approximators that gen...
متن کاملAverage Cost Temporal{diierence Learning 1
We propose a variant of temporal di erence learning that approximates average and di erential costs of an irreducible aperiodic Markov chain Approximations are comprised of linear combinations of xed basis functions whose weights are incrementally updated during a single endless trajectory of the Markov chain We present a proof of convergence with probability and a characterization of the limit...
متن کاملLearning and value function approximation in complex decision processes
In principle, a wide variety of sequential decision problems { ranging from dynamic resource allocation in telecommunication networks to nancial risk management { can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and store a value function, which evaluates expected future reward as a function of current state. Unfortuna...
متن کامل