An Analysis of Temporal - Di erence Learning with Function Approximation 1
نویسندگان
چکیده
We discuss the temporal-di erence learning algorithm, as applied to approximating the cost-to-go function of an in nite-horizon discounted Markov chain. The algorithm we analyze updates parameters of a linear function approximator on{line, during a single endless trajectory of an irreducible aperiodic Markov chain with a nite or in nite state space. We present a proof of convergence (with probability 1), a characterization of the limit of convergence, and a bound on the resulting approximation error. Furthermore, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal-di erence learning. In addition to proving new and stronger positive results than those previously available, we identify the signi cance of on-line updating and potential hazards associated with the use of nonlinear function approximators. First, we prove that divergence may occur when updates are not based on trajectories of the Markov chain. This fact reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporal-di erence learning. Second, we present an example illustrating the possibility of divergence when temporal-di erence learning is used in the presence of a nonlinear function approximator. 1
منابع مشابه
An Analysis of Temporal Di erence Learning with Function Approximation
We discuss the temporal di erence learning algorithm as applied to approximating the cost to go function of an in nite horizon discounted Markov chain The algorithm we analyze updates parameters of a linear function approximator on line during a single endless traject ory of an irreducible aperiodic Markov chain with a nite or in nite state space We present a proof of convergence with probabili...
متن کاملLIDS - P - 2390 May 1997 Average Cost Temporal { Di erence Learning 1
We propose a variant of temporal{di erence learning that approximates average and di erential costs of an irreducible aperiodic Markov chain. Approximations are comprised of linear combinations of xed basis functions whose weights are incrementally updated during a single endless trajectory of the Markov chain. We present a proof of convergence (with probability 1), and a characterization of th...
متن کاملAverage Cost Temporal{diierence Learning 1
We propose a variant of temporal di erence learning that approximates average and di erential costs of an irreducible aperiodic Markov chain Approximations are comprised of linear combinations of xed basis functions whose weights are incrementally updated during a single endless trajectory of the Markov chain We present a proof of convergence with probability and a characterization of the limit...
متن کاملStable Function Approximation in Dynamic Programming
The success of reinforcement learning in practical problems depends on the ability to combine function approximation with temporal di erence methods such as value iteration. Experiments in this area have produced mixed results; there have been both notable successes and notable disappointments. Theory has been scarce, mostly due to the difculty of reasoning about function approximators that gen...
متن کاملLearning and value function approximation in complex decision processes
In principle, a wide variety of sequential decision problems { ranging from dynamic resource allocation in telecommunication networks to nancial risk management { can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and store a value function, which evaluates expected future reward as a function of current state. Unfortuna...
متن کامل