Sample-Efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs
نویسندگان
چکیده
Recent advances in deep learning have enabled optimization of reactive policies (DRPs) for continuous MDP planning by encoding a parametric policy as neural network and exploiting automatic differentiation an end-to-end model-based gradient descent framework. This approach has proven effective optimizing DRPs nonlinear MDPs, but it requires large number sampled trajectories to learn effectively can suffer from high variance solution quality. In this work, we revisit the overall DRP objective instead take minorization-maximization perspective iteratively optimize w.r.t. locally tight lower-bounded objective. novel formulation iterative lower bound (ILBO) is particularly appealing because (i) each step structurally easier than objective, (ii) guarantees monotonically improving under certain theoretical conditions, (iii) reuses samples between iterations thus lowering sample complexity. Empirical evaluation confirms that ILBO significantly more sample-efficient state-of-the-art planner consistently produces better quality with variance. We additionally demonstrate generalizes well new problem instances (i.e., different initial states) without requiring retraining.
منابع مشابه
Sample Efficient Feature Selection for Factored MDPs
In reinforcement learning, the state of the real world is often represented by feature vectors. However, not all of the features may be pertinent for solving the current task. We propose Feature Selection Explore and Exploit (FS-EE), an algorithm that automatically selects the necessary features while learning a Factored Markov Decision Process, and prove that under mild assumptions, its sample...
متن کاملLower bound on complexity of optimization of continuous functions
This paper considers the problem of approximating the minimum of a continuous function using a fixed number of sequentially selected function evaluations. A lower bound on the complexity is established by analyzing the average case for the Brownian bridge.
متن کاملLearning Reactive Policies for Probabilistic Planning Domains
We present a planning system for selecting policies in probabilistic planning domains. Our system is based on a variant of approximate policy iteration that combines inductive machine learning and simulation to perform policy improvement. Given a planning domain, the system iteratively improves the best policy found so far until no more improvement is observed or a time limit is exceeded. Thoug...
متن کاملDiscrepancy Search with Reactive Policies for Planning
We consider a novel use of mostly-correct reactive policies. In classical planning, reactive policy learning approaches could find good policies from solved trajectories of small problems and such policies have been successfully applied to larger problems of the target domains. Often, due to the inductive nature, the learned reactive policies are mostly correct but commit errors on some portion...
متن کاملReactive Policies with Planning for Action Languages
We describe a representation in a high-level transition system for policies that express a reactive behavior for the agent. We consider a target decision component that figures out what to do next and an (online) planning capability to compute the plans needed to reach these targets. Our representation allows one to analyze the flow of executing the given reactive policy, and to determine wheth...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2022
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v36i9.21220