High-Confidence Off-Policy (or Counterfactual) Variance Estimation
نویسندگان
چکیده
Many sequential decision-making systems leverage data collected using prior policies to propose a new policy. For critical applications, it is important that high-confidence guarantees on the policy’s behavior are provided before deployment, ensure policy will behave as desired. Prior works have studied off-policy estimation of expected return, however, variance returns can be equally for high-risk applications. In this paper we tackle previously open problem estimating and bounding, with high confidence, from data.
منابع مشابه
High-Confidence Off-Policy Evaluation
Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance ...
متن کاملHigh Confidence Policy Improvement
We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameters that require expert tuning. The user may select any performance lower-bound, ρ−, and confidence level, δ, and our algorithm will ensure that the probability that it returns a policy with performance below ρ− is at mo...
متن کاملStochastic Variance Reduction for Policy Gradient Estimation
Recent advances in policy gradient methods and deep learning have demonstrated their applicability for complex reinforcement learning problems. However, the variance of the performance gradient estimates obtained from the simulation is often excessive, leading to poor sample efficiency. In this paper, we apply the stochastic variance reduced gradient descent (SVRG) technique [1] to model-free p...
متن کاملRegularized Policy Gradients: Direct Variance Reduction in Policy Gradient Estimation
Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces, which update the policy parameters along the steepest direction of the expected return. However, large variance of policy gradient estimation often causes instability of policy update. In this paper, we propose to suppress the variance of gradient estimation by directly employing the var...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2021
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v35i8.16855