نتایج جستجو برای: reward

تعداد نتایج: 29303  

Journal: :CoRR 2015
Yusen Zhan Matthew E. Taylor

This paper proposes an online transfer framework to capture the interaction among agents and shows that current transfer learning in reinforcement learning is a special case of online transfer. Furthermore, this paper re-characterizes existing agents-teaching-agents methods as online transfer and analyze one such teaching method in three ways. First, the convergence of Qlearning and Sarsa with ...

2006
Laurissa N. Tokarchuk John Bigham Laurie G. Cuthbert

Reinforcement learning (RL) is a machine learning technique for sequential decision making. This approach is well proven in many small-scale domains. The true potential of this technique cannot be fully realised until it can adequately deal with the large domain sizes that typically describe real world problems. RL with function approximation is one method of dealing with the domain size proble...

Journal: :CoRR 2018
Kamil Ciosek Shimon Whiteson

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussi...

2005
Peter Vamplew Robert Ollington

In order to scale to problems with large or continuous state-spaces, reinforcement learning algorithms need to be combined with function approximation techniques. The majority of work on function approximation for reinforcement learning has so far focused either on global function approximation with a static structure (such as multi-layer perceptrons), or on constructive architectures using loc...

Journal: :Theor. Comput. Sci. 2007
Rocco De Nicola Joost-Pieter Katoen Diego Latella Michele Loreti Mieke Massink

The TemporalMobile Stochastic Logic (MOSL) has been introduced in previous work by the authors for formulating properties of systems specified in STOKLAIM, a Markovian extension of KLAIM. The main purpose of MOSL is to address key functional aspects of global computing such as distribution awareness, mobility, and security and their integration with performance and dependability guarantees. In ...

2006
Ian Wehrman Aaron Stump Edwin M. Westbrook

A Knuth-Bendix completion procedure is parametrized by a reduction ordering used to ensure termination of intermediate and resulting rewriting systems. While in principle any reduction ordering can be used, modern completion tools typically implement only Knuth-Bendix and path orderings. Consequently, the theories for which completion can possibly yield a decision procedure are limited to those...

اعتمادی چهارده, نیلوفر, بخشی پور, عباس, کریم پور وظیفه خورانی, علیرضا, کمالی قاسم آبادی, حسین,

Objectives The present study examined the effects of reward-driven task on improving the affective levels in individuals with depressive symptoms. Methods The present study is an experiment study with pretest- posttest and follow-up with control group. The community of this research was the students in Tabriz University in 2016-2017 semester. The sample size was 40 students which had visited t...

Journal: :Inf. Comput. 1999
Albert Rubio

We present the first fully syntactic (i.e., non-interpretation-based) AC-compatible recursive path ordering (RPO). It is simple, and hence easy to implement, and its behaviour is intuitive as in the standard RPO. The ordering is AC-total and defined uniformly for both ground and nonground terms, as well as for partial precedences. More important, it is the first one that can deal incrementally ...

Journal: :Journal of experimental psychopathology 2011
Michael Amlung James MacKillop

Delayed reward discounting (DRD) is a common index of impulsivity that refers to an individual's devaluation of rewards based on delay of receipt and has been linked to alcohol misuse and other maladaptive behaviors. The current study investigated response consistency and reward magnitude effects in two measures of DRD in a sample of 111 undergraduates who consumed an average of 10.7 drinks/wee...

2007

Policy evaluation using least-squares techniques (such as LSTD and iLSTD) have been shown to estimate the value of a policy with far less data than traditional TD techniques. Unfortunately, they make use of policy-dependent statistics that have to be discarded when the policy changes. This makes it difficult to use the techniques for online control problems. In this paper, we explore the effect...

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید