Discounted Markov decision processes with utility constraints
نویسندگان
چکیده
-We consider utility-constrained Markov decision processes. The expected utility of the total discounted reward is maximized subject to multiple expected utility constraints. By introducing a corresponding Lagrange function, a saddle-point theorem of the utility constrained optimization is derived. The existence of a constrained optimal policy is characterized by optimal action sets specified with a parametric utility. @ 2006 Elsevier Ltd. All rights reserved. K e y w o r d s M a r k o v decision processes, Utility constraints, Discount criterion, Lagrange technique, Saddle-point, Constrained optimal policy. 1. I N T R O D U C T I O N A N D P R O B L E M F O R M U L A T I O N Utility-constrained Markov decision processes (MDPs) arise in the case where the decision maker wants to maximize the total reward under more than one utility function. The typical case is, for example, that in the group decision problem with different utility functions each player wants to maximize the reward under his own specified utility function. In such a case, we want to maximize the one type of expected utility of the reward while keeping other types of expected utilities higher than some given bounds. In this paper, we consider general utility-constrained MDPs in which the expected utility of the total discounted rewards is maximized subject to multiple expected utility constraints and the objective is to show that the Lagrange approach to general utility-constrained MDPs is successfully done. In fact, by introducing a corresponding Lagrange function, a saddle-point theorem is given, by which the existence of a constrained optimal policy is proved. And a The authors show grateful thanks to the anonymous referee who gave useful comments and suggestions on the earlier draft. 0898-1221/06/$ see front matter @ 2006 Elsevier Ltd. All rights reserved. Typeset by Afl.~-TEX doi: 10.1016/j.camwa.2005.11.013 280 Y. KADOTA e t al . constra ined opt imal pol icy is character ized by opt imal act ion sets specified with a paramet r ic utility. However, we do not specify the kind of ut i l i ty function; it is expected to enlarge the pract ical appl icat ion of MDPs. As far as we are aware, it appears tha t l i t t le work has been done on the Lagrange method to general u t i l i ty-const ra ined MDPs. The me thod of analysis for general ut i t i ty functions is closely re la ted to [1,2], in which discounted MDPs have been s tudied with general ut i l i ty function and whose results are appl ied to character ize a const ra ined opt imal policy. Recently, Kurano et al. [3] derived a saddle-point theorem for constra ined MDPs with average reward criteria. For the ut i l i ty t r ea tmen t for MDPs and constra ined MDPs, refer to [1,2,4-7] and their references. In the remainder of this section, we define the ut i l i ty-const ra ined problem to be examined and a constra ined opt imal policy. F i rs t we consider s t andard Markov decision processes (MDPs), specified by (s, {A(i)}~s, q, T), where S = { 1 , 2 , . . . } denotes the set of the s ta tes of the processes, A(i) is the set of actions available at each s ta te i E S, taken to be a Borel subset of some Polish space A. The mat r ix q = (q~j(a)) is a t rans i t ion probabi l i ty sat isfying tha t Y~jes q~j(a) = 1 for all i E S and a E A(i) , and r ( i , a , j ) is an immedia te reward function defined on { ( i , a , j ) l i E S, a E A ( i ) , j E S}. Throughout this paper , the following assumpt ion will remain operative. ASSUMPTION 1. (i) For each i E S, A(i) is a closed set of a compact metr ic space A. (ii) For each i , j E S, both q~j(.) and r( i , . , j ) are continuous on A(i) . (iii) The function r is uniformly bounded, i.e., I r ( i ,a , j ) l < M for ali i , j E S, a E A(i) , and some M > 0. The sample space is the product space f/ = (S x A) ~176 such tha t the pro jec t ion Xt , A t on the t TM factors S, A describe the s ta te and the act ion of t t ime of the process (t _> 0). A policy ~r =: (zr0, ~rl . . . . ) is a sequence of condit ional probabi l i t ies rrt such tha t zrt(A(it) I io, a0,. 9 it) = 1 for all histories (i0, a 0 , . . . , i t ) E ( S x A ) t x S. The set of policies is denoted by H. Let Ht = (Xo, Ao . . . . , A t _ l , X t ) for t _> 0. ASSUMPTION 2. We assu//le that (i) P r o b ( X t + l = j l H t l , A t l , X t = i, At = a) = qij(a), (ii) P r o b ( A t + l E D I Ht) = 7rt(DI Ht) for ai1 t >_ 0, i , j E S, a E A(i) , any Borel subset D E A, and for any given ir = (Tro,Trl,... ) E l-I. Let P ( X ) be denoted by the set of all probabi l i ty measures on any Borel measurable set X. Then, any init ial probabi l i ty measure v E ;~ and policy 7r E 11 de termine the probabi l i ty measure P~ E 7~(f~) in a usual way. For the s ta te -ac t ion process {X~, At; t = 0, 1, 2 , . . . }, its discounted present value is defined by 013 := ~ 9tT(xt, zxt, xt+l), (11) t=0 where /3 (0 < /3 < 1) is a discount factor. Then, for each v E P ( S ) and 7r E I I , /3 is a random variable from the probabi l i ty space (ft, P~) into the interval f M / ( 1 / 3 ) , M / ( 1 -/3)]. ASSUMPTION 3. Let g, hi (1 < i < k) be any real-vaIued functions on the set of real numbers R satisfying tha t (i) 9 is upper semicontinuous; (ii) each hi (1 < i < k) is lower semicontinuous. Discounted Markov Decision Processes 281 For any given threshold vector ~ = (al , c~2,..., ctk) 6 R k and any initial probability measure E T'(S), let v(v,~) := {~ E n l E~(hd~)) _< ~,, for all i(1 < i < k)}, where E~ is the expectation with respect to Pff. Interpreting g, h~ (1 < i < k) as given utility functions, we will consider the following utility-constrained optimization problem: Problem A: maximize E~(g(B)) subject to 7r E V(v ,a ) . The optimal solution 7r* E 12(u, a) of Problem A, if it exists, is called a v-constrained optimal policy, or simply a constrained optimal policy. Note tha t Problem A includes, for example, the constrained moment problem (cf. [8]): for t h e i th moment of B with a sign ( -1 ) i, 9 maximize E~r(B ) subject to ( -1)~E~(B i) _< a t (2 < i < k + 1), and the constrained threshold probability problem (cf. [9,10]): 9 maximize P,~(B >_ a) subject to P~(B <_ b) <_ c~ for some b < a. We shall use the following result in the sequel. LEMMA 1.i. (See [II].) For any v 6 ~(S), @(u) ;= {Pff E ~:)(f~) 7r E If} is convex and compact in the weak topology. In Section 2, the saddle-point statement for Problem A is gwen, whose results are applied to obtain the existence of a constrained optimal policy. The characterization of a constrained optimal policy is given and the exponential case is discussed in Section 3. 2. S A D D L E P O I N T T H E O R E M F O R U T I L I T Y C O N S T R A I N E D M D P S In this section, we prove the saddle-point theorem for the Lagrangian associated with Problem A. For any initial probability measure u E 7)(S), we define the Lagrangian, L ~, that corresponds to Problem A as follows: k L"(Tr, A) := E~(g(B)) + E Ai (c~i E~(h~(B))) i = l (2.1) for any rr E II and A = (A1,A2,. . . ,Ak) E R k := R k N { A i >_ 0 (1 <_ i _< k)}. Without any confusion, A E Rk+ will be written simply by A _> 0. The following statement on saddle-points can be proved similarly to tha t of Luenberger [12, p. 221, Theorem 2] and so omitted. THEOREM 2.1. (Cf. [12].) Suppose that there exists 7r* E I I and A* ~_ 0 such that L ' ( ., .) with p E 7)(S) possesses a saddle-point at 7r*, )C, i.e., L~(Tr, A *) _< L'(Tr*,A *) < L 'Qr* ,A) (2.2) for all 7r E rI and A > O. Then, 7r* solves Problem A and is a u-constrained optimal policy. The above theorem motivates us to obtain sufficient conditions for the existence of a saddlepoint of the Lagrangian L v. To this purpose, it is convenient to rewrite the expected utility using the distribution function of the present value. Let., for each u E 7'(S) and 7r E H, F: (x ) := P;(B < z), ~(v) := {F: ( . ) I ~ e n } . (2.3) (2.4) 282 Y. KADOTA et al. Now, with some abuse of notation, we define A) := / g x ( x ) dF(x ) L ' ( F , for any F E ~(u) and A >_ 0, where k g~(~) := g(~) + ~ ~,(", i = 1 Then, the Lagrangian L ~ defined in (2.1) is obviously F = F~. Thus, we have the following corollary. (2.5) h ~ ( ~ ) ) . ( 2 . 6 ) rewritten by L~(zr, A) = L ' (F ,A) with COROLLARY 2.1. Let ~r* E H and A* _> 0. Then, L~( ., .) with ~ E 7~(S) possesses a saddle-point atTr*, fl* i f and only i f the following relation holds with F* = F~. L ' ( F , A*) < L~(F *, A*) < L ' ( F * , A), (2.7) for all F E g2(~) and A >_ O. Then, 7r* solves Problem A and is a u-constrained optimal policy. LEMMA 2.1. For any ~ E P (S ) , it holds that (i) ff~(L,) is convex and compact in the week topology; (ii) L~( ., A) is concave and upper semicontinuous for each A >_ 0; (iii) L~(F, .) is convex and continuous for each F E ~(~). PROOF. Noting that the present value B is a continuous map from • to [ M / ( 1 -~3), M / ( 1 -/3)], (i) follows from Lemma 1.1. Since fix(-) is upper semicontinuous, (ii) follows from (2.5), also, (iii) clearly holds. I From Lemma 2.1, we observe that Fan's minimax theorem (cf. [13]) is applicable to obtain the following. LEMMA 2.2. It holds that, for any ~ E P(S) , inf max L ' ( F , A ) = max i n f f ' ( F , A ) . (2.8) ~_>o Fer F~ (v ) ~>0 Henceforth, the common value of (2.8) will be denoted by L*. In order to prove the existence of a saddle-point with (2.7), we need the following condition. SLATER CONDITION. There exists a ~r 6 YI such that E~(h~(B)) < a~, for all i, 1 < i < k. (2.9) Since L~(/~, A) ~ oc as [[A][ -~ ec with /~ -F~ under condition (2.9), the convex function maxF~v(~) L ' ( F , A) is bounded from below, so that there exists A* > 0 such tha t L~(F, A*) _< L*, for all F E (I)(u) (2.10) by (2.8). On the other hand, by Lemma 2.2, there exists F* E ~(~) with L~(F*, A) >_ L*, for all A > 0. (2.11) Thus, applying Corollary 2.1, (2.10) and (2.11) lead the following main theorem. THEOREM 2.2. Under condition (2.9), the Lagrangian L ' ( . , .) with the initial probabili ty measure ~ P ( S ) has a saddle-point, i.e., there exists 7r* E H and A* _> 0 satisfying (2.2). Also, from Theorem 2.1 and 2.2, the following corollary holds. COROLLARY 2.2. Under condition (2.9), there exists a constrained optimal policy. Discounted Markov Decision Processes 283 3. C H A R A C T E R I Z A T I O N O F T H E C O N S T R A I N E D O P T I M A L P O L I C Y In this section, by applying the results in [1], a constrained optimal policy is characterized by optimal action sets. Let ~ E P(S) . Then, for each A _> 0, ~r* E YI is called g~-optimal if E~.(g)~(13)) > E~(g),(B)), for all 7r E 11, where gx is given in (2.6). The following lemma can be easily proved (cf. [14]). LEMMA 3.1. Let # E i"I and A = (A1, A2, . . . , Ak) r Rk+ 9 Then, for any t~ E 7~(S), the Lagrangian L~( -, -) given in (2.1) has a saddle-point at #, A iff the following holds: (i) # is g~-optimal; (ii) # E V(u, a); (iii) k Ei=l Xi(ai E~( hi(13) ) ) = O. l b characterize g~-optimality in Lemma 3.1 (i), let f (s + Ztr( i ,a , j ) + /3t+lz) F(dx) , (3.1) Ut {g;~ } ( s, i, a, j ) m a x FEcP(j) J for t > 0, s E [ M / ( 1 / 3 ) , M / ( 1 / 3 ) ] , and i , j E S, where i f u E P (S) is degenerate at {j}, p is simply denoted by j and ~(u) by r Since ga(-) is upper semicontinuous and O(j) is compact in the week topology, the maximum in (3.1) is attained. Here, for each A > 0, we define the sequence {Aff}~= o by At ~ (s, i ) : = arg max ~ q,y (a) Ut { g~ } (s, i, a, j ) , a~A(i) jES (3.2) for s E [ M / ( 1 /3 ) , M/(1 -/3)] and i E S. Then, we have the following. THEOREM 3.1. For any zJ E P(S) , a policy r~* E V(t,, a) is a constrained optimal policy iff there exists A* > 0 such that (i) P~ . (A, E A~' (B t -x ,X t ) ) = 1 where Bt = Et~-o/3*r(X, ,A, ,X~+l) (t >_ 1); (ii) k E~=, At(a, E~. (h~(u))) = o. PROOF. Applying the results of Theorem 3.3 in [1], it can be shown that rr* is ga.-optimal iff thc above (i) holds. So, Theorem 3.1 follows from Lemma 3.1. | Consider the exponential utility case with k = 1, i.e., g(x) = ha~(x) and hi(x) = h),2(x) (A1, A2 7 ~ 0), where ha(-) is a utility function with constant risk sensitivity 5, as follows: { sign(a)e &, 5 ~ 0 , h~(z) := x, 5 = 0 . In this case, g~(x) in (2.6) is given as g~(x) = g(x) + A(a h i ( x ) ) with a Lagrange multiplier A. For each A _ > 0 a n d i E S , t _ > 0 , c ~ < x < o c , let Pff(i,s)= sup / {sign(A1)cx**+v'alX~sign(~2)e x=*+z%x} dF(x). FEe(i) (3.3) Then, the following recursive equation holds: P~(i , s ) = max E qij(a)P~+ 1 ( j , s + f l t r ( i , a , j ) ) . (3.4) aEA(i) 284 Y. KADOTA et al. In fact, by using the dynamic programming method, Pt~(i,s) = sup f {sign(,~l)e ~ls+z'x~z -,ksign(A2)e ~s+~*~2x} dF(x) Fcr = max E q ~ j ( a ) sup / {sign()h)e ~'(~+~%(i'~'j))+~'+1~ aeA(i) J FEq,(j) -,~ sign(A2)e~(~+Z'r(~'a'J)) ~ a2~'+'~} dF(x) = max Eqij(a)Pt~+l (j,s + Ztr(i,a,j)). aEA(i) J Obviously, lim Pt~(i, s) = sign()~l)e ~ls ,~ sign()~2)e ~2s. Also, Ut{gx} in (3.4) is written as follows: U t { g ~ } ( s , i , a , j ) = P?+l ( j , 8 -t~ t?~( i , a , j ) ) ~)ko~. (3.5)
منابع مشابه
Accelerated decomposition techniques for large discounted Markov decision processes
Many hierarchical techniques to solve large Markov decision processes (MDPs) are based on the partition of the state space into strongly connected components (SCCs) that can be classified into some levels. In each level, smaller problems named restricted MDPs are solved, and then these partial solutions are combined to obtain the global solution. In this paper, we first propose a novel algorith...
متن کاملIterated risk measures for risk-sensitive Markov decision processes with discounted cost
We demonstrate a limitation of discounted expected utility, a standard approach for representing the preference to risk when future cost is discounted. Specifically, we provide an example of the preference of a decision maker that appears to be rational but cannot be represented with any discounted expected utility. A straightforward modification to discounted expected utility leads to inconsis...
متن کاملContinuous Time Discounted Jump Markov Decision Processes: A Discrete-Event Approach
This paper introduces and develops a new approach to the theory of continuous time jump Markov decision processes (CTJMDP). This approach reduces discounted CTJMDPs to discounted semi-Markov decision processes (SMDPs) and eventually to discrete-time Markov decision processes (MDPs). The reduction is based on the equivalence of strategies that change actions between jumps and the randomized stra...
متن کاملSemi-markov Decision Processes
Considered are infinite horizon semi-Markov decision processes (SMDPs) with finite state and action spaces. Total expected discounted reward and long-run average expected reward optimality criteria are reviewed. Solution methodology for each criterion is given, constraints and variance sensitivity are also discussed.
متن کاملChapter for MARKOV DECISION PROCESSES
Mixed criteria are linear combinations of standard criteria which cannot be represented as standard criteria. Linear combinations of total discounted and average rewards as well as linear combinations of total discounted rewards are examples of mixed criteria. We discuss the structure of optimal policies and algorithms for their computation for problems with and without constraints.
متن کاملDynamic programming in constrained Markov decision processes
We consider a discounted Markov Decision Process (MDP) supplemented with the requirement that another discounted loss must not exceed a specified value, almost surely. We show that the problem can be reformulated as a standard MDP and solved using the Dynamic Programming approach. An example on a controlled queue is presented. In the last section, we briefly reinforce the connection of the Dyna...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computers & Mathematics with Applications
دوره 51 شماره
صفحات -
تاریخ انتشار 2006