Numerical Calculation of Optimal Policy Pairs in Zero-sum Stochastic Games with Varying Discount Factors

Wu, Xiao; Tang, Yanqiu

doi:https://doi.org/10.1155/2022/7474566

Discrete Dynamics in Nature and Society

On this page

Abstract Introduction Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2022 | Article ID 7474566 | https://doi.org/10.1155/2022/7474566

Numerical Calculation of Optimal Policy Pairs in Zero-sum Stochastic Games with Varying Discount Factors

Xiao Wu¹and Yanqiu Tang¹

Academic Editor: Rigoberto Medina

Received06 Jan 2022

Accepted13 Apr 2022

Published05 May 2022

Abstract

In this study, the numerical calculation of optimal policy pairs in two-person zero-sum stochastic games with unbounded reward functions and state-dependent discount factors are studied. First, the expected discount criterion, −optimal values, and policy pairs are defined for zero-sum stochastic game model. Then, an iterative algorithm is given and the correctness of the algorithm is verified. At last, an example of inventory system is stated, the numerical simulation is obtained according to the iterative algorithm steps, and the difference between varying discount factors and a constant discount factor is obtained in further discussion.

1. Introduction

According to different time parameters and state processes, the research on zero-sum stochastic games can be roughly divided into four categories: (i) discrete-time Markov game, see [1–3], (ii) continuous-time Markov game, see [4, 5], (iii) semi-Markov game, see [6, 7], and (iv) stochastic differential game, see [8, 9]. Discrete-time Markov game is the most basic, and it is very convenient in practical operations such as action selection, as well as in designing algorithms, solving game values and calculating optimal policies, so it has been widely used in the real world. The total reward obtained by the players in the game is generally calculated by the discounted value of the reward in each stage or by the lower limit of the average value of the reward in each stage. Thus, expected discounted reward criteria and expected average reward criteria are two important research directions of stochastic games. In stochastic games, the most studied is the infinite stage expected discount criterion, which considers the expected discount total return of the system in the long-term operation state, and is widely used in some economic and financial models [4, 10].

There have been widely studied for the discounted criterion of stochastic games with a constant discount factor, see [4, 6, 11]. In recent years, the interest on the study of discounted stochastic games with a nonconstant discount factor has been increasing both because of their importance in applications and because of the mathematical challenges that their analysis entails. Therefore, it is reasonable to consider the varying discount factors, see, for example, the works of Schall [12], Gonzalez-Hernandez et al. [13, 14], Zhang [15], and Stachurski and Zhang [16] and their references therein.

In addition, most of the aforementioned works mainly focus on the existence of optimal values and policies for the stochastic games. However, from a practical point of view, the computations of optimal values and policies are more important. A common method is to transform the stochastic game problem into a linear programming problem, which is used to identify and calculate the optimal stationary policies. For example, Hordijk and Kallenberg [17] used linear programming to find the average optimal strategy of general Markov decision processes (abbreviated as MDPs), Denard and fox [18] obtained the algorithm of linear programming to solve some special cases of nondiscounted semi-Markov decision processes (semi-MDP), and Vrieze [19] gave a linear programming algorithm for nondiscounted stochastic games with finite state and action space. Recently, Yang et al. [9, 20, 21] proposed some new algorithms for optimal control problems, such as a novel adaptive optimal control design method for the constrained control problem in [9], an iterative adaptive dynamic programming (ADP) algorithm for the infinite-horizon optimal control problem in continuous time for nonlinear systems in [20], and λ-policy iteration (λ-PI) for the discrete-time linear quadratic regulation problem in [21].

In this study, we mainly discuss the discrete-time Markov games, with finite-state space and countable-action spaces, unbounded reward function, and state-dependent discount factors, give an iterative algorithm, and verify the correctness. Moreover, we illustrate the numerical calculation by an example of inventory system.

This study contains three main contributions:(a)The discount factors α(i) is a state-dependent measurable function from the state space to [0, 1), which is a generalization of the case of a constant discount factor(b)For the two-person zero-sum Markov games with finite-state space and state-dependent discount factors, the ε-optimal value and policy pairs can be calculated by the iterative algorithm, which lays a solid foundation for the numerical calculation of countable state space or general state space cases(c)We illustrate the approximation calculation by a discrete-time inventory system and obtain the difference between varying discount factors and a constant discount factor in further discussion

This study is organized as follows. In Section 2, we introduce the two-person zero-sum stochastic game model and its expected discount criterion. In Section 3, under suitable conditions, we give the iterative algorithm to calculate the ε-optimal value and policy pairs by the linear programming. In Section 4, we give an example to illustrate the numerical calculation of the ε-optimal value and policy pairs by the iterative algorithm. Finally, in Section 5, we state the difference between varying discount factors and a constant discount factor in further discussion.

2. Zero-Sum Stochastic Game Model and Expected Discount Criterion

Consider the two-person zero-sum stochastic game model as the following model:where is a finite-state space with -field and and represent the action spaces of player 1 and player 2, respectively. They are supposed to be countable spaces with fields which are and , respectively. For each state , and are the admissible action sets of player 1 and player 2, respectively, and letbe a measurable subset of . is the transition probability, which is the stochastic kernel on given such that, for any and , is the probability measure on , and for any , is the Borel function defined on . is the discount factor, which is a measurable function from to . is a measurable function defined on , which is the reward (or cost) that player 1 gets (or player 2 pays) when the current state of the system is , player 1 takes action as and player 2 takes action as (each player selects actions independently).

Definition 1 (see [4, 5, 22]). (a)If there is a random kernel sequence such that then, we call as a randomized Markov policy of player 1, and the set of all randomized Markov policies of player 1 is denoted by .(b)The randomized Markov policy is said to be stationary if there is a probability measure such thatand the set of all the stationary policies of player 1 is denoted by .
Similarly, we can define the randomized Markov policy class and the stationary policy class of player 2, and is said to be the policy pair.
Let and . An element of is a vector in the form of , which is also known as the history of the system up to time . For any initial state and each policy pair , by Ionescu–Tulcea theorem (see [23]), there exists a unique probability measure and a underlying stochastic process defined on such that, for each , , , it holds thatMoreover, the expectation operator with respect to is denoted by .
By independence, for the initial state and any policy pair and for each and , it holds that

Definition 2 (see [4, 5, 22]). (a)For any , , and the discount factor , the expected discount criteria of player 1 and player 2 are defined as follows:(b)Denote as the lower and upper value of model (1), respectively. Obviously, for all , we have . Furthermore, if , for all , then we call it as the optimal value of model (1) and denote as .(c)If the stochastic game (1) has the optimal value , then we call the policy as the optimal policy of player 1 ifSimilarly, we call the policy as the optimal policy of player 2 ifFurthermore, if are the optimal policies of player (), then we call as the optimal policy pair.
Now, we introduce some notations and terminology. For the finite-state space , represents a probability measure on with a weakly convergent topology. Furthermore, we say that a measurable function is the weight function and call the real-valued function defined on -bounded if its -norm is finite, where -norm is given byThe Banach space of all -bounded measurable functions defined on is denoted as .
In addition, for each , , , and the function , letand then,Also, for any and , we can defineFurthermore, (14) and (15) can be denoted, respectively, as and if and are stationary.

3. Iterative Algorithm of Optimal Policy Pairs

In this section, we will give the iterative algorithm of optimal value and policy pairs in two-person zero-sum stochastic games. In order to guarantee the existence of optimal value and policy pairs, we need the following assumptions.

Assumption 1 (see [4, 5, 22, 24]). (a)There exists a constant such that .(b)There exist nonnegative constants and (with ), and a weight function , such that, for all , we have(c)For each , and are compact.(d)For each , is continuous on .(e)For and any bounded measurable function on , is continuous on , and so is the weight function .

Remark 1. Assumption 1 (a) is obviously true for the case where the discount factor is constant and (b) shows that the reward function can have neither upper nor lower bounds. Assumption 1 (a)-(b) is the so-called “expected growth” conditions (see Assumption 3.1 in [24] and Assumption 1 in [22]) and (c)–(e) is the “continuous-compact” conditions (see [3–5, 11, 22, 24] and their references).
By Theorem 1 in [22], we have a direct conclusion as follows.

Lemma 1 (see [22]). Suppose that Assumption 1 holds; then,(a)The optimal value of two-person zero-sum stochastic game in (1) exists and satisfies the following equation: In addition, is the unique solution of (18) on .(b)The stationary policy pair is optimal if and only if is the solution of (18).

Remark 2. Lemma 1 is not only an important theoretical foundation for finding the optimal value and optimal policy pairs but also an important basis of designing iterative algorithms.
Next, we give the iterative algorithm, for which we should give the definition of -optimal policies.
Let be the optimal value of two-person zero-sum stochastic game in (1); if it holds thatthen we call as an -optimal policy pair, and is -optimal value. Note that the condition in (19) is guaranteed by the proof of Theorem 1; then, the -optimal policy pair and optimal value are defined.
In addition, for the initial value , we define the sequencewhere the operator is given byMoreover, for fixed , letInspired by [25], we can propose an iterative algorithm to calculate the -optimal value and policy pairs as follows.

Theorem 1 (iterative algorithm). Suppose that Assumption 1 holds; then, the -optimal value and policy pairs can be obtained by the following steps. Step 1: for any , give the initial value . Step 2: let ; for all , can be given by the original linear programming: Then, for all , can be given by the dual linear programming: Step 3: regarding as the payment function, by [22], we can obtain Step 4: for sufficiently small , if , the iteration stops and then is the -optimal value, and is an -optimal policy pair. Otherwise, let , and return to Step 2 to continue.

Remark 3. The iterative algorithm is a combination of value iterative algorithm and linear programming method of matrix game, which is not only different from the value (or policy) iterative algorithm in the existing literature as [17, 18, 22, 23] but also different from the linear programming method in matrix game as [19, 25].
Furthermore, in order to verify the correctness of the above algorithm, we give the following theorem.

Theorem 2. Suppose that Assumption 1 holds; then,(a)For any given , there exists a nonnegative integer: such that , where is the round down function and denotes the indicator function. That is to say, the iterative converges in step .(b)The policy pair obtained from the above iterative algorithm is an -optimal policy pair, where

Proof. By the proof of Lemma 1 in [24], we can obtain that is a contraction operator and(a)By the definition of the sequence , for we can obtain that which yields that For any given , if , then let ; we have Otherwise, if , then let ; we can obtain Thus, we choose ; then,(b)By Theorem 1 in [24], we have then,which yields thatLet ; it follows thatwhere is the -optimal value for each . Therefore, is an -optimal policy pair.

4. An Application Example

Example 1. Consider an inventory system, and its storage level is the state of the system. In each stage , the system has only three states , which represent different storage levels. When the system is in state 1, our warehouse storage is loose; when it is in state 2, our warehouse storage is tight; when it is in state 3, our warehouse storage is relatively loose. When the system is in state , player 1 starts from to choose the supply and player 2 from to choose the order quantity, and then, player 1 will get the corresponding reward , where and . In state 1, when we take action and , the probability of system transition to state is , . In addition, we assume that the discount factors are related to the storage levels of the warehouse, denoted as .
These two players are in a state of competition. Player 1 wants to maximize his own reward, while player 2 holds the opposite attitude, that is, to minimize the reward of player 1 (which can also be regarded as the cost of player 2).
From the above description, we can get the model of the two-person zero-sum stochastic game with varying discount factor as follows: , , and , . The transition probability is , , For numerical calculation, we set the model parameter values, as shown in Table 1.
Let ; then, Assumption 1 holds. By Lemma 1, there exists an optimal policy pair for the inventory system.
Next, we use the iterative algorithm to find the optimal values and policy pairs of the game. The specific steps are as follows: Step 1: let , for Step 2: for all , , we can obtain Solving the following original linear programming, The solution of the above linear programming is where Its dual linear programming is The solution of the above dual linear programming is where Then, we have Similarly, we can get where and , which yields that Also, we can calculate out where and , which yields that Step 3: let if , then the iteration stops. Then, is -optimal value and is -optimal policy pairs. Finally, by software MATLAB, we can obtain that(1)When , it follows that , and then, the iteration stops; the -optimal values are(2)In state 1, the -optimal policy pair is that player 1 chooses the action with a probability of 0.68045 and with a probability of 0.31955 and player 2 chooses with a probability of 0.73865 and with a probability of 0.26135.(3)In state 2, the -optimal policy pair is that player 1 chooses with a probability of 0.38703 and with a probability of 0.61297 and player 2 chooses with a probability of 0.32584 and with a probability of 0.67416.(4)In state 3, the -optimal policy pair is that player 1 chooses the action with probability 1 and player 2 chooses with probability 1.

5. Discussion: Comparison between the Constant and Varying Discount Factors

In order to better analyze the case that the varying discount factors and a constant discount factor, we make the following comparison for the game in Example 1 about the values of the games and iterations times in Table 2 for the varying discount factors case with , and the constant discount factor case with .

The error of two adjacent iterative values and the -optimal policy pairs are given by Figure 1 for the varying discount factor (with ) and Figure 2 for the constant discount factor (with . Note that, from the figure of the error of two adjacent iterative values, it can be seen that, in the iterative process, exponentially increases and tends to when , which is consistent with our previous theoretical analysis.

(a)

(b)

(c)

(d)

(e)

(a)

(b)

(c)

(d)

(e)

From the above analysis, it is concluded that when the discount factors are varying, the game values and iteration times are different from the case of a fixed constant discount factor, and the optimal policy pairs are also different, which shows that the varying discount factors have an impact on the optimal values and policy pairs of the games. Therefore, it is of practical significance to consider the varying discount factor in stochastic games.

6. Conclusion

In this study, we are concerned with the numerical calculation of the ε-optimal value and policy pairs in two-person zero-sum stochastic games with unbounded reward functions and state-dependent discount factors. For the two-person zero-sum Markov games with finite-state space and state-dependent discount factors, this study gives the iterative algorithm to calculate the ε-optimal value and policy pairs. Future work will discuss the numerical calculation of the ε-optimal value and policy pairs in two-person zero-sum stochastic games in countable state space and even general state space with varying discount factors. For the case of the countable state space, we will try to use the finite approximation technique. For the case of the general state space, we will introduce the concept of quantized policies and use the optimal policies in finite action space to approximate the optimal policies in general state space.

Data Availability

No data were used to support this study.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by National Natural Science Foundation of China (Grant no. 11961005) and the Opening Project of Guangdong Province Key Laboratory of Computational Science at Sun Yat-sen University (Grant no. 2021021).

References

A. Neyman and S. Sorin, Stochastic Games and Applications, Kluwer Academic Publishers, Dordrecht, Netherlands, 2003.
A. S. Nowak, “Universally measurable strategies in zero-sum stochastic games,” Annals of Probability, vol. 13, no. 1, pp. 269–287, 1985.
View at: Publisher Site | Google Scholar
L. S. Shapley, “Stochastic games,” Proceedings of the National Academy of Sciences, vol. 39, no. 10, pp. 1095–1100, 1953.
View at: Publisher Site | Google Scholar
X. Guo and O. Hernández-Lerma, “Zero-sum continuous-time Markov games with unbounded transition and discounted payoff rates,” Bernoulli, vol. 11, no. 6, pp. 1009–1029, 2005.
View at: Publisher Site | Google Scholar
X. Guo and O. Hernández-Lerma, “Zero-sum games for continuous-time jump Markov processes in Polish spaces: discounted payoffs,” Advances in Applied Probability, vol. 39, no. 3, pp. 645–668, 2007.
View at: Publisher Site | Google Scholar
J. A. Minjárez-Sosa and F. Luque-Vásquez, “Two person zero-sum semi-Markov games with unknown holding times distribution on one side: a discounted payoff criterion,” Applied Mathematics and Optimization, vol. 57, no. 3, pp. 289–305, 2008.
View at: Publisher Site | Google Scholar
P. Mondal, S. Sinha, S. K. Neogy, and A. K. Das, “On discounted AR-AT semi-Markov games and its complementarity formulations,” International Journal of Game Theory, vol. 45, no. 3, pp. 567–583, 2016.
View at: Publisher Site | Google Scholar
S. Hamadene and J. Lepeltier, “Zero-sum stochastic differential games and backward equations,” Systems & Control Letters, vol. 24, no. 4, pp. 259–263, 1995.
View at: Publisher Site | Google Scholar
Y. Yang, D.-W. Ding, H. Xiong, Y. Yin, and D. C. Wunsch, “Online barrier-actor-critic learning for H∞ control with full-state constraints and input saturation,” Journal of the Franklin Institute, vol. 357, no. 6, pp. 3316–3344, 2020.
View at: Publisher Site | Google Scholar
F. Luque-Vasquez, “Zero-sum semi-Markov games in Borel spaces: discounted and average payoff,” Boletín de la Sociedad Matemática Mexicana, vol. 8, no. 3, pp. 227–241, 2002.
View at: Google Scholar
L. I. Sennott, “Nonzero-sum stochastic games with unbounded costs: discounted and average cost cases,” Zeitschrift für Operations Research Mathematical Methods of Operations Research, vol. 40, no. 2, pp. 145–162, 1994.
View at: Publisher Site | Google Scholar
M. SchÄl, “Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal,” Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, vol. 32, no. 3, pp. 179–196, 1975.
View at: Publisher Site | Google Scholar
J. González-Hernández, R. R. López-Martínez, and J. R. Pérez-Hernández, “Markov control processes with randomized discounted cost,” Mathematical Methods of Operations Research, vol. 65, no. 1, pp. 27–44, 2007.
View at: Publisher Site | Google Scholar
J. Gonzalez-Hernandez, R. Lopez-Martinez, and J. Minjarez-Sosa, “Approximation, estimation and control of stochastic systems under a randomized discounted cost criterion,” Kybernetika, vol. 45, pp. 737–754, 2009.
View at: Google Scholar
Y. Zhang, “Convex analytic approach to constrained discounted Markov decision processes with non-constant discount factors,” Top, vol. 21, no. 2, pp. 378–408, 2013.
View at: Publisher Site | Google Scholar
J. Stachurski and J. Zhang, “Dynamic programming with state-dependent discounting,” Journal of Economic Theory, vol. 192, Article ID 105190, 2021.
View at: Publisher Site | Google Scholar
A. Hordijk and L. C. M. Kallenberg, “Linear programming and Markov decision chains,” Management Science, vol. 25, no. 4, pp. 352–362, 1979.
View at: Publisher Site | Google Scholar
E. V. Denardo and B. L. Fox, “Multichain Markov renewal programs,” SIAM Journal on Applied Mathematics, vol. 16, no. 3, pp. 468–487, 1968.
View at: Publisher Site | Google Scholar
O. J. Vrieze, “Linear programming and undiscounted stochastic games in which one player controls transitions,” Operations-Research-Spektrum, vol. 3, no. 1, pp. 29–35, 1981.
View at: Publisher Site | Google Scholar
Y. Yang, H. Modares, K. G. Vamvoudakis, W. He, C.-Z. Xu, and D. C. Wunsch, “Hamiltonian-driven adaptive dynamic programming with approximation errors,” IEEE Transactions on Cybernetics, pp. 1–12, 2021.
View at: Publisher Site | Google Scholar
Y. Yang, B. Kiumarsi, H. Modares, and C. Xu, “Model-free λ-policy iteration for discrete-time linear quadratic regulation,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2021.
View at: Publisher Site | Google Scholar
X. Wu, Q. Wang, Q. Wang, and Y. Kong, “Two-person zero-sum stochastic games with varying discount factors,” AIMS Mathematics, vol. 6, no. 10, Article ID 11516, 2021.
View at: Publisher Site | Google Scholar
X. Guo and Q. Zhu, “Average optimality for Markov decision processes in Borel spaces: a new condition and approach,” Journal of Applied Probability, vol. 43, no. 2, pp. 318–334, 2006.
View at: Publisher Site | Google Scholar
O. Hernandez-Lerma and J. B. Lasserre, Discrete-Time Markov Control Processes: Basic Optimality Criteria, Springer-Verlag, New York, NY, USA, 1996.
J. B. Krawczyk and G. Zaccour, Games and Dynamic Games, World Scientific, Singapore, 2012.

Copyright

Copyright © 2022 Xiao Wu and Yanqiu Tang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies