Abstract

In this study, the numerical calculation of optimal policy pairs in two-person zero-sum stochastic games with unbounded reward functions and state-dependent discount factors are studied. First, the expected discount criterion, −optimal values, and policy pairs are defined for zero-sum stochastic game model. Then, an iterative algorithm is given and the correctness of the algorithm is verified. At last, an example of inventory system is stated, the numerical simulation is obtained according to the iterative algorithm steps, and the difference between varying discount factors and a constant discount factor is obtained in further discussion.

1. Introduction

According to different time parameters and state processes, the research on zero-sum stochastic games can be roughly divided into four categories: (i) discrete-time Markov game, see [13], (ii) continuous-time Markov game, see [4, 5], (iii) semi-Markov game, see [6, 7], and (iv) stochastic differential game, see [8, 9]. Discrete-time Markov game is the most basic, and it is very convenient in practical operations such as action selection, as well as in designing algorithms, solving game values and calculating optimal policies, so it has been widely used in the real world. The total reward obtained by the players in the game is generally calculated by the discounted value of the reward in each stage or by the lower limit of the average value of the reward in each stage. Thus, expected discounted reward criteria and expected average reward criteria are two important research directions of stochastic games. In stochastic games, the most studied is the infinite stage expected discount criterion, which considers the expected discount total return of the system in the long-term operation state, and is widely used in some economic and financial models [4, 10].

There have been widely studied for the discounted criterion of stochastic games with a constant discount factor, see [4, 6, 11]. In recent years, the interest on the study of discounted stochastic games with a nonconstant discount factor has been increasing both because of their importance in applications and because of the mathematical challenges that their analysis entails. Therefore, it is reasonable to consider the varying discount factors, see, for example, the works of Schall [12], Gonzalez-Hernandez et al. [13, 14], Zhang [15], and Stachurski and Zhang [16] and their references therein.

In addition, most of the aforementioned works mainly focus on the existence of optimal values and policies for the stochastic games. However, from a practical point of view, the computations of optimal values and policies are more important. A common method is to transform the stochastic game problem into a linear programming problem, which is used to identify and calculate the optimal stationary policies. For example, Hordijk and Kallenberg [17] used linear programming to find the average optimal strategy of general Markov decision processes (abbreviated as MDPs), Denard and fox [18] obtained the algorithm of linear programming to solve some special cases of nondiscounted semi-Markov decision processes (semi-MDP), and Vrieze [19] gave a linear programming algorithm for nondiscounted stochastic games with finite state and action space. Recently, Yang et al. [9, 20, 21] proposed some new algorithms for optimal control problems, such as a novel adaptive optimal control design method for the constrained control problem in [9], an iterative adaptive dynamic programming (ADP) algorithm for the infinite-horizon optimal control problem in continuous time for nonlinear systems in [20], and λ-policy iteration (λ-PI) for the discrete-time linear quadratic regulation problem in [21].

In this study, we mainly discuss the discrete-time Markov games, with finite-state space and countable-action spaces, unbounded reward function, and state-dependent discount factors, give an iterative algorithm, and verify the correctness. Moreover, we illustrate the numerical calculation by an example of inventory system.

This study contains three main contributions:(a)The discount factors α(i) is a state-dependent measurable function from the state space to [0, 1), which is a generalization of the case of a constant discount factor(b)For the two-person zero-sum Markov games with finite-state space and state-dependent discount factors, the ε-optimal value and policy pairs can be calculated by the iterative algorithm, which lays a solid foundation for the numerical calculation of countable state space or general state space cases(c)We illustrate the approximation calculation by a discrete-time inventory system and obtain the difference between varying discount factors and a constant discount factor in further discussion

This study is organized as follows. In Section 2, we introduce the two-person zero-sum stochastic game model and its expected discount criterion. In Section 3, under suitable conditions, we give the iterative algorithm to calculate the ε-optimal value and policy pairs by the linear programming. In Section 4, we give an example to illustrate the numerical calculation of the ε-optimal value and policy pairs by the iterative algorithm. Finally, in Section 5, we state the difference between varying discount factors and a constant discount factor in further discussion.

2. Zero-Sum Stochastic Game Model and Expected Discount Criterion

Consider the two-person zero-sum stochastic game model as the following model:where is a finite-state space with -field and and represent the action spaces of player 1 and player 2, respectively. They are supposed to be countable spaces with fields which are and , respectively. For each state , and are the admissible action sets of player 1 and player 2, respectively, and letbe a measurable subset of . is the transition probability, which is the stochastic kernel on given such that, for any and , is the probability measure on , and for any , is the Borel function defined on . is the discount factor, which is a measurable function from to . is a measurable function defined on , which is the reward (or cost) that player 1 gets (or player 2 pays) when the current state of the system is , player 1 takes action as and player 2 takes action as (each player selects actions independently).

Definition 1 (see [4, 5, 22]). (a)If there is a random kernel sequence such thatthen, we call as a randomized Markov policy of player 1, and the set of all randomized Markov policies of player 1 is denoted by .(b)The randomized Markov policy is said to be stationary if there is a probability measure such thatand the set of all the stationary policies of player 1 is denoted by .
Similarly, we can define the randomized Markov policy class and the stationary policy class of player 2, and is said to be the policy pair.
Let and . An element of is a vector in the form of , which is also known as the history of the system up to time . For any initial state and each policy pair , by Ionescu–Tulcea theorem (see [23]), there exists a unique probability measure and a underlying stochastic process defined on such that, for each , , , it holds thatMoreover, the expectation operator with respect to is denoted by .
By independence, for the initial state and any policy pair and for each and , it holds that

Definition 2 (see [4, 5, 22]). (a)For any , , and the discount factor , the expected discount criteria of player 1 and player 2 are defined as follows:(b)Denoteas the lower and upper value of model (1), respectively. Obviously, for all , we have . Furthermore, if , for all , then we call it as the optimal value of model (1) and denote as .(c)If the stochastic game (1) has the optimal value , then we call the policy as the optimal policy of player 1 ifSimilarly, we call the policy as the optimal policy of player 2 ifFurthermore, if are the optimal policies of player (), then we call as the optimal policy pair.
Now, we introduce some notations and terminology. For the finite-state space , represents a probability measure on with a weakly convergent topology. Furthermore, we say that a measurable function is the weight function and call the real-valued function defined on -bounded if its -norm is finite, where -norm is given byThe Banach space of all -bounded measurable functions defined on is denoted as .
In addition, for each , , , and the function , letand then,Also, for any and , we can defineFurthermore, (14) and (15) can be denoted, respectively, as and if and are stationary.

3. Iterative Algorithm of Optimal Policy Pairs

In this section, we will give the iterative algorithm of optimal value and policy pairs in two-person zero-sum stochastic games. In order to guarantee the existence of optimal value and policy pairs, we need the following assumptions.

Assumption 1 (see [4, 5, 22, 24]). (a)There exists a constant such that .(b)There exist nonnegative constants and (with ), and a weight function , such that, for all , we have(c)For each , and are compact.(d)For each , is continuous on .(e)For and any bounded measurable function on , is continuous on , and so is the weight function .

Remark 1. Assumption 1 (a) is obviously true for the case where the discount factor is constant and (b) shows that the reward function can have neither upper nor lower bounds. Assumption 1 (a)-(b) is the so-called “expected growth” conditions (see Assumption 3.1 in [24] and Assumption 1 in [22]) and (c)–(e) is the “continuous-compact” conditions (see [35, 11, 22, 24] and their references).
By Theorem 1 in [22], we have a direct conclusion as follows.

Lemma 1 (see [22]). Suppose that Assumption 1 holds; then,(a)The optimal value of two-person zero-sum stochastic game in (1) exists and satisfies the following equation:In addition, is the unique solution of (18) on .(b)The stationary policy pair is optimal if and only if is the solution of (18).

Remark 2. Lemma 1 is not only an important theoretical foundation for finding the optimal value and optimal policy pairs but also an important basis of designing iterative algorithms.
Next, we give the iterative algorithm, for which we should give the definition of -optimal policies.
Let be the optimal value of two-person zero-sum stochastic game in (1); if it holds thatthen we call as an -optimal policy pair, and is -optimal value. Note that the condition in (19) is guaranteed by the proof of Theorem 1; then, the -optimal policy pair and optimal value are defined.
In addition, for the initial value , we define the sequencewhere the operator is given byMoreover, for fixed , letInspired by [25], we can propose an iterative algorithm to calculate the -optimal value and policy pairs as follows.

Theorem 1 (iterative algorithm). Suppose that Assumption 1 holds; then, the -optimal value and policy pairs can be obtained by the following steps.Step 1: for any , give the initial value .Step 2: let ; for all , can be given by the original linear programming:Then, for all , can be given by the dual linear programming:Step 3: regarding as the payment function, by [22], we can obtainStep 4: for sufficiently small , if , the iteration stops and then is the -optimal value, and is an -optimal policy pair. Otherwise, let , and return to Step 2 to continue.

Remark 3. The iterative algorithm is a combination of value iterative algorithm and linear programming method of matrix game, which is not only different from the value (or policy) iterative algorithm in the existing literature as [17, 18, 22, 23] but also different from the linear programming method in matrix game as [19, 25].
Furthermore, in order to verify the correctness of the above algorithm, we give the following theorem.

Theorem 2. Suppose that Assumption 1 holds; then,(a)For any given , there exists a nonnegative integer:such that , where is the round down function and denotes the indicator function. That is to say, the iterative converges in step .(b)The policy pair obtained from the above iterative algorithm is an -optimal policy pair, where

Proof. By the proof of Lemma 1 in [24], we can obtain that is a contraction operator and(a)By the definition of the sequence , for we can obtain thatwhich yields thatFor any given , if , then let ; we haveOtherwise, if , then let ; we can obtainThus, we choose ; then,(b)By Theorem 1 in [24], we have then,which yields thatLet ; it follows thatwhere is the -optimal value for each . Therefore, is an -optimal policy pair.

4. An Application Example

Example 1. Consider an inventory system, and its storage level is the state of the system. In each stage , the system has only three states , which represent different storage levels. When the system is in state 1, our warehouse storage is loose; when it is in state 2, our warehouse storage is tight; when it is in state 3, our warehouse storage is relatively loose. When the system is in state , player 1 starts from to choose the supply and player 2 from to choose the order quantity, and then, player 1 will get the corresponding reward , where and . In state 1, when we take action and , the probability of system transition to state is , . In addition, we assume that the discount factors are related to the storage levels of the warehouse, denoted as .
These two players are in a state of competition. Player 1 wants to maximize his own reward, while player 2 holds the opposite attitude, that is, to minimize the reward of player 1 (which can also be regarded as the cost of player 2).
From the above description, we can get the model of the two-person zero-sum stochastic game with varying discount factor as follows: , , and , . The transition probability is , , For numerical calculation, we set the model parameter values, as shown in Table 1.
Let ; then, Assumption 1 holds. By Lemma 1, there exists an optimal policy pair for the inventory system.
Next, we use the iterative algorithm to find the optimal values and policy pairs of the game. The specific steps are as follows:Step 1: let , for Step 2: for all , , we can obtainSolving the following original linear programming,The solution of the above linear programming is whereIts dual linear programming isThe solution of the above dual linear programming is whereThen, we haveSimilarly, we can get where and , which yields thatAlso, we can calculate out where and , which yields thatStep 3: let if , then the iteration stops. Then, is -optimal value and is -optimal policy pairs.Finally, by software MATLAB, we can obtain that(1)When , it follows that , and then, the iteration stops; the -optimal values are(2)In state 1, the -optimal policy pair is that player 1 chooses the action with a probability of 0.68045 and with a probability of 0.31955 and player 2 chooses with a probability of 0.73865 and with a probability of 0.26135.(3)In state 2, the -optimal policy pair is that player 1 chooses with a probability of 0.38703 and with a probability of 0.61297 and player 2 chooses with a probability of 0.32584 and with a probability of 0.67416.(4)In state 3, the -optimal policy pair is that player 1 chooses the action with probability 1 and player 2 chooses with probability 1.

5. Discussion: Comparison between the Constant and Varying Discount Factors

In order to better analyze the case that the varying discount factors and a constant discount factor, we make the following comparison for the game in Example 1 about the values of the games and iterations times in Table 2 for the varying discount factors case with , and the constant discount factor case with .

The error of two adjacent iterative values and the -optimal policy pairs are given by Figure 1 for the varying discount factor (with ) and Figure 2 for the constant discount factor (with . Note that, from the figure of the error of two adjacent iterative values, it can be seen that, in the iterative process, exponentially increases and tends to when , which is consistent with our previous theoretical analysis.

From the above analysis, it is concluded that when the discount factors are varying, the game values and iteration times are different from the case of a fixed constant discount factor, and the optimal policy pairs are also different, which shows that the varying discount factors have an impact on the optimal values and policy pairs of the games. Therefore, it is of practical significance to consider the varying discount factor in stochastic games.

6. Conclusion

In this study, we are concerned with the numerical calculation of the ε-optimal value and policy pairs in two-person zero-sum stochastic games with unbounded reward functions and state-dependent discount factors. For the two-person zero-sum Markov games with finite-state space and state-dependent discount factors, this study gives the iterative algorithm to calculate the ε-optimal value and policy pairs. Future work will discuss the numerical calculation of the ε-optimal value and policy pairs in two-person zero-sum stochastic games in countable state space and even general state space with varying discount factors. For the case of the countable state space, we will try to use the finite approximation technique. For the case of the general state space, we will introduce the concept of quantized policies and use the optimal policies in finite action space to approximate the optimal policies in general state space.

Data Availability

No data were used to support this study.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by National Natural Science Foundation of China (Grant no. 11961005) and the Opening Project of Guangdong Province Key Laboratory of Computational Science at Sun Yat-sen University (Grant no. 2021021).