Formal Model and Analysis for the Random Event in the Intelligent Car with Stochastic Petri Nets and Z
Algorithm 1
Parameter optimization algorithm based on actor-critic.
Input: SPN, Environment, the Reward and Punishment Rules.
Output: .
Let the initial global time be and the SPN Contains Places and transactions.
(1)
The reachable marking graph and Markov Chain with states Can be derived from the SPN.
(2)
Random initialization transactions implementation rate and let .
(3)
The nth-order square matrix is introduced based on the Markov Chain obtained in Step 1 and Definition 1.
(4)
According to the Markov Chain in Step 1, the steady-state probability vector is an -dimensional vector.
(5)
According to equation (5), the vector at the current moment is calculated.
(6)
Set the set of states , the set of bad states . By then the steady-state probability vector of good states , the steady-state probability vector for the bad state .
(7)
iteration = 0;
(8)
While (iteration ≥ 100)
(1) Observe the current state ,i.e., the current steady-state probability . Random initialize the policy network and the value network and randomly sample an action according to the policy network.
(2) Execute action . The environment generates the next states , according to action . The reward is calculated according to the reward and punishment rules.
(3) The policy network randomly samples an action according to the current state , but does not execute the action .
(4) From the value network
(5) Calculate TD error according to TD algorithm , and is the discount rate.
(6) Calculate the value network gradient .
(7) Update the value network, .
(8) Calculate the policy network gradient, .
(9).Update the policy network, .
(10) t++ iteration++,
where and correspond to the learning rates in the value network and policy network, respectively.