Abstract

When multiple bus vehicles send priority requests at a single intersection, the existing fixed-phase sequence control methods cannot provide priority traffic request services for multiphase bus vehicles. In view of the conflict of multiphase bus priority requests at intersections, the priority vehicle traffic sequence is determined, which is the focus of this study. In this paper, a connected vehicle-enabled transit signal priority system (CV-TSPS) has been proposed, which uses vehicle-infrastructure communication function (V2I) technology to obtain real-time vehicle movement, road traffic states, and traffic signal light phase information. By developing a deep Q-learning neural network (DQNN), especially for optimizing traffic signal control strategy, the public transit vehicles will be prioritized to improve their travel efficiency, while the overall delay of road traffic flow will be balanced to ensure the safe and orderly passage of intersections. In order to verify the validity of the model, the SUMO traffic analysis software has been applied to simulate real-time traffic control, and the experimental results show that compared with the traditional timing signal control, the loss time of vehicles is reduced by nearly 40%, and the cumulative loss time per capita is reduced by nearly 43.5%, and a good control effect is achieved. In the case of medium and low densities, it is better than the solid scheduled traffic control scheme.

1. Introduction

Transit signal priority (TSP) is a way of treating the movement of transit vehicles or light rail. Many research papers and several successful real-world deployments have shown the applicability and efficiency of TSP in promoting public transportation and reducing congestion within urban areas. “Bus priority” was first implemented in Paris, France, in the early 1960 s [1]. In 1967, Wilbur et al. of the Los Angeles Highway Bureau proposed bus signal priority, giving priority to the bus from the time level. The bus signal priority control is also widely used because of its high regulation flexibility.

At present, TSP control strategies can be divided into three types: passive priority strategy, active priority strategy, and real-time priority strategy [2]. The passive priority is optimized for the offline solution, and the passive fixed time strategy establishes the best signal timing for each sequence with a fixed signal phase. The short-term traffic demand changes little, and the best signal timing scheme can be calculated based on historical traffic information. Reducing mainline delays has a role to play. Some scholars have used fixed-time strategies for 1 × 2 arterial road networks and isolated intersections, respectively, which can make general intersections easier than the original Webster method. The transmission capacity of the communication network is maximized. However, due to the influence of intersection nodes and road driving conditions, in actual operation, bus vehicles cannot arrive at the station on time, and the actual comprehensive benefit of the passive priority control strategy is not obvious.

Since the active priority strategy was proposed by Elias, early research has focused on the active control strategy, and many research results have been achieved: the active priority strategy collects real-time data from the sensors of the infrastructure and applies simple logic rules such as green light extension, minimum green light interval, or maximum green light interval. The green light extension will extend the green light phase time according to the detected vehicle. When the minimum green light time interval of continuously activating the vehicle detector exceeds the set threshold, the current phase green light phase will end; when the maximum output exceeds the set maximum green light phase duration, the maximum green time constraint will terminate the green phase. There is a conflict between the real-time phase priority and the phase time allocation of ordinary vehicles at the intersection. The adaptive signal control method considers the priority of the bus at the entrance of the intersection and also considers other influencing factors such as road traffic flow delay, according to the planning and design requirements of the intersection. Traffic delays are equalized at intersections. Using data detected from roadside sensing devices to obtain bus and traffic flow information, TSP control strategies evolve from simple public traffic signal control to social vehicles coordinated development of signals with public transportation [3].

However, most of the above methods are based on passive information acquisition and then decision-making control, which has a lag in the acquisition of bus traffic status information, and cannot provide a better decision-making basis for bus priority control, thereby reducing the occurrence of accidents to a greater extent. The main contributions of this paper are as follows: (a) this paper presents the connection vehicle transit signal priority system (CV-TSPS). Real-time vehicle information is obtained based on the vehicle-road coordination technology, and the traffic signal control strategy is optimized through the reinforcement learning algorithm, which improves the priority of the bus vehicles; (b) the method balances the overall delay of road traffic flow.

The remainder of this paper is organized as follows: a literature review is presented in Section 2, which is followed by methodology in Section 3. After that, experiments and result analysis is described in Section 4. Finally, Section 5 presents a conclusion.

2. Literature Review

The traffic signal control algorithm is also one of the fields that many signal control researchers continue to explore, and various solutions to the TSCP (Traffic Signal Control Problem) have been proposed. Some TSCP solutions are converted into mathematical models based on framework formulations reflecting the dynamics of traffic flows and solved using analytical methods. The Lighthill-Whitam Richards model [4], the cell transmission model (CTM) [5], and various other models have been used to explain macroscopic traffic dynamics, and game theory methods, neural network (NN) or reinforcement learning (RL).

Most of the early research scholars have adopted a rule-based approach to solving TSCP [6]. For example, Abdulhai et al. [7] combined fuzzy logic and rule-based methods by defining states with approximate input keys. Considering an adaptive strategy, Ekeila et al. [8] used a dynamic rule-based strategy that changes the control rules according to the traffic situation for dealing with real-time traffic problems. Regardless of the size or status of the problem, various solution approaches have been applied to TSCP, including the use of rule-based approaches, genetic algorithm (GA), simulation-based approaches, dynamic programming (DP), and solutions for multiagent systems (MAS). Liu and Chang [9] explicitly modeled the evolution of vehicle queues in terms of lane groups to address the problem of sharing lanes at traffic intersections and solved the model using a genetic algorithm. Because genetic algorithms were demonstrated to be good at using microscopic simulation tools to obtain high-quality optimization solutions [10], Stevanovic [11] optimized the traffic capacity under mixed traffic flow with private and public transportation by setting public transportation priority settings. In addition, many researchers have also proposed methods based on general model simulation to address traffic flow interactions [9]. Game theoretical methods and RL are typical solutions associated with MAS. Roozemond [12] proposed a system that can provide prediction and control strategies by defining each unit of the traffic system as an agent and then applying artificial intelligence to the defined agents (intelligent traffic signals) to provide prediction and control strategies. Some studies combine RL with fuzzy theory or a neural network (neuron network, NN). Choy et al. [13] divided TSCP into regional intersection subproblems; each subproblem is handled by an agent with a fuzzy neural decision module while using fuzzy NN and RL are used to optimize the traffic signal timing of large and complex traffic networks, and RL is used to adjust the learning rate and weight related to the fuzzy relationship, and further adjust the fuzzy relationship.

The adaptive priority strategy is an incentive strategy, by predicting future traffic conditions, calculating the optimal result of the system, and then adjusting the signal phase time. Dell’OlMO [14] and Mirchandani [15] used the approximate prediction of response to signal demand (APRES-NET) model to identify the queuing situation of vehicles, and predicted their movement state in the road network, and provided reasonable timing services according to the state demand. LA Koehler [16] proposed the integrated control strategy that implements simultaneously the required bus headway corrections and the bus priority through signalized intersections with the objective of minimizing the total delay of passengers that are onboard and at stops. The developed model, in the form of a mathematical programming problem, presents peculiarities for describing the behavior of BRT and transit systems and bus priority at signalized intersections. Shu et al. [17] calculated the delay increment in three scenarios for buses and private vehicles according to the dissipation time of the vehicular queue. A set of constrains are set up to avoid queue overflows and to ensure the rationalization of the signal timing. The results show that the proposed model can reduce the total person delay at near-saturated intersections. Seredynski et al. [18] demonstrated how coordinated in-vehicle driver advisory systems (DASs) can partly reduce the signal priority control requests. Such a cooperative solution, enabled by the connectivity between vehicles and the signal control system, allows combining signal adjustments with dwelling and speed advisories. We show that combining in-vehicle advisory with signal priority can reduce stops at signals and decrease trip time when compared with TSP used alone. Xu et al. [19] proposed an optimization model to resolve conflicting transit signal priority (TSP) requests. Bus travel delay was selected as the index to measure the priority level of a TSP request. Weighting expressions for priority strategy, transit route level, and transit mode were introduced to this model. Requests calling for different priority strategies and coming from different transit route levels and transit modes were granted different levels of importance, and granting priority to the buses with the largest weighted delay. Consoli et al. [20] demonstrated the effectiveness of TSP in improving bus corridor travel time in a simulated environment by using real-world data for the international drive corridor and compared unconditional and conditional TSP with the no TSP scenario. This evaluation looked at performance metrics (for buses and all vehicles), including average speed profiles, average travel times, average number of stops, and crossing street delays.

3. Methodology

3.1. Construction of DRL-TSP Model

Most of the existing bus priority control methods design linear programming problems based on the spatiotemporal information of road conditions, vehicle motion states, and traffic signal phase states, and a number of constraints are set on this basis. Such a linear programming problem is difficult to obtain an optimal solution, because the design of its function involves many complex variables, such as road elements are divided into lane IDs, intersection IDs, road lengths, and so on; vehicle status is further divided into models, positions, and speeds. The passage of vehicles at each entrance is controlled by traffic lights. Compared with the traditional linear programming method, which cannot solve the high-dimensional function optimization problem, reinforcement learning RL [7] has the ability of trial-and-error learning, and can learn effective experience in continuous iteration, and then obtain the best control strategy. But there are still two difficulties in applying RL to urban traffic control: (1) a large number of complex traffic states and (2) a large number of complex control factors. This study uses an RL training strategy and a deep neural network to represent the function of an approximate optimal policy. DRL can learn the state space representation of road traffic and reduce the difficulty of expressing the complex situation of dealing with the flow of vehicles at the entrance of the intersection. Through the combination of deep learning and reinforcement learning, the comprehensive information such as high-dimensional road sections and vehicles can be reduced to the green light at the intersection. In this paper, the reinforcement learning environment is represented by a two-way four-lane standard intersection. The end of the entrance road and the beginning of the exit road are connected to the intersection, and the left, straight and right of the opposite entrance road correspond to a traffic control signal phase. The DRL-TSP model consists of a single agent that interacts with the environment through states , actions , and rewards . Figure 1 shows the modeling process of the proposed DRL-TSP, where is defined as the status of signalized intersections, is actions in traffic light phases, and is the reward function.

3.1.1. Definition of Environment

Due to the constraints of the communication distance between the networking equipment RSU and the on-board equipment OBU, it is impossible to obtain vehicle information from a long distance. Therefore, the effective communication distance of the road intersection in the simulation environment is set to 750 meters, that is, the vehicle is within 750 meters of the intersection. Only then can vehicle information be obtained; in practice, the communication capability between the RSU and OBU is affected by environmental factors. The farther the distance is, the greater the possibility of packet loss. The farther the distance is set in this model, the smaller the impact on the model decision. The possible driving directions are set on the four lanes of each entry road segment. When the vehicle approaches the intersection, it will select the lane required for the corresponding destination, and set up a left-turn lane, a through lane, and a right-turn through lane.

The traffic lights in the environment are indicated by the color on the stop line of each entry lane, which represents the state of the traffic lights for that lane at a precise time step. For example, Figure 2 shows a green light for north-south traffic, while all other phases are red. In a real environment, there are 8 different traffic lights, each of which controls one or more adjacent lanes, as represented by equation (1).

Among them, each value describes the control scheme of each phase signal lamp. For example, represents the situation of the phase signal light corresponding to the north entrance going straight, and is the signal light situation corresponding to the entrance road turning left from the north entrance road. Similarly, the other phases are the same rules. As shown in the lane markings in Figure 1, there will be alternating red, green, and yellow lights in each phase, and during this process, the adjustment of each phase follows the following rules:(1)The lighting rule of the signal light color is red, green, yellow, and red;(2)The start duration of the traffic lights in each phase is fixed and meets the minimum green time requirement, which is 10 s, and there will be a 3 s yellow time after the end of the green light for each phase. The red time is the time difference between the remaining green time value of the current phase and the subsequent target phase;(3)In each time step, at least one type of signal light needs to be turned on for each phase, i.e., the signal light alternate cycle logic that conforms to traffic routines;(4)Do not set the full red time.

3.1.2. Description of the State

The state of the agent is determined by the state of the environment at a given time step . In order for the agent to effectively learn to optimize traffic conditions, the state of the environment should provide enough relevant information about the assignment of cars in each lane. The purpose of this construction is to let the agent learn at time step of how the CV vehicles are in the road environment location. The method proposed in this paper is inspired by genders; the difference is that the amount of information encoded in this state is small; especially, this state design only contains the spatial position information about the vehicle inside the road where the subject of the CV vehicle is located and is used for discrete. The cells of the continuous environment are irregular. The importance of CV vehicles at different locations is also reflected in the irregular distribution, which simplifies the state statistics of CV vehicles at different locations, and the selected design of state representation focuses on realism. Under the actual CV environment conditions, the roadside equipment RSU can easily obtain the accurate location information of CV vehicles, classify and count CV vehicles in different road sections, and then integrate them into discrete cells to provide model training input data. Figure 2 shows a representation of the road state distribution on the west entry road at the intersection.

The arrival state of CV vehicles is crucial to the recognition of CV vehicle delays in the model, and the discrete distribution adopted in this paper is shown in Figure 2. The queuing positions closest to the entrance are very important for the statistics of queuing delays at the intersection, so the segment length is divided into a space of every 7 m, and the main consideration is the vehicle length and safety interval. The length of the discrete space away from the stop line is shown in the figure. The longer the length of the part away from the intersection, the lower the perception of the vehicle’s position at the intersection. This paper defines a mathematical representation of the discrete state space of an intersection, where the of each element is calculated according to the following formula (2). The discrete spatial distribution is shown in Figure 2.where is the th cell, representing the location where each cell is mapped to the vector RDS, which is updated according to formula (3):

When the agent samples the environment at the time step , it receives a vector containing a discretized representation of the environment at that time step. This is the main information the agent receives from the environment, so it is designed to be as precise as possible without being overly detailed so as not to increase the computational complexity of the neural network during training. In reinforcement learning, for the performance of the agent itself, the time spent by the agent to explore the state space reflects its learning ability, if it does not explore an important part of the state space, then it will not be able to correctly estimate what the state space is, and there is no way to choose the best action in every situation. After training, the agent is able to choose the best action even when the global state is not available, based on the experience gained before, because it is able to know the value of each action choice, which means that the design of the state space should be suitable for the agent expected learning.

The state space consists of 80 Boolean cells, and the number of traffic state representations was 280. The choice of Boolean units for environment representation is also critical, as the agent must only explore the most important parts of the state space to learn optimal actions. For example, the critical state of the environment is when at least one vehicle stops and waits for the green light to release; in this state, it is more necessary to take the best action to get the vehicle into the intersection and improve the overall intersection efficiency. The cells closer to the stop line are more important than cells further away, and combinations of states with active cells closer to the stop line contribute more to the agent’s search performance (delays are caused by queuing vehicles), reducing the size of the state space significantly to the point where the training duration is as expected quantity.

3.1.3. Action Setting

Action set A represents what possible actions the agent can take. The agent’s decision is the traffic light control system, which will make the traffic light of a certain phase turn green and keep it green for a fixed period of time, the green time is set to 10 s, and the yellow light is set to 10 s. The time is set to 3 s. The task that the agent needs to perform is to select the start time of the green light phase in a predefined phase action space. Each action set will have the potential to occur, and the action space set is defined as follows:

In the formula, NS: the lane corresponding to the traffic phase in the north-south direction turns on the green light for release; NSL: the lane corresponding to the left-turn traffic phase in the north-south direction turns on the green light for release; EW: the lane corresponding to the traffic phase in the east-west direction turns on the green light for release; EWL: the lane corresponding to the left turn traffic phase in the east-west direction turns on the green light to release.

If the action chosen at time is the same as the action taken at the last time t 1, it means that the current phase will still last the green light time. If the action chosen at time t is different from the action chosen earlier, a 3 s amber time is initiated before the two actions. The number of simulation steps between each action is 10. In the SUMO simulation, the simulation step is set to 1 simulation step every 1 s, and the action represents that the time for taking each action needs to last at least 10 s. In the yellow light stage, it means 3 simulation steps. That is, each time an action is taken, the continuous simulation steps are 13-time steps.

Traffic signal optimization parameters include the cycle and green signal ratio. Therefore, the action space needs to satisfy the following constraints:

(1) The Longest Cycle and the Shortest Cycle. The length of the signal period has a significant impact on the normal operation of traffic flow at intersections. The design principle of the shortest phase is to ensure that the queuing vehicles during the red phase just pass smoothly during the green light phase. Therefore, the shortest signal period is the sum of the total loss time in the period and the time of each phase arriving at the vehicle to pass through the intersection with the saturated flow, that is, Among them, L is the total loss time in the cycle, and Y is the sum of the maximum flow ratio of each phase. If the cycle is too long, the green light will be empty, so that the waiting time of vehicles at the intersection will be longer, resulting in an increased delay time. Therefore, the longest signal period needs to be set. The longest signal period is determined by the longest red time. If the red time is too long and a large number of vehicles accumulate behind the stop line, the queuing vehicles may exceed the maximum capacity that the entrance direction can bear, resulting in vehicle overflow and reducing the traffic efficiency of the intersection. Therefore, the longest red time is determined by the maximum queuing vehicle that the phase corresponds to the entrance lane, . In the formula: n represents the number of phases, i represents the phase, and j represents the direction corresponding to phase i; is the longest red time of phase i; l is the average length of the vehicle.

(2) Green Time Allocation. All phases are assigned the shortest green light time to ensure no phase congestion occurs. Let the critical saturation of each phase be , then , the minimum green time of phase i: . The remaining green time at intersections: . The effective green time of each phase: , is the maximum flow ratio of the i-phase bus flow to the total flow.

To sum up, the constraints are sorted out as follows:

3.1.4. Setting of Reward Function

In reinforcement learning, the reward represents the feedback from the environment after the agent chooses an action. Rewards are critical to the agent learning process, as agents use rewards to understand the consequences of actions taken and to choose improved models for future actions. Rewards typically have two possible values: positive or negative, with positive rewards for good behavior, and negative rewards for bad behavior. The goal of this paper is to maximize road improvements over time. In order to achieve this goal, this paper will select several traffic indicators to improve the efficiency of vehicle traffic, so that the agent can understand and judge whether the current action is improving or reducing the traffic efficiency of the intersection. In the analysis of traffic flow, this paper uses three indicators of average queue length, average delay, and traffic flow to improve the traffic efficiency of the intersection. The calculation method used to measure the indicator is as follows:(1)Queue length: queued vehicle statistics are defined as vehicles whose speed does not exceed 0.1 m/s.(2)Waiting time: the total waiting time is the sum of the waiting time of each vehicle at time t, and the waiting time for each vehicle is calculated when the vehicle speed does not exceed 0.1 m/s.(3)Traffic flow: vehicles passing through a signalized intersection within a defined time period.

In this article, the wait time is calculated as follows:

In the formula, WT represents the delay value of all CV vehicles at the current intersection at time t; and represent the delay of a single CB and the speed of ordinary CV vehicles when the speed is less than 0.1 m/s, respectively, delay; BN and CN, respectively, represent the total number of buses and ordinary cars entering the calculation area at time t; represent the delay weight coefficients for ordinary social vehicles and public transportation vehicles, respectively.

The total waiting time is extremely important for the agent to choose the delay of the vehicle as a reward, and it can also count the most accurate delay information for each vehicle. The delay coefficient of ordinary social vehicles and public transportation vehicles reflects the bus importance in the bus priority control system, used to improve the efficiency of bus vehicles at intersections. According to article 4 of the 2017 edition of the “Technical Conditions for Safe Operation of Motor Vehicles”, the bus with a standing area shall not exceed 8 people per square meter-based on the effective standing area determined by the national standard GB/T12428. The buses designed in this experiment are all medium-sized buses with a length of 10 m. It is assumed that the number of people loaded is 80 per vehicle. In the experiment, 40% of the number of people fully loaded is taken as the comfortable average actual number of people loaded, i.e., 32 people per vehicle. The actual number of people in the car is set to 1.5 people per vehicle. is the reciprocal of the average number of passengers multiplied by the CB numbers and the ordinary CV numbers, respectively.

Calculating the waiting time of each vehicle to evaluate the delay of the entire intersection is more accurate than calculating the queue length at each intersection. This paper does not consider the queue length as the design of the reward function. There are two reward function designs in this paper. The design of the reward function refers to the reward function designed by Genders and Razavi. The reward function is defined as the total delay of CV vehicles at the current intersection at time t-1 and the CV at time t. The difference between the total delays of the vehicles calculated the way is as follows:

As shown in the reward function above, the reward value of the function may be positive or negative. As shown in formula (7), when the reward value is a positive value, it means that the reward obtained by the current agent is positive. That is to say, the overall CV vehicle delay caused by the action of controlling signal lights at time t is smaller than the delay at time t-1, indicating that the current action of adjusting signal lights reduces the queuing delay of CV vehicles, and its action is effective. If the currently received reward is negative, it means that the action taken at time t is worse than the action taken at time t 1, which increases the queuing delay of CV vehicles, and the action is not good. The larger the positive value of , the better the behavior evaluates the action taken by the agent; the larger the negative value is, the worse the behavior evaluates the action selected by the agent.

3.1.5. Deep Reinforcement Learning Mechanism

The combination of Q-learning and neural network is a deep Q-learning network (DQN). In reality, the number of states is extremely large, and features need to be designed manually, and once the feature design is not good, the desired result cannot be obtained. The neural network can solve this problem and replace the function of the Q table in Q-learning.

(1) Reinforcement Learning Mechanism. Q-learning is a model-free reinforcement learning model, and the formula (8) for updating the function Q of the action value of Q-learning is as follows:where, is the action value of taking action from state ; represents the learning rate, which can update the current Q value; represents the reward obtained from taking action from state ; t + 1 representation is used to emphasize the time relationships between taking an action and getting the corresponding reward; represents the value obtained at time t + 1; represents the next action after taking the action in the state ; Maxa represents the action of selecting the optimal value in the state of ; γ is the discount factor, which ranges from 0 to 1, and is used to reduce the importance of getting the reward immediately in the current stage.

The reason for using this update of the formula is to ultimately preserve the relationship of formula (9):

This equation says to update the Q-value after taking the current action at state using the discounted Q-value of the reward currently earned and the action to be taken in the future. The future action value is the maximum reward obtained in state , and is the reward value obtained in state and future state to take action get the big value . By analogy, when there are no more future actions in the last time step before the end of the road network, the future reward is 0 at this time. Therefore, the agent can choose actions not only based on immediate rewards, but also based on predicted future discounted rewards. The derivation formula (10) is as follows:

The choice of the γ value determines the important influence of the subsequent reward on the current action selection. If the γ value is too large, it means that the current action is very dependent on the future reward, resulting in a decrease in the importance of the reward of the current action; If the value of γ is too small, it shows that the reference value of future rewards is low, and the future rewards have little reference to the current action, which is not suitable for the agent to find the optimal reward. For example, when the green light phase is activated at time t + 1, the total CV vehicle waiting time is 100 s, then is equal to −100 + 100, the purpose of which is to delay the current reward; when γ When set to 0.09, the agent has a shorter look ahead and only considers part of the future reward value; when γ is set to 0.25, the result is that almost the reward value of future actions contributes to the current operation; when γ is set to 0.75, almost most of the contributions are from actions that will be taken in the future. When γ is 0.75, it can help the agent obtain better rewards in the long run.

If the action is taken in the state at time , the neuron output value of the output layer is , and the agent learns to make the output value close to , just need to use the squared error function to calculate the square of the difference between the true value and the output value as the error. The error function is:

However, since the state is obtained after taking the action at in the state , the value of should be obtained by inputting the state st + 1 into the neural network. The calculation process is shown in Figure 3 shown.

(2) Deep Q-Learning Network. In order to map the environmental state to the Q-values that represent action-related, a deep neural network needs to be built. The input to the network is the state of the environment at time t by the vector , and the output of the neural network is the Q-value of possible actions to take in state . The input to the neural network is defined as follows:where is the k-th state element of the input at time t. Because the design of has a total of 80 states, the size of the input layer is  = 80. The output layer of the neural network is defined as follows:where is the j-th output through the neural network at time t, and is the Q-value of the j-th action selected from the state at time t. The output of the neural network should be equal to the number of action spaces A, indicating that the output value is expressed as the Q-value when four actions are output.

The neural network adopts a fully connected deep neural network with a rectified linear unit activation function (ReLU). The exact number of layers and the number of neurons per layer is specified in Section 3.2, where multiple agent configurations are tested. The scheme of the neural network is shown in Figure 3. The vector DSR is shown as the input to the network, then the network connecting the hidden layers, and finally the output layer with 4 neurons representing the 4 Q-values corresponding to the 4 possible actions.

3.2. Training of the DRL-TSP Model

The basic principles of the agent, such as states, possible actions, and reward values, have been described in the previous subsection. Figure 4 shows the workflow made up of all modules in one time step.

After a fixed number of simulation steps, the agent starts with time step t. First, the agent obtains the state of the environment and the delay time. Next, using the delay time of this time step t and starting from the last time step t 1, it calculates the reward associated with the action taken at t 1, collected and saved to memory for training purposes, this will be detailed in Section 3.2.1. Finally, the agent selects a new action and sets it into the environment, and then starts a new sequence of simulation steps, in this work, the agent is trained by using a traffic micro-simulator and submitting multiple episodes including traffic scenarios, to gather experience from, each episode consists of 5400 steps, equivalent to 1 hour and 30 minutes of flow simulation.

3.2.1. Experience Replay

Experience replay is a technique adopted in the training phase of the agent to improve the performance and learning efficiency of the agent. It submits the information needed for learning to the agent by randomly sampling a set of sample data, rather than immediately submitting information the agent has collected during simulated training (often referred to as online learning). This batch of data is drawn directly from the in-memory data structure that stores each sample of data are collected during the training phase. According to the sample is defined as a tuple of four elements as follows:where is the reward received after taking an action from state that transitions the environment to the next state . The training instance draws a batch of samples from memory and uses the above samples to train the neural network.

In this paper, the memory size of 50,000 sample data is set to store sample data that can be played back empirically. The sample data batch size is defined as the number of samples drawn from the memory in a training instance. If the sample capacity is full, the most recent sample data will be deleted. Old sample data makes room for new sample data. The strategy adopted in this paper is to start the training phase at each time step and set the batch size to 50 samples. In a 5400-step episode with about 415-time steps and a batch size of 50, the number of samples used for training in an episode is about 20750, which is about half the memory size.

3.2.2. Training Process
(1)Add sample m containing the latest information described in the previous equation sample to memory.(2)According to the sampling strategy used, a fixed number of samples are randomly drawn from the memory that constitutes batch X. Each sample contains the state , the action , the reward obtained, and the next state , and each is executed as shown in the following steps.(1)The state obtained in the simulation environment is input into the neural network to obtain the Q-value that will be obtained by taking each action, thereby calculating in Figure 5.(2)Figure 6 shows that calculate the Q-value by submitting the vector DSR of the agent’s next state to the neural network and getting to take an action to predict the Q-value. These represent how the environment will develop and what possible actions will be taken next.As shown in Figure 7, use the updated Q-value of equation (9) described in equation (2). Among the possible future Q-values calculated in the second stage, indicates that the best possible Q-value has been selected and represents the maximum expected future return.(3)Neural network training. Figure 8 shows the training neural network, the input is the vector RDS representing the state , and the expected output is the updated Q-value , due to the effect of the Q-value update formula (2), this value now includes the largest expected future return. The next time the agent encounters state or a similar state, the neural network is likely to output a Q-value for the action that incorporates the best future case.

4. Simulation and Results Analysis

The basic element of the road intersection is drawn using the NetEdit software tool in SUMO, and the net.XML file of the road intersection network is obtained. Modeling traffic flow at intersections is essential for the design, control, and management of intersections. But modelling vehicle trajectories within intersections is challenging because there are an infinite number of possible paths in two-dimensional space, and drivers can simultaneously adjust their speeds [21, 22]. Therefore, this paper assumes that the 2D movement of the intersection is not considered. In a road network environment where control signals are simulated with reinforcement learning, the generation of a normal traffic flow is an important part of an appropriate evaluation model. To obtain a high degree of feasibility, a Weber distribution is used in the training phase of reinforcement learning to simulate the arrival of normal traffic flow. Figure 9 shows an example, where the traffic flow on the horizontal axis represents the time step of a CV vehicle and the vertical axis represents the traffic flow generated in the step window. The reason for choosing the Weber distribution is that the arrival of the traffic flow can be approximated as the traffic flow. The cumulative increase in the traffic flow reaches the peak hours, then the traffic flow continues to decrease, and the traffic congestion gradually decreases. In the mixed traffic situation, the Weber distribution can effectively describe the cumulative probability distribution of the right-of-way of CV vehicles, and with the increase in the interference factor, the shape parameter decreases while the scale parameter increases. However, the arrival of buses CV does not have a strong regularity. In this paper, the hourly arrival situation of bus traffic is used, that is, the total traffic flow is the current bus traffic flow plus the normal car traffic flow.

The distribution in Figure 9 indicates the exact steps in which the plot of the car is created. For each car that is plotted, the source and destination are determined using a random number generator, with the seed changing with each new episode. To maximize the performance of the agent, the simulation should include different sequences and patterns. Therefore, five different scenarios are defined in this paper, one for each scenario.

4.1. Algorithm Performance Evaluation Metrics

To evaluate the performance of the agents, a series of fixed time signal experiments are conducted with each trained agent. For each of the four scenarios, the random seeds chosen for vehicle generation were not trained on these episodes. The Weber distribution used to evaluate vehicle arrivals is described in detail in Section 4.3. In this experiment, the results of 5 experiments are averaged to compare the performance of reinforcement learning in specific scenarios. In this work, the median of the cumulative negative reward values is used as the evaluation value. The performance metrics used for evaluation are as follows:

Average negative return:

The sum of each negative reward received at each time step t in each training step averaged over 5 times.

Total waiting time:

The sum of the waiting times for each vehicle received at each time step t in each training step averaged over 5 times.

Average vehicle waiting time:

4.2. Fixed Signal Timing Scheme

As shown in Table 1, the traffic sequence of each phase is: north-south straight ahead north-south left-turn east-west straight ahead east-west left turn. The green time for east-west left-turn and north-south left-turn is 16 seconds, and after the end of the green time, there is a yellow time of 3 seconds for each phase to ensure the time to clear the vehicle within the intersection. When planning this traffic flow, the arrival situation is simulated according to the Weber distribution. The total arrival traffic flow was 1074 pcu/h, including 1000 normal cars. The proportion of traffic is the same, 25% is left- and right-turn traffic, and the proportion of left- and right-turn traffic is the same. At the same time, 37 buses are added to the straight east-west and north-south phases, and the buses are added to the traffic flow in the form of random distribution.

4.3. Analysis of Training Results

As shown in Figure 10(a) below, different control results are obtained when different -values are used. When the -value is 0.1, 0.25, 0.5, and 0.75, a better training effect is obtained; when the -value is 0.9, the reward value constantly increases and the training results are invalid, making it impossible to complete the signal control. This may be because the agent pays too much attention to future experiences and ignores the existing known experiences, resulting in the agent’s inability to gain effective experiences. Therefore, results with a of 0.9 are not considered in the following analysis. From Figure 10(b), it can be seen that the training results are relatively good when the control range is less than 0.9; although there is some volatility, the training effect is best when the -value is 0.75 and the final reward value is 100,000 or so, with good control performance.

Second, as shown in Figure 11, from the total delay at the junction, we can see that the control effect is reduced from the initial delay value of up to 85000 s to about 10000 s at -values of 0.1, 0.25, 0.5, and 0.75, all being good. In the initial training process, we can see that the fluctuation value is large and each training produces a large deviation, but after 200 trainings, it is more stable, which may be because the agent did not learn enough during the initial training process. It is impossible to find an effective control strategy-based on the experience of the robot. After the experience gradually increases, the intelligence can also achieve accurate control effects depending on the vehicle conditions at each intersection. The effect in the figure shows that the total passenger delay decreases by about 1500 s compared to the total passenger delay when is 0.5 and 0.1. When is 0.75, the control effect of the agent is the best, and the control effect is improved by 15%.

Finally, as shown in Figure 12, through the queue analysis, it can be seen that the training results and the delay situation are relatively similar, and the final average queue length is about two vehicles, and the convergence stability has better optimization. When γ is 0.75, the control effect is the best, and the queue length of the original 8 vehicles is reduced to about 2 vehicles, which shows that the control measures of indeterminate phase sequence and dynamic green light time proposed in this paper have achieved good control effects. With an iteration of about 1300 times, convergence is essentially achieved, which is faster than when the γ-value is 0.1, 0.25, and 0.5.

Based on the training result, the delay data of the vehicles at the intersection is obtained. As shown in Figure 13, the cumulative delay of the execution scheme adopted in this chapter is compared with the traditional signal schedule. From the Figure 13(a), the total intersection delay is reduced from 40,000 seconds to about 24,000 seconds with the DRL execution strategy. The total delay was reduced by 4.4 hours, the average vehicle was reduced by 16 seconds, and the delay time was reduced by 40%, which is a great improvement for intersection delay. As shown in Figure 13(b), compared with the delay of the bus vehicle under the optimized conditions, the original delay is reduced from about 2400 seconds to about 1200 seconds, the average vehicle is relieved by 16.2 seconds, and the delay is reduced by nearly 50%. It can be seen that the control delay is significantly improved under the DRL model. By comparing the lost time when the vehicle falls below the expected speed, it can be found that the average vehicle delay has been greatly reduced; by converting between vehicles and passengers, the average passenger capacity of normal cars is 1.5 persons/vehicle, and the buses are converted and calculated as 32 persons/vehicle, and the cumulative passenger delay curve shown in Figure 14 is obtained. From the comparison of Figure 14(a), it can be seen that the cumulative passenger delay has decreased from 133641 seconds to 75487 seconds, which is a decrease of 43.5%, and the delay per capita has decreased by 37 seconds. From Figure 14(b), it can be seen that bus ridership has decreased from 61475 seconds to 35978 seconds, which is a decrease of 41.5%, and that the per capita loss of bus ridership has decreased by 11.7 seconds. It can be seen that the control effect has been significantly improved.

5. Conclusions

When the intensity of traffic flow varies greatly, this paper introduces an intersection priority control system with a deep Q-network mechanism. A multilane four-phase intersection is simulated using an urban traffic simulation SUMO, and the traffic scenarios under different traffic flow density distributions are simulated. By considering the per capita delay as a reward function, the bus priority decision weighting is improved to enhance the bus priority efficiency. The results show that the vehicle loss time is reduced by nearly 40% and the cumulative per capita loss time is reduced by nearly 43.5%.

In the future, we will optimize the following aspects to achieve a better traffic signal priority system. (1) The performance of the agent trained with the simple state space and the large complex state space is similar, but the simple state can save computation cost and improve learning efficiency. Therefore, it is proposed to use a simple state as the input for the agent to improve the learning efficiency of the agent for traffic signal control. (2) Since the amount of study on traffic signal control is conducted in a simulation environment, the traffic data and road network used are usually generated by artificial assumption which is not real traffic situation. A more realistic traffic scenarios are required including vehicles, pedestrians, transit stops near intersections, to realize the transformation from the simulated environment to the real world.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study has been supported by Jiangxi Provincial Natural Science Foundation (grant no. 20224BAB204066).