Abstract

To explore the impact of autonomous vehicles (AVs) on human-driven vehicles (HDVs), a solution for AV to coexist harmoniously with HDV during the car following period when AVs are in low market penetration rate (MPR) was provided. An extension car following framework with two possible soft optimization targets was proposed in this article to improve the experience of HDV followers with different following strategies by deep deterministic policy gradient (DDPG) algorithm. The pretreated Next Generation Simulation (NGSIM) dataset was used for the experiments. 1027 car following events with being redefined were extracted from it, in which 600 of the events were used for training and 427 of the events were used for testing. The different driving strategies obtained from the classical car following models were embedded into virtual environment built by OpenAI gym. The reward function combined safety, efficiency, jerk, and stability was used to encourage the agent with DDPG algorithm to maximize it. The final result reveals that disturbance of HDV followers decreases by 2.362% (strategy a), 8.184% (strategy b), and 13.904% (strategy c), respectively. The disturbance of HDV follower decreases by 14.961% (strategy a), 12.020% (strategy b), and 13.425% (strategy c), respectively. HDV followers with different strategies get less jerk in both soft optimizations. AV passengers get a loss on jerk and efficiency, but safety is enhanced. Also, AV car following performs better than HDV car following in both soft and brutal optimizations. Moreover, two possible solutions for harmonious coexistence of HDVs and AVs when AVs are in low MPR are proposed.

1. Introduction

With the development of technology, autonomous driving rises and is available to be achieved. There are a lot of exploration on autonomous driving. Connected automated vehicles (CAVs) as a part of the intelligent system gets high expectation for improving vehicle operation safety, traffic efficiency, and passengers’ experience. But at the beginning phase of CAVs implementation, amounts of CAV are less than amounts of human-driven vehicles (HDVs). CAVs play less effect on what they were designed for and they may degenerate into autonomous vehicles (AVs). Will the AVs coexist harmoniously with HDVs under brutal optimization for AVs’ passengers? Car following as a usual event is also a situation, which AVs are impossible to avoid. So, what are the AVs’ effects on HDV follower when it follows HDV if the AVs are designed and operating like selfish but experienced drivers? Will the selfless optimization for AVs in the car following process make a difference to HDV followers? Lots of new questions need to be studied.

Car following is a base phenomenon for traffic flow consisting of HDVs [1]. The research models can be divided into physical models and data-driven models. In early research studies, mathematical formulas and physical theories are used to express car following phenomenon. There are abundant theories to describe the action of drivers. Safety distance theory proposed by Gipps describe drivers seeking a better velocity to balance drivers’ desire, vehicle performance, and safety [2]. The optimal velocity (OV) model proposed by Bando et al. shows that drivers pursue optimal velocity to approach equilibrium state [3]. However, without limitation of acceleration, it shows unrealistic acceleration in the simulation. Researchers persist in studying the OV and proposed some improvement models such as the generalized force model (GFM) by Helbing and Tilch [4] and full velocity difference model (FVDM) by Jiang et al. [5]. The intelligent driver model (IDM) is proposed by Treiber et al. [6], which reflects conflict between drivers’ desired velocity and driving safety. With the rising of machine learning, car following models transform from physical models to data-driven models. Without physical formulas, researchers focus on improving accuracy of car following models with data-driven models. From above, it can be known that physical models tend to express behavior of HDVs, while data-driven models tend to predict action of HDVs or mimic human driving [7].

Autonomous driving has been proposed for a long time, there is an abundant method for AVs to execute car following. As per the thought proposed by Zhu et al., human-like driving is more likely to be accepted [8]. Lots of research studies on mimicking human-like car following are conducted. The higher accuracy to reproduce human driving trajectory or human velocity is the main characteristic. Simonelli et al. proposed an artificial neuron network (ANN) with feed-forward and an ANN with recurrent network to mimic human driving, respectively [9, 10], results of which reveal that ANN with a recurrent network shows better performance on mimicking human driving and ANN with feed-forward adapts to online learning because of its fast learning efficiency and accuracy to mimic human driving even if its accuracy is not as good as ANN with recurrent network [11]. K-nearest neighbor (KNN) is used to fit human driving trajectories [12], in which a simple model is used to fit while a fabulous result is gotten. Recurrent neuron network (RNN) is firstly used to learn memory effect in a car following model by Zhou et al. [13], then long-short term memory (LSTM) RNN as an improvement of RNN is conducted to the model car following [14]. Deep deterministic policy gradient is used to train agent and mean square error is set as reward function for learning to realize agent driving like a human [8]. High accuracy is shown on fitting the human driving trajectory. Zhou et al. [15] consider different driving styles and use GAIL-GRU with proximal policy optimization (PPO) to update actor-critic network to create human-like car following model. Zhou’s model has been proven to be a better fitness to time headway for drivers both aggressive and conservative.

Researchers are aware of limitation of mimicking human driving. The best performance for mimicking human driving can only make AVs become HDVs at last. Advanced on-board unit (OBU) might help AVs get over the HDVs. Optimization for AVs in the car following aims at improving safety, efficiency, and comfort [16]. Zhu et al. [8] use deep deterministic policy gradient (DDPG) algorithm with optimal function for agent learning to keep safety, efficiency, and comfortable driving state. The outstanding work made the AV agent have safer, more efficient, and more comfortable driving than human driving [17]. A time margin that indicates driving safety is constructed, which is different from TTC [18]. AVs are not allowed to be below it. Some reinforcement methods are contrasted and automating entropy adjustment are employed on Tsallis actor-critic algorithm. Autonomous driving car following model based on this algorithm shows great performance on its acceptance for drivers. Some methods use monitors rather than kinematics sensors to get end-to-end control. Q learning is employed to discrete the action of the vehicle to follow the vehicle [19]. Moreover, deep Q learning is used for agent to take continuous actions for car following [20].

Researchers also prove that AVs can bring improvement to whole traffic flow with traffic simulations. An agent-based macroscopic simulation on dedicated lane for AVs is conducted [21], and the result shows that with the increase of the portion of AVs in traffic, the average travel time will decrease. Lu et al. conducted the simulation based on the SUMO software to study the effect of the different MPRs of AVs and AV levels on the Macroscopic Fundamental Diagram (MFD) [22]. Cellular automata-based models simulate mixed traffic flow, which is constructed by AVs platoon and normal vehicle [23]. Road capacity will increase as the size of platoon increases, but it will reach a maximum value and will not continue to increase. Chen et al. proposed a notion “1 + n” which makes CAVs lead HDVs in the intersection area [24]. Also, an optimization framework considering fuel consumption, velocity deviation, and terminal velocity of whole platoon is built to improve vehicles throughput.

Although the above research studies have found some methods for modeling autonomous car following and making simulations for mixing traffic flow, some issues are coming out:(i)Most HDVs’ car following models ignore following vehicles behind the target vehicle [25]. The thought continues to affect parameters setting of autonomous vehicle, which makes autonomous vehicle get brutal optimization on self-interest.(ii)The more convenient the simulation of macroscopic traffic, the less consideration of the details of drivers’ heterogeneity. Relative macroscope simulation might omit the detailed impacts that AV leader brings to HDV follower.(iii)When CAVs are in low MPR, it will be a normalization for AVs surrounded by HDVs on the highway and produce a minimized system-level effect on the traffic [26]. The predecessors’ research has proved that AVs make a great difference on improving traffic flow operation; however, little attention is focused on HDVs right behind the AV, especially when CAVs are in low MPR. Few topics are talked about on harmonious coexistence of AVs and HDVs.

To give out solutions for above questions, we further study the interactive detail of AV leader and HDV follower. Based on the theory proposed by Gong and Zhu [27], we provide an indicator for improving the stability of AV leaders and HDV followers. Then, we integrated them into a reward function by the RL algorithm. 1027 redefined car following events are extracted from pretreated NGSIM data. 600 of them are used for training and 427 of them are used for testing. Representative models are selected for expressing different driving strategies, such as IDM, Gipps Model, and FVDM. These models are embedded into new virtual environment built by Google OpenAI gyms. The difference between the RL autonomous car following model proposed in this paper to others could be summarized as follows:(i)This model is willing to scarify less self-interest in efficiency and comfort to make sure be stable in deceleration process (stable deceleration) or equilibrium state (low absolute acceleration).(ii)The comfort of HDV follower with different driving strategies is improved.(iii)This model is not independent, and it can cooperate with other models, which considers no following HDVs in car following events. It can be treated as an extension for a situation in which AV gets HDV followers.

The rest of this article will be organized as follows: methodology will be introduced in Section 2 including the DDPG algorithm and virtual environment designed for fitting redefined car following event. Section 3 will show details for proceeding with the NGSIM data and indicators. Experiment will be conducted in Section 4. Result and analysis will be shown in Section 5. Finally, Section 6 will give out the conclusion and discussion.

2. Methodology

Considering AVs need to be controlled and interacted with environments, reinforcement learning method is good for AVs [28]. In this section, deep deterministic policy gradients will be further introduced to distinguish other papers. Feature of considering other driver behind the CAV is demonstrated. An interactive environment is created for reinforcement learning agent to learn more about safety and efficiency driving.

2.1. Deep Deterministic Policy Gradients (DDPGs)

DDPG is a manner for solving continuous control problem [29], which uses actor-critic method to combine the Q learning [30] and policy gradients [31]. To explain advantage of DDPG, it is inevitable to introduce reinforcement learning family. Reinforcement learning has been divided into 2 kinds, value-based reinforcement learning and policy-based reinforcement learning.

Value-based reinforcement gives a Q-value for the action taken in the state . The agent explores the environment and harvests the Q-table. This denotes evaluation for agent to evaluate its action under different states. The agent takes the action with a higher Q-value in the Q-table in a different state. But the existence of the Q-table makes continuous action hard to deal with because the table is finite with a fixed size. Then, deep Q-network (DQN) is proposed to solve the continuous problem [32]. Traditional Q-table is replaced by neural networks. Without the limitation of the Q-table, DQN shows its great strength.

Policy-based reinforcement learning makes agents take the action by following a policy . is a probability density function for the action. The policy will evaluate current state and generate . Each action gets its possibility to be taken, then agent runs a stochastic sampling or a deterministic sampling to select the action which it takes [33].

Each kind of RL gets its characteristics. Actor and critic are proposed as a combination of two kinds of RL. DDPG is one of RL algorithms using actor and critic. The algorithm of DDPG is shown in the following Algorithm 1.

(1)Randomly initialize critic network and actor with weights and
(2)Initialize target network and with weights ,
(3)Initialize replay buffer
(4)For episode = 1, M do
(5) Initialize a random process for action exploration
(6) Receive initial observation state
(7)For t = 1, T do
(8)  Select action according to current policy and exploration noise
(9)  Execute action and observe reward and observe new state
(10)   Store transition in
(11)   Sample a random minibatch of transitions from
(12)   Set
(13)   Update critic by minimizing loss:
(14)   Update actor policy using sampled policy gradient:
(15)  Update target networks:,
(16)End for
(17)End for

DDPG not only has its own characteristic on deterministic policy but also integrates efficient section for buffering training process. Buffer replay section of DDPG provides transition storage, which contains history message executed by agent. Stored transition is in the form of , where represents state of environment, represents action, which is taken by agent under this state, represents next state of environment after agent takes action, and represents reward obtained from changing of environment. The DDPG algorithm gets good performance on solving control with consequent action space.

2.2. Virtual Interactive Environment

Rest of this section expresses design of virtual interactive environment. Previous deep learning environment is contributed in a defaulted state, which means only the agent is considered. The state in Zhu et al. [8] is shown in Figure 1, in which velocity of preceding is replaced by relative velocity (velocity of preceding vehicle subtracts velocity of objective vehicle). denotes a bump-to-bump distance. and denote the velocity of objective vehicle and velocity of preceding vehicle, respectively.

To study the effect of objective vehicle to its follower, a more complicated environment is created for agent interaction. A classical and physical car following model is set to express HDV driver’s strategy as an extra human-driven following vehicle behind the agent. A new car following event considering both following objective AV and its HDV follower is proposed. So, the environment for agent studied has been changed to situation shown as in Figure 2.

The updating of environment is determined by acceleration of preceding vehicles, acceleration of objective vehicles, and acceleration of preceding vehicles. The updating states are made up of equations (1) to (5). Velocity of preceding vehicle is updated with origin data.

Transformation calculation of time headway and gap distance is shown in equation.where denotes space headway of vehicle, denotes gap distance, and denotes vehicle length.

3. Data Preparation and Reward Indicators

In this section, methods for data preparation are explicated, and selection of reward function is also precisely delineated.

3.1. Data Preparation

Dataset used for experiment is I-80 NGSIM dataset, which is reconstructed by methods mentioned in previous studies [34, 35]. Due to sampling frequency being 10 Hz, average velocity of vehicle is considered as the instantaneous velocity, so does the acceleration.

The car following event declares that vehicle should stay in the same lane, and duration time should be no less than 15 s. At last, 1027 car following events are extracted to fit our environment. It is necessary to declare that car following events in this paper should involve 3 vehicles, i.e., one preceding vehicle, one objective vehicle, and one following vehicle. Total data contain 429909 time slots.

3.2. Reward Indicators
3.2.1. Safety

Safety is an important indicator for HDV and CAV. Time to collision (TTC) is considered as a safety indicator to evaluate safety in many aspects [36], such as design of advanced driver assistance system (ADAS) and autonomous driving.

TTC is shown as the following equation:

TTC should get its threshold, and agent tries to keep TTC below it. In this article, the value of threshold is set as previous study [37]. The indicator is shown in the following equation:where is set to be a penalty factor, which makes agent avoid falling into the interval . Safety is a fundamental operation criterion, so more dangerous action leads to more enormous punishment.

3.2.2. Efficiency

Efficiency requires drivers to drive as fast as possible under safe conditions. Time headway (TH) always plays a significant role in road safety and traffic capacity. Improvement of traffic capacity makes sure that traffic flow works more efficiently in the same time interval [38].

Different country gets their standard for time headway, appropriate time headway based on empirical data is advised. Time gap distribution is considered in this study to contrast with the experiment of previous study. Different efficiency awards with different velocities are arranged in the previous study; however, efficiency award for agent is considered not to be complicated. Distribution of car following events’ time gap in the NGSIM dataset is shown in Figure 3.

The efficiency indicator is shown as equationwhere and are set as 0.6058 and 0.5060, respectively, as previous study [8].

3.2.3. Comfort

Comfort is evaluated by jerk, which is determined as the first derivative of acceleration [39]. However, the NGSIM data are discrete. The first derivative has been approximated to change of acceleration in a short time interval. The time interval is set to 0.1 s according to sampling frequency of NGSIM data.

The comfort indicator is set as following equation:

The maximum acceleration is 3 m/s2. The maximum deceleration is −3 m/s2. The maximum jerk is the m/s3. These parameters are set in the virtual environment as action boundary according to the literature [37].

3.2.4. Stability

Stability is always discussed in car following models proposed with physics theory. It has been proved that self-stability makes car following get more stability [27]. Based on the previous study, a stability reward indicator is set as the following formula.where denotes acceleration of vehicles, which includes acceleration and deceleration.

However, minimizing acceleration might cause a conflict with minimizing jerk. Rechecking NGSIM I-80 dataset, we find that construction of dataset includes continued deceleration or acceleration process to form congestion or dissipate congestion. Acceleration oscillates between a negative and positive value within a subtle interval. Keeping a less jerk during deceleration (or acceleration) process may express an obvious motion intention for HDV follower to catch. And less jerk in deceleration (or acceleration) process means a more stable deceleration (or acceleration). Less acceleration oscillation in relative equilibrium state means less changing occurred in velocity, which means vehicle operating in a stable state. So, stability consists of dynamic stability during deceleration process and equilibrium stability during relative equilibrium state.where denotes history acceleration of agent in previous state.

During the experiment, we found minimizing acceleration in equilibrium state might cause a situation. The situation is that change of positive acceleration and negative acceleration is frequent. Therefore, less jerk in minimizing oscillation is necessary. is proposed as a penalty to avoid frequent changing of acceleration. When the state is deceleration (or acceleration) or equilibrium, history acceleration of objective vehicle is set as a signal. The distribution of acceleration in the NGSIM dataset is investigated, which is shown in Figure 4.

We set quarter quantiles () and three quarters quantiles () as an interval. within this interval means that vehicle is in equilibrium state. without this interval means that vehicle is in deceleration (or acceleration) process.

4. Experiment

In this section, the details of experiment are expressed. The experiment is divided into three parts, where the heterogeneity of the driving strategy is considered. Strategies are expressed in three kinds in this experiment. Strategy a is that drivers try to keep a reasonable time headway. Considering hard to reach this target, it adapts to drivers with abundant driving experience or vehicle with an advanced driver assistance system. Strategy b is that drivers want to drive with an optimal velocity. Strategy c is that drivers are sensitive to safe distance and drivers are willing to change their velocity to keep a safe distance. The agents with three reward functions, one brutal optimization, and two soft optimizations, are trained to obtain maximum reward. Soft optimizations contain enhancing motion intention (EMI) and minimizing disturbance (MD).

4.1. Environment Setup

This environment is built with Open AI gym. The base rule for RL environment to update state has been expressed in Section 2.2. In this section, difference with other environments is shown. Each training episode denotes a car following event, which contains three vehicles. The objective vehicle is agent under DDPG control, while following vehicle is controlled by different classical CF models according to its driving strategy.

Strategy a is represented by intelligent driver model (IDM).where denotes acceleration at time . denotes maximum acceleration. denotes follower’s velocity. denotes desired velocity. denotes bump-to-bump distance. and are two parameters of jam distance. denotes safe time headway. denotes maximum deceleration.

Strategy b is represented by full velocity different model (FVDM), which is an improved optimal velocity model.

Strategy c is represented by safe distance model (SDM).where denotes maximum acceleration. denotes maximum deceleration. denotes minimum space headway, which is made up of length of vehicle and minimum bump-to-bump distance for driver to accept. denotes desired velocity of drivers. denotes location of vehicle at time . denotes velocity of vehicle at time . is the reaction time. Values of each different parameters in CF models are shown in Table 1.

Agent aims to maximize reward, so model performance relates strongly to reward function. Reward indicators have been provided in 3.2, and reward function of soft optimization is shown in the following equation:

are set as 1, 1, 1, respectively. is a specific parameter, which works at the moment when objective vehicle considers its follower. In brutal optimization, i.e., without self-stable, is not triggered. In soft optimization EMI, when agent judges vehicles in acceleration or deceleration process, needs to be triggered to set as 1, while is set as 0.

In the experiment with soft optimization MD, recommended reward function is shown as follows.

Trigger for DDPG with self-stable minimizing disturbance is the time gap between agent and its follower below 3 s. Optimization object is to minimize disturbance and maximize reward function. is tested with different values (−0.1, −0.2, −0.3, −0.4, −0.5). Finally, with a value of −0.2 performs best on minimizing disturbance when , , and are fixed to 1, respectively. So, is set to −0.2 when it is triggered.

4.2. Emergency Braking and Acceleration Boundary

Collision is not an option because any dangerous state is not allowed to happen. According to previous study [8], an emergency braking system is necessary for safety guarantee.where denotes reaction time. The value of reaction time makes agent take different actions. Reaction time include time of updating state (), time to reach maximum deceleration(), and sensor delay time (assumed no delay time). Reaction time is hard to give out, so value of reaction time is set to 0.2 s considering vehicle performance and accuracy of sensor.

RL agent takes action to follow rule with emergency braking system in the following equation:where denotes the bump-to-bump distance between objective vehicle and preceding vehicle.

Emergency braking system is also equipped for following vehicles. Emergency braking system for HDVs is similar to AVs. The reaction time is changed to 0.667 s, which is proposed by Gipps [2] because sensitivity difference between humans and machines is considered.

To make RL environment action uniformly, following vehicle action boundary needs to follow agent action boundary. The formula for IDM follower is shown as follows.

4.3. Details of the DDPG Model and Training

DDPG is made up of four neuron networks, two for action and two for critic. Action network and action target network share a same structure, so do critic network and critic target network. The learning rate of the actor network is 0.0001. The learning rate of the critic network is 0.0001. The discount of the reward is 0.9. The soft replacement coefficient is 0.01. The capacity of memory is 20000, and the batch size of data is 256.

Considering more hidden layers costs more computation, but less improvement on learning, one hidden layer is added with 30 neurons, and the simplest ANN structure is put into using.

The structures of actor network and critic network are shown in Figures 5 and 6. The target network of actor shares a same structure with actor network, and the target network of critic shares a same structure with critic network. They may share a same structure but different parameters.

Activation functions are set behind the hidden layer and output layer, respectively. ReLU function is set behind the hidden layer, while tanh function is set behind the output layer.

Whole 1027 car following events are extracted from the NGSIM dataset. 600 car following events are used for training, and 427 car following events are used for testing. 600 car following events are randomly selected for a training episode, which repeats for 3000 episodes. Each episode stops when car following event comes to an end or appearance of collision. Train and test data are not traditional meaning in machine learning. Train data are treated as a part of a complex game for agent to play, and we test its learning result by constructing another game consisted of test data, which only generate track for agent to follow. Test data are used for only once with one award function. Hyperparameters are not adjusted to compare with predecessors’ work. The same method can be seen in previous studies [18, 37].

The training result of different reward functions is shown in Figure 7.

Soft optimization EMI is easier for agent to learning than others, while soft optimization MD also has good performance on learning. However, both soft optimization EMI and MD are created based on brutal optimization without self-stability. Both of them get less reward than brutal optimization.

5. Result and Analysis

In this section, contrast of NGSIM origin driving data, DDPG without self-stability constraint (i.e., brutal optimization) driving data, and DDPG with self-stability constraint (i.e., soft optimization) driving data are given out. First of all, followers of objective vehicle are ignored. Also, contrast of safety, efficiency, and comfort will be put forward firstly. Then, analysis of HDV followers is conducted. It is necessary to declare that there is no collision occurring in experiments with emergency braking system for AVs or HDVs.

5.1. Self-Interest

Self-interest denotes the view of autonomous vehicle passengers towards the operation of autonomous. It is a generalized expression because keeping safety for the autonomous vehicle is also to keep safe for whole traffic flow. Acceptance of AV passengers to tolerant jerk and less efficiency is unknown; hence, loss to AV passengers is divided into two parts. Part one tries to minimize loss on AV passenger jerk and efficiency, and part two tries to explore maximize decreasing disturbance.

5.1.1. Safety

TTC is an important indicator for evaluating driving safety. TTC of NGSIM origin driving data for testing, DDPG without self-stability constraint driving data for testing, and DDPG with self-stability constraint driving data for testing are given out in Figure 8. Because TTC below 0 is considered as safe driving, and TTC above 50 is enough for safety. Therefore, interval between 0 and 50 is shown.

TTC has little difference between DDPG without self-stability and DDPG with self-stability. However, DDPG with self-stability is the safest in these data. The strategy for considering following vehicle does not affect driving safety.

5.1.2. Efficiency

Time gap is used for evaluating traffic efficiency. Distribution of NGSIM origin driving data, DDPG without self-stable, and DDPG with self-stable are shown in Figure 9.

Average time headway for NGSIM data used for testing, DDPG without self-stable constraint, DDPG with self-stable constraint but enhancing motion intention for AV passenger, and DDPG with self-stable constraint but considering minimizing the disturbance generated by AV are 2.111 s, 1.219 s, 1.365 s, and 1.582 s, respectively.

5.1.3. Jerk

Jerk denotes the comfort of drivers. Considering that the bad effect of a jerk on passenger of objective vehicle has no difference between positive and negative, the absolute value of jerk is taken for expressing comfort of passengers. The jerk absolute value distribution of NGSIM origin driving data, DDPG without self-stable constraint, and DDPG with self-stable constraint are shown in Figure 10.

Average absolute value of jerk for NGSIM origin driving data for testing, DDPG without self-stable constraint data for testing, DDPG with self-stable constraint but enhancing motion intention for AV passenger, and DDPG with self-stable constraint but considering the minimizing of the disturbance generated by AV is 1.721 m/s3, 0.947 m/s3, 1.090 m/s3, and 0.741 m/s3. It seems that selfish driving has better comfort than EMI.

5.2. Contribution to HDV Followers

The disturbance of vehicle is expressed by absolute value of vehicle acceleration. Also, mean absolute accelerations of different HDVs driving strategies are shown as following Figure 11. Considering following vehicle makes actor network face more situations, it may cause different decisions for agent. AVs with self-stability considering minimizing the loss of AVs passengers’ self-interest might decrease absolute value of acceleration of HDVs followers with different driving strategies by 2.362% (strategy a), 8.184% (strategy b), and 13.904% (strategy c), respectively. Also, AVs with self-stability considering minimizing disturbance generated by AVs might decrease absolute value of acceleration of HDVs followers with different driving strategies by 14.961% (strategy a), 12.020% (strategy b), and 13.425% (strategy c).

5.3. Analysis

It can be seen in Figure 12 that it is hard for agent to satisfy both less jerk and less disturbance caused by acceleration. In the experiments, we compromise part of jerk and efficiency for generating less disturbance, that is the reason why agent gets worse performance on self-interest than agent with brutal optimization (i.e., without self-stable) for self-interest. We do not compromise safety of AV passengers and road users. Less efficiency causes more gap distance between HDV preceding vehicle and AV objective vehicle, which means safer driving. AV has more gaps to choose to be more efficient or to generate less disturbance when HDV preceding vehicle generates disturbance.

Difference between our models is to sacrifice experience of AV passengers to do some good to miniplatoon. It has been proved that sacrifice does make some difference to alleviate disturbance generated by AVs. The experiments are divided into two parts. Experiments in part one are conducted to explore minimizing loss of self-interest for AV passengers to enhance motion intention, to make HDV followers with different driving strategies generate less disturbance, as shown in Figure 13. Experiments in part two are conducted to focus on minimizing disturbance generated by AV.

6. Discussion and Conclusion

To sum up, this study continues research studies of predecessors and focuses on the effect of objective vehicle on vehicle behind it during a car following event. The article is curious about the question: will autonomous vehicles improve microscope traffic flow, even though autonomous vehicles are designed to be selfish under low MPR? An extreme situation is proposed, where autonomous vehicle is surrounded by HDVs, and heterogeneity of drivers’ strategies is also considered. IDM, FVDM, and SDM are, respectively, conducted in the experiment.

A car following event is defined to balance two kinds of predecessors’ research studies. An AV car following model is proposed to improve miniplatoon under the situation of low MPR for AV. Three optimization targets are used to encourage agent with DDPG. An AV car following model is designed for AV to coexist with HDV harmoniously, which make HDV generate less disturbance even though different drivers getting different strategies.

As a result, RL model with self-stability considered deceleration process. Also, the equilibrium state might decrease average absolute value of acceleration of HDV followers to make it more stable, even though HDV followers get different strategies, so do the minimizing disturbance with self-stability. Soft optimization with enhancing motion intention shows enhanced motion intention feature and soft optimization with minimizing disturbance causes a longer time gap but less jerk. HDV followers with different strategies also enjoy less jerk to cooperate following with AV leader. Above results are achieved only causing fewer loss on time gap and absolute value of jerk to AV passenger. Even considering loss to AV passengers, AV car following still performs better than HDV car following in the mass.

This paper also gets its limitation. HDV drivers in the reality might not execute accurate acceleration. In further study, noise should be put into considering. Although heterogeneity of driving strategy is discussed in this study, heterogeneity of drivers in a same strategy is ignored. Dataset used in this paper is extracted in a congestion state, which means car following model may be not suitable for other situations. The dataset with more situations is considered to improve generalization of proposed model, and also RNN will be considered to train time-series data in the future. Enhancing motion intention is considered with less input in low MPR, results of which might be not obvious. As improving roadside unit (RSU) and communication of intervehicles, AVs are more likely to realize traffic state and the results might perform better.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Natural Science Fund for Colleges and Universities in Jiangsu Province under Grant 20KJB580016, General Program of Philosophy and Social Science Research in Jiangsu Universities under Grant 2020SJA0133, MOE of PRC Industry-University Collaborative Education Program (Program No. 202102055014, Kingfar-CES “Human Factors and Economics” Program), and the Science and Technology Innovation Fund for Youth Scientists of Nanjing Forestry University under Grant CX2018011.