Abstract
Traffic state estimation plays a fundamental role in traffic control and management. In the connected vehicles (CVs) environment, more traffic-related data perceived and interacted by CVs can be used to estimate traffic state. However, when there is a low penetration rate of CVs, the data collected from CVs would be inadequate. Meanwhile, the representativeness of the collected data is positively correlated with the penetration rate. This article presents a traffic state estimation method based on a deep learning algorithm under a low and dynamic CVs penetration rate environment. Specifically, we design a K-Nearest Neighbor (KNN) data filling model integrating acceleration data to solve the problem of insufficient data. This method can fuse the time feature of speed by acceleration modification and mine the distribution features of speed by KNN. In addition, to reduce the estimation error caused by penetration rate, we design a Long Short-Term Memory (LSTM) model, which uses penetration rate estimated by Macroscopic Fundamental Diagram (MFD) as one of the input factors. Finally, we use the concept of operational efficiency for reference, dividing traffic state into three categories according to the estimated speed: free flow, optimal flow, and congestion. SUMO is used to simulate traffic cases under different penetration rates to evaluate our scheme. The results suggest that our data filling model can significantly improve filling accuracy under a low penetration rate; there is also a better performance of our estimation model than that of other comparison models in both low and dynamic penetration rates.
1. Introduction
Traffic state estimation is a vitally important part of intelligent transportation management. It can improve transportation efficiency and reduce transportation costs [1]. Transportation control, management, and information services, such as congestion monitoring, ramp control, event detection, and travel time estimation, can be implemented according to state evaluation [2–4]. Traffic state estimation needs to infer real traffic conditions from incomplete and uncertain traffic information, which depends on strong traffic data support [5]. In recent years, data of traffic estimation mainly come from fixed detectors [6], probe vehicles [7, 8], cellular networks [9, 10], and so on. However, these devices have both advantages and disadvantages [11]. For example, fixed detectors are limited by excessive installation and maintenance budgets; floating vehicle data is often limited to urban roads with low proportion; cellular network data is large, but it is difficult to be used due to serious data noise and inaccurate positioning [8]. In addition, these data are difficult to meet the requirements of increasing real-time control.
Nowadays, the emergence of connected vehicles (CVs) provides new ideas and methods for traffic data acquisition and traffic state estimation [11, 12]. One of the salient features of CVs is that they can sense themselves and road operation status (such as speed, position, and acceleration) and can communicate with roadside units or central control units [13, 14]. They are equivalent to mobile sensors, which can feedback data in real-time on any road with communication facilities. The feasibility and accuracy of using CVs data for traffic state estimation have been proved [15]. The estimation of traffic state is the estimation of three parameters of macro traffic flow, namely, speed, volume, and density [16]. However, because when there is only CVs data, only vehicle speeds can be obtained directly and can represent the overall speed of roads, while the volume and density need the support of additional sensors [11]. Therefore, this article selects speed as the index to estimate traffic state.
However, research shows that the popularization of CVs would take a long time, and research of mixed traffic, especially the low penetration rate of CVs, is still needed [13]. In the low penetration rate case, CVs could not be detected for some time, resulting in no data for a period [17]. This is one of the reasons for the poor estimation accuracy. The model proposed by Bekiaris et al. [18] needs to ensure minimum data requirements with the help of other sensor data in low penetration rate cases. Although this can ensure the accuracy of estimation, it is not suitable to be extended to most roads due to a lack of flexibility and scalability. Without drawing support of other data, scholars mainly use the time smoothing method, interpolation method, and regression method to fill insufficient data [19, 20]. By using the periodicity of data, the K-Nearest Neighbors (KNN) can apply more relevant data to explore potential laws of data [20, 21]. However, the common feature of these methods is to estimate missing data based on the time characteristics of data. The tensor completion method of rank minimization can synthesize relevant information and explore potential features [22], but its calculation is complex and consumes a lot of computer resources [23]. Besides, filling the missing data or modifying abnormal data caused by machine failure is the main purpose of the current data filling methods, and filling vacant data caused by low penetration rate has not received much attention. The CVs provide more diverse and timely data that can help find a more reliable filling method.
Based on these filling data, traffic state can be estimated. In literature, methods of state estimation can be summarized as dynamic model-based methods and data-driven methods [24, 25]. The classical dynamics models include Macroscopic Fundamental Diagram (MFD) and Cellular Transport Model (CTM). Wang et al. [26] took MFD as a time update equation and used the Kalman filter to estimate highway traffic state. Fountoulakis et al. [27] and Bekiaris et al. [15] applied the Kalman filter combined with MFD to estimate traffic state in mixed traffic scenarios. CTM can well simulate traffic flow phenomena such as queuing and dissipation [28]. Tampere et al. [29] proposed a traffic state estimation model based on CTM. However, dynamic model-based methods are difficult to use in all scenarios and conditions, while a data-driven method is much more flexible [30]. In recent years, data-driven methods have made great progress, and especially neural network such as backpropagation (BP) neural network, Artificial Neural Network (ANN), and Recurrent Neural Network (RNN) has achieved great success in traffic state estimation [30]. However, there are problems of vanishing gradient and long-term dependence in the traditional neural network [31, 32]. Long Short-Term Memory (LSTM) is a special type of RNN, which can solve these problems through some gate controls [33]. Vinayakumar et al. [34] designed traffic prediction experiments on the RNN method family, and the results showed that LSTM performed well compared with other RNN and classical methods. Cui et al. used [35] stacked bidirectional and unidirectional LSTM to predict the whole network traffic state with missing values. Du et al. [36] used LSTM to predict traffic state in the connected vehicle environment. Moreover, Khan et al. [1] integrated connected vehicle technology with deep learning to evaluate the traffic state, and the result showed that when the penetration rate exceeded 20%, the accuracy reached 85%.
Deep learning models are widely used in traffic state estimation and prediction and have achieved good results [25]. However, the model based on a single data lacks a comprehensive description of the traffic network. In addition to the features directly related to the estimated object, some external factors would also affect the accuracy of the estimated model. Therefore, current studies draw more attention to the fusion estimation of multiple data by adding additional explanatory variables [37]. Loop detector data and floating car data for state estimation are a common combination method [38]. Antoniou et al. [30] used a data-driven method, fusing traffic and other data to estimate traffic state. In this article, we would like to focus on the LSTM model that uses multiple-input factors for traffic state estimation, which can flexibly add other factors. In a mixed traffic environment, one of the biggest influencing factors in the overall state estimation by using partial connected vehicle data is the penetration rate of connected vehicles [15, 27]. When the penetration rate is higher, the accuracy of the model is higher, and vice versa. This influence could be put into the deep learning model for learning. Therefore, it is necessary to convert the penetration rate into an input factor in the estimation model.
Generally, we expand the application of connected vehicles data and propose an ACC-P-LSTM model to improve the accuracy of traffic state estimation under the low and dynamic penetration rates in mixed traffic environments. Firstly, to improve the accuracy of data filling at a low penetration rate, we proposed a KNN filling model integrating vehicle acceleration. This model can effectively combine dynamic characteristics of vehicles and extract temporal and spatial distribution of speed. Secondly, for the environment with a dynamic penetration rate, we design a penetration rate estimation method based on MFD and take the estimation result as an input factor of the state estimation model. Then, we construct an LSTM model for overall speed estimation, which takes speed and permeability as input. Finally, SUMO is used to build a mixed traffic environment for getting CVs data. In our experiment, we compared the estimation accuracy in cases with penetration rates from 0.01 to 1.0. Meanwhile, we also set the cases with irregular penetration rates to verify the sensitivity of the model to permeability.
The rest of this article is structured as follows: in the second section, the speed filling model and LSTM state estimation model are explained. In the third section, the simulation environment is built, and CVs data are analyzed. In the fourth section, under the different CVs penetration rate, our model is compared with the benchmark models without our scheme and other existing advanced algorithms. The fifth section summarizes the full article.
2. Methods
This section explains in detail the proposed ACC-P-LSTM model including the data filling model integrating acceleration, the method of estimating penetration rate, and the LSTM model based on the first two. These three parts would be introduced, respectively, below, while the whole model is introduced first to show a logical relationship between them.
2.1. Overall Framework
The structure of the ACC-P-LSTM model is shown in Figure 1. Firstly, the original speed, acceleration, and number of CVs are acquired and preprocessed. In this model, we attempt to estimate the spatial average speed in one minute, while the data of CVs can be collected through the communication equipment every 10 seconds. The data of multiple CVs collected each time would be taken the average value as the average vehicle state. However, because the collected speed data is incomplete and poorly representative, the existing speed needs to be filled and corrected. The KNN model is used to fill the vacancy speed, and acceleration as an influencing parameter is also fused in this model. On the one hand, filling speed data would be input into the LSTM model; on the other hand, it would be input into MFD to estimate traffic volume. Then, the penetration rate of CVs can be calculated by the count of CVs and the estimated traffic volume. The estimated penetration rate is also entered into the LSTM model together with the filling speed to estimate traffic state. In short, the LSTM model would estimate the overall traffic speed based on the six instantaneous average speeds and would also learn the impact of changes in penetration rate on the accuracy of the model.

2.2. KNN Data Filling Model Integrating Acceleration
To deal with insufficient speed data, a scheme of KNN data filling model integrating acceleration is designed in this article and call it ACC-KNN. Acceleration significantly affects the speed feature of the next moment. Although the acceleration of a single vehicle is a microscopic parameter, we can characterize the macroscopic traffic flow characteristics by calculating the average acceleration of multiple vehicles at the same time, that is, the average acceleration as an indicator of the macroscopic trend. On the one hand, when the speed is missing at the next moment, the acceleration can be used to estimate the speed at the next moment. On the other hand, when a speed deviates from the surrounding speeds (maybe due to low penetration rate, a few vehicles represent the whole), the original speed can be modified by acceleration. In this study, the speed of the next state is estimated based on dynamic formula (1):where is the speed of state () ( is the number of collected speed data), is the acceleration of state , is the estimated speed of state , and is acceleration duration, in this article taken as 1s. Through formula (1), two sets of speed data can be obtained: one is the original speed, and the other is the estimated speed. We would not abandon the original speed but combine it with the estimated speed to eliminate the influence of deviation from the speed.
In the next step, we use the KNN model to fill the original speed and estimated speed, respectively. The KNN filling model uses correlations of different dimensions to fill and correct missing or outliers in the data. For example, if we do not obtain the measured value at a certain time, several measured values around it like can be obtained. We can use the existing speed to estimate the target speed :where is the weight of state . The weight is inversely related to the distance between adjacent points and target points.
Through formula (2), we get two filling speeds and . The modified filling speed would be obtained by weighted summation of the two filling speeds.where is the weight of the original speed, and is the weight of the estimated speed, in this article taken as 0.5.
2.3. Estimating Penetration Rate of CVs
The penetration rate is a characteristic of the local corresponding to the whole. Estimating penetration rate requires knowing the local and global features. Local features can be obtained by counting the number of CVs, while global features can be obtained from the Macroscopic Fundamental Diagram (MFD). MFD is an inherent attribute of roads [16]. It reflects the relationship between three parameters of traffic flow: speed, volume, and density. Therefore, the other two values can be calculated when one parameter is known through the formula of the fundamental diagram. The widely accepted Green Shields model [39] is used in this article.where is the spatial average speed of state () ( is the number of traffic state data), is the volume of state , is blocking density, and is the free flow speed. is the traffic flow density when the vehicle speed is close to zero, which can be calculated from the vehicle length and the shortest stopping distance. refers to traffic flow speed not affected by upstream and downstream conditions, generally 110% of the speed limit. Therefore, MFD can be obtained on a road section for a while according to geometric characteristics of roads, speed limit, and dynamic performance of vehicles.
The speed collected in reality is location speed, which cannot be directly used for speed input in MFD. According to the conversion formula between space average speed and time average speed [40], space average speed of 1 min is calculated by using the time speed collected for six times, where .
Finally, the number of CVs obtained by equipment counting is converted into the traffic flow . Compared with the traffic volume , the penetration rate of CVs can be gotten as follows:
2.4. LSTM Model
LSTM includes input layers, output layers, and hidden layers. Figure 2 shows the framework and principle of the LSTM model. Our inputs include six speed data collected every 10 s in 1 min and a penetration rate . The output layer is one-minute spatial average speed of the road section. The hidden layers are realized by a carefully designed structure called ‘gate’ and neural network [41]. Specifically, ‘forget gate’ would determine the information to be discarded. The ‘forget gate’ reads hidden state and and outputs a forget gate weight between 0 and 1.where is the input weight for the forget gate, and is the bias for the forget gate.

The second step is to determine the information to be updated through the ‘input gate’ . And a new candidate value vector is generated by ‘tanh,’ which could be updated into a new cell.where is the input weight for the input gate, is the bias for the input gate, is the input weight for the current state, and is the bias for the current state.
The third step is to update the new cell by using a part of ‘forget gate’ selection and a part of ‘input gate’ selection.
Finally, ‘sigmoid gate’ is used to determine cell state output . The final output is obtained by multiplying processed by ‘tanh’ and the output of ‘sigmoid gate.’where is the input weight for the output gate, and is the bias for the output gate.
3. Simulation and Data
Since CVs are not popular on real roads now, the SUMO simulation platform is used to simulate the scene of mixed traffic. SUMO can interact with the outside world in real-time through TraCI interface, sending vehicles data. To approach the real road environment, our simulation has made the following settings:(1)We set up a three-lane highway, in which there are ramps in the upstream and downstream, and our detection section is within 500 meters in the middle. The maximum capacity of one lane is 2000 veh/h, and the speed limit is 120 km/h (33.33 m/s);(2)We assume that the characteristics of CVs are the same as conventional vehicles and only have the function of sensing and transmitting their own information; that is, functions of automatic driving and assisted driving are not considered;(3)We set up a dynamic traffic flow and ramp control strategy to generate different traffic states. The traffic volume changes gradually over time, and all possible traffic states in the road environment are simulated with two peaks;(4)The simulation time in this article is set to 7200 s, of which the first 600 s is warm-up time. Data are collected every 10 s, a total of 1380 groups, including speed, acceleration, and number of CVs.
We detect the real spatial speed and volume through sensors installed in the first place of the road section. The MFD is determined according to the characteristics of the road and the speed limit and size of the vehicle. On this simulated road, the maximum speed is 120 km/h, and the optimal traffic flow is 2000 veh/h. According to the Green Shields model, we can determine the relationship between speed and flow; that is, (unit of : m/s). Finally, the calculated speed-volume relationship is compared with the detected real speed and flow relationship, and the results are basically consistent, which shows the feasibility of the simulation model.
Meanwhile, due to the low penetration rate of CVs and traffic volume, it may not be able to collect data at some moments. To analyze the missing data under different penetration rates and different flows, we show the distribution of missing data in Figure 3. The white line indicates vacant data, and the black line indicates data. Clearly, the severity of vacant data would decrease with the increase of traffic density and penetration rate. In the case of low penetration rate, even if there is a large traffic volume, there would still be many vacant data. This situation could be alleviated when the penetration rate reaches 0.03. At the same time, when the penetration rate reaches 0.3, the loss basically disappears, even at low traffic volume.

We use ACC-KNN data filling model to fill the original data. There is not a standard for selecting the best ‘k’ in KNN [42], so we determine it from the characteristics of the data set and the experimental results. From the perspective of the data set, firstly, there should be data in the k neighborhood under low penetration rate, and secondly, the influence interval of speed should be considered. Therefore, we take ‘k’ as 3–6. From the perspective of the experimental results, we set the value of ‘k’ to 3, 4, 5, and 6, respectively, and the mean difference between the filling speed and the real speed is 6.07 m/s, 5.73 m/s, 5.81 m/s, and 6.40 m/s, respectively. Therefore, we take ‘k’ as 4.
Figure 4 shows the real speed (red line), the original data (blue line) collected when the penetration rate is 0.05, and the filling speed (orange line and green line) when KNN filling and ACC-KNN filling are used. The results show that, in the case of 0.05 penetration rate, error with acceleration is reduced by 5.8% compared with that without acceleration. In three cases of dynamic penetration rate, the errors are reduced by 4.0%, 12.7%, and 7.9%, respectively. Clearly, our filling model is more effective than a single KNN filling model.

4. Experiment and Comparison
4.1. Speed Estimation
In this method, the LSTM model needs to learn multiple speed characteristics and penetration rate to estimate the whole speed, which is a multiple-input single-output model. Firstly, according to the spatial average speed and MFD, CVs penetration rate is estimated. Then, six speed data and the penetration rate are input into the LSTM model for estimation.
In our experiment, the first 115 groups of data are used for training, and the last 115 groups are used for testing. According to the results of many experiments, we set the parameters of the LSTM model. There are 3 hidden layers and 2 dense layers in LSTM, and the number of neurons in the hidden layer is 256, 256, and 128. Meanwhile, hyperbolic tangent function (tanh) is used for activation function, mean absolute error (MAE) is used for the loss function, adaptive moment estimation (Adam) optimizer is used for optimization, and the learning rate is set to 0.005.
We conducted traffic state estimation experiments using ACC-P-LSTM that LSTM model uses KNN filling with fusion acceleration and penetration rate estimation, P-LSTM that LSTM model uses penetration rate estimation, ACC-LSTM that LSTM model uses KNN filling with fusion acceleration, LSTM, ACC-KF that Kalman Filtering uses KNN filling with fusion acceleration, and Kalman Filtering (KF), respectively. On the one hand, we set up cases of the penetration rate from 0.01 to 1.0 to compare these models’ performance. On the other hand, we set up three groups of unknown and dynamic penetration rate cases to compare their performance. In these experiments, the performance of the traffic state estimation model is measured by three indicators: Root Mean Square Error (RMSE), Mean Relative Error (MRE), and Mean Absolute Error (MAE). Their calculation formulas are as follows:
Figure 5 shows the RMSE of these models running results for five times and their trend in the 95% confidence range from rate = 0.01 to rate = 1.0. It can be found that the LSTM algorithm is better than the Kalman filter algorithm, especially in the cases of the high penetration rate. ACC-P-LSTM has the best prediction effect, especially in the cases of the low penetration rate. Specifically, when the penetration rate is 0.01, 0.02, 0.03, 0.05, 0.10, 0.15, and 0.20, the error of the ACC-P-LSTM is reduced by 13.5%, 12.6%, 18.3%, 7.8%, 7.6%, 5.6%, and 4.5%, respectively, compared with the single LSTM. When the penetration rate exceeds 0.35, all LSTM models show almost the same effect, but they are still half the error of the KF model. In addition, the effect of using KNN filling with fusion acceleration is greater than that of using penetration rate. This may be caused by the fixed penetration rate in these cases we set. The fixed penetration rate does not give full play to the advantages of the P-LSTM model, but it still improves the accuracy to a limited extent. In short, we believe that this data filling method and penetration rate estimation method proposed can improve the estimation effect under the condition of low CVs penetration rate.

To verify the estimation effect under the condition of dynamic penetration rate, we set up three groups of experiments that their penetration rate could change with time. In these three cases, our penetration rate is set to change in a reasonable interval, and the average penetration rate is kept below 30%, because the accuracy of models above 30% is not different. We also use ACC-P-LSTM, P-LSTM, ACC-LSTM, LSTM, ACC- KF, and KF conducted experiments and counted their RMSE, MRE, and MAE to compare, as shown in Table 1. It can be seen that, in three cases, ACC-P-LSTM shows good results, while other effects are that ACC-LSTM is similar to P-LSTM and better than the Kalman filter algorithm. It can be seen that the accuracy of the model after using acceleration to correct the data and the model with penetration rate as input can achieve similar improvements. This result is different from the first experiment of fixed penetration rates, which reflects that the P-LSTM model can well perceive changes in penetration rates. At the same time, these two improvement measures are complementary to each other, and the model that uses them both achieves the best results. In short, ACC-P-LSTM can be considered that it achieves significant performance in cases of dynamic penetration rate.
In our model, estimation of the penetration rate is difficult and inaccurate. It is necessary to exclude that the increased accuracy of adding penetration rate is caused by the addition of noise. Therefore, we have added a set of experiments with random penetration rate, that is, replacing the penetration rate with random noise from 0 to 1. We use LSTM, P-LSTM, and LSTM with random penetration rate as an input (R-P-LSTM) for comparison experiments. Three cases are also set up, and the estimation error is shown in Figure 6. In case 1 and case 2, the error of the R-P-LSTM is lower than that of the LSTM, but the error of the P-LSTM decreases more. Meanwhile, the random penetration rate in case 3 is counterproductive. Therefore, we can eliminate the interference of random noise and believe that the addition of the estimated penetration rate could help improve the estimation accuracy. This model can learn the impact of different penetration rates on the accuracy of estimation and can autonomously adjust and weaken this impact.

4.2. Traffic State Judgment
The final task is to judge the state of traffic flow. The speed data can be directly obtained only through CVs and can be used to judge the traffic state. We refer to the concept of operation efficiency proposed by Xu et al. [6] and believe that the operation efficiency of the road is related to its volume and speed, as shown in formula (12). Operation efficiency suggests that the best traffic state means that the number of vehicles is large, and speed is also fast at the same time. Therefore, the formula of operation efficiency can be expressed as
Upper two-side derivation:
Order ; you can get . Then, find the derivative, and let ; you can get .
Taking and as the dividing line, we divide the traffic state into free flow, optimal flow, and congestion. Therefore, in this experiment, free flow state is speed (22.22 m/s, 33.33 m/s), optimal flow state is speed (11.11 m/s, 22.22 m/s), and congestion state is speed (0, 11.11 m/s).
Three dynamic penetration rate cases are used as validation experiments. The results are shown in Figure 7. In three cases, ACC-P-LSTM has achieved better results than other models. Specifically, the average accuracy of ACC-P-LSTM, ACC-LSTM, P-LSTM, and LSTM in the three cases is 93.91%, 93.33%, 93.04%, and 91.88%. In addition, estimation errors are mostly in optimal flow and congestion. These two states have large random fluctuations and fuzzy boundaries, which are the main reasons for errors. Moreover, due to the randomness and dynamics of CVs, vehicle state performance is difficult to show the whole state, especially at low penetration rates. For example, in the case of heavy traffic, it takes a certain time for aggregation wave to reach the rear of the vehicle platoon. Further, if there are only CVs in the rear, the result would be inaccurate. This is also a reason why the penetration rate of CVs is introduced in this article. In general, our model has achieved good performance under a dynamic penetration rate.

5. Conclusion
CVs environment provides a new way to obtain data, but it is still difficult to estimate the traffic state under low or dynamic penetration rates of CVs. Due to the small amount of acquisition and unstable representativeness, the data transmitted by CVs need to be filled effectively and evaluated. In this article, a KNN data filling model integrating acceleration and an LSTM speed estimation model introducing penetration rate estimation are proposed. Specifically:(1)Acceleration of CVs is used for data filling to fuse the time feature of speed, while KNN is used to mine distribution features. The results show that adding acceleration to the speed filling reduces the error by about 10%.(2)The volume is estimated according to spatial speed based on MFD, inferring the penetration rate of CVs. The penetration rate that is directly related to estimation accuracy is taken as an input factor of the speed estimation model. Further, we construct a multifactor LSTM model that extracts the features of multisource data to estimate space speed. The speed estimation results show that the accuracy of using these strategies is better than that of not using these strategies. Especially when the penetration rate is lower than 30%, the ACC-P-LSTM model has the lowest speed estimation error and maintains the best stability.(3)The speed is used to divide the traffic state, which can directly show traffic characteristics without additional sensor information. The results show that the accuracy of our model (average accuracy 0.933) is better than benchmark models (average accuracy 0.890) in various cases.
These can illustrate that the acceleration is effective for filling vacant data, and the penetration rate can improve the accuracy of estimation. But this article does not further study the correlation between them. The influence of penetration rate can be discussed in detail. At the same time, the performance of this estimation model is different under different traffic states. Especially under the optimal flow and congestion state, the probability of model error is high. The estimation accuracy in heavy traffic flow can be strengthened in the future. Besides, dynamic characteristics of CVs different from conventional vehicles are not considered in our model, which may affect speed estimation. Future research can consider the influence of the dynamic performance of CVs.
Data Availability
The data used to support the findings of this study were generated from SUMO simulation software and could be available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the National Key Research and Development Program of China (2019YFB1600100).