Abstract
This paper used the data of automobile traffic accidents from 2018 to 2020 in the Chinese National Automobile Accident In-Depth Investigation System. The prediction features of traffic accident severity are innovated. Four accident features that did not participate in the importance ranking were added: accident location, accident form, road information, and collision speed. Eight accident features (engine capacity, hour of day, age of vehicle, month of year, day of week, age band of drivers, vehicle maneuver, and speed limit) have been used in previous studies. Random forest was used to rank the importance of 12 accident features, and 7 important accident features were finally adopted. By comparing the algorithms and optimizing the results, the prediction model of traffic accident degree with higher accuracy is finally obtained.
1. Introduction
With the rapid development of today’s society, the number of cars increases dramatically. Traffic accidents have also increased, resulting in huge human and economic losses (Micheale [1]). According to the World Health Organization, road traffic accidents kill more than 1.25 million people each year, and nonfatal accidents affect more than 20 to 50 million people (Bahiru et al. [2]). It can be seen that road traffic accidents have become one of the leading causes of death and injury worldwide. How to prevent traffic accidents and how to predict traffic accidents has become a hot topic in traffic science and intelligent vehicle research.
The severity of traffic accidents is an important index of traffic accident harm. There are various factors that cause traffic accidents of different degrees. Many algorithms and factors have been cited in the study of traffic accidents. Lu et al. [3] analyzed the location of a car in road transects, the road safety grade, the road surface condition, the visual condition, the vehicle condition, and the driver state were studied, and the prediction accuracy model of 86.67% was established. Alkheder et al. [4] predicted the severity of traffic accidents from 16 attributes and four injury degrees (minor, moderate, severe, and death) through artificial neural networks. Akanbi et al. [5] found that old age, overtaking, speeding, religious beliefs, poor braking performance, and bad tires were the main human factors causing and causing plant and animal extinctions in traffic accidents. Some effects of weather and accident conditions on the characteristics of highway traffic behavior have also been pointed out by Caleffi et al. [6]. An et al. [7] applied a fuzzy convolutional neural network to traffic flow prediction under uncertain traffic accident information and verified its effectiveness through the real trajectory of cars and meteorological data. Multiobjective genetic algorithms have also achieved good results in predicting the severity of traffic accidents according to users’ preferences (Hashmienejad and Hashmienejad [8]). The deep learning method obtained a short-term traffic accident risk prediction model through traffic accidents, traffic flow, weather conditions, and air pollution (Ren et al. [9]). The spatio-temporal correlation of traffic accidents has been proposed in urban traffic accident risk prediction (Ren et al. [10]). The temporal aggregation neural network layer developed by Huang et al. [11] automatically captures correlation scores from the temporal dimension to predict the occurrence of traffic accidents. Kumeda et al. [12] revealed that Lighting Conditions, 1st Road Class & No., and Number of vehicles are the key features in electing the attributes. Driver behavior was effectively analyzed by Murphey et al. [13] through data mining methods. Bao et al. [14] also proposed an accident prediction model based on uncertainty and spatio-temporal relationship learning. Yaman et al. [15] use fuzzy data mining technology to analyze the factors affecting the injury degree of traffic accidents. Examples include age, gender, seatbelt use, alcohol, and drug involvement. Independent importance standardized variables affecting injury factors were obtained. A variety of algorithms have been applied to the prevention and prediction of traffic accidents. In recent years, the use of random forest algorithm in traffic accident data processing has gradually increased.
Random forest algorithm is widely used in various fields, such as medicine (Iwendi et al. [16]), meteorology (Ding et al. [17]), statistics (Schonlau and Zou [18]), and many other fields. The random forest has also achieved some results in traffic accidents. Yan and Shen [19] used random forest and Bayesian optimization to study how influencing factors affect the severity of traffic accidents. Zhao et al. [20] proposed an accident risk prediction algorithm based on a deep convolutional neural network and random forest. Chen and Chen [21] used three prediction performance evaluation indexes, namely, accuracy, sensitivity, and specificity, to find out the best comprehensive method consisting of the most effective prediction model and input variables with a higher positive impact on accuracy, sensitivity, and specificity. Koma et al. [22] used the random forest to detect the distraction of cognitive drivers by considering the types of eye movements. Wang et al. [23] selected different time periods, road grades, tidal lanes, proximity to infrastructure, and accident sections as indicators affecting traffic. The experimental results show that the method can effectively avoid the congested road and obtain the high-speed route. Zhang et al. [24] introduced generalized random forest to estimate heterogeneous treatment effects in road safety analysis to provide local authorities and policymakers with more comprehensive information and improve the performance of speed camera projects. In addition, GRF can be a promising method to reveal the heterogeneity of treatment effects in the road safety analysis.
Traffic accidents are usually caused by the influence of people, cars, roads, and environment. Based on the excellent performance of the random forest algorithm, this paper prioritized the influence of traffic accident factors on the severity of traffic accidents. Then, the low-impact factors were removed and trained in the random forest algorithm. Finally, the model is used to predict the severity of traffic accidents. The final prediction results show that the traffic accident factors proposed in this paper have superior performance in predicting the severity of traffic accidents.
2. Materials and Methods
2.1. The Data Source
The data in this paper are from 2800 automobile traffic accidents collected by the Chinese National Automobile Accident In-Depth Investigation System (NAIS) from 2018 to 2020. NAIS refers to the traffic accident database of NHTSA in the United States and GIDAS in Germany and combines the characteristics of traffic accidents in China. It is jointly established by the Defective Product Management Center of AQSIQ of China, together with a number of university vehicle accident research institutions and judicial appraisal institutions. The accident data include the basic variables of traffic accidents, such as time, location, age, gender, and degree of injury. The data collection area is distributed in Northeast, North, East, South, Southwest, and Central regions. It covers plains, hills, mountains and plateaus, and other areas.
In terms of traffic accident characteristics, Gan et al. [25] selected 8 traffic accident data features by random forest algorithm to predict the degree of traffic accident validation. Including engine capacity, hour of day, the age of the vehicle, the month of the year, day of week, age band of drivers, vehicle maneuver, and speed limit. Šliupas and Bazaras [26] emphasize the importance of road information and the road environment for traffic accidents. And, the number of speeding deaths has been increasing over the years. Therefore, this paper added four accident features that did not participate in the importance ranking, including accident location, accident form, road information, and collision speed. Therefore, 13 variables are considered in this paper. Including severity of accident, engine capacity, hour of day, the age of the vehicle, the month of the year, day of week, age band of drivers, vehicle maneuver, speed limit, accident location, accident form, road information, and collision speed. The driving speed is calculated by PC-CRASH, video recording, EDR vehicle data record, and according to the driver’s/witness’s complaint.
2.2. Random Forest
Data mining technology includes association, classification, clustering, prediction, sequential pattern mining, and so on [27]. In this paper, a random forest algorithm is used to predict the severity of traffic accidents. Random Forest is a combined classifier algorithm proposed by Breiman [28] that contains multiple decision trees. The random forest has obvious advantages in processing multidimensional data and is one of the best classification algorithms at present. The overall traffic accidents are preprocessed before the prediction. The importance degree of 12 traffic accident variables was obtained by random forest algorithm. Finally, select the traffic accident factors with high importance to predict the severity of traffic accidents.
The random forest algorithm combines multiple decision trees together. Each dataset is randomly selected, and some features are randomly selected as input. Figure 1 shows the specific flow of the random forest algorithm, in which the combination selects most of the classification results as the final result in the classification problem.

The random forest belongs to ensemble learning and adopts the idea of Bagging. Bagging is(1)Each time, n training samples are removed from the training set to form a new training set(2)Using the new training set, the M submodel is trained(3)For the classification problem, the method of voting is adopted, and the classification category of the submodel with the most votes is the final category
A random forest takes a decision tree as the basic unit, and by integrating a large number of decision trees, it can constitute a random forest. Its construction process is as follows: Step 1: T has N samples in total, and N samples are randomly selected to be put back. The selected N samples are used to train a decision tree as the samples at the root nodes of the decision tree. Step 2: when each sample has M attributes and when each node of the decision tree needs to be split, M attributes are randomly selected from the m attributes to meet the condition m << M. Then, some strategy (such as information gain) is adopted to select one of the m attributes as the split attribute of the node. Step 3: during the formation of the decision tree, each node should be split according to Step 2 until it can no longer be split. Note that there is no pruning in the entire decision tree formation process. Step 4: follow Steps 1 through 3 to create a large number of decision trees to form a random forest.
The bootstrap method was used for random forest classification, and k samples were selected from the original training sample set N. Secondly, the corresponding decision tree model is established for k samples. Finally, the results of k samples are voted on, and the final classification results are selected according to the principle of majority rule. The classification decision is as follows:where is a combination of classification model; is decision classification model; Y is the output variable (target variable); and is an indicative function.
Another high quality of the random forest algorithm is that it is easy to measure the relative importance of each feature to the prediction. It can measure the importance of features by looking at tree nodes that use this feature to reduce impurities in all trees in the forest. It automatically calculates this score for each factor after training and scales the results so that the sum of all importance equals one. This greatly satisfies our need to predict the importance of traffic accident factors.
3. Results and Discussion
3.1. Preprocessing
The whole data set is simply analyzed by statistical analysis. We found that some accident cases had incomplete accident characteristic data and some obviously wrong accident characteristic information. We use the method of direct deletion to deal with it. For example, 0.7% of the speed limit. If the value is obviously abnormal, it is also removed directly. For example, drivers in China are only allowed to drive if they are 18 years old or older. For 0.6% of traffic accidents under the age of 18, the law will be deleted.
After data processing, there are 2,800 groups of traffic accident data. The characteristics of some traffic accidents are classified as follows:(1)Accident location is as follows. 1: ground nonexpressway, 2: ground expressway, 3: elevated nonexpressway, 4: elevated expressway, and 5: other roads(2)Road information is as follows. 1: three-branch bifurcation, 2: four-branch bifurcation, 3: multi-branch bifurcation, 4: roundabout crossing, 5: ramp, 6: ordinary road, 7: elevated road, 8: narrow road, 9: narrow road, 10: bridge, 11: tunnel, 12: entry and exit of road, 13: dangerous road, and 14: other special road(3)Accident form is as follows. 1: collision between people and cars, 2: collision between nonmotor vehicles and cars, 3: collision between motor vehicles and tricycles, 4: single car accident, 5: collision between cars and stationary cars, 6: collision between cars and cars, 7: collision between cars and cars, 8: collision between cars and cars, 9: collision between cars and cars, and 10: collision between cars(4)Accident classification is as follows. 1: minor accident, 2: general accident, 3: major accident, and 4: extramajor accident
In the accident classification, a minor accident refers to one or two minor injuries at a time or property damage motor vehicle accident less than 1000 yuan and nonmotor vehicle accident less than 200 yuan accident. A general accident is one or two minor injuries at a time or property damage motor vehicle accident less than 1000 yuan and nonmotor vehicle accident less than 200 yuan accident. A major accident is one or two deaths at a time or more than 3 or less than 10 injured. The rest were accidents with property losses of more than 30,000 yuan but less than 60,000 yuan. An extramajor accident is one that causes the death of more than three people or seriously injure 11 or more people. Or, one person died and more than eight were seriously injured. Or, two dead and five or more seriously injured. In addition, there are more than 60,000 yuan in property damage accidents.
Among them, 1378 were major accidents, accounting for 49.2%. 63.0% of traffic accidents occurred on ordinary roads, followed by 18.2% at four-branch and 10.9% at three-branch. The data of traffic accident classification and road section information are shown in Table 1.
Since this data have not been used to predict the severity of traffic accidents, this paper considers three dimensions of longitude and latitude distribution, date, and time to verify the reliability of the data. The longitude and latitude distributions of all traffic accidents were compared with those of China and no outliers were found. In addition, the distribution of traffic accidents in some provinces is shown in Table 2. Shanghai had the highest number of accidents, at 34.8 percent. It is actually Sichuan and Yunnan, with 18.6% and 13.6%, respectively. Although the traffic accident data does not cover the whole country, it also covers a variety of areas such as plains, hills, mountains, and plateaus. Therefore, the reaccident location dimension can prove the reliability of the data.
For the date and time dimensions, we selected months in the date dimension and hours in the time dimension. The data for the date dimension are shown in Figure 2. The distribution in each month is relatively random, which corresponds to the randomness of traffic accidents in the date dimension. In addition, the time heat map of the accident is shown in Figure 3. It is easy to see that the highest number of accidents occurs during the morning and evening rush hours. In addition, traffic accidents occur more frequently in the evening than in the early morning. Therefore, the reliability of the data is verified from three aspects: accident location, accident date, and accident time.


3.2. Prediction of Severity
In this paper, all data were randomly selected from 2240 traffic accidents to plant forests. The remaining 560 accidents were tested. The importance of features is measured by looking at tree nodes that use this feature to reduce impurities in all trees in the forest. The final importance ranking is shown in Figure 4. Our principle for choosing the importance threshold is the Ø80 value of the cumulative value curve of importance. According to the importance value of each feature, the Ø 80 value is around 0.04. Therefore, we choose 0.04 as the selection index of important features. Finally, seven accident characteristics are selected to predict the severity of traffic accidents. Including the morphology of the accident, the engine capacity, impact velocity, the speed limit, road information, accident site, and vehicle maneuver.

3.3. Model Comparison
Through the selected 7 characteristics of traffic accident data, we conducted several experiments on the original data of 2800 traffic accidents. 2240 traffic accident data were randomly selected for training, and the rest data were used as test sets. Random forest, BP neural network, SVM, and radial basis neural network were used for training and testing. The prediction performance of each model on the severity of traffic accidents is shown in Table 3. The performance of the random forest algorithm is better than other models. The recall is higher than other models. The false alarm rate is lower than in other models, so the overall F1 score is also higher. The highest ROC indicates that the prediction result of this model is more reliable and stable.
3.4. Random Forest Results Analysis
Each column of the confusion matrix represents the predicted category, and the total number of each column represents the number of data predicted for that category. Each row represents the true category to which the data belongs, and the total numbers of data in each row represent the number of data instances in that category. The accuracy of the training model is 99%. In the prediction of category 1, three categories 2 are predicted as category 1, and the accuracy is 98.6%. There were 7 forecast errors in Category 2, a 0.8 percent error rate. There were 13 errors in Category 3 predictions. Nine of them were category 2. Four of them were category 1, with an accuracy of 98.8%. Category 4 has 100% accuracy. This is shown in Figure 5.

The specific prediction results and real types are shown in Figure 6. The accuracy of centralized prediction is 80%. It shows that the model has a good prediction effect.

3.5. Discussion
Most of the data used in this paper focus on a few provinces. Because the NAIS database collection points are located in those provinces. However, the data collected in other provinces are less or incomplete. In addition, the number of major accidents in this paper is the largest, followed by general accidents, and minor accidents are less. The main reason is that the NAIS database requires one of the accident participants to be a car when collecting data. And, the accident caused death or injury level AIS3 or above. Therefore, there are few studies on minor accidents in this paper. However, overall, the dataset verified its reliability from three dimensions: location, date, and time.
The random forest can deal with both classification and regression problems. The random forest can maintain high classification accuracy even when some data are missing. It can handle a large number of high-dimensional features without the need for dimensionality reduction. It can assess the importance of each feature in a classification problem. It can also generate tree structure, judge the importance of each feature, and is not sensitive to outliers and missing values. The main limitation of random forests is that a large number of decision trees can slow down the algorithm and make real-time prediction ineffective. In general, these algorithms are fast to train but slow to predict once trained. More accurate predictions require more trees, which leads to slower models. However, Wang and Chen [29] realized the weighted optimization of the random forest model, effectively solving the problems existing in the traditional random forest algorithm. In addition, Wang et al. [30] optimized the random forest algorithm. The improved algorithm has higher classification accuracy and can effectively classify data.
In terms of accident characteristics, the newly added accident features include accident location, accident form, road section information, and driving speed. It is found that the importance of the newly added four accident features is much greater than that of the hour of day, age of vehicle, month of year, day of week, and age band of drivers. The replacement results in a more accurate prediction model.
Compared with the predicted results, Gan et al. [25] predicted that the accuracy rate was 75% when the data was 10,000. The model of this study achieved 80% prediction results when the data was 2800. Compared with the accuracy rate under the same data, our accuracy rate is higher. If we increase our sample size, our prediction accuracy will also improve. This also reflects the superiority of the prediction model in this paper.
From the perspective of traffic safety management, more accurate prediction of the severity of traffic accidents has always been an important research direction for us to maintain a safe and efficient traffic system. Many traffic accidents happen accidentally before people can take more effective measures to reduce the damage. The prediction model based on the random forest algorithm presented in this paper has good performance in terms of the severity of traffic accidents. It can effectively provide an effective reference for drivers or traffic safety management departments to prevent the occurrence of traffic accidents. For example, the design of the body structure, the improvement of the road surface condition, the improvement of the road speed limit, and the setting of dangerous sections.
4. Conclusion
In this paper, a random forest algorithm is used to predict the severity of traffic accidents based on its high performance in data classification. We innovatively add the four accident characteristics of the accident location, accident form, road information, and driving speed into the existing traffic accident prediction model. Finally, a more efficient and accurate traffic accident model is obtained. Compared with BP neural network, SVM and RBF neural network, the prediction accuracy of random forest is the highest. It can be used as an effective tool to predict the severity of traffic accidents. In addition, it is found that the collision pattern is the most important to the severity of traffic accidents. It also underscores the importance of car structure to the safety of drivers and passengers. In addition to car structure, road information is also the main factor causing traffic accidents. It is found in this paper that the traffic accidents in the common road section are the most, so it can be seen that improving the infrastructure of the common road section and strengthening the supervision of the common road section will effectively prevent the occurrence of traffic accidents. Based on the importance of collision speed and speed limit, it can be seen that in addition to cars, roads, and the environment, it is also indispensable to strengthen safe driving training for drivers.
In the future research process, more characteristics of traffic accidents will be applied to optimize the model. The influence of driver factors on the severity of traffic accidents also needs to be further studied. The data used in the research are all from China, and there may be some gaps in other countries and regions. Due to the requirements of data collection, some cases in some areas have not been collected. In addition, the model parameters and algorithms are not optimized in this paper. The relationship between different accident characteristics and different algorithms remains to be explored.
Data Availability
The article includes some data to support the results of this research. In order to protect the privacy and security of the Chinese NAIS database, NAIS restricted other information. This information can be obtained from NAIS for researchers who meet the criteria for accessing confidential data.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This study was supported by The Open Research Fund of Sichuan Key Laboratory of Vehicle Measurement, Control, and Safety (Grant no. szjj2018-130) and Sichuan Province Innovation Training Project (Grant nos. S202210623048 and S202210623064).