Abstract

Pedestrian injuries and fatalities due to traffic accidents remain at a high level. Therefore, the need for efforts to reduce this ratio is on the rise. Machine learning models can facilitate the exploration of the various factors that influence the occurrence of pedestrian accidents. In this study, we used data on pedestrian traffic accidents classified into three categories of injury severity: minor, severe, and fatal. To compare the performance of various types of models, logistic regression, Naïve Bayes, XGBoost, CatBoost, and LightGBM were used for analysis. Five machine learning methods were applied to the analysis, and hyperparameter tuning was performed to improve the performance of the model. The performances of the five models were 0.688, 0.577, 0.705, 0.708, and 0.707, respectively, and LightGBM showed the best classification accuracy at 0.708 in this study. Based on SHAP (Shapley additive explanation), one of the explainable artificial intelligence (XAI) techniques, we were able to obtain the variable importance of the LightGBM model, through which we identified the main factors affecting each level of injury severity. In addition, by using LIME (local interpretable model-agnostic explanation), another XAI technique, it was found that the age of the driver and pedestrian was the factor that had the most significant influence on the model’s classification prediction. Specifically, as the size of the vehicle increases, the severity of the accident increases. When the driver is older, the severity of the accident is small, and when the driver is young, the severity of the accident is high.

1. Introduction

Traffic accidents account for a large proportion in casualties of accidents. Because traffic accidents can cause physical injury and disability, and in severe cases, pose a critical threat to survival, they are recognized as a major public health problem [1]. Moreover, the decrease in productivity due to death can be said to be a great loss to the national economy [2]. It can be said that the process of classifying the severity of injuries through only traffic accident data is very complex and difficult. Hence, it is essential to investigate the factors that contribute to injury severity [3, 4].

Accurately predicting the severity of injury in traffic accidents is difficult because there are many variables, such as driver, road, vehicle, and weather characteristics. Moreover, empirical studies have shown these variables to be important factors [57]; this sets the direction for the current study. Previous studies have used machine learning models for accident occurrence classification with good performance [810]. Explainable artificial intelligence (XAI) techniques have also been used in traffic accident analysis [11]. In this paper, we tried to analyze the results that are difficult to analyze with nonstatistical models in existing machine learning analysis using XAI. XAI increases confidence in model performance.

We systematically analyzed various causal factors by interpreting the machine learning classification results through XAI. In particular, it was possible to classify the degree of accident into three levels through the design of a multiclass classification model for accidents. In addition, by applying the XAI technique to the analysis results, we tried to interpret the prediction results and identify the factors that affect the severity of accident. Ultimately, through this process, it was intended to help establish effective vehicle safety measures to prevent traffic accidents.

In this study, the classification results through machine learning were interpreted through XAI to systematically analyze various causal factors. Ultimately, it was intended to establish effective traffic safety measures to prevent pedestrian accidents. Thus, we developed a high-performance model that could systematically analyze data.

Most traffic accident case classification studies have aimed to establish an optimal classification model by improving performance for metrics such as accuracy, precision, recall, or F1-score [12]. Moreover, using statistical models such as logit model and probit model, we analyzed the effect of variables on accident occurrence with specific numerical values such as odds ratio [1316].

Unlike statistical models, machine learning models can be expected to return highly accurate results; however, their interpretability is relatively inferior to that of statistical models. Hence, the XAI technique was used to supplement the interpretability of the machine learning results.

The need to use XAI techniques is greater when the classification factors for each class are important, as in the case of multiple classifications. Because the severity of pedestrian injury is classified into three levels (three classes) in our data, a detailed interpretation of the classification is deemed necessary.

Due to the many variables in real data, predicting the severity of injury in a traffic accident is difficult. Empirical studies have indicated that road characteristics, vehicle attribution, and personal characteristics are significant determinants, and this has an impact on the current study.

This study is structured as follows. A comprehensive description about the data is presented in the data section. The method used for classification and model improvement are discussed in the Section 2.2. The results of the analysis are presented in the results section. The results of the analysis of the classification using XAI techniques are presented in the next section. In the discussion, we mention the overall results and limitations of our study. Moreover, we suggest the future implications of this work. Finally, the conclusion section presents a summary of the contributions and limitations of this study, as well as future directions.

2. Materials and Methods

2.1. Data

We analyzed the correlation between pedestrian accident severity by region and the factors that cause pedestrian accidents. To analyze the factors affecting pedestrian accidents, 14 categories related to accidents were organized through data collection of pedestrian accidents. We collected data on a total of 48,381 pedestrians in 45,261 pedestrian traffic accident cases nationwide. Finally, information from 37,589 pedestrians was used after eliminating 10,792 cases where the pedestrian accident factors were uncertain [17].

In Table 1, all exploratory variables are binary categorical variables with values of 0 or 1. To make all variables categorical, variables that show continuity (e.g., driver’s age or pedestrian’s age) are divided into intervals (e.g., driver’s age is divided into three categories—under 25 years, 25–60 years, and above 60 years) to create categorical variables that take values of 0 or 1 for each interval. The dataset contains 49 variables observed in each individual accident, including an independent variable for injury severity level. The independent variable is a multiclass variable and is similar to the variable used in previous studies [18, 19]. Each accident case is classified as one of three classes based on the severity of the accident, namely, minor, serious, and fatal injury.

2.1.1. Data Preprocessing

Since the raw data were all categorical variables, preprocessing was required, which entailed creating a dummy variable using one-hot encoding for each categorical variable in the data. Thereby, eight “region” variables; five “accident location” variables; four “car type,” “road type,” “lane number,” and “season” variables; and three “driver’s age,” “pedestrian’s age,” and “weather” variables were expressed as 1 and 0 using dummy variables.

2.1.2. Data Split

The data were divided into training and testing datasets. The ratio of data used for model training and testing was 0.7 : 0.3. The model was trained on the training dataset, and the test dataset was used for classification prediction. The prediction result was evaluated with several metrics. The percentages of minor, serious, and fatal labels were approximately 37%, 57%, and 5%, respectively. There were relatively few fatal cases. Therefore, we used stratified k-fold with cross-validation to prevent a specific target variable from becoming concentrated in a specific fold. As described in Figure 1, each fold contains the same ratio of the three classes. We then used cross-validation to obtain good classification results.

2.2. Machine Learning Method

To classify pedestrian accident severity, we used logistic regression (LR), Naïve Bayes (NB), XGBoost (XGB), LightGBM (LGBM), and CatBoost (CB). Logistic regression and Naïve Bayes were used as statistical idea-based models, and XGBoost and LightGBM, which are known to have high performance among tree-based models, were used. In addition, considering that all data used in this study are composed of categorical variables, CatBoost which has significant performance when there are many categorical variables was used. A description of the machine learning models used is as follows.

2.2.1. Logistic Regression

A logistic regression model generally establishes the relationship between a binary dependent variable and several explanatory variables [20]. Because the dependent variable “severity” has three classes, we performed multinomial logistic regression, which is an extended version of logistic regression that can be applied to multiclass classification. It changes the loss function to cross-entropy loss and predicts the probability distribution as a multinomial probability distribution. A multinomial probability distribution is a probability distribution that defines multiclass probability, and multinomial logistic regression is tuned to learn and predict a multinomial probability distribution.

2.2.2. Naïve Bayes

Naïve Bayes is a classification method that uses probability. The Naïve Bayes algorithm predicts the future probability based on past data. The probability is stated in the following equation;

The variable is the dependent variable, which is divided into three classes (slight, serious, and fatal). The variable is the independent variable, which helps to predict the class. Compared with other techniques, Naïve Bayes is faster and easier to apply; it uses fewer data, and it delivers better performance with no additional training period required for the training dataset.

2.2.3. XGBoost

The name XGBoost comes from “extreme gradient boosting.” It is a decision tree-based ensemble method, used especially as a boosting method (Figure 2). XGBoost finds a classifier using a greedy algorithm and finds appropriate parameters quickly using distributed processing. Moreover, it has a flexible learning system, so the model can be optimized by adjusting various parameters. In addition, overfitting can be prevented, and visualization is easier than that with a neural network; hence, it is more intuitive to understand than other models. Above all, XGBoost shows good performance compared with other existing models and is one of the most used machine learning models for various analyses.

2.2.4. LightGBM

The gradient boosting decision tree (GBDT) is time-consuming as it has to search all data instances to estimate the information gain of all possible split points for each feature (Figure 3). In terms of efficiency and scalability, when the dimension of a feature is high and the dataset is large, obtaining a satisfactory output is difficult. To solve these problems, two new methods, GOSS (gradient-basedone-ide sampling) and EFB (exclusive feature bundling), have been suggested. LightGBM is a new GBDT model that uses GOSS and EFB; it achieves almost the same performance as a traditional GBDT and trains more than 20 times faster [21]. The GOSS and EFB techniques can manage large data objects and large variables, respectively. Furthermore, these two methods achieve significantly better performance than XGBoost in terms of computational speed and memory usage. An additional advantage is that the optimal division of categorical features is possible.

2.2.5. CatBoost

CatBoost is an algorithm for permutation-based alternative boosting and categorical variable processing. It was built to combat the prediction shift caused by a certain kind of target leakage when existing gradient boosting algorithms are implemented. One of the most used techniques for categorical variables in a boosting tree structure is one-hot encoding, which adds a new binary variable for each category. One-hot encoding that groups categories with a limited number of clusters or groups categories using a target statistic (TS) estimates the expected target value of each category. CatBoost is an algorithm that uses the TS as a new numerical variable to manage categorical variables with minimal loss of information. As such, CatBoost can be more effective than other algorithms that process categorical variables, but when the data are continuous variables, the performance of the model may be relatively poor. Furthermore, it has a slower learning rate than LightGBM [22].

2.3. Hyperparameter Tuning

Hyperparameters are algorithmic parameters used to construct a machine learning model or to minimize a loss function [23]. To design an optimal machine learning model, it is necessary to find the optimal parameters within the parameter range of the machine learning model. Configuring the optimal parameters in the process of designing a model is called hyperparameter tuning, and it is one of the most important processes in designing an efficient model. In particular, its importance is greater in specific cases that have many hyperparameters [24].

The goal of hyperparameter optimization (HPO) is to tune hyperparameters so that the model performs optimally or near optimally with given values [25]. The model’s performance can be evaluated by accuracy, root mean square error (RMSE), and F1-score. To improve a machine learning model using HPO, a designer first needs to know the main idea for tuning a machine learning model according to a specific problem or type of dataset.

Traditionally, machine learning models are classified into supervised and unsupervised learning algorithms. In this study, HPO is used to create a model that shows higher classification accuracy by improving the performance of the model. Supervised learning algorithms use both input and output values.

2.3.1. Grid Search (GS)

GS is one of the most widely used hyperparameter configuration search techniques. It works by evaluating all combinations of hyperparameters given and then evaluating the Cartesian product of a user-specified finite set of values [26]. Although GS is easy to implement and parallelize, the number of evaluations to be done increases exponentially as the number of hyperparameters increases, which is inefficient in a high-dimensional hyperparameter configuration space. This exponential increase leads to high dimensionality [27].

2.3.2. Bayesian Optimization (BO)

BO is a well-known iterative algorithm for solving HPO problems. It determines the future evaluation point from the previous results; it uses a surrogate model and an acquisition function to determine the next hyperparameter configuration.

The goal of the surrogate model is to fit all currently observed points to an objective function. The acquisition function regulates the use of different points by balancing the trade-off between exploration and exploitation after it obtains the predictive distribution of the probabilistic surrogate model. “Exploration” refers to sampling instances in areas that have not been sampled. Meanwhile, “exploitation” refers to sampling based on the posterior distribution and sampling in promising regions where the global optimum is most likely to occur.

The BO model detects the area that is most likely to be optimal and avoids missing a better configuration in an unexplored area by balancing the exploration and exploitation processes [28]. In this study, we use BO-TPE (tree of Parzen estimators) algorithm for the HPO as a surrogate model.

2.4. Model Evaluation

Given that the degree of injury was divided into three classes, we opted to use precision, recall, and F1-score, rather than using accuracy alone as an evaluation index for the model. Table 2 is a confusion matrix of multiclass classification (three classes); it was used to obtain the evaluation index for each of the three classes. Table 2 describes the variables of each metric formula.

Accuracy refers to the ratio of correctly classified to the total number of cases. Therefore, the accuracy can be obtained by dividing the sum of all entries of the confusion matrix into cases where class 0 is classified as 0, 1 is classified as 1, and 2 is classified as 2, which are the cases correctly classified by the model.

Precision is a measure of the ratio of true positive cases to all cases with positive predictive results and is an index for evaluating positive predictive performance. In the case of multiclass classification, this means the ratio of the actual cases of minor, serious, and fatal among the number of cases predicted to be minor, serious, and fatal, respectively, and the precision of each class can be obtained as follows:

Recall is the ratio of positively predicted cases among true cases, also called sensitivity or true positive rate. It can be said to be the ratio of the number of actual cases, i.e., the proportion of cases predicted by the class, to the total true values of the class and can be expressed as follows:

The f1-score is a metric that calculates the harmonic average of precision and recall to verify the average performance, so it can be used to find the appropriate trade-off between precision and recall. Considering the problem of data imbalance, we tried to obtain several types of f1-score: micro-f1-score, macro-f1-score, and weighted f1-score. Each of the f1-score metric formulas is expressed as follows.

We evaluated the performance of the model using these four evaluation metrics. These metrics should enable a flexible interpretation of the results [29].

2.5. Interpretable Machine Learning

Through the preceding processes, an appropriate classification model was found. Using a feature importance plot of the model, it was possible to determine which variables had a substantial influence on the classification. However, the feature importance plot alone cannot determine whether the variable has a positive or negative effect on the model’s classification results. Therefore, we tried to obtain more detailed information by applying the classification prediction results of the optimal performance machine learning model obtained above to the XAI methods.

2.5.1. SHAP (Shapley Additive Explanation)

In the case of linear regression, measuring the influence of each feature is possible. A method that can be used to do this with a machine learning model in this case is SHAP, which explains the prediction results of the black-box model and is based on the Shapley value in game theory. Shapley values calculate the effect of cooperation and noncooperation for each feature, arrange payouts, and enable an interpretation of the analysis model [30]. When is the set of all input features, the general formula of SHAP iswhere is the feature attribution value of the input variable, is the set of all input variables, and signifies all subsets of the input variables without the input variable. In equation (6), stands for the value of the input variable, excluding the input variable from the set . Further, is the dataset containing the input variable, and signifies the predicted values based on the input value [30]. Moreover, because SHAP is a model-agnostic method, it can be applied to any model. Furthermore, it can be calculated quickly by changing the calculation method according to the characteristics of the existing learning model. Thus, it was judged to be appropriate for machine learning techniques.

2.5.2. LIME (Local Interpretable Model-Agnostic Explanation)

LIME is an algorithm that reliably explains the individual predicted values of a classification model or a regression model. As the name suggests, LIME can be used with any model. LIME has the following three characteristics: it is interpretable, has local fidelity, and is model-agnostic.

Equation (7) shows that the greater the focus on local fidelity, the lower the interpretability of the entire model; moreover, the higher the interpretability of the entire model, the lower the weight for local fidelity. Therefore, to ensure both the interpretability and the local fidelity of the aforementioned formula, it is necessary to minimize and to adjust such that it is sufficiently small for human interpretation. In this study, LIME was used to interpret the analysis results for each variable and to examine their influence [31].

LIME can evaluate feature importance across individuals. In application 2, LIME was used to identify the pedestrian accident whose features contributed the most to the model’s classification results. The main idea of LIME is to compute a local surrogate model, which is an easily interpretable model that is trained to mimic the behavior of a more complex model.

3. Results and Discussion

This section describes the classification results of each of the five machine learning models. Moreover, we applied XAI methods such as SHAP and LIME to the analysis output.

3.1. Classification Result

More detailed results of accident severity could be obtained using a three-class classification instead of a conventional binary classification. In particular, because the classification uses three classes, the metrics are obtained for each class. After data preprocessing, data analysis was carried out preferentially without tuning the five machine learning models: logistic regression (LR), Naïve Bayes (NB), XGBoost (XGB), LightGBM (LGBM), and CatBoost (CB). Table 3 makes it possible to identify the four performance evaluation metrics (accuracy, recall, precision, and F1-score) for the five machine learning models. In the case of recall, precision, and F1-score, the performance according to each severity level was obtained through the performance evaluation formula described above. In addition, four cases of HPO (default, GS, RS, and BO) were made to be compared.

Through Table 3, it was found that the case when BO was applied as HPO to the LGBM model showed the highest performance. In the following, XAI techniques such as SHAP and LIME are used for this model to identify the main factors influencing accidents and to interpret the classification results in detail.

3.2. SHAP Result

Figure 4 shows LightGBM’s feature importance plot and the SHAP value plot. The feature importance plots measure the influence of a variable on the model using the permutation technique. When a variable is a dependent variable for other variables, the results can become skewed. Crucially, the negative impact of variable importance is not calculated. Meanwhile, when we use the SHAP value plot, we can calculate the negative influence and consider that the variables may influence each other. Therefore, it is possible to guarantee more reliable results compared to the existing variable importance with the risk of deriving incorrect results different from the importance of the variable [32].

In comparing the two plots, a difference in the ranking of the variables can be observed. Feature importance shows that feature dependencies exist for each variable. For example, “metro” and “Seoul” represent the place where the accident occurred, and “capital” is a category that signifies the road type, which means that the road is located in a capital or metropolitan city.

Therefore, “metro” and “capital” may have some correlation with each other. In the case of the feature importance plot, three (“metro,” “Seoul,” and “capital”) of the top seven features contain similar information to each other. Thus, the feature importance calculation algorithm may lead to incorrect results. The SHAP value plot measures the importance of each variable more accurately than the feature importance plot. “D25” and “Day” are the most influential features in the model analysis.

3.3. LIME Result

The LIME tool helps us interpret the machine learning model. Figures 57 show the local explanations for the model that we chose in the results section.

In this section, we interpret four classification sample cases for each label (“minor,” “serious,” and “fatal”). For each case, we examine three LIME results: the classification probabilities for each label, the LIME for the selected features, and the table that arranges the features and values. All 12 cases were classified accurately by the machine learning model, and the LIME showed the features that influenced the classification results for each label [33]. Figure 5 shows that for the “minor” label, D25 = 0 and D60 = 1 have positive impacts, and Seoul = 0 has a negative impact on classification. More specifically, it can be said that D25 = 0 and D60 = 1 have a positive correlation with the classification of accidents as “minor,” and it can be said that Seoul = 0 has a negative correlation. In other words, the features depicted in blue contribute to the classification. Overall, LIME shows that D25 = 0 and D60 = 1 make a large positive contribution to the model’s prediction of classification as a “minor” accident, and Seoul = 0 makes a negative contribution.

In Figure 6, we exemplify four of the cases classified with the premise “bad” label. In particular, D60 = 0 has a positive impact and is classified as “serious.” Moreover, in most cases, D25 = 1 and Acc5 = 1 contribute to the classification model to predict the samples under the “serious” label. Notably, there are several main features that have positive influence on being classified as “serious,” but finding the features that have negative influences that are classified “serious” is difficult.

The final four cases were classified under the “fatal” label. As per Figure 7, most cases have D25 = 1, P60 = 1, and Day = 0 (presented in green in the table). However, Freeway = 0, Large = 0, and Drink = 0 have negative effects after being classified under the “fatal” label. On identifying the features that contributed to the classification of the “fatal” label overall, it appears evident that there are some main features that contribute more to the classification than do other features.

3.4. Discussion

There is no tremendous difference in the number of traffic accidents during the period 2011–2019. Additionally, the pedestrian traffic accident fatality rate has remained constant at about 40% (Table 4). There is, thus, a need to reduce material and human losses by reducing the rate of pedestrian traffic accidents.

The results of the classification model and the interpretation output indicated the factors that could predict injury severity. Both the feature importance plot and the SHAP value plot include Day, D25, P25, and P60.

Thus, these features were probably major influences on the model’s prediction of classification. Furthermore, after we interpreted the LightGBM model classification, we identified the main features that affected the classification for each label. The LIME showed that for “minor” cases, the factor related to the driver being over 60 years old was the main factor in classification.

In the case of the “serious” label, the main factors in the prediction were the driver’s age being less than 25 years and the crosswalk being the location of the accident. Lastly, in the case of the “fatal” label, the main features for the prediction were the driver’s age being less than 25 years and the pedestrians’ age being more than 60 years. Meanwhile, the factors that contributed negatively to the “fatal” label were location of the accident not being on the highway, the driver not being in a drunken state, and the vehicle not being large.

Since all data were categorical, format information was lost, which made improving the model’s performance difficult. This also limits the interpretation of the classification results. In the case of the analysis results obtained by applying XAI, a more detailed and specific interpretation would be possible if the features were not all categorical but continuous or ordinal as well. Furthermore, as the period of data collection was only for one year, comparison with data from other years was not possible. If data from different years were available, comparing differences between time points would be possible [34]. Similarly, future studies could explore the distribution of injury severity by year or by day of the week [35].

As the driver’s age and pedestrian’s age were the main factors influencing accident severity in our study, it would be meaningful to conduct future research to compare these results with other traffic accident analysis studies related to drivers’ age. Moreover, better performance could be expected by applying machine learning methods or deep learning methods that were not used in this study. These methods would have led to more diverse interpretations of XAI techniques other than SHAP and LIME.

As a result, it is expected that the findings of this study will aid the road transport department and engineers in making policy decisions regarding road design for pedestrians. Furthermore, these results may be useful in the development of public policies, such as driver and pedestrian safety management.

4. Conclusion

Pedestrian accidents account for a high proportion of all traffic accidents. In this study, an in-depth and systematic analysis of pedestrian traffic accidents was conducted using machine learning models. We investigated the factors that might have caused the pedestrian accidents. The degree of injury in the pedestrian accident cases was classified into three categories (minor, serious, and fatal), and important features were identified through multiple classifications. We improved the performance of the models using hyperparameter tuning. While selecting an optimal model, we considered not only the accuracy but also other metrics.

Moreover, XAI was used to interpret the results of the machine learning model. While earlier studies have used machine learning methods for traffic accident analysis, we focused on model performance and feature importance using XAI to interpret the results of the machine learning model analysis. This enabled us to identify how specific factors were related to pedestrian accidents.

The purpose of this research was to prepare effective traffic safety measures for the reduction and prevention of pedestrian accidents through the classification of accidents. We explored the main features that affected the results through machine learning analysis. Furthermore, we used XAI to confirm the classification factors for each label. Our results could help traffic departments to reduce the occurrence of accidents and the severity of injuries. The features obtained in our study could be linked to customer information to help companies set their insurance premiums.

Data Availability

The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Sanghun Lee, Sangyeop Kim, Jaehoon Kim, and Doyun Kim contributed equally to this study.

Acknowledgments

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2019R1I1A3A01057696).