Abstract
Highway-rail grade crossing (HRGC) crashes continue to be the major contributors to rail causalities in the United States and have been intensively researched in the past. Data-mining models focus on prediction while dominant general linear models focus on model and data fitness. Decision makers and traffic engineers rely on prediction models to examine at-grade crash frequency and make safety improvement. The gradient boosting (GB) model has gained popularity in many research areas. In this study, to fully understand the model performance on HRGC accident prediction performance, the GB model with functional gradient descent algorithm is selected to analyze crashes at highway-rail grade crossings (HRGCs) and to identify contributor factors. Moreover, contributors’ importance and partial-dependent relations are generated to further understand the relationship of identified contributors and HRGC crash likelihood to concur “black box” issues that most machine learning methods face. Furthermore, to fully demonstrate the model’s prediction performance, a comprehensive model prediction power assessment based on six measurements is conducted, and the prediction performance of the GB model is verified and compared with a decision tree model as a reference due to their popularity and comparable data availability. It is demonstrated that the GB model produces better prediction accuracy and reveals nonlinear relationships among contributors and crash likelihood. In general, HRGC crash likelihood is significantly impacted by several traffic exposure factors: highway traffic volume, railway traffic volume, and train travel speed and others.
1. Introduction
Crashes between motor vehicles and trains at highway-rail grade crossings (HRGCs) often have severe consequences [1]. Of all crashes at HRGCs in the U.S. (2000 to 2014), 12% resulted in fatalities [2]. Numerous models have been developed to identify major contributing factors and explore relationships between crashes and explanatory variables to better understand safety performance and be able to apply effective countermeasures to reduce crash rates at HRGCSs.
Since crash data have random, discrete, and nonnegative characteristics, generalized linear models (GLMs) [3] have been commonly selected to investigate the relationship between crashes and contributing factors. However, Lord and Mannering [4] pointed out that these models face various data challenges which stem from crash data distribution and inappropriately fitted GLMs. As indicated by Lu and Tolliver [5] and Oh et al. [6], HRGC crash data often show underdispersion distribution where sample variance is less than the sample mean, and less common GLMs are suitable for such datasets. Moreover, the available crash dataset often contains a large portion of missing data and outliers. GLMs are very sensitive to noise data [7].
In this study, to fully demonstrate the model application and its capabilities to analyze safety data, a robust data-mining technique, the gradient boosting (GB) model is selected to analyze crashes at HRGCs. Unlike GLMs, it requires no predefined underlying relationship between dependent and independent variables. Thus, underdispersed HRGC data are not an issue. Moreover, to better understand the model forecasting performance, a comprehensive model forecasting accuracy evaluation system including six measurements is proposed and evaluated.
2. Literature Review
GLMs or other statistical models have been commonly adopted by transportation safety decision makers and researchers to capture relationships among many factors to allow assessment of transportation safety risk. Poisson and negative binomial (NB) models have been widely applied in crash frequency studies [8–14]. Zero-inflated Poisson and zero-inflated negative binomial models have been developed as extensions of Poisson and NB models to try to overcome poor model performance for rare-event data [6, 15–18]. However, these models still need to meet the data distribution assumptions. As indicated in the literature, these parametric regression models have severe limitations [19–22]. Researchers have difficulty upholding the various required assumptions for these models in many applications. Failure to satisfy these assumptions results in numerous errors. Lu and Tolliver [5] stated that underdispersion exists in HRGC crash data. However, the negative binomial model as the most popular GLM models assumes overdispersion distribution. The gamma model [6], Conway–Maxwell–Poisson, and Bernoulli [5] are recommended to handle underdispersed crash data. However, those models are hard to implement because of the model’s complexity. Moreover, all GLMs assume linear relationship between the transformed response in terms of the link function and the explanatory variables, which make it hard to capture the real dynamic nonlinear relationship.
Tree-based machine learning models and other nonparametric methods were recently recognized by transportation safety researchers [23], and Karlaftis and Golias [21] conducted safety research to model crash frequency on rural roads with a decision tree model. Yan and Radwan [24] studied the influential factors of rear-end crashes. Qin and Han [25] classified intersection crashes. Yan et al. [26] applied a decision tree (DT) model to predict train-vehicle crashes at HRGCs. Keramati et al. [27] adopted a survival analysis to evaluate geometric effects on HRGC safety performance. All the existing literature adopted GB as a method to conduct safety research only focuses on one part of the application; some focuses on its capability to better handle the missing value and to provide better overall accuracy [28], and others only focus on its simplified version, the DT method, to produce the interpretable relationship [24, 26].
A nonparametric tree-based model, the gradient boosting model, is used to examine the relationship between HRGC crashes and the contributing factors. The GB model is extremely powerful in understanding the structure of complex datasets and exploring potential relationships between dependent variables and independent variables. GB models are widely used in various transportation research areas [29, 30] and [31]. Unlike linear models, the GB model requires no statistical assumptions. The GB model is believed to be superior to simple DT models because of its techniques for handling missing data, robustness with data noise, and resistance to overfitting [32]. In this study, the authors tend to conduct a complete application and performance assessment to demonstrate the GB model’s full learning and prediction aspects including (1) building the predictive model from the data, (2) producing interpretable relationship, and (3) demonstrating sound prediction power through a complete forecasting performance analysis.
3. Methodology
3.1. Data
Data used in this study came from two public sources: (1) the FRA’s (Federal Railroad Administration) Office of Safety accident/incident database and (2) the FRA’s Office of Safety highway-rail crossing inventory. A new combined database is generated based on HRGC identification numbers in both databases, and it includes information regarding the crash and a description of the highway-rail crossing. The database contains 19 years of historical crash information at HRGCs in North Dakota. There were 5,713 HRGCs in North Dakota, of which 354 have historical crash records in 19 years. With the intent to study crash-associated factors, a binary target variable (CRASH) is defined with two levels: a value of 1 indicates that a crash happened, while a value of 0 represents a crossing with no crash. Table 1 lists all screened variables, including one target variable, one ID variable, and thirty potential contributing variables.
3.2. Gradient Boosting
The gradient boosting method can be viewed as multiple additive trees (MATs) and is a machine-learning data-mining technique for regression and classification problems proposed by [33, 34] at Stanford University. The GB method theoretically extends and improves the simple DT model using stochastic gradient boosting [33]. GB produces a predictive model in the form of an ensemble of several simple decision tree models [35]. Therefore, the GB model inherits all of the advantages of decision tree models while improving in other aspects, such as robustness and accuracy [34]. Moreover, several other features make the GB model attractive, including capability of handling large datasets without preprocessing, resistance to outliers, capability to handle missing data, robustness to complex data, and resistance to overfitting [34, 36].
A GB model is a series expansion approximating the true functional relationship [36]. In general, the GB model starts by fitting the data with a simple decision tree model, which has certain level of error in terms of fitness with the data. A detailed description of the simple decision tree algorithm and data selection can be referred to Zheng et al. [28]. Considering the errors having the same correlation with outcome value, the GB model then develops another decision tree model on the errors or the residuals of the previous tree. This sequential process will repeat until errors are minimized.
The detailed algorithm of GB is described as follows [29, 37]: where x is a set of predictors and is the approximation of the response variable. are single decision trees with the parameter indicating the split variables. (n = 1, 2, …, n) are the coefficients and determine how each single tree is to be combined. Friedman [38] proposed a functional gradient descent optimization algorithm to find the final optimal GB model, which is an iterative tree-building process. The method keeps adding trees until all observations are perfectly fitted. It will result in an overfitted model that only performs perfect for the trained data (low bias) but has very low prediction accuracy with a different dataset (high variance). To avoid overfitting, the model then needs to be tested by fitting a testing dataset. Iterative training will stop when the performance of the model reaches a point where the model predicts well for both the training and testing datasets.
Regularization parameters are critical for avoiding overfitting and improving model performance. Regularization parameters usually include two parameters: learning rate and tree complexity. Learning rate is also called shrinkage rate [39]. It controls how fast the model is updated or improved after each stage. The value of learning rate ranges from 0 to 1. A small value of learning rate yields great improvement and minimizes loss function but requires more iterations and computational time [29]. Higher values, close to 1, result in overfitting and poor performance [39]. Tree complexity represents the number of nodes per single simple tree [37]. A higher number of nodes will introduce lower bias in the training set while increasing the variance in the testing set. This phenomenon is recognized as the bias-variance trade-off. As a result, both learning rate and tree complexity rate must be balanced to satisfy this trade-off and avoid overfitting.
4. Results
This section presents findings of this study. R is the software to conduct all the analysis for the research. First, the selection of an optimal model based on model performance is presented. Then, based on the optimal model, influential variables are ranked by their importance to the target variable. After that, impacts of top influential variables on crash prediction are analyzed. Finally, a model performance comparison is conducted between the DT model and the GB model.
4.1. Model Setup
To detect interactions between variables and to take full advantage of the GB model, a higher level of tree complexity and a low learning rate are suggested for experimentation [39]. In this study, the model is tested under three values of learning rate: 0.05, 0.01, and 0.005, and five levels of tree complexity: 2, 4, 6, 8, and 15. Table 2 shows how the model performs with various learning rate and tree complexity levels. “Class” represents crash levels with 0 indicating no crash and 1 indicating crash. Columns “Training data error percent” and “Testing data error percent” show the percentage of prediction error for training data and testing data, respectively. “Number of trees” indicates the number of trees needed to train for an optimal model under corresponding learning rate and tree complexity. It is clear that for a lower learning rate or a lower tree complexity, more trees are needed to achieve the optimal model.
The optimal model should predict well for both training and testing data; in addition, accurate prediction of event level is also critically important. Moreover, the number of trees required to achieve the optimal performance model indicates computing time and should be considered when selecting regularization parameters. By balancing model performance in terms of training error, testing error, event forecasting error, nonevent forecasting error, number of trees needed to obtain optimization model, and computing resource requirement, this research selected the model with a learning rate of 0.01, tree complexity of 8, and an ensemble of 1,092 trees as the optimal setting. Prediction accuracy for the optimal model is 85.7% for nonevent level (CRASH = 0) and 83.9% for event level (CRASH = 1). Variable importance and their impacts on crash at HRGCs will be generated based on the optimal model using 1,092 simple decision trees.
4.2. Variable Importance
The importance of a variable in a simple single tree is measured by the number of times the variable is used as the splitter and the squared improvement attributed to the tree due to the splits by the variable. After summing the number of times used as the splitter and the squared improvement over the ensemble of trees, the average value of the summation is regarded as the variable importance in the model. A high value of variable importance indicates a high level of contribution to the prediction [34]. In this study, the authors use the same algorithm to measure variable importance.
Table 3 presents the relative variable importance of each contributor based on the selected optimal GB model. The “Relative importance” column shows the importance value of the corresponding variable with 100 being assigned to the most important variable and a relative percentage values to all other contributors regard to the top variable. “Influence percent (%)” is an absolute importance factor which indicates how much each variable contributes to the prediction. Twenty-eight factors out of thirty are identified as having impacts on crashes at HRGCs, and the top ten factors contribute about 60% of the total crash influence power. Single factor influence power ranges from 1% to 11%. Highway traffic and railroad daytime traffic alone contribute about 20% influence on crash, and the majority of them are less than 5%. In other words, crashes at HRGCs are complicated and cannot be explained by only a few factors, but highway traffic and daytime rail traffic do play the most influential impact on crash likelihood.
Among all 28 influential factors, average annual daily highway traffic, daily through-train traffic, train detection type, nightly through-train traffic, average train speed, and the number of traffic lanes are the top six contributors to crash prediction. Among these six variables, four of them are traffic-related variables describing highway and railway traffic exposures and contribute about 30% to the prediction. Most of the predictors are crossing characteristics (17 out of 28), such as SPSEL, ADVWARN, and PAVEMRK, which provide information about warning systems and train-detecting systems, and they cumulatively contribute to about 50% of the impacts.
4.3. Marginal Effect of Contributing Variables
One of the criticisms frequently found in the literature for newer predictive modeling approaches such as gradient boosting is the difficulty of interpretation relative to linear regression models. For that reason, partial-dependent plot analysis is conducted in this study. Partial-dependent plots can be viewed as a graphical representation of contributor coefficients for each individual independent variable. Essentially, partial plots are model-based simulations [36]. The values appearing in the y-axis are the modeled values of the response variable. A positive y value indicates that the contributing variable at the corresponding value has a positive influence on the classification in the model. In this study, all other contributors hold at their mean values while researching on the influence changes of a target variable. Figure 1 illustrates the use of partial-dependence plots to characterize the marginal effects of the three types of contributors: traffic, highway, and crossing characteristics.

(a)

(b)

(c)

(d)

(e)

(f)

(g)
4.3.1. Traffic Characteristic Variables
Figures 1(a)–1(d) present the effects of AADT, DAYTHRU, NGHTTHRU, and AVG_TRAIN_SPEED on crash likelihood at HRGCs, respectively. The nonmonotonic relationship indicates a clear nonlinear, dynamic, and complex relationship between the target contributor and crash likelihood. Note that the impact indicates the effect of the target variable on crash likelihood while all other contributors hold at their mean influential levels. However, it is clear that a roughly increasing pattern exists in all traffic exposure variables except nighttime train volume. In Figure 1(a), crash rate suddenly reaches a peak value when AADT is about 500, which indicates that a “crash” is very likely to occur when AADT is 500. With AADT greater than 500, one can tell in general, crash likelihood is increasing gradually with AADT increasing; however, the relationship is not monotonic, and there are other two peaks at 2,500 and 10,000 AADT. The reason for the nonmonotonic is the influence of all other contributors hold at their mean level. In Figure 1(b), crash likelihood stays roughly constant when DAYTHRU is between 7 and 20; however, it starts increasing after daytime train traffic increasing from 20 trains per day. As in Figure 1(c), crash likelihood fluctuates at a low rate before NGHTTHRU reaches 11, beyond which, a sudden dramatic increase is observed which indicates that crash likelihood increases dramatically if nighttime through-train traffic increases from 11 to 13 and remains high when nightly through-train volume is greater than 13. Observations from this figure indicate the other controlled contributors at their mean level dominating the influence on crash likelihood compared to nighttime train traffic while its impact is around zero before the volume reaches 11. Figure 1(d) suggests that a crash is less likely to happen when the train speed is less than 30 mph and increases dramatically when train speeds increase up to 35 mph while other contributors hold at their mean influence level. As shown, crossings with trains travelling at speeds between 3 and 13 are less likely to have a crash.
4.3.2. Highway Characteristic Variables
Figure 1(e) shows the effect of one of the highway characteristic variables: HWYSYS. It is found that crashes tend to happen at crossings intersecting Federal-aid highways (coded as 3). In contrast, crossings intersecting with non-Interstate highways (coded as 2) or non-Federal-aid highways (coded as 4) are less likely to have a crash.
4.3.3. Crossing Characteristic Variables
Effect of crossing characteristic variables is critical for HRGC design. Figures 1(f) and 1(g) show the effects of two crossing characteristic variables, SPSEL and TRAFICLN, respectively. It indicates that a direct current audio frequency overlay (SPSEL = 2) installed at a HRGC helps to reduce the likelihood of crashes. It also suggests that crashes tend to happen at HRGCs with constant warning time (CWT) systems. With the CWT system, a warning signal is activated intentionally to provide a constant preselected warning time, usually 25 seconds. So, for a slow-moving train, the distance between the train and the crossing could be much closer than for a faster-moving train. However, a CWT system is not able to measure a change in speed accurately which can result in variability in the actual warning time oftentimes, less than desired warning time. As shown in Figure 1(f), highways with no more than 2 lanes have a negative impact on crash. Moreover, it is noticeable that a highway with 4 lanes has the highest positive impact on crash prediction. This is potentially caused by the more complicate traffic condition which involves lane-changing activities and will block the driver’s vision of incoming crossing and trains.
4.4. Model Forecasting Accuracy Assessment and Comparison
Prediction results can be summarized in a classification table (Table 4), and based on which, model prediction accuracy measurements are computed. The observed event class is represented by “Present” in Table 4, while the observed nonevent class is represented by “Absent” in Table 4. If an observation is predicted to be event class, it is indicated as “Positive” in Table 4, otherwise “Negative.” The number of true positive (TP) and true negative (TN) indicates the number of correct predictions. The number of false positive (FP) and that of false negative (FN) indicate the number of wrong predictions against observed conditions.
Even though the prediction performance of a model is a critical indicator, only a limited number of researchers published their prediction performance results in their studies [40–43]. For the ones who do report prediction assessment results, they only tend to selectively evaluate the accuracy rather than provide a full picture of the model’s forecasting skills because in their study, the prediction performance is used as a validation tool rather than a full assessment focus. Most commonly selected measurements are described in equations (1)–(3) for event class, nonevent class, and overall prediction, respectively:
Sensitivity and specificity compute the number of correct predictions given the number of observed conditions. Sensitivity in some other research is also referred as recall or true positive rate while specificity is referred as true negative rate. Equation (1) indicates that among all the observed present conditions (TP + FN), the model makes a number of TP correct predictions. However, it ignores the number of false positive (FP) predictions. The issue with sensitivity ignoring FP is that sensitivity can be high by sacrificing a great amount of false positive forecasts; in return, it could waste limited safety improvement budget allocation if decision makers rely on model forecasting result to allocate budget. Traditional selected prediction accuracy parameters, sensitivity, specificity, and accuracy, only partially represent a model’s prediction accuracy power. To draw the full picture of a model’s prediction accuracy, three additional prediction accuracy measurements should also include in the analysis.
For event and nonevent class, respectively, positive predictive ratio (PPR) and negative predictive ratio (NPR) show true positive and true negative prediction rate over predicted condition for each case shown in equations (4) and (5). Forecasted accuracy rates are calculated by equations (3) and (6) for accurate class (accuracy) and failed class (false rate (FR)), respectively:
The greater the value of all the indicators except FR, the better the forecasting power indicated by the model .
In this study, all 6 measurements are evaluated for the proposed gradient boosting model and compared with decision tree model results [28] which serve as a reference level. The result is shown in Table 5.
Several interesting findings can be drawn from Table 5:(1)Forecasting power based only on accuracy, sensitivity, and specificity can overestimate the model’s forecasting power. The high value of those measurements can be biased by the high volume of nonevent forecasts for an imbalanced dataset such as crash dataset because the crash event is relatively rare compared to the noncrash event.(2)Note that relative low positive predictive ratios for both models are caused by the imbalanced datasets with low separability. Positive predictive ratio also known as precision indicates false alarm rate. This value is often low for imbalanced data because models try to improve accuracy and sensitivity by sacrificing more false positive estimates. In this study, even though the GB model overperforms than the DT model in terms of PPR/sensitivity, both models still fail to provide a sound sensitivity for imbalanced data such as the crash data used in this study, which can be very troublesome because when decision makers depend on the model’s estimates to allocate budget, they can increase unnecessary spending and waste scares resources.(3)By looking at comprehensive accuracy assessment as the proposed six measurements, one can have a complete picture of the model’s true forecasting capability and the model’s forecasting trade-offs.(4)The GB model is supreme than the DT model by simultaneously improving sensitivity and precision. In all measurements, the GB model is overperformed than the DT model.(5)For GB model’s training dataset, comparing sensitivity, precision, NPR and specificity, only 27.3% of the GB model’s event estimates are correct even though this accounts for 88.6% of actual observed events. Moreover, 99.1% of noncrash forecasts are correct, which accounts for 84.1% of actual observed nonevents. The same pattern can be found in DT models. It indicates that both models overestimate crash (more false alarms compared to true event estimates) and underestimate noncrash (less misses compared to true nonevent estimates).(6)In most of measurements, both GB and DT models perform better in the training dataset and relatively less good in the testing dataset.(7)It is also worth to mention that accuracy is greater than 84% in the GB model, which is an outstanding improvement, compared with previous studies [44–46].
5. Research Summary
As demonstrated, the GB model can accurately identify contributing variables and determine an optimal model to avoid overfitting with regularization parameter simulation analysis. More importantly, it can also provide easy-to-interpret relative importance of influential variables and partial-dependent plots which present influential variables’ marginal effect on crash prediction.
The GB model overcomes many shortcomings of common statistical models such as limited capability to model underdispersed crash and poor forecasting performance of rare event (crash) by using both training data and testing data. Comparing with the DT model result, the GB model demonstrates its ability to successfully identify contributing variables and their relative importance levels.
Moreover, the proposed evaluation approach based on six measurements can assess the model prediction accuracy more thoroughly and comprehensively in a classification study. The GB model is superior to the DT model in terms of its improved forecasting accuracies in both training and testing datasets. Both of the current GB and DT data-mining models overestimate events (false alarming) to increase its accurate event coverage forecasting rate (sensitivity) which can be misleading to decision makers to waste safety improvement resources on the unnecessary crossings. Handling imbalanced datasets in machine learning and improve forecasting performance need future research attention.
Data Availability
The data that support the findings for this study are available from the authors on the request.
Disclosure
The content of this paper reflects the views of the authors, who are responsible for the facts and accuracy of the information presented.
Conflicts of Interest
The authors wish to confirm that there are no known conflicts of interest associated with this publication.
Authors’ Contributions
Dr. Pan Lu, Dr. Denver Tolliver, and Dr. Ying Huang are responsible for the research idea and study concept initiation. Dr. Pan Lu, Dr. Denver Tolliver, and Dr. Zijian Zheng designed the study. Dr. Zijian Zheng and Amin Keramati collected the data. Dr. Zijian Zheng and Dr. Pan Lu performed the model construction. Dr. Ying Huang is responsible for lab/software resource support. Dr. Zijian Zheng, Dr. Pan Lu, Xiaoyi Zhou, and Amin Keramati carried out the analysis and interpretations of results. Dr. Zijian Zheng, Dr. Pan Lu, and Dr. Denver Tolliver prepared the draft manuscript. All the authors reviewed the results and approved the final version of the manuscript.
Acknowledgments
The authors would like to express their deep gratitude to the North Dakota State University and the Mountain-Plains Consortium (MPC), a University Transportation Center funded by the U.S. Department of Transportation.