Abstract

Transportation agencies build statistical models and predict the average crash frequency to identify hazardous road sections and make informed decisions to reduce crashes. In this paper, safety performance models (SPFs) were built and evaluated considering various pavement and roadway characteristics, including pavement friction, which is seldom available for analysis. Four count data models—Poisson model, negative binomial (NB) model, hurdle-NB model (HNB), and zero-inflated NB (ZINB) model—were built based on roadway characteristics and crash data provided by the Oklahoma Department of Transportation (ODOT). Pavement friction, roadway geometry, surface condition characteristics, and traffic exposure were considered the contributing factors to traffic crashes. Established models were compared in terms of the goodness-of-fit, zero inflation, and statistical significance of factors. The HNB model exhibited promising fitting performance with a manageable number of influencing variables. Coefficients in the HNB model suggest that adequate pavement friction and the presence of shoulders can significantly reduce the crash frequency and thus improve roadway safety performance. Potential issues of the statistical models, such as unobserved heterogeneity and multicollinearity, were also discussed. The relation between roadway infrastructure characteristics (including pavement friction) and roadway safety revealed in this study could assist in choosing the proper statistical model for better decision-making and selecting appropriate preventive treatments for improved roadway safety.

1. Introduction

Transportation systems play crucial roles in providing the foundation for economic and social development around the world. Although transportation agencies strive to improve transportation safety, nearly 6.6 million vehicles encounter road traffic crashes every year. Crashes result in tremendous financial, corporeal, and emotional affliction to the affected families, as well as productivity reduction to the whole society [1]. In the United States, the National Highway Traffic Safety Administration (NHTSA) estimated that 38,680 people perished in traffic crashes in 2020, a 7.2% increase compared to the 36,096 fatalities reported in 2019 [2]. Identifying the significant contributing factors to crashes and recognizing their effects are essential for proactive decision-making to reduce roadway crashes [3].

Researchers have put substantial efforts into investigating the intrinsic correlation between crash frequency and risk factors, including implementing traffic crash prediction models as valuable tools for safety analysis [3]. Recently, there has been an ever-growing investigation on utilizing machine learning (ML) models for predicting crashes [4]. However, statistical models remain the mainstream as the amount of crash data is usually small and inadequate for machine learning [5]. Besides, statistical models have the advantage of clearly defining the functional forms and revealing the significance of the contributing factors [6]. Various statistical models can predict the crash frequency and identify the contributing factors [7]. Several recognized models include the Poisson regression model [8], the negative binomial model [9, 10], and the zero-inflated Poisson/negative binomial model [11]. However, choosing the appropriate statistical model is challenging and confusing on many occasions.

Contributing factors to roadway crashes include human behaviors, road geometry, pavement characteristics, traffic volume, vehicle performance, and environmental conditions [12]. A desirable model for analyzing the correlation regarding crash probabilities should be constructed with detailed driving data (vehicle and driver behaviors), roadway features, environmental factors, and crash data. However, such comprehensive data are seldom available as crashes happen so rarely and randomly that it is costly, even impossible to monitor all the information [7]. As an alternative, researchers have framed approaches to building up safety performance functions (SPFs) to investigate the factors that affect roadway crashes occurring in specific geological spaces (such as roadway segments or intersections) over a particular period (weekly, monthly, annually, or several years). This way of revealing a correlation between crashes and pavement features—the aspect the transportation agencies can control—is supported as research shows that roadway factors contributed to 34% of vehicle crashes [13, 14]. Several pavement variables were found to be statistically significant on crash severity, while their effects on frequency are not consistent [15]. In recent years, pavement properties including skid resistance [16], roughness [17], rutting, and surface texture in terms of mean profile depth (MPD) have been investigated using traditional and advanced models [1820].

For decades, the Oklahoma Department of Transportation (ODOT) has been managing several rigorous database systems, including the roadway safety database, skid testing program, and pavement management data sets. This paper investigates the appropriate approaches to estimating Oklahoma crash frequency to (1) develop general forms of various models, (2) investigate and compare the model performance to address issues such as the overdispersion and preponderance of zeros, and (3) interpret the correlation between crashes with roadway characteristics, including pavement surface friction. The procedure and results of this research are expected to provide a reference for transportation agencies for the selection of statistical models when building SPFs and evaluation of pavement friction significance regarding crashes.

1.1. Data Compilation and Description
1.1.1. Data Source and Compilation

Factors contributing to roadway crashes include three categories: human behavior, vehicle, and roadway environment [21]. In detail, the Highway Safety Manual Data Needs Guide [22] suggests analyses should include data on traffic volume, roadway geometry, site characteristics, and roadway surface conditions for safety analysis and modeling. This study extracted the crash data and the relevant available data sets provided by ODOT. These data sets include safety data from the State-wide Analysis for Engineering & Technology (SAFE-T) database, skid resistance from the Skid Studies Program, pavement surface characteristics, roadway conditions from the Pavement Management System (PMS), and construction records and material properties from the SiteManager® management system.

SAFE-T is a statewide database of crash information gathered from collision reports presented by law enforcement agencies [23]. Since 1998, crash data in Oklahoma have been recorded as point location events with various relevant information. Because crashes are rare and random events naturally fluctuate over time [21], four-year crash data were analyzed to decrease the regression-to-the-mean effect. This effect describes a statistical phenomenon that pavement sections with more crashes in a specific period tend to have fewer during the following period, even without disturbance or measures. In particular, crash data on the Interstate Highway System, US highways, and the state highways from 2012 to 2015 in Oklahoma were authorized and obtained from the SAFE-T database.

The Strategic Asset & Performance Management (SAPM) Division at ODOT historically managed the Skid Studies Program for monitoring the pavement friction in Oklahoma. The friction was measured using a locked-wheel skid tester operating at 80 km/h (50 mph), reported at an interval of approximately 0.8 km (0.5 miles), and saved in separate files by control sections. In recent years, ODOT had downsized the scope of skid studies from the entire statewide highway system to a small portion of the network, including annual testing only on US-69 (a truck corridor), interstate highways, as well as the Strategic Highway Research Program (SHRP) sites. With the changing scopes of testing sites and evolving data reporting formats, numerous data preprocessing efforts were made to combine those friction files into one universal database. Friction data from 2012 to 2015 were gathered and summarized in Table 1. It is notable that only a handful of control sections on US-69 were tested in 2013.

The PMS is another database maintained by ODOT recording pavement surface conditions (e.g., surface type, roughness, rutting, and macrotexture), geometry features (e.g., grade and curvature), and other pavement characteristics for the National Highway System (NHS) and State Highway System (SHS) routes [24]. The field pavement data were recorded every 0.01 miles with location labels including highway name, mile point, control section number, control subsection, and GPS. Subsections are the subsets of a control section generated by multiple criteria defined in the ODOT Road Inventory Manual [25]. These criteria include the changes in state highway junctions, political jurisdictions, urban area boundaries, surface widths or type change, and traffic volumes. It is reasonably justified that the traffic characteristics and surface conditions for one subsection are relatively uniform. Hence, it is considered the basic analysis unit in this study, and all crashes, roadway, and related supporting data were gathered and compiled by subsections for further safety analysis.

The maintenance, rehabilitation, and reconstruction (MR&R) activities within the four-year analysis period could affect roadway crash frequency and patterns. Therefore, the sites that had received MR&R work from 2012 to 2015 were excluded from the SPF development. These sites were identified based on the data in the SiteManager® construction database, a database that recorded the contracts of MR&R activities in Oklahoma and provided the project and location description, project number, and allocated days.

1.1.2. Data Summary and Correlation

After significant effort compiling various data items, all required data were connected by subsection and travel direction for each pavement segment. Consequently, 1,811 subsections were screened out with complete data sets for SPF modeling. The distribution of the number of crashes—the dependent factor––at these subsections over four years was presented in Figure 1. As observed, there exists an excess number of zeros, and it is a distinct right-skewed distribution.

The statistical descriptions of the compiled data sets are summarized in Table 2. The contributing factors in the compiled data sets include traffic volume in terms of the average annual daily traffic (AADT), pavement surface characteristics (friction, texture, surface type, and roughness), and roadway geometry features (longitudinal grade, horizontal curvature, and presence of shoulder or median). The segment length and AADT were transferred to the natural logarithm values to fit in the generalized linear regression [19, 26]. The classification of IRI (international roughness) is based on the FHWA recommendations, where 95 in/mile (1.5 m/km) is the upper threshold of “good” ride quality, 170 in/mile (2.7 m/km) is the upper threshold for “acceptable” ride quality, and IRI beyond 170 in/mile (2.7 m/km) is classified as “poor” ride quality [27, 28]. Following the AASHTO recommendation, the rut depth threshold is set as 0.5 in (12.7 mm) between high and medium rut severity and 0.25 (6.4 mm) between medium and low rut severity [29, 30]. It is observed that the proportion of segments with “good” pavement conditions is smaller than the national average according to FHWA Conditions and Performance Report to Congress [28]. This results from the filtering process used in this paper, which had excluded the segments with maintenance, rehabilitation, and repair (MR&R) work from 2012 to 2015 in order to eliminate its effect on changing crash patterns. As a result, the proportion of roadway segments in “poor” conditions was higher than expected. In addition, some research found that such data filtering may also help handle the unbalanced feature of crash data [26].

Table 3 summarized the correlation coefficients among the influencing variables and the corresponding significance level. The Pearson’s correlation coefficients shown in the lower triangular cells evaluate the linear relationship between two continuous variables. The Spearman correlation coefficients shown in the upper triangular cells evaluate the monotonic relationship between two variables. The coefficients with an absolute value larger than 0.5 were marked with bold fonts. For example, the positive correlation coefficient (0.56) between the existence of median and the natural log of traffic volume shows that segments with higher traffic volume are often constructed with a median. Meanwhile, the significance level is calculated to determine whether the correlation between variables is significant. In general, the significance level with a -value lower than 0.05 indicates a correlation exists between the two variables, and a correlation coefficient greater than 0.7 demonstrates the multicollinearity between the variables [3133]. As observed in Table 3, some variables are correlated with a -value less than 0.05, while no correlation coefficients have an absolute value higher than 0.7, indicating the chosen predictors have no evidence of manifest multicollinearity and can be used for building generalized linear models.

2. Methodologies

Crash occurrence, statistically considered as count data, can be appropriately modeled using several methods; the most popular applied are Poisson and negative binomial (NB) models. The hurdle negative binomial (HNB) and zero-inflated negative binomial (ZINB) regression models were also developed and implemented for highway safety analysis to address the phenomena of zero inflation.

2.1. Poisson Model

Poisson models often behave well in approximating non-negative integer count data that follows the Poisson distribution [34]. In the crash prediction case, the probability of subsection having (0, 1, 2, …) crashes is given bywhere Poisson parameter is the predicted or the expected number of crashes per year at subsection . Poisson model identifies as a function of explanatory variables. The commonly selected functional form (or function link) is in the log-linear form:where and are vectors of estimable parameters and explanatory variables independently. Maximized log-likelihood method is applied to estimate the regression coefficients . The expected number of crashes on a specific subsection is given by .

As the Poisson model assumes the dependent variable follows the Poisson distribution, the variance of the number of crashes is assumed to be equal to its mean; thus, the dispersion parameter is fixed at 1. If this assumption is violated, on conditions of either overdispersed () or underdispersed (), the resulting parameter estimate will be biased [35].

2.2. Negative Binomial Model

The negative binomial (ZB) model is derived from the Poisson model to solve the overdispersion phenomenon in count data sets [36]. ZB model assumes thatwhere is a gamma-distributed error term and and are defined in the Poisson model. The addition of relaxes the constraint of equal mean and variance as follows:where is often referred to as the dispersion parameter. Poisson model can be regarded as a negative binomial model with of 0. The NB model is predominant in traffic crash data analysis, especially when the dependent variable is overdispersed distributed [35]. Nevertheless, the NB model is limited in handling an excess of zeros [37].

2.3. Hurdle Negative Binomial Model

Hurdle models can depict data with overdispersion and excess-zero observation by combining two components: a truncated count component (Poisson or negative binomial) for modeling positive counts and a hurdle component (binomial model) to characterize zeros [38]. The hurdle model can be structured as follows:where resembles the probability of one pavement segment belonging to the zero modules, represents a probability mass function for a normal count distribution with parameters vector , and is the count distribution evaluated at zero [39]. The zero predictions in the Hurdle model all come from its zero module.

2.4. Zero-Inflated NB Models

Zero-inflated models are also two-component mixture models that deal with overdispersed data with excess zero counts, but the sampling zeros occur by chance from two components [40]. The zero-inflated model can be structured as follows:where represents the probability of a structural zero with and defined in the Hurdle model. As such, zero numbers come from both “structural” origins and “sampling” component [39].

3. Results and Discussion

3.1. Model Results

As the first attempt, the Poisson model was fitted to seize the relationship between the number of crashes and available pavement characteristics. A stepwise process was conducted to optimize the modeling results. In each step, a variable was subtracted from the set of explanatory variables by minimizing Akaike’s Information Criteria (AIC) [41]. The best-fitted Poisson model has the smallest AIC in the stepwise processing, and the model results on studied 1,811 segments are summarized in the first column of Table 3. All variables in the fitted Poisson model are statistically significant in determining the crash likelihood. However, a Poisson model requires the dispersion parameter to be approximately 1, with equal mean and variance. However, the actual dispersion parameter in this study was calculated to be 31.7, which is significantly larger than 1, indicating that distinctive overdispersion characteristics exist in the data sets, and thus, the Poisson regression assumption is violated.

To address the observed overdispersion of the crash data, attempts were made to analyze crash frequency using the NB model, hurdle negative binomial model (HNB), and ZINB regression techniques. The NB model was chosen for its popularity in tacking overdispersion; ZINB and HNB were selected to deal with both overdispersion and the preponderance of zeros. Similarly, for each type of model, the stepwise variable selection method was applied to optimize the fitting performance by minimizing the AIC of the models. It is noteworthy that ZINB and HNB are composed of two model parts so the stepwise optimization should be conducted on each model part separately. The regression results of all investigated models are presented in Table 4 with the estimated coefficients and their corresponding significance levels [42]. As shown in Table 4, the subsection length, traffic volume, average grade, and length of the curve are statistically significant with positive estimated coefficients across all the models. Positive coefficients indicate that crash frequency increases with these parameters. In other others, fewer crashes are expected on segments subsections with lower traffic volumes, smaller grades, and shorter lengths of curves. This observation is consistent with the analysis in former research [15].

On the contrary, the presence of shoulders has a negative coefficient in all the models, indicating that the presence of the shoulder can decrease the expected crash frequency. Likewise, using the asphalt concrete (AC) surface type as the benchmark for comparison, the negative coefficient for the pavement type parameter indicates that continuously reinforced concrete pavements (CRCP) and jointed concrete pavements (JCP) are associated with a lower level of crash frequencies. This indicates the pavement type does have an impact on the probability of crashes. However, some parameters exhibit different levels of statistical significance or even opposite trends of estimates on the crash frequency among the four types of models. Such differences could be originated from unobserved heterogeneity within the data sets in this study. On the other hand, such differences could provide necessary information for model performance evaluation so that the best model could be recommended.

3.1.1. Comparisons and Interpretation

An intuitive investigation of the performance of fitted models is the plotting of the predicted and observed density distributions, as shown in Figure 2. The observed distribution of the number of crashes is presented in the histogram. Meanwhile, the predictions of the four fitted count models are plotted in lines with different patterns and colors. As observed, the Poisson model prediction results show a distinct difference with the distribution for the field crash observations, primarily when it is used to predict sections with a small number of crashes. Such deviations are primarily caused by the overdispersion of the dependent variable, which could be handled better through the NB, HNB, and ZINB models.

Another approach to evaluating the overdispersion phenomenon is the rootogram, as shown in Figure 3. Rootograms graphically compared the frequencies of fitted and observed probability. This method plotted the predicted frequency with the thick red line and observed counts with bars hanging from the red line of expected counts [43]. The horizontal line at 0 frequency distinguishes where the model is over- or underpredicting. If the hanging bar (or rootogram) does not reach the zero line, the model overpredicts the occurrence at that count range. In contrast, if the bar exceeds the zero line, the model underpredicts the crashes.

As observed in Figure 3, the rootograms for the Poisson model dramatically exceed the zero line when the crash frequency is low, indicating that the Poisson model predicted much fewer crashes than the observed values from the field. The rootograms hanging above the zero lines indicated overfitting at larger crash counts (ranging from approximately 6 to 10). In other words, the variability of the observed crash data is much greater than what a Poisson model could predict. The other three models (NB, HNB, and ZINB) perform better as their rootograms converge around the zero lines, which are expected for a preferred crash prediction model. Minor differences were observed among the NB, HNB, and ZINB models from their rootograms, and thus, additional comparisons are needed to identify the best model type.

The Akaike information criterion (AIC) and Bayesian information criterion (BIC) are the two most well-known criteria for model selection, and the model with smaller AIC and BIC values was preferred. The AIC and BIC values for the four investigated models are summarized in Table 4. The HNB and ZINB models have the smallest AIC and BIC values, followed by the NB model. The Poisson model has much larger AIC and BIC values since the overdispersion cannot be considered. However, the differences between the HNB and ZINB remain insignificant in terms of the AIC and BIC values.

As excess zeros were observed in the crash data, the predicted zero values from the fitted regression models were also listed in Table 4. There were 625 roadway sections without any observed crashes. Due to its lack of considering the overdispersion distribution of dependent variables, the Poisson model illustrated the worst performance with only 142 zero predictions, consistent with the analysis results shown in Figure 3. The HNB model predicted the desired number of zeros because of its modeling structure. The ZINB model overpredicted the zeros, and the NB model underpredicted the zeros. Since the zero counts in this study are not dominating (625 divided by 1,811, which is 34.5%), the differences between the HNB and ZINB models are not apparent for this case study. However, the zero-state hypothesis in the zero-inflated count models has been criticized by researchers and is not widely accepted in the analysis of crash frequency [26, 44, 45].

Based on the comparisons of model performance, the HNB model is recommended as the superior model for developing the safety performance function in this study. The regression coefficients of the influencing variables, as shown in Table 4, can be used to establish the HNB model for crash frequency prediction and assist decision-making for improved roadway safety. Although the segmentation length and pavement type generally do not change after construction, transportation agencies can take measures to preserve and improve pavement conditions. The mean profile depth, on the other hand, has one negative coefficient in the zero component and a positive coefficient in the count model component, indicating two-sided effects on the predicted number of crashes. It may be caused by the driver’s abstraction on pavements with excellent conditions but being more cautious under poor pavement conditions [15]. The negative estimate of average friction and positive coefficients in the IRI ranking suggest countermeasures can be taken by improving friction and decreasing roughness. It should be noted that the observed crashes in segments decrease slowly with friction, between 2.5% and 0.9% per unit of friction in skid number. This observation is consistent with former research and provides support in maintaining a proper level of friction that helps improve transportation safety [18].

3.1.2. Limitations and Discussions

Despite the investigated models revealing the statistically significant relationship between the roadway characteristics and crashes, it is worthwhile to note that traffic crash prediction is a complex topic, and a variety of influencing factors are not available during data collection [18, 46]. These related factors include driver behavior, vehicle conditions, and environmental variations, which were generally assumed identical and characterized as assumptions of the generalized linear models. However, these assumptions may diverge from real transportation observations and result in a high likelihood of unobserved heterogeneity [47, 48].

Even for the listed independent factors, the investigated models in this paper were all fixed models, which make an inherent assumption on homogeneous effects of the observed factors. This assumption may not hold firmly when inspecting the data sets closely. Obeying this assumption means that in a generalized linear model, a linear relationship with identical slopes and intercepts across different groups should exist between the linked variable and independent parameters [18, 19, 49]. Figure 4 shows the natural log value of crash numbers (all zero crashes were added by 1 to be mathematically logical) versus the natural log of traffic volume (a) and average segment friction (b). For each pavement type, their correlations to independent variables were shown with regression lines and confidence bands (different colors). It is observed that the slopes and intercepts vary across different pavement types, indicating the homogeneous assumption is challenged and unobserved heterogeneity presents in the data sets.

Another assumption of the investigated models is that all independent parameters were assumed to have no mutual dependence and no correlation. The correlation coefficients shown in Table 3 reveal no severe multicollinearity, but there are some degrees of correlations between independent parameters. In fact, correlations between pavement characteristics are often observed. For example, a pavement segment with poor conditions often simultaneously shows elevated rutting and roughness. Neglecting cross-correlations among investigated factors may result in the downward bias of the estimated parameter variances [19].

The application of uncorrelated and correlated random parameters models has the potential to partially account for unobserved heterogeneity and cross-correlations. Several studies have revealed the statistical superiority of the correlated random parameters count model [19, 20, 50]. However, these models involve computation and modeling complexity when facing many independent factors. More advanced models including uncorrelated and correlated random parameters models will be investigated and applied in future research efforts.

4. Conclusions

The contribution of this paper is twofold in both the modeling process and results. First, the SPFs for roadway segments in the AASHTO HSM mainly include two categories of crash-related indicators: the annual average daily traffic (AADT) and the segment length, while the roadway infrastructure-related characteristics, which have potential effects on highway safety, have not been thoroughly investigated for predicting of crash rates. This study acquired and compiled extensive data sets in Oklahoma and included various roadway infrastructure-related characteristics in the model. Pavement surface characteristics (friction, IRI, rutting, and surface type) and roadway geometry (longitudinal grade, horizontal curvature, length of curve, number of lanes, presence of shoulder, and median) were identified to be statistically significant to roadway crashes. The developed SPF with roadway infrastructure factors assists knowledgeable decisions to select pavement preservation and maintenance practices.

Secondly, this paper implemented and compared the performance of four statistical counting models—Poisson model, negative binomial (NB) model, hurdle negative binomial (HNB) model, and zero-inflated negative binomial (ZINB) model—for the development of SPFs using the Oklahoma data sets. Various model comparison and evaluation methods were applied for the recommendation of the best model type for roadway crash modeling and safety analysis, with the following key findings:(i)The overdispersion problem can be addressed by the NB, HNB, and ZINB models.(ii)The HNB and ZINB models can better predict the excessive zeros in the crash data than the NB model. Based on comparisons between the goodness of fit and the number of independent variables, the HNB is recommended as the preferred model.(iii)The HNB regression results indicate that improving pavement surface friction and adding outer shoulders could be effective MR&R candidates in reducing the number of crashes and improving roadway safety performance.

It should be acknowledged that the data used in this study are limited in size and scope. Additional data items with a wider range of roadway facilities should be collected and used to develop comprehensive SPFs for the state. Besides, advanced models such as correlated random parameter models should be investigated and compared in future research.

Data Availability

All data, models, or codes that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

Joshua Q. Li and Wenyao Liu conceptualized the study; Wenyao Liu did the formal analysis; Joshua Q. Li contributed to collecting resources; Wenyao Liu contributed to writing the original draft and did the revision; Joshua Q. Li and Pan Lu reviewed and edited the manuscript; Wenyao Liu and Xue Yang contributed to visualization; and Kelvin C. P. Wang supervised the study. All authors have read and agreed to the final version of the manuscript.