Abstract
Quick and accurate crash detection is important for saving lives and improved traffic incident management. In this paper, a feature fusion-based deep learning framework was developed for video-based urban traffic crash detection task, aiming at achieving a balance between detection speed and accuracy with limited computing resource. In this framework, a residual neural network (ResNet) combined with attention modules was proposed to extract crash-related appearance features from urban traffic videos (i.e., a crash appearance feature extractor), which were further fed to a spatiotemporal feature fusion model, Conv-LSTM (Convolutional Long Short-Term Memory), to simultaneously capture appearance (static) and motion (dynamic) crash features. The proposed model was trained by a set of video clips covering 330 crash and 342 noncrash events. In general, the proposed model achieved an accuracy of 87.78% on the testing dataset and an acceptable detection speed (FPS > 30 with GTX 1060). Thanks to the attention module, the proposed model can capture the localized appearance features (e.g., vehicle damage and pedestrian fallen-off) of crashes better than conventional convolutional neural networks. The Conv-LSTM module outperformed conventional LSTM in terms of capturing motion features of crashes, such as the roadway congestion and pedestrians gathering after crashes. Compared to traditional motion-based crash detection model, the proposed model achieved higher detection accuracy. Moreover, it could detect crashes much faster than other feature fusion-based models (e.g., C3D). The results show that the proposed model is a promising video-based urban traffic crash detection algorithm that could be used in practice in the future.
1. Introduction
Traffic crashes can cause property damage, injuries, death, and nonrecurrent congestions. Accurate and fast crash detection can help improve the response speed of incident management, which in turn reduces injuries/fatalities and congestions induced by crash occurrence. Thus, developing such crash detection methods is necessary and important for traffic incident management.
Traditional crash/incident detection methods mostly rely on traffic flow modeling techniques [1–7]. The basic idea of traffic flow modeling is to identify nonrecurrent congestion, based on data from loop detectors, microwaves, and probe. However, nonrecurrent congestion and recurrent congestion can be difficult to be differentiated without enough and sound historical data. Thus, the performance of traffic flow modeling approach heavily depends on the data quality obtained from traffic detectors. Moreover, it could often fail when the traffic environment is too complex (e.g., multimodal traffic in urban area). Thus, detection accuracy of such method is sometimes not guaranteed. Another emerging method is to identify incident based on crowdsourcing data [8]. However, such method could also suffer from underreporting issues when there is no witness around the incident scene. Nowadays, with the development of intelligent transportation systems (ITS), video cameras have been widely installed in many cities and highways. Thanks to their wide coverage, vision-based crash detection techniques have gained increasing research attention in the recent years [9]. Their basic concept is to automatically identify crash scenes based on the features of traffic images/videos through computer-vision techniques. Such techniques, as a promising intelligent crash detection method, are expected to significantly reduce human labors and have achieved relatively high detection accuracy [10–12].
To ensure detection accuracy, a video-based crash detection method needs to be capable of extracting important crash features from traffic images/videos. In general, there are two main types of features of interest: motion (temporal) features and appearance (spatial) features. Appearance features include apparent vehicle damage, vehicle rollovers, and pedestrian fallen-off. Motion features need to be continuously identified, including the intersection of vehicle trajectories and the gathering of pedestrians. From this perspective, current crash detection methods can be classified into two groups: motion feature-based methods and feature fusion-based methods.
Many research works are based on motion features, such as the intersection of vehicle trajectories, the overlap of bounding box detectors, and the speed change of vehicles. Some used background subtraction methods to extract vehicles’ motion features (acceleration, direction, and velocity), based on which certain rules and thresholds were applied to identify crashes [9, 13–15]. Maalou et al. [16] tracked vehicles’ motion based on optical flow methods and used heuristic methods to find a threshold for crash identification. Sadeky et al. [17] used Histogram of Flow Gradient (HFG) as the motion features and discriminated crash from noncrash, based on logistic regression. Chen et al. [18] developed an Extreme Learning Machine (ELM) for crash identification, based on motion features represented by Scale-Invariant Feature Transform (SIFT) and optical flow. In recent years, with the development of deep learning methods (e.g., Faster R-CNN (Faster Region-based CNN) [19] and YOLO (You Only Look Once) [20–22]), the performance of vehicle detection and tracking has been significantly improved. Vicente and Elian [23] used YOLO model to detect motion features and used support vector machine (SVM) for crash identification. Lee and Shin [24] used Faster R-CNN for vehicle detection and Simple Online and Real-Time tracking (SORT) for vehicle tracking. Based on those motion features, the incident/crashes in tunnels were detected. Paul [25] applied Mask R-CNN (Mask Region-based CNN) for motion feature extraction and used rules for crash detection. Motion feature-based models only depend on vehicle motions. This requires a high precision of object detection and tracking. When the traffic environment is complicated, vehicle detection and tracking performance could be decreased, resulting in low crash detection performance. Furthermore, some crashes may not be detected only based on motions, such as vehicle rollover and pedestrian fallen-off.
Recently, the feature fusion-based (i.e., appearance and motion) crash detection methods have become increasingly popular. There are two types. One is based on unsupervised learning methods. For instance, Singh and Mohan [26] and Yao [27] developed a crash detection model based on autoencoder methods. Another type is based on supervised learning framework, which normally combines a module (e.g., convolutional neural network) for spatial feature extraction and a module (e.g., a recurrent neural network) for temporal feature extraction. Batanina et al. [28] used Convolutional 3D (C3D) model to capture both spatial and temporal crash features from simulated video crashes. Then, a domain adaption (DA) transfer learning was applied to the real-world condition. The accuracy has been improved by 10%. Huang et al. [29] employed two-stream network to separately extract appearance features and motion features, which were then further combined to detect crashes. According to previous literature, the performance of crash detection can be improved by feature fusion methods.
Although feature fusion-based methods have achieved a better performance than motion feature-based methods, some improvements still can be made. To simultaneously capture both motion and appearance features, such models oftentimes have complicated structures and a large number of parameters. As such, those models require a lot of computing resources and long computational time, which prevent them from being used in a real-time traffic environment. Thus, the current fusion-based models need to find a better balance between detection accuracy and speed.
To fill the gap, we proposed a new feature fusion-based urban traffic crash detection framework, aiming at achieving a good balance between detection accuracy and speed. First, we introduced attention module into residual neural networks to improve the performance of detecting local appearance features. Meanwhile, we linked ResNet with Conv-LSTM model to simultaneously capture crashes’ appearance and motion features. The proposed model is expected to achieve high accuracy as well as fast detection speed for crash detection. The remainder of the paper is organized as follows: Section 2 introduces methods used in this study. Section 3 discusses data preparation. Section 4 presents modeling results and discusses research findings. Section 5 provides the research conclusion and future directions.
2. Methodology
In this section, we introduce our proposed model in detail. Figure 1 shows the overall framework of our model. First, the attention module was combined with ResNet to capture the appearance features of the crash images. The ResNet can improve the speed of conventional convolution neural network, while the attention module can enable the model to focus on localized appearance features instead of other irrelevant information to further boost the model. Then, the output feature map is reduced in dimension via a convolutional layer, which is then chronologically input into the Conv-LSTM network to further extract the motion features of crashes. Conv-LSTM has an advantage over conventional recurrent neural network (e.g., LSTM) in terms of being lighter and retaining spatial information. Finally, a global pooling layer and a fully connected layer were used to detect a crash (or noncrash). The following is a detailed description of the residual network ResNet, attention module, and Conv-LSTM module in the framework.

2.1. Crash Appearance Feature Module (ResNet + Attention)
2.1.1. Residual Neural Network (ResNet)
Residual neural network (i.e., ResNet) was proposed in 2015 [30] and has been widely used in various deep learning-based computer-vision tasks for extracting image features. The purpose of ResNet solves the problem of training difficulties caused by gradient explosion or vanishing in deep convolutional neural networks. Compared to other conventional neural networks (e.g., VGG (Visual Geometry Group Network)) continuously stacking convolutional layers to obtain higher image expression capabilities, ResNet stacks flexible residual modules to obtain stronger expression ability instead.
There are two main types of residual modules. The first type (Figure 2(a)) often appears in a shallow residual network (ResNet 18/34). Each residual module includes two convolutions, the output of which is the sum of the input (i.e., the output from the last residual module) and its convolution. The ReLU activation function is used to obtain the output of the current residual module, as shown in the following equation:where is the convolution operation and i is the layer index.

(a)

(b)
The second type (Figure 2(b)) often appears in deeper residual networks (ResNet50/101/152). Each residual module includes three convolution layers (, , and ), the output of which is the sum of the input (i.e., the output from the last residual module) and its convolution.where is the convolution operation.
Since such residual network structure can compensate for the unstable training caused by deep structures [30], it can handle deeper network layers than VGG. Three typical ResNets are 50 layers (ResNet-50), 101 layers (ResNet-101), and 152 layers (ResNet-152). They are similar in structure. The selection of ResNet depends on computational capability and training data amount. Deeper network could be more powerful with adequate training data.
2.1.2. Visual Attention Module
In this paper, we further extend ResNet by integrating visual attention modules. The visual attention module squeeze-and-excitation (SE) Block was first proposed by Hu et al. [31]. This module has been widely used because it is relatively simple and is able to improve the efficiency of many convolutional network models. SE Block belongs to the channel attention mechanism, which gives different weights to different channels of a feature map. As is known, in convolutional neural networks, different channels correspond to different feature extractions. The different classification tasks should lay particular emphasis on different feature selections. The concept is similar to the way that human beings identify objects. For example, people may pay more attention to the shape features when judging cats and dogs, while they may focus on texture features when judging jaguar and leopard (belonging to Felidae). Thus, SE Block improves the ability of feature selection for convolutional neural networks.
As shown in Figure 3, SE Block converts the input X, , to U, , through a global average pooling operation , as shown in the following equation:

For , , in input tensor and , , in output tensor , the following equation holds.
After the global average pooling, the output U is passed through a fully connected layer with a weight of , , that is, in Figure 3, and the result V is as shown in equation (5), where “” refers to matrix multiplication.
The above is the activation function, and the result V is also called attention weight. Finally, multiply the attention weight V and the input X by the channel weight to adjust the importance of different channels of the input (equation (6)). Here “” refers to the element-wise multiplication (i.e., Hadamard product):
An improved visual attention module over SE is called convolutional block attention module (CBAM), which was first proposed by Gupta [32]. Based on the basic channel visual attention module (i.e., SE), CBAM innovatively introduces the spatial visual attention module, as shown in Figure 4. Different from the basic module, the spatial visual attention module initially performs maximum pooling and average pooling operations on the input by channel and then converts the two-layer feature map to a single-layer feature map through a convolutional layer with a weight of W, as shown in in Figure 4. Finally, softmax is used to convert the original distribution to a probability distribution and adjust the importance of the model to different spatial positions of the input . The process can be expressed by the three following equations:

CBAM module can be embedded into residual modules to improve its feature selection performance. Figure 5 shows how the two modules are integrated.

2.2. Feature Fusion Module (Conv-LSTM)
The Conv-LSTM module was first used in precipitation nowcasting [33], the structure of which is shown in Figure 6. Traditional LSTM input requires data flattening, which often causes spatial information loss. The Conv-LSTM module inherits the gating structure adopted by the traditional LSTM, while it uses convolution neuron as a basic unit to retain spatial features. The data modeling process is as follows.

First, the inputs and are stacked along the channel dimension to generate ; then a one-dimensional convolution performs convolution operation on :
Then, obtain by using activation function on , as shown in the following equation:
Finally, the outputs and of the Conv-LSTM module at the time step t are obtained by gating operations, as shown in the two following equations:
3. Data Preparation
To the best of our knowledge, there is no public database for crash detection task. Thus, in this study, all data were acquired from local police in China. We prepared two datasets. The first dataset is an urban city traffic image dataset, which contains 5061 traffic crash images and 5573 noncrash traffic images. The crash images include multiple types, such as single-vehicle, multivehicle, and non-motorist-related crashes. Figure 7 shows some examples. Another dataset is an urban surveillance video dataset, which contains 420 crash video clips and 432 noncrash video clips. The duration of each video clip is around 20 seconds, with 24/25 frames per second.

The first image dataset was used to train the ResNet plus attention module, while the video dataset was used to train the whole network. By transferring the pretrained ResNet module, the convergence speed of the whole network could be boosted. To note that, images/video clips were manually labeled with either crash or noncrash. As such, the capability of the trained model was expected to identify crashes among normal traffic scenes.
4. Results and Discussion
All experiments in this study were carried out on a laptop equipped with Nvidia GTX 1060 GPU. Some detailed parameters of the laptop are as follows: (1) I7-7700HQ CPU @2.80 GHZ and (2) GTX 1060 (6G) GPU, core frequency: 1506-1709MH, and floating-point operation: 4.4 TFLOPs.
First, a set of deep learning models was compared for differentiating crash images (positive) from noncrash images (negative), with the purpose of finding a best crash appearance feature module, which was further linked to Conv-LSTM. VGG-16 and ResNet-50 were used as baseline models. Four extended models were developed by incorporating SE and CBAM modules into VGG and ResNet. The training dataset included 3861 crash and 4373 noncrash images, while the testing dataset included 1200 images for each category. Table 1 shows the performances of those models (i.e., crash appearance feature extraction models) in the test dataset. Compared to VGG-16 and ResNet-50, extended models with attention modules generally had higher detection accuracy. Among those, the ResNet-50 + CBAM model achieved the highest accuracy of 90.17%. Figure 8 shows the testing accuracy for each training epoch for those crash appearance extraction models.

It can be also found that all models had much more false-positive (FP) cases than false-negative (FN) cases. This indicates that those models tend to determine noncrash traffic scenes as crashes. Some traffic conditions (e.g., stopped vehicles and heavy congestions with many overlapping pedestrians and vehicles) could have very similar appearance features to those of crash scenes. Thus, models solely based on appearance features cannot well identify those conditions.
We further visualize those models based on the gradient-weighted class activation mapping (Grad-CAM) technique [34], as shown in Figure 9. ResNet appeared to be better than VGG in terms of focusing on the appearance features of crashes. For example, VGG failed to identify the appearance features of crash D, while ResNet identified them correctly. When adding attention modules, the extended models (ResNet 50 + SE/CBAM) could better focus on appearance features. For instance, CBAM improved the performance of ResNet in recognizing the location of fallen people for crash C.

Then, using ResNet50 + CBAM as the pretrained model, we further trained the whole model with 330 crash video clips and 342 noncrash video clips. The testing dataset included 90 crash videos and 90 noncrash videos. To examine the performance of the proposed model, six models were compared. Model 1 determined crashes based on the amount of velocity changes or anomalies of trajectory extracted by Faster R-CNN + SORT. Model 2 identified crashes depending on the number of detected crash frames of a video clip. Models 3–6 utilized Resnet-50 + CBAM/SE to extract appearance features, while they employed LSTM/Conv-LSTM models to extract motion features from videos. Table 2 shows the performance of the six candidate models.
The results indicate that model 1 had the lowest detection accuracy and speed compared to other models. In general, this model had a good performance in detecting multivehicle crashes. However, it largely failed to detect vehicle-pedestrian crashes and single-vehicle crashes. The reason could be that such models can only recognize crash motions (e.g., the intersection of vehicle trajectories, abnormal behaviours of nonmotorists [35, 36]) instead of crash appearance (e.g., fallen people, vehicle rollover, vehicle damage, etc.). Figure 10 shows some crash scenes that were falsely detected by model 1.

Compared to model 1, feature fusion-based models had better performance in detection accuracy. Among feature fusion-based models, the rule-based model (model 2) and the LSTM models (i.e., models 3 and 6) had lower detection accuracy than the Conv-LSTM models (models 4 and 5). The basic idea of the rule-based model is to determine a crash based on the number of detected crash frames of a video clip. Based on experiments, a highest accuracy can be achieved when the threshold is set to 10 (i.e., 10 frames). Since such method requires no sequential information, it may not well identify crash motion features (the FN rate is high). LSTM models require the flattened layout of appearance feature maps, which could lose spatial information. Conv-LSTM can simultaneously detect both motion features and appearance features, while retaining their original information considerably (FN decreases compared to rule-based models).
Regarding detection speed, the proposed model framework considerably outperformed motion-based deep learning models. In order to get a high detection accuracy of motion objects, motion-based models often require powerful deep learning models for vehicle detection and tracking. In general, Conv-LSTM achieved the highest detection accuracy with acceptable detection speed (FPS > 30).
Furthermore, a typical feature fusion-based model (i.e., C3D model) was also compared to our best model (i.e., model 6). As shown in Table 3, an overfitting issue occurred for the C3D model, with a training accuracy of 99.89% and a test accuracy of 67.22%. The reason is that the C3D model has much more parameters (over 10 times) than our proposed model. Since the dataset is limited, the model was easily overfitted. Regarding computational loads and detection speed, the proposed model was also better than the C3D model in terms of FLOP (floating point operations) and FPS.
Of note, the best Conv-LSTM (i.e., model 6) models still have some false-positive cases. Some noncrash scenes (congestions) cannot be well identified by the model, as shown in Figure 11. This is probably due to limited sample size. Another reason could be that the proposed model tends to focus on part of images (thanks to attention module), while ignoring the understanding of the whole traffic scene.

As for misdetection (i.e., FN), some typical cases were discussed here (Figure 12). The first crash is that two vehicles collided with each other and led to an explosion. When it happened, the fire quickly covered the whole traffic scene. Such case is very rare in our current dataset, so that the trained model cannot well identify appearance features. The second to fifth crashes all happened in congested or complex traffic environment. In such environment, crash features were blocked or were difficult to be identified, especially when the original image quality is not high.

5. Conclusion
Detecting crash in a timely and accurate manner is important for traffic incident management. Previous video-based crash detection models suffer from low detection accuracy (e.g., some motion-based models) or high computational costs (e.g., large feature fusion-based models). To fill the gap, in this paper, we proposed a new feature fusion-based deep learning model framework with the purpose of achieving a balance between accuracy and speed for urban traffic crash detection. To this end, ResNet with attention modules was developed to capture the appearance features of crash images. ResNet has faster speed than conventional convolution neural network, while the attention module enables ResNet to focus on localized appearance features other than irrelevant information to further boost the model’s speed. Conv-LSTM was introduced to link to ResNet to simultaneously capture appearance and motion features. Compared to conventional recurrent neural network (e.g., LSTM), Conv-LSTM can retain most of the spatial information with relatively fewer parameters.
Based on modeling results, the ResNet with attention modules can improve the performance of detecting localized appearance feature of crashes. Compared to simple rules and LSTM, the Conv-LSTM can better capture the motion features of crashes. The proposed model achieved the overall accuracy of 87.78% with relatively fast detection speed (FPS > 30), which outperformed conventional motion-based models and existing feature fusion-based models. Thus, the proposed method is a promising crash detection method, achieving a good balance between speed and accuracy.
Admittedly, the proposed model also has some limitations. First, the model falsely detected some congested traffic scenes as crashes. An understanding of the whole traffic scene may need to be considered. In the future research, we will attempt to improve the model framework. Second, it still has some misdetections when traffic environment/crash scenes are complicated, rare, and ambiguous. Thus, the model still needs more data and other supplementary methods (e.g., multiangle cameras or few-shot learning) to further improve its robustness. Third, the model needs to be further improved to identify different types/severity levels of crashes. The authors recommended that future research should be focused on those topics.
Data Availability
The data used were acquired from the local traffic police in China. Since those data are from video surveillance (many are crash-related), they can only be accessed with permission from the local government.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was supported by the National Natural Science Foundation of China (grant no. 71971061), the National Key R&D Program of China (grant no. 2018YFE0102700), and Transportation Science Research Project in Jiangsu Province (grant no. 2019Z02).