Abstract
Achieving high detection accuracy of pavement cracks with complex textures under different lighting conditions is still challenging. In this context, an encoder-decoder network-based architecture named CrackResAttentionNet was proposed in this study, and the position attention module and channel attention module were connected after each encoder to summarize remote contextual information. The experiment results demonstrated that, compared with other popular models (ENet, ExFuse, FCN, LinkNet, SegNet, and UNet), for the public dataset, CrackResAttentionNet with BCE loss function and PRelu activation function achieved the best performance in terms of precision (89.40), mean IoU (71.51), recall (81.09), and F1 (85.04). Meanwhile, for a self-developed dataset (Yantai dataset), CrackResAttentionNet with BCE loss function and PRelu activation function also had better performance in terms of precision (96.17), mean IoU (83.69), recall (93.44), and F1 (94.79). In particular, for the public dataset, the precision of BCE loss and PRelu activation function was improved by 3.21. For the Yantai dataset, the results indicated that the precision was improved by 0.99, the mean IoU was increased by 0.74, the recall was increased by 1.1, and the F1 for BCE loss and PRelu activation function was increased by 1.24.
1. Introduction
Pavement cracks are a common symptom and an early sign of potential damages and degradation in the pavements [1–4]. However, due to heavy traffic and drastic environmental changes, cracks can significantly affect the performance and functionality of the asphalt pavement projects and eventually lead to structural deterioration and a shortened service time.
Early and frequent inspection can help collect asphalt pavement condition data for in-depth analysis and strategic decisions. Using this information, appropriate maintenance measures can be arranged to prevent pavement failure at an early stage. However, manual inspections of asphalt pavement require massive time involvement and labor costs, and the results are not accurate [5–9].
In this case, some automated vision-based techniques for road damage detection have been investigated. Meanwhile, the development of deep learning in computer vision [10, 11] has significantly improved the crack detection accuracy of camera-captured images [12–15]. Scholars have proposed pixel-level methods with deep convolutional neural networks (CNN) to detect cracks. For crack segmentation, UNet and DenseNet were performed. The results showed high performance and accuracy to a certain extent [16, 17].
In the past few decades, the crack detection technology based on captured images has been widely studied and applied [12, 18–20]. Compared with subjective and labor-intensive manual approaches [19], image-based methodologies were consistent and objective, thus leading to reduced labor costs and higher detection efficiency. In general, image-based methodologies can be divided into different categories, including rule-based methods, machine learning-based methods, and deep learning-based methods. Rule-based methods used a variety of filter algorithms and image processing techniques [12–15]. S. Chambon et al. [20] proposed a two-step multiscale method. The first step was to use adaptive filtering binarization, and the second step was to refine the binarization by segmentation based on Markov model. James Tsai et al. [21] assessed the crack detection performance in different lighting conditions with poor contrast by emerging 3D laser technology. M. Salman et al. [22] developed a novel technology for the automatic detection and identification of cracks from digital pavement images. Merazi-Meksen et al. [23] used a mathematical method to extract the discontinuity-related pixels and used pattern recognition technology to characterize the discontinuity. Due to the complexity of the texture of pavement surfaces, and the irregularity of the crack morphologies, scholars tried to detect the cracks with machine learning-based algorithms. Qin Zou et al. [24] developed CrackTree, which was a fully automatic crack detection method using pavement images. Ghada Moussa et al. [17] presented a novel reliable method, which used flexible pavement images acquired in the asphalt pavement surveys to automatically detect, classify, and estimate the cracks. Miguel Gavilán et al. [25] proposed a seed-based crack detection method for asphalt pavement by combining Multiple Directional Non-Minimum Suppression (MDNMS) and a symmetry check.
However, machine learning-based approaches are highly dependent on input features and require the researchers to have a broad knowledge of the features. In recent years, as a branch of machine learning, deep learning, especially artificial neural network (ANN) [26, 27], has been rapidly applied due to its superior performance in various fields [28–33], including image classification and semantic segmentation. In the report by Gopalakrishnan et al. [31], a Deep Convolutional Neural Network (DCNN) was trained on the “big data” ImageNet database and applied in their study. Zhenqing Liu et al. [32] adopted UNet to detect the concrete cracks. The trained UNet can identify the location of cracks from the input original images under various conditions. Ankang Ji et al. [34] proposed an integrated crack detection method based on the convolutional neural network DeepLabv3+ and developed an algorithm to quantify the cracks at the pixel level. Cao Vu Dung et al. [35] proposed a crack detection method based on a deep full convolutional network (FCN) and applied this method for semantic segmentation of concrete crack images. Some other semantic segmentation methods were also proposed. However, none of the semantic segmentation methods were employed in crack detection involving engineering applications. Adam Paszke et al. [36] developed a new deep neural network architecture, i.e., efficient neural network (ENet) for tasks requiring low latency operation. Zhenli Zhang et al. [37] proposed a new framework called ExFuse to fill in the gap between low-level and high-level features. The developed framework leads to significant improvement of the segmentation quality (by 4.0%). Abhishek Chaurasia et al. [38] designed a novel neural network architecture from scratch, which can be specifically applied for semantic segmentation. Vijay Badrinarayanan et al. [39] designed an encoder network architecture with the same topology as the 13 convolutional layers in the VGG16 network [40, 41]. Abhishek Chaurasia et al. [38] designed a novel deep neural network architecture, namely, LinkNet, which can learn without significantly increasing the number of parameters. SegNet was designed by Vijay et al. [39]. It had a core trainable segmentation engine and contained an encoder network, a corresponding decoder network, and a pixel-wise classification layer. However, due to impact from shadows, uneven lighting conditions, or crack shape irregularity, it was still not accurate enough for pixel-level crack segmentation.
In this study, an encoder-decoder network-based architecture named CrackResAttentionNet was proposed to detect asphalt pavement cracks, as well as pixel-level image segmentation. The overall process of the CrackResAttentionNet training process is shown in Figure 1. In the following sections, the structures of CrackResAttentionNet, model evaluation method, and model optimization method are described. From the structure of CrackResAttentionNet itself, the encoder mainly used the convolutional layers of ResNet-34 to extract image features, and another encoder layer was added to better extract information. The decoder employed the deconvolutional layers to perform the semantic segmentation of pixels with and without crack. Additionally, the position attention module and channel attention module were connected after each encoder to summarize remote context information. The two attention modules were fused proportionally to emphasize more the position information. The output of each encoder layer was fused with attention output and linked with the corresponding decoder. In addition, the output of last decoder was fed into the decoder as input. In this way, the decoder and its upsampling operations can use spatial information to help improve the accuracy of prediction.

The organization of this paper is as follows. The “Introduction” section reviews some related studies on image-based crack detection and segmentation; the “Methodology” section describes the methodologies related to technical background, including Convolutional Neural Network (CNN) and a group of layers, deconvolution, unpooling, Position Attention, and Channel Attention, and encoder-decoder network; the “Network Architecture” section explains the CrackResAttentionNet architecture in detail; the “Experimental Program” section lists the details of data acquisition, data generation, and experimental program; the “Results and Discussion” section presents the experiment results and the analysis results; the “Summary and Conclusions” section summarizes the conclusions and our findings; and finally, the “Suggestions for Future Research” section looks at the future research prospects and suggestions.
2. Methodology
This section briefly describes the methodologies related to the technical background, including Convolutional Neural Network (CNN) and a group of layers, deconvolution, unpooling, Position Attention, and Channel Attention, and encoder-decoder network.
The convolution layer was a linear operation used to process the product of a set of weights and the input. The multiplication was performed between the input data array and a two-dimensional weight array, which was called filter or kernel. The kernel had a smaller size than the input, and the multiplication calculation between the kernel and the input data was based on dot product. Dot product is an element-wise multiplication operation between two 2D arrays. Then, the products are added together, generating a single value as the output. As the calculation progressed, the kernel moved along the width and height of the image, which produced a 2D image representation of the receptive region. The generated image representation indicated the response of the kernel at each spatial position of the image. The kernel slid on the image with the step size of “stride” [8]. The pooling layer helped reduce the spatial dimension of the representation, which can lower the number of iterations, computation size, and weight [8]. Several pooling options are currently available, namely, the average of the rectangular neighborhood, L2 norm of the rectangular neighborhood, and max pooling, which can obtain the maximum value from the subarrays of the input array.
A nonlinear function was employed as an activate function. Dropout was used to process the overfitting problem for neural networks. The connections between neurons were disconnected at a fixed dropout rate. During training, as the training progressed, the parameters of the preceding layers changed, which led to the change in the distribution of inputs. Thus, the current layer needed to be updated to reflect the actual distribution. The output from the previous activation layer was normalized by Batch normalization. In detail, the Batch mean was subtracted, and the resulting value was divided by the Batch standard deviation. Then, the most likely classification was identified by the softmax layer. In the most likely classification, the class had the highest probability [8]. The probability distribution with a sum of one was output. Pooling in a convolution network was used to abstract activations in a receptive field with a single representative value, thus filtering the noisy activations in the lower layer. Unpooling operation performed the reverse operation of pooling and reconstructed the original size of activations. The unpooling can be implemented using the approach proposed in References [9, 25]. Briefly, unpooling recorded the positions of the maximum activations obtained by the pooling operation, performed the reverse operation of pooling, and reconstructed the original size of the activations. As a result, each activation was placed back to the original position before pooling operation. The deconvolution layer used operations with multiple learned filters similar to convolution to make the sparse activations much denser. A convolutional layer connected multiple input activations within a filter window to a single activation; in contrast, deconvolutional layers associated a single input activation with multiple outputs. The deconvolutional layer then output an enlarged and dense activation map. Encoder-decoder network was a widely used form of semantic neural network for image segmentation. The encoder may consist of a few convolutional layers to obtain the feature maps, and the decoder was used to extend the features extracted by the encoder so that the output probability map and the input image had matching sizes.
3. Network Architecture
An encoder-decoder network-based architecture called “CrackResAttentionNet” was proposed to segment crack pixels. CrackResAttentionNet consisted of two parts: encoder and decoder. Between each encoder and decoder, an attention module was added behind the encoder and linked to the decoder. The encoder network consisted of pretrained ResNet-34 [41] as its main encoder layer, and the latter part was connected with non-ResNet-34 encoders. Each encoder included a corresponding decoder layer, and the final decoder output was fed into an output block containing a deconvolution function to perform pixel prediction.
Attention was obtained for each ResNet-34 encoder layer, and the output of the attention was added to the original encoder output. The outputs from the encoder layer and the previous decoder layer were fed into the next decoder layer as input.
3.1. Encoder
The architecture of CrackResAttentionNet is shown in Figure 2. From the figure, the encoder starts with the pretrained ResNet-34 in the block layer and performs convolution, Batch normalization, Relu activation, and max pooling. The next layers include layer 1, layer 2, layer 3, and layer 4 of ResNet-34.

The last part of the encoder is a diverse encoder block, as shown in Figure 3. The diverse encoder block performed convolution, in which the kernel size was 2 × 2, the stride was 2, and the padding was 0. The size of the output matrix was divided by 2. After the convolution, Dropout, Batch normalization, and PRelu activation were connected.

3.2. Decoder
The last encoder block was directly fed into decoder 5. As shown in Figure 4, decoder 5 contains convolution 1 module and convolution 2 module, deconvolution module, and convolution 3 modules. Each convolution module performed convolution. The kernel size was set to 1 × 1, the stride was 1, and the padding was 0. Then, the matrix with the same size can be obtained as the input. Dropout, Batch normalization, and PRelu activation were after the convolution. The deconvolution module first performed deconvolution by function ConvTranspose2d, with a kernel size of 2 × 2 and a stride of 2. As a result, the size of the input matrix was multiplied by 2. Then, the next step was the Batch normalization and PRelu.

As shown in Figure 5, the four decoders, i.e., decoder 4, decoder 3, decoder 2, and decoder 1, have the same structure. They contain convolution 1 module, deconvolution module, and convolution 2 module. Both convolution modules performed convolution with a kernel size of 1 × 1, a stride of 1, and a padding of 0. Then, the matrix with the same size was produced as the input. Dropout, Batch normalization, and PRelu activation were connected. The deconvolution module performed deconvolution by function ConvTranspose2d, in which the kernel size was 3 × 3 and the stride was 2. As a result, the input size was multiplied by 2. Then, the next step was the Batch normalization and PRelu.

As shown in Figure 6, the out block contains the deconvolution module 1, the convolution modules 1 and 2, and the deconvolution module 2. The first deconvolution module performed deconvolution through the function ConvTranspose2d, in which the kernel size was 3 × 3 and the stride was 2. As a result, the size of the input matrix was multiplied by 2. Then, the next step was the Batch normalization and PRelu. The two convolution modules had the same structure and performed convolution with a kernel size of 3 × 3, a stride of 1, and a padding of 0. Both the output matrix and the input matrix had the same size. Dropout, Batch normalization, and PRelu activation were connected. The last deconvolution module simply performed a deconvolution by function ConvTranspose2d, in which the kernel size was set to 2 × 2 and the stride was 2. This operation caused the size of the input matrix to be multiplied by 2. The output was the predicted image, which had the same size as the input image.

3.3. Attention
Cracks are different in scales, lighting, and views, which can affect the segmentation effect. Since the convolution operations would result in local reception, the pixels with the same label may have different features [42]. The differences can cause inconsistency across classes and influence the accuracy. Therefore, the attention mechanism was used to establish associations between features, thus extracting global context information. Paying more attention to crack segmentation can aggregate remote context information and enhance the feature representation capability.
As shown in Figure 7, in order to get the global context of local features from the network, two types of attention modules are added. Given the output from the ResNet-34 decoder layer, a convolution layer was firstly applied to obtain features through different layers, which would not change the shape of the input.

In the first attention module, i.e., position attention module [42], a wider range of context information was extracted into local features. The feature maps A, B, and C were generated from the convolution layer. The following condition was satisfied: {A, B, C, D} ∈ RC×H×W. Then, A, B, and C were reshaped to RC×N, where N was the number of pixels and can be calculated by N = H × W. B was transposed to RN×C, and a matrix multiplication was conducted between the transpose of C and B. The shape of the resulting matrix was RN×N. Then, the softmax layer was used to calculate the spatial attention map S ∈ RN×N [42]:where sji is the influence of the ith position on jth position. When the feature representations of two locations were more similar to each other, the correlation between them was higher [42].
Then, S was transposed into the same shape RN×N, but the dimension was different. Matrix multiplication between A and the transposed S was performed, and the result was RN×C. This result was reshaped to be RC×H×W. Finally, the result was multiplied by a scale parameter α, an element-wise sum operation was conducted with the original convolutional features D, and the final output H ∈ RC×H×W was obtained as follows [42]:where α had the initial value of 0 and gradually learned to assign more weights [42].
Similarly, the channel attention module shown in Figure 7(b) can emphasize interdependent feature maps and improve the representation of specific semantic features.
A convolution firstly was performed to extract the feature maps E, F, G, H, and {E, F, G, H} ∈ RC×H×W.
Then, the E, F, G were reshaped to RC×N, where N was the number of pixels and can be calculated by N = H × W. F was transposed to RN×C, and a matrix multiplication of transposed F and E with the matrix was performed [42]. The resulting matrix with the shape of RN×N was obtained, and a softmax layer was applied to calculate the spatial attention map X ∈ RN×N:where xji is the impact of the ith position on jth position.
Then, a matrix multiplication between the softmax result X and reshaped G was conducted, and the result of RC×N was obtained. This result was reshaped to RC×H×W. Finally, the result was multiplied by a scale parameter β, an element-wise sum operation was conducted with the original convolutional features H, and the final output I ∈ RC×H×W was obtained as follows [42]:where β had the initial value of 0 and gradually learned to assign more weights [42].
Therefore, the output feature of each channel was a weighted sum of the features in all channels and the original features. The output feature of each channel can model the remote semantic dependencies between feature maps.
After obtaining the results from both the position attention H and the channel attention with the shape C × H × W, they can be fused. The position attention had a proportion of Ø, and correspondingly, the channel attention had a proportion of (1 − Ø). The element-wise sum proportional operation can be calculated using equation (5):where Ø is a hyperparameter, and 0.8 can be used to emphasize position attention for crack segmentation.
3.4. Encoder and Decoder Bridge
From Figure 2, each encoder layer is bridged with a decoder. This was done by fusing the output of each encoder layer, the results of attention are indicated in Section 4.3, and the output is from the previous decoder layer. By feeding this fusion output into the decoder, the information of the encoder layer and corresponding attention can be captured.
4. Experiment Setup
4.1. Data Preparation and Preprocessing
In this study, two datasets were used in the experiments. One was the public crack data set [43], which was an annotated dataset to train and validate machine learning-based crack detection and segmentation algorithms. It consisted of 2000 samples. The dimension of each image was 448 × 448 pixels. The dataset was divided into three datasets, i.e., training dataset, validation dataset, and test dataset. Among them, the images in the training dataset were used to fit the neural network model. In the training process, the fitted model repeatedly made predictions on the validation dataset, which evaluated the effect of the model and indicated whether overfitting occurred. After the training process was completed, the test dataset was used to assess the performance of different networks.
The second crack image dataset was denoted as “Yantai dataset,” which was manually collected by a camera on Yantai highway. This dataset consisted of images with complex textures under different light conditions. The dataset consisted of 5000 samples. The images were firstly cropped to obtain the central part of the image, which was square. Then, the cropped images were resized to have 448 × 448 pixels. No other data preprocessing techniques were applied to the collected data, such as noise removal and crack enhancement. The dataset was also divided into three datasets, i.e., training dataset, validation dataset, and test dataset. The purpose of introducing this dataset was to further evaluate the performance of the proposed model in this work. The details are listed in Table 1.
4.2. Data Labeling
For training and validation purposes, the Yantai dataset was required to be annotated. The software of LabelMe was used to manually draw polygon for the crack. After that, the annotated images were generated through the python program, and the output consisted of a JSON file including the position information of the cracks. Therefore, the image background was black, while the cracks were white.
4.3. Segmentation Metrics
The following evaluation indicators were applied for the crack segmentation task, i.e., precision (P), mean IoU, recall (R), and F1. In the image, the crack pixels (white) were defined as positive samples. Based on the combination of labelled cases and predicted cases, the segmentation of the pixels resulted in true positive (TP), false positive (FP), true negative (TN), and false negative (FN).
Precision was defined as the ratio of the number of correctly predicted crack pixels to the number of all the pixels predicted to be cracks. Recall was defined as the ratio of the number of correctly predicted crack pixels to the number of all the true crack pixels. F1 score was the harmonic mean of precision and recall [44].
Intersection over Union (IoU) reflected the overlapping degree between two objects. In this study, the IoU was evaluated on the “crack” class to measure the overlap between the truth crack objects and predicted crack objects.
4.4. Loss Functions
Three loss functions were applied and compared in this study. The pixel-wise cross-entropy loss was described as follows.
where i is the index of the pixel, nn indicates the size of the output image, p represents the true label of samples, 1 represents positive, 0 represents negative, and represents the possibility of samples being predicted as positive.
The balanced pixel-wise cross-entropy loss was similar to pixel-wise cross-entropy loss. The difference was that the balanced pixel-wise cross-entropy loss only assigned weights to positive and negative samples, and the sum of the weight was 1. The equation can be described as follows:where β represents the balanced factor, i is the index of the pixel, nn indicates the size of the output image, p represents the true label of samples, 1 represents positive, 0 represents negative, and represents the possibility of samples being predicted as positive.
Dice loss was designed from the perspective of IoU and expressed in equation (10).where TP refers to true positive, and FN refers to false negative.
4.5. Computing Hardware and Software
All the experiments were performed on a computer with the following specifications: CPU was Intel (R) Xeon (R) CPU E5-2678 v3, and GPU was Nvidia GTX 1060 with 8G RAM. The software environment was based on Ubuntu 16.04, and python was used as the main programming language. Meanwhile, the experiments were conducted on Pytorch 1.5 deep learning framework.
4.6. Hyperparameter Setting
In this work, the training was optimized by the min-Batch stochastic gradient-descent algorithm with momentum. The hyperparameter settings were listed as follows: weight decay factor = 0.0002, momentum = 0.9, learning rate = 0.01, mini-Batch size = 4, and number of epochs = 60.
Two groups of experiments were designed to assess the performance of the proposed CrackResAttentionNet model. The first group of experiments was based on the public crack dataset, and the other group was based on the Yantai dataset. The detailed settings are listed in Table 2.
For each group of experiments, in addition to CrackResAttentionNet, the typical image segmentation models including ENet, ExFuse, FCN, LinkNet, SegNet, and UNet were also run. Three different loss functions, i.e., pixel-wise cross-entropy loss (CE), balanced pixel-wise cross-entropy loss (BCE), and dice loss (Dice), were used to train each model.
5. Results and Discussions
A series of experiments were performed, and the pavement images were processed with the models. Each experiment was evaluated from precision, mean IoU, recall, and F1. Precision indicated the ratio of the number of correctly classified crack pixels to the total number of all pixels, while recall indicated the ratio of the number of correctly classified crack pixels to the number of all true crack pixels. Mean IoU reflected the degree of overlap between predicted crack area and real crack area. F1 was the harmonic mean of precision and recall, reflecting the accuracy of an algorithm [45].
5.1. Experiment Results
The results are summarized in Tables 3–8.
Table 3 shows the performance of different models (ENet, ExFuse, FCN, LinkNet, SegNet, UNet, and CrackResAttentionNet) with CE loss for the public dataset [44]. Compared with other models, CrackResAttentionNet yielded higher precision (82.58), mean IoU (72.83), recall (85.13), and F1 (83.84). ExFuse had good precision (82.22), and ENet had good mean IoU (72.22), recall (83.94), and F1 (81.94).
Table 4 shows the performance of different models with BCE loss. Compared with other models, CrackResAttentionNet showed great precision (89.40). However, CrackResAttentionNet did not show advantages in the other three metrics (mean IoU, recall, and F1). Considering all four metrics together, CrackResAttentionNet still performed great.
Table 5 shows the performance of different models with Dice loss. CrackResAttentionNet showed great precision, mean IoU, and F1 gains, while its recall was not in the lead.
Table 6 shows the performance of different models (ENet, ExFuse, FCN, LinkNet, SegNet, UNet, and CrackResAttentionNet) with CE loss using the Yantai dataset. The maximum F1 score (93.33) was achieved by CrackResAttentionNet, which was 1.78 higher than that of the model in the 2nd place, ENet (91.55). Meanwhile, CrackResAttentionNet had much better performance in other metrics. For instance, the precision (94.64) and mean IoU (83.28) of CrackResAttentionNet were both much better than those of the other models.
Table 7 shows the performance of different models with BCE loss using the Yantai dataset. CrackResAttentionNet showed much better precision (96.17), mean IoU (83.69), recall (93.44), and F1 (94.79) than those of other models. LinkNet also obtained a good F1 score (93.55), but it was still lower than the F1 score of CrackResAttentionNet.
Table 8 shows the performance of different models with Dice loss using the Yantai dataset. CrackResAttentionNet exceeded other models in all the four metrics (precision 95.43, mean IoU 82.75, recall 94.2, and F1 94.81).
BCE loss can be selected as the best loss function for CrackResAttentionNet. It can be seen from these two datasets that BCE loss had higher precision (89.40 for the public dataset and 96.17 for the Yantai dataset).
Sample prediction segmentation output for BCE loss of each model is shown in Figures 8 and 9. From the public dataset sample prediction in Figure 8, SegNet misclassified some background as cracks, while FCN misclassified some cracks as background. CrackResAttentionNet made good segmentation, and the result was extremely close to the ground truth.


From the Yantai dataset sample prediction results in Figure 9, ENet and FCN misclassified some background as cracks, while SegNet and ExFuse misclassified some cracks as background. CrackResAttentionNet had perfect segmentation performance, and the result was extremely close to the ground truth.
5.2. Influence of Different Activation Functions
The effects of different activation functions on CrackResAttentionNet performance were further investigated. Firstly, we fixed the loss function as BCE loss and then used different activation functions (Relu, PRelu, RRelu, and LeakyRelu). The same experimental setup was used as above.
As shown in Table 9, for the public crack dataset, if PRelu was used, mean IoU (71.51) was 0.28 higher than the second place, precision (89.4) was 2.37 higher than the second place, and F1 was also high (85.04). So PRelu has a better performance compared with the other three activation functions (LeakyRelu, RRelu, and Relu).
As shown in Table 10, for the Yantai dataset, PRelu had outstanding performance compared with the other three activation functions (LeakyRelu, RRelu, and Relu). Mean IoU (83.69) was 2.12 higher than the second place, and precision (96.17) was 1.47 higher than the second place. Therefore, based on the performances of different functions for the public dataset and the Yantai dataset, PRelu was selected for CrackResAttentionNet.
6. Summary and Conclusions
In this study, an architecture based on encoder-decoder network was proposed to detect asphalt pavement cracks and perform pixel-level image segmentation. The results indicated that CrackResAttentionNet with BCE loss function achieved the best performance in precision (89.40), mean IoU (71.51), recall (81.09), and F1 (85.04) for the public dataset. Meanwhile, for the Yantai dataset, CrackResAttentionNet with BCE loss function had the precision of 96.17, mean IoU of 83.69, recall of 93.44, and F1 of 94.79. The performance of CrackResAttentionNet exceeded other popular models including ENet, ExFuse, FCN, LinkNet, SegNet, and UNet.
The summary and main findings are as follows:(i)In the architecture of the developed algorithm, the encoder mainly used the convolutional layers of ResNet-34 to extract image features, and an additional encoder layer was added to better extract information. The decoder employed the deconvolutional layers to perform the semantic segmentation of crack and noncrack pixels.(ii)After each encoder, position attention module and channel attention module were connected to summarize remote context information. The two attention modules were fused proportionally to emphasize more on the position information. The output of each encoder layer was fused with the output attention and linked with the corresponding decoder. In addition, the output of the previous decoder was fed into the decoder as an input. In this way, the decoder and its upsampling operations can use the spatial information to help improve the prediction accuracy.(iii)The experiments were conducted for the public crack dataset and Yantai dataset using CrackResAttentionNet and other typical models including ENet, ExFuse, FCN, LinkNet, SegNet, and UNet. In addition, three different loss functions (CE, BCE, and Dice) were used in the experiments. The results demonstrated that CrackResAttentionNet with BCE loss function achieved the best performance.(iv)For public dataset, CrackResAttentionNet performed well in precision (89.40), mean IoU (71.51), recall (81.09), and F1 (85.04). Especially, with BCE loss, the precision was increased by 3.21.(v)For the Yantai dataset, the CrackResAttentionNet had a great performance in precision (96.17), mean IoU (83.69), recall (93.44), and F1 (94.79). With BCE loss, precision was increased by 0.99, mean IoU was increased by 0.74, recall was increased by 1.1, and F1 was increased by 1.24.(vi)The effects of different activation functions were further analyzed. For the public dataset, if PRelu was used, mean IoU (71.51) was 0.28 higher than the second place, the precision (89.4) was 2.37 higher than the second place, and F1 was also high (85.04). For the Yantai dataset, if PRelu was used, mean IoU (83.69) was 2.12 higher than the second place, and the precision (96.17) was 1.47 higher than the second place. Therefore, PRelu was finally selected as the activation function for CrackResAttentionNet.
Overall, to the best of our knowledge, this research is the first attempt to use an encoder-decoder network with both encoder and attention information linked to the decoder layer. It also defined a few additional encoder and decoder modules. Another difference was that the position channel module and channel attention module were proportionally fused.
7. Suggestions for Future Research
Suggestions for future research on this topic include the following:(i)More research should be conducted on the latest attention models (such as Transformer and SENet), which might be integrated into our network to improve the accuracy of the results.(ii)More detailed experimentation is required to analyze the hyperparameters derived from the training and validation datasets. Meanwhile, it is necessary to study the impact of the hyperparameters on the model performance.(iii)Multi-GPU can speed up the model training and inference. Real-time model training and prediction require running the training under a multi-GPU hardware environment.(iv)A bigger public annotated public dataset is needed to cover all types of cracks with noises under different conditions. And a more accurate method related to pixel-level crack detection should be developed to characterize and quantify the spatial stereoscopic information of cracks in the pavement.(v)In the future, the proposed detection method should be optimized, and a large-scale crack database for asphalt pavement crack detection should be built. The image recognition method of pavement cracks should be applied for pavement intelligent maintenance to help identify the microcracks on the pavement to the cloud platform for earlier repair.
Data Availability
The (https://github.com/wanhaifengytu/CrackSegmentationProject/tree/main/data) data used to support the findings of this study have been deposited in the Git Hub repository (https://github.com/wanhaifengytu/CrackSegmentationProject/).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
Haifeng Wan conceptualized the study, developed the methodology, performed investigation, and wrote the original draft. Lei Gao conceptualized the study and performed investigation. Manman Su performed investigation and data curation. Qirun Sun performed investigation and reviewed and edited the manuscript. Lei Huang performed investigation and edited the manuscript.
Acknowledgments
The authors are grateful for the financial support from Natural Science Foundation of Shandong Province (Grant No. ZR2020ME238), the Shandong Provincial Key Laboratory of Highway Technology and Safety Assessment Research Fund Project (TM2018H46), and Yantai Science and Technology Innovation Development Plan Project (2020XDRH104).