Abstract

Bridge expansion joints (BEJs) in service are susceptible to damage from various factors such as fatigue, impact, and environmental conditions. While visual inspection is the most common approach for inspecting BEJs, it is subjective and labor-intensive. In this paper, we propose a novel methodology for detecting the fault status of BEJs, inspired by voiceprint recognition (VPR) based on audio signals. We establish an Artificial Neural Network to filter nonevent segments from low signal-to-noise ratio signals, achieving an AuC value of 0.981. We design and improve ConFormer VPR models with a multifeature aggregation strategy and cascade them to realize fault detection of BEJs. For three successive tasks in classifying environment sound types, vehicle impact types, and faults, the ConFormer VPR models achieve AuC values of 0.975, 0.925, and 0.886, respectively, demonstrating the feasibility of our methods for unmanned inspection of BEJs. In future research, the introduction of multiple types of damage and the implementation of benchmarking tests are planned to further enhance the capabilities of the system.

1. Introduction

Bridge expansion joints (BEJs) are structural components that allow bridges to accommodate thermal movements and vibrations caused by traffic loads, wind forces, and seismic events. However, BEJs are also vulnerable to deterioration and damage due to various factors, such as fatigue, corrosion, impact, and environmental conditions [1, 2]. Damaged expansion joints can compromise the structural integrity of the bridge, create noise and vibration problems, and pose a hazard to traffic [3, 4]. Therefore, inspecting BEJs is essential for ensuring the safety and functionality of the bridge, as well as preventing costly and disruptive repairs in the future.

Currently, the inspection of BEJs primarily relies on two methods: manual visual inspection and nondestructive testing (NDT) techniques [5, 6]. Trained professionals conduct visual inspections, employing methods such as hammer testing for bolt looseness, magnifying glasses or microscopes for crack detection, and micrometers to assess beam gap distances [7]. Despite its simplicity and efficiency, manual visual inspection is inherently subjective, contingent upon the inspector’s expertise, skill, and fatigue. Furthermore, this approach has limitations, as it may overlook concealed defects beneath the surface or within the joint.

Consequently, NDT methods are being explored for a more objective and precise inspection of BEJs. Accelerometers or displacement gauges are installed to capture the dynamic response of BEJs. This allows for the extraction of vibration characteristics induced by moving vehicles, changes which are detected for fault identification in damaged BEJ steel fingers [8]. A wireless installation, based on Internet-of-Things technology, has also been proposed to enhance the flexibility of this approach [9, 10]. Ultrasonic testing, proposed for inspecting steel conditions [1113], can be employed to pinpoint the damaged areas of steel components in BEJs. X-ray Computed Tomography (X-ray CT) is being investigated to inspect the corrosion zones of BEJ steel specimens in a laboratory setting [14]. Corrosion products are observed in X-ray CT scanning images across different vertical sections. Electromagnetic testing, a novel NDT technique for detecting defects in steel components [15], can be utilized to inspect support bars in BEJs. However, these methods necessitate expensive specialized equipment and trained personnel. Some methods also require lane closures and traffic control, thereby escalating the cost and duration of the inspection process.

Previous research has investigated the sound produced by vehicle impact on BEJs, with a focus on understanding its generation [16] and devising methods to mitigate it for environmental conservation [17, 18]. Findings suggest that the audio response of BEJs contains information about their operational status, as corroborated by experienced inspectors who have leveraged these sounds to pinpoint anomalous BEJs [19]. Consequently, sound signals elicited by vehicle impact hold the potential for detecting BEJ damage. However, there is a paucity of studies on BEJ fault inspection that utilize audio signals.

In recent decades, sound signals have been increasingly utilized for detecting damage or anomalies in infrastructures that are in service. To acquire these sound signals, a microphone or an array of microphones is required. Research on auditory perception has shown that most of the information conveyed by sound signals is below 10 kHz [20]. As a result, a sampling rate of 16 kHz is commonly used in most scenarios [21]. However, for applications that prioritize efficiency and are based on audio, a sampling rate of 8 kHz can also be practical [22, 23]. Then, the original audio signals are processed using digital signal processing (DSP) techniques, such as Fourier Transform, to extract high-level features. Given the unique characteristics of audio signals, specific features have also been defined, including Mel frequency, Fbank [24], Gammatone [25], and so on.

After signal preprocessing, machine learning methods can be used to detect faults in these infrastructures. In the early stages, Hidden Markov Models (HMMs) are used to classify audio signals by obtaining Mel cepstrum coefficients, resulting in a 58% accuracy rate for 18 types of sounds [26, 27]. For more generalized usage, statistical learning models are applied in fault detection. The Support Vector Machine (SVM) is widely used for detecting damage to structures or equipment, such as bolt loosening [28], pipeline cracking [29, 30], bearing faults [31], ratchet faults [32], and turbine blade damage [33]. A decision tree is designed to classify damage events of wind turbine rotor blades using airborne sound, and the results show the high precision of the algorithm [34]. Artificial Neural Network (ANN) model is selected to analyze in-pipe leak detection and bearing fault inspection, resulting in high precision [35]. In recent years, deep learning models have been introduced into fault detection [36, 37] as they can obtain deeper features of signals with greater efficiency. One-Dimensional Convolution Neural Network (1D-CNN) is first applied to audio signal classification [38, 39]. To exploit the features of time-series signals, a recurrent module is added to the 1D-CNN model, improving its classification accuracy [40]. Additionally, a Long Short-Term Memory model is utilized for fault detection in additive manufacturing in industrial applications [41]. By applying WaveNet [42], the accuracy of fault detection is improved compared to the LSTM model [43]. For special occasions with few fault data, unsupervised methods are proposed to monitor the operation state of machines [4447]. These approaches are available in noninterrupting surroundings where only the sounds of monitored devices are acquired.

In conclusion, existing research suggests that manual visual inspection remains the most prevalent method for inspecting in-service BEJs. Techniques such as accelerometers, ultrasonic sensors, and electromagnetic sensors have been explored for detecting BEJ faults, aiming to inspect the BEJs in an NDT manner. However, these techniques are labor-intensive and cost-intensive and necessitate professional training or practical analysis. Given its proficiency in anomaly detection tasks and the application of cutting-edge deep learning techniques for audio analysis, the audio signal-based approach shows promise as a more cost-effective and automated method for detecting faults in in-service BEJs. For BEJs that are in service, the environment is noisier and more complex than in a factory or laboratory. As a result, more robust and feasible event segmentation methods and fault detection models are required. Recent advancements in speech recognition and voiceprint recognition (VPR) for human voices can be utilized for improvement, especially VPR which also focuses on the audio signal classification problem.

In this study, we propose a novel framework for fault detection of in-service BEJs based on VPR. First, microphones are deployed under the BEJs to acquire audio data. Then, an Artificial Neural Network (ANN) classification model is established to filter nonevent segments from low SNR audio signals in the BEJ environment. Subsequently, the ConFormer VPR model is designed and improved for general audio event classification. Finally, a cascading approach is proposed, consisting of three successive ConFormer VPR models, to separate vehicle impact audio, distinguish the detailed type of vehicle, and ultimately detect fault status. The main contributions of this work are four-fold. (1) Acoustic sensors and VPR algorithms are introduced for the first time in the fault detection of BEJs. (2) A machine learning model is applied for audio signal event segmentation, achieving an AuC of 0.981. (3) The original ConFormer is modified for VPR and improved with a multifeature aggregation strategy. (4) A cascading approach is proposed by combining ConFormer models for fault detection of BEJs.

The remainder of this work is organized as follows. Section “Overview of our methodology” presents our methodology for fault detection of BEJs based on the VPR technique. In Section “Audio event segmentation based on Fbank-ANN model,” the Fbank feature and ANN model for audio signal event segmentation are introduced. Section “Fault detection of BEJs by cascading ConFormer VPR models” proposes the ConFormer structure for VPR and a cascading approach consisting of ConFormer models for complete fault detection. Furthermore, a case study is conducted and the results are discussed in Section “Case study.” Finally, conclusions are drawn in Section “Conclusions.”

2. Overview of Our Methodology

In this paper, we propose a framework for fault detection of in-service BEJs based on audio processing techniques. As shown in Figure 1, the methodology consists of four major parts.

First, a microphone is deployed under the BEJ to acquire audio data. Additionally, some auxiliary sensors are necessary for annotation purposes, such as cameras to annotate actual passing vehicles for sound collection. The audio data are then preprocessed using the Fbank approach, which includes preemphasis, framing and windowing, Short-Term Fourier Transform (STFT), and Mel filters. The resulting Fbank feature maps are used to train an ANN classification model for audio event segmentation in actual applications. Finally, fault detection can be achieved by cascading ConFormer VPR models. The first ConFormer is applied for environment VPR to separate vehicle impact audio signals. The second ConFormer is used to distinguish the detailed type of vehicle impact. The last ConFormer serves for the final fault detection under the same vehicle impact type.

3. Audio Event Segmentation Based on Fbank-ANN Model

Exposed to the outdoor environment, BEJs are surrounded by multiple types of sounds caused by four factors, as listed in Table 1.

Nonevent segments in audio signals represent meaningless information and should be eliminated. Threshold-based methods, such as Short-Term Energy (STE) and Short-Term Cross-Zero Rate (STCZR), are popular approaches for audio event segmentation. However, in low signal-to-noise ratio (SNR) environments such as BEJs, these methods may not perform well for event segmentation [48, 49].

In this paper, we apply a machine learning (ML) approach to event segmentation of audio signals by treating it as a classification problem. Firstly, Fbank feature extraction is used for audio preprocessing, with event and nonevent samples manually annotated. Then, an ANN model can be trained to distinguish event features from nonevent features.

3.1. Fbank Feature

The audio signal data have ultra-high frequency and rich semantics, making its feature extraction approach more complex than other signal data in the bridge monitoring system. By investigating the theory of auditory perception, the Fbank feature descriptor has been proposed and realized through four calculation procedures [24].

Step 1. Preemphasis. The high-frequency components of audio signals are significant in VPR. However, the uniform discretization process of signal sampling can cause attenuation of energy in high-frequency regions. Therefore, a preemphasis procedure is required to boost energy and highlight resonant peaks in these regions. Specifically, a first-order digital filter can be used to achieve this compensation, as follows:where is discrete time coordinate, and are audio signals after and before the preemphasis procedure, respectively, and is the factor of emphasis.

Step 2. Framing and Windowing. The process of audio signal generation has inertia, indicating that the audio signal is stationary in a short time period and exhibits short-time stationarity. Therefore, the original high-frequency signal needs to be segmented into multiple short-duration audio segments called speech frames. Frame-length and frame-shift should be determined in the process. The former refers to the duration of the audio frame, while the latter defines the overlap between adjacent frames. Since framing is a truncation of signals that can cause spectral leakage, audio frames need to be windowed to reduce edge weighting and avoid the Gibbs phenomenon. A commonly used window function for audio signals is the Hamming window:where is the window function and is the sample number of the audio frame.

Step 3. STFT. The audio frames are processed by -point STFT to obtain the frequency-domain responses of audio frames, as shown in equation (3). Here represents the -th signal of the -th frame. Then, the power spectrum of audio frames can be further calculated using equation (4).

Step 4. Mel Filter. The frequency-domain responses of audio frames are further processed by Mel filters. Mel filters are mathematically triangular bandpass filters, set one by one according to linear intervals of Mel frequency. The center frequency interval between adjacent triangular filters can be obtained using the following equation:where is the number of triangular filters and and are the maximum and minimum real frequencies, respectively. The lower frequency , center frequency , and higher frequency​ of the -th triangular filter are defined as follows:Here , so the response of the -th triangular filter in Mel filter bank at frequency iswhere is in the range . Figure 2 shows the Mel filter bank with . It can be seen that the filtering range widens at high frequencies. The upper limit amplitudes of the filter are the same to simultaneously retain low and high frequency information of the audio.
Then, the power spectrum of the audio signal after STFT is input into the Mel filter bank to extract the Mel power spectrum of the audio, calculated using the following equation:where and are the number of frames and triangular filters, separately.
Furthermore, the obtained Mel power spectrum is subjected to a logarithmic operation to obtain the final Fbank feature of the -th frame along the -th dimension.An illustration of Fbank feature acquisition is shown in Figure 3. The 2-second audio signals are processed into 200 frames with a frame-length of 25 ms and a frame-shift of 10 ms. After STFT, nonevent segments of the audio signals have low responses, as shown in Figure 3(b). Then, an 80-dimensional Mel filter bank is used to obtain Fbank features of the audio, as shown in Figure 3(c). The features will be used for further applications.

3.2. ANN Model for Distinguishing between Events and Nonevents

Audio signals are preprocessed to obtain Fbank features. A classification model is then required to distinguish nonevent features from event features. ANN is a type of ML model that can learn any nonlinear function. For an ANN to work, audio signals are annotated into event and nonevent segments and processed to obtain Fbank features frame by frame. The model is then built and trained to classify Fbank features of event or nonevent frames. Finally, the model is applied for event frame recognition, which is the purpose of event segmentation.

An ANN model is composed of an input layer, one or more hidden layers, and an output layer. The number of elements in the input and output layers is determined by the dimension of the feature vector and the number of prediction categories, respectively. The number of elements in the hidden layer(s) is determined through hyperparameter tuning. Mathematically, each layer consists of two algorithmic steps: linear weighted summation and nonlinear activation, as represented by the following equation:where represents the input vector, and are weight parameter vectors to be learned, represents the activation function, which in this case is the ReLU function, and is the output result of this layer and serves as the input vector for the next layer.

An ANN forms a complex nonlinear cascading structure through its multiple layers, enabling it to fit the mapping relationship between input and output. Specifically, the back-propagation mechanism [50] is used to tune the model by minimizing the error between the model’s output and the true result, namely, the loss function :where represents the weighting parameters to be learned, is the number of samples, is the final output vector, and represents the entire ANN model.

Furthermore, during the training process of an ANN, errors in the loss function with respect to each parameter are calculated. The error in the output layer is given by equation (12), while the errors of the parameters in the hidden and input layers are obtained through back-propagation using chain rule differentiation.

Finally, the parameters are optimized using the Gradient Descent algorithm in each iteration. The coefficient , defined in the range (0,1], represents the learning rate. The parameter optimization process is given by the following equation:

An ANN for event segmentation is a binary classification model. A confusion matrix, defined in Table 2, is used to evaluate the performance of the ANN. The Accuracy, True Positive Rate (TPR), and False Positive Rate (FPR) of the model can be calculated via equation (14) by setting different classification thresholds. The Receiver Operating Characteristic (ROC) curve is obtained by plotting FPR on the horizontal axis and TPR on the vertical axis. The closer the ROC curve is to the upper left corner, the greater the model’s ability to distinguish between different types of features, indicating higher generalization and robustness. The performance of the model is evaluated using the Area under Curve (AuC) value, which is a metric calculated from the ROC curve. The AuC represents the area enclosed under the ROC curve and provides a quantitative measure of the model’s ability to distinguish between different classes [51, 52].

4. Fault Detection of BEJs by Cascading ConFormer VPR Models

Fault detection of BEJs can be treated as a voiceprint recognition (VPR) task, where audio signals are inputs and the normal/faulty status of BEJs are outputs. A VPR model is required to classify audio signals by BEJs to detect their status.

4.1. ConFormer Model with MFA Module

The Convolution-augmented Transformer network (ConFormer) was first proposed by Gulati [53] for speech recognition applications. The design idea of ConFormer is to combine the advantages of Transformer and CNN models. On the one hand, the encoder-decoder structure of the Transformer makes it effective at capturing global features based on content. On the other hand, CNNs excel at local feature extraction through multiple convolution-pooling operations. Therefore, ConFormer combines the benefits of both models to achieve unified modeling of global and local features in audio data analysis while minimizing the number of parameters.

In this paper, we modify the structure of ConFormer to fulfill VPR usage and achieve better performance. Firstly, after obtaining the speaker embedding, a fully connected layer is added to reduce the embedding dimensions and obtain the final classification results. Secondly, the Multiscale Feature Aggregation strategy is introduced for ConFormer basic blocks to improve its ability to extract features at different depths of layers.

As shown in Figure 4, the ConFormer VPR model is composed of five main parts.

4.1.1. Fbank Feature Extraction

As mentioned earlier, the Fbank can preprocess audio signals to extract their nonlinear frequency-domain features. For a typical 2-second audio utterance with a 16 kHz sampling rate, 200 audio frames are obtained after windowing with a frame-length of 25 ms and a frame-shift of 10 ms. Therefore, by setting the dimension of Mel filters to 80, a sized Fbank feature map can be acquired.

4.1.2. Convolution for Downsampling

As shown in part two of Figure 4, the two-dimensional Fbank feature map is essentially a digital image. Thus, a two-dimensional convolutional layer is first introduced to perform spatial downsampling on the audio features to accelerate the inference procedure and achieve higher spatial dimensions. Then, a linear layer is connected to unfold the feature map obtained by the last operation, realizing dimension reduction along the depth channel. A dropout layer is then connected to remove the weights of randomly selected neurons, intended to avoid model overfitting during training.

4.1.3. ConFormer Block

As depicted in the third part of Figure 4, the ConFormer block is a core component of ConFormer VPR, consisting of four modules: a Feed-Forward Network (FFN) module, a Multihead Self-Attention (MHSA) module, a Convolution (Conv) module, and a second FFN module at the end. These modules form a macaron structure with the same head and tail. Figure 5 shows the computation process of the ConFormer block. Mathematically, for input feature , the -th ConFormer block performs layer-by-layer operations according to the following equation:

4.1.4. MFA Strategy

Concatenating different layers of feature maps can improve the performance of deep learning-based VPR models [54, 55]. Therefore, we introduce the MFA strategy to connect the features extracted by ConFormer blocks. Specifically, an attentive statistic pooling layer is used to provide different trainable weights for the output feature map of each ConFormer block. After batch normalization and a linear layer, a one-dimensional vector speaker embedding is acquired, which is usually a vector in speech recognition tasks.

4.1.5. Fully Connected Layer

As shown in Figure 4, following the speaker embedding vector, a fully connected layer is constructed according to the final outputs. Here, the speaker embedding vector is dimensionally reduced to the number of voiceprint types. The Additive Angular Margin Softmax (AAMSoftmax) [56] is applied to compute model loss during training, defined in equation (16) Compared to traditional Softmax, AAMSoftmax can enhance the discriminative power of features and improve the performance of classification tasks. It can also reduce intraclass variance and increase interclass variance, making the learned features more compact and separable.where is the number of voiceprint types, is the angle between the weight and feature, is the scale factor used for normalization, and is the margin penalty value between the weight and feature.

4.2. Cascading Approach for BEJ Fault Detection

Compared with sensing methods such as vision and radar, audio acquisition requires active excitation sources. According to research by Nishikawa [19], experienced inspectors can determine the fault status of BEJs by the sound of vehicle impact, indicating that vehicle impact is a reliable excitation source for audio-based fault detection of BEJs.

Therefore, as shown in Figure 6, a cascading approach is established using three ConFormer models, each for a different VPR task, to ultimately achieve fault detection of BEJs. Firstly, environmental event VPR is performed to select vehicle impact events from all audio events. Secondly, to explicitly recognize the distinction between fault-free and faulty status of BEJs, vehicle impact events are finely classified according to different vehicle types. Thirdly, for the same type of vehicle impact audio, fault identification of BEJ component states is performed.

In the application phase, we introduce a decision-making module that combines all outcomes deduced from the four ConFormer models to enhance the reliability of the judgment. For each outcome from the third-level ConFormer model, both fault-free and faulty detections are logged, leading to an accumulation of hits for both categories. The status of the BEJ is subsequently determined based on the ratio of faulty hits: if it falls below a predetermined threshold, the BEJ is classified as fault-free, whereas if it surpasses the threshold, the BEJ is considered faulty.

4.3. Dataset Creation

Comprehensive datasets are essential for data-driven VPR applications. As shown in Figure 1, videos recorded by an in-field camera can be used to annotate specific vehicle impacts. Besides, sound events such as vehicle horns, human sounds, and bird sounds are manually differentiated and annotated.

The duration of audio signals in Table 1 may be imbalanced for practical applications. To supplement the in situ audio data, Open-Access (OA) datasets can be used. However, environmental noise can affect the robustness of the VPR model [57], particularly in low SNR scenarios such as BEJ environments. Therefore, environmental noise enhancement should be applied to the audio segments of OA datasets based on on-site measured signals, as shown in equation (17) and Figure 7.where and represent the audio signals before and after noise enhancement, respectively, is the enhancement coefficient, set to 1 in this case, and represents noise signals, which are randomly obtained from labeled noise signal segments in the BEJ environment.

Typical audio signal segments in the BEJ environment are shown in Figure 8. Both fault-free and faulty status audios are generated by vehicle impact, making it difficult to directly detect faults in BEJs. Besides, prominent segments of bird sounds are relatively short and distinct from other types of audio. However, the remaining types of audio are difficult to distinguish simply by using thresholds or time-to-peak. Therefore, a VPR model is necessary to achieve sound classification through its ability to extract high-dimensional features.

5. Case Study

5.1. Basic Information

An in situ experiment was conducted on the Jiangyin Bridge, a suspension bridge that contains two modular BEJs on each side. The experimental layout is shown in Figure 9. A microphone was fixed on a tripod under the main girder to capture audio data surrounding the BEJ. Above the main girder, a camera was temporarily installed to annotate moment and type of each vehicle passing over the BEJ. The main configurations of the two experimental sensors are listed in Table 3.

For data processing and model testing, we used a desktop PC (CPU: Intel® CoreTM i7-6800k; RAM: 32GB; and GPU: NVIDIA GeForce GTX 1080Ti) with the support of CUDA v10.2 and cuDNN v8.2. The PyTorch deep learning framework was utilized to accomplish model training and evaluation.

5.2. Damage Types of BEJ Faults

During their service, BEJs often experience damage and deterioration within a local range due to the prolonged influence of various factors such as load and environmental conditions (including vehicular traffic, corrosive actions, and temperature), leading to the failure of specific components of BEJs. Common forms of damage include the following [5, 58]: (1) local congestion, twisting deformation, or even breakage of the center steel beams; (2) long-term wear of the sliding bearing resulting in sliding failure; (3) aging of the sealing rubber, loss of elasticity in the rubber spring, or even fatigue cracking; (4) and corrosion and detachment of the welding points in some parts of the hanger.

Due to the distinct changes in BEJ components caused by damage, the fault-free state and various damage modes will manifest different acoustic features under the influence of vehicular impacts. Therefore, it is logical to categorize different kinds of damage based on impacting audio signals.

In this paper, to fundamentally verify this idea, we selected a fault-free BEJ of Jiangyin Bridge and collected one hour of audio data. Subsequently, we introduced simulated faults by using two steel shims to congest two center steel beams. One hour of vehicle-induced audio signals under this faulty BEJ was then recorded. The practical scenario is illustrated in Figure 10.

5.3. Event Segmentation of Audio Data

To train the ANN classification model for event segmentation, we annotated audio signals using Praat software [59]. A total of 2010.438s of signals were obtained, including 1035.562s of event signals and 974.876s of nonevent signals, as shown in Table 4. The dataset was split into training and test sets at a ratio of 80% and 20%, respectively.

The proposed ANN and other comparison models for audio event segmentation were applied, including the ML model Support Vector Machine (SVM) and threshold-based models STE and STCZR. The performance of different models was evaluated using accuracy as the metric. Hyperparameter optimization of the different models was first conducted to determine their sensitivity and stability in event segmentation. The results are shown in Figure 11.

It can be seen that ML methods have higher accuracy than threshold-based methods. Among them, the ANN and SVM models are insensitive to their hyperparameters and achieve accuracies of 93.5% 93.7% and 93.0% 93.1%, respectively. The STE model is sensitive to the predefined threshold and achieves an accuracy of 79.1% 90.0%, so a precise threshold setting is required. The STCZR model is insensitive to the threshold, but its accuracy is only slightly above 50%, indicating little disparity between event and nonevent audio signals in STCZR under the BEJ environment.

An 8-second audio data sample was selected to apply the event segmentation procedures using the four models mentioned above. As shown in Figure 12, the ANN, SVM, and STE models can accurately segment event signals, with only about 0.05s of misidentification locally at the beginning and end of events. Their main errors involve identifying low-amplitude event signals as nonevent, which has little impact on subsequent VPR applications. Compared to the SVM and STE models, false predictions occur less frequently for the ANN model, indicating its stronger adaptability and reduced need for postprocessing. Consistent with previous conclusions, the STCZR model almost misidentifies all nonevent segments as event segments, so it cannot segment event audio.

Additionally, for the ML methods of the ANN and SVM models, the ROC curves under optimal hyperparameters are plotted as shown in Figure 13. The ROC curves show a good “” shape, indicating that the models have good generalization performance, with AuC values of 0.981 and 0.964, respectively. As listed in Table 5, in terms of model training time and inference processing efficiency, the threshold-based method does not require data training. Since the SVM model is based on matrix operations for tuning, training time can be very long for large sample sizes. The ANN model has clear advantages in both training and inference efficiency, exceeding other algorithms by 20 times in terms of segmentation and recognition efficiency of the original audio. It only takes 10.5 ms to segment and infer 1 s of original audio data, making it suitable for preprocessing BEJ environment audio.

5.4. Fault Detection of BEJs
5.4.1. Dataset Preparation

In this work, in addition to the in-field audio data, we used the DCASE [60], ESC [61], and IUSD [62] OA datasets for dataset expansion. A total of 2103 audio segments were divided into training and test sets at a ratio of 0.8 : 0.2, resulting in a benchmark dataset for BEJ fault detection, with details listed in Table 6. Specifically, for fine-grained recognition of vehicle impact events, four vehicle classes—Car, Bus, Van, and Truck—were labeled based on definitions in bridge standards [63]. A total of 4123 audio segments were obtained, as shown in Table 7.

5.4.2. Cascading ConFormer Model Training and Evaluation for BEJ Fault Detection

Based on the proposed cascading approach shown in Figure 6, three types of ConFormer models were established: environment VPR, vehicle type VPR, and fault detection VPR. A length restriction of 2 seconds was applied to regularize all audio segments to balance efficiency and precision [64]. Therefore, random cropping or zero-padding operations were applied to segments shorter or longer than 2 seconds, respectively.

The training parameters in this work are listed as follows. For Fbank feature extraction, the number of Mel banks was set to 80. The initial learning rate was set to 0.01 and reduced to 97% of its original value at the end of each epoch to achieve optimal model performance. Considering memory limitations, the batch size for a training epoch was set to 100. The scale factor of the AAMSoftmax loss function (equation (16)) was set to 30 and the margin penalty was set to 0.2. The total number of training epochs was set to 100.

First, the ConFormer VPR model with MFA module was trained and evaluated, as shown in Figure 14. As indicated by the loss and accuracy curves in Figure 14(a), the model’s performance gradually improved and quickly stabilized at around 30 epochs, where the loss was close to 0 and accuracy was close to 100%. In subsequent rounds, the model’s loss and accuracy only oscillated slightly, indicating that the training process was complete. As seen from the ROC curves in Figure 14(b), the model performed well for all four types of audio, achieving a mean average AuC (maAuC) of 0.975. Among them, AuCs for bird sounds and vehicle horn sounds almost reached 1.0. Meanwhile, AuCs for human sounds and vehicle impact sounds were 0.966 and 0.939, respectively. The model’s performance can be further improved by adding more working conditions for more data.

The model shown in Figure 14 was obtained after hyperparameter tuning, shown in Figure 15. Three hyperparameters were used for comparison: (1) —the number of Transformer layers in the ConFormer block; (2) —the encoder dimension of the ConFormer block; and (3) —the voiceprint embedding dimension after the MFA module.

From Figure 15, it can be seen that has a significant impact on both accuracy in the training set and maAuC in the test set. When , classification performance decreases significantly and brings a large amount of computation, so the optimal value for should be 3. Meanwhile, and have no obvious impact on the model’s accuracy and maAuC. Therefore, considering a balance between computational efficiency and classification performance, the two hyperparameters are set to 128 and 64, respectively.

Afterward, a second ConFormer model was trained to classify the types of vehicle impact events, with results plotted in Figure 16. From the curves in Figure 16(a), after 80 epochs, the model’s loss and accuracy were close to 0 and 1, respectively. This indicates that the training was complete and the model was stable, achieving high classification accuracy under reasonable thresholds. Additionally, Figure 16(b) shows the ROC curve for each vehicle type, with an overall performance maAuC value of 0.925, slightly lower than the performance of the previous model.

Specifically, the model has an AuC value of nearly 1.0 for Car and Truck, indicating that it can perfectly distinguish between these two types of vehicle impact sounds. However, for Bus, the AuC value is 0.764, indicating medium classification accuracy based on the definition of the ROC curve [51]. According to Table 7 measurements, the number of Bus is much lower than the other three types of vehicle impacts. Therefore, the model’s classification accuracy for Bus could be improved by increasing the number of Bus samples.

A ConFormer model is cascaded to complete the fault detection of BEJs. As shown in Figure 6, a specific ConFormer model can be applied for each vehicle impact type. In this work, we use Car and Van as two examples for the fault-free and faulty status classification of BEJs, with results shown in Figures 17 and 18. Both models have great classification performance for fault detection, with AuC values of 0.886 and 0.910 respectively. Compared to light passenger cars, vans provide greater differences for the status judgment of BEJs, indicating that stronger wheel impact serves as a better criterion for fault detection of BEJs.

5.4.3. VPR Model Comparison

To verify the performance advantages of our model, we conducted a comparison study with other mainstream deep learning-based VPR models, including the Transformer model [65], the ECAPA-TDNN model [54], and the ResNet18 and ResNet34 models [66]. All models were trained on the environment VPR dataset shown in Table 6. The performance of the models in terms of both precision and efficiency was compared, with results shown in Table 8 and Figure 19.

All models achieved accuracy close to 100%, indicating that they have high classification ability under appropriate thresholds. However, for maAuC evaluation, our ConFormer VPR models had a significant performance advantage over other models, indicating that they have the best classification reliability and robustness. The MFA module further improved the maAuC by 0.01. Due to the introduction of self-attention mechanisms, the Transformer and ECAPA-TDNN models achieved the second-best maAuC values of 0.929 and 0.900, respectively. In contrast, the two ResNet models, which only have basic convolution and residual operations, achieved maAuC values of only 0.771 and 0.758. Thus, the self-attention mechanism has been proven to be an important component for improving the performance of BEJ VPR applications.

In terms of inference speed, as shown in Table 8, using a GPU can significantly accelerate the process. The ConFormer VPR model is the slowest, requiring 16.9 ms on GPU hardware to infer 2s of audio data, while ResNet18, which primarily employs convolution operations, only requires 4.5 ms for inference. On the other hand, considering that embedded CPU devices are cheaper and lighter than GPU-based equipment, we also conducted benchmark tests for these models using a CPU. As listed in Table 8, the ConFormer VPR model has medium-level efficiency. The ECAPA-TDNN model, due to parameter pruning optimization, has the highest inference speed with a CPU, requiring only 32.8 ms to classify a 2-second utterance. Therefore, this model can be applied to edge computing scenarios with lower accuracy requirements.

5.4.4. Discussion of Fault Judgment

As outlined in the “Dataset Preparation” section, the audio data collected over a two-hour period, which include one-hour audio signals each for fault-free and faulty BEJs, are utilized for the discussion of fault judgment using the combined approach proposed in Figure 6. The audio signal is segmented into 12 parts for analysis, with each segment lasting for 5 minutes.

The outcomes for each audio segment are illustrated in Figure 20. For an individual audio segment, the classification of vehicle impact is accomplished by the second-level ConFormer model, with the results displayed in Figure 20(a). Subsequently, for each vehicle audio utterance, a fault is inferred by its corresponding type-specific third-level ConFormer model, resulting in a detection of either fault-free or faulty. Ultimately, the proportion of faulty detections is computed for this audio segment. Figure 20(b) plots the proportion of faulty detections for each audio segment.

The results in Figure 20(b) demonstrate that the proportions under the fault-free condition are low, while those under the faulty condition are high. This is a reasonable outcome and signifies the practicality of the proposed method’s application. Furthermore, both the upper and lower margins are set at 0.1 to establish the separation threshold for fault judgment. In this study, the final optional threshold can be chosen from a range of 0.187 to 0.814.

6. Conclusions

Effective inspection and monitoring of BEJs are crucial for bridge maintenance and management. This paper presents a novel methodology for detecting faults of in-service BEJs through VPR. Microphones are placed beneath the BEJs to capture audio data. An ANN model is then employed to segregate nonevent audio utterances from the original signals. Subsequently, the ConFormer VPR model is enhanced for general audio event classification. Finally, a cascading approach is proposed, involving three successive ConFormer VPR models, aiming to discern vehicle impact audio, identify specific vehicle types, and ultimately detect fault status. Additionally, a case study on an in situ bridge has been conducted to verify the proposed method. Conclusions are drawn as follows:(1)The proposed methodology is a new endeavor to broaden the spectrum of inspection methods for BEJs. The use of a consumer-grade microphone sensor in this study proves to be a cost-effective alternative when compared to more sophisticated NDT sensors such as accelerometers [8], ultrasonic sensors [11], and electromagnetic sensors [15] used in prior research. From a long-term perspective, the audio-based method offers a more economical solution than manual inspections conducted by inspectors.(2)An ANN classification model is devised and trained for audio event segmentation, achieving an AuC of 0.981. The ANN model exhibits superior accuracy and stability in event segmentation when compared to threshold-based methods.(3)The original ConFormer model is adapted for VPR and fortified with an MFA strategy. In response to the intricate sound factors associated with BEJs, we adopt a cascading approach using consecutive ConFormer models for classifying environmental sound types, vehicle impact types, and faults. The trained models achieve AuCs of 0.975, 0.925, and 0.886, respectively, demonstrating the feasibility of detecting faults in BEJs based on audio signals. Notably, the enhanced ConFormer VPR models surpass other VPR models such as Transformer and CNN-based models in terms of performance.

In future research, we aim to further advance this novel field and build upon this foundational work. Our plans include the continuous expansion of the audio signal dataset to encompass multiple types of BEJ faults, as well as faults in other bridge components. Additionally, we intend to conduct benchmarking tests to achieve a more detailed localization and classification of damage based on audio signals.

Data Availability

The data used to support the findings of this study can be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was financially supported by the National Natural Science Foundation of China (grant nos. 52208198, 52238005, 52192663, and 51978514), Interdisciplinary Project in Ocean Research of Tongji University (grant no. 2023-1-YB-03), and National Key Research and Development Program of China (grant no. 2021YFF0501004).