Abstract
This paper proposes a novel approach called cross-scale with attention normalizing flow (CSA-Flow) enhanced with channel-attention (CA) and self-attention (SA) modules for high-speed railway anomaly detection in complex industrial backgrounds to reduce the manual workload of the primary maintenance of high-speed electric multiple units. Detecting defects in industrial environments, characterized by intricate backgrounds and unclear subjects, poses significant challenges. To address this, CSA-Flow introduces a channel feature extraction module that combining the pretrained convolutional neural network models with a CA module for feature extraction, capturing information at different scales, and uses the SA module to capture more contextual information by its larger receptive field. The performance evaluation of CSA-Flow on the MVTec-AD dataset demonstrates an impressive area under the receiver operating characteristic curve (AUROC) score of 98.7%, with an equally remarkable score of 98.4% across all object classes. To further assess the effectiveness of CSA-Flow in complex background scenarios, we introduce a dedicated dataset, specifically designed for high-speed rail braking devices (HSRBDs). The experimental results establish the superiority of CSA-Flow over current state-of-the-art approaches in terms of both AUROC score and recall score, validating its exceptional capability for detecting anomalies in industrial complex backgrounds.
1. Introduction
Anomaly detection is a critical aspect within the field of railway detection. Safety is the foundation and primary concern of railway detection, directly impacting the lives of individuals and public property. Train accidents can be attributed to three primary factors: rail defects [1, 2], visual anomalies on the railway [3, 4], and misestimating or incorrect operation by the locomotive driver [4]. Any deviation from normal conditions is deemed an anomaly. In recent years, rapid advancements in image processing technology have drawn increasing attention to the detection of anomalies in the railway system. By prioritizing anomaly detection, we can ensure passenger well-being and safeguard critical public infrastructure.
In industrial applications, manual anomaly detection remains predominant in handling detection tasks. These approaches involve comparing visual texture features [5] between defective and normal samples to determine the presence of anomalies. However, manual detection methods often suffer from inefficiencies. As a result, deep-learning-based anomaly detection methods have gained traction in railway detection due to their inherent characteristics of speed, nondestructiveness, and high precision [6, 7]. Within the domain of high-speed railway, abnormal detection can be categorized into three main approaches: unsupervised methods [8], object detection methods [9–11], and defect segmentation methods [12, 13].
In real-world industrial detection, the scarcity of abnormal samples and the limited availability of labeled data present significant challenges for industrial anomaly detection. Additionally, in industrial applications, the backgrounds of detection objects are often complex, further compounded by the influence of moving parts, significantly raising the difficulty level of anomaly detection in these settings. To address these challenges, the MVTec-AD dataset [14] serves as a benchmark for anomaly detection, providing clear object boundaries where the previous methods have struggled to effectively incorporate contextual information. Consequently, our attention is directed toward exploring the self-attention mechanism as a potential solution to enhance anomaly detection in industrial scenarios.
Given the complexities associated with complex backgrounds and unclear subjects in industrial anomaly detection, this paper aims to address practical challenges in this domain. During the training process, our focus lies in enabling the network to learn the distribution of normal samples only while differentiating between normal and abnormal samples during testing.
This approach is commonly referred to as semisupervised learning [15].
To tackle the aforementioned challenges, we compare related methods in Section 2, and normalizing flow (NF) methods demonstrate excellent anomaly location and industrial defect detection capabilities [16, 17], among others. We propose a semi-supervised anomaly detection method named cross-scale with attention normalizing flow (CSA-Flow), which utilizes NF [18, 19]. CSA-Flow specifically targets the problem of complex backgrounds and unclear subjects, allowing for the recognition and visualization of the defect regions within the image. It employs a full convolutional architecture and attention modules to establish global dependencies and expand the receptive field of the image. We evaluate the performance of CSA-Flow using the MVTec-AD dataset [14], designed to mimic real-world industrial inspection scenarios, as well as the BeanTech Anomaly Detection (BTAD) dataset [20]. Our proposed method achieves state-of-the-art accuracy in abnormality detection. Additionally, we apply CSA-Flow to a real high-speed rail braking device (HSRBD) dataset, which is one of the key components on the train, demonstrating its effectiveness in achieving high performance in real-world industrial applications.
The contributions of this paper are outlined as follows:(1)Proposal of CSA-Flow incorporating the channel-attention (CA) module and the self-attention (SA) module to enhance anomaly detection accuracy by effectively capturing key features from input images.(2)Achievement of state-of-the-art accuracy demonstrated by CSA-Flow on the MVTec-AD dataset and the BTAD dataset.(3)Establishment of a real HSRBD dataset with complex backgrounds for anomaly detection and achieves a state-of-the-art accuracy.
2. Related Work
2.1. Reconstruction-Based Methods
Reconstructed image anomaly detection is a widely employed unsupervised approach for anomaly detection. The fundamental principle underlying this method is to model normal data in order to identify abnormal data that deviates from the learned model [21]. The core framework of this approach involves training a generative model using normal datasets and subsequently employing the model to reconstruct unseen data. By establishing a threshold for reconstruction error, any reconstructed data surpassing this threshold is considered anomalous. This methodology allows for the identification of anomalies based on deviations from the expected reconstruction patterns.
Autoencoder (AE) [22] is a widely utilized technique for anomaly detection, relying on the principle of reconstruction. AE is a type of neural network that compresses input data into lower-dimensional latent space and subsequently reconstructs it back to its original form by a decoder. In AEs architecture, the encoder processes the input data, extracting meaningful features and encoding them into a compressed representation. The decoder then decodes it back, reconstructing the data to resemble the original input. During the training phase, the AE aims to minimize reconstruction errors, ensuring that the output data closely matches the input data.
Similar to the decoding part of AE, the generator in generative adversarial networks (GANs) can be used for anomaly detection. Rudolph et al. [23] proposed to learn an inverse generator after training GAN and use both for reconstruction and error consideration.
Schlegl et al. [21] introduced AnoGAN, which aims to learn the manifolds of normal images from potential spaces, enabling the identification of anomalies in new images. Zenati et al. [24] trained a BiGAN model that simultaneously maps the image space to the latent space, showcasing improved statistics and computational outcomes. Akcay et al. [25] proposed GANomaly, building upon the concept of training GANs to learn the distribution of normal data and subsequently reconstructing input data using GANs. These innovative approaches leverage the power of deep learning and generative models to detect anomalies by learning normal data patterns and effectively reconstructing input data.
2.2. Embedding Similarity-Based Methods
These methods employ deep neural networks to extract meaningful vectors [26] or image blocks [27] to effectively describe the entire image for anomaly detection. Cho et al. [17] introduced a method known as semantic pyramid anomaly detection (SPADE). SPADE utilizes k-nearest neighbor (kNN) methods and leverages deep pretrained features. The proposed method focuses on aligning abnormal images with a series of similar normal images. SPADE introduces a novel approach that utilizes a multiresolution feature pyramid, allowing for a comprehensive analysis of image features across different scales.
Defard et al. [28] introduced PaDiM, a method that leverages pretrained convolutional neural networks (CNNs) for patch embedding. PaDiM uses multiple Gaussian distributions to generate a probability representation of normal data and employs the correlation among different semantic layers of the CNN to accurately identify the location of defects.
These methods employ the extraction of nominal features from the pretrained backbone networks, which are then utilized to construct a memory bank. During testing, the features extracted from the test images are compared against the entries in the memory bank. One significant advantage of this approach is its rapid speed, as the memory bank is preserved during training, requiring only feature comparisons during testing. However, some notable drawbacks are that the images stored in the memory bank must exhibit a high level of alignment and might not perform as well as other methods on large datasets.
2.3. NFs
NF is a distinctive generative model that sets itself apart from other models by its capability to generate distributions that are easily manageable. This feature enables efficient and accurate sampling as well as density evaluation. NF achieves this by employing reversible and differentiable mappings to transform a simple probability distribution, such as a normal distribution, into a more complex one [29]. In the NF framework, the density of a sample is converted back to the original sample distribution. The density evaluation of the sample involves calculating the product of the transformed sample’s density and the volume change induced by the transformation. According to the change of variable formula, the volume change is determined by the absolute value of the Jacobian determinant at each transformation. NICE [18] and Real-NVP [30] are two notable examples of classic NFs that possess high speed in both forward and reverse processes. There are still some limitations in NF, especially when the distribution of abnormal data is very similar to the distribution of normal data, which can produce false positives.
In the field of anomaly detection, DifferNet [19] employs the NF estimation method to perform accurate likelihood tests, resulting in effective anomaly detection at the image level. However, due to the flattening of the output in DifferNet, it fails to locate the specific anomaly regions within the detected defects. To address this limitation, Gudovskiy et al. [31] introduced CFlow, which utilizes a discriminant pre-training encoder followed by a multiscale-generating decoder. This architecture allows for explicit judgment of the probability of encoding features. However, its effectiveness may vary when applied to more complex datasets.
3. Method
The proposed method, called CSA-Flow, is built upon the foundation of CS-Flow [32], a cross-scale normalized flow approach. CSA-Flow integrates the CA module and SA module to enhance the accuracy on common and realistic datasets while maintaining the high performance achieved by CS-Flow. Figure 1 provides an overview of the proposed method, illustrating its key components and workflow.

Similar to DifferNet [19], our approach initially involves training a model to learn features from defect-free images , enabling the detection of anomalies. During the evaluation process, we utilize density estimation of the extracted feature to assign a similarity measure to each image . A lower similarity score indicates a higher likelihood of an anomaly being present. Density estimation is achieved through bijective mapping, which involves learning from the unknown distribution in the feature space to the Gaussian distribution in the potential space . By leveraging the bidirectional mapping capability of NF, we utilize density estimation to map from the unknown distribution in the feature space to the Gaussian distribution in the latent space . Figure 1 illustrates the pipeline of CSA-Flow, depicting the various stages and transformations involved in the process.
For each category, we begin by computing the receiver operating characteristic (ROC) curve and identifying the optimal threshold , which maximizes the ratio of true positive rate (TPR) to false positive rate (FPR). Utilizing this selected threshold , we can determine whether a test image is abnormal or not:
3.1. Channel Feature Extraction Module
Bergman and Hoshen [26] have demonstrated the exceptional performance of the ImageNet training feature extraction model for anomaly detection. Hence, we adopt feature extraction utilizing EfficientNets. The pretrained CNN possesses the capability to provide relevant features for anomaly detection [33]. Consequently, we employ a CNN that has been pretrained on ImageNet to extract the features from the input image . To enhance the descriptive capacity of the feature maps, we conduct a feature extraction on images with varying resolutions. Subsequently, the images are segmented into multiple scales, leveraging techniques such as upsampling and stride convolutions to adjust the input image scale. The NF architecture excels in performing intensive data estimation, enabling it to effectively preserve detailed location and context information.
The first subnetwork is the channel feature extraction module combining the CA module [34] with CNNs feature extraction. It leverages scalar values to represent and evaluate the significance of each channel in an image. Let’s assume is the image feature tensor in the network, where is the number of channels, is the feature height, and is the feature width [35]. Figure 2 illustrates the architecture of CA. The prediction is generated using the following formula:

where represents the input of CA and corresponds to the CA module.
By leveraging the channel feature extraction module, we believe that the model becomes more adept at focusing on valuable information.
3.2. Cross-Scale Flow
The cross-scale NF method proves to be highly effective in image anomaly detection. It processes feature maps at different scales to capture diverse information, leveraging the interplay between these scales to share relevant insights. Furthermore, the module’s fully convolutional nature ensures the spatial dimensions are preserved, enabling accurate localization of anomalies. The cross-scale flow consists of a series of affine transformations implemented through coupling blocks. Based on the reference to the coupling blocks described in the study of Dinh et al. [30], we adopt the basis architecture of Real-NVP, as illustrated in Figure 3. The network estimates each scale and offset coefficient estimated by the subnetworks, denoted as and , so that each input tensor is randomly divided into and . The obtained parameters are then employed as shown:

where the symbol denotes the element-product operation.
3.3. SA Module
The neural network processes a vast amount of vectors with varying sizes and connects them. However, this approach may not effectively uncover the intrinsic relationships among the inputs during training, resulting in suboptimal learning outcomes. To address this limitation, CSA-Flow incorporates an SA module, as depicted in Figure 4, to emphasize the correlations between features at different scales. The self-attention mechanism enables the model to establish global dependencies and expand the receptive field of an image. Compared to CNN, the SA module has a larger receptive field, allowing it to capture more contextual information.

The attention module can be represented by a set of queries and key-value pairs. The output is computed as a weighted sum of values, where each weight is determined by the correlation between the query and the key [36]. The representation, which captures the relationship between pixels by using the dot-product of the query and the key as the weight, is formulated as follows:where represents the extracted feature, represents the feature map that contains the information necessary for detecting co-occurrence relationship anomalies, and represents the depth of [36].
3.4. Negative Log-Likelihood Loss
The objective of the training process is to maximize the likelihood of the mapping from the latent space to the feature space . We adopt the likelihood formulation proposed in the study of Rudolph et al. [32] as follows:which aims to maximize the log-likelihood. Similar to the study of Rudolph et al. [32], we utilize the negative log-likelihood loss to train the proposed model as follows:where represents the squared -norm of a vector in n-dimensional Euclidean space, which is defined as the sum of the squares of its components. The term represents the absolute determinant of the Jacobian. To ensure stability, we constrain the gradients of the -norm to be equal to one.
4. Experiments
4.1. Datasets and Metrics
We evaluate the proposed method in various defect detection scenarios using the MVTec anomaly detection (MVTec-AD) dataset [6]. The MVTec-AD dataset, introduced by Bergmann et al. [14, 27], is designed to simulate anomaly detection in industrial applications. It offers high-resolution images with variations in multiple scales and lighting conditions. The dataset consists of 15 classes, including 10 object classes and 5 texture classes, each containing both normal and abnormal samples. The training set exclusively comprises defects-free images, while the test set consists of normal and abnormal images.
The BTAD dataset [20] includes 2,540 images of three industrial product categories. The training sets exclusively include normal samples, while testing sets contain both normal and abnormal samples. To assess the performance of the proposed method in real industrial applications, we curated a real-world dataset called the HSRBD dataset. This dataset comprises four scenarios that represent real-world HSRBDs. Each scenario includes four different industrial components with unknown size and foreign matters. Within each scenario, there are a varying number of high-resolution images, ranging from 160 to 220, with dimensions of pixels. The presence of dynamic lighting and moving parts in each scenario adds complexity to the anomaly detection task, making it more closely aligned with actual application scenarios, as illustrated in Figure 5.

(a)

(b)

(c)
To evaluate the performance of the proposed method, we compute the area under the receiver operating characteristic curve (AUROC) on the publicly available datasets MVTec-AD and BTAD. In industrial applications, a more intuitive metric is needed. Therefore, we also compute the recall, which represents the detection rate of anomalies on the real-world HSRBD datasets. The TPR and FPR are defined as follows:
AUROC reflects the classifier performance by measuring the AUROC [37]. The classifier with a larger AUROC value indicates a better accuracy of the classifier. On the other hand, the recall rate, also known as the detection rate, measures the proportion of positive cases correctly identified by the classifier. The recall rate is a measure of coverage and is equivalent to sensitivity.
4.2. Implementation Details
To achieve a balanced combination of feature semantic level and spatial resolution, we utilize the output of the 36th layer of the pretrained EfficientNet-B5 model from ImageNet as the feature extractor in all our experiments. For the CA module, we set the input channel and the output channel sizes to 304 to match the dimensionality of the extracted features. To standardize the input image size, we resize the images to 1,024 × 1,024 for the real-world HSRBD datasets. We extract features at three different scales: (), (), and (). For the MVTec-AD dataset, we resize the input images to 768 × 768. In our implementation, we employ four coupling blocks ). The internal networks of the first three blocks use convolutional kernels, while a convolutional kernel is applied in the last block. We set the negative slope of leaky ReLU to and set the clamp parameter to . For optimization, we utilize the Adam algorithm [38] with a learning rate of , weight attenuation of , momentum value , and . We train the CSA-Flow model for 240 epochs on MVTec-AD and BTAD datasets. For the real-world HSRBD datasets, we train the model for 480 epochs. The training process was performed using an NVIDIA RTX 3060 12G GPU.
4.3. Anomaly Detection
We conducted experiments on the MVTec-AD datasets, which consist of 10 classes of objects and 5 classes of textures, to evaluate the performance of CSA-Flow. The training set exclusively contains defect-free images, while the test set includes both normal and abnormal images. We compared the performance of CSA-Flow with other anomaly detection models, including STFPM [39], GANomaly [25], SPADE [40], PaDiM [28], DifferNet [19], CS-Flow [32], using the AUROC metric.
The results, as shown in Table 1, demonstrate that the CSA-Flow model outperformed or achieved comparable performance to previous models in nearly half of the classes. Particularly, in terms of AUROC scores, CSA-Flow exhibited excellent performance compared to other reconstruction-based methods. In the research target of this paper, we should pay more attention to object classes, because it is more consistent with the goal of high-speed rail inspection.
In Table 2, we present a comparison between the basic convolutional AE using MSE and MSE + SSIM losses. The results demonstrate that CSA-Flow performs on par with MVTec-AD in terms of anomaly detection performance on the BTAD dataset.
In the HSRBD datasets, we conducted tests on four different scenarios of real-world HSRBD to evaluate the performance of the CSA-Flow model. Remarkably, the CSA-Flow model achieved the highest AUROC score compared to other models. To provide a more comprehensive evaluation, we proposed the use of Recall_Ano as a metric to assess the models’ ability to detect abnormal samples. In industrial applications, detecting anomalies holds greater significance, considering the challenges posed by complex backgrounds and unclear subjects. Consequently, we believe it is essential to employ Recall as an evaluation metric.
We tested the AUROC and Recall_Ano on HSRBD datasets, and the results are shown in Tables 3 and 4. Notably, the CSA-Flow model outperforms previous models in the context of industrial applications. Figure 6 shows the accuracy comparison between CS-Flow and CSA-Flow in the HSRBD dataset. Our CSA-Flow model is significantly better than the original network.

These results suggest that existing methods struggle to effectively detect anomalies in scenarios with complex backgrounds and unclear subjects. In contrast, the proposed CSA-Flow model demonstrates outstanding performance in such industrial settings.
The primary goal of anomaly detection is not only to classify anomalies but also to segment abnormal parts. While the CSA-Flow model does not perform pixel-level evaluation, it uses anomaly scores to identify and locate defect regions. By analyzing these scores, we can effectively identify abnormal areas. In the HSRBD datasets, where moving parts are considered normal, CSA-Flow demonstrates robustness and aligns with real-world scenarios. Although CSA-Flow is not explicitly designed for pixel segmentation, we assign anomaly scores to local positions of the feature graph by aggregating values along the channel dimension using . By leveraging the high norms in the output tensors, we can accurately locate defects and assess them quickly. Figure 7 showcases localization in MVTec-AD, Figure 8 demonstrates localization in BTAD, and Figure 9 exhibits localization in HSRBD, highlighting CSA-Flow’s accurate localization performance, particularly in industrial settings.



4.4. Ablation Study
To assess the effectiveness of the attentional module in our model, we conducted ablation experiments involving different subnetworks combinations. Specifically, we compared the AUROC metric, recall, and accuracy scores by including both the CA and SA modules and added only one of the modules for the HSRBD datasets. The results of these experiments are presented in Table 5. Since the HSRBD datasets are collected based on real train operation, foreign bodies with a diameter of less than 10 mm might cause 5.57% redundancy in the accuracy. Via comparison, it is worth noting that the inclusion of both CA and SA modules significantly enhances the accuracy of defect recognition in real-world industrial defect scenarios to improve the detection efficiency of high-speed electric multiple unit (EMU).
In industrial detection, anomaly detection is widely applied to the primary maintenance of high-speed EMU. The time of this primary maintenance is urgent, and the current process regulations are constantly reducing the maintenance time; therefore, the workload of manual review can be greatly reduced due to the improvement of accuracy, especially if the base is large.
5. Conclusion
In this paper, we presented CSA-Flow, which combines cross-scale NF with attention modules and applied to a practical application of high-speed EMU. We aim to improve anomaly detection accuracy and reduce the workload of manual review. We also introduced the channel feature extraction module for different scales of feature extraction, and our experiments demonstrate the promising potential of CSA-Flow. We believe that evaluating the performance of a network in industrial applications is crucial. Existing networks often face challenges in industrial settings due to factors such as lighting, background information, and texture, which can impact detection results. As shown in Table 2, our proposed method excels at detecting foreign bodies in complex backgrounds. We also introduced more intuitive metrics that are highly relevant in industrial applications, such as the detection rate. Consequently, we evaluated the recall of CSA-Flow on the HSRBD datasets, and the results demonstrated that our method achieved the highest anomaly detection rate. Although CSA-Flow does not perform precise pixel-level segmentation, we can utilize the anomaly scores to locate abnormal parts. Future research will focus on improving speed and advancing pixel-level segmentation capabilities.
Data Availability
Data underlying the result presented in this paper are available in [14, 20]. Other generated data are not publicly available at this time but may be obtained from the authors upon reasonable request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the Funds for International Cooperation and Exchange of the National Natural Science Foundation of China (grant no. 61960206010).