Abstract

Ensuring compliance with safety regulations regarding wearing is essential for the safety and security of those working on substation construction sites. However, relying on supervisors to monitor workers in real time on the work site or through remote surveillance videos is both unreasonable and inefficient. A deep learning network approach named FFA-YOLOv7 is presented in this study that utilizes an improved version of YOLOv7 to detect violations of worker wearing in real time during power construction site surveillance. In YOLOv7, the feature pyramid network (FPN) of the neck stage is constructed through continuous upsampling and skip connections for feature fusion, after continuous downsampling of the backbone. However, this process can result in the loss of precise shallow position information. To tackle this issue, we have introduced a novel feature fusion pathway to the FPN architecture, enabling each layer not only to fuse feature maps from the same level during the downsampling course but also to fuse feature maps from shallower levels. This approach combines precise positional information from shallow layers with rich semantic information from deep layers. Additionally, we utilized attention after feature fusion in each layer to optimize the feature map fusion effect and achieve better detection accuracy performance. In order to conduct comparative experiments, we trained six variations of the YOLO model as detectors using a dataset gathered from realistic construction sites. The experimental results indicate that our proposed FFA-YOLOv7 attained a detection precision of 95.92% and a recall rate of 97.13%, demonstrating a high level of accuracy and a low rate of missed detections. These outcomes effectively satisfy the requirements for robust and accurate detection of real-world power construction violations.

1. Introduction

The construction of electric power infrastructure is a crucial component for ensuring the smooth transmission and distribution of energy in substations. The safe and efficient construction of these sites is essential for maintaining reliable power grids. Unfortunately, accidents resulting from non-compliant wearing of work clothes are common occurrences in substation construction, jeopardizing the safety of workers and disrupting the normal operation of the site. Ensuring that workers wear the appropriate attire and comply with safety regulations is therefore of utmost importance. For managers overseeing these sites, identifying and addressing violations of safety protocols are crucial for maintaining a safe and productive work environment.

In the past, the assessment of workers’ attire and behavior at substation construction sites was primarily carried out through manual inspections performed by security personnel. Nevertheless, this method proved to be both time-consuming and demanding in terms of labor. Moreover, manual inspections may not be able to cover all workers in real time, especially in large-scale construction sites. With the advancement of video surveillance technology, many power construction sites have installed monitoring systems to transmit video footage to the substation’s monitoring and dispatching center through the network. Security officers on duty can monitor workers’ activities in real time and identify violations through surveillance video. Nevertheless, the present monitoring methods still depend on manual inspections, neglecting to fully exploit the capabilities offered by intelligent video surveillance technology. This limits the ability to efficiently and accurately identify violations in various construction scenarios. As such, there is a need to utilize intelligent video surveillance technology to develop a more efficient and accurate method for identifying construction violations.

This paper aims to put forward a novel deep learning approach that can detect wearing violations in substation construction sites more efficiently and accurately compared to conventional methods. The proposed network can identify not only straightforward wearing violations but also more intricate ones, by analyzing the distance between objects. Furthermore, the network is trained in an end-to-end manner using a comprehensive dataset that includes authentic images captured from actual power construction sites and synthetic images generated through data augmentation.

The main contributions of this paper are as follows:(1)In this paper, an enhanced variant of YOLOv7 is proposed, which introduces a new feature fusion pathway within the FPN. The objective is to effectively integrate accurate position information from shallow layers with rich semantic information from deep layers. Additionally, attention mechanisms are incorporated into these fusion layers to enhance the feature representation after fusion.(2)A deep learning approach is proposed for the real-time detection of worker attire violations in surveillance videos obtained from substation construction sites. Additionally, a dataset has been curated using a range of data augmentation techniques. The dataset consists of videos captured in authentic power construction sites and encompasses six commonly encountered targets for the detection of attire violations.

The remainder of this paper is organized as follows. Section 2 provides an overview of the related work. In Section 3, we present a detailed description of the proposed network architecture. To assess the effectiveness of our method, we design experiments in Section 4 and discuss the results in Section 5. Finally, we present our conclusions in Section 6.

Advancements in video surveillance technology and wireless mobile networks have facilitated real-time monitoring of substations. However, the current video surveillance systems used by electric power enterprises have certain limitations, as examined by Jiangtao et al. [1]. To address these shortcomings, the authors introduced key technologies that can be incorporated into a new substation security video surveillance system. However, it is important to acknowledge that there are still limitations in the current substation video monitoring systems. For instance, Yang et al. [2] highlighted that the video sensor equipment layout on the construction site of substations lacks scientific guidance, which results in incomplete three-dimensional monitoring coverage. They then proposed a video surveillance system that can provide full coverage monitoring for substation construction sites. Moreover, Lu et al. [3] proposed an intelligent monitoring solution for power substations, utilizing big data theory and intelligent analysis algorithms. The objective of this solution is to assist monitoring personnel in comprehending alarm signals and reducing the workload of substation personnel. However, existing video monitoring systems for substations are constrained to basic functionalities such as video capture, storage, and playback. They lack effective video data analysis capabilities. Furthermore, the monitoring of substation workers continues to be conducted manually, without fully capitalizing on the potential of intelligent video surveillance technology.

Deep learning technology has witnessed continuous advancements, particularly in convolutional neural networks (CNNs). Among the notable CNN series, You Only Look Once (YOLO) stands out for its exceptional performance in object detection, offering high accuracy, efficiency, and real-time capabilities. The YOLO series was initially introduced in 2015 with the release of YOLOv1 [4]. This pioneering single-stage detection network addressed the slow reasoning speed issue encountered in two-stage detection networks while maintaining commendable detection accuracy. Subsequent versions, including YOLOv2 [5] and YOLOv3 [6], further improved upon the original model. YOLOv3 introduced the Darknet-53 residual module and the feature pyramid network (FPN) architecture, enabling object prediction at multiple scales and facilitating multiscale fusion. YOLOv4 [7] and YOLOv5 have since incorporated various enhancements based on YOLOv3. The recent YOLOv7 introduced in 2022 [8] introduces the innovative extended ELAN architecture, which enhances the network’s self-learning ability without disrupting the original gradient path. Furthermore, YOLOv7 adopts a cascade-based model scaling approach, generating models of different scales to accommodate practical tasks and meet the detection requirements.

Previous research in safety construction monitoring has predominantly concentrated on helmet detection. Several studies have proposed helmet detection methods using YOLOv5, a popular deep learning technology [9, 10]. Additionally, other researchers [1115] have made advancements in helmet detection by refining networks based on YOLOv5. Furthermore, CNNs have been employed to detect safety vests [1618], safety belts [1921], and insulators [2224] worn by workers in surveillance videos of power substations. While these studies have yielded promising results, they have primarily focused on detecting a single object and are unable to simultaneously detect multiple objects. Consequently, these methods are not well-suited for violation detection tasks in complex power construction sites.

3. Method

3.1. Architecture

Our improvement on the network is based on the YOLOv7 base model. Compared with other models of YOLOv7, the base model of YOLOv7 has fewer parameters and demonstrates superior real-time performance. Ensuring high accuracy is crucial for real-time violation detection applications in substation construction. The improved network structure, FFA-YOLOv7, which is based on YOLOv7, is depicted in Figure 1. Our proposed model incorporates two significant enhancements. Firstly, we introduce a novel feature fusion path within the FPN to effectively merge the rich semantic information from the deep layers with the accurate location information from the shallow layers. This enables us to enhance the representation of features for improved performance. Secondly, we add a new attention module after each feature fusion path to extract inter- and intra-relationships in each fusion source for refinement of the fusion feature. More details about the feature fusion path and attention are available in the next two subsections.

3.2. Feature Fusion Path

In the segmentation process, pixel-level labels often lack global information, making it beneficial to consider larger patches to obtain more comprehensive information. In contrast, object detection tasks relying solely on image-level and bounding box-level labels can obscure crucial information. Edge information in feature maps tends to diminish during continuous downsampling and upsampling. The YOLOv7 model employs extended efficient layer aggregation network (ELAN) to enhance network learning by incorporating a gradient path. However, the model contains numerous convolutional layers, leading to a gradual dilution of location information during continuous extraction of semantic information. In subsequent feature pyramid network (FPN), the ELAN module is repeatedly used to extract features, allowing deeper networks to learn and converge effectively. However, this process significantly weakens the edge position information, which is crucial for accurately generating anchor boxes that fit the target size. Considering larger patches in object detection tasks can also impact the calculation of intersection over union (IoU) between predicted anchor boxes and ground truth boxes, potentially reducing the final detection accuracy, relying solely on image-level and bounding box-level labels may result in the omission of critical information, leading to a decrease in precision during IoU evaluation.

To address the issue of location information loss resulting from the extensive use of ELAN, we introduced additional feature fusion paths in the FPN of YOLOv7 to achieve better fusion of semantic and location information. Figure 2 illustrates the newly proposed feature fusion path. In the backbone network, the feature maps at each scale level are first extracted using the ELAN block and then downsampled using the MPConv block. We incorporated feature fusion paths in the subsequent FPN process, allowing each layer of the FPN to receive not only feature maps of the same size from the previous FPN layer and downsampling process but also feature maps from shallower layers that have not undergone the ELAN block of that layer. This feature fusion path enhances the aggregation of the initial feature pyramid and provides the necessary details for coordinate regression, thereby improving the accuracy of the one-stage object detector.

3.3. Attention Module

With the exception of the top layer, the feature maps in the FPN are obtained through a fusion process involving the previous layer and the two adjacent layers in the downsampling process. However, these three sources exhibit distinct representations of semantic levels and spatial locations due to their generation via different skip connections and upsampling pathways. The feature maps from the shallow pathways contain precise location information and fine-grained features, whereas those from the deep pathways exhibit richer semantic information and coarse-grained features. Consequently, a selective mechanism is required to filter out and retain effective feature information representations when fusing the three feature maps.

To improve the selection of feature information among each set of feature maps, a new select mechanism called the Channel Refinement Attention Module (CRAM) is proposed in this paper. CRAM is built upon the Channel Attention Module (CAM) [25]. As depicted in Figure 3, the three source feature maps are initially concatenated, and then a CAM is employed to extract the inter-group channel relationship. Subsequently, another CAM is applied to capture the intra-group relationships after summing the three feature maps. The final refined output is obtained by sequentially multiplying the concatenated feature map with the two CAMs. In summary, the CRAM of feature map F ∈ RH×W×C can be defined as follows:where σ is the sigmoid function and MLP represents the multilayer perception layer. AP and MP denote average pooling and max pooling operations, respectively. [·] denotes the concatenation operation, and denotes the element-wise multiplication between feature maps and attention maps. l denotes the feature level, and the larger the l value, the deeper the layer.

To complement the refined feature maps generated by the CRAM, the Spatial Attention Module (SAM) [25] is introduced. The SAM is incorporated to specifically emphasize the more accurate location of semantic and spatial information on the feature map, leading to an improved accuracy of feature representation. By integrating both CRAM and SAM, a feature map is obtained that encompasses both channel refinement and spatial refinement features. This feature map serves as an effective input for subsequent detection and recognition tasks.

4. Experiment

4.1. Dataset Construction

To comprehensively validate the effectiveness and practicality of the data, the experimental dataset used in this study comprises six distinct categories in the power grid context: (1) Helmet-wearing, (2) Helmet-not-wearing, (3) Seatbelt-wearing, (4) Seatbelt-not-wearing, (5) Ladder, and (6) Insulator. Among these, the first five categories include original images sourced from the internet or captured on-site during power operation activities. The images in the last category are exclusively obtained from real construction sites of substations. Figure 4 showcases a selection of image samples extracted from the dataset.

Data augmentation is commonly used to generate additional samples for detection objects in training data that are insufficient. Its principle involves creating a new dataset by applying various data augmentation methods to the existing dataset. In this study, five key data augmentation techniques were employed, including (1) object segmentation and background fusion, (2) partial erasing, (3) affine transformation, (4) brightness transformation, and (5) clarity transformation. These operations simulate common imaging condition variations in real-world surveillance, such as changes in viewpoint, distance, background, clarity, and illumination.

We generated a dataset consisting of 14,960 images (including 30% data augmentation images) by applying image augmentation techniques to each class of images. An 8 : 2 ratio was employed to split the dataset into a training dataset and a validation dataset. In order to maintain class balance, the number of images in each class was adjusted accordingly. Table 1 presents the parameters utilized for data augmentation. Furthermore, a separate testing dataset, consisting of 500 images captured from realistic power construction surveillance scenarios, was assembled. This testing dataset encompasses all six classes. A comprehensive breakdown of the dataset composition can be found in Table 2.

4.2. Evaluation Criteria

In this experiment, we employed four commonly used evaluation metrics to assess the performance of the detection model: precision (P), recall (R), F1 score (F1), and mean average precision (mAP). P, R, F1, and mAP can be formulated as follows:where FP (false positive) denotes the number of objects that were incorrectly detected, FN (false negative) is used to represent the number of objects that were missed in the detection, and TP (true positive) represents the number of correctly detected objects. TP + FP represent the total number of detected objects, and TP + FN represent the total number of actual objects in the dataset. The F1 score is the harmonic mean of precision (P) and recall (R), providing a balanced measure of the model’s performance. AP (average precision) is computed by calculating the area under the precision-recall curve (PR curve), which describes the trade-off between precision and recall at different thresholds. mAP is the average of AP values across different classes. N represents the number of classes in the test samples. In object detection, higher precision values indicate fewer false detections, while higher recall values indicate fewer missed detections. Therefore, achieving high precision and recall is crucial for accurate and comprehensive object detection.

4.3. Experimental Configuration

This experiment was conducted using the PyTorch deep learning framework on an NVIDIA RTX-3090 GPU. The experimental setup and hyperparameters are as follows. We utilized the Adam optimizer [26] to train our model with an initial learning rate of 1e − 3. The momentum was set to 0.937, and the weight decay was set to 5e − 4. The weights of the convolutional layers were initialized using the Kaiming normalization method. The remaining hyperparameters followed the default values specified in the YOLOv7-s official code. The model was trained for 400 epochs with a batch size of 16.

5. Results

Our proposed FFA-YOLOv7 aims to achieve high accuracy and efficiency in detecting violations in practical power construction sites. To objectively evaluate the performance of our proposed model, this study conducted a comparative analysis with five state-of-the-art object detection models. These models include YOLOv5-s, YOLOv5-m, YOLOv5-l, YOLOv7, and YOLOv7-x. We utilized pretrained weights from the YOLO framework and trained the models on our own constructed dataset. The dataset consists of 14,960 images with 30% data augmentation. By comparing the performance of different models on the same dataset, we are able to provide an objective assessment of the performance of our proposed model. We also utilized a separate testing dataset consisting of 500 images captured from realistic power construction surveillance scenarios. By comparing the results obtained from these different models, we can evaluate the effectiveness and performance of our proposed approach in detecting objects accurately and robustly in the specific context of power construction surveillance.

The detection results after 400 epochs of training are shown in Table 3. YOLOv5-s, with its simple model structure, achieves the best speed performance but the worst precision performance. YOLOv5-m and YOLOv5-l, with more complicated model structures, exhibit better precision but slower speed. The YOLOv7-based models, on the other hand, generally perform better than the YOLOv5 models. In comparison, our proposed model outperforms all of the aforementioned models as for P, R, mAP, and F1.

In order to validate the effectiveness of the proposed FFA-YOLOv7 model in this paper, we conducted a comprehensive evaluation on a dataset collected from real-world power construction sites. The dataset encompasses a wide range of detection objects from all six classes. The evaluation aimed to measure the performance of our model in accurately detecting these objects. Table 4 presents the test results of the proposed method, providing a detailed list of metrics for each class and an overall evaluation of the model’s performance. Additionally, the corresponding detection results are showcased in Figure 5, providing a visual demonstration of the accuracy and effectiveness of our proposed method. The outcomes presented in Table 4 and Figure 5 highlight the excellent detection accuracy achieved by our FFA-YOLOv7 model. It successfully detects objects from all six classes across diverse practical power operation sites, exhibiting both class-specific and overall high-performance capabilities. It is evident from the results that the method proposed in our work demonstrates effectiveness in target detection within the field of power construction monitoring.

Furthermore, we conducted an evaluation of the detection speed and concurrency of our proposed system using the testing dataset. The results demonstrate that our network achieves a detection speed of 9.6 ms per image, indicating its high efficiency in practical applications. Additionally, the system exhibits remarkable concurrency, allowing it to simultaneously record and detect up to 30 different video streams at a real-time frame rate of 30 FPS.

6. Conclusions

In this study, we propose FFA-YOLOv7, an improved version of YOLOv7, for detecting worker-wearing violations in substation construction. The process of downsampling and upsampling usually leads to location information loss, and the edge positioning accuracy and detection performance will be further affected. To address the issue, a new feature fusion path is presented to synthesize rich semantic information and precise location information from deep layers and shallow layers, respectively. Additionally, attention modules are incorporated to refine the fused features. Furthermore, we establish a dataset to compensate for the limited training samples, enabling better detection performance in realistic power construction scenarios. Compared to other YOLO-based detection methods, our proposed FFA-YOLOv7 achieves the highest detection accuracy (96.5%) without compromising detection speed. Experimental results on a dataset collected from realistic power construction sites demonstrate that FFA-YOLOv7 exhibits superior accuracy and robustness in detecting violations in practical power construction scenarios.

Data Availability

The data used to support the findings of this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The science and technology program “Research on remote safety control technology of power field operation based on infrared and visible multi-source image fusion” funded by the China Southern Power Grid provided funding for this effort. This work was also partially supported by the Yunnan Province Ten Thousand Talents Program.