Abstract

Human behavior analysis has been a leading technology in computer vision in recent years. The station operation room is responsible for the dispatch of trains when they enter and leave the station. By analyzing the behaviors of the operators in the operation room, we can judge whether the operators have violations. However, there is no scheme to analyze the operator’s behavior in the operation room, so we propose an operator behavior analysis system in the station operation room to detect operator’s violations. This paper proposes an improved target tracking algorithm based on Deep-sort. The proposed algorithm can improve the target tracking performance through the actual test compared with the traditional Deep-sort algorithm. In addition, we put forward the detection scheme for common violations in the operation room: off-position, sleeping, and playing mobile phone. Finally, we verify that the proposed algorithm can detect the behaviors of operators in the station operation room in real time.

1. Introduction

In the railway industry, the operators in the station play a vital role in the safety of train dispatch. If these operators have some violations, they may have serious potential safety hazards to railway operation safety. The most common violations are off-position, sleeping, and playing mobile phone. These three violations may lead to serious safety accidents. At present, the most common method is to set up some security officers to monitor these operators through remote monitoring systems. The remote monitoring systems are composed of monitoring cameras in each operating room. The security officers can judge whether each operator has violations by looking at the monitoring screen from remote monitoring systems. However, a railway bureau usually has hundreds of operating rooms, which requires many security officers to meet the needs of the monitor. Therefore, an intelligent behavior analysis system is urgently needed to replace the manual management in the operation room. The operator behavior analysis system first analyzes the pictures collected by the monitoring camera in the operation room to find out the operators and track them. Then, the analysis system uses three behavior analysis methods to judge whether the tracked target has violations. In addition to the behavior analysis of railway station operators, the analysis system can also be applied to other similar fields.

Before analyzing the behaviors, we first use an object detection algorithm to detect the operator location. Object detection algorithms are mainly divided into two categories: two-stage and one-stage. The two-stage network first extracts the object candidate regions from the input image and uses the classifier to classify all the candidate regions. Therefore, the detection speed is relatively slow. This algorithm mainly includes RCNN [1], Fast-RCNN [2], Faster-RCNN [3], and Mask RCNN [4]. The one-stage network directly finds candidate regions from the feature map. The detection speed is usually faster than the two-stage network, but the actual detection accuracy of the algorithm may be affected. At present, the common algorithms are Yolov1 [5], Yolov2 [6], Yolov3 [7], SSD [8], RetinaNet [9]. The detection efficiency of Yolov1 is excellent, but the overall accuracy is low. The most significant improvement of Yolov2 is to improve the ability of small object detection. The Yolov3 replaces the backbone network with Darknet53 [7], significantly improving the detection performance.

Then, we track the target according to the result of object detection. A. Bewley et al. proposed a sort (simple online and real-time tracking) [10] multitarget tracking algorithm by combining Kalman filter and Hungarian algorithm. However, the tracking accuracy of the sort algorithm decreases when people are occluded. To solve this problem, Wojke et al. proposed Deep-sort [11] multitarget tracking algorithm. Xu et al. proposed an end-to-end multitarget tracking framework DeepMOT [12], which solves the problem of end-to-end training. Sun et al. proposed Dan (deep affinity network) [13]. The algorithm can carry out end-to-end training and prediction. However, it introduces a lot of additional calculations, so the algorithm is inefficient.

Behavior analysis is mainly to analyze the behavior of the object. K. Simonyan et al. proposed a two-stream convolutional neural network [14], which significantly improved the accuracy of behavior recognition combined with optical flow information. Girdhar et al. [15] added an Action VLAD layer based on two-stream networks, but they did not research the recognition of multitarget different behaviors. Tran et al. constructed C3D [16] network using 3D convolution and 3D pooling. Xu et al. proposed R-C3D [17] network, which extracts behavior keyframes from a video. Then, the category of behavior is identified based on these keyframes. The network can analyze videos of any length.

3. Behaviour Analysis Algorithm

The design scheme of the behavior analysis algorithm of the station operation room is shown in Figure 1. The algorithm includes object detection, target tracking, and behavior analysis. The object detection module primarily uses the deep learning algorithm to detect the position of the operators. This paper proposes an improved algorithm based on Yolov4 [18]. To improve the detection results of the small object, we add the SPP module to the Yolov4 network. In the target tracking process, we introduce the HOG (histogram of oriented gradients) feature and improve the IoU (intersection over union) calculation method to improve target tracking ability. Finally, we design three behavior analysis methods to identify off-position, sleeping, and playing mobile phones.

3.1. Object Detection

Yolov4 network is the object detection network proposed by Alexey based on the Yolov3 [7] network. The detection network mainly consists of the following four parts: CSP Darknet53 [13] network, spatial pyramid pooling (SPP) [19], PANet [20], and Yolov3 head [7]. The CSP Darknet53 network includes cross-stage partial (CSP) [21] and Darknet53 [7]. The CSP can enhance CNN’s learning ability and reduce computational difficulty. The Darknet53 network contains five large residual network blocks. In each large residual network block, it contains some residual network structures. After each large residual block, we add the CSP structure to get the CSP Darknet53. The SPP can produce a fixed output for any input size, which solves the image distortion error caused by the nonproportional compression of the input image. The SPP is used in the Yolov4 network to increase the receptive field of the network. The PANet can locate the pixels correctly by preserving the spatial information to enhance the ability of instance segmentation. Figure 2 shows the Yolov4 network structure.

The SPP module obtains the receptive field information by using the maximum pooling of different cores and carrying out feature fusion. This fusion of receptive fields in different scales can effectively enrich the expression ability of the feature map. Figure 3 shows the structure of SPP. In the Yolov4 network, the SPP module is located before the final feature map. In this paper, we also apply the SPP module before the final feature map and feature map to enhance the ability to express the feature information in the feature map. Figure 4 shows the improved Yolov4 network structure.

3.2. Object Tracking

The most widely used real-time multitarget tracking algorithm in recent years is sort [10] and Deep-sort [11]. Although the sort algorithm is fast in target tracking, the accuracy will decrease when occlusion occurs. The Deep-sort uses the Kalman filter in video space. Then, it uses the Hungarian algorithm to correlate data frame by frame. This paper considers both target motion information and appearance information when correlating data. The association of motion information uses the Mahalanobis distance between the Kalman prediction result and the object detection result. The association of appearance information calculates the minimum cosine distance between the last 100 successfully associated features and the detection result of the current frame. The formulas are as follows:

is the Mahalanobis distance. is minimum cosine distance. λ is the adjustment coefficient.

This paper increases the comparison of HOG [22] features when calculating the association of appearance information. The HOG feature can describe the target’s contour through gradient or edge direction. Many studies in recent years have shown that this feature can accurately describe the outline of a person. We compare the HOG feature of the previously successfully associated rectangular box with the HOG feature of the current rectangular box.

In the matching process, Deep-post uses the IoU to calculate the coincidence degree of the bounding box.

The is the first bounding box. The is the second bounding box.

The Deep-post does not consider the width and height of bounding boxes, leading to false detection. Therefore, we improve the IoU by introducing the height and width information of the bounding box as follows:

The h1 and are the height and width of the first bounding box. The h2 and are the height and width of the second bounding box. The is the adjustment coefficient.

3.3. Behavior Analysis

At present, the behavior analysis based on deep learning mainly adopts object detection to directly identify people’s behavior, such as sleeping and playing mobile phones. This method has poor robustness. In some cases, the results of this algorithm are inaccurate. In this paper, we propose a behavior analysis algorithm based on target tracking and behavior characteristics. This paper analyzes three behaviors: off-position, sleeping, and playing with mobile phones.

3.3.1. Off-Position Detection

We think they leave their work area when the object detection algorithm cannot detect the operators in consecutive N frames. Cleave is the off-position behavior counter. When there is no operator in the detection result, add one to the counter. If the operator is detected, the counter will be cleared when the off-position behavior counter meets the following:

It is considered that the operators have the off-position behavior, and the is the off-position behavior threshold.

3.3.2. Sleeping Detection

The recognition of sleep behavior is mainly based on the change of the tracked operator’s position in each frame. Through the target tracking algorithm, we complete the matching degree between the target’s current position and the Deep-sort tracking prediction results. In the matching process, we obtain their IoU score. We measure the change of target position in different frames through the IoU score. The Csleep is the sleeping behavior counter. For the same tracked target, if the IoU score between the tracking algorithm predicted position and the object detection predicted position is less than the set threshold, we add one to the counter. The counter will be cleared if the tracked target disappears or the IoU score is smaller than the threshold when the sleeping behavior counter meets the following:

It is considered that the operator has the sleeping behavior, and the is the sleeping behavior threshold.

3.3.3. Playing Mobile Phone Detection

We assume that when the target is playing with the mobile phone, the mobile phone is close to the person. Therefore, by calculating the Euclidean distance between mobile phones and operators, we can judge whether operators are playing mobile phones. First, this paper uses object detection proposed above to detect mobile phones. We assume that the center of the mobile phone is (xp, yp), and the operator’s center is (xi, yi). We calculate the nearest operator to the mobile phone through Euclidean distance as follows:

The imin indicates the bounding box of the operator closest to the mobile phone. Cphone is the playing mobile phone behavior counter. Suppose the Euclidean distance between the mobile phone and the nearest operator’s bounding box is less than the smaller value between the width and height of the bounding box. In that case, we believe that the operator is playing mobile phone in this frame and adding one to the counter.

When the playing mobile phone behavior counter meets the following:

It is considered that the operator is playing on the mobile phone. The Tphone is the playing mobile phone behavior threshold.

4. Experiment Analysis

This paper verifies the feasibility of the proposed algorithm from three aspects: object detection, target tracking, and behavior analysis. The test environment of the experiment is 8 GB memory, Intel Core i5-6500 CPU, and NVIDIA gtx-1050 graphics card.

4.1. Object Detection

In object detection, the training environment of object detection is 32 GB memory, Intel Xeon e5-2650 CPU processor, and NVIDIA gtx-1080ti graphics card. For training and testing, we propose operating room datasets. The total number of images in the operation room datasets is 20000, composed of monitoring images taken by dozens of station operation room webcams in different illumination and time. The image size is . The dataset is divided into 70% images as the training set, 20% as the verification set, and 10% as the test set. The categories marked in the dataset are mobile phone and person.

The main parameters used in this training are shown in Table 1.

In object detection, precision (Pr) and recall (Re) are used as the benchmark to measure the object detection algorithms.

This paper compares three object detection algorithms: Yolov3, Yolov4, and our method. Table 2 shows the test results of the three algorithms.

Compared with the Yolov4 and Yolov3 detection algorithms, our algorithm has improved the accuracy and recall rate in our dataset.

4.2. Object Tracking

In this paper, we use MOTP (multiple objects tracking precision) and MOTA (multiple objects tracking accuracy) to measure the ability of the target tracking algorithm. We use the sort algorithm, Deep-sort algorithm, and our algorithm to test on our dataset and MOT16 dataset [23]. Figure 5 shows the process of target tracking.

Table 3 shows that we test three different algorithms on two different datasets.

Compared with the original Deep-sort algorithm, the MOTA increased 2.7%, and the MOTP increased 1.6% in the MOT16 dataset. In our dataset, the MOTA increased 1.9%, and the MOTP increased 1.1%.

4.3. Behavior Analysis

This article analyzes three violations: off-position, sleeping, and playing with mobile phones. The experimental analysis results are shown.

4.3.1. Off-Position Detection

We extract ten off-position videos from the video database. We extract one image every second. For the test of the off-position behavior, the threshold of off-position behavior is 180. If no personnel is detected in 180 consecutive images, the operator is judged to be off-position. Table 4 shows the results of 10 video tests.

It can be seen from Table 4 that there may be a difference between the detected frame number and the actual frame number because people may not be detected when they leave the screen. But it does not affect the actual detection results. It can be seen from Table 4 that is greater than the off-position behavior threshold , so the off-position behavior analysis algorithm proposed in this paper can judge the personnel off-position behavior. The screenshot of the test video is shown in Figure 6.

4.3.2. Sleeping Detection

We test ten sleeping videos from the video database. We extract one image every second.

In the sleeping behavior test, the sleeping behavior threshold Tsleep is 180. When the sleep behavior counter Csleep increases to 180, we judge the operator’s sleeping behavior. The test results are shown in Table 5.

According to Table 5, there are some differences between the counters’ maximum Csleep and the total frames of sleeping behavior in videos 1, 2, 3, and 10. But the Csleep of the sleeping behavior analysis algorithm is still greater than the sleep behavior threshold Tsleep. Therefore, the sleeping behavior algorithm proposed in this paper can judge the sleeping behavior of the personnel. In Figure 7, we show some sleeping behavior detection results.

4.3.3. Playing Mobile Phone Detection

We extract 10 playing mobile phone videos from the video database. We extract one image every second. In Figure 8, we show some playing mobile phone behavior detection results. The red rectangle indicates the person playing mobile phone, and the yellow rectangle indicates the detected mobile phone. The playing mobile phone behavior threshold Tphone is 180. When the playing mobile phone algorithm is satisfied for 180 seconds, we judge the operators’ playing mobile phone behavior.

5. Conclusion

Aiming at operation room management and monitoring demand, we analyze the operation room’s actual problems and put forward an efficient behavior analysis method based on deep learning. Through the experimental test, we verify the effectiveness of the proposed algorithm. The method proposed in this paper has been widely used in many railway stations.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.