Abstract

The lack of traffic data is a bottleneck restricting the development of Intelligent Transportation Systems (ITS). Most existing traffic data completion methods aim at low-dimensional data, which cannot cope with high-dimensional video data. Therefore, this paper proposes a traffic data complete generation adversarial network (TDC-GAN) model to solve the problem of missing frames in traffic video. Based on the Feature Pyramid Network (FPN), we designed a multiscale semantic information extraction model, which employs a convolution mechanism to mine informative features from high-dimensional data. Moreover, by constructing a discriminator model with global and local branch networks, the temporal and spatial information are captured to ensure the time-space consistency of consecutive frames. Finally, the TDC-GAN model performs single-frame and multiframe completion experiments on the Caltech pedestrian dataset and KITTI dataset. The results show that the proposed model can complete the corresponding missing frames in the video sequences and achieve a good performance in quantitative comparative analysis.

1. Introduction

In recent years, with the rapid development in the field of Intelligent Transportation Systems (ITS), numerous data with rich traffic information attract the widespread attention of researchers [14]. Accurate and efficient real-time traffic data can not only provide travelers with a better travel plan but also assist the traffic management department to effectively manage and guide traffic operations. However, in reality, incomplete data will be collected due to the limitations of the sensor placement [5, 6], the accidental deviation of the intelligent system [710], and the camera occlusion [11]. These problems will affect the accuracy of traffic state analysis and the timeliness of handling traffic problems [12]. Thus, it is necessary to complete the missing data.

Most of the existing studies are carried out to complete low-dimensional data (e.g., traffic flow [13], travel time [14, 15], and trajectory [16]), which cannot cope with high-dimensional traffic video data containing more intuitive information. This can be explained for two reasons. On one hand, due to limited hardware facilities, the computing power and processing speed of computers are restricted to capture meaningful information from high-dimensional traffic video. On the other hand, based on traditional statistical tools and proper prior knowledge, the existing data completion models are proposed to handle low-dimensional data. However, due to its high dimension and sparse representation, traffic video data is arduous to be modeled by statistical models and prior knowledge. Moreover, traffic video scenes are relatively complex, which usually include a large number of vehicles and pedestrians. This results in the difficulties of explicitly extracting semantic information in traffic scenes with low-dimensional traffic completion models.

To deal with these drawbacks, based on the generative adversarial network (GAN) [17], we proposed a traffic data completion generative adversarial network (TDC-GAN) to complete high-dimensional video sequences with the enhancement of graphics processing unit (GPU) parallel computing power. In the TDC-GAN, the Feature Pyramid Network (FPN) [18] is used to extract the multiscale features from the video frame in the generator. By learning latent representation from high-dimensional data, this paper expands the field of traffic data completion research to high-dimensional video sequences. In addition, this paper designs global and local discriminators to capture the temporal and spatial correlation of video sequences. The two discriminators learn the time information between consecutive frames and the spatial semantic information within the frames to generate reliable frames.

The remainder of the paper is organized as follows. Section 2 introduces some related work. Section 3 describes the TDC-GAN model framework. Section 4 is the content of experiments. Finally, Section 5 summarizes the TDC-GAN model and makes a prospect for future work.

In general, traditional traffic data completion methods can be classified into the following three categories: prediction, interpolation, and statistical learning [19].

The prediction method is to learn the mapping of past data to future data by establishing corresponding models. For example, both the high-order smoothing exponential model [20, 21] and the gradient boost regression tree (GBRT) model [22] complete the traffic data by modeling traffic flow. Based on the previous traffic information, Xu et al. [23] proposed a prediction model, which combines the autoregressive integrated moving average (ARIMA) model with Kalman filter. However, because the continuous data in the past period of time needs to be known, the application scenarios of the prediction model are relatively limited. In addition, compared to completion, it cannot use the subsequent adjacent data, which is not conducive to consistent representation in time series.

The interpolation method generally estimates the missing data by averaging the traffic data in adjacent time periods or using the historical data of other days that are similar to the missing data. Typical interpolation methods are -nearest neighbor (-NN) and local least squares (LLS) [24, 25]. Literature [26, 27] is based on an improved adaptive -NN method, which comprehensively considers spatial neighboring points, sliding windows, spatiotemporal weights, and other spatial heterogeneity features to complete missing traffic data. The improved LLS method attempts to replace the missing traffic data with the average of the known data and iteratively obtains the weight of the nearest neighbor by using the Euclidean distance. However, the interpolation method assumes that the adjacent traffic states have strong similarities. This method is unreliable when the state is relatively random.

The statistical learning method uses the statistical characteristics to complete the missing traffic information by establishing an iterative model of the probability distribution of the data. Typical methods are Markov Chain Monte Carlo (MCMC) [28] and probabilistic principal component analysis (PPCA) [29]. However, due to the complexity of the urban road traffic system, the learning ability of the statistical learning method is limited, and its convergence is difficult to guarantee.

Recently, with the advances of modern GPU and neural networks [30, 31], deep learning-based methods have appeared to complete traffic data [3234]. As an important technique of deep learning, GAN has been increasingly applied in video completion due to its outstanding learning ability.

Mathieu et al. [35] showed that traditional loss functions based only on pixel loss often lead to image blurring; however, an adversarial loss can effectively solve this problem. This is the first time GAN has been applied to the modeling of video sequences. Subsequent research on the video frame with the GAN attempted to decompose it into two modules containing different information, which were studied separately. Vondrick et al. [36] separated the background and foreground of the video scene, and the GAN was used to force static background and moving foreground to predict. The Motion and Content Generative Adversarial Network (MoCoGAN) model [37] divided the potential space of video frames into content and motion. The model can generate a video that contains the same object performing different operations or different objects performing the same operation. Liang et al. [38] divided the video sequence into future frames and future streams and used two GAN models to feedback each other for training. These methods separate video frames according to different factors, which requires expensive computing power. In addition, due to complex operating procedures, they are limited to a single and simple data set. Therefore, it is difficult to effectively model complex traffic scenarios.

Different from the abovementioned methods, FutureGAN [39] and Retrospective Cycle-consistency Generative Adversarial Network (CycleGAN) [40] attempt to use the original video frames as input. The idea of not decomposing the video frame is consistent with our method, which allows the network to learn more overall information about the input frame. Inspired by this, the TDC-GAN model directly receives unlabeled raw traffic video frames. In the generator model, the FPN is used to learn the information of multiple scales of the video frame by synthesizing the feature maps of multiple levels. By combining the lower-level feature map with more target location information and the upper-level feature map with more feature semantic information, the frames generated by the generator will be more realistic and accurate. In addition, in the discriminator model, the global discriminator mainly grasps the overall information of consecutive frames in the time series, and the local discriminator can supplement the detailed information in the space, which provides a guarantee for the temporal and spatial consistency of the generated frames. Therefore, the TDC-GAN model is capable of solving the problems of missing high-dimensional traffic video frames.

3. Methodology

3.1. Generative Adversarial Network

In recent years, deep learning methods have become an important tool for video sequences modeling, especially the proposal of GAN, which is good at capturing complex features in high-dimensional data due to its outstanding learning capability. In this study, based on GAN, the TDC-GAN model is proposed to complete the missing traffic video data.

Since Goodfellow proposed the GAN, the idea of adversarial has gradually been applied to the framework of generative models. As shown in Figure 1, The original GAN includes a generator and a discriminator. The generator is used to capture the distribution of sample data. By converting the distribution of the original input information into the parameters in the maximum likelihood estimation, the training deviation is finally converted into a sample of the specified distribution. During training, the generator learns to generate data samples that can confuse the discriminator, and the discriminator is used to judge the difference between the real sample and the generated sample. In constant adversarial learning, they will eventually reach a balance.

The equation for the GAN can be defined as where represents the batch size, represents the sample of real samples, and represents the sample of noise samples.

Equation (1) indicates that the discriminator needs to learn to assign a higher score to the real sample data and to assign a lower score to the sample data generated by the generator. The generator needs to generate samples that confuse the discriminator as much as possible.

Video is composed of continuous images with a certain frame rate; so, the high-dimensional traffic missing data studied in this paper are video frames in continuous time. In the training process, the TDC-GAN model can learn the mapping from existing frames to complete frames . represents some known frames from 1 to , that is, the data input to the generator. represents the corresponding missing frame from time 1 to , that is, the output of the generator.

3.2. Network Architecture

As shown in Figure 2, the network structure of the TDC-GAN model includes a generator and two discriminators. The generator employs the FPN network, which includes three paths of bottom-up, top-down, and horizontal connection. The bottom-up path retains more position information through less downsampling, and the top-down upsampling path is used to obtain feature maps with more semantic information and higher resolution. In the horizontal connection, convolution is used to fuse the two parts of position information and semantic information; so, more high-dimensional feature information can be learned through the TDC-GAN model. Upsampling and convolutional layers are added to the end of our generator network to keep the resolution of the input video frame consistent.

In the discriminator network, to obtain the completion performance with temporal and spatial consistency, the TDC-GAN model designs two discriminator models, global and local. The global discriminator integrates the complete spatial environment by alternately receiving the generated frame and the real frame to obtain the rough motion state of the video frame in continuous time. However, the global discriminator weights the entire frame image, ignoring local spatial details. Therefore, the TDC-GAN model introduces a local discriminator, which randomly crops a certain number of patches on the whole frame and sends them to the discriminator. By performing feedback learning on each spatial local unit to obtain more details, high resolution and high details can be maintained. In addition, a Leaky Rectified Linear Unit (LReLU) is used to increase nonlinearity, and the batch normalization layer follows each LReLU.

As shown in Figure 3, to speed up training and improve network performance, the backbone module of the TDC-GAN introduces the pretrained InceptionResNet-v2 [41] network, which can make full use of the characteristics of the training data image and reduce the feature loss in the convolution process. By introducing the residual module, the convergence can be accelerated, and the training error will not increase with the increase of the network depth.

The TDC-GAN model can complete the missing frames of the video conditioned on the incomplete frame sequences. During training, the generator network only receives pixel values of the original video frames as input and does not depend on other constraints. To complete the real and effective missing frames, the spatial and temporal components of the video sequence will be captured simultaneously. The discriminator is trained to distinguish between true and false video frames by receiving the real sequence and the generated sequence as input alternately.

3.3. Loss Function

For the problem of traffic videos completion, Wasserstein generative adversarial network-gradient penalty (WGAN-GP) [42] with a loss function of gradient penalty term is used to optimize the discriminator which is defined as where and are the real sample distribution and generator sample distribution, respectively. The gradient-penalty coefficient is represented by . is the random sampling between the two sampling points connecting and .

To train the generator of the TDC-GAN model to generate more realistic completion samples, our overall loss function is defined as where is the weight coefficient. We use the mean square error (MSE) loss (L2 loss) between the real image and the generated image as the value of the first term , and it is defined as

To solve the problem of image blur caused by using the loss function, the adversarial loss is introduced in the second term, which is defined as

The third term [43] is the loss function proposed for the general content of the restored images, and we use it for image generation. In Equation (6), stands for the feature map, where and denote the convolution and the max pooling layer, respectively. and represent the dimensions of the feature map.

4. Results and Discussion

4.1. Datasets

In the experimental part, we noticed a large-scale urban traffic dataset Caltech pedestrian dataset [44], which consists of about 10 hours of pixel video. The video was captured by the onboard camera of a vehicle traveling through normal traffic in an urban environment. Because it contains comprehensive traffic information, many video-related pedestrian detection, target recognition, and other tasks use this dataset [45, 46]. In addition, the open source and easy-to-download attributes guarantee a fair comparison and analysis of research performance in subsequent research. Experimenting on this public dataset makes the TDC-GAN model more convincing. Since our research aims to complete the missing traffic data and no other information is needed in the training data, the annotation information of pedestrians in this dataset is ignored. Moreover, to verify the versatility of the proposed TDC-GAN model, we verified it on the KITTI dataset [47], which contains real video data collected in scenes such as urban areas, rural areas, and highways, and each frame can contain up to 15 cars and 30 pedestrians. To adapt to the network structure of TDC-GAN, we changed the video pixel to .

4.2. Training Details

The TDC-GAN model is implemented in PyTorch, and the computer is configured as a single NVIDIA GTX 2080ti GPU under Linux. The ADAM optimizer is used to optimize our algorithm, and the relevant parameters were set to , , and . The weights of the loss function are set to , , , and . To quantitatively evaluate the network, we provided the values of peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) between the ground truth and the completed video frames. And they are defined as where is the bit of each sampled value, and MSE is the corresponding mean square error. where and are the average value and variance of the real frame or the generated frame , respectively. And the covariance of and is represented by . The value of SSIM is between 0 and 1, and the generated frame close to 1 is what we expect.

4.3. Experimental Results and Analysis
4.3.1. Single Frame Completion

We first complete the next frame based on the previous frame. Figure 4 shows the single frame completion results of the TDC-GAN model.

The generator receives the video frame at time and generates the video frame at time .

By comparing the details of 5 consecutive video frames between generated frames and the ground truth, we can see that the TDC-GAN model can effectively complete the missing video frames. And the quantitative results are shown in Table 1. PSNR and SSIM can reach 27.9 and 0.89, respectively.

4.3.2. Multiple Frame Completion

We also tried to train the generator to input multiple missing frames to test the completion effect of the TDC-GAN model.

Figures 5 and 6 are the results of multiple frame completion on the two datasets. The input sequence of the generator can be expressed as , which represents the input video frames at times 1, 3, and 5, and we want to complete the sequence of video frames at times 2, 4, and 6.

As can be seen from the details circled in red and green boxes in the figures, the TDC-GAN model can not only complete the missing video frames but also ensure that the video frames have temporal and spatial consistency. Moreover, the multiple frame quantitative results are given in Table 1, and the best PSNR and SSIM values of the TDC-GAN model can reach 26.8 and 0.85, respectively. It is worth noting that the TDC-GAN model has good performance on both data sets. The versatility of this model is of great significance to our further research.

In addition, Figure 7 shows the quantitative performance of each completed frame in the multiple frame experiment on the Caltech pedestrian dataset. As the number of completion frames increases, the values of PSNR and SSIM gradually decrease. It is explained as that the past information becomes less valuable to facilitate the completion in the longer future, which causes the reduction of the performance. However, in terms of the overall effect of the completion, it is still satisfactory.

5. Conclusion

This paper proposes a TDC-GAN model for completing traffic video sequences. In the TDC-GAN model, the designed generator network learns multiscale features from video sequences with the help of the FPN. Meanwhile, the discriminator network includes two branches (i.e., the global branch and the local branch), which takes into account the time information between frames and the space information within each frame. The adversarial loss is utilized to improve the stability of training, and the perceptual loss calculates the semantic difference between the generated frame and the real frame, which enhances the performance of the proposed model. With the Caltech pedestrian dataset and KITTI dataset, the experimental results show that this TDC-GAN model can effectively complete missing frames in traffic videos. In summary, the TDC-GAN model is well suited to complete travel videos under various scenarios.

In the future, we will add technologies such as scene understanding to optimize our model to solve more complex problems (such as solving traffic video problems with more missing frames). Moreover, encouraged by the promising performance of the TDC-GAN, it is interesting to propose more GAN-based methods in the traffic field.

Data Availability

The address of our experimental datasets can be found in the link: http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians, http://www.cvlibs.net/datasets/kitti/raw_data.php.

Conflicts of Interest

We declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This study was supported by the National Natural Science Foundation of China, No. 61973103, Henan Province Central Plains Thousand Talents Plan: Top Young Talents, Key Scientific Research Project of Henan University with No. 19A120002.