Abstract
Video surveillance has become an important measure of urban traffic monitoring and control. However, due to the complex and diverse video scenes, traffic data extraction from original videos is a sophisticated and difficult task, and corresponding algorithms are of high complexity and calculation cost. To reduce the algorithm complexity and subsequent computation cost, this study proposed an autoencoder model which effectively reduces the video dimension by optimizing structural parameters; thus several traffic recognition models can conduct image processing work based on dimension-reduced videos. Firstly, an optimal autoencoder model with five hidden layers was constructed. Then, it was combined with a linear classifier, support vector machine, deep neural network, DNN linear classification method, and the k-means clustering method; thus, five traffic state recognition models were constructed: -Linear, -SVM, -DNN, -DNN_Linear, and -k-means. Train and test results show that the accuracy rate and recall rate of -linear, -SVM, -DNN, and -DNN_Linear were 94.5%–97.1%, and the F1 score was 94.4%–97.1%; besides, the accuracy rate, recall rate, and F1 score of -k-means were all approximately 95%, which suggests that the combination of the autoencoder and common classification or clustering methods achieve good recognition performance. Comparison was also implemented among the five models proposed above and four CNN-based models such as AlexNet, LeNet, GoogLeNet, and VGG16, which shows that the five proposed modes achieve F1 scores of 94.4%–97.1%, while the four CNN-based models achieve F1 scores of 16.7%–94%, indicating that the proposed light weight design methods outperform more complex CNN-based models in traffic state recognition.
1. Introduction
With the rapid development of urbanization and the sustainable development of a national economy, traffic congestion has increasingly become a key issue in large and medium-sized cities. To effectively control traffic operations and alleviate traffic congestion, it is very important to achieve traffic state recognition in a timely and effective manner. Currently, extensive studies have been conducted on traffic state recognition using a variety of methods, such as fuzzy logic in [1], the fuzzy C-means clustering algorithm in [2, 3], machine learning in [4, 5], and the K-means clustering algorithm in [6]. Data used in these studies were mainly from fixed surveillance systems such as cameras and vehicle detectors, as well as vehicle GPS in [7, 8], etc. With the enrichment of urban traffic surveillance video resources and the progress of image processing technology, traffic state recognition based on video images has gained wide attention.
A large number of studies have been conducted on traffic state recognition based on video images. Traditionally, traffic state recognition was achieved by establishing certain algorithms to extract the traffic image characteristics in [9, 10]. Since these algorithms had high complexity, they were not able to cope with the rapid increase in mass traffic surveillance videos. To avoid the extraction, analysis, and modeling of complex image features, deep learning methods have recently been used to achieve rapid and effective image pattern recognition using large-scale image training. The convolutional neural network (CNN) is one of the most common deep learning methods and has achieved great success in image classification and object detection tasks in [11, 12]. However, to the best of our knowledge, a few studies have used the CNN in traffic state recognition based on video images.
Video images often have high dimensionality and contain complex and redundant information. To recognize the traffic states timely and effectively, attempts have been conducted to reduce the dimensionality of video images, thereby performing image pattern recognition based on the dimension-reduced data in [13]. Encoders, which are nonlinear and unsupervised neural network models that include an input layer, a hidden layer, and an output layer, can effectively reduce the dimensionality of video image data and realize image classification in [14].
Traffic data extraction from original videos is a sophisticated and difficult task because of the complex and diverse video scenes, and corresponding algorithms are of high complexity and calculation cost. In this study, to reduce the algorithm complexity and subsequent computation cost, an autoencoder model was proposed which effectively reduces the video dimension by optimizing structural parameters. Then, a traffic state recognition method was established by integrating the autoencoder with common classification and clustering methods. The train and test result shows that the proposed method has the advantages of a lightweight model structure and low calculation cost, which outperforms CNN models such as AlexNet and GoogLeNet. Therefore, it is suitable for traffic state recognition from videos.
2. Background
This section explains the relevant background of the video-based traffic state recognition, and related work are reviewed, including traditional traffic video processing, convolutional neural networks, and autoencoders.
The research results of video-based traffic state recognition are very fruitful. Among them, the earlier research mainly achieved traffic state recognition by establishing certain algorithms to extract the traffic image characteristics. By extracting a discrete cosine transform and its movement characteristics based on traffic surveillance videos, the hidden Markov model of fuzzy C-means was used to learn and recognize the traffic congestion state. A fuzzy logic method for traffic state recognition based on camera images was proposed. A traffic state recognition method was proposed, using traffic monitoring videos based on the fuzzy C-means clustering algorithm. A method recognizing three traffic states (free traffic flow, steady traffic flow, and forced traffic flow) based on images and video data was achieved in [15], and an image-based traffic density estimation method was proposed, in which a support vector machine classifier was used to classify traffic states into heavy, medium, and light densities. In general, this type of algorithms has high complexity and poor real-time video processing ability.
The convolutional neural network (CNN) has achieved rapid and effective image pattern recognition using large-scale image training, which can avoid the extraction, analysis, and modeling of complex image features. It is one of the most common deep learning methods and has achieved great success in image classification and object detection tasks. In 2012, Krizhevsky et al. brought out AlexNet, which ranked the first in the ImageNet image classification competition in [16]. Since then, the network structure has been improved, and other CNN models, such as visual geometry group (VGG) and GoogLeNet, have been proposed. A UAV image vehicle recognition method based on CNN and SVM was proposed in [17]. A satellite image vehicle detection method based on CNN and hard case mining was put forward in [18]. The authors applied MIT and Caltech vehicle datasets, and proposed a vehicle detection method in traffic scenes based on improved Faster RCNN in [19]. The CNN has gained extensive interest in large-scale image classification, but less study has used the CNN in traffic state recognition based on video images.
Encoders, which are nonlinear and unsupervised neural network models, can effectively reduce the dimensionality of video image data in [20, 21]. A supervised stacked autoencoder to extract image features was constructed in [22]. In [23], the authors integrated a random forest classifier into the stacked sparse autoencoder for hyperspectral image classification, which produced promising generalization performance, prediction accuracy, and operation speed. A dual adversarial autoencoder for image clustering was proposed in [24], which achieved comparable clustering accuracy to CNN models. In [25], an image compression architecture based on energy compression using a convolutional autoencoder was established, which achieved high coding efficiency. In [26], the authors used an autoencoder to project high-dimensional image feature vectors into low-dimensional potential space and proposed an image selection method based on autoencoder neural networks and then applied it to semisupervised image classification. An enhanced fast NSGA-II based on a special congestion strategy and an adaptive crossover strategy, namely ASDNSGA-II, was proposed to improve the selection strategy in [27]. The studies above demonstrate that autoencoders can effectively extract the image features through dimension reduction and show good image recognition results when combined with classification or clustering methods.
3. Methodology
3.1. Methodological Framework
In this study, an autoencoder with multiple hidden layers was used to compress the image features. Based on the compressed low-dimensional data, classification and clustering methods were used for traffic state recognition. The proposed method consisted of four steps (Figure 1), described as follows:(1)Establishment of image datasets. Traffic images were generated based on traffic surveillance videos. The traffic state was determined manually for each frame, and an image dataset with three traffic states (i.e., free traffic flow, steady traffic flow, and congested traffic flow) was established.(2)Image data preprocessing. The image matrix was transformed into row vectors, and the vector element values were normalized and taken as the input of the autoencoder(3)Construction of the autoencoder. The influence of the main structural parameters, such as the dimensionality of the input data, the number of hidden layers, and the dimensionality of the dimension-reduced data, on the performance of the autoencoder was evaluated, and as a result, an autoencoder suitable for traffic image processing was constructed. The autoencoder was trained, tested, and optimized based on feedback.(4)Traffic state recognition. The support vector machine (SVM), deep neural network (DNN), and k-means clustering method were used to classify the feature data obtained using the autoencoder, thereby achieving traffic state recognition

3.2. Autoencoder Construction
The autoencoder was composed of an encoder and a decoder. The encoder extracted the features of the original image data, and the decoder restored the feature data. The smaller the difference between the restored data and the original data was, the better the autoencoder, and the more effective the extracted feature data.
Traffic surveillance video images were taken as the input of the autoencoder. Both the encoder and decoder had hidden layers, and the network had a symmetrical structure. There was a fully connected weighting structure between each pair of layers, as shown in Figure 2. Taking the image in an image dataset with a sample size as an example, the operation of the autoencoder is as follows:(1)The image is converted into a vector where . Here, and are the width (pixels) and height (pixels) of the image, and is the element number of .(2)Encoding: encoding hidden layers were used to perform a nonlinear transformation of to obtain encoding vectors , with the dimensions of , respectively. The calculation equation is as follows: where is the encoding nonlinear activation function, which was a hyperbolic tangent function in this study; is the encoding weight matrix of the encoding hidden layer in the image; represents the encoding vector of the image; represents the encoding offset matrix of the encoding hidden layer in the image; represents the encoding hidden layer number, and . The calculation equation of is as follows: where is equal to .(3)Image decoding: decoding hidden layers were used to perform a nonlinear transformation of , to obtain decoding vectors, with the dimensions of , respectively. The decoding vectors were restored to the decoding vector with a dimension of . The calculation equation is as follows: where is the decoding nonlinear transformation function, which was a hyperbolic tangent function tanh() in this study; is the decoding weight matrix of the decoding hidden layer in the image; is the decoding offset matrix of the decoding hidden layer in the image; is the decoding hidden layer number, and .(4)Feedback optimization of encoding-decoding training: with the least mean square (LMS) as the objective, through continuous feedback propagation training, the autoencoder minimizes the error between the input and output , to obtain the optimal autoencoder structure and encoding data. The error loss calculation equation is as follows:

3.3. Traffic State Recognition
In this study, the dimension-reduced data obtained using the optimal autoencoder was used as the input of the classification and clustering methods for traffic state recognition, including linear classifier in [28], SVM in [29], DNN in [30], and DNN linear combined classifier in [28].
Here, is the encoding vector of the image obtained using , , is the value of the element, and represents the sample size of the dataset, i.e., the number of images.(1)The linear classifier, a typical supervised learning model, finds the optimal hyperplane that separates the two classes of samples according to the characteristics of the samples. Taking as an example, let the optimal hyperplane be equation (14). The left part of the equation can be defined as a linear discriminant function, as shown in (15). Let the weights and be the offset. Then, the inner product is . The linear classifier divides the feature space into two regions: and ; in other words, the data points are divided into two categories: and . The optimal hyperplane can be obtained by training the sample data.(2)As a linear classifier, SVM is widely used in image processing, face detection, speech recognition, and so on. The dimension-reduced data that corresponds to each image in was labeled with a category label , , which indicates that the data are to be divided into two categories. Then, a training set was established to train the SVM, thereby obtaining the optimal hyperplane to classify the samples. Here, represent the normal vector and intercept of the hyperplane, respectively. By introducing a slack variable , the matrix form of the SVM can be expressed as follows: where is the transpose of the matrix , is the penalty parameter, and is the vector in which each element is 1. By solving the optimization problem of (18), and can be solved.(3)The DNN mainly includes an input layer, a hidden layer, and an output layer. There is a fully connected structure of weights between each pair of layers. This method is widely used in image recognition, natural language processing, speech recognition, and so on. Its structure is shown in Figure 3. The input layer has neurons, which are connected with the element values of and correspond to . After being processed by hidden layers (the dimension of each layer is , ,…, ), the probability of the category is obtained. The sample category is obtained using the softmax function.(4)The DNN linear combined classifier consisted of two parts: a linear model and a DNN model. The linear model could receive inputs of continuous features or sparse discrete features and quickly converge to effective feature vectors using L1 regularization. The principle of the DNN is the same as described above. The DNN Linear combined classifier used the joint training method to simultaneously feedback the error to the linear model and the DNN model to update the parameters. The sigmoid function was used to stack the outputs of the linear model and the DNN model to obtain the final output.

4. Field Experiments
4.1. Dataset
The original data consisted of traffic surveillance videos of a main road in a Chinese city, and the image resolution was 704 × 576 pixels. The images were extracted and divided into three traffic states: free traffic flow (free), steady traffic flow (steady), and congested traffic flow (congested). Two datasets were constructed: A1 has a sample size of 1500, which was used to evaluate the influence of the autoencoder structural parameters, and A2 has a sample size of 6000, which was used to optimize the autoencoder and traffic state recognition. The number of images was the same for each traffic state in both datasets. The datasets are shown in Figure 4.

4.2. Optimization of the Autoencoder
4.2.1. Model Training
The training parameters included the basic learning rate, batch training size, and the number of iterations. In the process of model training, the adaptive moment estimation (Adam) iterative optimization algorithm in [31] was used to dynamically adjust the learning rate of each parameter using the first-order moment estimation and second-order moment estimation. The equations are as follows:where is the gradient of time step ; is the first-order moment estimation of, i.e., the exponential moving average of ; is the second-order moment estimation of , i.e., the exponential moving average of ; and are the exponential decay rates; is the deviation correction of ; is the deviation correction of ; is the parameter vector of time step ; the default learning rate is ; and is the residual term, .
The Adam algorithm has a low requirement for memory during the computation. After deviation correction, each iterative learning rate was within a certain range, which results in stable parameters. Therefore, the Adam algorithm was suitable for training and optimization of the large datasets and high-dimensional data that were addressed in this study.
4.2.2. Evaluation of the Autoencoder
According to the loss distribution curve of the autoencoder, two indexes were used to measure its convergence and evaluate the performance of the autoencoder.(1)A threshold was set for the loss. When the training loss of the autoencoder became less than after performing iterations to step , the average loss between steps - was calculated, as shown in Figure 5. The smaller the value of is, the smaller the difference between the original input data and the restored data of the autoencoder, and the better the performance of the autoencoder. where is the total number of training iterations, is the loss of the th iteration, , and is the loss threshold.(2)The loss distribution curve of the autoencoder was fitted, and the tangent slope at the end of the fitting curve was calculated. where is the loss fitting curve function and is the first-order derivative at the iteration, i.e., the slope. As shown in Figure 6, when was less than 0; the closer it is to 0, the gentler the loss curve, and the better the convergence effect. Under the same parameter settings and training conditions, as decreased, more effective feature data could be extracted and restored by the autoencoder; as became less than 0 and approached 0, the convergence effect of the autoencoder loss was better, which indicates high encoding-decoding stability and reliability.(3)Structure optimization of the autoencoder. To design an autoencoder for video traffic state recognition, 36 sets of crossover tests were conducted based on the dimension of the input data, the dimension of the dimension-reduced data, and the number of hidden layers, as shown in Table 1.


The 36 models were trained using the sample dataset with the same training parameters. The basic learning rate was 0.0001. The batch training size was 100, and the number of iterations was 15,000.
The distribution of the loss was analyzed, and the average loss when was 0.01 was calculated, as shown in Figure 7. The average losses of the 3rd, 6th, 9th, 12th, 15th, 18th, 21st, and 24th groups were at a lower level compared with other groups of tests. In these eight groups of tests, the input data dimension of the first four groups was 32 × 32 and that of the last four groups was 64 × 64. The number of hidden layers in each test was 10. Therefore, when the input data dimension decreased and the number of hidden layers increased, decreased; the effect of the dimension of the dimension-reduced data on the loss was not significant. However, as the dimension of the dimension-reduced data increased, the model training loss tended to decrease. The eight groups of models were preliminarily selected as the candidate models.

Then, curve fitting was performed on the loss of the 36 groups of tests. The coefficients of determination of the fitting functions were all between 0.84 and 0.95. Only 13 groups of tests had lower than 0.9. Therefore, the goodness of fit was high on the whole. The of the loss fitting function of each group of tests was calculated, as shown in Figure 8. In general, the tangent slopes of the loss fitting curves of each group of tests were all less than 0 and close to 0. The curve convergence effect was good, and the tangent slopes of the eight candidate models were all small. Therefore, the autoencoder models of the 3rd, 6th, 9th, 12th, 15th, 18th, 21st, and 24th groups of tests were selected as the candidate models and numbered as model I–model VIII. The structure is shown in Table 2.

Using the sample dataset A2, the candidate models I–VIII were tested. The basic learning rate was 0.0001. The batch training size was 300, and the number of iterations was 100,000. According to the distribution of the training loss, was set to 0.005. and of each model were calculated. The results are shown in Table 3. Compared with other models, model III had the smallest value of and the tangent slope . Therefore, given the performance of model III, it was selected as the optimal autoencoder model .
5. Result Analysis and Discussion
5.1. Traffic State Recognition and Evaluation Methods
We assume that there were types of samples (i.e., traffic states) in the traffic video image dataset. For an arbitrary image sample, its traffic state can be accurately recognized or can be misjudged as other traffic states. Therefore, in the recognition results of certain types of traffic states, there might be errors and omissions. To effectively evaluate the traffic state recognition model, the accuracy rate and recall rate were used to measure the error and omission situations. Furthermore, a comprehensive evaluation index was proposed. The larger the value of is, the better the performance of the model. The principle and calculation process of the three indexes are as follows:(1)The recognition accuracy rate, recall rate, and value of each traffic state in the image datasets were calculated as follows: Here, is the total number of traffic states in the traffic image dataset. In this study, (i.e., free, steady, and congested traffic flow); is the category of the traffic state, ; is the accuracy rate of the model on the type of sample in the image dataset; is the recall rate of the model on the type of sample in the image dataset; is the value of the model on the type of samples in the image dataset; is the number of samples of the type correctly recognized using the model; is the number of samples of the type obtained using the model; and represents the actual number of samples of the type in the image dataset.(2)The average accuracy rate, recall rate, and value of -type traffic state all were calculated.where is the average accuracy rate of the model on the -type traffic state in the image dataset, is the in the average recall rate of the model on the -type traffic state in the image dataset, and is the average value of the model on the -type traffic state in the image dataset.
5.2. -Classifier Recognition Results
The traffic images were encoded using the autoencoder , from which the dimension-reduced data were obtained and taken as input to the SVM, DNN, linear classifier, and DNN Linear classifier for traffic state recognition. In this study, the models that integrated with the SVM, DNN, linear classifier, and DNN Linear classifier were named -SVM, -DNN, -Linear, and -DNN_Linear, respectively. Traffic state recognition was conducted on dataset A2. The average accuracy rate , recall rate , and value all were calculated, as shown in Table 4.
The and values of the four models were all in the range of 94.5%–97.1%, and the average value was 94.4%–97.1%, which indicate that the integration of the autoencoder with the classification methods produced excellent traffic state recognition results. Thus, can effectively extract traffic image features to achieve traffic state recognition.
To further analyze the recognition results of the model, the accuracy rate, recall rate, and value of each traffic state all were calculated, and the results are shown in Figure 9. The accuracy rate, recall rate, and value varied among the four models. The accuracy rates ranged between 89.0% and 100%. Only the accuracy rate of -SVM for the steady traffic flow and that of -Linear for the congested traffic flow were below 90%; the recall rates were 86.4%–100%, and specifically, the recall rate of -SVM for congested traffic flow and that of -Linear for steady traffic flow were below 90%; the values ranged between 91.1% and 100%, which indicates that the overall performances of the four models were all good.

From the perspective of traffic states, the accuracy rate, recall rate, and value for free traffic flow were the best in all of the models, compared with steady and congested traffic states. The values were close to 100%, which indicates that almost all free traffic flow images were correctly recognized. The accuracy rate and recall rate for steady and congested traffic states were mostly higher than 90%, and the accuracy rate of -SVM for steady traffic flow and the recall rate of -SVM for congested traffic flow, as well as the accuracy rate of -Linear for congested traffic flow and the recall rate of -Linear for steady traffic flow, were below 90%. Figures 10–12 show the recognition results of the four models for the three traffic states in dataset A2. The autoencoder showed a good effect on the image dimensionality reduction. Integration of the autoencoder with the SVM, DNN, linear classifier, and DNN Linear classifier achieved overall good recognition results for the different traffic states.

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)
5.3. -K-Means Clustering Results
The autoencoder was integrated with the k-means clustering method, to make a model that is referred to as -k-means, which was trained using dataset A2. The clustering accuracy rate, recall rate, and value for the different traffic states were calculated, and the results of which are shown in Table 5. The clustering accuracy rate, recall rate, and value of the -k-means model all were between 90.2% and 100%, which suggests good recognition performance for the different traffic states. Moreover, the accuracy rate, recall rate, and value for the steady traffic flow were all 100%; in other words, all of the samples were recognized accurately. The recognition results for the free and congested traffic flow were almost the same, and the values were approximately 93%. Thus, the optimal autoencoder constructed in this study can achieve a good traffic state recognition effect when combined with common clustering methods, which indicates that can effectively extract the traffic image features.
5.4. Comparison between the Proposed Model and CNN Model
The CNN is a typical deep learning method. Common CNN models include AlexNet, VGG16, GoogLeNet, and LeNet in [32], which have been widely applied in the field of image pattern recognition and have achieved remarkable results. In this study, we used dataset A2 to train and test the four CNN models, and we calculated the accuracy rate, recall rate, and value, and the results of which are shown in Table 6.
Among the four CNN models, AlexNet had the optimal performance; its accuracy rate, recall rate, and value all reached approximately 94%. The values of the proposed classifier and -k-means clustering model were 94.4%–97.1%, which was higher than that of AlexNet. The performance of LeNet in traffic state recognition was average, with an accuracy rate, recall rate, and value of 82.3%, 62.4%, and 71.0%, respectively. Moreover, the values of GoogLeNet and VGG16 were all below 40%, which suggests poor performance in traffic state recognition.
From the perspective of the model network structure, used only five encoding hidden layers to obtain the image feature data, whereas AlexNet, LeNet, GoogLeNet, and VGG16 had 12, 9, 23, and 22 network layers, respectively. The complexity of was much lower than those models. Therefore, by integrating a simple and practicable autoencoder with common classifiers, the models proposed in this study were able to achieve or surpass the traffic state recognition effect of complex CNN networks. Compared with the CNN, the method proposed in this study has the advantages of a lightweight model structure and a low calculation cost, which outperforms CNN models such as AlexNet and GoogLeNet. Therefore, it is suitable for traffic state recognition from videos.
6. Conclusions
In this study, an autoencoder model for urban traffic surveillance videos was proposed that can effectively reduce the data dimension by optimizing the input data dimension, number of hidden layers, and dimension of the dimension-reduced data. Taking the low-dimensional image features obtained using the autoencoder as the input, five models were constructed based on common classification methods, including the linear classifier, DNN, SVM, DNN Linear, and k-means clustering method. The performances of the models in traffic state recognition were compared with those of commonly used CNN models. The results show that the average value of the four models -DNN, -SVM, -Linear, and -DNN Linear was 94.4%–97.1%, and the average value of -k-means was 95.3%. Among the CNN models, AlexNet has the best performance, with an value of 94.0%. Thus, the autoencoder constructed in this study can effectively extract the image features. When integrated with common classification and clustering methods, it can accurately recognize the traffic state, and it achieves better results than common CNN models, such as AlexNet, LeNet, GoogLeNet, and VGG16.
The traffic state recognition model was established in two stages to avoid the problems of high algorithm complexity and calculation cost. First, an autoencoder was proposed which effectively reduces video dimension; then, the traffic state recognition model was established by integrating the autoencoder with common classification and clustering methods. The train and test result presents that our method has advantages of a lightweight model structure and a low calculation cost, which outperforms CNN models such as AlexNet and GoogLeNet. The method can also be applied in other fields of video detection, such as image compression, moving target detection, and image classification.
Due to the variability and complexity of traffic scenes and the difficulty in building large-scale traffic image datasets, we conducted an exploratory study on the construction and optimization of an autoencoder model using limited traffic scenes and sample size. In the future, we will focus on the train and test dataset construction, autoencoder model architecture and parameter optimization, and classification method selection.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Acknowledgments
This research was supported in part by the National Natural Science Foundation of China under grant 61703064, Chongqing Research Program under grant cstc2018jscx-msybX0295, and Scientific Research Project of Chongqing Key Lab of Traffic System & Safety in Mountain Cities under grant 2018TSSMC05.