Abstract

Network traffic state prediction has been constantly challenged by complex spatiotemporal features of traffic information as well as imperfection in streaming data. This paper proposes a traffic flow prediction model for spatiotemporal graph networks based on fusion of attention mechanisms (STGNN-FAM) to simultaneously tackle these challenges. This model contains a spatial feature extraction layer, a bidirectional temporal feature extraction layer, and an attention fusion layer, which not only fully considers the temporal and spatial features of the traffic flow problem but also uses the attention mechanism to enhance the critical temporal and spatial features to achieve more accurate and robust predictions. Experimental results on a network traffic speed dataset PeMSD7 show that the proposed STGNN-FAM outperforms several important benchmarks in prediction accuracy and the ability to withstand interference in the data stream, especially for mid- and long-term prediction of 30 minutes and 45 minutes.

1. Introduction

As one of the core functions of the intelligent transportation system, traffic flow prediction has always been a research hotspot. The task of traffic flow prediction refers to predicting the traffic situation for some time in the future according to the historical traffic situation. According to the actual prediction content, the traffic flow prediction problem can be divided into traffic flow prediction, speed prediction, and density prediction, etc. Accurate traffic flow prediction can provide decision support for the traffic control department and help the public to arrange travel routes rationally, thereby improving the resilience of all parties under unstable traffic conditions.

Traditional traffic state prediction models include dynamic modeling, statistical model, and machine learning. A review of these methods is presented in Section 2.1. With the development of deep learning, scholars began to devote their research direction to traffic flow prediction based on deep learning models, which has been widely and successfully applied; since the traffic flow prediction problem is essentially a time series prediction problem, the classic time series prediction model recurrent neural network (RNN) and its variants are the most commonly used deep learning models for traffic flow prediction in the early days, but traffic flow is a spatiotemporal data. In addition to temporal correlation, spatial correlation is also a category that needs to be considered. Therefore, researchers began to study traffic flow prediction from the perspective of spatiotemporal features analysis. The most typical method is to extract spatiotemporal correlation using a combined model composed of convolutional neural network (CNN) and RNN that can extract spatial features. However, CNN is better at processing image data in the Euclid domain. The generalized traffic network is a non-European data structure, which inevitably limits the performance of CNN. The emergence of the graph convolution neural network (GCN) addresses the previous problems. Because the unique spatial structure of GCN is consistent with the traffic road network with non-European spatial correlation, researchers began to use GCN to capture the spatial relationship between road networks. Many experiments have proved that the performance of GCN-based hybrid models in road traffic flow and speed prediction tasks is far better than previous methods.

Although combining GCN with a time series model has become a popular approach to solve the traffic flow prediction problem, this method still needs some improvement. When using a graph network to model the spatial structure of the road network, the update of the central prediction node depends on the features of the neighboring nodes of the node. However, the correlation degree of each neighbor sensor to the central sensor is different in the actual traffic road network, affected by the complex geographical location and road structure. Therefore, in an ideal state, it is necessary to capture the different impacts of adjacent node traffic flow on the traffic flow of the central prediction node and focus on the adjacent node information that significantly impacts the central node. However, GCN can only uniformly assign the same weight to neighbor nodes in the same order neighborhood, which makes it unable to screen sensor nodes that play a more significant role in the traffic network, thus affecting the ability to capture spatial features. Moreover, the existing models mainly rely on the short-term memory network (LSTM) and the gate recurrent unit (GRU) to extract the temporal correlation, and the traffic flow information at a certain moment is not only related to the past state. The one-way RNN network alone cannot accurately extract temporal features. The previous defects indicate that the current research still needs to consider the spatiotemporal correlation modeling fully, and there is room to improve the prediction accuracy. In this paper, STGNN-FAM, a spatiotemporal network traffic flow prediction model based on attention mechanism fusion, is proposed and applied to urban traffic flow speed prediction tasks. The main methodological contributions of this paper are as follows:(1)In spatial feature extraction, STGNN-FAM uses the graph attention network (GAT) to draw attention into the process of extracting spatial features. The attention mechanism calculates the similarity between the center node and the neighbor node’s features. Different weights are assigned to the neighbor node according to the degree of association to amplify the impact of the features of the critical areas.(2)In the temporal feature extraction, STGNN-FAM uses the bidirectional gated recurrent unit (BiGRU) to simultaneously consider past and future periods’ influence on traffic flow speed, enhancing the ability of temporal feature extraction through the bidirectional input. It is to be noted that future information is used only in the training phase of the model, not when the model is performing prediction tasks.(3)STGNN-FAM has a separate attention fusion layer before the final output to filter the information with spatiotemporal features output by the temporal feature extraction layer. This layer selectively pays attention to the impact of input of different time steps on the prediction results through the attention mechanism. It gives higher weight to the input vector with a higher correlation to realize the enhancement of critical spatiotemporal features and improve the accuracy of the final prediction of the model.

To fully assess the prediction accuracy and robustness against data interference of the proposed STGNN-FAM model, we use a battery of comparative experiments using a real-world traffic speed dataset. According to the analysis of three evaluation indexes, namely, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE), the prediction error of STGNN-FAM in the 45-minute prediction task was reduced by 9.3%∼53.7% (MAE), 2.4%–44.8% (RMSE), and 0.76%–8.1% (MAPE) compared with other baseline models. In the face of noise disturbance, the variation range of the three evaluation indicators was 1.0%∼3.1% (MAE), 0.3%∼2.5% (RMSE), and 0.04%∼0.19% (MAPE). Faced with missing value disturbance, three evaluation indexes of the range are 1.7%∼5.1% (MAE), 0.9%∼4.8% (RMSE), and 0.05%∼0.27% (MAPE). The experimental results show that STGNN-FAM has excellent prediction performance, especially in medium and long-term prediction tasks and has strong anti-interference ability, which can be applied to natural scenes.

The remainder of this paper is organized as follows: Section 2 introduces the research progress in traffic flow prediction. Section 3 elaborates the definition of the problem and analyzes the spatiotemporal correlation of traffic flow in detail. In Section 4, STGNN-FAM and its prediction algorithm are presented. Section 5 prediction and interference experiments are designed to verify the model’s performance, and the experimental results are analyzed. Section 6 concludes this paper and the prospects for future work.

Traffic prediction models can be broadly categorized as model-based methods and data-driven methods. Data-driven methods can be further subcategorized into statistical models, machine learning, and deep learning. Traffic prediction methods based on dynamic models, statistical models, and machine learning are collectively referred to as traditional methods.

2.1. Traditional Traffic Flow Prediction Methods

The traffic flow prediction method based on dynamic modeling is a standard early traffic flow prediction method. This method uses mathematical tools (such as differential equations) and physical knowledge to formulate traffic problems through calculation and simulation [1] to achieve prediction. The models represented by this method include Queuing Theory [2], the Cell Transmission model [3], and the Basic Graph model [4]. However, these physical models need to consume a lot of computing power, and the assumptions about the environment need to be simplified to deal with the complexity of actual road conditions. Because dynamic modeling methods have many defects and with the development of traffic data collection equipment, the number, types, and accuracy of traffic data have been continuously increased and improved, so statistical model methods that are simpler to operate, more efficient in the calculation, and more mature in technology than dynamic modeling methods have gradually come into the attention of researchers. Such methods are sets of mathematical models that make predictions in a somewhat idealized form based on assumptions about historical sample data. The commonly used classical models mainly include the Historical Average model [5], the Autoregressive Moving Average model [6], and the Kalman filter model [7]. Compared with the dynamic modeling method, the traffic flow prediction method based on a statistical model improves the calculation speed. However, because the model is too simple, this method can only be applied to short-term prediction scenarios with low accuracy requirements. Since the 1990s, prediction methods based on traditional machine learning have been gradually applied to the field of traffic flow prediction, including the Support Vector Regression model [8], Bayesian model [9], and K-nearest neighbor model [10]. Lippi et al. [11] used the seasonality of the traffic flow time series and introduced the seasonality kernel into the SVR model to achieve high-precision prediction. Sun et al. [12] proposed a traffic flow prediction method based on the Bayesian model, modeling the traffic flow between adjacent roads in the traffic network as a Bayesian network for prediction. The introduction of the machine learning method has made significant progress in traffic flow prediction. However, due to the limitations of machine learning itself, such as too simple model architecture, insufficient consideration of spatial features, and insufficient response to complex traffic conditions, this method usually fails to produce the best results for prediction tasks with complex laws and complex factors.

2.2. Traffic Flow Prediction Method Based on Deep Learning

Deep learning is a new technology of machine learning algorithms. Its motivation is to establish a neural network that simulates the human brain for analysis and learning. Because of its good performance of automatically learning features from data, it has achieved excellent research results in many fields. LSTM, as one of the most popular deep learning technologies at present, also shows great advantages in solving traffic flow prediction tasks. Cao and Cao [13] used LSTM to predict and compare it with the prediction results of the SAE model. The experimental results on the PeMS dataset proved that the LSTM model is better at processing time series data. Afrin and Yodo [14] made improvements and proposed an LSTM-based traffic flow prediction framework LSTM-CTP. This model retains the advantage that LSTM can capture long-term dependence and achieve better accuracy than LSTM. Although many methods based on the LSTM model perform well in traffic flow prediction tasks, it also has many limitations, such as more parameters and difficult convergence. In order to reduce the computational complexity, Liu et al. [15] applied the variant GRU of LSTM to the field of traffic flow prediction and achieved good results. Considering that the simple gating structure of GRU may affect the prediction accuracy, Sun et al. [16] proposed a selective stack gating cycle model SSGRU, which realizes prediction by stacking GRU units. Experiments show that this simple structure model has stronger adaptability and higher accuracy. In order to make the expression of input data more fine-grained to dig into the internal temporal features of traffic flow data more deeply, a few works began to focus on Bidirectional Recurrent Neural Networks (BRNNs). Abduljabbar et al. [17] used Bidirectional Short-term Memory Network (BiLSTM) to predict traffic flow in their study. The good results obtained in this work prove that training traffic flow data with input from both directions can achieve better prediction results.

Compared with traditional traffic flow prediction models, LSTM, GRU, and other deep learning models do not need to manually extract the features of the traffic network, and the prediction accuracy has been further improved. However, traffic flow is a kind of spatiotemporal data with complex features. Although a single model digs the time information of traffic flow, it ignores the spatial features of traffic flow. In order to solve this defect, the deep learning method based on a hybrid neural network model has become the focus of research in recent years. A typical method is to use a convolutional neural network (CNN) and time series model that can extract spatial features for combined forecasting. For example, Zheng et al. [18] introduced CNN to extract spatial correlation in their work and combined it with LSTM to form a Conv LSTM module. The experimental results show that the performance of the combined model in the traffic flow speed prediction task is better than that of a single time series model. Ma et al. [19] formed a hybrid model of CNN and GRU and combined the spatiotemporal feature selection algorithm to achieve prediction. Pu et al. [20] proposed a deep learning combined model CNN-ResNet-LSTM, which uses CNN and LSTM to capture the local spatial and temporal features of traffic data in urban areas and uses residual neural units to deepen the depth of the network to improve the accuracy of prediction. It has achieved high accuracy in the Beijing dataset. Chai et al. [21] used the combination of 1D-CNN and BiGRU in their model, which allowed the model to take spatial features and capture long-term time dependence more accurately, thus improving the accuracy of prediction.

Compared with the prediction using a single time series model, the previous method comprehensively considers the spatiotemporal features of traffic data. It is better at coping with complex road conditions. However, since the traffic network is not a grid of traditional rules and due to the limitation of spatial structure, the CNN-based hybrid model is consistently unable to play its best role in spatial feature extraction. GCN is a neural network that extends the convolution operation to the graph domain. With its remarkable achievements in various tasks, many researchers have begun to use GCN instead of CNN to extract the spatial features in the road network. Yu et al. [22] proposed the STGCN model, composed of spatiotemporal blocks composed of frequency domain graph convolution and gated convolution and has achieved good experimental results on the traffic flow speed dataset. Li et al. [23] built the DCRNN model using the convolution and gated loop unit of the airspace graph, modeled the traffic flow as a diffusion process on the directed graph, and captured the spatial dependence using the two-way random walk on the graph. Zhao et al. [24] built the T-GCN model by integrating a graph convolution network and gated cycle unit and then jointly captured the potential spatiotemporal features in traffic flow data. All the previous studies can prove that the performance of the GCN-based hybrid model in traffic flow prediction tasks is better than the previous methods. In order to further improve the prediction accuracy, researchers try to introduce an attention mechanism into the GCN-based hybrid model to assist in the extraction of spatiotemporal features. Guo et al. [25] developed a spatiotemporal attention mechanism to mine the dynamic spatiotemporal features of big traffic data and captured the relevance of different locations and times by introducing attention mechanisms into the spatiotemporal module. Gan et al. [26] designed an attention spatiotemporal graph neural network for traffic prediction. By introducing an attention mechanism, they adjusted the importance of adjacent and nonadjacent roads and integrated global spatial information. Zhao et al. [27] designed an attention module in their STGCGRN model to capture long-term time dependence and improve its prediction performance. The previous research shows that the introduction of an attention mechanism can give weight to different information to strengthen the memory of important information and effectively alleviate the problem of information loss when dealing with long sequences; it is also beneficial to apply it to Graph Neural Networks. GAT is a variant of spatial graph convolution. In the operation of GAT, the attention mechanism is introduced and replaces the inherent convolution operation in GCN. Compared with GCN, which can only assign the same weight to all neighbor nodes, GAT can dynamically assign different weights to neighbor nodes according to the impact of neighbor sensor nodes on the central node in the traffic network to more accurately extract the spatial correlation of the traffic network. Although in most of today’s research, this method has received less attention than traditional GCN, it has had a trend of rapid development in recent years. Zhang et al. [28] combined GAT with the LSTM model, which is good at capturing long-term dependence to construct a new combination model ST-GAT. The STGAT model proposed by Kong et al. [29] integrates the graph attention layer and gated time convolution into a spatiotemporal block to capture the spatiotemporal features at the same time. The ST-GRAT model proposed by Park et al. [30] uses multihead attention to extract the spatial and temporal correlation of the road network; the attention heads that extract the spatial features are divided into two types to focus on the direction of inflow and outflow in the road network. The previous work focuses on modeling the spatial structure of road networks using GAT. The excellent experimental results also prove the unique advantages of GAT for spatial correlation extraction. However, it is regrettable that many methods, including the previous work, still need to break away from the shackles of only considering one-way time dependence, which leads to the fact that they still have room for improvement in prediction accuracy.

According to the analysis of the current work, with the deepening of the research on traffic flow prediction, the hybrid model based on GCN and its variants has achieved great success. However, most methods still need more consideration of spatial and temporal correlation. Therefore, it is still the primary challenge to solve the problem of traffic flow prediction to focus on the impact of important information and choose a suitable scheme to fully and thoroughly mine the spatiotemporal features existing in complex traffic flow data. The research of this paper is based on this.

3. Problem Overview

3.1. Definition of Traffic Flow Speed Prediction Problem

When a graph network is applied to a traffic flow speed prediction task, the road network should be first modeled as a graph representation. This paper defines a traffic network as an undirected graph G = (V, E, W). Among them, the node set is the set of N sensors in the road network, and the edge set is the set of connectivity between sensors, representing the adjacency matrix of the graph. Assuming that the current time is t, the traffic flow speed observed by N sensors at time t can be expressed as . To sum up, according to the historical traffic flow speed information of the past P time steps in Figure G, the process of predicting the traffic flow speed information of the future Q time steps can be defined as

According to (1), the method to solve the problem of traffic flow speed prediction is to learn a suitable mapping function which can predict the future traffic flow data according to the input historical traffic flow data and graph structure.

3.2. Analysis of Spatiotemporal Features of Traffic Flow Speed

Unlike other time series data, the traffic flow speed data has a constantly changing spatiotemporal feature, which is shown explicitly in the road network map as the spatial correlation between sensors due to geographical location connectivity and the temporal correlation of sensors themselves due to changes in observation values over time. As shown in Figure 1, Road 1 and Road 2 are two unrelated parallel sections, with nodes representing sensors and arrows expressing the direction of traffic flow diffusion. Wherein sensor 1 and sensor 2 are located on the same road, the traffic flow speed information recorded by them is very similar. Since the upstream traffic is affected by the downstream traffic, the data recorded by sensor 2 is closely related to the road condition of the road section where sensor 1 is located. While sensor 1 and sensor 3 are close in space, they are actually on two roads with opposite traffic flow directions. Therefore, in this picture, there is a stronger spatial correlation between sensor 1 and sensor 2, while the spatial correlation between sensor 1 and sensor 3 is relatively small or even irrelevant.

Figure 2 is a schematic diagram of the sensor. The horizontal axis is the time axis, including t, t + 1, and t + 2. The imaginary curve represents the impact of the sensor on itself. The darker the color of the imaginary curve, the more significant the effect. It can be seen from the figure that the information recorded by the same sensor changes over time, and the information at a certain time will have different impacts on future times. For example, the data recorded at time t heavily impact the adjacent time t + 1, while it has a relatively light effect on the time t + 2 that spans two times.

According to the previous analysis, it can be seen that there are complex spatial and temporal correlations in the entire traffic network. Selecting a suitable prediction model to extract the two features is crucial to achieve accurate traffic flow speed prediction.

4. STGNN-FAM Model

The STGNN-FAM proposed in this paper is a hybrid model for predicting traffic flow speed. The model mainly comprises an input layer, spatial feature extraction layer, bidirectional temporal feature extraction layer, attention fusion layer, and output layer. The overall framework is shown in Figure 3, the spatial feature extraction layer is composed of the GAT network, and the BiGRU network forms the bidirectional temporal feature extraction layer. When using this model for prediction, firstly, the original traffic flow speed data are processed into vector data through the input layer, and then the processed information is sent to the spatial feature extraction layer to extract the spatial features in the data; the representation is sent to the bidirectional temporal feature extraction layer, and feature mining is performed again at the temporal level; then, the feature representation output by the bidirectional temporal feature extraction layer is weighted by the attention fusion layer to highlight important spatiotemporal information; finally, through the whole connection, the layer predicts the outcome at future time steps. For the detailed calculation process, please refer to the subsections of this section.

4.1. Input Layer

This step will use the Speed2Vec mechanism [28] to process the data in advance so that the historical traffic flow speed data ( representing the traffic flow speed information of all nodes at time t) become a feature representation that can participate in the GAT network calculation. The process of data reshaping is shown in Figure 4. The specific method is to regard the historical time series used by a node for prediction as a feature of the node. Compared to other works modeled as sequence-to-sequence structures [31], this processing method enables time series data to be directly used as the input of the GAT network. The process is as follows: Let represent the observed speed of a node at time t, then the feature vector of the node can be defined as , where , P is the historical time step, and the size can be adjusted at will according to the needs of the task. Therefore, the feature vector representation of N nodes is as follows:

is the final processing result, and it will be sent to the GAT network to participate in the calculation of the attention coefficient and the update of the node features.

4.2. Spatial Feature Extraction Layer

This model uses the GAT network to extract the spatial correlation of the traffic flow speed data. GAT is essentially an optimization of the GCN network. Its core idea is to assign different weights to different neighbor nodes by introducing an attention mechanism and then update the features of each node through weighted calculation. Figure 5 takes node i as an example to show the complete calculation process of updating node features using GAT.

The process of GAT extracting spatial features can be decomposed into two following steps: solving the attention coefficient and updating node information. The method of solving the attention coefficient is defined by equations (3) and (4).

Equation (3) is the calculation process of the attention coefficient; and represent the feature vector of the sensor node i and its neighbor node j; is a learnable shared linear parameter; is the process of implementing the self-attention mechanism. The whole equation (3) represents the use of the shared linear matrix to perform feature enhancement on and and then splicing the enhanced features together, using the self-attention mechanism to map the spliced result to an actual number, and finally, through the activation function LeakyRelu gets the attention coefficient of node i and neighbor j.

Equation (4) is the normalization process. represents the set of all neighbor nodes of node i. The actual operation of equation (4) is to perform a softmax operation on the attention coefficient obtained in equation (3) and finally accept a normalized result that is convenient for comparison. The normalized attention coefficient will be used for node information update, and the node information update process is defined by equation (5).

Equation (5) uses the obtained attention coefficient to weigh the neighbor nodes. It then passes the summed result through the activation function and obtains the new feature after node i fuses the neighborhood information. To enhance the stability of the self-attention learning process, this step will introduce a multihead attention mechanism to participate in the operation. The calculation process of the multihead attention mechanism is completed by equation (6) and procedure equation (7).

Equation (6) is the calculation process of the middle layer when the multihead attention mechanism is used. K represents the number of attention mechanisms, that is, the “number of heads” in the multihead attention mechanism. The meaning is to use K-independent attention mechanisms to execute equation (6) to update the node features and then splice the results together to output the feature representation; equation (7) is the calculation process of the final layer when using the multihead attention mechanism. The specific operation is to average the outputs of the K attention mechanisms to obtain the final output result.

The abovementioned section is the whole process of using GAT for spatial correlation extraction. The original sequence passes through the GAT network and outputs a representation with spatial features, which will be used as the input of the bidirectional temporal feature extraction layer to participate in the operation.

4.3. Bidirectional Temporal Feature Extraction Layer

As the next part of the spatial feature extraction layer, the bidirectional temporal feature extraction layer aims to learn the time dimension of the feature vector output in the previous step and then fully extract the temporal features of the traffic flow speed data. This paper will use BiGRU to model the temporal correlation. BiGRU is composed of two unidirectional GRUs with opposite directions. The two GRUs accept input simultaneously, jointly determine the output, and explore the internal variation law of traffic flow speed from both positive and negative trends.

The implementation process of BiGRU relies on the calculation of a single GRU, and the structure of a single GRU is shown in Figure 6. If the input at time t is and the hidden state is , the GRU calculation process is usually described by the following equation:where , , and represent the weight matrices; represents the activation function; and represents the Hadamard product. and are the update gate and reset gate of the GRU, respectively. They determine which information can be used as the final output of the GRU through the output state at the last moment and the input at this moment; is a candidate hidden state for storing the content of the pending output, which is affected by the reset gate; is the final production hidden layer information at time t.

Figure 7 shows the process of input data passing through BiGRU. The BiGRU output is composed of the results of two GRUs. The output of the GAT network layer with spatial features at time t is simultaneously used as the input of two reverse GRUs. The output of the forward GRU at this moment is represented by , and the production of the backward GRU is designated by . The result of the BiGRU at time t is shown in equation (9), where is the weight of forward propagation GRU hidden state, is the weight of backward propagation GRU hidden state, and b is the offset term.

4.4. Attention Fusion Layer

Before realizing the final prediction, this paper uses the attention mechanism to construct an attention fusion layer to process the output of the BiGRU network layer. Figure 8 is a structural diagram of the attention mechanism. The data with spatiotemporal features obtained after passing through the BiGRU layer are used as the input of the attention fusion layer. The attention mechanism pays enough attention to the critical spatiotemporal features through probability allocation, thereby amplifying the influence of important information on prediction and improving the prediction accuracy of the model.

The calculation process of the attention fusion layer includes the calculation of the attention coefficient and the weighted summation, which are represented by equations (10)–(12).

In the previous equation, u and represent learnable weight parameters, and represents the corresponding bias term. is the attention coefficient corresponding to the vector output by BiGRU at time t, is the normalized final value of , and z is the final output of the attention layer calculated by weighted summation.

4.5. Output Layer

The output layer realizes the prediction of Q time steps in the future by constructing a fully connected layer. The input of the fully connected layer is the output z of the attention layer, and the output is the predicted value containing Q output steps. The calculation equation is expressed aswhere is the weight matrix used to map z into Q steps, is the bias term, and ReLU is the activation function. The predicted value represents all nodes’ traffic flow speed information in Q time steps.

The previous section is the process of using STGNN-FAM to predict the traffic flow speed. The overall algorithm flow is shown in Table 1.

5. Experimental Verification

To test the performance of the proposed model, this section selects the PeMSD7 dataset for experimental verification. The PeMSD7 dataset from the California Department of Transportation’s Performance Measurement System records traffic information for all weekdays in California’s District 7 from May 1, 2012, to June 30. The dataset chosen for this paper consists of 228 sensor sites, each sensor recorded 12,672 velocity data, and the samples were sampled at a 5-minute interval.

5.1. Data Preprocessing

The PeMSD7 dataset consists of feature data and sensor distance data. The feature data is the traffic flow speed data that changes with time, and the size is 12672 rows × 228 columns. To preprocess the feature data, firstly, the extracted traffic flow speed data should be cleaned to eliminate abnormal data; secondly, the missing data should be recovered by linear interpolation; finally, the data should be normalized by the Z-Score method to generate the available data. The process is shown in Figure 9.

The content of sensor distance data is the Euclidean space distance between sensors, and the size is 228 rows × 228 columns. After preprocessing the sensor distance data, a graph adjacency matrix W that describes the spatial relationship is generated. W is constructed by a threshold Gaussian kernel. The calculation process is shown in (14), where represents the spatial relationship between sensor nodes i and j, and represents the distance between i and j. and are two adjustable thresholds that control the sparsity of the adjacency matrix. Figure 10 shows the heat map drawn according to the adjacency matrix W, which can intuitively see the spatial relationship between the nodes.

After the data preprocessing is completed, the processed dataset is divided according to the ratio of 6 : 2:2 : . 60% is used as the training set; 20% is used as the validation set; 20% is used as the test set, which is then sent to STGNN-FAM for prediction.

5.2. Experimental Setup
5.2.1. Experimental Environment Settings

The model is built based on the PyTorch 1.7.0 deep learning framework, and the testing platform is Ubuntu 18.0.4 system. The specific configuration information is as follows: the Intel(R) Core(TM) i7-7800X CPU @ 3.40 GHz, the memory is 32G, the graphics card model is GeForce RTX 2080 Ti, and the CUDA version is 11.0.

5.2.2. Model Parameter Setting

This experiment uses the historical traffic flow speed data in the past 1 hour to predict the traffic flow speed data in the next 15, 30, and 45 minutes. That is, the single input size of the model is 12-time steps, and the output size of the model is 3, 6, and 9-time steps, respectively. After many experimental comparisons and parameter adjustments, the final experimental parameters are shown in Table 2.

5.2.3. Baseline Model Settings

This paper sets eight baseline models [28] for comparison with the proposed model as follows: (1) Historical Average Model (HA), which uses a simple averaging method to predict the speed of future moments; (2) Autoregressive Moving Average Model (ARIMA), which fits historical speed data into a parametric model to predict future speed; (3) Linear Support Vector Regression Model (LSVR), which obtains the relationship between input and output by learning from historical speed data and then predicts the future; (4) Feedforward Neural Network (FNN); (5) Fully Connected Long Short-Term Memory Network (FC-LSTM); (6) Diffusion Convolutional Recurrent Neural Network (DCRNN), which uses Diffusion Graph Convolution (DGC) and GRU; (7) Spatiotemporal Graph Convolutional Model (STGCN), which uses spectral domain GCN with gated convolutional layers for prediction; (8) Spatiotemporal Graph Attention Model (ST-GAT), which combines GAT with LSTM to make predictions.

5.2.4. Evaluation Index Setting

To evaluate the prediction effect of different models, this paper selects three commonly used evaluation indicators to reflect the gap between the actual value and the predicted value as follows: Mean Absolute Error (MAE), Root Mean Square Error (Root Mean Square Error, RMSE), and Mean Absolute Percentage Error (Mean Absolute Percentage Error, MAPE). The range of the three is [0,+ ). The higher the value, the larger the prediction error. Their equations are as follows:

5.3. Experimental Results and Analysis
5.3.1. Model Prediction Accuracy Experiment and Result Analysis

Table 3 and Figure 11 show the prediction results of STGNN-FAM. Table 3 compares the experimental results of different models on the PeMSD7 dataset, the bold values in Table 3 represent the best performance. At the same time, to more intuitively show the prediction accuracy of STGNN-FAM, 3 of 228 sensor nodes are selected at random, and the prediction results of STGNN-FAM on these three nodes are visualized, as shown in Figure 11.

It can be seen from Table 3 that STGNN-FAM has more advanced prediction performance on the PeMSD7 dataset. In particular, the performance of this model is better than that of all baseline models in terms of medium and long-term predictions of 30 minutes and 45 minutes. In Figure 11, the solid line represents the actual value of the traffic flow speed, and the dotted line represents the predicted value of STGNN-FAM. It can be seen from the figure that STGNN-FAM can predict the fluctuation of the actual speed in time, and the fitting degree of the two lines is generally good. According to the data in Table 3 and the observations in Figure 11, the experimental results can be analyzed from the following four aspects:(1)Comparison of STGNN-FAM and traditional models: Among the eight baseline models, the HA model, ARIMA model, and LSVR model are all traditional traffic flow forecasting methods. Taking MAPE as the evaluation standard, their prediction errors at 45 minutes are 3.51%, 8.1%, and 4.4% higher than STGNN-FAM, respectively. Among them, the HA model and ARIMA model, which belong to the statistical model method, are relatively backward in predicting the overall performance. This is because the way of the statistical model is too ideal. As shown in Figure 11, the actual data are nonlinear and unstable, and the simple method is not enough. The LSVR model is a machine learning model to deal with complex road conditions. It has a slight advantage in short-term prediction results, but it does not perform well in the medium and long term, and because the machine learning method relies on feature engineering, this method requires a lot of time and computing power and is unsuitable for large-scale processing tasks.(2)Comparison of STGNN-FAM and deep learning single model: The FNN and the FC-LSTM models are simple deep learning models. Taking MAPE as the evaluation standard, their prediction errors in 45 minutes are 5.28% and 3.0% higher than that of STGNN-FAM. These two deep learning models have advantages in processing time series tasks, but the actual effect is still not ideal. The reason is that the spatial features in the traffic flow speed data are not considered enough. This method is only suitable for a single road network structure with uniform distribution of sensors.(3)Comparison of STGNN-FAM and deep learning hybrid model: DCRNN model, STGCN model, and ST-GAT model are all hybrid model methods combining graph neural network and other neural networks, using MAPE as the evaluation standard. Their prediction errors at 45 minutes are 2.89%, 1.59%, and 0.76% higher than STGNN-FAM, respectively. It can be seen from the experimental data that the ST-GAT model based on the graph attention network has a better prediction effect than the DCRNN model and the STGCN model based on the graph convolutional neural network. STGNN-FAM is also constructed based on the graph attention network. Compared with the ST-GAT model, the prediction effect of the two models at 15 minutes is not much different, but at 30 minutes and 45 minutes, the advantages of STGNN-FAM are fully revealed. It has been proved that using the BiGRU model and introducing the attention mechanism is effective in medium and long-term time-dependent mining.(4)Analysis of STGNN-FAM’s prediction results: Figure 11 presents the prediction results of sensor node 1, the sensor node 101, and sensor node 201 at 15 minutes, 30 minutes, and 45 minutes. It can be seen that the two lines have the best fit on the 15-minute forecast, followed by the 30-minute estimates, and the last in the 45-minute forecast, which shows that the model is more accurate in short-term forecasting. When the speed data tend to change linearly (such as between the 1100th and 1200th minutes of node 1), the two lines are almost entirely fitted, and the prediction effect is the best; when the speed data change significantly in a short time (such as between the 1000th minute and 1100th minute of node 101), the fitting degree of the two lines is relatively low, and the prediction error is significant to fill missing data.

5.3.2. Model Robustness Experiment and Result Analysis

The traffic data collected in the natural environment often have abnormal disturbances, such as noise and missing values. These disturbances will have an unknown impact on the prediction accuracy. Therefore, to explore the performance changes of STGNN-FAM under different disorders, this section sets up model robustness experiments under two disturbances.

(1) Robustness Experiment under Noise Interference. The sensor’s natural aging and poor contact will introduce noise to the collected traffic data. Due to the influence of objective conditions such as the use environment and sensor life, the noise interference caused by natural aging cannot be avoided. The experiments in this section will explore the robustness of the model under noise interference. The specific method is to introduce a set of noises obeying the Gaussian distribution in the PeMSD7 dataset to simulate the noise interference in the real world, where the standard deviation is 0.2, 0.4, 0.6, 0.8, and 1.0. Figure 12 shows the changes of the three evaluation indicators on the prediction task 45 minutes after introducing Gaussian noise in the form of a histogram. Figure 13 takes the no. 1 node as an example to visualize the prediction results under noise disturbance. The results are as follows.

The data in histogram 12 are divided into six groups, where “original” represents the prediction performance of STGNN-FAM when no Gaussian noise is introduced, and the other groups correspond to the prediction performance of STGNN-FAM when the standard deviation in the Gaussian distribution is (0.2, 0.4, 0.6, 0.8, 1.0); there are five subplots in Figure 13, the solid green line in each figure represents the actual traffic flow velocity value, and the red dashed line represents the predicted value of STGNN-FAM under Gaussian noise interference. Analysis of the results in Figures 12 and 13 shows that no matter what the parameters of the noise distribution take, the STGNN-FAM is affected very little. The evaluation indicators in Figure 12 have not changed significantly, and the fitting trend between the actual value and the predicted value in Figure 13 is not abnormal, which proves that STGNN-FAM has a high tolerance for noise. Even if there is interference, it will not affect the model. The performance has a “catastrophic” impact.

(2) Robustness Experiments under Random Missing Value Interference. In addition to natural aging, sensors in the real environment will also face temporary power failure, sudden damage, forced shutdown, and other faults caused by external forces. The collected traffic data will also be lost, that is, “missing values.” Since sensor failure is random, the generation of missing values is also random in space and time [32]. As shown in Figure 14, missing values may randomly appear at any moment in any sensor.

To explore the impact of this random missing value on STGNN-FAM, this section sets up a performance experiment of the model on missing data. The specific method is to define a missing mask matrix to represent the lost state of the data and adjust the missing by setting different missing rates. The sparsity of the mask matrix multiplies the original observation sequence with the disappeared mask matrix by element multiplication to introduce random missing values into the original data. The element definition in the missing mask matrix is shown in (16), where i represents a sensor node, t means a certain time, and represents the traffic flow speed value observed by sensor node i at time t. Figure 15 shows introducing random missing values through a random mask matrix.

Since real-world sensors are regularly serviced, the probability of failure is not high, and there are fewer random missing values. Therefore, in this work, the spontaneous missing rate is defined in a small range set as 1%, 2%, 3%, 4%, and 5%, respectively, and the corresponding number of missing values is 28892, 57784, 86676, 72230, 115568, and 144461. Histogram 17 shows the performance change of STGNN-FAM on the 45-minute prediction task under slight random missing “original” which represents the prediction performance of STGNN-FAM when no missing values are introduced, and the other five data groups represent the prediction performance of STGNN-FAM when the loss rate is 1% to 5%, and the loss rate is described in decimal form in the figure. Figure16 takes the node 1 as an example to visualize the prediction results under the disturbance of random missing values. The solid orange line represents the actual value of the traffic flow speed, and the red dotted line represents the predicted value of STGNN-FAM under the disturbance of different missing rates. As seen from Figure 17, as the disappeared rate increases, the model’s prediction error begins to change, but the change is not significant; Figure 16 can intuitively see that STGNN-FAM predicts under the interference of random missing values. The results are still in the same general trend as the actual values. This shows that although small-scale deletion can affect the prediction performance of STGNN-FAM, it will not cause a severe blow to the robustness of the model.

Figures 18 and 19 show the performance indicators of STGNN-FAM on the 15, 30, and 45-minute prediction tasks under two interferences, respectively, where “0” represents the prediction performance when no interference is introduced. Comparing the changes of the different-colored polylines in the subgraphs of Figures 18 and 19, it can be seen that no matter what kind of interference, when the prediction duration is 15 minutes, the variation range of performance indicators is the largest, which indicates that STGNN-FAM is more vulnerable to abnormal interference when processing short-term prediction tasks. Comparing the variation of the same-colored lines in the two figures, it can be seen that the random missing value interference has a more significant impact on the prediction accuracy than the noise interference, which proves that prediction in the scenario of missing data will be an important challenge in future work.

5.4. Model Interpretability Analysis

To better understand the STGNN-FAM model, this section will analyze the spatial correlation between a central sensor node and different neighbor nodes at a certain time or at some time with the aid of the actual traffic network structure, as shown in Figures 2022.

The three sensor nodes in Figure 20 are located in the same road section, and the arrow marks the traffic flow diffusion direction of this road section. The topology diagram constructed according to the relationship between nodes is shown on the right side of Figure 20. The depth of color is used to represent the weight distribution of STGNN-FAM after calculating the attention coefficient of node a and its neighbor nodes at a certain time. The darker the color is, the greater the weight is assigned. It can be seen from the topology diagram that although b and c are both neighbors of a, the weight assigned to b is more significant, which proves that b and a have a stronger spatial correlation. According to the traffic flow direction of the existing road network on the left, sensor node b is located in the downstream section of a, and sensor node c is located in the upstream section of a. The traffic flow information recorded by a node is indeed more affected by its downstream node. This proves that STGNN-FAM can accurately screen the neighbor node that plays a more significant role in the central node when allocating weights and aggravating the influence of the neighbor node.

The road network structure diagram in Figure 21 contains three sensor nodes, d, e, and f. Node d is regarded as the central node. The topological structure diagram constructed according to the relationship between nodes and the weight distribution at a certain time under the action of STGNN-FAM is shown on the right side of Figure 21. It can be seen from the topology diagram that the influence of neighbor node e on d is more significant than that of f on d. Locating the positions of the three nodes on the actual road shows that this is because sensor e and sensor d are on the same road, and e is on the downstream section of d. Although the geographical positions of sensors d and f are similar, they are actually on two different sections. Although they are neighbor nodes, their spatial correlation is not significant.

Figures 20 and 21 explore the spatial relationship between the center node and the neighbor node at a fixed time. The traffic network is time-varying, and the traffic status recorded by each node will change over time, thereby changing the spatial correlation between nodes. Figure 22 shows the road network structure diagram containing four nodes. The node is considered the central node for analysis, and three groups of road network topology diagrams at different times are drawn according to the node distribution. In these three groups of topology diagrams, different color systems represent the weight distribution at different times. In the example shown in this figure, although the weight distribution is different at different times, the importance of the three neighbor nodes to the central node has stayed the same in order. Positioning on the existing road network shows that sensor node i is actually in the downstream section of central node , so the spatial correlation between i and is the largest. Although sensor node h is the closest to in space, their spatial relationship is slightly weak because they are on two opposite paths. While sensor nodes j and are entirely on different roads, their spatial correlation is the smallest.

The previous analysis proves that the STGNN-FAM model proposed in this paper can automatically assign different weights to neighbor nodes according to the actual road network situation, solving the problem of insufficient attention to important nodes. It can also pay attention to the real-time changes in the traffic network and dynamically capture the traffic network’s spatial dependence to achieve better prediction results.

6. Conclusion and Prospect

The experimental results show that the STGNN-FAM model proposed in this paper performs well in capturing spatiotemporal features. The model can not only effectively improve the accuracy of traffic flow prediction but also has specific stability. With the MAE, RMSE, and MAPE as evaluation indicators, the model prediction accuracy experiment on the actual dataset shows that STGNN-FAM has high prediction accuracy in the 15-minute short-term traffic flow speed prediction task. Its prediction results are better than all baseline models in the 30-minute and 45-minute long-term prediction tasks. The results of the model robustness experiment show that STGNN-FAM has excellent anti-interference ability and its prediction performance will not be significantly affected even if noise and random missing values are introduced. Although the accuracy of this model has been improved, there is still room for improvement. Future work will be carried out according to the following three points: (1) Continue to focus on the prediction performance of the model under the influence of missing values and focus on applying the model to other scenes containing an extensive range of missing values, so that it can complete the task of filling missing data. (2) Consider the traffic flow speed under atypical traffic conditions. The atypical traffic state refers to the traffic state under the influence of adverse weather, holidays, and other accidental factors. At this time, the observed traffic flow speed data often shows more complex spatiotemporal features than usual. Therefore, achieving accurate traffic flow speed prediction under external factors will become the research focus. (3) Explore the periodicity of traffic flow speed data. Most of the current work is to use the historical time series adjacent to the prediction period to predict. The following piece will try to introduce the historical time series one week apart from the prediction period and the historical time series one day apart to mine the weekly correlation and daily correlation in the temporal features to focus on the longer-term time correlation.

Data Availability

The dataset used to support the findings of this study is available at https://pems.dot.ca.gov/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is partly supported by the National Natural Science Foundation of China through grant (72071163), the Natural Science Foundation of Sichuan through grant (2022NSFSC0474), the Inner Mongolia Natural Science Foundation: Identification Method of Crop Fine-Grained Diseases in natural Environment (2021MS06007), the Key Science and Technology Project of Inner Mongolia Autonomous Region: Artificial Intelligence Application Technology and Product R&D—Application Research and Demonstration in Modern Pasture (2019ZDZX001), the Inner Mongolia University of Science and Technology Innovation Fund: Research on the recognition of Multiple Crop Diseases based on ResNet (2019QDL-S09), and the Inner Mongolia University of Science and Technology Innovation Fund: Construction of chronic disease knowledge Map based on Natural language processing (2019QDL-S10).