Abstract

Timely and accurate network traffic prediction is a necessary means to realize network intelligent management and control. However, this work is still challenging considering the complex temporal and spatial dependence between network traffic. In terms of spatial dimension, links connect different nodes, and the network traffic flowing through different nodes has a specific correlation. In terms of spatial dimension, not only the network traffic at adjacent time points is correlated, but also the importance of distant time points is not necessarily less than the nearest time point. In this paper, we propose a novel intelligent network traffic prediction method based on joint attention and GCN-GRU (AGG). The AGG model uses GCN to capture the spatial features of traffic, GRU to capture the temporal features of traffic, and attention mechanism to capture the importance of different temporal features, so as to realize the comprehensive consideration of the spatial-temporal correlation of network traffic. The experimental results on an actual dataset show that, compared with other baseline models, the AGG model has the best performance in experimental indicators, such as root mean square error (RMSE), mean absolute error (MAE), accuracy (ACC), determination coefficient (), and explained variance score (EVS), and has the ability of long-term prediction.

1. Introduction

Cisco annual Internet report (2018–2023) notes that device functionality will be combined with higher bandwidth and more intelligent networks by 2023, and the number of devices linked to IP networks will be more than three times the global population [1]. With the increasing number of terminals, the enrichment of multimedia applications, and the continuous expansion of network capabilities, network traffic management has become a critical and challenging task. Real-time and accurate network traffic prediction can greatly improve the control gain of the network.

The existing network traffic prediction methods are divided into model-driven traffic prediction methods and data-driven traffic prediction methods. Model-driven traffic prediction methods are also called parameterization methods, including autoregressive moving average model (ARMA) and autoregressive integrated moving average mode (ARIMA). Laner et al. introduced the ARMA model, which can predict network traffic [2]. Guo et al. introduced the ARIMA model and tested the algorithm with the data collected by a backbone switching node. The experimental results show that compared with other network traffic prediction methods, the model has a better effect in dealing with nonstationary series and higher prediction accuracy [3], so the ARIMA model and its variants are widely used and can well explore the time correlation of network traffic [46]. Model-driven traffic prediction methods mostly use a polynomial fitting function to approximate the actual network traffic and then make the fitting effect better through a large number of parameter tuning. However, it is difficult to capture the nonlinear characteristics of network traffic, such as fast fluctuation and time dependence.

The data-driven traffic prediction method can automatically learn statistical rules from a large quantity of historical data to intelligently capture the nonlinear characteristics of network traffic. Specifically, data-driven traffic prediction methods can be divided into machine learning prediction methods and deep learning prediction methods. Among them, machine learning prediction methods include support vector regression (SVR) and k-nearest neighbor algorithm (k-NN). Bermolen et al. applied support vector regression (SVR) to link load prediction [7]. Kremer et al. chose two different machine learning algorithms, SVR and KNN, to explore the balance between complexity and estimation accuracy [8]. However, machine learning methods are not sufficient for processing high-dimensional data and rely on feature engineering. Therefore, the universality of this method is weak.

Compared with machine learning prediction methods, deep learning prediction methods can not only retain the learning characteristics but also ensure the relevance between tasks and effectively address time series problems. Wu et al. proposed a network traffic prediction method based on a deep neural network (DNN), which proves the superiority of the deep learning prediction method in traffic prediction [9]. Lazaris et al. used actual network traffic tracking from ISPs to train long-term short-term memory (LSTM) neural network and generate predictions in a short time. Experiments show that LSTM can predict network traffic with low error [10]. Azzouni et al. proposed an LSTM RNN framework for predicting a large-scale network traffic matrix and proved the fast convergence ability of the LSTM model through actual data from GEANT [11]. Although this kind of deep learning prediction model has achieved good results, the above models all predict the time series of network traffic in a single area but ignore the spatial structure of the network, that is, the spatial correlation of network traffic. To extract the spatial characteristics of network traffic, researchers introduced convolutional neural networks (CNNs) into the task of network traffic prediction. Zhang et al. used a convolutional neural network to capture the temporal and spatial dependence of traffic by processing traffic data to images. The experimental results show that the prediction performance of this method in terms of root mean square error (RMSE) is significantly improved [12]. Li et al. proposed a CNN fusion LSTM model for prediction, used a one-dimensional CNN to obtain the spatial characteristics of network traffic, and used LSTM to obtain the temporal correlation of network traffic. However, the spatial structure of the CNN model is in Euclidean space; that is, the CNN can only deal with Euclidean data, but it cannot effectively deal with non-Euclidean data such as communication network topology.

Therefore, researchers hope to effectively extract spatial features from non-Euclidean data structures such as topological maps [13], so GCNs have become a new research focus. He et al. proposed a spatial-temporal network based on graph attention, which is called GSATN. This model integrates spatial-temporal characteristics, characterizes spatial correlation through geographical relationship graphs, characterizes temporal correlation through recurrent neural networks, and predicts network traffic by combining spatiotemporal characteristics [14]. Yang et al. proposed a network traffic prediction model combining a graph convolution neural network (GCN) and a gate control recursive unit (GRU). The model uses GCN to learn network topology and extract spatial characteristics of traffic and uses GRU to learn the temporal characteristics of network traffic. Thus, the intelligent prediction of network traffic is realized [15]. Although these models have achieved excellent prediction accuracy, most models tend to extract static spatial dependencies in traffic, and such spatial dependencies may evolve over time [16, 17]. Therefore, by introducing an attention mechanism into the GCN-GRU model, this paper proposes a novel intelligent network traffic prediction method based on joint attention and GCN-GRU. This model can not only capture spatial-temporal correlation information but also collect temporal global change information. The main contributions of this paper are as follows:(1)A network traffic prediction method combining GCN, GRU, and attention mechanism is proposed. The method uses GCN to capture the spatial features of traffic, GRU to capture the temporal features of traffic, and attention mechanism to capture the importance of different temporal features, so as to realize the comprehensive consideration of the spatial-temporal correlation of network traffic.(2)The attention mechanism is introduced into the GRU, and the weight matrix calculation method in the GRU unit is redesigned. In this mechanism, the state vector is generated by combining the hidden states at different times, a scoring function is designed to calculate the weight of each hidden state, and an attention function is designed to calculate the context vector that can describe the global traffic change information, so as to adjust the importance of different time points and collect the global time information to improve the prediction accuracy.(3)Considering that the length of the sliding window and the number of hidden units have a significant impact on the timeliness and accuracy of network traffic prediction, an action to determine the experimental parameters is performed, so as to obtain the optimal length of sliding window and optimal number of hidden units, which effectively supports the comparative analysis of the network traffic prediction model AGG proposed in this paper with other baseline models.(4)The AGG model is trained on the Milan traffic network dataset for many times. The results show that compared with several existing baseline models, the AGG model has the best performance in experimental indicators, such as root mean square error (RMSE), mean absolute error (MAE), accuracy (ACC), determination coefficient (), explained variance score (EVS), and has the ability of long-term prediction.

The rest of this paper is organized as follows. In Section 2, we present the problem formulation of network traffic prediction and design a framework to solve the network traffic prediction problem. Based on the design of the spatial feature extraction model, temporal feature extraction model, and attention mechanism model, a complete intelligent network traffic prediction model is given in Section 3. In Section 4, we introduce the experimental environment and analyze the performance of the proposed traffic prediction model. We conclude this paper in Section 5.

2. The Proposed Prediction Framework

2.1. Problem Formulation

The goal of network traffic prediction is to predict the network traffic information in the future according to the measured historical network traffic information. We can define this process aswhere is the observation vector of observation points at the sampling time . The purpose of the traffic prediction model is to learn a mapping function based on the traffic data of the previous sampling time to predict the network traffic of the sampling time in the future.

Definition 1 (network topology). The network is composed of nodes and links, which are generally represented by digraphs . represents the nodes in the network, and , where is the number of nodes, and represents the links between nodes. The adjacency matrix is used to represent the connection relationship of nodes, . The adjacency matrix only contains the elements 0 and 1. When the element is 0, there is no connection between nodes, and when the element is 1, there is a connection between nodes.

Definition 2 (network traffic prediction). In , each link is , and the time series represents the network traffic of in the time interval . The principle of the prediction model proposed in this paper is to learn a mapping function based on the topological graph structure and network traffic time series to obtain the network traffic data spatial-temporal characteristics and then predict the network traffic information in the future from the characteristic matrix. The network traffic prediction formula is as follows:

2.2. Traffic Prediction Framework

For the problem described in Section 2.1, the prediction architecture proposed in this paper is shown in Figure 1. First, the time series data in each region in the dataset at n time sampling points and the adjacency matrix representing the relationship between regions are taken as the input. Then, the GCN model is used to extract the input data spatial features, and the time series with spatial features are used as the input of the GRU model to extract the temporal correlation features between time series. Furthermore, the attention mechanism is introduced into GRU, and the weight matrix calculation method in the original GRU unit is replaced by the attention weight mechanism, which reweights the influence of historical network traffic data to capture the global variation trend of network traffic. Finally, the prediction results of data with spatial-temporal correlation are obtained through the fully connected layer.

3. Prediction Models

3.1. Spatial Feature Extraction Model

Spatial feature extraction is one of the critical problems in network traffic prediction. A regional topological network is a graph structure, and its network traffic data belong to non-Euclidean data. Although traditional convolutional neural networks (CNNs) can obtain spatial features, they can only be used in Euclidean data and cannot effectively extract spatial features from graph data. In this paper, the graph convolution network (GCN) model is used to process the non-Euclidean data represented by graph data, and the spatial features of each region are learned from the network structure.

The principle of GCN is to construct a filter in the Fourier domain and then process the graph nodes and the first-order domain of the nodes with the constructed filter to obtain the spatial features between the nodes in the graph. Finally, the GCN model is established by superposition of multiple convolution layers. In this paper, we designed two convolutional layer processing graph structures, and the formula is as follows:where represents the network traffic characteristic matrix, represents the adjacency matrix, and Re represent the activation function. represents the preprocessing step, is the degree matrix, . represents the matrix with a self-connection structure, and and represent the weight matrix in the first and second convolution layers, respectively.

3.2. Temporal Feature Extraction Model

Temporal feature extraction is another critical problem in network traffic prediction. At present, the recurrent neural network (RNN) is the most widely used neural network model for processing sequence data. However, due to the defects of gradient disappearance and gradient explosion, the traditional recurrent neural network has limitations in terms of long-term prediction. The LSTM model and GRU model are variants of recurrent neural networks, which can better solve the above defects. As variants of RNN, LSTM, and GRU have the same basic principle, they both use a gate control mechanism to memorize as much long-term information as possible. In this paper, we use the GRU network unit. Compared with the LSTM unit, the GRU unit has fewer parameters. Under the premise of ensuring the prediction accuracy, it can reduce the time of model optimization.

The structure diagram of the GRU unit is shown in Figure 2, in which represents the input data at time t, , , and indicate the hidden state at different times, is a reset gate, which controls the degree of information reservation or abandonment at the previous time, is an update gate, which is used to control the extent to which state information of the prior moment enters the current state, is the information stored at time t, and the principle of GRU is to use the hidden state of the prior moment and the input of the current moment together to obtain the network state information of the next moment. The model not only captures the current network information but also retains the change trend of historical network information and has the ability to capture temporal dependence.

3.3. Attention Mechanism Model

When capturing temporal features, we introduce an attention mechanism into GRU in this section and redesigns the weight matrix calculation method in the original GRU unit with the attention weight mechanism.

After replacing the original matrix calculation method in GRU with an attention mechanism, and are used to obtain the information of the reset gate and update gate at time . The formulas are as follows:where is the weight matrix information in the attention mechanism, represents the input traffic at the current time, represents the hidden state passed down from the previous time, and and are deviation parameters.

After obtaining the information of the reset gate and update gate , the reset data can be obtained first, and then the value range of the data of and can be controlled within through the activation function. That is, the state of memorizing the current moment can be obtained. The formula is as follows:where is a deviation parameter.

After obtaining the current time state of memory, the last step is to update the memory stage, in which the update gate is used. The formula is as follows:

Through the multilayer GRU with attention mechanism, the temporal features of network traffic can be better captured. The internal structure of the redesigned GRU is shown in Figure 3.

3.4. Traffic Prediction Model

The network traffic prediction model, named AGG model, introduces the attention mechanism based on the GCN-GRU model and reweights the influence of historical network traffic data to capture the global variation trend in network traffic. The model structure is shown in Figure 4.

The AGG model calculation is shown in the following formulas:where is the update gate which is used to control the extent to which the state information of the last time enters the state of current time, is the activation function of the nonlinear model, , , and are the weight parameters, is the graph convolution process, is the adjacency matrix, is the input of the model at the current time, and are the hidden state at and , respectively, , , and are deviation parameters, is the reset gate which controls the level of information retention or abandonment at the previous time, and is the information stored at time .

The AGG model is constructed by the GCN model combined with the GRU model. The principle is to input n historical time series network traffic data into the AGG model to obtain n hidden states and obtain the vector containing spatial-temporal features: .

Then, the hidden state is inputted into the attention model, and the multilayer perceptron (MLP) is used to calculate the weight of each hidden state . The information vector covering the global traffic change is calculated by the sum of the weights. The formulas are as follows:

Then, an attention function is used to describe the vector of global traffic change information, and the formula is as follows:

Finally, the final predicted value is obtained through the fully connected layer.

4. Simulation Results and Analysis

In this part, we first introduce the actual traffic dataset of the telephone service provider in the European city of Milan and then analyze comparative experiments based on this dataset to verify the advantages of our proposed model.

4.1. Dataset Description

In this paper, we select an open network traffic dataset which is in https://dataverse.harvard.edu/dataset.xhtml?persistentId = doi:10.7910/DVN/EGZHFV, and the traffic collection time is from 00 : 00 on November 1, 2013, to 00 : 00 on January 1, 2014. Table 1 shows the relevant dataset information. In this experiment, the data of 11/04-11/10 for seven days are selected as the dataset. The time interval of the original data is 10 minutes, and there are 144 data points in each region. In this paper, nine regions are selected, and the data of a week are collected. The grid and map of the area where the dataset is located are shown in Figure 5. Figure 6 shows the network traffic trend of the nine regions within a week.

4.2. Experimental Indicators

In order to thoroughly verify the performance of the model, we set five experimental indicators to judge the flow prediction model proposed in this paper, as follows:(1)Root mean square error (RMSE) reflects the prediction error of the model. The value range of RMSE is . The closer the RMSE is to zero, the better the performance of the model is.(2)Mean absolute error (MAE) is used to measure the mean absolute error between the predicted value and the true value. The value range of MAE is . The closer the MAE is to zero, the better the performance of the model is.(3)Accuracy (ACC) reflects the prediction accuracy of the model. The value range of ACC is . The closer the ACC is to 1, the better the performance of the model is.(4)Determination coefficient () represents the quality of model fitting. The value range is . The closer the is to 1, the better the model fits the data.(5)Explained variance score (EVS) is the variance score of the model. The value range is . The closer the EVS is to 1, the better the independent variable can explain the variance change of the dependent variable.where denotes the actual value of traffic data at the time t and denotes the predicted value of traffic data at the time t. denotes the mean value of traffic data, and is the number of samples.

4.3. Experimental Parameters

In this experiment, we use a deep learning server to configure the experimental environment, in which the production type of CPU is AMD Ryzen 52600, the production type of GPU is Nvidia GT745 M, the size of Memory is 16 GB. In addition, TensorFlow is used to build the network framework and Python is used as the programming environment. Table 2 lists the detailed environment configuration parameters.

Further, we need to determine the model training parameters. In this experiment, Adam is chosen as the optimizer, the learning rate is set to 0.001, and the epoch for model training is 3000. As for the selection of the sliding window length and the number of hidden units, theoretically, on the one hand, the larger the sliding window length is, the larger the perception range will be, and the more features will be predicted, which may cause some interference to the accuracy of prediction. On the other hand, when the number of hidden units increases to a certain extent, the complexity and difficulty of model calculation will also increase, and the accuracy of prediction will also decrease.

Considering that the sliding window length and the number of hidden units have a significant impact on the timeliness and accuracy of the traffic prediction, we compared ACC and under different and and obtained the optimal sliding window length and the number of hidden units under the current configuration.

Specifically, the optional range of sliding window length is set to [4,8,12,16], and by comparing the prediction performance under different conditions in Figure 7, we obtain the optimal sliding window length, which is 8. That is, we use 8 historical network traffic data () to predict future traffic. Similarly, the optional range of the number of hidden units is set to [32,64,100,128], and by comparing the prediction performance under different conditions in Figure 8, we obtain the optimal number of hidden units, which is 100.

In conclusion, when the sliding window length is set to 8 and the number of hidden units is set to 100, the prediction result is optimal. Therefore, the model training parameters containing the above results are listed in detail in Table 3.

4.4. Result Analysis
4.4.1. Comparison Results between AGG Model with Other Baseline Models

To verify the performance of AGG model, 80 of traffic data are selected as the training dataset, and 20 of traffic data are selected as the verification dataset. The comparison indicators are described in Section 4.2. In addition, five baseline models are selected including model-driven methods and data-driven methods to compare with the model proposed in this paper. The comparison results are listed in Table 4; because the sampling interval of the traffic data is 10 minutes, we use 10 minutes (one point) and 20 minutes (two points) to carry out single-step prediction and multistep prediction, respectively.(1)Historical average model (HA), which models network traffic as a periodic process to predict the time series(2)An autoregresive moving composite average model (ARIMA), which is used to fit the time series into a parameter model for completing the network traffic prediction(3)Support vector machine model (SVR), which adopts the machine learning algorithm and uses historical data to fit the relationship between input and output and then predicts future network traffic data(4)Gated recurrent unit (GRU), which is an efficient solution to the gradients vanishing issue after a long sequence of inputs(5)GCN-GRU, which is a combination model combining a graph convolution neural network (GCN) and a gate control recursive unit (GRU)

Table 4 shows that the experimental indicators of the AGG model proposed in this paper are significantly better than those of other baseline models. To be specific, we have the following:(1)At the 10 min prediction span, the AGG model proposed in this paper has optimal values in RMSE, MAE, ACC, , and EVS. For example, the RMSE of the AGG model is 3.7 lower than that of the GCN-GRU model, 4.2 lower than that of the GRU model, 5.5 lower than that of the SVR model, 6.3 lower than that of the ARIMA model, and 14.7 lower than that of the HA model. The ACC of the AGG model is 1.5 higher than that of the GCN-GRU model, 2 higher than that of the GRU model, 2.3 higher than that of the SVR model, 15.9 higher than that of the ARIMA model, and 6.9 higher than that of the HA model. The AGG model proposed in this paper has optimal values in RMSE, MAE, ACC, R2, and EVS. It can be further seen that both AGG and GRU are superior to model-driven traffic prediction methods.(2)At the 20 min prediction span, the AGG model proposed in this paper still has optimal values in RMSE, MAE, ACC, , and EVS. For example, the RMSE of the AGG model is 1.6 lower than that of the GCN-GRU model, 1.7 lower than that of the GRU model, 1.9 lower than that of the SVR model, 2.5 lower than that of the ARIMA model, and 7.4 lower than that of the HA model. The prediction accuracy of the AGG model is 0.7 higher than that of the GCN-GRU model, 1.8 higher than that of the GRU model, 2.5 higher than that of the SVR model, 12.2 higher than that of the ARIMA model, and 3.5 higher than that of the HA model.(3)It can be further concluded from the prediction results that, in horizontal comparison, the data-driven prediction methods, whether SVR or GRU, are better than other model-driven methods. This result is due to the poor fitting ability of HA and ARIMA for this long series of unstable data, while the neural network models fit the nonlinear data much better. In longitudinal comparison, the performance indicators of the AGG model proposed in this paper decrease with the increase of prediction time, but the decline trend is relatively stable, and it still has long-term prediction ability.

4.4.2. Influence of Spatial-Temporal Correlation and Attention Mechanism on Prediction Performance

In order to further explore the influence of spatial-temporal correlation and attention mechanism on prediction performance, two experimental indicators, RMSE and ACC, are used to compare AGG model with other baseline models at the 10 min prediction scale, and the comparison results are shown in Figures 7 and 8, respectively.

Figure 9 shows the comparison results of RMSE between AGG model and other baseline models. These baseline models include model-driven traffic prediction methods HA and ARIMA, and data-driven traffic prediction methods SVR, GRU, and GCN-GRU. Specifically, RMSE of model-driven traffic prediction method are 6.1774 (HA) and 5.6241 (ARIMA) respectively, and RMSE of data-driven traffic prediction method are 5.5817 (SVR), 5.4932 (GRU), 5.4761 (GCN-GRU), and 5.2721 (AGG), respectively. Therefore, RMSE on the whole presents a downward trend, and the AGG model proposed in this paper has the smallest RMSE, which means that the model of spatial-temporal correlation and the introduction of an attention mechanism are fundamental to reduce the RMSE of network traffic prediction results.

Figure 10 shows the comparison results of ACC between AGG model and other baseline models. These baseline models are consistent with Figure 9. Specifically, ACC of model-driven traffic prediction method are 0.6785 (HA) and 0.6264 (ARIMA), respectively, and ACC of data-driven traffic prediction method are 0.7095 (SVR), 0.7114 (GRU), 0.7150 (GCN-GRU), and 0.7256 (AGG), respectively. Therefore, ACC on the whole presents an upward trend, and the AGG model proposed in this paper has the largest RMSE, which means that the model of spatial-temporal correlation and the introduction of an attention mechanism are significant to improve the ACC of network traffic prediction results.

4.4.3. Analysis of Visual Results of Traffic Prediction

In order to more intuitively see the prediction results of the proposed AGG model, Figures 11 and 12, respectively, show the traffic trend comparison diagram between the prediction value and the true value of AGG model in 10 min and 20 min prediction spans of area 2270. In the experiment, the sliding window length is set to 8 and the number of hidden units is set to 100, which has been proved in Section 4.3 that these parameters are optimal.

It can be seen from Figures 11 and 12 that the AGG model proposed in this paper has good prediction performance, but it has the following two flaws. On the one hand, the prediction result of network traffic at the peak is poor. The main reason is that the GCN model defines a smoothing filter in the Fourier domain and captures the spatial characteristics by continuously moving the filter and signal for winding operation. This process leads to smoother prediction of the mutation region. On the other hand, there is a certain error between the true network traffic data and the prediction results. The possible reason is that when there is no communication at a certain time in the region, the value of network traffic may be zero, or the value of network traffic may be very small, and a small difference may cause a large relative error. Further, by comparing Figures 11 and 12, we can also get that, with the increase in the prediction time scale, the fitting level between the prediction value and the actual value also decreases, indicating that the small prediction scale always has a better prediction effect.

5. Conclusion

In this paper, we propose a network traffic prediction method combining GCN, GRU, and attention mechanism. In this method, GCN is used to capture the network topology to obtain the spatial features of network traffic. GRU model is used to capture the dynamic changes of traffic on nodes, so as to obtain the time features of network traffic. Furthermore, the attention mechanism is used to weight the historical traffic data to dynamically adjust the importance of network traffic information at each sampling time. By using the actual network traffic dataset to carry out the experiment and comparing it with the baseline models such as HA, ARIMA, SVR, GRU, and GCN-GRU, it can be concluded that the AGG model proposed in this paper achieves the best prediction effect under different performance indicators.

Data Availability

This paper selects an open network traffic dataset, the download address is https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/EGZHFV, and the traffic collection time is from 00 : 00 on November 1, 2013, to 00 : 00 on January 1, 2014.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grants 61801073, 61722105, and 61931004.