Abstract
Multivariate time series prediction is a very important task, which plays a huge role in climate, economy, and other fields. We usually use an Attention-based Encoder-Decoder network to deal with multivariate time series prediction because the attention mechanism makes it easier for the model to focus on the really important attributes. However, the Encoder-Decoder network has the problem that the longer the length of the sequence is, the worse the prediction accuracy is, which means that the Encoder-Decoder network cannot process long series and therefore cannot obtain detailed historical information. In this paper, we propose a dual-window deep neural network (DWNet) to predict time series. The dual-window mechanism allows the model to mine multigranularity dependencies of time series, such as local information obtained from a short sequence and global information obtained from a long sequence. Our model outperforms nine baseline methods in four different datasets.
1. Introduction
In the age of big data, sequence data is everywhere in life [1, 2]. Time series prediction algorithms are becoming more and more important in many areas, such as financial market prediction [3], passenger demand forecasting [4], and heart signal prediction [5]. In most cases, time series data is multivariate. The key to multivariate time series prediction is to obtain the spatial and temporal relationships between different attributes at different times [6]. As a widely used traditional time series prediction algorithm, ARIMA [7] has shown its effectiveness in many areas. However, ARIMA cannot model nonlinear relationships and can only be applied to stationary time series [8–10]. Recurrent neural network (RNN) [11] has achieved great success in sequence modeling. But RNN has the problem of vanishing gradients, and it is difficult to capture the long-term dependence of time series [12]. Long Short-Term memory (LSTM) [13] and gated recurrent unit (GRU) [14, 15] alleviate the problem of RNN’s vanishing gradients and have developed many models for time series prediction, such as Encoder-Decoder networks [15, 16]. Encoder-Decoder networks are excellent in time series prediction tasks, especially Attention-based Encoder-Decoder networks [17]. Attention-based Encoder-Decoder network can not only find the spatial-temporal correlation between different series but also find important information in raw data and increase its weight [17]. Among them, dual-stage attention-based recurrent neural network (DARNN) is one of the state-of-the-art methods, creatively using a two-stage attention mechanism [18].
Although DARNN can capture spatial correlations between different attributes at the same time and the temporal correlations between different times in the same attribute, when the length of the sequence is too long, the prediction effect will be worse [18]. This problem is common to all Encoder-Decoder networks. A long sequence means more historical information, so better results should be obtained. However, due to the limitations of Encoder-Decoders, the information of the long sequence is not effectively used, even interfere with the prediction results. This is because LSTM does not solve the problem of vanishing gradient, and when the length of the time series is too long, the previous information will be covered by the latter. Therefore, Encoder-Decoders generally use a small window size to ensure the accuracy of prediction. Dual-stage two-phase attention-based recurrent neural network (DSTP) [19] has made improvements to this problem of DARNN and optimized the prediction effect of long sequences. However, DSTP still does not make effective use of long sequences.
When the time window size is small, the series is very close to the prediction point. Such data has the closest relationship with the prediction point. For instance, if the values before the prediction point are gradually increasing, then the value at the prediction point is also likely to increase. When the time window size is large, series contain more time steps. It is difficult for other models to extract recent information, such as trends, in such a long series, so it cannot get good prediction results. However, more information brought by more time steps is very important for time series prediction. It is key to how to make good use of the different characteristics of short sequence and long sequence.
To solve this problem, we propose a dual-window deep neural network (DWNet). DWNet consists of two parts. The first part captures spatial and temporal correlations from the short sequence and is responsible for providing recent details, based on Encoder-Decoder [15]. The second part obtains long-term dependencies such as periodicity and seasonality from the long sequence, based on TCN. Temporal convolutional network (TCN) [20] is an emerging CNN-based model. With the parallelism of convolution operation and large receptive field, it has gained everyone’s expectations in the areas of sequence modeling. Short-term time series generally contain only one or two periods. However, long-term time series are the opposite, including enough time steps. The setting of two different time window sizes for long sequence and short sequence makes it possible to mine multigranularity dependencies.
The main contributions of our work are as follows:(i)We propose a dual-window mechanism that can extract multigranularity information from sequences of different lengths.(ii)We propose the DWNet approach, which includes the advantages of Encoder-Decoder networks and TCN at the same time. Encoder-Decoder networks have a strong ability to mine dependence from the short sequence. Meanwhile, TCN’s large receptive field and fast training speed are more suitable for long sequences.(iii)DWNet can be applied to time series prediction tasks in many domains, and there is no requirement for input data. To justify the effectiveness of the DWNet, we compare it with nine baseline methods using the Human Sports dataset, SML 2010 dataset, Appliances Energy dataset, and EEG dataset. The experiment showed the effectiveness and robustness of DWNet.
2. Related Work
For the time series prediction task, there are various approaches from traditional methods to deep learning methods. As the most famous traditional method, ARIMA can effectively obtain the long-term dependence of target series [7]. However, ARIMA does not consider the spatial correlation between exogenous series [18], can only be used to deal with stationary series [7], and cannot model nonlinear relationships [8]. ARIMA is not suitable for the increasingly complex time series data analysis. As a deep neural network dedicated to machine learning and data mining applications [21–23], RNN can model nonlinear relationships [24] and has achieved great success in time series prediction. However, the gradient vanishing of RNN makes it difficult to obtain long-term dependence from time series. LSTM [13] and GRU [15] add a gating mechanism based on RNN and process the addition and deletion of timing information through the gating mechanism, which alleviates gradient vanishing of RNN. Based on LSTM and GRU, many influential deep neural networks have been proposed, such as the Encoder-Decoder network that has received great attention in the area of natural language processing [17]. Encoder-Decoder networks convert input series into context vector through Encoder and then convert context vector into output through Decoder. Encoder-Decoder networks have a problem. When the length of the sequence increases, the performance of Encoder-Decoder will first become better and then worse [17]. Attention-based Encoder-Decoder network can automatically select important information, thereby effectively alleviating the shortcoming of performance degradation when the length of the sequence increases.
Many attention-based models emerge endlessly. And DARNN [18], GeoMAN [25], and DSTP [19] are models that are improved based on the Attention-based Encoder-Decoder and used for time series prediction. Inspired by some theories of human attention [26], DARNN uses a dual-stage attention mechanism. The first stage uses a spatial attention mechanism to assign different weights to exogenous series to the hidden state of Encoder at the previous time step. The second stage uses a temporal attention mechanism to select the most relevant Decoder hidden states in all time steps. After DARNN was proposed, it has always been one of the state-of-the-art methods in time series prediction. Multilevel Attention Network (GeoMAN) is specially used to predict geo-sensor time series data. Many time series data are collected by sensors distributed in many locations. Such data is called geo-sensor time series data. If each series in the geo-sensor time series is simply treated as a normal attribute, it will lose the connection between different locations. GeoMAN adds local spatial attention and global attention mechanisms to Encoder and adds external factor information to Decoder to solve this problem. DSTP adds a new spatial attention mechanism to Encoder to obtain a spatial correlation between target series and exogenous series so that DSTP achieved better results in the long time series prediction.
While the Attention-based Encoder-Decoder network has attracted much attention, TCN has also shown strong sequence modeling ability [20]. TCN is based on CNN and includes causal convolution, dilated convolution [27, 28], and residual block [29]. To apply to series data, TCN is specially adjusted for different data formats of series and image. TCN has advantages that RNNs do not have. (1) TCN can process series in parallel and does not need to be processed sequentially like RNN or LSTM. This means that there is no possibility that the information of the previous time step will be overwritten and it also means that there is a faster training speed. (2) TCN’s receptive field varies with the number of layers, kernel size, and dilation rate and can be flexibly changed according to a different situation. (3) Compared with LSTM, TCN rarely has the problem of gradient vanishing. Due to the flexible receptive field, fewer parameters than LSTM, and parallel processing, TCN can not only reduce the training time of long sequence but also ensure that the information of the previous time step will not be covered. Therefore, TCN has a strong ability to obtain information from long sequences and is suitable for long sequence modeling.
Long- and short-term time series network (LSTNet) [30] is based on CNN and RNN and realizes that time series have two different dependencies, short-term and long-term. Therefore, LSTNet uses a recurrent-skip mechanism to obtain short-term dependence and then uses RNN to obtain long-term dependence from previous results. But it does not consider that the closer to the prediction point, the more important the information. Therefore, LSTNet will lose some recent information in the time series prediction.
3. Dual-Window Deep Neural Network
3.1. Notation and Problem Statement
In our work, there are two different window sizes, and . Given exogenous series, that is, , we segmented a short series like this . We use to represent the i-th long exogenous series, use to represent the i-th short exogenous series, and use to denote a vector of exogenous series at time . We use to represent target series, which has the long window size .
Given previous values of target series and exogenous series, that is, with and with , we aim to predict the next time step value of target series :where is a nonlinear mapping function we aim to learn.
3.2. Model
Figure 1 presents the framework of our method. The input of DWNet is divided into two parts, long series with window size and short series with window size . Short series is a part of long series and is located at the end of the long series (Figure 1 shows the relationship between the two series). Long series is processed by TCN [20] and used to obtain more detailed historical information than short series. The short series is processed by Encoder-Decoder to capture local information. Finally, the output of the two parts is combined to get the predicted value of the target series at time .

3.2.1. Capture Short-Term Dependence
First of all, we introduce the short series processing module. This part is based on Encoder-Decoder and uses spatial attention and temporal attention mechanism [18] to emphasize key information in short series. Encoder is based on LSTM, the input data of Encoder is short series . Given i-th short exogenous series , we use the spatial attention module to adaptively obtain the spatial correlation between exogenous series:where , and are parameters to learn. Here, is the hidden size of Encoder and and are the hidden state and cell state of LSTM unit in the Encoder at time . is the attention weight measuring the importance of i-th exogenous series at time . After we get the attention weight, we can adaptively extract exogenous series with
Thus, the hidden state at time can be updated aswhere is an LSTM unit in the Encoder. The spatial attention module calculates the weight of each exogenous series through equations (2) and (3) at time and uses to adjust the hidden state at time .
The input of Decoder is the previous target series and the output of the Encoder, which is the hidden state of Encoder. Decoder aims to predict . To get accurate prediction results, we need to capture the temporal correlation between each series. So, we add a temporal attention module to the Decoder. The same as Encoder, the attention weight of Encoder hidden state at time is calculated based upon the previous Decoder hidden state and cell state of LSTM unit withwhere , and are parameters to learn. is the hidden size of Decoder, and and are the hidden state and cell state of LSTM unit in the Decoder at time . is the attention weight and can show the importance of i-th Decoder hidden state at time . And, we can get context vector with
Context vector is the sum of all weighted encoder hidden states at time . Then, we combine context vector and target series to update the Decoder hidden state :where is an LSTM unit in the Decoder.
3.2.2. Capture Long-Term Dependence
We obtain long-term dependence through TCN [20], because TCN can process time series data in parallel and have much fewer parameters than LSTM. Therefore, TCN can quickly handle long series and improve time efficiency. And TCN does not have the problem of the previous information being covered. When window sizes are too large, the integrity of the information can be guaranteed. In our model, the input of the TCN part is long series from time 1 to . In time series analysis, we cannot allow leakage from the future into the past. A high layer element at time is obtained by convolution of elements from time and earlier in the previous layer. To avoid information leakage, TCN uses casual convolution. To expand the receptive field, TCN uses dilated convolution [27, 28]. For long exogenous series and filter , the element at time iswhere is the dilation factor, is the filter size, and is dilated convolution operation. will increase exponentially with the number of layers to expand the receptive field. A deep neural network is so easy to have the problem of gradient exploding and gradient vanishing, so TCN uses residual block [29]. The residual connection enables the network to transfer information in a cross layer and improve the efficiency of feature extraction.
3.2.3. Training
Figure 1 shows that the predicted value is determined by two parts. We combine the output of Decoder and TCN to predict :where , and are parameters to learn. Here, is the number of hidden units per layer, and . We use the backpropagation algorithm to train DWNet. We use the Adam optimizer [31] to minimize the mean squared error (MSE) between the predicted value and the ground truth :where are all parameters to learn in DWNet.
4. Experiment
Our model and all baseline methods are implemented on the PyTorch framework [32]. In this section, we first introduce four different datasets used in the experiment. Then, we introduce nine baseline methods. Next, we introduce the model evaluation methods and parameters. Finally, experiment results show the effectiveness of DWNet.
4.1. Datasets
We use four datasets to verify the effect of our model. They are in the field of sports, energy, climate, and medicine. We divide datasets into training sets and testing sets according to the ratio of 4 : 1.
4.1.1. Human Sports [33]
Human Sports data is collected by 10 volunteers of different genders, heights, and weights who performed sports including squat, walking, jumping jacks, and high knee. Four sensors worn on the arms and thighs record data every 50 milliseconds, including acceleration and angular acceleration of the x-axis, y-axis, and z-axis. In our experiment, we take the resultant acceleration as the target series and others as exogenous series. We only use the squat data of one volunteer and use the first 8796 data points as the training set and the remaining 2197 data points as the testing set.
4.1.2. SML 2010 [34]
SML 2010 is a public dataset for indoor temperature prediction. SML 2010 contains nearly 40 days of data, which is collected by the monitoring system. The data were sampled every minute, computing and uploading it smoothed with 15-minute means. In our experiment, we take the weather temperature as target series and select fifteen exogenous series. We use the first 1971 data points as the training set and the remaining 493 data points as testing set.
4.1.3. Appliances Energy [35]
Appliances energy is a public dataset used for home appliance energy consumption prediction. This dataset is at 10 minutes for about 4.5 months. Room temperature and humidity conditions were monitored with a wireless sensor network. The energy data is recorded with m-bus energy meters every 10 minutes. Weather data was downloaded from the nearest airport weather station. In our experiment, we take energy use as target series and others as exogenous series. We use the first 15548 data points as a training set and the remaining 3887 as a testing set.
4.1.4. EEG [36]
EEG is a public dataset for classification and regression. This database consists of 30 subjects performing Brain–Computer Interface for Steady-State Visual Evoked Potentials. In our experiment, we only use the data from one of those subjects. We take the electrode O1 attribute as the target series and others as exogenous series. We use the first 7542 data points as a training set and the remaining 1886 as a testing set.
4.2. Baseline
4.2.1. ARIMA [8]
It is one of the well-known statistical algorithms for time series prediction.
4.2.2. LSTM [13]
LSTM is improved by RNN, through the gating mechanism to control the adding and deletion of information, alleviating the gradient vanishing.
4.2.3. Encoder-Decoder [16]
It is widely used in machine translation. However, Encoder-Decoder has the disadvantage of losing information.
4.2.4. Input-Attn-RNN [18]
It adds a spatial attention module on the basis of Encoder-Decoder to the Encoder to obtain the spatial correlation of raw data.
4.2.5. Temp-Attn-RNN [19]
It adds a temporal attention module on the basis of Encoder-Decoder to the Decoder to obtain the temporal correlation of Encoder hidden state.
4.2.6. TCN [20]
It is an emerging sequence modeling model that has attracted much attention, including casual convolution, dilated convolution, and residual blocks.
4.2.7. LSTNet [30]
It combines CNN and RNN to obtain short-term and long-term dependencies in sequence.
4.2.8. DARNN [18]
As one of the state-of-the-art methods, inspired by the human attention system, DARNN uses both spatial attention and temporal attention to extract spatial-temporal correlation.
4.2.9. DSTP-RNN [19]
It improves DARNN and adds an attention module to Encoder. In the Encoder, more stationary weights can be obtained. DSTP-RNN is good at long time series prediction.
4.3. Evaluation Metrics
We employ root mean squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and symmetric mean absolute percentage error (SMAPE) to evaluate our model and baseline methods. These four evaluation metrics are scale-independent and widely used in time series prediction. RMSE has a strong feedback ability for predicted results that deviate too much from the ground truth. MAE treats all results equally. MAPE is able to compare forecast accuracy among differently scaled time series data because relative errors do not depend on the scale of the dependent variable. However, when truth value is small, different will have a huge difference in MAPE value. And SMAPE can solve this problem. Assuming is predicted value at time and is the ground truth, RMSE is defined as follows:
MAE is defined as follows:
MAPE is defined as follows:
SMAPE is defined as follows:
4.4. Parameters Settings
Most time series prediction models have chosen a small window size in their experiment. For example, DARNN set the window size to 10 [18], and GeoMAN set the window size to 6 [25]. To show the influence of window size on prediction, we select the window size . For DWNet, we set and . For baseline methods, we conducted experiments on and , respectively. In training, we set the batch size to 128 and learning rate to 0.001. In our model, there are also parameters such as the hidden size of Encoder , the hidden size of Decoder , kernel size, and levels of TCN. For simplicity, we use the same hidden size at Encoder and Decoder, that is, , and conducted a grid search over . For TCN level and kernel size, we also conducted a grid search. The setting in which outperforms the others in the testing set. And we fixed these parameters in all experiments.
5. Results and Discussion
In this section, we first compare our model with baseline methods on the four datasets. Then, we conduct a grid search to show the performance of our model in different long time steps and short time steps combinations. Next, we investigate ablation experiments and study the time efficiency of our model.
5.1. Model Comparison
To show the effectiveness of DWNet, we compare DWNet with 9 different methods, including the state-of-the-art methods and emerging methods. For the sake of fairness, we use two different window sizes for baseline methods so that we can compare the baselines’ results of long window size and short window size with DWNet. The prediction results of DWNet and baseline methods are shown in Tables 1 and 2 .
Table 1 shows that DWNet achieves the best RMSE and MAE across four datasets. Table 2 shows that DWNet also achieves the best MAPE and SMAPE in four datasets. This is because DWNet obtains not only the short-term dependency in the short sequence but also the long-term dependency in the long sequence. ARIMA performs worse than other models for ARIMA cannot capture linear relationships and does not consider the spatial correlation between exogenous series [7]. Encoder-Decoder network performs better than normal LSTM in four datasets, which means Encoder-Decoder is easier to obtain dependency from raw data [16]. Attention-based Encoder-Decoder networks, that is, Input-Attn-RNN and Temp-Attn-RNN, are better than normal Encoder-Decoder networks in four datasets because the attention mechanism pays more attention to more important features in raw data. DARNN and DSTP combine spatial attention and temporal attention mechanism and have good performance in four datasets. The performance of TCN is very unstable, and its performance in Human Sports is better than DSTP, but it is far worse than DARNN and DSTP in other datasets, especially EEG. LSTNet’s performance is also unstable. And it performs very well in Human Sports, but it performs poorly in the other three datasets. Meanwhile, we can also find that LSTM-based networks perform better than long sequences in short sequences.
5.2. Time Step Study
In this section, we study the impact of long window size and short window size on prediction. When we vary and , we keep other parameters fixed. We plot the RMSE versus different long window size () and short window size () in Figure 2.

It is easily observed that the performance of DWNet is simultaneously affected by two parameters and . When is fixed, the performance of DWNet will be worse when is too large or too small and vice versa. And we notice that DWNet achieves the best performance when and .
5.3. Ablation Experiment
To further investigate the effectiveness of each model component, we compare DWNet with Input-Attn-RNN, Temp-Attn-RNN, DARNN, and other variants in Human Sports and EEG datasets. In this experiment, we set window size of Input-Attn-RNN, Temp-Attn-RNN, and DARNN to 16 and set and . The variants of DWNet are as follows:(i)DWNet-ni: there is no spatial attention module in the Encoder part.(ii)DWNet-nt: there is no temporal attention module in the Decoder part.
The experiment results are shown in Figure 3. Input-Attn-RNN performs better than Temp-Attn-RNN in the EEG dataset but performs worse than Temp-Attn-RNN in the Human Sports dataset. However, DARNN achieves better RMSE and MAE than Input-Attn-RNN and Temp-Attn-RNN in both two datasets. Apparently, the model based on a two-stage attention mechanism is better than the single attention model. And that is why DWNet is superior to DWNet-ni and DWNet-nt. It is easily observed that DWNet achieves the best RMSE in Human Sports and EEG, which shows that the information in the long sequence is valuable for the time prediction task. Without the long sequence processing module, it is impossible to outperform the state-of-the-art methods in time series prediction.

(a)

(b)
5.4. Time Complexity
The time efficiency of deep neural networks is also an evaluation metric that needs to be considered. In this section, we compare the time efficiency of DWNet and baseline methods. In this experiment, we set to 16, to 128, to 16, and fixed other parameters. We experimented on Human Sports and EEG datasets and recorded the time (in seconds) spent in 10 epochs. The results are shown in Figure 4. We can observe that, with more attention modules, the time spent by the model gradually increases. Input-Attn-RNN and Temp-Attn-RNN have only one attention module: one is spatial attention and the other is temporal attention, but the amount of computation is essentially the same. Temp-Attn-RNN’s training time is slightly longer than Input-Attn-RNN, but it is far less than the DARNN that both attention modules have. DSTP has two attention modules in the Encoder part and one attention module in the Decoder part, so the training time spent is longer than DARNN. TCN is superior to fewer parameters and the characteristics of parallel processing and has a very large advantage in time spent. It takes the least time in both two datasets. In DWNet, there are two attention modules and a long sequence processing module (implemented by TCN). Therefore, DWNet is inferior to DARNN in terms of time efficiency and even worse than TCN. However, DWNet has stronger time series forecasting capabilities than DARNN and TCN and is more suitable for situations that require high accuracy rather than low time consumption.

6. Conclusion
In this paper, we propose a dual-window deep neural network (DWNet) to make good use of the long sequence for time series prediction. The dual-window mechanism splits the end of a sequence as a short sequence and treats this sequence as a long sequence. The long sequence processing module in DWNet can extract historical information from long time series, and the short sequence processing module obtains recent information from short time series. These allow the model to learn both long-term dependence and short-term of the sequence. Our model outperforms the state-of-the-art methods in four datasets. In the future, we are going to perform model compression and reduce the model running time. Moreover, we will improve the long sequence processing module and enhance its stability, thereby enhancing the performance of DWNet.
Data Availability
The Human Sports dataset is available from Hangzhou Dianzi University’s fitness club. Due to personal privacy, data cannot be made publicly available. The remaining datasets analyzed during the current study were derived from the following public domain resources: https://archive.ics.uci.edu/ml/datasets/SML2010https://archive.ics.uci.edu/ml/datasets/Appliances+energy+predictionhttps://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by a grant from the National Natural Science Foundation of China (no. U1609211) and National Key Research and Development Project (2019YFB1705102).