Abstract
Sea wind speed forecast is important for meteorological navigation system to keep ships in safe areas. The high volatility and uncertainty of wind make it difficult to accurately forecast multistep wind speed. This paper proposes a new decomposition-based model to forecast hourly sea wind speeds. Because mode mixing affects the accuracy of the empirical mode decomposition- (EMD-) based models, this model uses the variational mode decomposition (VMD) to alleviate this problem. To improve the accuracy of predicting subseries with high nonlinearity, this model uses stacked gate recurrent units (GRU) networks. To alleviate the degradation effect of stacked GRU, this model modifies them by adding residual connections to the deep layers. This model decomposes the nonlinear wind speed data into four subseries with different frequencies adaptively. Each stacked GRU predictor has four layers and the residual connections are added to the last two layers. The predictors have 24 inputs and 3 outputs, and the forecast is an ensemble of five predictors’ outputs. The proposed model can predict wind speed in the next 3 hours according to the past 24 hours’ wind speed data. The experiment results on three different sea areas show that the performance of this model surpasses those of a state-of-the-art model, several benchmarks, and decomposition-based models.
1. Introduction
Sea wind always threatens the safe navigation of ships. According to the Marine Casualties and Incidents Reports published by the International Maritime Organization (IMO), there were 1561 well-documented ship accidents in the first 10 years of the 21st century, of which 755 were caused by strong winds and large waves caused by strong winds, and accidents caused by strong winds accounted for 48.3% of total accidents. In addition, when the wind wave and swell appear at the same time, the danger of navigation will be greatly increased [1]. Therefore, the accurate wind speed forecasting is of great significance for routes optimization and navigation risk management.
Wind speed forecasts are divided into 4 categories, super-short-term [2, 3], short-term [4–7], medium-term [8, 9], and long-term [10], ranging from a few seconds to 30 minutes, from 30 minutes to 6 hours, from 6 hours to 24 hours, and from 24 hours to a week or more, respectively [11]. According to the principle of the wind speed forecasts models, they are classified into physical models and statistical models. The second class includes time series models and machine learning models [11, 12]. The physical methods, like the numerical weather prediction (NWP), construct differential equations about physical factors such as wind speed, wind direction, air temperature, and pressure. Solving meteorological equations requires a large number of computing resources and time. The result belongs to long-term forecasting of a large area. The time series methods model the relationship between current wind speeds and historical wind speeds, which are suitable for short-term and medium-term forecast. Most time series methods, such as the autoregressive integrated moving average (ARIMA) [2, 13] and autoregressive moving average with exogenous variables (ARMAX) [14], assume that there is a linear relationship between current data and past data or errors. The construction, order identification of these models is easy to understand, but their linear assumptions lead to poor forecast performance on nonlinear data. Machine learning methods are suitable for short-term or super-short-term forecasting. They take each time point of past series as an input feature and that of predicted series as an output feature and construct nonlinear relationship. Many complex machine learning models, such as the long short-term memory (LSTM) [15] and gated recurrent unit (GRU) [4, 16], are able to learn temporal correlation and often outperform time series models. Among them, GRU not only alleviates the risk of gradient explosion and vanishing but also is faster than LSTM.
Decomposition-based methods have attracted much attention recently. These methods decompose the original wind speed into several subseries and use a group of same or different individual prediction models to learn each subseries [17]. Usually, time series or machine learning models are selected as predictor. The decomposition-based methods reduce the complexity of original data and make the predictors easier to learn. In addition, multimodel ensemble decreases the risk of getting stuck in local optima in the training process [18]. Therefore, decomposition-based forecasting is more accurate than direct forecasting via individual model. In the field of wind speed forecasting, wavelet transform (WT) methods and empirical mode decomposition (EMD) methods are the most used algorithms [19]. Usually, the repeated WT and ARIMA are used to predict super-short-term wind speed of a 10 min scale [20]. Because LSTM is more effective than ARIMA in nonlinear system, the WT and LSTM are combined to predict hourly wind speed, and feature selection based on mutual information is executed between decomposition and predictors [6]. The characteristics of linear and nonlinear are different; then the ARIMA and multilayer perceptron (MLP) are used to predict linear and nonlinear subseries which are classified based on the EMD [21]. There are also some decomposition-based methods that are used to deal with nonlinearity. Moving average (MA) filter [22] and ARIMA filter [23] are used to separate linear components, and MLP is used to predict the nonlinear parts [24]. Besides the linear and nonlinear predictors, a predictor is used to predict the residual of EMD, since it includes some information [25]. For short-term forecast, a permutation entropy (PE) method is used to predict a 3-step hourly forecast. The subseries is reorganized into several new series according to their PE values. Because it is difficult to capture the nonlinear features, this method uses MLP to predict each component [26].
According to above references, the decomposition-based methods have many advantages. However, they have some problems that have not been widely solved. Firstly, the decomposition algorithms have some defects. Although the WT and EMD are used in wind speed forecast, there are some defects that decrease the forecast accuracy. It is difficult to use WT to analyse local low-frequency changes [27] and the decomposition behaviour depends on wavelet basis functions [15]; different wavelet basis functions bring different decomposition results. The EMD decomposes a time series into subseries with different frequency domain bandwidths and the frequency bands have no overlap ideally. When there is a frequency band overlap in the subseries, multiple modes are mixed, and it is not suitable for further processing [28]. In [29], the ensemble empirical mode decomposition was used to predict short-term wind speed. In [30], the complementary ensemble empirical mode decomposition was used to alleviate mode mixing. These two methods add multiple white noises to the original data and then integrate the results of multiple EMDs. Variational mode decomposition (VMD) is proposed to solve the mode mixing and does not depend on fixed basis. Different from the EMD-based algorithms, it avoids mode mixing as much as possible by solving specific intrinsic mode functions (IMFs) [31]. In recent studies, the VMD, MLP, and autoregressive moving average (ARMA) are used to predict wind speed with 10-minute interval [32] and 30-minute interval [19].
Secondly, the subseries’ predictors can be improved by stacking. Although the decomposed subseries are simplified in frequency, they still have relatively high nonlinearity. Many prediction studies used the support vector machine regression (SVR) and MLP as nonlinear predictors [21, 33]. The neural networks are good at nonlinear modelling, so complex neural networks, such as LSTM [6, 34] and GRU [4], are helpful to improve the accuracy of forecast. A hybrid predictor that includes the VMD and a single-layer GRU is used to predict the wind power interval [3]. Ideally, stacking more models will significantly improve the ability of nonlinear modelling. However, the actual performance of a stacked network often becomes worse when there are more layers. It is difficult to train deeper layers to fit an identity mapping and it leads to the degradation of stacked models. The residual connections solve this degradation phenomenon by building linear paths between deep layers [35]. The stacked LSTM with residual connections shows superior accuracy in machine translation and sentiment intensity prediction [36, 37], but this improvement has not been applied to wind speed forecasting. In [36], two 8-layer LSTMs are added with residual connections every 2 layers. In [37], an 8-layer LSTM is added with residual connections every 1 and 2 layers, respectively, and two types have their own advantages. In wind speed forecasting field, these two types remain to be verified by experiments.
This paper proposes a VMD-Stacked GRU model with residual connections to forecast the short-term global sea wind speed with multiple steps. The decomposition and predictor are designed based on analysis and experiments. Original wind speed data is complex and the VMD is used to decompose the wind speed data; it makes an adaptive decomposition that overcomes the defect caused by mode mixing in EMD-based models. A modified GRU is used as subseries predictor to improve its nonlinear modelling ability. The performances of the stacked GRU are improved by adding the residual connections between the last two layers. This improvement by adding residual connection is very novel in the field of wind speed forecasting. In addition, a lot of experiments are carried out on the European Reanalysis (ERA5) dataset. It has been proved to surpass the performances of several benchmark and baseline models. Different from most studies based on wind farm observation data, it supports the study of sea wind speeds. The experiment results show that the performance of the proposed model is better than those of some benchmark and baseline models.
This paper is organized as follows. Section 2 describes three methods involved in the proposed model. Section 3 details the proposed model’s architecture and evaluation criteria. Section 4 details the experiments and analysis that are used to obtain the best forecast performances. Section 5 provides discussion and Section 6 summarises the conclusions. An acronyms list is shown in Table 1.
2. Methodology
2.1. Variational Mode Decomposition
Wind speed series are nonlinear nonstationary signals which contain a variety of period characteristics. For example, the Fourier transform for hourly wind speed shows that it is not a 24-hour period, but there are many significant periods. It means that multiperiod wind speed cannot be represented by an instantaneous frequency. When forecasting wind speed directly, the complex periodicity will be disadvantageous to model learning. To understand the signals with complex periodic patterns, an effective strategy is to use IMFs, which are ideal functions with fixed instantaneous frequency. Since there is no complex periodicity, it is relatively easy to predict the IMFs.
To extract IMFs from the original series, the EMD adopts a completely different iteration method to deal with the original data adaptively [27]. But, in practice, there are some imperfections such as overshoots, undershoots, asymmetric wave forms, and ends swing in the results of EMD, which make them not the ideal IMFs [38]. In order to alleviate the above problems, the VMD is proposed to calculate the IMFs more accurately. By constructing and solving a constrained variational problem, the VMD obtains all modal components nonrecursively and improves the decomposition robustness to noise. Under the constraint that the summation over all modes is equal to the original signal, the sum of the all estimated bandwidths of modes is minimized, and the following optimization problem is constructed [31]:where and are the -th modal component and the center frequency after decomposition, respectively. , , . is the original time series, and is its mode decomposition number. is the estimated bandwidth of each modal component:where is the squared -norm of the gradient and represents the convolution operation. is the partial derivative operation, and is the Dirac distribution.
By using quadratic penalty factor and Lagrangian multipliers , the lowest point of this variational constraint problem is transformed into saddle point of augmented Lagrange equation defined as follows. The augmented Lagrange equation is shown as follows [31]. The equation can be iteratively calculated by the Alternating Direction Multiplier Algorithm.
2.2. Stacked Gate Recurrent Unit
The conventional machine learning methods deal with time series problem; each moment of a sample is regarded as a different independent random variable, and it is given into the regression model or neural network for training. However, these models assume that the data at different moments are independent of each other, and their sequence in time is not considered. The recurrent neural network (RNN) is proposed to capture this temporal correlation by using the machine learning. The GRU is a modified RNN based on the LSTM. When error signals propagate backwards through time in the conventional RNN, the signals tend to vanish or blow up, and both of the cases lead to the failure of the network to learn from data [28]. The GRU not only retains the ability to prevent the previously mentioned problems but also reduces the complexity of the structure without losing the efficient learning ability [39].
The structure of the GRU at each step is the GRU cell, which is shown in Figure 1. In this figure, the reset gate and the update gate are fully connected layers with sigmoid activation, which are used to control the memory. The previous hidden state preserves the past memory, the reset gate controls how to combine the input with the past memory to become a candidate hidden state, and the update gate controls how to add the candidate hidden state into the hidden state [39]. Finally, the candidate hidden state, previous hidden state, and output of the update gate constitute the current hidden state and output. The GRU cell can be expressed as follows:where is the hidden state at and , , , , and are the input of the GRU cell, output of the update gate, output of the reset gate, candidate hidden state, and hidden state at , respectively. and are the weight matrices of the fully connected layer, and is the bias vector. and tanh are sigmoid and tanh activation function, respectively. represents the element-wise product between two matrices of the same size. The GRU cell is shown in Figure 1.

To make the GRU work, its current hidden state is connected to the next hidden state input. In order to improve the learning ability, multiple GRU cells can be stacked along the input-output direction, and the output of the GRU cell at each step can be used as the input of the next GRU cell at corresponding step. Compared with single-layer GRU, stacked GRU has multiple hidden layers, which can improve the ability to learn time series. The structure of stacked GRU along the time axis is shown in Figure 2.

2.3. Residual Connections
With the appearance of normalization and dropout, the vanishing and exploding gradients problem of the stacked neural network is greatly alleviated, which makes the training of deep network no longer difficult. In theory, the learning ability of the stacked neural network increases with the number of layers, and its error also decreases until it remains unchanged. But actually, when the number of layers increases, the network’s performance will degrade rapidly. At present, stacked RNN, LSTM, or GRU generally has no more than four recurrent layers [36].
The latest research pointed out that overfitting is not the cause of stack network degradation. The assumption that the performance of a deep network is not lower than that of a shallow network is based on the ability of the deep part of the network identity mapping its input, in other words, the ability of the deep part of the network fitting [35]. However, artificial neural network has been proved to be difficult to apply in learning linear relationship [33]. In order to give the network layer this ability, the residual connections as shown in Figure 3 are proposed [35].

The red part in Figure 3 is the added residual connections, also known as skip connections or shortcut connections. After adding residual connections, the input of the network layer is directly superimposed with the output, and the layer is transformed from fitting to fitting . is the approximate identity mapping of ; the network layer is changed to learning the nonlinear residual of identity mapping. It is much easier for neural network to learn a group of nonlinear data close to zero compared to linear data.
The residual connections shown in Figure 3 were first used to solve the degradation problem of the deep convolutional neural network in image recognition, but they can also be applied to any stacked network. Figure 4 shows the stacked GRU structure with residual connections, which is the same as Google's stacked LSTM in its machine translation model [36]. Different from Figure 3, the residual connections in Figure 4 skip one GRU layer instead of two, and Add is set before the activation function. The GRU network layer inputs the second GRU layer after the element-wise addition of the output and input at each step. Each GRU layer with residual connections constitutes a residual block, which can be defined as follows:where the function is composed of equations (6)–(8), representing the -th GRU layer.

The residual connections can significantly improve the flow of gradients between network layers. In theory, the network with any number of layers can be trained after stacking residual blocks. But, in practical works, the sum of LSTM layers with and without residual network is no more than 8 [36].
3. The Proposed Model
3.1. Model Architecture
The proposed model architecture is shown in Figure 5. It contains three parts: data split, data decomposition, and components prediction. The process of the proposed model is summarized as follows:(1)The data split part splits the original wind speed series into three subsets: training set, validation set, and test set. The train-validation-test split percentage is 60%-20%-20%. The test set is considered unknown and does not participate in the training process; and the validation set is used to determinate hyperparameters. In order to speed up model training, we also normalize the three subsets to eliminate the range differences and accelerate the gradient descent. The maximum and minimum values of the training set are obtained to scale the validation set and test set.(2)The data decomposition part uses VMD to solve the constrained variational problem and then reconstruct the component series and calculates the decomposition residual. Correlated information remains in the decomposition residual, so it is necessary to set up a predictor for the decomposition residual. The number of subseries determines the number of predictors and total training time. In order to make a trade-off between the training time and forecasting accuracy, the wind speed series are decomposed into four subseries in this part under termination conditions = 1e-7.(3)In the components prediction part, since the subseries have different frequency characteristics, five stacked GRU models with residual connections are used to predict the subseries and a decomposition residual, respectively. The final forecast values are obtained by integrating the forecast outputs of all predictors. Since this paper is a short-term hourly forecasting; the data from the past 24 hours are highly related to the forecast values. Therefore, the data from 24 hours are used to make a 3-hours-ahead wind speed forecasting.

According to the above section, the residual connections should be set in the deep layer of the network, so we design a stacked GRU with four layers, and GRU layer 1 and GRU layer 2 are independent, while the residual connections are set at the input of GRU layer 3 and the output of GRU layer 4. The output of GRU layer 2 will be fed to the output of GRU layer 3 and added to it, and then the sum of them will be fed to the output of GRU layer 4 and added to it. In order to match the output of the stacked GRU with the desired output step size, we use a flatten layer to reshape the output into a one-dimensional vector, and then a dense layer with linear activation function is used for linear conversion. A detailed parameters determination is described in the Parametric Study section.
3.2. Evaluation Criteria
Time series forecasting can be converted into a supervised regression problem, so we use three regression metrics to evaluate the forecasting performance. These regression metrics are the Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE). They are defined as follows:where represents the model inputs, is the forecast value, is the corresponding actual value, and is the number of actual values.
RMSE and MAE have the same physical dimensions as the original data and range from 0 to ∞. A lower RMSE or MAE means the model has a higher forecasting accuracy. MAE uses the absolute value to describe the gap between two curves, and the error of each prediction point has the same weight in the final error. Therefore, the MAE is less than the RMSE on the same data. Since the square term of the RMSE magnifies the error between two points, the large gap between the RMSE and MAE can indicate that some prediction points contribute significantly to the final error of the prediction curve. The MAPE is a dimensionless metric ranging from 0 to ∞, and a lower MAPE means the model has a higher forecasting accuracy. We use this metric because it considers the proportion of error in the total data and is able to evaluate performance of different models on the same dataset. In addition, 5-fold cross-validation strategy is used in the Result of Multistep Wind Speed Forecasting section.
4. Case Study
4.1. Datasets
Marine meteorological datasets are collected, sorted, and released by scientific research institutions in various countries. The use of the datasets varies greatly depending on the observation method, observation area, observation period, and observation elements. Selecting a high-quality, long-term, and high-resolution marine meteorological dataset is the premise of modelling. Reanalysis datasets are produced from the buoy and satellite observation data by determining the optimal estimation of the system state. The reanalysis datasets can be regarded as the real global ocean data and are currently used as the data source for the NWP.
Therefore, the ERA5 is selected in the case study. The ERA5 is the latest global meteorological dataset released by the European Centre for Medium-Range Weather Forecasts (ECMWF). The ERA5 provides hourly wind speed data in 137 levels from the surface up to a height of 80 km, covering the global land and ocean with 30 km grids [40]. The wind speeds are decomposed into u-component and v-component. The positive u-component of wind is eastward wind speed, and the negative counterpart is westward wind speed. The positive v-component of wind is northward wind speed, and the negative counterpart is southward wind speed.
In order to verify the applicability of the proposed model in global ocean, we use hourly wind speed u-component from 1 January 2016, 00 : 00, to 31 December 2017, 23 : 00, which includes 17544 hours to make forecast experiments. The forecast areas are located in the Pacific, Indian, and Atlantic Oceans, respectively. Figure 6 shows the coordinates of three forecast areas and their surrounding areas on the map, as well as the heat map of wind speed in January 2016. The map in Figure 6 is drawn based on the National Oceanic and Atmospheric Administration (NOAA) Panoply software. The statistical indices of three areas in two years are shown in Table 2.

To further analyse the characteristics of the dataset, Figure 7 shows the original time series, as well as [0, 1] normalized trends and seasonality of the first month with 24 hours as period. The trendy series present great changes in a month. The seasonal series do not show repetitive patterns, which indicates that the wind speed data contain multiple seasonalities. Therefore, it is necessary to use decomposition methods that support multiple frequencies.

In Table 2, the positive average wind speed means eastward wind speed, and the negative means westward wind speed. It can be seen from Table 2 that the maximum westward wind speeds in areas 2 and 3 are significantly higher than the maximum eastward wind speed, while they are similar in area 1. Area 2 has the highest average wind speed and the largest standard deviation. In addition, although the maximum wind speed in area 3 is just 9.3708 m·s−1, its minimum value is 0.1613 m·s−1, much higher than other areas. According to the statistical indices of trends and seasonality, area 1 and area 2 have similar seasonal standard deviations, and the higher volatility of area 2 is attributed to its trendy part. Area 3 has the lowest trendy and seasonal standard deviations among the three areas.
4.2. Comparison between Decomposition-Based Models
In order to prove that the VMD is superior to other decomposition methods and that the performance of the stacked GRU is improved by the residual connections as a component predictor, the following experiments are carried out. All of data are normalized when passed to the model for training.
First, a group of experiments are carried out in area 1 to prove that the proposed combination of the VMD and the stacked GRU is superior to the combinations of the other decomposition and prediction models. In the experiment, three decomposition methods and four prediction models are cross-combined. The WT, EMD, and VMD are selected. Among them, the decomposition level of the WT is 4, which means that the wind speed sequence will be decomposed into an approximate component, four detail components, and a decomposition residual sequence. The EMD processes the sequence adaptively, so the wind data in area 1 is decomposed into 9 to 11 subseries. Therefore, the highest frequent subseries will be discarded until the number of all subseries does not exceed 9. The VMD decomposes wind data into four subseries under termination conditions = 1e-7.
The prediction methods include the LSTM, GRU, stacked LSTM, and stacked GRU. The LSTM and GRU are designed as a single-layer structure with 512 neural units. The stacked LSTM and stacked GRU are designed as four layers with 512, 32, 32, and 32 neural units in each layer, respectively. The batch size is 25 and Adam optimizer’s learning rate is 0.001. The results are shown in Table 3.
The three metrics of the EMD are slightly lower than those of WT. There are some exceptions in Figure 3. For example, when the EMD is combined with the GRU and stacked LSTM, the RMSE is 0.2709 and 0.4361, while when the WT is combined with them, the RMSE is 0.2574 and 0.2982. It can be seen from the table that the VMD is better than WT and EMD in all combinations.
The RMSE of GRU is lower than LSTM under the three decomposition methods, and the RMSE of stacked GRU is lower than the stacked LSTM. This shows that GRU shows better performance than the LSTM in both single layer and multiple layers. However, the metrics of stacked GRU are higher than GRU, and the metrics of stacked LSTM are higher than LSTM. For example, the 1-step RMSE of VMD-Stacked GRU is 0.1436, and the 1-step RMSE of VMD-GRU is 0.1385. This shows that the stacked GRU and stacked LSTM are degraded when combined with the VMD. This degradation is found in the models based on all three decomposition methods.
4.3. Improvement of VMD-Stacked GRU by Residual Connections
After the above analysis, the VMD-GRU is determined as the best combination of decomposition-prediction methods, and the VMD-Stacked GRU is determined as the second best combination. In order to confirm that residual connections improved VMD-Stacked GRU to make it surpass the VMD-GRU method, a comparative experiment was carried out. In the experiment, two kinds of residual connections were used. Residual connections (a) represent the structure shown in Figure 4 and equation (8), and residual connections (b) change the single-layer skipping to double-layer skipping. The results are shown in Table 4.
The VMD-Stacked GRU with residual connections (a) outperforms that with residual connections (b) in most metrics. The VMD-stacked GRU with residual connections (a) perform slightly worst than that with residual connections (b) only on the 1st and 3rd steps of Area 1 and the 3rd step of Area 3. Therefore, it can be considered that residual connections (a) are more suitable than residual connections (b) for wind speed prediction tasks. The VMD-Stacked GRU with residual connections (a) outperforms VMD-GRU in most metrics. It is illustrated that residual connections (a) solve the degradation of VMD-Stacked GRU and make it surpass VMD-GRU.
4.4. Parametric study
To determine the optimal parameters of the proposed model, a detailed parametric study is carried out. The parameters to be determined are the network parameters and training parameters of the stacked GRU. The network parameters include the number of hidden neural units of each layer. The training parameters include batch size, maximum epochs, and the learning rate of the optimizer.
The training parameters are first determined, followed by the network parameters. When determining the training parameters, the network parameters are set in advance according to the latest research. In the article in [4], the optimal two-layer GRU is determined with 512 units in the first layer and 32 units in the second layer. Since the Adam optimizer outperforms classical optimizers such as RMSProp, SGD, and Adagrad [41], it is adopted in the experiment, and its learning rate is searched in {0.1, 0.001, 0.0001}. Batch size is searched in {8, 16, 25, 32, 64} since a study in [42] showed that a smaller batch helps to model training. We set maximum epochs = 50 and use TensorFlow 2.3.1’s callback function [43] to monitor the lowest loss value of the validation set in epochs iteration. The parametric study is carried out on area 1, and results are average of three time steps. The RMSE results are shown in Figure 8.

The above results show that the best configuration of training parameters is batch size = 24 and learning rate = 0.001. The stacked GRU network parameters are, respectively, marked as (a, a), (a, a, a), and (a, b, b, b) according to different layers. For example, (a, a) represents two-layer GRU, and the units number is a; and means that there are residual connections in this layer. Units search is firstly conducted in {10, 100, 200, …, 600} and then a more accurate search is conducted in the best interval. The RMSE results are shown in Figure 9.

(a)

(b)

(c)

(d)
It can be seen that (500, 50, 50, 50) are the optimal network parameters. Above all, the best parametric configuration set is shown in Table 5.
4.5. Result of Multistep Wind Speed Forecasting
To prove the superiority of the proposed model VMD-Stacked GRU with residual connections, we choose seven time series and machine learning models, as well as a published EMD-based model [26], as baseline models. These models directly learn wind speed series without composition. Through Autocorrelation Function and Partial Autocorrelation Function diagrams, the ARIMA parameters p = 2, d = 1, and q = 12 are determined. The support vector machine regression (SVR) uses RBF kernel and establishes the relationship between past information and each forecast time step. The structure of MLP is a four-layer structure with 400, 32, 32, and 32 units in each layer, respectively. The EMD-PE-ANN reconstructs IMFs into IMF1, IMF2, and . 5-Fold cross-validation strategy is used to obtain the results in Tables 6–8. It can be seen from Tables 6–8 that, compared with predicting wind speed directly, the proposed model has lower error metrics at three time steps in three areas. Most of the error metrics of GRU are lower than those of LSTM, especially in area 2. Most of the error metrics of stacked GRU are lower than those of stacked LSTM, and the RMSE is lower than stacked LSTM only on the 3rd step of area 1 and the first step of area 3. When directly predicting wind speed, the stacked model also has a less obvious degradation effect. For example, compared with the GRU, the RMSE of the stacked GRU shows degradation in the 3 steps of area 1 and the 1st step of area 2 and area 3. In addition, although not surpassing the proposed model, the two classic methods, ARIMA and SVR, have relatively good metrics which are close to GRU.
To further illustrate the superiority of the proposed model, the following figures show the comparison curves of models in area 1. It can be seen from Figure 10 that the fitting effect of the prediction curve (red line) of the proposed model is significantly higher than those of the other models. The values at the last input time step (Persistence) and the overall mean values of inputs (Average) are also added in the figure as benchmarks, and the experiment results show that the proposed model surpasses the benchmarks.

(a)

(b)

(c)
5. Discussion
The case study concludes that, compared with other direct prediction or decomposition-based prediction models, the proposed VMD-Stacked GRU model with residual connections is more accurate in multistep forecasting. The proposed model performs well on the ERA5 sea surface wind speed datasets of three ocean areas around the world. Compared with the classic model WT-LSTM, the proposed model has lower errors metrics. According to the case study, the superior performance of the proposed model is due to the three following reasons:(1)The VMD is an excellent decomposition method, and its error in combination with LSTM, GRU, stacked LSTM, and stacked GRU is lower than the combination of WT or EMD and these methods.(2)The GRU is a better forecasting model than LSTM. In the cases of direct prediction, direct prediction after stacking, and decomposition-based prediction after stacking, the GRU’s error metrics are lower than LSTM.(3)The residual connections can alleviate the degradation of stacked GRU and improve its learning ability. The overall error metrics of VMD-Stacked GRU with residual connections are lower than those of VMD-GRU and VMD-Stacked GRU without residual connections.
6. Conclusions
The sea wind speed forecasting is a key part to guarantee safety of sailing ships. To solve the problems of hourly short-term wind speed forecasting, an ensemble model based on the VMD and stacked GRU is proposed, and the residual connections are used to improve stacked GRU. The model uses VMD to decompose the wind speed series and then uses the stacked GRU model with residual connections to predict each component. In order to prove the performance of the proposed model, three cases from the Pacific, Indian, and Atlantic Oceans are studied. In the experiment, three error metrics, RMSE, MAE, and MAPE, are used to evaluate each time step. Through the case studies, the following conclusions can be illustrated:(1)Separately predicting the decomposed wind speed sequence and then superimposing it as the final result can improve the prediction effect, and VMD is the most effective one of the various decomposition methods.(2)The forecast error metrics of VMD-Stacked GRU with residual connections are generally lower than those of ARIMA, SVR, MLP, LSTM, GRU, stacked LSTM, and stacked GRU models at the 1st, 2nd, and 3rd steps.
Data Availability
The wind speed data used to support the findings of this study are freely available and supplied by ECMWF. Requests for access to these data should be made through ECMWF website: https://cds.climate.copernicus.eu/cdsapp.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this study.