Abstract
Sea surface temperature (SST) forecasting is the task of predicting future values of a given sequence using historical SST data, which is beneficial for observing and studying hydroclimatic variability. Most previous studies ignore the spatial information in SST prediction and the forecasting models have limitations to process the large-scale SST data. A novel model of SST prediction integrated Deep Gated Recurrent Unit and Convolutional Neural Network (DGCnetwork) is proposed in this paper. The DGCnetwork has a compact structure and focuses on learning deep long-term dependencies in SST time series. Temporal information and spatial information are all included in our procedure. Differential Evolution algorithm is applied in order to configure DGCnetwork’s optimum architecture. Optimum Interpolation Sea Surface Temperature (OISST) data is selected to conduct experiments in this paper, which has good temporal homogeneity and feature resolution. The experiments demonstrate that the DGCnetwork significantly obtains excellent forecasting result, predicting SST by different lengths flexibly and accurately. On the East China Sea dataset and the Yellow Sea dataset, the accuracy of the prediction results is above 98% on the whole and all mean absolute error (MAE) values are lower than 0.33°C. Compared with the other models, root mean square error (RMSE), root mean square percentage error (RMSPE), and mean absolute percentage Error (MAPE) of the proposed approach reduce at least 0.1154, 0.2594, and 0.3938. The experiments of SST time series show that the DGCnetwork model maintains good prediction results, better performance, and stronger stability, which has reached the most advanced level internationally.
1. Introduction
Analyzing sea surface temperature (SST), an essential parameter for studying the marine ecosystem and global climate can efficiently help us to explore the ocean conditions and understand the climatic dynamics. For a long time, SST has been reported the role in different fields of science, such as providing significant predictive information about hydroclimatic variability [1–3], supplying basis for revealing the spatial distribution of biological environmental factors [4], and as an indicator to observe and monitor marine disasters [5, 6]. Because of large variations in heat flux, radiation, and diurnal wind near the sea surface, the prediction of SST has always been a highly uncertain issue.
Recent years, many methods have been developed for SST prediction. There are primarily two types of forecasting strategies: physical techniques and statistical techniques [7]. The former is aimed at the physical properties of the ocean, using a series of differential equations to describe the SST data. Statistical models, including linear regression [8], thogonal functions [9], support vector machines (SVM) [10, 11], and artificial neural networks (ANN) [7], are extensively used time series-based approaches for SST prediction. These models are designed to predict SST time series by establishing a relationship between historical values and a predictor. The previous studies found that the SST prediction result is often unstable. Traditional methods have some disadvantages in processing large-scale SST data, such as slow speed, difficulty in fitting, occupying much machine memory, and computing time.
Recurrent Neural Networks (RNNs), such as Long Short-Term Memory (LSTM) [12, 13] and Gated Recurrent Unit (GRU) [14–16], have shown to achieve the state-of-the-art results in many applications with time series or sequential data. RNNs enjoy several nice properties such as strong prediction performance as well as the ability to capture long-term temporal dependencies and variable-length observations. LSTM and GRU introduce gate mechanism to overcome the problems of vanishing and explosion of gradients in traditional RNNs when learning long-term dependencies. GRU network is faster and has the simpler structure than LSTM training and performs well in sequence learning tasks [6, 17]. Recently, SST prediction progresses further with the advent of deep learning [15] and neural networks methods. Zhang et al. [13] adopted LSTM to predict SST and obtained good prediction results. Based on the existing contributions, however, there are three problems with the studies. Firstly, mining the information of time series by the model structure of a single network layer is limited. Secondly, the current examination did not consider the temporal and spatial characteristics of SST time series simultaneously. In other terms, the isolated prediction of each point ignores the interaction between the SSTs of different points. Thirdly, the previous ways did not take into account the optimization strategy of the parameters in the prediction model.
In our work, an innovative approach is constructed for SST prediction, which is the Deep Gated Recurrent Unit and Convolutional Neural Network (DGCnetwork). The DGCnetwork model is constructed combining the deep GRU and CNN. The deep GRU layers and the convolutional layer are used to extract the deep hidden temporal features and spatial characteristics of SST data, respectively. We apply one full-connected layer to combine all features into global features and map the output of the previous layer to a final prediction. Increasing the depth of a neural network is an effective way to improve the overall performance [15]. Because the proposed model has a more compact representation than the single network layer, it will be better promoted and performed when applying to prediction of SST data. Besides, temporal information and spatial information are all included in our procedure. Research shows that the SST of a specific point interacts with the SST of its surrounding points [4, 18]. Therefore, when we predict the SST of a certain location, the proposed approach combines the historical SST information of its nearby location.
The efficiency of the DGCnetwork depends on several hyperparameters, namely, the number of neurons in every layer and the number of epochs. Without choosing appropriate network parameters, it slows down the training speed and the network is vulnerable to interference in the nearest local minimum. Because the initial values of hyperparameters play a vital role in the training outputs of the neural network [19, 20], we adopt the Differential Evolution algorithm (DE) to infer optimal selection for the proposed model’s hyperparameters. DE can leverage individual local information and population global information to search for the optimal solution, which has been widely applied [21, 22].
The sequel of the paper is organized as follows. The procedures of the DGCnetwork predicting model are explained in detail in Section 2. Section 3 provides the experimental results and discussions. Finally, Section 4 summarizes the conclusions.
2. Methodology
2.1. The DGCnetwork
In order to solve the task of SST time series prediction, this paper proposes the DGCnetwork model based on deep learning with deep GRU and CNN network. The DGCnetwork architecture can adapt by learning the nonlinearity and complexity of SST time series data, which includes multiple GRU layers, one CNN layer, and one full-connected layer. After the prediction point is selected, we express the SST time series of the prediction point and its nearest points in a matrix form to input into the model. In the model, each GRU layer operates at different time scales and the CNN layer captures spatial feature. The full-connected layer combines all features into global features and maps the output of previous layer to a final prediction. They process the certain part of the prediction task. The output of the previous GRU layer is the input of the next GRU layer. The output of the last GRU layer is the input of the CNN layer and finally generates the prediction result by the full-connected layer. As such, the model is an end-to-end prediction network. Stacking more GRU layers to the recurrent connections between the units in the model and the feed-forward connections between units in a GRU layer and the GRU layer above, it is helpful to research the large-scale SST time series. This ensures an improved learning with more sophisticated conditional distributions of SST time series data. Also, it can perform hierarchical processing on difficult temporal tasks, and more naturally, capture the deep feature of data sequences. The hyperparameters in the network layers are chosen by the DE algorithm.
As shown in Figure 1, the DGCnetwork architecture has three GRU layers, one CNN layer, and one full-connected layer. We define the SST time series as X(x1, x2, …, xt, …, xn). xt represents the SST value at time t and n is the length of SST time series. Multiple time series constitute the input matrix M(X1, X2, X3, X4, X5, X6, X7, X8, X9), where X1 is the predicting point and X2, X3, X4, X5, X6, X7, X8, and X9 are the surrounding points. In the DGCnetwork architecture, the input at time t and Mt is introduced to the first GRU layer along with the previous hidden state ht−1(1), and the superscript (1) denotes the first GRU layer. The hidden state at time t, ht−1(1) and ht(1) are computed, as shown in Section 2.2. ht(1) goes forward to the time t + 1 and also moves forward to the second GRU layer. ht−1(2) in GRU layer 2 is computed by ht(1) and ht−1(2), which goes forward to the time t + 2 and also moves forward to the third GRU layer in the same way. The output of the third GRU layer is the input of the CNN layer. and are computed, as shown in Section 2.3. The output of the CNN layer is the input of the full-connected layer. Finally, the predicted value yt is obtained by the full-connected layer.

Our proposed DGCnetwork model has three advantages. To begin with, each layer can process some part of the predicted task and GRU layer and pass it on to the CNN layer, until finally the last full-connected layer provides the predicted SST value. Secondly, the hidden state in the model at each level is allowed to deal with at a different time scale which could mine the deep spatial-temporal feature of the data. Thirdly, the optimal hyperparameters in the model are selected directly by the DE algorithm. The three advantages have great benefit in case of handling the predicting problem of large-scale SST time series data.
2.2. Temporal Feature Extraction by GATED Recurrent Unit
This paper adopts GRU to capture the temporal relationship among SST time series data. GRU was first proposed by Bahdanau et al. [16], which is more accurate than conventional RNNs and more simple than LSTM. In the topological structure of GRU, the forget gate and the input gate are integrated into an update gate. GRU mixes the cell state with the hidden state, and the information flow inside it is modulated by the reset gate and the update gate. As illustrated in Figure 2, rt and zt are the reset gate and update gate, respectively, and ht and represent the activity value and the candidate activity value, respectively. The mechanism of the gates could extract the temporal relationship among time series data.

The reset gate rt can control the influence containing information of the last implicit state ht−1 on the current information xt, which determines how much information was forgotten in the past. If the value of rt approximates 0, the information of the previous implicit state is discarded.
The update gate zt is used to control the importance of the past implicit state ht−1 at the present moment ht. If the value of zt is always approximately 1, the information of ht−1 is always saved through time and passed to ht. This makes the gradient reversely propagate, effectively solving the gradient vanishing problem of RNN. The whole computation can be defined by a series of equations as follows:where denotes the sigmoid function, Wr, Wz, and are the recurrent weight matrices. [] represents the two vectors are connected and ∗ is the multiplication of matrix elements.
The eigenvalues are required to enter in the chronological order when GRU networks are dealing with the SST time series. Both the sigmoid function and the hyperbolic cosine function tanh are adopted as activation functions in the structure. During the training process, the loss of the objective function from the training sets is minimized.
2.3. Spatial Feature Extraction by Convolutional Neural Network
CNN is a special structure of ANN, which has the ability to deal with high-dimensional data. It is general utilized in image recognition, recommender systems, and natural language processing [23]. Since there is interaction between the SST of the adjacent positions, this paper combines the historical SST information of the prediction point and its surrounding points to forecast the target point. In the proposed model, we apply the CNN layer as a module to mine the spatial information of SST time series (Figure 3). After processing the matrix M in the GRU layers, the matrix M′ is input into the CNN layer. To begin with, multiple two-dimensional matrices at different time periods are stacked into three-dimensional matrix blocks. Then, spatial feature extraction can be achieved by a roll over convolution layer. Afterwards, the outputs of convolution operation are adopted in pooling process. The role of the pooling layer is lowering the computational burden and improving operation efficiency by compressing the feature map. Finally, the abstract feature set is flattened to a one-dimensional vector and connected with the full-connected layer. CNN has the advantages of local perception, sparse interactions, and parameter sharing. Its weight-sharing network structure makes it more similar to the biological neural network and has achieved good results in time series research [24]. The output of the CNN layer can be written as follows:where Zj is the collection of input maps. Each output map is given an additive bias b; however, for a particular output map, the input maps will be convolved with distinct kernels. The kernels applied to map i are different for output maps j and k when output map j and map k both sum over input map i.

2.4. Optimization of Network Parameters by Differential Evolution Algorithm
There are some decision parameters to be optimized for the DGCnetwork’s training. This paper applies DE algorithm to optimally select the values of each hyperparameter in the predicting model, including the number of neurons in the GRU layers and the number of epochs. The optimization strategy is convenient for us to seek out the best model’s structure in order to minimize the difference between the predicting and actual values. The DE algorithm is a simple, population-based, and direct-search algorithm for optimizing the multimodal functions [25]. DE is reliable due to its ability to reach global optimum values and rapid convergence with fewer control parameters. Previous research states that the DE outperforms several other well-known optimization algorithms in terms of convergence speed and stability [26]. The standard DE consists of four main operations, which are initialization, mutation, crossover, and selection. The four operations make the model evolve to a higher fitness to achieve the goal of optimal solution.
3. Experiments
3.1. Data and Software
The data used in our research is the Optimum Interpolation Sea Surface Temperature (OISST), an optimally interpolated SST, from the National Oceanic and Atmospheric Administration (NOAA). Because OISST has good temporal homogeneity and feature resolution [27], it is applied to the analysis and prediction of time series in our work, studying the OISST data is helpful to research the oceanic features. The data we used in the paper is global grid data, the spatial resolution is 1° × 1°, and the time resolution is days. We choose the East China Sea and the Yellow Sea as the experimental objects (see Figure 4). This paper creates two SST datasets which are the East China Sea dataset and the Yellow Sea dataset, respectively. Six points are randomly selected on the two datasets. The time length is from January 1, 2001, to July 15, 2017 (6,040 days).

SST data preprocessing and handling are conducted in Python 3.6, relying on the packages numpy and pandas. Deep learning GRU and CNN networks are implemented with keras, a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.
3.2. Evaluation Standard
In the study, five different indexes are measured in order to estimate the forecasting precision, error, and performance evaluation of the prediction task [28].
Root of mean squared error (RMSE):
Root mean square percentage error (RMSPE):
Mean absolute percentage Error (MAPE):
Mean absolute error (MAE):
Accuracy (ACC):
Among them, and represent the true value and its predicted value, respectively. The degree of freedom in RMSE is N − L + 1 − i, where N is the number of samples, L is the length of observations, and i represents the number of independent variables. In this paper, i = 2. RMSE is smaller and its degree of freedom is larger indicating that the model is more effective and universal [29, 30]. The important property of the RMSPE, MAPE, and MAE is their values closer to 0 imply higher accuracy of the predicting model. The range of ACC is [0, 1], and the value closer to 1 corresponds to better performance of the forecasting model. It is widely demonstrated in the previous literature that the five measures are the appropriate tools to assess the performance of the forecasting model [31].
3.3. Results and Analysis
There are some important settings in the DGCnetwork model to be determined beforehand. Firstly, we utilize early stopping to prevent overfitting as a further mechanism. This paper sets the maximum early stopping duration to 15. Secondly, the data is split into training, testing, and validation set following the ratio of 3 : 1 : 1. The training set is used for training and the test results are obtained on the verification set. Furthermore, we set the batch size as 40 in the experiments.
We proceed now to show the quantitative and visual results of the proposed DGCnetwork. The results shown in all tables and figures indicate the performance of the model in the validation set. This has been done in concurrence with the widely demonstrated fact, which states, the genuine evaluation for forecasting performance should be based on unseen data not the historical (training and testing) data, which is already seen by the model [31].
In the experiment, we use the different lengths of historical observations to predict the future SST value. The length of historical observations is denoted as H. In general, if H is too small, there may not be sufficient sequence information to predict future SST values. Otherwise, with the increase of H, there may be more noise in the training samples [10, 32]. When the length of historical observations is from 1 day to 60 days, we apply the DGCnetwork to predict the SST for one day. Figures 5(a) and 6(a) show the forecasting accuracy on the East China Sea dataset and the Yellow Sea dataset, respectively. From the results, the accuracy of the six points on the two datasets are all more than 98% with the different H. Experiments display that the length of historical observations has little effect on the prediction accuracy when the predicting length is one day. Then, this paper adopts the DGCnetwork to predict the SST for one week with the length of historical observations from 7 days to 60 days (as shown in Figures 5(b) and 6(b)). That is to say, SST data from the past H days are applied to forecast the value for the seventh day in the future. Considering the problem of the insufficient information, our experiment does not perform the case, where H is less than the predicting length. It is worth mentioning in view of the results that, as H increases, the forecasting accuracy has a raise in tendency. This could be attributed to more sequence information which is needed when we predict the longer length. Overall, whether it is forecasting the SST value of the first day or the seventh day in the future with different H, the prediction effect on the two datasets could achieve satisfying accuracy (98%∼99%). Moreover, it is interesting that the accuracy of p1 is better than p2 and p3 on the East China Sea dataset. As we all know, the temperature changes in the distant sea are relatively stable, while the fluctuations in the coastal water’s temperature are greater. By observing the location of the three points on the map, we can observe that p1 is farther away from the coast than p2 and p3. This is demonstrated that the temperature changes at p1 are relatively stable; therefore, the forecast performance of p1 is better than p2 and p3. On the Yellow Sea dataset, we could obtain the same finding. The forecast accuracy of p5 is better than p4 and p6 which are near the land.

(a)

(b)

(a)

(b)
Since the DGCnetwork contains DE algorithm module, the values of each hyperparameter have been optimally selected. This paper analyzes the best model’s structure and the prediction results with the different predicting lengths. On the two SST datasets, the optimal model is used to forecast SST value with the historical observations of 30 days used as an example, the predicting length is set as 3 days, 5 days, 1 week, 2 weeks, and 1 month, respectively. DE algorithm in our predicting model makes it convenient to adjust the deep network to the optimal state when the prediction range changes, avoiding the trouble of parameter adjustment. Table 1 lists the predicting results of p1 and p4 on the two SST datasets, and it is easy to notice the number of neurons in hidden layers accumulate between 10 and 20 and is larger as the predicting length increases. The number of neurons in the neural network determines the number of input features. Very few neurons can cause part data to be lost. The numbers of epochs in the optimal models are clustered around 100. The forecast result gets better obviously when the predicting length reduces; among them, ACC is near 0.99 when the third day’s data is forecasted in the future on the two SST datasets. The error of the model remains small when we forecast the SST data after a month (RMSE is 0.6729 on the East China Sea dataset and 0.5681 on the Yellow Sea dataset). The experimental process also indicates the DGCnetwork optimized by DE may be a good choice for SST time series forecasting.
This paper adopts the GRU network to make the comparative analysis of the prediction errors with the proposed method. Figures 7 and 8 depict the prediction results by the two methods when the length of historical observation is 7 days and the predicting length is 1 day. According to the results on the two datasets, it should be pointed out that the prediction results of the six points reflect the same problem. The prediction errors obtained by GRU are more lager near the maximum SST value. However, the DGCnetwork model always maintains small prediction errors and the prediction results are very close to the true SST value. After searching the previous SST prediction studies [13, 32, 33], we find that, in the literature [13], the SST predicting results also have the larger errors near the maximum SST value. So far, however, there has been little discussion about the reason for this phenomenon. This paper analyzes the issue from two aspects: data and method. First of all, SST time series presents obvious periodicity tendency. That is to say, SST data generally reaches its maximum in summer each year. This was demonstrated in some studies that showed in the last two decades; SST has been warming up in the coastal areas of China, and the intensity of extreme high temperature has been significantly enhanced, especially in spring and summer [18, 34]. Secondly, the shallow architecture, i.e., the single-layer neural network cannot represent efficiently the complex features of time series data, particularly when attempting to process highly nonlinear and long interval time series datasets [35, 36]. On the whole, the single-layer GRU network is difficult to capture the trend of SST data in summer. The proposed method in our research has the higher prediction accuracy because it uses a deep network structure, which can dig deeper into the spatiotemporal characteristics of SST data. Besides, we set the length of historical observation is 3 days, 15 days, 30 days, and 45 days, respectively, for more experiments. On the East China Sea dataset and the Yellow Sea dataset, the conclusion obtained by the two methods is consistent with 7 days.

(a)

(b)

(c)

(a)

(b)

(c)
Furthermore, in the comparative evaluation experiment, 12 predicting methods are deployed using the two datasets and experimental conditions via different error measures, which covered the classical time series predicting methods and newly published methods in recent years. The 12 predicting methods are Support Vector Regression (SVR), Support Vector Machine (SVM), Autoregressive Integrated Moving Average (ARIMA), Back Propagation Neural Network (BPNN), Radical Basis Function Neural Network (RBFNN), RNN, GRU, LSTM, updated-LSTM [13], GRU-SVM [14], WNN [37], and CEEMDAN-LSTM [12].
The results of the experiment on the two datasets which predict 1 day’s SST value with the length of historical observation is one week (7 days) are shown in Tables 2 and 3. The smaller RMSE, RMSPE, MAPE, and MAE, the better the prediction results, while ACC is the opposite. The ACC results on the two datasets are 98.81% and 98.30%, which improved 0.86% to 13.90% in contrast with other 12 methods. Our datasets demonstrated that the DGCnetwork plays a role in large-scale SST time series prediction. On the East China Sea dataset, RMSE, RMSPE, MAPE, and MAE of the proposed approach reach 0.4471, 2.0932, 1.5018, and 0.3218°C, respectively, which decreased by 0.1154, 0.2594, 0.3938, and 0.1082°C than the best of the other models. RMSE, RMSPE, MAPE, and MAE is 0.3637, 1.7382, 1.2915, and 0.2673°C, respectively, on the Yellow Sea dataset. The results of evaluation indicators indicate that the method in this paper is more effective than traditional methods or other existing predicting models. The DGCnetwork model has the advantages of higher forecasting precision, better performance, and stronger stability.
4. Conclusions
In this study, we propose a deep GRU and CNN based on the DGCnetwork network to model the spatiotemporal relationship of SST to predict the future value. DE algorithm is adopted to infer optimal selection for the hyperparameters of the model. The contributions of this paper are four folds. (1) The DGCnetwork has a compact structure and focuses on learning deep long-term dependencies in SST time series. Each layer in the DGCnetwork model processes the part of the predicted task. (2) Apart from temporal information, spatial information is combined in our work to forecast the SST data. (3) We randomly select the points on the East China Sea and the Yellow Sea datasets to experiment. The results show that the DGCnetwork overcomes the disadvantage of GRU network which has lager prediction errors near the maximum SST value. We have conducted the comprehensive experiments and compared with the leading time series predicting models. The experiments have demonstrated that the DGCnetwork model achieves a state-of-the-art performance and outperforms many existing predicting models. (4) The model can be applied to more time series data. Finally, our future studies would also work on analyzing other types of SST, such as Group for High Resolution Sea Surface Temperature (GHRSST) data.
Data Availability
The data used in our research is an open dataset, the Optimum Interpolation Sea Surface Temperature (OISST), from the National Oceanic and Atmospheric Administration (NOAA).
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the National Program on Key Research Project of China (2016YFC1401900) and Open Fund of the Key Laboratory of Digital Ocean, State Oceanic Administration, China (B201801030).