Abstract
Due to the need for rural revitalization and renewable energy utilization, a large quantity of small hydropower stations is emerging, with weak-coupling flow-power features. However, a weak spatial coupling exists between the distribution of small hydropower station groups (SHSGs) and gaging stations since the small hydropower stations are usually located in remote areas lacking hydrographic facilities. That may cause weak or no coupling between the hydroregime and the power output of small hydropower plants in the target basin, thus hindering accurate power forecasting. To meet the need for short-term power generation prediction for SHSGs in intensive management areas, we propose a data-driven power-forecasting model which can mine the correlation information of weakly coupled basins while transferring hydrological knowledge to uncoupled basins. First, to make the task data domains before and after migration more similar, a similar watershed matching algorithm based on the nonlinear dimensionality reduction algorithm (Isomap) and the -means++ algorithm is proposed; then a short-term interpretable runoff prediction model is pretrained, and features are extracted in the source basin using the temporal fusion transformer (TFT) network. After that, a heuristic ensemble fine-tuning model based on the -fold cross-validation fine-tuning method and heuristic ensemble algorithm is proposed to transfer the public knowledge of the source basin to the uncoupled basin. Then, a TFT network is used to mine the weak-coupling relationship between the hydrological regime and the output power of an SHSG. Finally, the validity of the model is verified with an example from a European region. After considering the weakly coupled flow-power characteristics, the mean absolute percentage error (MAPE) of three SHSGs’ power prediction by the proposed method is on average 34.8% lower than that of the method without considering the hydrological information.
1. Introduction
A large quantity of small hydropower stations is emerging in some remote mountain areas due to the motivation of rural revitalization and renewable energy utilization [1]. To improve the efficiency and quality of this clean energy, accurate power prediction is needed, which is strongly dependent on the natural inflow. However, for economic and technological constraints, the quantity and location of flow monitoring devices are not the same as those of the small hydropower generators, and a week-coupling flow-power system is forming [2]. This makes it impossible to form a strong point-to-point correlation between hydrological data and small hydropower plants (SHPPs), but only to form a weakly coupled system. In this case, how to use the collected data to connect with the output power of SHSGs and obtain the power prediction results, and to provide a basis for improving the abandonment and nesting of SHSG in the management area as well as the scheduling arrangement of the power grid, has become an urgent problem to be solved.
At present, most of the related research is focused on studying the modeling problem of individual small hydropower stations, while the research on the prediction method of SHSG is still relatively basic.
For individual small hydropower station power modeling as well as generation forecasting, Reichl and Hack [3] decomposed the load into a base load, shock load, and small hydropower output load based on areas with high load fluctuations and then predicted them separately. The method introduces hydrological expert’s knowledge into the small hydropower output load prediction model, so the model is less generalized and only applicable to some small watersheds. Hammid et al. [4] used the ANN to directly model the power generation of a small hydropower station and demonstrated that artificial neural networks perform well in the prediction of small hydropower plant power production while also having great generalization capabilities. And then, Jung et al. [5] improved the predictive performance of the ANN model for future small hydropower potential by considering climate change scenarios in the modeling process. However, it is hard for the ANN model to learn the long-term relationship between rainfall and power generation since the model has limited complexity. Drakaki et al. [6] incorporated the expert’s knowledge about the hydrological regime and the technical characteristics of the SHPP within the power generation modeling process, which proves to be the more advantageous method. However, in reality, hydrological data is not directly correlated with SHPPs in space, i.e., they are weakly coupled, which means that hydrological data cannot be directly converted into the output power of the SHPP. Even in ungauged basins, hydrological information is too scarce to be used for modeling.
In terms of modeling the power of SHSG, Yang et al. [7] considered the temporal and spatial distributions of precipitation variables and introduced a multimodal deep learning method to predict the power generation of SHSG with good results. In another study [8], the power trend of SHSG was used as a feature to train the extreme learning machine, which improved the prediction accuracy to some extent. Both of the above studies have achieved some success, although more factors such as hydrological regimes can be considered to improve prediction performance.
Those studies have aimed at improving the accuracy of small hydropower forecasting models in terms of different prediction algorithms and input data. However, there is still room for improvement. First, the existing forecasting methods may not perform well in multihorizon forecasting, which contains complex meteorological inputs. Also, the prediction process is not interpretable. Then, given the weak coupling between the hydrographic measurement system and the small hydropower system, predicting the power generation of an SHSG directly from meteorological information and historical runoff or runoff forecasting, followed by direct flow-energy conversions, may not perform well.
Therefore, while utilizing hydrological information to increase the performance of SHSG output prediction, the weak-coupling property of flow and power is considered for the first time in this paper. First, a two-stage prediction model is proposed for watersheds with sufficient hydrological information, in which the runoff at the river section of the watershed outlet is first predicted, and then mine the inner connection between runoff and the output power of the SHSG within the watershed to achieve runoff-power prediction. Then, considering that certain SHSGs may be located in the ungauged basin, a migration prediction method based on the -fold fine-tuning method and heuristic integration algorithm is proposed, which can effectively transfer the public knowledge of similar watersheds to the target watershed (ungauged watershed) and fill the information gap in the target watershed. Finally, considering that the output power prediction of SHSGs is essentially a multihorizon prediction problem with a complex composition of input variables, the temporal fusion transformer algorithm is introduced for modeling, and its feature identification patterns are analyzed to demonstrate that it can capture and learn the delayed characteristics of rainfall in the runoff formation process.
2. SHSG Power Forecasting
Benefiting from the information reform on the SHSG, the basic data (hydrology, meteorology, historical power, etc.) will be quickly collected and stored on the cloud server, even though there are little or few relationships between hydrometeorological data and power generation.
So, a prediction model based on the TFT network and transfer learning is proposed in this paper. The general framework of the model is shown in Figure 1. The model consists of four parts. The first part completes the similar matching of the basins where the SHSGs are located. Through nonlinear dimensionality reduction and cluster analysis of the feature data, the hydrological similarity clusters of basins are classified. The second part realizes the construction of a multihorizon runoff prediction model for source basins (basins with hydrometric infrastructures), where the TFT neural network is introduced to learn and extract the long-term characteristics between meteorological data and runoff in the source basins. In the third part, a transfer learning model is designed, which first uses the source basin runoff prediction network to extract the public features and then integrates the public features and migrates them to the target basins (basins without hydrometric infrastructures, but a few provisional observations are available) through an improved heuristic ensemble algorithm to provide relatively accurate future runoff in the target basin. The fourth step completes the deep mining process of the weakly coupled relationship between hydrology data and energy. The deep mining model is trained with historical hydrometeorological data and past power generation of the SHSG and then obtains the future power of the SHSG by feeding the future runoff from the basin.

3. Methods
3.1. Similar Watershed Matching
Under small samples, “borrowing” the common features of a basin with sufficient hydrological data to transfer to a basin with insufficient data can be an effective and low-cost solution [9]. However, the prerequisite for transfer is to find source basins that are similar to the target basin; otherwise, the transferred information will not be able to characterize the target basin, and the predictive performance will be greatly reduced.
Small hydropower plants are usually run-of-river type, their regulation capacity is very weak, their output power changes are mainly affected by the changes in water flow, which is closely related to the hydrological response of the basin, and the hydrological response of the basin is strongly related to the catchment attributes (such as climate characteristics and geological characteristics) [10], so when the catchment attributes of two basins are similar, it can be considered to some extent that they have a hydrological similarity. From the above analysis, it is clear that the method of similar matching of SHSGs based on catchment attributes is more reasonable than the method by administrative area or by geographical proximity, and it is also convenient to identify source basins that are similar to the target basin. In this paper, we refer to paper [11] to select specific catchment attributes and use a nonlinear dimensionality reduction algorithm to map the spatial location relationships of high-dimensional catchment attributes to low-dimensional space. After that, the set of basins is aggregated into classes using a clustering algorithm, and the basins in each class are considered similar basins.
3.1.1. Nonlinear Dimensionality Reduction Algorithm
Compared with linear dimensionality reduction algorithms such as PCA, the nonlinear dimensionality reduction algorithm Isomap can measure the distance between two data points on a nonlinear stream shape using the geodesic distance calculated by the local neighborhood distance approximation, which is more conducive to capturing the proximity of the points in the high-dimensional space.
In this paper, 319 watersheds are selected, 11 physiographic catchment attributes are obtained from the database of the corresponding watersheds, and then the feature data are normalized and the catchment attributes dataset is nonlinearly dimensionalized using the Isomap algorithm. The selection of static catchment features and the value of -neighborhood points in the algorithm are also referred to in the study [11].
3.1.2. -Means Clustering
Once the low-dimensional feature dataset is obtained by dimensionality reduction, the low-dimensional features can be clustered and analyzed to divide the watershed clusters and achieve similar watershed matching.
The clustering analysis uses the -means++ algorithm [12], which clusters the reduced-dimensional dataset into classes. The algorithm starts by randomly selecting a centroid in the dataset, and then iteratively selects centroids among the remaining observations with a certain probability, and the probability that the th observation is selected as the centroid is shown in Formula (1), where is the previously selected centroid and is the observation closest to the centroid . After centroids are selected, all observations closest to each centroid are clustered into one class, respectively, to finally form the set of sets of similar watersheds.
The choice of the value is crucial to the clustering performance. In this paper, is tested sequentially from 2 to 50, and the Davies-Bouldin (DB) index [13] is introduced as a clustering effect verification index, which calculates the ratio of minimizing intracluster scattering to intercluster separation, and the optimal number of categories corresponds to the smallest DB value.
3.2. Migration Prediction Model of SHSG
3.2.1. General Modeling Steps
The modeling process of the proposed migration prediction model for SHSG in this paper consists of three main steps.
In the first step, a runoff prediction model is developed in the source basin. The model requires training the TFT network with complete historical rainfall, historical runoff, historical dew point temperature, historical evapotranspiration, historical snow water equivalent, historical net surface radiation, and forecast value of rainfall, dew point temperature, evapotranspiration, snow water equivalent, and net surface radiation at the basin area scale.
In the second step, a heuristic ensemble fine-tuning algorithm is used to complete the process of transferring the set of pretrained models from similar source basins to the target basins.
In the third step, the historical runoff data, meteorological variables in the intermediate region between the gauging station and the SHSG and the output power of SHSG are used to train the new TFT network to obtain the runoff-power prediction model, and then the river runoff prediction values and historical small hydropower generation data obtained in the previous two steps are input into the prediction model to obtain the final small hydropower generation results of each basin.
3.2.2. Source Basin Runoff Prediction Based on the TFT Network
With transformer-like models being generous in the field of NLP, the incorporation of encoder-decoder model constructs and attention mechanisms into time series prediction models has become a major trend in recent years [14, 15]. Similarly, in rainfall-runoff models, the encoder-decoder LSTM method improves a lot over a single LSTM in long-term prediction, but there is still room for improvement [16]. Temporal fusion transformer is a deep learning model based on an attention mechanism for interpretable high-performance multihorizon forecasting, which combines the strengths of both LSTM encoder-decoder and self-attention mechanism algorithms to capture both short-term temporal features of neighboring time points and learn long-term features of time series [17]. Therefore, the TFT network is introduced for runoff prediction to obtain a runoff prediction model with excellent performance.
The main structure of the TFT model, which is shown in Figure 2, includes the following: (i)Gating Mechanism. It can eliminate the useless part of the input sequence and increase the network complexity to adapt to the complex input data. The formula is as followswhere is the input (influence factor data and historical runoff data for the previous days), is the context vector, is the exponential linear unit activation function, and are the intermediate layers, is the standard layer normalization, is the metric indicating weight sharing, is the sigmoid activation function, and are the weights and biases, respectively, and is the element-by-element Hadamard product. (ii)Variable Selection Network. The associated input variables are selected at each time step with the following formulawhere is the feature vector of variable after processing and is the th element of the variable selection weight vector . (iii)Static Enrichment Layer. The static features are integrated into the network by encoding the static feature vectors to regulate the temporal dynamics(iv)Time-Series Processing Layer. Long-term and short-term dependencies of time series are learned from observed and known time-varying inputs. The LSTM encoder-decoder module, which is commonly used in multistep prediction frameworks, is used to capture the pattern information of adjacent points to learn short-term dependencies. And the interpretable multiheaded attention mechanism layer learns the long-term dependencies. The interpretable multiheaded attention mechanism is formulated as followswhere , , and are the query vector, key-value vector, and value vector of the attention mechanism, respectively, while , , and are their corresponding weights. Attention is the traditional attention mechanism. (v)Interval Prediction Layer. The TFT model can generate prediction intervals by simultaneously predicting various percentiles (e.g., 10th, 50th, and 90th) at each time step, in addition to outputting days ahead point predictions of runoff

The dataset of this subject includes future known variables (meteorological data), variables known only in the past (water flow), and static covariates (static basins attributes). The time steps between each kind of variable are different, and the variables also contain complex nonlinear interactions with each other.
The LSTM layer and variable selection layer of the TFT model can effectively process the input variables and filter out the effective part of the precipitation variables for the attention mechanism layer. After that, the attention mechanism layer can capture specific rainfall-runoff interaction patterns, thus greatly increasing the peak prediction capability of the model, as analyzed in Section 5. Therefore, there are advantages to using this model for runoff prediction. In addition, this paper focuses mainly on point prediction capability.
3.2.3. Migration Prediction Based on Heuristic Ensemble Fine-Tuning Algorithm
To migrate the effective knowledge from the source basins to the target basin, dense layers are trained using high-dimensional spatial features extracted from the original data of the target basin in a -fold cross-validation method. And then, using the improved heuristic ensemble algorithm [18] to heuristically search for the depth models that maximize the prediction performance of the target basin to be added to the integration, to achieve the best migration of public knowledge of the source basin. The specific steps are as follows: (i)Pretraining in Source Basins. A dataset of hydrologically data-rich source basins is obtained, with in the dataset being historical values of meteorological data, runoff observations, and meteorological forecasts and being runoff prediction values. A TFT network is used to train 𝑛 runoff prediction networks using 𝑛 datasets, and then 𝑛 feature extraction networks are obtained by removing the fully connected layers at the head of these 𝑛 networks to complete the pretraining model construction process. The above-pretrained feature extraction networks contain specific public knowledge learned from the source basins and can extract common temporal feature information from different original datasets(ii)-Fold Cross-Validation Fine-Tuning. The above feature extraction networks are each added with a new output layer (fully connected layer) to form the target basin runoff prediction network set. Then, the newly added fully connected layers are trained on the target basin feature dataset using the -fold cross-validation method, respectively. The target basin feature dataset is extracted from the feature extraction network corresponding to each fully connected layer in the original dataset of the target basin, and it should be noted that the time of the data in the original dataset of the target basin is much smaller than that of the source basin dataset. Let the original data of the target basin cover a total of hourly time points, and the time step of the model prediction is , and the total time step of the model needed for one prediction is . The objective function of the training is shown in the following Figure 3 shows an example of a five-fold cross-validation fine-tuning. By fine-tuning, different fully connected layers can be trained with a feature extraction network. Each fully connected layer is sliding predicted on the test set with step 1 to obtain a prediction result matrix of size . Finally, the prediction result matrices of fully connected layers are stacked to output the final prediction result matrix of size . The obtained set of the final test set prediction results of equal size to the actual runoff data matrix will be used as the new feature data. Each feature extraction network is fine-tuned separately to obtain a set of sets of feature data . These sets of data are then used as the input data for the ensemble algorithm to train the ensemble model later. The above process is similar to stacking ensemble learning [19]. The fine-tuning using the -fold cross-validation method can maximize the use of a small amount of data on the target domain and output a test set of the same length as the original training dataset, which can greatly improve the effectiveness of subsequent ensemble and prevent the overfitting phenomenon generated by too little data. In this paper, we set to 5.(iii)Improved Heuristic Ensemble Process. The overall flow of the improved heuristic ensemble process can be found in Figure 4. To make full use of the public knowledge learned from the target basin runoff prediction networks obtained in the previous step, an improved heuristic ensemble algorithm is used to iterate through all network models in the set, search for prediction networks that have a gaining effect on the selected metrics, and increase the weights of these networks. The above steps are iterated times to finally obtain the weights of all networks. The selected indicators are specifically: the mean of TPE, the mean of NES, the mean of NES, and the mean of NRMSE formed by the row vector of the new feature data matrix obtained in the previous step and the corresponding vector of actual values of runoff , where . We expand the system of indicators used in the screening process of the original heuristic ensemble algorithm and increase the number of indicators used for screening to make the algorithm more applicable to the ensemble of runoff prediction models, to balance the objectives favored by a single indicator, and to improve the ensemble effect


3.2.4. Runoff-Power Prediction Model
The ultimate goal of the model in this paper is to obtain the future power of SHSG in a watershed. So, in this section, we discuss how future runoff that is spatially weakly coupled with the SHSG is utilized to access the output power of an SHSG in this watershed.
Small hydropower plants are divided into adjustable and unregulated hydropower stations (run-of-river plants). Adjustable power plants have a small reservoir capacity and have limited control over river runoff, while run-of-river power plants do not have regulated reservoirs, so the natural flow of the river determines the generating flow of small hydropower plants. Then, according to the analytical relationship between the output power of small hydropower plants and the generating flow (Equation (8)), it is clear that the output of small hydropower plants in a watershed depends on the generating flow. However, given the large span of the basin, there is spatial heterogeneity in runoff from different reaches of the same river, which makes it impossible to directly correlate the flow measured by hydrological stations with the power generated by the SHSG in the basin. In this case, it is necessary to take full advantage of the potential connections between runoff from different river segments to predict the output power of the SHSG.
Similar to Section 3.2.2, the TFT network is employed to construct the runoff-power prediction model, whose inputs are the runoff with meteorological variables (both past and future) and past power generation of the SHSG, and the output is the future power of the SHSG.
3.2.5. Evaluation Metrics
The error evaluation process is carried out in two parts, the first one is to evaluate the error of runoff prediction, and the second part is to evaluate the error between the predicted and actual values of the final output power of the SHSG.
To evaluate the performance of the runoff prediction model, the following metrics are used in this paper by the reference [20]: the Nash efficiency factor (NSE), the absolute value of the top 2% of the flow prediction error (TPE), and percentage bias (PBIAS).
NSE is a common metric used in hydrology to judge the effectiveness of hydrologic model predictions, and its expression is as follows: where represents the predicted value at moment , represents the actual value at moment , represents the average of the actual observed values, and represents the total number of moments in the sample. The range of NSE values is between -∞ and 1, and the closer its value is to 1, the better the hydrological model prediction.
To describe the accuracy of the model’s peak flow prediction, the TPE metric is introduced with the following expression: where represents the th value of the sorted sequence of actual values , is the predicted value corresponding to , and represents the number of peaks ranked in the first 2% of the sequence. TPE ranges from 0 to ∞, and the closer its value is to 0, the better the peak prediction is.
The PBIAS metric is introduced next to describe the deviation of the prediction, and the PBIAS expression is as follows:
The value of PBIAS ranges from - ∞ to ∞, and the closer its value is to 0, the better.
In addition to the above traditional indicators for evaluating runoff prediction, we also introduce the regularized root mean square error (NRMSE), a commonly used indicator for time series prediction error assessment, to evaluate the runoff predictive performance.
The error between the actual and predicted output power of the SHSG is described by two commonly used indicators, NRMSE and MAPE. Both of them have smaller values for better results.
4. Dataset and Model Settings
4.1. Introduction to the Dataset
To verify the effectiveness of the TFT model runoff prediction, we use the LameH-CE dataset as an arithmetic example to compare and analyze the prediction results of the TFT model with other commonly used precipitation-runoff models. LamaH-CE is a large sample hydrological dataset for Central Europe proposed in recent years [21]. The runoff data are provided by the national hydrological services, and the meteorological data are aggregated from the data source ERA5-Land with a spatial resolution of 10 km at the corresponding basin scale. The meteorological variables include the date, air temperature, dew point temperature, water equivalent of snow, net surface shortwave radiation, net surface thermal radiation, and total evapotranspiration. The temporal resolution of each variable can reach one hour. In this paper, meteorological time series and runoff time series of 319 subbasins without river outflow phenomenon in the third subbasin collection of the dataset are selected, spanning from January 1, 2011, 0 : 00 to December 31, 2017, 23 : 00, where data from 2011 to 2015 are used for model training, data from 2016 are used for validation, and those from 2017 are used as a test set to focus on evaluating the prediction performance of each precipitation-runoff model up to 24 hours ahead while demonstrating the potential of the model for runoff prediction at longer time steps (e.g., 120 hours ahead).
After completing the runoff prediction, the power output data of SHSG in three different basins were selected to verify the validity of the runoff-power prediction model. The total installed capacity of these three basins is 20,000 kW, 15,000 kW, and 50,000 kW, respectively, and each SHSG has a weak regulation capacity.
4.2. Benchmark Models and Model Settings
In this section, several benchmark models are used to compare with the TFT model, and then the hyperparameters of each model are determined separately.
The first benchmark model is the LSTM, a traditional model in the field of deep learning. The LSTM has been proven in numerous studies to be an outstanding performance time series prediction model, especially when used as a rainfall-runoff model [22]. The number of RNN layers in the model is set to 1, and the number of neurons per RNN is set to 32. Finally, the prediction results with the best parameters are compared with the model in this paper.
The second benchmark model is the Light Gradient Boosting Machine (LightGBM). LightGBM is one of the most popular models in the field of integrated machine learning, which is a model integration model based on the gradient boosting algorithm. To obtain the best prediction, Bayesian optimization (BOM) is used to automatically tune the parameters, and finally, the model with the best parameters is obtained for the test set prediction, and the parameters are tuned in the range as shown in Table 1. According to the characteristics of the tree model, we take the meteorological data as the input of the model and the runoff data as the output and then make the time steps of the output as 24 h and 120 h.
The third model benchmark uses the NRM proposed by study [23]. This model proposes an encoder-decoder GRU method with a model structure similar to the LSTM encoder-decoder layer of the TFT model. In this paper, the best combination of its parameter set is used, and the specific parameters are shown in Table 2. Similar to the last two benchmark models, the time steps of this model output are also 24 h and 120 h.
Finally, the hyperparameter settings of the TFT model used in this paper are shown in Table 3.
5. Results and Discussion
In this chapter, we compare the advantages and disadvantages of the model proposed in this paper with the benchmark model after determining a set of similar watersheds and trying to analyze the reasons behind them. Section 5.1 shows the process of dimensionality reduction algorithm and clustering and the selection of the most appropriate parameters through the analysis of each index. Section 5.2 shows the prediction results of the runoff prediction model in detail. Section 5.3 shows the effect of runoff prediction on three target basins. Section 5.4 shows the results of the output of three small hydropower cluster on a watershed cluster as an example.
5.1. Dimensionality Reduction Algorithm and Cluster Analysis
The two-dimensional embedding plots of the dataset after dimensionality reduction using the Isomap algorithm and the various plots of the residual variance are shown in Figures 5 and 6.


As can be seen from Figure 6, when the dimension is 3, the residual variance drops below 0.1, so it can be determined that the dimension of the dimensionality reduction algorithm is 3. The value of the -means clustering algorithm is judged based on the Davies-Bouldin index (DBI). As seen in Figure 7, when , the number of DBI is the smallest, representing the best clustering effect at this time. So, the optimal number of categories chosen in this paper is 3, and the corresponding numbers of watersheds in each category are 108, 122, and 89.

5.2. Analysis of Runoff Prediction Results
5.2.1. Overall Performance of Runoff Prediction
Figures 8 and 9 show the cumulative distribution function (CDF) curves of the runoff prediction evaluation metrics for TFT and the three baseline models in 319 watersheds. For the NSE metric, the closer the curve is to 1, the better the prediction is. For the PBAIS, TPE, and NRMSE, the closer the curve is to 0, the better the effect is. As seen from the figure, the NSE curve of the TFT model is closer to 1 than the other three benchmark models, and the curves of PBAIS, FHV, TPE, and NRMSE are closer to 0 than the other models, which indicates that the overall prediction performance and peak prediction ability of the model used in this paper are higher than the other models, and the applicability is better for different basins. Meanwhile, comparing the CDF curves of different predictive time steps, it can be found that the CDF curves of the TFT model with 120-hour-ahead prediction shift toward the direction of poor effect, which indicates that the performance of the TFT model decreases to some extent with the increase of predictive time steps, especially the peak prediction ability, which decreases significantly, but the decrease is smaller compared with other benchmark models.

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)
It is important to note that the errors in the prediction results are not only introduced by the prediction model itself but also have more to do with the measurement bias of the data itself, such as overestimation of precipitation data or underestimation of evapotranspiration data resulting in water surpluses and errors in the acquisition sources of runoff data, etc. [21].
5.2.2. Runoff Prediction Curve
Previous results have shown that the TFT model performs better than other models both in the 24-hour-ahead prediction and the 120-hour-ahead prediction.
The main reason for this is the advantage conferred by the model structure, which allows the model to not only identify important variables among the many input variables and learn their nonlinear relationships with the output variables but also possess the ability to learn the long-term dependence of the time series with complex covariate so that the model can achieve outstanding results in the multihorizon forecasting. Figure 10 plots the prediction curves of the TFT model as well as the three-benchmark models for four watersheds with IDs 11, 326, 491, and 525 for 24 hr-ahead prediction, as an example.

(a)

(b)

(c)

(d)
The runoff curves of these four basins show different characteristics, and the occurrence and duration of the high-water period and the low-water period in each basin are different in a year. For example, the runoff curve of basin 326 shows a significant peak in spring and a trough in autumn; the river in basin 525 is in the dry period in spring and winter; the river volume increases in summer and autumn; and the runoff curve oscillates significantly with the change in rainfall and other characteristics. In these basins, we can easily test the predictive ability of the model in different scenarios. It is clear from the figure that the TFT model predictions are closest to the horizontal benchmark of the true values during the low-water period in the four basins; during the high-water period in the basins, the TFT model predicts the timing of the inflection points and peaks more accurately, while the predictions of the magnitude of some of the peaks are slightly less accurate. For example, in basin 525, the model failed to predict the highest peak value of annual runoff that occurred on July 23, although the model predicted an upward trend. One possible reason for this is that the TFT model can determine the timing of the onset of the rising runoff trend by rainfall events but has the poor predictive ability for runoff peaks with complex genesis.
5.2.3. Feature Identification
The interpretable attention mechanism layer in the TFT network can capture features such as temporal patterns and sudden changes in significant events [17]. To explore as much as possible the performance of these features in runoff prediction, the curves of the attention weight vector for each predictive time step at prediction moment are introduced next to visualize these features, where the attention weight vector is defined as shown in Equation (14), where is equal to the input length of the encoder and is equal to the input length of the decoder. is the element of the matrix in Equation (5) at time .
In this section, the prediction results of two momentary points and in basin 326 are selected for analysis, and a series of visualization curves are plotted to visually determine whether the TFT model can identify the characteristics of the input variables and predict the trend of water flow changes. Among them, is at the junction point when the water flow goes from a flat trend to an upward trend, and is at the moment when it is in the dry period. The specific analysis process is as follows.
First, the composition of the high-dimensional variable input to the attention mechanism layer at moments and is analyzed. The correlation between the high-dimensional variable input into the attention mechanism layer and the original input variables is analyzed by the Spearman correlation coefficient.
As seen in Figures 11 and 12, most of the high-dimensional variables at the time of are highly correlated with the original variables such as precipitation, historical runoff, and day of year with correlation coefficients greater than 0.6. The vast majority of the high-dimensional variables at time are only highly correlated with historical runoff data and day of year.


Next, the feature identification ability of the model is explored more deeply by attention weights. Given the different components of the high-dimensional variables at times and , the following analysis is performed for each of the two moments.
The graph plotted for the prediction results at time is shown in Figure 13, where Figure 13(c) shows that the water flow is stable at that time and the runoff curve is almost a straight line. The attention weights at this time are concentrated around the prediction time point. Combined with the results of the previous correlation analysis, it can be inferred that the model predicts future runoff during the period of smooth water flow (e.g., dry period) mainly by observing the historical runoff trend in the recent period, which is very similar to the way the human brain thinks when predicting.

(a)

(b)

(c)
To investigate the effect of precipitation on the prediction results at time , the analysis at time is additionally set up with a comparison experiment using the presence or absence of the precipitation variable as the independent variable to test the importance of the precipitation variable on the model prediction; one group of the experiment trains the model with the training set containing precipitation data, and the other group uses the training set excluding precipitation data and runs the model on the test set with both groups separately. The partial input data, prediction results, and attention weight vector corresponding to the prediction time point of are output, and the result curves with and without precipitation variables are plotted afterward, as shown in Figures 14 and 15.

(a)

(b)

(c)

(a)

(b)

(c)
As can be seen from the figures, after removing the precipitation variable, the model is unable to predict the rising trend of runoff; while after adding the precipitation variable, the model’s ability to predict the rising trend of water flow is greatly improved, and the attention weight is focused on the rainfall event before the rising water (the blue curve in the middle subplot), and the comprehensive correlation analysis shows that the possible explanation for the appearance of this phenomenon is that the variable selection layer of the TFT model and the LSTM local preprocessing layer can filter out some of the precipitation features in the historical precipitation data, and the role of this precipitation feature on the rising runoff is very significant, and the attention mechanism layer after captures this precisely, making the model’s prediction of rising water trends rise significantly.
5.3. Migration Prediction Results
To verify the effectiveness of the transfer learning method based on -fold cross-validation fine-tuning as well as the heuristic ensemble algorithms proposed in this paper, the following comparison methods are set: the original TFT model trained directly in the target basins, the transfer learning method based on ordinary fine-tuning (i.e., directly dividing the samples into a training set, validation set, and test set, and fine-tuning with the training set), and the single pretrain-fine-tune TFT model with the relatively best prediction. The output step of each method used in this section is 120 hours (120 points), which is very much a test of the migration ability of the model. Table 4 shows the prediction performance of each model on the information-poor watersheds (only 30 days of runoff data) located in watershed cluster 3. Combining the results shown in Table 4 and Figure 16, we can conclude that (i)Comparing the predictive performances of the TFT model trained directly on small samples and the model proposed in this paper, it can be found that the former has almost no predictive power with NSEs of 0.03, -0.05, and -0.31 on the three watersheds, respectively. This is because the deep learning model actually estimates the overall sample distribution by training samples. When the training samples are too small (e.g., 30 days), the estimation results of the model are easily biased, and overfitting occurs(ii)Comparing the single pretrain-fine-tune TFT model with the best predictive performance and the proposed model, we can find that the NSE, TPE, and NRMSE of this model are 0.54, 0.22, and 0.08 in the basin 498, respectively, which are significantly better than the 0.36, 0.24, and 0.24 of the single pretrain-fine-tune TFT model, while in the basin 260, the above three metrics of this model are close to or even slightly worse than those of the single pretrain-fine-tune TFT model. Similarly, in basin 32, the metrics of the two models are also close, and the effect of the proposed model is slightly better. Meanwhile, as seen in Figure 16, the heuristic ensemble algorithm in basin 498 assigns similar weights to multiple source basins, indicating that they are equally important, while in basins 260 and 32, only one source basin is dominant in importance, and the weights of the other source basins do not exceed 0.3 at most. Combining the above results, it is clear that the weights of each source basin can, to a certain extent, reflect their similarity to the target basin. When there are multiple source basins in the set with high similarity to the target basin, the heuristic ensemble migration algorithm in this paper can effectively use the public knowledge of source basins to obtain a prediction capability higher than that of a single pretrain-fine-tune model. However, when there are few source basins with high similarity to the target basin, the algorithm of this paper is limited by the limited public knowledge, and the predictive performance is only similar to the pretrain-fine-tune model of the source basin with the highest similarity(iii)Comparing the prediction metric results of the ordinary fine-tuning-based transfer learning method and the method in this paper, it can be found that the ordinary fine-tuning-based method has worse prediction results. Combining the weight plots of the two methods (Figures 16 and 17), it can be found that the weight distribution of the set of source basins learned by the ordinary fine-tuning-based method is completely different from that of the -fold cross-validation fine-tuning-based method, especially the source basins corresponding to the highest weights of the two methods are not the same. This indicates that the ordinary fine-tuning-based transfer learning methods cannot accurately find the source domains with high similarity to the target watersheds and thus cannot effectively utilize the public knowledge present in the set of source basins. The main reason for this may be that the validation set data generated by the ordinary fine-tuning approach is too small to provide sufficient information to the heuristic ensemble algorithm. In contrast, the -fold cross-validation fine-tuning method produces a validation set with as much data as the training set, which can provide enough information to the integration algorithm and therefore perform better on the test set


5.4. Output Power Prediction Results of SHSG
In this section, we take three SHSGs with different geographic and meteorological conditions as examples to compare the predictive performances of the direct use of the meteorological data from hydrological stations to predict the power of SHSG (the traditional method) and the method in this paper.
To better reflect the predictive performances of the two methods, the multihorizon prediction pattern with a sliding window is used to reconstruct the multiple sets of 24-hour prediction results obtained from the sliding prediction in time order to obtain the power prediction curve with a period of 6 days.
Table 5 shows the prediction error metrics of the two methods at different predictive time steps. Through longitudinal comparison, we can see that the overall predictive performance of our method is significantly higher than that of the traditional method at both predictive time steps, with the most obvious performance improvement in basin 3—the MAPE values decrease by about 48.9% and 44.2%, respectively. And as shown in Figures 18–20, our method has better performance in the prediction of the inflection point and peak of the power curve compared with the traditional method, lower error with the actual power curve, and the trend of the power curve performs better.



By lateral comparison, the prediction error increased in all three basins after the predictive time step was increased from 24 to 120 hours, which is consistent with the conclusion reached in Section 5.2.
In summary, the method proposed in this paper has a bright future.
6. Conclusion
In this paper, a power prediction model of a small hydropower station group based on a TFT network and an improved heuristic ensemble transfer learning algorithm is proposed, which splits the output prediction process of the small hydropower group into two parts: runoff prediction and runoff-power prediction. In addition, the introduction of the TFT network for the runoff prediction part of this model broadens the application scenario of the TFT network and demonstrates that the network can identify fixed features in runoff prediction. What is more, given the weak coupling between the hydrographic measurement system and the small hydropower system, we propose a novel runoff-power prediction model which is of practical significance and achieves better results than the traditional prediction method.
At the same time, there is still room for improvement. For example, there is room to improve the prediction accuracy after increasing the predictive time step of runoff prediction, which can be improved by using a combination model or a time series prediction model with better performance. Further research will be conducted in the future to address these issues.
Nomenclature
SHSG: | Small hydropower station group |
TFT: | Temporal fusion transformer network |
NSE: | Nash efficiency factor |
TPE: | Top 2% of the flow prediction error |
PBIAS: | Percentage bias |
NRMSE: | Regularized root mean square error |
MAPE: | Mean absolute percentage error. |
Data Availability
The hydrological and meteorological data related to this article can be found at doi:10.5281/zenodo.5153305, and the power data of SHSGs used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Authors’ Contributions
Biyun Chen carried out the experiment, supervision, and visualization. Yujia Long was responsible for the algorithm development, result analysis, and writing the original draft. Bin Li and Wenyang Deng contributed to writing the review and editing. Hua Wei and YongJun Zhang were involved in the supervision and validation. CanBing Li helped with the investigation and in writing the review and editing.
Acknowledgments
This work was supported in part by the Guangxi Special for Innovation-Driven Development (Grant No. AA19254034) and the Guangdong Basic and Applied Basic Research Foundation (Guangdong-Guangxi Joint Foundation) (2021A1515410009).