Abstract
Tropical cyclones (TC) are one of the extreme disasters that have the most significant impact on human beings. Unfortunately, intensity forecasting of TC has been a difficult and bottleneck in weather forecasting. Recently, deep learning-based intensity forecasting of TC has shown the potential to surpass traditional methods. However, due to the Earth system’s complexity, nonlinearity, and chaotic effects, there is inherent uncertainty in weather forecasting. Besides, previous studies have not quantified the uncertainty, which is necessary for decision-making and risk assessment. This study proposes an intelligent system based on deep learning, PTCIF, to quantify this uncertainty based on multimodal meteorological data, which, to our knowledge, is the first study to assess the uncertainty of TC based on a deep learning approach. In this study, probabilistic forecasts are made for the intensity of 6–24 hours. Experimental results show that our proposed method is comparable to the forecast performance of weather forecast centers in terms of deterministic forecasts. Moreover, reliable prediction intervals and probabilistic forecasts can be obtained, which is vital for disaster warning and is expected to be a complement to operational models.
1. Introduction
Tropical cyclone, also known as typhoons or hurricanes, can cause direct hazards such as storm surges and extreme winds, but also indirect hazards such as landslides and mudslides, and can cause huge losses of people and property. 300,000 people died in Bangladesh in 1970, and another 140,000 died in 1991 [1]. Over the past 50 years, TC have been the most economically costly weather and climate disaster [1].
Given the enormous impact and incalculable losses caused by TC, accurate forecasting of TC intensity is one of the most critical factors in reducing population and economic losses, which is of great significance for governmental decision-making and advanced risk avoidance for the people. However, intensity forecasting has been a difficult task in TC forecasting in recent years [2], but due to the errors introduced in data assimilation and the difficulty in giving the actual structure of TC in the initial analysis field of the model, physics-based numerical and statistical forecasting methods have not improved the forecast level significantly for many years [3]. The numerical forecasting methods rely on complex boundary conditions and large computational resources, and the statistical forecasting methods require a large number of factor inputs [4].
Recently, deep learning-based methods have emerged in geosciences with the potential to replace traditional approaches. For example, in El Niño forecasting, deep learning methods outperform all traditional numerical models [5]. In addition, the deep learning rainfall forecasting method MetNet proposed by Google surpasses the state-of-the-art physics-based forecasting models operated by NOAA [6].
There have also been many studies on machine learning-based TC forecasting in recent years. For example, the deep learning-based intensity forecasting scheme proposed by Wenwei et al. [7] significantly outperforms the numerical statistical model operated by NOAA. The rapid intensification (RI) forecast scheme proposed by Su et al. [8] exceeds the RI consensus of the National Hurricane Center (NHC).
However, the Earth system is a chaotic system with strong nonlinearity [9], and TC are difficult to forecast such as RI. Current numerical model-based studies commonly use ensemble forecasting methods to model this uncertainty, such as through multimode ensembles or initial value perturbations, but these methods require enormous computational resources [10], and intensity forecasting remains a forecast bottleneck.
Based on the potential of deep learning in the earth sciences and the challenges faced by current traditional TC forecasting methods, we consider using a deep learning model this uncertainty directly. It is worth noting that while there are current deep learning-based studies focused on providing accurate forecasts of TC intensity, no studies have so far quantified the uncertainty. Since TC are a hazard with significant human impact, providing only point forecasts is not conducive to fully assess the impact of TC, and adequately modeling their uncertainty is of great importance to reduce casualties.
We propose a probabilistic deep learning model for probabilistic tropical cyclone intensity forecasting (PTCIF), which is based on multimodal spatiotemporal data to achieve both deterministic and probabilistic forecasts and is capable of providing point forecasts, interval forecasts, and probabilistic forecast results. In summary, our contributions can be categorized into three areas as follows:(i)We propose PTCIF, an intelligent system for uncertainty quantification of multimodal spatiotemporal data, which, to the best of our knowledge, is the first study based on deep learning for uncertainty modeling of tropical cyclones(ii)We validate our model on a large-scalereal-world meteorological dataset, and the proposed model achieves advanced levels of deterministic as well as probabilistic forecast results in the testing period of 2015–2018(iii)The probabilistic forecasts and interval prediction results help in extreme risk avoidance and can be a useful complement to traditional forecasting methods during tropical cyclone forecasting
2. Related Work
2.1. TC Intensity Forecast
TC intensity forecasting is considered to be a particularly challenging task, and progress has been slow in recent years. Deterministic TC intensity forecasting models can usually be divided into three categories: dynamical models [11], statistical-dynamical models [12], and consensus models [13].
The dynamical model has some limitations, such as an incomplete understanding of many physical processes [14], and relies on supercomputers requiring huge computational effort. Statistical-dynamical models require a large number of dynamical models as well as characteristic input from Earth observations to obtain reliable results [15]. Neither the dynamical model nor the statistical-dynamical model can give accurate intensity predictions. Since the physical processes associated with TC are highly nonlinear and deep neural networks are very good at modeling nonlinear relationships, deep learning is expected to replace current intensity prediction models.
Recently, many studies on deep learning methods in TC forecasting have emerged. Some studies have proposed using recurrent neural networks and temporal prediction algorithms such as variant LSTM to capture correlations from historical intensity sequences to forecast intensity, obtaining results comparable to those of dynamical and statistical models [16, 17], and some studies have recently started to consider the environmental information in which TC are located, using 3D environmental spatiotemporal information to construct deep learning models to achieve good TC intensity forecasts [18–20]. However, these current studies have focused on improving the error performance of the forecasts without fully considering the data, and the uncertainties inherent in the models.
2.2. Uncertainty Quantification
Previous studies have focused on the need to provide uncertainty representation for TC, and there are usually two approaches to constructing probabilistic forecasts [10]:(i)Ensemble methods: uncertainty is usually obtained by varying initial conditions, adding random perturbations, or by a set of multiple dynamic or statistical models in parallel(ii)Statistical methods: uncertainty is usually obtained by probability density functions of past errors, or through the prediction of deterministic or statistical models
As with deterministic intensity forecasting methods, ensemble probabilistic forecasting methods suffer from physical incompleteness and computational overwhelm. Statistical methods usually obtain the current uncertainty through the forecast errors in the past period and do not completely characterize the uncertainty of the TC in the present [21]. The current phase is more studied in the RI forecasting of TC, which usually uses an ensemble approach [22], or a plain Bayesian framework [23]. Tolwinski-Ward [24] developed a spatial generalized linear model of landfall frequency to quantify uncertainty in TC landfall climatology, and Bonnardot et al. [25] developed a hybrid approach to generate the final probabilistic forecasts.
There is a growing interest in using deep learning to characterize uncertainty in geosciences. Petersik and Dijkstra [26] applied two neural networks to estimate ENSO uncertainty and obtained forecast performance comparable to current state-of-the-art methods. Gordon and Barnes [27] used recurrent neural networks to produce probabilistic forecasts of climate uncertainty over 2–10 years. Rittler et al. [28] used probabilistic machine learning techniques with normalized streams to generate conditional forecast distributions for probabilistic forecasting of weather forecasts. FourCastNet proposed by Pathak et al. [29] can quickly create fast and inexpensive large ensemble forecasts with thousands of ensemble members to improve probabilistic forecasting.
Typically, probabilistic prediction research in the AI community has focused almost exclusively on temporal prediction, concentrating on publicly available datasets such as UCI dataset, for example, electricity and solar data [30–32]. Wu et al. [33] quantified uncertainty for spatiotemporal data, such as probabilistic forecasting of PM2.5, traffic flow, and COVID-19. However, all these studies involve only unimodal and do not consider multimodal data.
Our research not only provides the first deep learning-based uncertainty quantification of TC intensity but also provides a fundamental framework for uncertainty modeling for multimodal spatiotemporal data.
3. Methodology
3.1. Problem Definition
Let denote a set of tropical cyclone spatial and temporal sequence observation features at a sequence of time moments. Each consists of two components: (1) denoting a series of ClIPER features, and (2) denoting the three-dimensional spatial features around TC at time , where denotes the feature dimensions of .
Let denote the intensity observation at the time moment, which is the target value of our model. Given past time series observations , the goal of the TC intensity forecasting problem is to produce probabilistic forecasting; i.e., we are interested in the conditional distribution of future intensity values :where is the future time-step for forecasting, and denotes the learnable parameters of the model. In this paper, we all used the TC characteristics of the past 18 hours to predict the intensity predictions for the next 6, 12, 18, and 24 hours.
3.2. Model Overview
As a spatiotemporal sequence inference neural network, PTCIF provides a principled framework for modeling uncertainty in multimodal spatiotemporal sequence data of TC. In this section, our proposed PTCIF model is described in detail, and the overall architecture is shown in Figure 1. Overall, the input to the model contains two types of modal data, 3D features as well as statistical features. 3D data are used to extract key features through complex convolution operations, while statistical features are used to construct complex features through feature construction algorithms commonly used in meteorology. These two parts of features are finally fused by means of a full concatenation of features.

(a)

(b)

(c)
As suggested by Demaria and Kaplan [34], the intensity variation at fixed time intervals follows an approximately normal distribution with the mean close to zero, so the output of the model in this study is a Gaussian distribution; i.e., the model is predicted as the mean and standard deviation. In order to adequately capture the uncertainty in multimodal tropical cyclone data, PTCIF uses heteroskedasticity regression for modelling [35] and therefore uses heteroskedasticity losses designed to deal with the case where the forecast variance is not constant. In heteroskedasticity regression, the assumption of constant variance is relaxed, and the variance of the residuals can vary in different regions of the input space. In Layman’s terms, our method begins by determining the normal distribution to which the predicted values of intensity obey based on domain knowledge. By minimising the heteroskedasticity loss function, the model learns to predict the mean and variance of the dependent variable and can provide more accurate predictions, particularly in the presence of heteroskedasticity.
In this study, the output of our 3D feature is divided into the and components of the wind. The and components have two branching networks with identical structures, and after the initial feature extraction by the encoder and the feature fusion operation at the channel level to better integrate the two components. After repeating the above operation twice, a Flatten operation is finally performed to obtain the final representation of the spatial features. The statistical feature construction approach of this study will be presented in the dataset and processing section.
In the following, the three basic components of the model are described in detail.
3.2.1. Encoder Module
The Inception architecture has been shown to achieve very good performance at a low computational cost [36]. As shown in Figure 1(b), we use the Inception-A and Reduction-A modules in this study as alternatives to the operations of convolution and pooling, respectively, and synthesize features after each block to obtain nonlinear properties.
(1) Inception-A Module. The Inception-A module is designed to extract features at multiple scales using filters of different sizes. The main idea of the Inception-A module is to use multiple convolution kernels to extract different features of an image and to merge these features together. This approach allows multiple features of tropical cyclone spatial data to be fully considered at the same time and avoids the problem of overfitting. The Inception-A module contains four branches which each perform a convolution operation using convolution kernels of different sizes to obtain feature maps of different sizes. These feature maps are then stitched together and fed into the next layer of the network, a stitching operation known as Filter concatenation. The mathematical formulation of the Inception-A module can be written as:where is the output feature map of the ith filter in the Inception-A module. is the input feature map, , , , and are the learnable filter weights for filters of sizes 1 × 1, 3 × 3, 3 × 3, and 5 × 5, respectively, , , , and are the corresponding filter deviations, denotes the convolution operation, denotes the tandem operation of the filter, and is the rectified linear activation function.
(2) Reduction-A Module. The Reduction-A module aims to reduce the spatial size of the feature map while retaining the most salient features. The main function of the Reduction-A module is to downsample the feature map, thus reducing the computational effort of the model. It uses a Filter concatenation similar to the Inception-A module to stitch together the feature maps obtained from multiple branches. The Reduction-A module contains three branches. The mathematical formulation of the Reduction-A module can be written as follows:where is the output feature map of the th filter in the Reduction-A module, is the input feature map, and are the learnable filter weights for filters of sizes 3 × 3 and 2 × 2 (strided), respectively, and are the corresponding filter deviation values, and is the maximum pooling operation.
3.2.2. Convolutional Block Attention Module (CBAM)
CBAM is a simple and effective attention module for feedforward convolutional neural networks with almost negligible computational overhead [37]. The attention module is used to make the CNN learn to pay more attention to important information rather than learning useless information, and CBAM uses two attention mechanisms in turn, spatial and channel attention, which compute complementary attention, focusing on “what” and “where,” respectively.
(1) Channel Attention Module. The channel attention module recalibrates the feature maps by explicitly modelling the interdependencies between channels. The module accepts the input feature map and calculates the channel attention map by assigning different weights to different channels. The channel attention map is calculated bywhere is the averaging pooling operation that calculates the mean of each channel, and is a multilayer perceptron consisting of two fully connected layers and a sigmoid activation function. The channel attention map is used to repartition the feature map along the channel dimensions in a Hadamard product as follows:
In brief, the channel attention mechanism learns to emphasise important channels and suppress irrelevant ones by computing a channel attention graph. The channel attention graph is generated by globally averaging the pooling of the input feature map to produce a channel descriptor, which is then obtained by two fully connected layers with ReLU activation. The resulting output is a channel attention map which is multiplied by the input feature map to produce an enhanced feature map using Hadamard products.
(2) Spatial Attention Module. The spatial attention module recalibrates the feature map by explicitly modelling the interdependencies between spatial locations. It uses a channeled attentional feature map and computes a spatial attentional map , assigning different weights to different spatial locations. The spatial attention map is calculated as follows:where is the maximum pooling operation that calculates the maximum value for each channel, and is a multilayer perceptron consisting of two fully connected layers and a sigmoid activation function. The spatial attention map was used to rescale the feature map along the spatial dimension by means of the Hadamard product:
In brief, the spatial attention mechanism learns to focus on informative regions of the input feature map while suppressing less relevant regions by computing a spatial attention map. The spatial attention map is obtained by applying two convolutional layers with ReLU activation to the input feature map, producing a spatial descriptor. The spatial descriptor is then passed through a maximum pooling operation and two fully connected layers with ReLU activation. The resulting output is a spatial attention map which is multiplied by the input feature map to produce an enhanced feature map using Hadamard products.
The final output feature map of the CBAM module is an augmentation of the input feature map in terms of both channel attention and spatial attention:where denotes the Hadamard product, resulting in a feature map of the same shape as .
In summary, the CBAM module adaptively recalibrates the feature maps by explicitly modelling the interdependencies between channel and spatial locations, and the output feature maps are enhanced with both channel-side and spatial-side concerns.
3.2.3. Fusion Mechanism
The feature fusion mechanism aims to mathematically combine different features to obtain a better feature representation [38]. In our study, we mainly used three types of fusion methods for the wind speed feature map, including Add fusion, Max fusion, and Convolution fusion. Add fusion is simple and effective but may not capture the complex interactions between feature maps. Max fusion is useful for selecting the most salient features, while Convolution fusion can capture more complex and subtle relationships between feature maps.
(1) Add Fusion. In this method, the output feature maps of different layers or modules are added together element by element. This helps to improve the representational power of the network by combining features from different layers. Given two feature maps and with the same shape, the output feature map is calculated as follows:
(2) Max Fusion. In this method, the maximum value of each spatial location in the output feature maps of the different layers or modules is selected. This helps to retain the most salient features and discard others and is calculated as follows:
(3) Convolution Fusion. In this method, output feature maps from different layers or modules are concatenated along the channel dimension, and the convolutional layers are then applied to the concatenated feature maps to produce the final output. This allows the network to learn how to combine features from different layers in a more complex way. It is calculated as follows:where denotes the joining of the two input feature maps along the channel axes, and denotes a convolution operation with learnable filter weights. This approach allows the network to learn the best way to combine the two input feature maps.
4. Experiments
4.1. Dataset and Processing
We use two standard datasets commonly used in atmospheric science for modeling purposes.
4.1.1. China Meteorological Administration (CMA) Best Track Data [39]
The data cover some of the most important basic information about TC, including intensity, longitude, latitude, TC central minimum pressure, and TC central maximum wind speed, covering the northwest Pacific with a temporal resolution of 6 hours.
The statistical model CLImatology and PERsistence (CLIPER) method proposed by Knaff et al. [12] for the Pacific Northwest, which is modeled by constructing some statistical features, and has been used as a baseline model for TC forecasting. We construct the two-dimensional features based on this method with reference to the selection scheme of factors from previous studies [19, 40]. A total of 96 forecast factors were generated, and the specific factors constructed are shown in Table 1. Specifically, features 1 to 20 in the table represent typhoon persistence factors. Features 21 to 36 represent the types of typhoons called climatic factors. Features 37 to 45 are domain knowledge guided by SHIPS features that contribute to forecasting [34]. The features from 46 to 95 represent wind speed and pressure. Feature 96 is designed as a seasonal factor.
4.1.2. ERA-Interim Data [41]
The data are the global reanalysis data; i.e., the information obtained by combining ground-based observations, satellites, and other observations, assimilating them into the global numerical model after quality control and, to some extent, can be approximated as the actual atmospheric state. These data provide large-scale environmental information for our model with a spatial resolution of and a temporal resolution of up to 1 hour. In order to match with the CMA data, these data are also sampled accordingly to a 6-hour resolution. To portray the structural information of the typhoon, we used the data processed from the ERA-Interim data by Xu et al. [19]. The data were centered on the TC, and a grid of four isobars (250 hpa, 500 hPa, 750 hpa, and 1000 hpa) was constructed using the and components. In ERA-Interim, the U-wind and V-wind components are provided in separate files as zonal and meridional wind components, respectively. The zonal wind component, or U-wind, represents the eastward component of the wind vector, while the meridional wind component, or V-wind, represents the northward component of the wind vector.
Figure 2 shows our study area and the TCs involved. The dataset covers all typhoon events in the northwest Pacific Ocean from 2000 to 2018. The 2000 to 2014 samples were used to train the model, including 365 typhoons with a total of 11,315 time point records, of which 10% of the sample (with 1131 records) was retained as the validation set. The samples from 2015 to 2018 are used as the test set, including 115 typhoon samples with a total of 3,661 records.

4.2. Baselines
To validate the probabilistic forecasting performance of our proposed model, we implemented four models based on the backbone network of our model as a comparison of probabilistic forecasting performance.
4.2.1. Quantile Regression (QR)
It is a modeling approach to estimate the linear relationship between a set of regression variables and the quantile of the explanatory variable . We use a one-sided quantile loss function to generate forecasts with a fixed confidence level [42]. We inference the confidence interval using and quantiles, and the corresponding quartiles involved in this study are 0.025, 0.1, 0.25, 0.5, 0.75, 0.9, 0.975. The model and learning rate settings are the same as for PTCIF.
4.2.2. Bootstrap
It is a resampling technique that estimates the overall statistical information by resampling the batch-sampled dataset [43]. In each round, a random data index is generated, and based on the data index, we resample the training data to obtain different retraining models, training while keeping the original validation and test data, and using the predictions from the different retraining models, we obtain 10 samples to construct the average prediction and confidence intervals.
4.2.3. Deep Ensemble
It is an established method to improve performance and to train a set of neural networks with different configurations [44]. We train multiple neural networks with different parameter initializations on the same data. Random optimization and random initialization ensure that the trained networks are sufficiently independent to train a total set of 10 networks, through which predictions are made on the same test samples.
4.2.4. Monte Carlo Dropout
MC dropout can be interpreted as performing variational inference, which is mathematically equivalent to an approximation of a probabilistic deep Gaussian process [45]. The implementation simplifies the model by considering only the model uncertainty. We apply random dropout with 0.5 drop rate during testing, and the comparison results averaged the performance of 10 paths.
4.3. Experimental Setting
The training strategy adopted was to train each forecast hour individually, i.e., to construct separate forecast models for every 6, 12, 18, and 24 hours, as rolling forecasts tend to result in error accumulation according to previous studies. Specifically, separate regressions were performed for each forecast point using data from the current time point and the last 18 hours, for a total of four time points of multimodal spatiotemporal data. The batch size was set to 128 for all models, and 128 epochs were trained for each run, using the Adam optimization function with a learning rate set to 0.001. Note that the comparison model Bootstrap and the quantile regression were run 10 times in order to obtain uncertainty, and the upper and lower bounds and 0.5 quantile were run separately.
We conducted the above experiments based on Pytorch 1.10, using Tesla V100 GPUs, CUDA version 11.4, CPU Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50 GHz, and OS ubuntu18.04.
4.4. Evaluation Protocols
In order to adequately evaluate our model, we examined the following metrics separately.
4.4.1. Deterministic Metrics
To assess the accuracy of point predictions, we used mean absolute error (MAE) as a metric, which assesses the deviation between predictions and observations. The formula is as follows:where and are prediction and observation, respectively. is the size of the validation set sample.
4.4.2. Probabilistic Prediction Metric
To assess the comprehensive performance of probabilistic prediction, the continuous ranking probability score (CRPS) was used, and the smaller the CRPS, the better the comprehensive performance of probabilistic prediction. The formula of CRPS is as follows:where is the PDF of , and is its cumulative distribution function (CDF). is the Heaviside function.
4.4.3. Interval Forecasting Metrics
To assess the usability of interval forecasting, we use prediction interval coverage probability (PICP) as well as the mean width percentage (MWP). is defined as the probability that an observation falls within the prediction interval under confidence level , and is used to measure the width of the interval. The ideal prediction interval trial has a high PICP and a low MWP. The interval metrics are compared by default using 95% confidence intervals. The formulas are as follows:where is the number of samples whose observations fall within the prediction interval.
5. Results
5.1. Comparison with Baseline Models
Table 2 shows the results of PTCIF against the baseline model in as much detail as possible, with the results showing the 6–24 hour forecasts obtained for each of the years in the test set, respectively. Overall, the PTCIF and Bootstrap models have the best results for almost all indicators, with Bootstrap performing best in CRPS and MWP and PTCIF generally having the best results for the MAE and PICP indicators. The table also shows that although PTCIF is not necessarily the best across all indicators, it has relatively stable results across all indicators in different years and for different forecast durations.
In order to facilitate a more comprehensive evaluation of the efficacy of the diverse models, we have computed the mean values of the evaluation metrics across all the years in the aforementioned test set. The obtained results are depicted in Figure 3, which presents the line charts for MAE, CRPS, PICP, and MWP, respectively, for various prediction horizons. Subsequently, an in-depth analysis of the forecasting outcomes is conducted on a per-indicator basis in the forthcoming sections.(1)Regarding the deterministic evaluation metric, MAE, both PTCIF and Bootstrap models exhibit relatively superior performance. However, PTCIF outperforms Bootstrap for the 24-hour forecast horizon. It is worth noting that the qr model exhibits suboptimal convergence in comparison to the other model configurations, resulting in a noticeable divergence in its performance. Furthermore, in conjunction with the findings reported in Table 2, MC dropout exhibits erratic behavior in terms of MAE across different years, with the metric’s performance appearing to be worse than that of the 6-hour forecast for a 12-hour prediction horizon across the entire test set.(2)With respect to the probabilistic evaluation metric, CRPS, Bootstrap exhibited superior performance for the 6- and 12-hour prediction horizons. Conversely, PTCIF outperformed the other methods for the 18- and 24-hour forecast intervals. Notably, Bootstrap’s performance for the probabilistic forecasts deteriorated considerably at these two time points, leading to it being designated as the worst-performing model for the 24-hour forecasts. Furthermore, like in the case of MAE, the performance of MC dropout in terms of CRPS was also erratic.(3)The joint comparison of the probabilistic evaluation metrics, PICP and MWP, provides a more comprehensive assessment of the models. In terms of interval coverage, PTCIF outperforms the other methods significantly. Moreover, in combination with the findings presented in Table 2, PTCIF achieves interval coverage of nearly 100% for the 6- and 12-hour forecast intervals across most years. However, when considering interval widths, it is evident that PTCIF consistently exhibits the widest intervals, even broader for the 6-hour forecast than for the 24-hour forecast, which is a significant limitation of the method. Deep ensemble also exhibits reasonable interval coverage, but its performance in terms of MAE and CRPS is suboptimal, and its intervals are relatively wide, which restricts its practical applicability.

(a)

(b)

(c)

(d)
In contrast, the intervals produced by Bootstrap exhibit excessively narrow coverage, resulting in their inability to capture the actual values, which is reflected in the low MWP performance. This tendency towards overconfidence may be attributed to the fact that the method solely relies on data selection, which fails to account for sufficient uncertainty. The results obtained for MC dropout and Bootstrap exhibit similar characteristics, which we attribute to the excessive complexity of the spatiotemporal model and the inadequate level of randomness in the perturbations incorporated in the model, which fail to provide adequate uncertainty for the task.
5.2. Comparison with Official Forecasts
We further examine the forecast performance of PTCIF compared with the currently operating methods, and Table 3 shows the comparison of PTCIF with the official forecasts of the Meteorological Administration and the currently operating numerical models on the test set for 24 hours. These include the CMA and the Japan Meteorological Agency (JMA), as well as the US Navy-operated Joint Typhoon Warning Center (JTWC), the European-operated ECMWF Integrated Forecasting System (ECMWF-IFS), the US-operated NCEP Global Forecast System (NCEP-GFS), and the baseline statistical model CLIPER, whose forecasts were evaluated by WMO Typhoon Committee [46].
The results show that PICIF outperforms the official forecasts and numerical models in the mean of deterministic forecasts, which indicates that PICIF is able to obtain probabilistic forecasts and has reliable deterministic forecasting capability.
5.3. Case Study
Finally, we selected two typical TC for independent validation. Figure 4 shows the trajectory and intensity variation, the comparison of 24-hour forecasts, and the average calibration of the two most influential super typhoons in 2018: MANGKHUT and TRAMI. MANGKHUT is a very powerful and catastrophic TC, and the storm caused a total of $3.77 billion in damage across multiple nations, along with at least 134 fatalities. TRAMI caused severe damage, among which 5 people were killed, and over 200 people were injured.

(a)

(b)

(c)

(d)

(e)

(f)
When MANGKHUT developed into a tropical storm, this study began to forecast it. It continued to strengthen in the coming days and intensified explosively after the early hours of 11 September, maintaining its intensity for a long time after being upgraded to a super typhoon, and then continuing to decline in intensity and downgrading to a tropical storm after successive landings in the Philippines and Guangdong.
We started the forecast after TRAMI was upgraded to a tropical storm, after which it moved steadily westward and intensified explosively to a super typhoon, and then stalled in the western Pacific Ocean with decreasing intensity, and when TRAMI became a typhoon, it maintained its intensity for a long time and made landfall in Japan causing significant damage, after which its intensity weakened rapidly due to land influence.
Figures 4(b) and 4(e) show the results of the 24-hour deterministic and interval forecasts of PTCIF. Overall, our model can predict the trend of TC intensity, but the predicted values are small relative to the actual values, which we believe is due to the small sample size of such a powerful TC in the dataset. However, it is worth noting that the interval prediction results almost completely cover the actual values. This suggests that PTCIF can play an important role in practical risk decisions.
One of the most important plots for assessing model prediction uncertainty is the average calibration plot [47]. The calibration plot shows the proportion of predicted test data that we expect to fall within the interval on the x-axis and the proportion of observed test data within the interval on the y-axis. From Figures 4(c) and 4(f), we can see that our model is above overall and covers all values at about 90% of the prediction interval. This indicates that our probabilistic forecasting model is overly unconfident and produces overly wide prediction distributions; i.e., it tends to produce overly wide prediction intervals.
As a case study, we examine the typhoon MANGKHUT to further illustrate the forecast results obtained. The analysis of the results is performed for each of the four forecast intervals separately, as illustrated in Figure 5. The deterministic forecast results indicate that the intensity forecast performance deteriorates gradually with increasing lead time. Notably, the results for the 6- and 12-hour forecast intervals exhibit better alignment with the actual values.

(a)

(b)

(c)

(d)
The analysis of interval forecasts indicates that the interval widths remain relatively constant across different forecast horizons. This aspect is a significant limitation for short-term forecasts, as the probabilistic forecast results for the 6- and 12-hour intervals fail to provide a useful reference for operational purposes due to the wide interval widths. However, accurate deterministic forecasts for these intervals can compensate for this limitation. In contrast, probabilistic forecasts for longer-term intervals may provide better guidance when the performance of deterministic forecasts declines.
To provide a more precise analysis of the deterministic and interval forecasts, a scatter plot with error bars is presented in Figure 6. The error bars represent the forecast intervals at an 80% confidence level, while the green dots represent the forecasted values. The plot indicates that the forecasted values fit well with the red ideal diagonal line, suggesting that the model does not produce highly erroneous forecasts.

(a)

(b)

(c)

(d)
Furthermore, for all forecast periods, nearly all the error bars fully cover the red ideal diagonal line, demonstrating the validity of the interval forecasts. However, for the 6-hour period, there is an overly wide error bar width, which indicates that the interval widths may be a limitation for short-term forecasts.
Figure 7 depicts the probability density curves for the 24-hour forecast of typhoon MANGKHUT at three specific time points, selected at 0:00 for the fixed time interval. The actual values fall within the probability interval with minor deviations, and the forecasts for the first two time points are remarkably precise, nearly aligning with the maximum probability. This observation serves as evidence of the effectiveness of probabilistic forecasting.

(a)

(b)

(c)
5.4. Ablation Study
In order to verify the effectiveness of each module of our proposed model, we conducted a module ablation study; also to verify the help of multimodal data for TC prediction, we conducted a data ablation study separately.
Table 4 shows the comparison results of both modular ablation and data ablation with the proposed method. The table shows the prediction results for deterministic forecasts, and the impact of each component on the final results can be indicated using MAE. The ablation of the module CBAM means that the CBAM is not used directly in the model, and the ablation encoder is a simple module that replaces the module with a convolution with maximum pooling. Data ablation, i.e., direct removal of these data, is represented in the model by the direct removal of these branches. The results show that, overall, the proposed model obtains the best prediction at almost all times, and the proposed submodules have a positive impact on the final prediction. The results also show that the multimodal data have a great impact on the results regardless of which one is removed; especially, the constructed CLIPER has the greatest contribution to the final results, proving the necessity of using multimodal data.
6. Conclusion
We propose PTCIF, which is the first known study that utilizes deep learning to quantify forecast uncertainty for tropical cyclones. Our framework leverages multimodal meteorological data to achieve reliable deterministic forecasts and probabilistic forecasts for TC intensity simultaneously. Our experimental findings indicate that our model performs similarly to official forecasts for 24-hour deterministic TC intensity forecasts and even surpasses the forecasts produced by currently operating global numerical models. Comparing our model with the probabilistic forecast baseline model, PTCIF offers more robust probabilistic forecasts in the 6–24 hour range. Additionally, we conducted a case study involving two super typhoons in 2018, and the results demonstrate that the forecast intervals almost entirely encompass the actual values.
The findings demonstrate that PTCIF performs better than the comparative models in terms of long-term deterministic, probabilistic, and interval forecasting. However, its primary limitation is observed in short-term forecasts, wherein excessively wide forecast intervals are obtained due to large variance forecasts, leading to the introduction of a high level of uncertainty. Therefore, in practical applications, the 24-hour uncertainty forecasts of PTCIF are considered to be more meaningful. In conclusion, spatiotemporal forecasting is a challenging task that is complicated by the high-order dependence of space, time, and variables, and the comparative methods exhibit instability. Conversely, PICIF is capable of learning relatively reasonable uncertainties from multimodal spatiotemporal data and is significantly less computationally intensive than the comparative models.
In summary, our results show that the PTCIF as an intelligent system is expected to complement existing operational models and provide reliable uncertainty prediction intervals that provide powerful insights for risk avoidance and decision-making.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the Alibaba Group through Alibaba Research Intern Program, National Key R&D Program of China (grant no. 2018YFC1406200), Natural Science Foundation of China (grant nos. U1811464, U21A6001), Natural Science Foundation of Shandong Province (no. 405 ZR2019MF012), Taishan Scholars Fund (grant no. ZX20190157), Juan de la Cierva IJC2018-038539-I. This project was supported by the Key Laboratory of Environmental Change and Natural Disaster of Ministry of Education, Beijing Normal University (project no. 2022-KF-08), Key Laboratory of Marine Hazards Forecasting, Ministry of Natural Resources (project no. LOMF2202), and innovation found project for graduate students of China University of Petroleum (East China) (project no. CXJJ-2022-08).