Abstract
Sound speed profile (SSP) inversion is usually performed by linear statistical regression, such as the single empirical orthogonal function regression (sEOF-r) model. However, due to the complex dynamic activities of the ocean, the relationship between parameters is not strictly linear, often resulting in an unsatisfactory inversion result. In this study, an algorithm based on the random forest (RF) integrated learning model, for SSP inversion, was proposed. Using the sea surface temperature anomaly (SSTA) and sea surface height anomaly (SSHA) data, the sound speed profile of the upper 1000 m layer in the South China Sea was reconstructed, and its accuracy was evaluated through the root mean square error (RMSE). The accuracy of the evaluation demonstrated that the RF model proposed here could reconstruct the SSP in the upper 1000 m layer better than the sEOF-r can. Compared with the latter, the average reconstruction accuracy of the RF model was improved by 0.56 m/s. The linear regression of the sEOF-r model fell short of expectations in the regression between surface and subsurface parameters. By removing the constraints of linear inversion, the nonlinear regression of the RF model showed a smaller RMSE and better robustness in the reconstruction process and was superior to the sEOF-r model at all depths. As a result, it provided an effective integrated learning model for SSP reconstruction.
1. Introduction
Sound speed profile (SSP) refers to the distribution of sound speed with ocean depth, which is an important ocean waveguide parameter that plays an important role in underwater acoustics applications, such as underwater target identification, monitoring of the marine environment, and underwater communication [1–3]. The most traditional method to obtain the SSP is to measure it on-site by using appropriate equipment, but this is time-consuming and laborious, and the three-dimensional distribution of sound speed in a large area cannot be obtained in real-time. Satellite remote sensing can carry out continuous and high-resolution observations of sea areas, providing long-term and large-area remote sensing data to meet the needs of large range and urgency. Nevertheless, the data are limited to the ocean surface, and this technology cannot directly detect information at greater depths. Most of the oceanic processes have sea surface features, through which dynamic phenomena can be reflected [4], such as the thermohaline structure, which largely depends on surface ocean dynamics [5].
With the development of satellite observation systems and Argo data sets, an increasing number of studies are attempting to retrieve and reconstruct significant ocean information, such as the sound speed profile, by establishing dynamic models, empirical statistical models, or data assimilation [6]. Previous research had established the relationship between sea surface parameters and internal SSP based on the first baroclinic mode [7–10], and it was shown that the calculation accuracy was very stable in deep sea areas. Carnes’ method of circulation over statistics in the Gulf of Mexico proved that remote sensing parameters had functional relations with empirical orthogonal functions (EOFs) [11], and LeBlanc proved that the EOF was the basis function with the minimum error in SSP reconstruction [12]. Based on the functional relationship between remote sensing parameters and EOF projections, Carnes successfully predicted the temperature profile in the northwestern Pacific and northwestern Atlantic Oceans. This method was referred to as single empirical orthogonal function regression (sEOF-r) [13]. It was continuously supplemented and applied to different sea areas through the modular ocean data assimilation system (MODAS) to obtain dynamic climate profiles and, subsequently, the underwater structures [14, 15]. However, the marine environment is a complex nonlinear system, and ensuring accuracy, when describing it as a linear system, is particularly challenging.
In recent years, as machine learning algorithms have been increasingly developed, they have also been applied to this field. In addition to the traditional regression method, Ali et al. used a neural network to predict temperature profiles based on sea surface parameters [5], and Wu et al. used a self-organizing map (SOM) neural network to reconstruct the subsurface temperature distribution from multiple surface observation data [16]. Based on remote sensing parameters, Su and Li used a classical machine learning method of support vector regression (SVR) predicting the global ocean temperature profile above 1000 meters [17, 18]. Subsequently, Su proposed an integrated learning algorithm, known as the extreme gradient boosting (XGBoost), which could predict the thermohaline profile of the global ocean above 2000 meters [19]. Chen successfully reconstructed the SSP over 1000 meters in the northwestern Pacific Ocean using the SOM method [20, 21], and Li et al. further improved the SOM neural network and successfully reconstructed the SSP in the northern South China Sea [22]. It can be concluded that the nonlinear model based on machine learning performs well in terms of solving problems related to nonlinear ocean dynamics.
In this study, a model based on the random forest (RF) for SSP estimation with satellite surface observations is proposed. The RF model was used to fit a large number of decision trees to different data subsets through the random resampling of training data [23]. The cross-validation method was adopted to improve the accuracy, and the overfitting was controlled by pruning and tuning the decision trees. The RF is an integrated learning model widely used in data classification and regression, whose advantage is that it is suitable for remote sensing research [24]. It has been effectively applied to various remote sensing studies, and it usually performs very well [25–27]. Based on the RF machine learning model, the sea surface height anomaly (SSHA) and sea surface temperature anomaly (SSTA) were combined to estimate the SSP in the South China Sea area. The model performance was evaluated, and the RF model had been proved to effectively improve the performance of SSP estimation.
2. Methods
2.1. Dimension Reduction Based on EOF
The SSP, at a particular depth z, can be represented by one column vector. Each element of the vector represents a sampling point of depth z. is usually expressed aswhere is the constant component in the SSP, and also the background profile, representing the long-term stable ocean background, , is the empirical orthogonal function (EOF), is the projection coefficient of the EOF, and the subscript is its order. In general, taking the reconstruction accuracy and noise suppression into consideration, orders 3 to 5 of the EOF are used to reconstruct the profile. Based on previous studies on profile reconstruction in the sea areas, five-order mode EOFs were selected to reconstruct the SSP.
The EOF is the most commonly used perturbation function to solve SSP inversion problems [28]. To provide constraints on the search space in the SSP inversion problem, it is, therefore, necessary to apply dimensionality reduction techniques to model refined SSPs. By extracting the principal components of the SSP sample matrix, the main modes of sound speed perturbation can be identified, and the noise can be reduced. The SSP anomaly matrix , obtained by subtracting the background mean SSP from the SSP sample matrix, is a matrix, where refers to discrete depths in each profile, and refers to the total number of samples. The EOF vector can be obtained through principal component analysis, as follows:where is the covariance matrix of , is the eigenvalue, and is the EOF vector. By regressing and , the projection coefficients can be obtained.
2.2. SSP Estimation Based on Remote Sensing Parameters
2.2.1. sEOF-r Model
Based on the remote sensing parameters and the SSP samples obtained at the same time and at the same location, the linear relationship between surface and subsurface parameters can be determined. Based on the linear function relation among SSHA, SSTA, and EOF coefficients, the projection coefficients of the EOF of all orders can be calculated as [29]where is the coefficient obtained from the linear regression, is the EOF order, and is the constant term coefficient. After obtaining the relation coefficient from the training dataset, the EOF coefficients can be inverted using the SSHA and SSTA as input parameters, and then SSP can be reconstructed through (1). Obviously, the sEOF-r model is based on the linear regression between remote sensing parameters of the sea surface and the projection coefficient of EOFs. This linear relationship was in turn based on statistical results obtained from a large number of samples in a specific sea area, hence the difference between individual characteristics and statistical characteristics may lead to errors.
2.2.2. Random Forest Model
Figure 1 shows the flow chart of SSP estimation using the RF model. The whole estimation process was divided into four steps.

First, a training data set was established, which included the remote sensing parameters SSTA, SSHA, and corresponding SSP data (from 2009 to 2017 in this paper). The SSP longitude and latitude were input as their cosine value (LON and LAT) and the date of data was input as the ordinal day of a year (DATE). All the training datasets were input into the RF model for training. There were six projection coefficients separated from the Argo SSP that were used as the training and the test label, where was corresponded to the 0-order constant term, and were the selected total orders of the principal components.
In the second step, the RF model was trained. The RF model parameters were optimized one by one using the learning curve optimization method, and then a grid was established to search the area around a single optimal parameter. The optimal parameter combination was obtained after multiple screenings. The RF model was established based on this combination. In the third step, the test dataset was input into the RF model to obtain the projection coefficient . Finally, in the fourth step, the SSP was reconstructed based on the projection coefficient, and the accuracy of the model was evaluated using the root mean square error (RMSE).
3. Data
The satellite remote sensing data used in this study were SSTA and SSHA data. The former was obtained from the National Oceanic and Atmospheric Administration (NOAA) data center and the latter from the Archiving, Validation, and Interpretation of Satellite Oceanography (AVISO) data. Their temporal and spatial resolutions were 1 day and 0.25° × 0.25°, respectively [30].
The background profile adopted was the climatic profile data from the world ocean atlas (WOA13) (https://www.nodc.noaa.gov/0C5/woal3/), which contained temperature, salinity, density, and other information related to global sea areas, as well as the average climate state of measured data. These can be divided into annual, seasonal, and monthly average data with three spatial resolutions of 0.25°, 1°, and 5° [31]. In this study, the annual average data from 2009 to 2018 were selected at a spatial resolution of 1° × 1°.
The SSP samples were derived from the Argo buoy data. These data were taken from the world oceans Argo, a scatter data set (http://ftp.argo.org.cn/pub/ARGO/global/), including all the thermohaline profiles measured in the South China Sea between 2009 and 2018. Each thermohaline profile was converted into SSP by using the empirical formula of sound speed [32], and linear interpolation was carried out across the 0–1000 m depth, with 5 m as the sampling interval to interpolate the data to the standard depth.
The experiment selected the South China Sea as the reconstruction area. The topography of the area is a basin. Due to the monsoon season and the complex topography of the South China Sea, ocean dynamic is complex, and the effects of eddies and internal solitary waves make profile inversions particularly difficult. The combination of these complex factors challenges the validity of the proposed model. Furthermore, the experiment selected data covering an area between 12° and 20°N and 110°–120°E. Figure 2 shows all samples calculated by the Argo data, with a total of 3881 SSPs values, spanning from 2009 to 2018. As shown in Figure 3, a total of 3757 SSPs from 2009 to 2017 were used as training sets for the training models. The SSP data of 2018 were used as a test set to test the model, with a number of 124 in total.


The EOFs are perturbation modes that can describe most of the characteristics of the perturbation in the sound speed profile. It can reduce the dimension of the perturbation in a large number of samples and give them a simple and refined description. Figure 4 shows the first five perturbation functions after EOF normalization. Based on the EOF’s amplitude distribution, the sound speed perturbation mainly occurred above the depth range of 300 m, and near a depth of 1000 m, the amplitude of the leading modes was close to zero. Therefore, we focused on the reconstruction of the SSP above 1000 m.

Table 1 shows the variance contribution rate and error of the first five EOF modes. The first five-order modes accounted for 96.5% of all the variances, indicating that the five-order EOF modes could explain most of the data changes. The average reconstruction error of five-order EOF modes was 0.60 m/s, demonstrating that the five-order EOF modes could reconstruct the profile accurately without introducing too much noise. Therefore, those modes were selected to reconstruct SSP in the subsequent comparison between sEOF-r and RF models.
4. Results
In this study, by comparing the root mean square error of the RF model and the linear regression sEOF-r model, the application effect of the two models in the South China Sea was discussed. The root mean square error of both was calculated by the following equation:where is the reconstructed SSP matrix, is the actual measured SSP matrix, is the number of discrete points of depth, and is the total number of samples. The results showed that the root mean square error of the sEOF-r model was 2.34 m/s, the root mean square error of the RF model was 1.77 m/s, and the reconstruction efficiency was improved by 24.14%. By comparing the root mean square error between the two models, it could be concluded that the nonlinear RF model was more suitable for SSP inversion than the linear sEOF-r one.
Figure 5 shows the errors of each SSP reconstructed in the test set. It was found that the reconstruction accuracy of the RF model was significantly higher than that of sEOF-r, except for a few samples. The maximum and average errors of the single SSP in the sEOF-r model were 5.02 m/s and 2.20 m/s, respectively, while the same error values in the RF model were 4.34 m/s and 1.66 m/s, respectively. The RF model eliminated the constraints of linear inversion, introducing parameters related to position and time to reduce reconstruction error. In addition, the SSP reconstruction based on the sEOF-r model required linear regression analysis of a large number of samples. Therefore, the larger the differences between individual characteristics and statistical features, the larger the errors, which was the reason for the large number of peaks observed in Figure 5. However, the nonlinear RF model was obviously superior to the sEOF-r model in SSP reconstruction for those with large differences between individual eigenvalues and statistical features. The RF model showed that the functional relationship between ocean parameters tended to be nonlinear through the regression of decision trees, and there was no unified transformation mode or linear constraint for the input ocean parameters. The results of this study showed that this model had better robustness and could reconstruct the SSP better than the linear regression.

The reconstruction errors of the two models at different depths are shown in Figure 6. Obviously, the RF model was superior to the sEOF-r model at any depth. As the remote sensing parameters could directly affect sound speed near the surface, so the error was small near the surface. At a depth of around 100 m, the water column was greatly affected by the mixing layer, season, day and night, and internal waves leading to water temperature and other parameters in this range that were not linear, therefore the error was large. For water bodies below 200 m, temperature and salinity changed smaller, so the error gradually decreased with depth. The error variation in Figure 6 is consistent with the perturbation shape of the five-order EOF modes in Figure 4. The large errors were mainly in the range of 50–200m. The maximum errors of the sEOF-r and RF models were 4.78 m/s and 3.97 m/s, respectively, and the corresponding depth of these two values was exactly at the same discrete depth point. At this discrete depth point, the accuracy of the RF model was improved by 0.81 m/s compared to that of the sEOF-r model, which was the depth with the greatest accuracy improvement among all the discrete depth points, indicating that the nonlinear interval depth RF model was significantly superior to the sEOF-r model.

Figure 7 shows the first SSP of each month in the reconstructed SSP. The errors of December, January, February, July, and August were significantly greater than those of the other months, because the change in the mixing layer, which caused it to become shallower in summer and deeper in winter, led to a large fluctuation of the nonlinear range of reconstruction compared with the annual average, which in turn produced a larger reconstruction error. The results in Figure 7 are consistent with those in Figures 5 and 6, indicating that the calculated results of the RF model were in line with the actual profile, while the sEOF-r model shows a large error in many samples.

5. Conclusions
In this paper, an algorithm for SSP inversion based on the RF model was proposed. The remote sensing parameters SSHA and SSTA, latitude and longitude (LAT and LON), and measurement DATE data were used as input information for the model, from which EOF coefficients were retrieved and the SSP was reconstructed. In the South China Sea experiment, by using RMSE to evaluate the model performance, the RF model showed higher accuracy than the sEOF-r model did. The reconstruction error of the latter was 2.34 m/s, while that of the former was 1.78 m/s, with an error reduction of 24.14%. In addition, the RF model proved to be more effective and robust in reconstructing ocean SSP information compared to the sEOF-r model, regardless of season and depth. The analysis of the RF model also preliminarily reflected the relationship between sea surface parameters and SSP based on perturbation transfer. The experiment showed that the linear relationship, reflected in a large amount of data, was not strictly applicable to certain depths, thus limiting the sEOF-r model accuracy. In contrast, the RF model could reduce the limitation of the simple linear fitting without the limiting of analytic expression, detecting the relationship between the parameters more accurately, and introducing additional parameters, such as location, time, heat flow, and wind speed. This research is expected to reconstruct the SSP of the global ocean using long time series and will provide effective technical support for the underwater acoustic application.
Data Availability
SST data are obtained from the National Oceanic and Atmospheric Administration of the United States and SSH data are obtained from archiving, validation, and interpretation of satellite oceanography data. The background section adopts the world ocean atlas. The SSP sample comes from the Global Ocean Argo Scatter Data Collection of the Argo Data Center in China.
Conflicts of Interest
The authors declare that they have no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Authors’ Contributions
K.Q conceptualized the study. K.Q. and Z.O developed the methodology and software, validated the study, and did formal analysis. Z.O. and C.L. wrote the original draft. K.Q. and Z.O reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.
Acknowledgments
This research was funded by the Natural Science Foundation of Guangdong Province under contract No. 2022A1515011519.