Abstract

Existing prediction models have low prediction accuracy for surface water pollution with many influencing factors. Taking algal bloom prediction as the entry point of surface water pollution research, LASSO-LARS algorithm is adopted to select the main factors affecting algal bloom variables. At the same time, combined with BP neural network in machine learning, a surface water pollution prediction model based on machine learning is proposed. The results show that the proposed method can accurately predict the algal blooms by using the BP neural network algal bloom prediction model, where the LASSO-LARS algorithm is used to select the variables, such as water temperature, pH, transparency, conductivity, dissolved oxygen, ammonia nitrogen, and chlorophyll a, as model inputs, and the relative error of prediction is less than 5.2%. Thus, the proposed method has certain validity and practical application value.

Surface water is an important water resource. However, with the acceleration of urbanization, the problem of water pollution is becoming more and more serious, which seriously affects the safety of human domestic water. In recent years, the problem of water quality safety has received attention from all walks of life. In order to continuously improve water quality and strengthen the safety of domestic water, various water treatment technologies and water management methods have emerged. Zhao et al. proposed an in situ loading method for the preparation of nanofiber composite aerogel by using aromatic nanofibers (ANFs) and prepared a high-performance 3D ANF/MnO2 composite aerogel that can contain more than 90% Pb2+ in water, which provides a promising insight for the treatment of Pb2+ in water pollution [1]. Ma et al. combined chitosan base modified polymer (MCS) with Fe3 O4 and SiO2 by silane coupling agent and successfully prepared a new type of recyclable covalently bonded magnetic flocculant (FS-MC), which can effectively remove emulsified oil in water [2]. Huang et al. used polydimethylsiloxane (PDMS) as the dispersing phase and glutaraldehyde as the crosslinking agent, and Pickering emulsion stabilized by chitin nanofiber (ChNF) was adopted to prepare hydrophobic aerogel. Nonaqueous phase HOCs are selectively removed from water. It has broad application prospect in water treatment related to HOCs [3]. Saima prepared nitrogen-doped graphene oxide (NGO) nanocomposites (NGO/BQDS-TiO2) and bismuth oxide quantum dot (BQD)-doped TiO2 composite photocatalyst, which are used to degrade 2, 4-dichlorophenol (2, 4-DCP) and other organic pollutants and stable dyes [4]. Sun et al. prepared a new type of surface molecularly imprinted polymer (SMIP), which can achieve selective separation and recovery of metal complex dyes in wastewater, providing a theoretical basis for the treatment of metal complex dyes in wastewater [5]. Saravanan et al. [6] discussed the treatment methods of various pollution sources such as heavy metals, dyes, pathogens, and organic matter in water pollution from physical, chemical, and biological aspects and believed that membrane separation and adsorption methods were more suitable for water pollution control. Doyle and Wu et al. prepared polyvinylidene fluoride (PVDF) composite ultrafiltration membrane (PGA membrane) by blending modification of GO and adsorption and filtration of silver carbonate (Ag2CO3) on the membrane surface, which can improve the anti-fouling performance and permeability of PVDF ultrafiltration membrane and effectively treat the slightly polluted surface water [7, 8]. As can be seen, the current research on water pollution is mainly focused on water pollution treatment, and there are fewer studies on prediction. Putri et al. used the Stella model to predict the pollution of the river, and the results show that this method can accurately predict the pollution of each section [9]. Wang et al. proposed a water pollution prediction method based on grey correlation, in which the main influencing factors of water pollution are found out through the grey correlation algorithm, and the water pollution is predicted by white differential equation. The results show that the prediction accuracy of this method is high and the error is small [10]. Wang et al. applied SVM algorithm to water pollution prediction. The results show that this method has good accuracy for small samples and nonlinear data [11]. In addition, the research on water pollution prediction is the key to water pollution treatment. Therefore, in order to better deal with water pollution, LASSO algorithm and BP neural network are adopted to propose a surface water pollution prediction model based on machine learning.

2. Basic Methods

2.1. LASSO Algorithm

LASSO algorithm is a variable selection method. Using the absolute value function of model coefficients as punishment, the regression coefficient is reduced. Moreover, variable selection and parameter estimation are achieved [12]. The main basis of LASSO’s algorithm is the accuracy, interpretability, stability, and complexity of variable predictions to the model, and the coefficient of variables is punished. Assume that there are data and is the independent variable and response variable corresponding to the observed value i. The linear regression model can be expressed as follows:

Generally, in regression structures, conditions of and are independent [13]. Therefore, xij is assumed to be standardized; then, there are

When , LASSO’s estimate is

Although LASSO algorithm has some advantages in theory compared with traditional variable selection methods such as partial least squares method and principal component analysis method, it has the problem of large computation [14]. To solve this problem, Efron reselects variables on each regression residual to reduce the residual, and LARS algorithm is proposed to reduce the computation of LASSO algorithm [15]. Therefore, the LASSO-LARS algorithm is used to select influencing factor variables of surface water pollution.

2.2. BP Network

BP neural network is a typical neural network, which carries out network training through error backpropagation. Its basic structure is shown in Figure 1, consisting of three layers: input layer, output layer, and hidden layer [16, 17].

Assume that the input vector is , the output vector is , and the connection weights between input layer and hidden layer and between hidden layer and output layer are and , respectively. Therefore, for hidden layer, there are

For output layer, there are

In formulas (4) and (6), is the transfer function, which usually is sigmoid function and expressed as [18]

The derivative function is

BP neural network adopts the gradient descent method to adjust weights, and the adjustment formula is [19]

Based on the above analysis, it can be seen that the LASSO-LARS algorithm can ideally select the factor variables affecting surface water pollution, which determines the main influencing factors for accurate prediction of surface water pollution. However, the BP neural network has the characteristics of fast search speed and strong local optimization ability [20]. Therefore, variables selected by LASSO-LARS algorithm are taken as the input of prediction model, and a surface water pollution prediction model based on BP neural network is constructed.

3. Surface Water Pollution Prediction Model Based on Machine Learning

3.1. Selection of Influencing Factor Variables of Surface Water Pollution Based on LASSO-LARS

Algal bloom is one of the main causes of surface water pollution. Algal bloom pollution is taken as a representative to explore surface water pollution prediction based on machine learning. There are many factors affecting algae blooms, including 234 indexes such as water temperature, dissolved oxygen, and metal ions [21]. Among them, 16 indexes such as water temperature, transparency, and chlorophyll a, as shown in Table 1, are the main factors affecting algae blooms [22].

As can be seen, the above 16 indicators are selected as available variables, and logarithmic processing is performed on them, which are represented as . Moreover, the established linear model is shown as follows:where a0 is the constant term; a1,..., a16 represent the variable coefficient; and is the random disturbance term.

LASSO-LARS algorithm is used to calculate the above influencing factors. According to the Akaike information criterion, the smaller the statistical value is, the easier the optimal solution is [23]. Figure 2 shows the values of the variables for which the current model is optimal. X1, X3, X5, X7, X9, X12, and X14 have a great influence on algae blooms, namely, water temperature, pH, transparency, conductivity, dissolved oxygen, ammonia nitrogen, and chlorophyll a. Therefore, the above 7 variables are selected as the main factor variables of algae blooms in surface water.

3.2. Surface Water Pollution Prediction Model Based on BP

Based on the above factors and BP network selected by LASSO-LARS, the construction process of algal outbreak prediction model is designed as follows.

3.2.1. Determination of Sample

The collected surface water quality data can be divided into training sample set and test sample set according to certain rules. Among them, the training sample set is used for BP neural network learning and training, so as to obtain an optimal BP neural network prediction model. The test sample set is used to verify whether the prediction model constructed by the test can achieve the ideal prediction effect for unfamiliar data.

3.2.2. Determination of Network Structure

In general, the standard three-layer BP neural network can achieve relatively accurate prediction [24]. Therefore, three-layer network structure is adopted to construct the prediction model. According to the 7 factors influencing algae blooms selected by LASSO-LARS, the number of neurons in the input layer of BP neural network prediction model can be determined to be 7. Since the algal Q index is the main indicator to measure the extent of algal blooms, the algal Q index is used as the model output [25], and the number of neurons in the output layer is 1. The number of neurons n in the hidden layer can be determined by formula (12). After repeated training, it can be seen that n = 3.where n is the number of neurons in the hidden layer, P and Q represent the number of neurons in the input and output layer, and .

3.2.3. Data Preprocessing

Due to the different dimensions of different data types in the sample dataset, the input requirements of BP neural network prediction model are not met, and there are problems such as missing values, which affect the final prediction effect of the model. Therefore, before the sample is input into the model, the data need to be preprocessed by normalization, standardization, missing value filling, and so on.

3.2.4. Model Construction and Prediction

The BP neural network prediction model can be constructed by calling Newff function in MATLAB software, and then the best prediction model retained through repeated training can be used to predict algal blooms. The specific prediction steps are as follows. According to the 7 main factors affecting the algae outbreak, the prediction variables are established by using the obtained data, and then the prediction variables are input into prediction model to obtain the output of variables, namely, the prediction result.

4. Simulation Experiment

4.1. Experimental Environment

MATLAB Newff function is called to construct the surface water pollution prediction model, which is run on Windows 7 operating system, and Premnmx function is called to normalize the input data.

4.2. Data Sources and Preprocessing

The experimental dataset is collected from the average array of weekly water quality data of a certain water sample from March to November in Daqing city, Heilongjiang Province, from 2003 to 2020 (a total of 648 groups), and the specific collection work is divided into two parts: on-site monitoring and laboratory testing analysis. Water temperature must be measured in the field, and chlorophyll a must be taken back to the laboratory for testing. In order to make the input data meet the input requirements of model, centralization is performed on the input model data by using formulas (13)–(17); the input model data are normalized by using formula (16); and the output model data are reversely normalized by using formula (17) [2628].

In formulas (16) and (17), represents the input or output data and and are the minimum and maximum values.

576 groups of preprocessed experimental data from 2003 to 2018 are selected as training datasets, and the remaining 72 groups of data are selected as test datasets.

4.3. Evaluation Indicators

Relative error δ is selected as an indicator to evaluate the prediction performance of model, and its calculation method is shown in the following formula:where is the absolute error, which can be calculated by formula (19), and L is the real value.where X is the measured value and L is the real value.

4.4. Experimental Results
4.4.1. Variable Selection Result

To verify the effect of LASSO-LARS algorithm on variable selection, nine groups of water quality data in the first week of each month from March to November 2020 are used to illustrate.

(1) Water Temperature. Figure 2 shows the distribution characteristics of water temperature in the first week of each month from March to November 2020. As can be seen, the water temperature changes significantly with the seasons. In 2020, the water temperature in this region is below 10°C in March, April, and November and above 10°C from May to September, with an average temperature of 15.42°C. According to the literature, cyanobacteria have certain growth advantages in water temperature of 40°C; Chlorophyta has certain growth advantages in water temperature of 30°C; and diatoms have certain growth advantages in water temperature of 20°C. Overall, the species and quantity of algae are closely related to the water temperature, and the water temperature varies significantly with the season, leading to the seasonal variation of algae species. Thus, there is a strong correlation between water temperature and algal blooms.

(2) pH. Figure 3 shows the pH distribution characteristics of water body in the first week of each month from March to November 2020. As can be seen, the pH of the water varies between 8.30 and 9.18, and it is slightly alkaline on the whole.

(3) Transparency. Figure 4 shows the transparency distribution characteristics of water body in the first week of each month from March to November 2020. As can be seen, the transparency of this water body is mostly concentrated at 0.18 m, which is generally low. The lowest water transparency is 0.13 m in October, and the highest is 0.24 m in November.

(4) Conductivity. Figure 5 shows the conductivity of water body during the study period. In this water environment, the conductivity is generally high, reaching more than 1600 μs/cm. The reason is that the selected water is close to the sewage inlet and contains a large amount of industrial wastewater and domestic sewage.

(5) Dissolved Oxygen. Dissolved oxygen is an important index of self-clean capacity of the water body, and its concentration reflects the growth of algae. Figure 6 shows the dissolved oxygen concentration of water body in the study area in different periods. As can be seen, the dissolved oxygen concentration of water body presents a wavy distribution with seasonal changes, with a variation range of [5.47, 9.78] (unit: mg/L). According to the analysis, when the water surface is wide and agitated, the concentration of dissolved oxygen is high; when the water surface is small and without fluctuation, the concentration of dissolved oxygen is low, and too low dissolved oxygen will lead to the death of a large number of algae and deteriorate the water quality.

(6) Ammonia Nitrogen. Figure 7 shows the ammonia nitrogen concentration in water body during the study. As can be seen, ammonia nitrogen concentration in this water body reaches its lowest and highest values in April and June, with obvious seasonal variations.

(7) Chlorophyll a. Chlorophyll a concentration is positively correlated with water pollution degree. Figure 8 shows the concentration of chlorophyll a in water body during the study period. It can be seen from the figure that during the study period, the pollution level in the water is high and algae growth is active.

In conclusion, the 7 influencing factor variables selected by LASSO-LARS algorithm are closely related to algae blooms, which indicates that LASSO-LARS algorithm is effective.

4.4.2. Prediction Results of the Model

There are 576 sets of data from the training set adopted to train BP surface water pollution prediction model, and the best prediction model with the minimum error is saved. 72 groups of test data are input into the best prediction model and compared with the real value, and thus the prediction effect of model can be obtained. Due to the large amount of data, only data of the first week of each month from March to November 2020 are presented, as shown in Table 3, and the evaluation indicator values predicted by the model are shown in Table 4. As can be seen from Table 3, the water environment from March to May in 2020 is generally good, with mild pollution and certain tolerance. From June to October, the water quality changed from moderate pollution to severe pollution and eventually returned to moderate pollution. Water body is slightly polluted in November.

As can be seen from Table 4, there is a certain error between predicted value and real value. The absolute error is more than 0.1 in all months except March and November 2020. The relative error is between 3.6% and 5.2%, which is within the allowable error range. Therefore, the proposed prediction model can predict surface water pollution well with high prediction accuracy and good universality.

5. Conclusion

To sum up, the proposed surface water pollution prediction model based on machine learning can accurately predict the algae blooms. Here the BP prediction model is established, and the LASSO-LARS algorithm is utilized to select water temperature, pH, transparency, conductivity, dissolved oxygen, ammonia nitrogen, and chlorophyll a variables as the model input. Furthermore, the relative prediction error is less than 5.2%, which means that the prediction accuracy is higher. All in all, it provides a theoretical basis for the prediction of surface water pollution. However, due to the limitations of conditions, there are still some problems to be improved. When using BP neural network to construct the surface water pollution prediction model, there are many methods to improve the accuracy of BP network model, for example, optimizing the network parameters by genetic algorithm can improve the prediction accuracy of model to a certain extent. Therefore, to better realize surface water pollution prediction, the next research is to use the above algorithm to improve BP neural network or explore a new prediction model.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was sponsored in part by the Science and Education Joint Project of Hunan Natural Science Foundation (2020JJ7058).