Abstract
This study adopts two approaches to analyze the occurrence of algae at Haman Weir for Nakdong River; one is the traditional statistical method, such as logistic regression, while the other is machine learning technique, such as kNN, ANN, RF, Bagging, Boosting, and SVM. In order to compare the performance of the models, this study measured the accuracy, specificity, sensitivity, and AUC, which are representative model evaluation tools. The ROC curve is created by plotting association of sensitivity and (1-specificity). The AUC that is area of ROC curve represents sensitivity and specificity. This measure has two competitive advantages compared to other evaluation tools. One is that it is scale-invariant. It means that purpose of AUC is how well the model predicts. The other is that the AUC is classification-threshold-invariant. It shows that the AUC is independent of threshold because it is plotted association of sensitivity and (1-specificity) obtained by threshold. We chose AUC as a final model evaluation tool with two advantages. Also, variable selection was conducted using the Boruta algorithm. In addition, we tried to distinguish the better model by comparing the model with the variable selection method and the model without the variable selection method. As a result of the analysis, Boruta algorithm as a variable selection method suggested PO4-P, DO, BOD, NH3-N, Susp, pH, TOC, Temp, TN, and TP as significant explanatory variables. A comparison was made between the model with and without these selected variables. Among the models without variable selection method, the accuracy of RF analysis was highest, and ANN analysis showed the highest AUC. In conclusion, ANN analysis using the variable selection method showed the best performance among the models with and without variable selection method.
1. Introduction
Recently, the decrease of river flow due to the diverse effects of climate change and the increase of residence time of the river due to the influence of the installation of structures installed on the river, such as beam, weir, and drainage gate, have caused frequent occurrence of algae proliferation. This phenomenon adversely affects the utilization of rivers and the health of the aquatic ecosystem. In particular, these problems are emerging in recent years as social and environmental issues, and efforts to analyze the causes and to present solutions to these issues through academic approaches in various fields are urgently needed. Various conditions, such as sufficient nutrient supply, retention time, adequate water temperature, and light quantity condition, are essential for the growth of algae. However, under the current water quality monitoring system, it is difficult for researchers to identify and analyze the limiting factors and various control variables for algae growth by each aquatic ecosystem and region. Therefore, it is necessary to study the physical, chemical, and biological environmental factors and optimal control variables that are critical for the occurrence of algae, in order to establish an effective algal monitoring system for each country and region.
Algae are small microorganisms that live in the water of river, sea, lake, or ponds; they contain chlorophyll, which lets them use energy from light and carbon dioxide for photosynthesis, creating organic materials and oxygen. Algae, found in fresh water, mostly consist of very small microalga containing chlorophyll, the green pigment required for photosynthesis; they are classified into diatoms, chlorophyceae, cyanophyceae, and others. Diatoms, chlorophyceae, and cyanophyceae have respective pigments of brown, green, and blue colors.
The optimal conditions of the amount of nutrients, radiation of light, water temperature, etc. required for the growth of each type of algae differ from each other. Depending on conditions, such as the amount of nutrients like phosphorus, sunlight, and water temperature, the types of algae that appear and proliferate vary. In general, diatoms are frequently found during the interval from winter to spring, when water temperature remains below 10°C, while chlorophyceae are frequently found during the interval from spring to early summer, when water temperature remains between (10 and 20)°C. The optimal water temperature for the fast proliferation of cyanophyceae falls in the range (20–30)°C, and accordingly, algal bloom is a seasonal phenomenon that is frequently observable during summertime.
In addition, the occurrence of algae varies significantly, depending on the growth conditions and locations. In an identical aquatic ecosystem, algae appear intensively at points at which the inflow of nutrients, including phosphorus and nitrogen, is concentrated. Points of stagnant flow of water where the flowing water meanders, or where tributaries concentrate, often render proliferating algae.
Algae are one of the natural members essential for the sustainment of an aquatic ecosystem. However, excessive proliferation affects the aquatic ecosystem for the worse. The algae covering the water surface block sunlight and disturb the photosynthesis of aquatic plants in water. In an ecosystem, in which diverse kinds of biological species and abiotic environment share integrated interactions, the prediction of an occurrence of algal proliferation in nature, or clarification of the algal generation process, would be quite difficult.
Despite such difficulties, several techniques for machine learning and deep learning have been developed recently to more accurately predict algal occurrence. Artificial neural network (ANN) was reported useful for the nonlinear ordination of ecological data, and for the visualization, multiple regression, and time series modeling [1]. The approach has been mainly employed for the prediction of phytoplankton species abundance time series [2–4]; ANN was also employed for the selection of input variables for the prediction of phytoplankton proliferation [5, 6].
Cho et al. [7] predicted the TOC, pH, and atmospheric and water temperature as factors affecting algal bloom in Lake Juam, while Jeong et al. [8] carried out the time series modeling of phytoplankton dynamics in the Nakdong-gang river. Park et al. [9] used ANN and three-dimensional hydrodynamic models to clarify the relationship between cyanobacterial blooms and environmental factors in the Baekje Reservoir (inland water). In addition, an early prediction of the concentration of chlorophyll-a was examined in the Juam Reservoir and Yeongsan Reservoir with the use of ANN and support vector machine (SVM) [10].
Wang et al. [11] employed the backpropagation neural network for the prediction of cyanobacterial bloom in Dianchi Lake, whereas Segura et al. [12] used the Random Forest (RF) approach to predict Microcystis aeruginosa complex (MAC) colonies. The four machine learning models of the approaches of regression trees (RT), RF, SVM, and ANN were used to predict changes in phytoplankton community composition in the Miyun reservoir [13], whereas Hollister et al. [14] used the RF technique for the modeling of chlorophyll-a and trophic state of lake. In addition, Bourel et al. [15] predicted the presence/absence of marine phytoplankton species based on the four machine learning techniques.
To date, the concentration of chlorophyll-a, which is taken as an indirect indicator of algae, has been examined by many studies. However, studies that delve into the correlation between the number of cells of each taxon of algae and environmental factors are rare. Each species of algae reacts sensitively to environmental changes and is affected by an integration of diverse factors. This is why the employment of simple correlation between algae and factors affecting water quality for the prediction of behaviors of algae is limited. Especially, diatoms are sensitive to nutrient and organic matter concentrations [16] and integrate environmental conditions with long-term water quality [17]. Also, diatoms create substances that cause fishy smell and deteriorate the efficiency of coagulation and sedimentation in the process of water purification. In particular, some of them cause clogging of filter paper. This necessitates an integrated and multidimensional approach for the preparation of scientific and fundamental measures necessary for the control of diatom proliferation.
The prediction of the behavior of algae has been carried out many times in previous studies with the use of ANN and SVM. However, the cases of the employment of other machine learning approaches, such as kNN, Bagging, and Boosting, for the prediction of algal behaviors are rare. In this study, various machine learning techniques were applied, and the performances were compared with each other, to determine an optimal model for the prediction of behaviors of diatoms, and to derive factors affecting the proliferation of diatoms.
2. Materials and Methods
2.1. Data
In this study, the variables determining water quality, such as pH, DO, BOD, COD, Susp, TN, TP, TOC, water temperature, NH3-N, NO3-N, and PO4-P, were taken as explanatory variables, whereas the Dia (Diatom Abundance) was taken as a dependent variable. The data collected for this study were measurements obtained from Haman Weir for Nakdong River. The decision to choose the site where the data is collected has been made thoroughly. The Four Major Rivers Restoration Project had been conducted as a green growth purpose on the Han River, Nakdong River, Geum River, and Yeongsan River in South Korea from January 2009 to October 2011. With only environmental aspect, one of initial purposes of improving water quality by restoring river ecosystem seems to be failed because many rivers suffers from the algal blooms right after the completion of the project. Haman weir was built at the southern tip of the Nakdong River in Figure 1. The algal bloom issue recently emerges around Haman weir due to the proliferation of diatom. This is why the study uses the water quality data from Haman weir to identify which independent variables help diatoms grow.

Figure 2 is missing map showing the pattern of missing values. It represents that a total of 181 cases (69.9%) selected among an entire 259 cases excluding missing data were used for the analyses conducted in this study.

Table 1 presents descriptions of the variables of water quality; we used diatom abundance as a response variable. And pH, DO, BOD, COD, Susp, TN, TP, TOC, Temp, NH3-N, NO3-N, and PO4-P were used as explanatory variables. The data that support the findings of this study are available from the corresponding author upon reasonable request.
Sunil et al. [18] converted abundance estimates of diatom cells to presence-absence for the prediction of diatom in the continental US to apply the selected machine learning techniques and to compare the results of application and each other. Similarly, data were divided into presence and absence groups from the value of the median of measurements (3,519) of diatoms to apply the selected machine learning techniques and to compare the results of application and each other. “Presence” was defined as greater than median of measurements. Likewise, we defined “absence” as smaller than median of measurements because distribution of diatom abundance is right-skewed.
Table 2 shows the results obtained from the “Independent Samples T-test” conducted to determine the presence of differences between the two groups divided by the degree of occurrence of diatoms. As a result of analysis for differences between the two groups, pH (P=0.0050), DO (P<.0001), BOD (P=0.0100), TN (P=0.0074), Temp (P=<.0001), NH3-N (P<.0001), NO3-N (P=0.0100), and PO4-P (P=<.0001) are significant.
2.2. Machine Learning Method
2.2.1. Logistic Regression
Logistic regression is a model of regression analysis for categorical dependent variables. The model of logistic regression is constructed of random component, systematic component, and link function construct. The random component signifies the dependent variable that includes individual error and is assumed to be the probability distribution of . Systematic components signify independent variables ; the expression of logistic regression is a linear combination of independent variables:where denotes the probability of corresponding to the observations of . The link function connects to the systematic component in logistic regression.
2.2.2. Support Vector Machine
SVM model is an algorithm that is designed to search for nonlinear formula to maximize the distance between each category of binary data. The hyperplane is a nonlinear formula for conducting classification for machine learning. The SVM model uses support vector to find grounds for classification. Support vectors nearest to binary classifications are searched for to obtain (non)linear formulae that maximize the margins between the hyperplane and each support vector. In the case where linear classification is disabled, then a kernel function is employed for the conversion of linear function into a higher dimensional nonlinear one, to conduct the classification. Figure 3 shows a hyperplane separating two classes with the maximum margin.

2.2.3. Artificial Neural Network
ANN model is a statistical learning algorithm that emulates the human brain. It is a nonlinear model that is designed to use data for making predictions. The ANN model is constructed of an input layer, hidden layer, and output layer. Each layer is constructed of input node, hidden node, and output node. The input layer corresponds to independent variable X, while the output layer corresponds to the dependent variable Y. Values of hidden nodes are obtained through the nonlinear combination of input nodes. The values of output nodes are available through the nonlinear combination of hidden nodes. Weights are assigned to each nonlinear combination. ANN then performs a training to adjust each weight. The training mainly focuses on the reduction of errors of output values to adjust weights. The adjusted weights are transferred from output nodes to hidden nodes and from hidden nodes to input nodes. This is referred to as a backpropagation algorithm. Figure 4 shows the example of simple artificial neural network.

2.2.4. K-Nearest Neighbor
kNN model is an instance-based learning model to classify the characteristics of data newly found through an identification of existing training data adjacent to new data. Here, the Euclidean distance, as expressed below, is used to determine the similarity of training data to new data. In the kNN model, the value of training data () and the value of new data () are used to obtain Euclidean distance to classify the characteristics of new data according to the characteristics of k adjacent data.
2.2.5. Bagging
Bagging is an abbreviation of “Bootstrapping Aggregating”. It is an algorithm deriving new results from the results of several prediction models to be created after sampling with the replacement of given training data. Average values of the predicted results of prediction models are used for the values of continuous dependent variables, and the categories of higher portions among results obtained from prediction models are used as predicted results for the categorical dependent variables. The results of prediction models are variable in terms of statistical learning; the combination of increased prediction models can reduce the variability of the prediction results.
2.2.6. Boosting
Boosting model is an algorithm that uses the technique of sampling with replacement from given training data to create several prediction models using the samples. Respective weights are then assigned to each prediction model to convert the given prediction results, to derive the final prediction results. Types of Boosting comprise Adaboost, Gamboost, and Glmboost, etc. Glmboost method is used as a prediction model in this study. Glmboost is a model derived from the generalized linear model. Glmboost is different from the generalized linear model that creates a basic model after selecting variables. Basic models of respective iterations are obtained to calculate the negative gradient of the loss function for each basic model. The sum of the negative gradient vector of the corresponding iteration and the coefficient of regression is assigned to the model of the next iteration. The final prediction model is then created after completion of the predetermined number of iterations.
2.2.7. CART
CART is one of the decision tree models. It is an algorithm that repeats the process of bifurcation of variables to construct tree models to classify categorical dependent variables, or to conduct regression analysis for continuous dependent variables. CART model proceeds in a way to increase purity, which is a concept opposite to uncertainty; the decrease in uncertainty attained in accordance with increasing purity will ensure more accurate information. Thus the CART algorithm takes criteria that are capable of maximizing purity when it performs bifurcation of the branches in the tree.
2.2.8. Random Forest
Random Forest model exploits the scheme of sampling with replacement from given training data to create several decision tree models. This is an algorithm that creates new results from the results derived previously. The Bagging algorithm, which uses the decision tree model as a prediction model, can be regarded as a submodel of RF model that adjusts the number of independent variables, which is different from the Bagging algorithm. The independence of each model can be improved statistically by adjusting the number of independent variables to be used for each decision tree model. This can significantly reduce the variability of results.
2.2.9. Model Evaluation Index
As a means to test the results obtained from this study, the “K-fold cross validation” was used. The “K-fold cross validation” scheme tests and evaluates the data collected in this study for which the collected data are divided into k-subsets, in which the k-1 subsets, except for just one subset, are used for training of the model, to evaluate its performance with the use of the one subset. Thereafter, the evaluation of the model will be repeated k times, for which all subsets are used as test data. Here, the average value of k evaluative measurements obtained from each stage of test is calculated to make it one evaluative measurement. The formula rendering the average value iswhere denotes the i-th measurement among measurements.
Accuracy, specificity, sensitivity, and area under the curve (AUC) were taken as evaluative indicators; the specificity corresponds to the ratio of accurate prediction in the case of the prediction of diatom occurrence higher than the reference level, whereas the sensitivity corresponds to the ratio of accurate prediction in the case of the prediction of diatom occurrence lower than the reference level. AUC signifies an area resulting from the ROC curve showing the probabilities of an occurrence of data higher and lower than the reference level from the model prediction with given data; the respective probabilities of the results of the corresponding data are predicted from a specified cut-off. The ROC curve results from the sensitivity (=1–specificity) obtained by varying the specified cut-off; the wider area corresponds to better prediction. The formulae of each evaluative indicator are expressed as where TP is an acronym for True Positive that corresponds to the ratio of data predicted by the prediction model as higher than the reference level and was also concluded as higher than the reference level among actual data; TN is an acronym for True Negative that corresponds to the ratio concluded by the prediction model as higher than the reference level and was also concluded as higher than the reference level among actual data; FP denotes False Positive that also corresponds to the ratio of data predicted by the prediction model as higher than the reference level and was also concluded as higher than the reference level among actual data. In addition, FN is an acronym for False Negative that corresponds to the ratio of data predicted by the prediction model as higher than the reference level and was also concluded as higher than the reference level among actual data; and k denotes the number of cut-off.
The ROC curve is created by plotting association of sensitivity and (1-specificity). The AUC that is area of ROC curve represents sensitivity and specificity. This measure has two competitive advantages compared to other evaluation tools. One is that it is scale-invariant. It means that purpose of AUC is how well the model predicts. The other is that the AUC is classification-threshold-invariant. It shows that the AUC is independent of threshold because it is plotted association of sensitivity and (1-specificity) obtained by threshold. We chose AUC as a final model evaluation tool with two advantages.
2.2.10. Variable Selection
Variable selection is a method of creating a model by selecting variables that increase the performance of each model. There are traditional statistical techniques using “Correlation” and “Regression”. The “Correlation” technique identifies multicollinearity through correlation between the paired variables. The “Regression” technique identifies significant variables through regression analysis and multicollinearity variables through “VIF” statistic. There are machine learning techniques using “Variable Important Plot”. The RF model and Bagging model were used to evaluate the importance of each variable and to select variables for modeling. The importance of the variable signifies the degree of contribution of the selected variable corresponding to the improvement of ROC of each model that we chose as final model evaluation tool. In addition, the Boruta algorithm selects variables by the use of the RF model. In order to illustrate the importance of each variable, the colors of green, yellow, red, and blue correspond to the variables of high importance, potential importance, low importance, and those randomly created, respectively.
3. Results and Discussion
3.1. Variable Selection
In this study, the “Correlation Plot”, “Logistic Regression”, and “Variable Importance Plot” were used to select variables. The “Correlation Plot” identifies the presence of multicollinearity through correlation between each variable. The “Logistic Regression” determines variables of statistical significance. In addition, the “Variable Importance Plot” shows the importance of independent variables corresponding to dependent variables. “Logistic Regression” and “Correlation Plot” are based on traditional statistical techniques, and the method of selecting variables using the “Variable Importance Plot” can be understood as a machine learning method. Thus, it is important to compare the traditional method with the machine learning method. We described the three methods and compared between the machine learning method and the traditional method through VIF statistic.
First, the “Correlation Plot” was used to see the correlation between variables of data. Figure 5 shows that the coefficient of correlation between NO3-N and TN is 0.95, the biggest coefficient of correlation; the coefficient of correlation between TOC and COD is 0.84, while the coefficient of correlation between PO4-P and TP is 0.74. The coefficient of correlation between Temp and DO is the negative value of -0.80, the biggest negative correlation, whereas the coefficient of correlation between NO3-N and Temp is -0.77, while the coefficient of correlation between Temp and TN it is -0.71.

Second, the standardized estimates of corresponding variables were compared to each other, and the logistic regression was used to find variables affecting diatom presence. Table 3 shows the results of logistic regression analysis in which the variables of Temp and NH3-N were found to be statistically significant at the level of significance of 5%; the variables elsewhere appeared as statistically insignificant at the level of significance of 5%. Here, the unit increment of Temp that appeared corresponded to an increase of 1.2 times (= ) of the odds of presence of diatom, while the unit increment of NH3-N corresponded to the decrease of <0.001 times (= ) of the odds of presence of diatom. In addition, the influence of Temp over diatom presence was found to be greater than that of NH3-N from the “Standardized Estimate”.
Finally, Figure 6 illustrates the RF of algae, the importance of the variables of the RF and Bagging models, and the importance of the variables of Boruta algorithm. In terms of the importance of variables, the variables of the following order of PO4-P > DO > NH3-N > BOD > Temp > TOC > TN > Susp > NO3-N > pH > TP > COD appeared in the RF model, whereas they appeared in the following order: PO4-P > NH3-N > DO > BOD > pH > NO3-N > TN > TOC > Susp > Temp > COD > TP in the Bagging model. They appeared in the following order: PO4-P > DO > BOD > NH3-N > Susp > pH > TOC > Temp > TN > TP in the Boruta; COD and NO3-N can be regarded as variables of potential importance.

(a)

(b)

(c)
As a result of variable selection, the variables of PO4-P, NH3-N, and DO were found to be of higher common influence from the importance of the variables. The variables of Temp and NH3-N were concluded as significant variables from the Logistic Regression. In particular, the variable Temp appeared as highly correlated mostly with other variables, suggesting relatively the low importance for diatom presence, owing to its multicollinearity.
The Boruta method was used as the final variable selection method. The selected variables are pH, DO, BOD, Susp, TN, TP, TOC, Temp, NH3-N, and PO4-P because of green that means high importance variables. As a result of VIF, the Temp variable has multicollinearity problem (VIF = 11.72). The multicollinearity is defined if the VIF value is greater than 10. Although Temp variable has multicollinearity problem, we choose the Temp variable in the model because of association between Temp variable and response variable.
3.2. Classification
In this study, the variables PO4-P, DO, BOD, NH3-N, Susp, pH, TOC, Temp, TN, and TP were selected with the use of Boruta algorithm described in Section 3.1. The difference in results between the variables selected model and other models were compared with each other.
Each scheme of analysis may have respective advantages and disadvantages, among which the best scheme of analysis fit for respective purposes can be found. Thus, the scheme of analysis fit for characteristics of data should be selected. In addition, the most appropriate scheme of analysis can be derived from comparison of the results obtained from the case of models of the application of selected variables that reflect the characteristics of data, with the results of the case of models done without the application of variables selected through Boruta algorithm.
Table 4 presents the summarized results of “Accuracy”, “Sensitivity”, “Specificity”, and “AUC” resulting from each model of LOGIT, KNN, CART, ANN, RF, Bagging, Boosting, and SVM that predicted the occurrence of diatoms inhabiting the points of investigation.
The results obtained from models that employed variables taken without variable selection showed the highest values of “Accuracy” (71.27%) of RF model and “AUC” (76.65%) of ANN model. Among the results obtained from models that employed variables taken through variable selection, the highest values of “Accuracy” (71.27%) of RF model and “AUC” (77.93%) of ANN model were obtained. Therefore, the variable selection was found to have influenced models to render mostly higher values of AUC; the RF and ANN models were also found as the most appropriate ones for the prediction of algal proliferation.
3.3. Selected Model Output
CART model was used to identify the relationship between independent variables and dependent variables. CART model repeats the process of bifurcation of each variable to construct a decision tree that enables an interpretation of results that is different from other models. Figure 7 illustrates the relationship between variables used for the CART model. When PO4-P and NH3-N exceed respective levels of 0.0035 and 0.047, the diatom is classified as “Absence”, whereas the combination of conditions of PO4-P over 0.0035, NH3-N less than 0.047, and TN less than 2.3 corresponds also to the “Absence” of diatom. The combination of conditions of PO4-P over 0.0035, NH3-N less than 0.047, and TN over 2.3 corresponds to the “Presence” of diatom, whereas the combination of PO4-P less than 0.0035 and TP less than 0.03 corresponds to the “Absence” of diatom.

The combination of PO4-P less than 0.0035, TP over 0.03, and Temp over 23 corresponds to the “Absence”, whereas the combination of PO4-P less than 0.0035, TP over 0.03, Temp less than 23, and NH3-N over 0.13 corresponds to the “Absence” of diatom. Finally, the combination of PO4-P less than 0.0035, TP over 0.03, Temp less than 23, and NH3-N less than 0.13 corresponds to the “Presence” of diatom.
Figure 8 illustrates the structure of independent variables and dependent variables obtained from ANN, in which the thicker lines connecting each node signify greater influence of the nodes to the next nodes. The diatoms, having hidden nodes of 3, 4, and 5, exhibited the highest accuracy.

4. Conclusion
The measurements of water quality obtained from the Haman Weir for Nakdong River were analyzed with the use of various machine learning techniques, such as kNN, CART, ANN, RF, BAG, BOOST, and SVM, to determine factors of water quality that affect the proliferation of diatoms (factors that have influence on the proliferation of freshwater diatoms). Among the machine learning techniques, the ANN and RF rendered the highest accuracy in the prediction of diatom proliferation. As a result of the variable selection, the variables of PO4-P, BOD, NH3-N, and DO were found as common influential variables in the models of Bagging and RF because these variables also occupied higher position in the models of RF and Bagging. Also, the major variables that were selected by Boruta algorithm indicated significant improvement of AUC in the analytic schemes of kNN, ANN, BAG, and SVM.
This study was carried out to find the appropriate techniques that were suitable for the prediction of diatom proliferation among various machine learning techniques, from which the ANN and RF were found as appropriate models for the prediction of diatom proliferation. Since diatoms react sensitively to spatiotemporal changes, further studies that focus on the comparison and analysis of diverse data seem necessary to determine the optimal machine learning schemes that are fit for the prediction of the proliferation of each taxon of algae.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
Heesuk Lee and Young-Joo Lee investigated the data. Dae Keun Seo validated the data. Bomi Jeong, Seoksu Hong, Jaehoon Kim, Taekgeun Kim, and Jae-Kyeong Lee analyzed data. Yuna Shin and Tae-Young Heo wrote the original paper. All authors read and approved the final manuscript.
Acknowledgments
This study was performed under the project “A Study on the Characteristics of Algae by the Types of Rivers in Korea (NIER-2018-01-01-034)” by the National Institute of Environmental Research.