Abstract
The purpose of this research is to enhance the ability of data analysis and knowledge mining in soil corrosion factors of the pipeline. According to its multifactor characteristics, the rough set algorithm is directly used to analyze and process the observation data without considering any prior information. We apply rough set algorithm to delete the duplicate same information and redundant items and simplify the condition attributes and decision indicators from the decision table. Combined with the simplified index, the decision tree method is used to analyze the root node and branch node of it, and the knowledge decision model is constructed. With the Python machine learning language and PyCharm Community Edition software, the algorithm functions of rough set and decision tree are realized, so as to carry out artificial intelligence analysis and judgment of the soil corrosion factor data in pipeline. Taking the area of loam soil corrosion as an example, the data analysis and knowledge mining of its multifactors original data are carried out through the model. The example verifies that the evaluation and classification rules of the model meet the requirements, and there are no problems such as inconsistency and heterogeneity. It provides decision-making service and theoretical basis for the soil corrosion management of pipeline.
1. Introduction
The pipeline transportation has the characteristics of high efficiency, low cost, and passing through various working conditions. It plays an irreplaceable role in energy transportation. Once the pipeline accident occurs, it will not only bring huge economic losses but also lead to casualties and environmental pollution. As the systematic mode of safety management, the pipeline integrity management is the practice embodiment of pipeline safety management for many years [1]. The pipeline integrity management is based on data collection, storage, cleaning, data analysis, and mining. Data analysis and mining in pipeline integrity management are very important. It is the basic core of integrity management and the premise of efficient application and serves the decision-making of pipeline safe transportation. Management and analysis of the soil corrosion data is an important item of the external corrosion of pipeline safety management. With the differences of pipeline working conditions and regions, the factors and sizes of multiple factors are also different, and the selection of multiple factors of soil corrosion is different too. These lead to incomplete selection of soil corrosion parameters in pipeline integrity management and failure to consider the relationship between important corrosion environmental factors (such as soil resistivity, redox potential, water content, and soil pH value) and region [2]. In this case, the data analysis is incomplete, and the results are one-sided or even wrong, which affect the correctness of pipeline integrity management decisions. At present, the main methods for the pipeline in soil corrosion factors and data analysis are as follows [3].
The single-factor index method only considers the single-factor index of soil corrosion and is one-sided.
Fault tree analysis: this method has some shortcomings in the analysis of structural importance. For example, the minimal path set or cut set method is to determine the influence of the basic events of the accident tree by Boolean algebra operation. It is simple and not accurate enough. When the minimum path set and the minimum cut set are used to analyze the same accident, the sorting results of the two kinds are inconsistent. The structural importance coefficient method needs to find out the state value relationship between basic events and top events and list them in the calculation process. It is solved by substituting the state relationship into the formula of the structural importance coefficient. The solution process is complex and cumbersome. The results are accurate relative to the minimum path set or cut set method. However, there are a certain number of basic events. For example, if the number of basic events is 8, the number of incompatible two state combinations is 28 (256). Manual calculation takes a long time and is difficult to complete. The above two methods do not consider the difficulty of the basic event of the accident, and assume that the probability of the basic event leading to the accident is equal. This is inconsistent with the change process of nonequilibrium, complex, and nonlinear random variables in the actual corrosion leakage process. It can be seen that its preconditions are obviously subjective and one-sided. Probability and critical importance analysis is to calculate the relationship between the attribute size of index factors through the probability of events, and the solution steps are relatively simple. However, it is difficult to calculate the probability of top events, and when the number of basic events is too large, it is easy to produce the problem of combined storm. Although the probability of top events can sometimes be calculated according to a large number of historical statistical data, the targeted accident results are different due to regional differences [4, 5].
Principal component analysis (PCA): its eigenvalue decomposition has some limitations. For example, the transformed matrix must be a square matrix, and in the case of non-Gaussian distribution, the principal element obtained by the PCA method may not be optimal [6].
Extension analytic hierarchy process for the soil corrosion: the determination of its weight coefficient is subjective, which will greatly affect the correctness of the analysis results [7].
Multiple linear regression analysis: it requires a lot of data. In the regression analysis, which factor is selected and which expression is adopted by this factor are only a speculation. These affect the immeasurability of some factors and limit the regression analysis in some cases [8].
Failure probability analysis method: it is a statistical analysis of soil corrosion characteristics based on historical data. Using Weibull probability density distribution and other correlation functions, the probability statistical distribution of defect failure is obtained, and the parameters in the function can be changed to reflect the corrosion development characteristics and severity in different stages. However, the data based on time-series analysis method depends on historical statistics, and most of the mathematical models are simple models based on linear relationship. It is difficult for the model to accurately describe the time series of nonequilibrium, complex, and nonlinear random variable change process in the actual process. The model itself also lacks self-learning ability, and the accuracy of its analysis needs to be improved [9]. Due to the limitations of the above methods, the accuracy of prediction and prevention in pipeline integrity management are not high, and the timeliness is poor, so the due effect of integrity management is lost.
2. Methods
In view of the above problems, the methods of rough set and decision tree are proposed to analyze pipeline soil corrosion factors and data, combined with Python machine learning language and PyCharm community edition software.
2.1. Rough Set Theory (RS)
It is a mathematical tool to deal with uncertain problems. With the direct observation data, the rough set algorithm is used to delete the duplicate information and redundant items and simplify the condition attributes and the decision indicators from the decision table without considering any prior information [10].
The RS steps of data mining and weight analysis are as follows:(a)Establish knowledge base: the actual objective data of each index attribute is used to form the information table of attribute object. A list of attributes corresponds to the equivalence relationship of an object. A table is a series of equivalence relations defined.(b)Establish a decision table: the conditional attributes of the information table are discretized and simplified according to the decision attributes. We remove duplicate rows and error data from the information table. We simplify condition attributes to form a decision table.(c)Attribute importance analysis (D is the decision attribute and C is the condition attribute): after checking the results of , we analyze their impact on decision-making attributes, delete those that have no impact, and calculate the importance of attributes that have impact.(d)Rank as the importance of attributes (n = C = the number of condition attribute):
The advantage of this method is that it does not need any prior information, only excavates, analyzes, and classifies the implicit knowledge of the objective data itself. It has fault tolerance and generalization capability [11].
2.2. Decision Tree Analysis Method (Knowledge Decision)
Decision tree is an analysis method that can be used for knowledge decision-making. It takes the recursive classification of the tree structure as the model. It takes the data of index factors as the set space and uses the tree structure to classify the spatial attributes for decision-making. The root node is based on the requirements of index factor classification. Each subnode is a classification problem of index factors. It is classified into two or more blocks according to the level of index factors. Each block can continue to be classified until the generation of leaf nodes. A leaf node is the level classification under the condition of multiple indicator attributes. Each path from the root node to the leaf node represents a classification rule [12].
The steps of decision tree analysis (knowledge decision) are as follows:(a)According to the hierarchical index factors of RS analysis, the root node and branch node of decision tree are analyzed, and attribute reduction is carried out.(b)Selecting the node of decision tree: we select the core factor as the root node of the decision tree. We select branch nodes according to the weight or importance of attribute structure.(c)Pruning of decision tree: the repeated classification and opposite judgment are deleted to improve the fault tolerance and adaptability of hierarchical evaluation.(d)Selecting the result attribute: the corrosion grade is used as the leaf node of decision tree classification, and the evaluation model of decision tree is established [13].
2.3. Multifactors’ Case in the Data Analysis and Knowledge Mining of Pipeline Soil Corrosion
Taking the corrosion area of loam soil as an example, the mathematical method based on rough set and decision tree are used to mine and analyze the original data of soil corrosion factors, combined with Python machine learning language and PyCharm community edition software, so that it can provide decision-making services for the management of pipelines in this area.
2.3.1. Data Analysis
With the buried area and location of loam corrosion site, six influencing factors are analyzed according to the test piece data and collection batch. We used randomly selected 20 groups of corrosion data for data mining. Table 1 shows the actual original sample of the index factor value of 20 groups’ soil corrosion for the loam area section [14].
According to the rough set method, the actual sample of index factor values of soil corrosion for the loam area pipe section in Table 1 is taken as the decision table. The selected point of pipeline soil corrosion is taken as the research object, I = {}. The selected pipeline soil corrosion of influencing factors is taken as the conditional attribute, T = {soil resistivity, redox potential, chloride ion content, sulfuric acid root ion content, water content, pH value}. The soil corrosion grade of the pipeline in the loam area is taken as the decision attribute J = {average corrosion rate} = {very strong, strong, general, weak} = {4, 3, 2, 1} because the existing discrete data methods have more or less lost value problems. When the attribute value increases, the number of breakpoints will also increase. The choice of breakpoints is directly related to the correctness of discrete data. Too few breakpoints will cause serious value loss. Too many breakpoints will increase the dimension and complexity and reduce the accuracy, for example, the equal width and equal frequency interval discretization method, the statistical discretization method, the greedy and improved discretization method, the clustering continuous attribute discretization method, and the differential evolution discretization method [15, 16] This study combines the requirements and purposes of discretization. In other words, discretization should ensure the consistency and simplification of data results. Through the effectiveness of discretization, the classification ability and robustness of the dataset are improved, and the sample conflict and minimum information loss are reduced. Therefore, aiming at the discretization method and principle, it is proposed to improve its application based on the multifactor characteristics of pipeline soil corrosion, and consider the specific attribute value of the decision table (the supervised discretization method) [17, 18]. Table 1 is discretized according to its corresponding grade classification of soil corrosion. The classification of soil corrosion factors is shown in Table 2. In this way, the loss value problem in data is solved and the stability of data discretization is guaranteed. The discretization table of soil corrosion factors in the pipe section of loam area is shown in Table 3. We delete the data in brackets in data redundancy item 2 (or 10, 17), item 4 (or 7, 12), item 9 (or 15), item 11 (or 18), and item 16 (or 19, 20). The new decision table is used for attribute reduction and structural importance analysis according to the reduction decision rules.
2.3.2. Attribute Reduction and Structural Importance Analysis
The importance of the condition attribute to the result attribute in the decision table can be deleted from the decision table. We calculate the size of the positive field value of the result attribute classification with removing this attribute. The influence of the attribute on the classification change of the result attribute is reflected by the size relationship of its value. The smaller the value is, the lesser the importance of the condition attribute is to the decision attribute. The larger the value is, the greater the importance of the condition attribute is to the decision attribute. Its value is zero, which means it has no impact on the result attribute and can be deleted [19].
Combined with the soil corrosion data of pipeline in loam area, the whole dataset is defined as I. T and J are condition attribute set and result attribute set, respectively. The condition attribute set T contains soil resistivity a, redox potential b, chloride ion content c, sulfate ion content d, water content e, and pH value f. The result attribute set J is the soil corrosion grade of loam area. That is,
The positive fields of the result attributes are as follows:
The importance of each attribute is as follows:
We combine the application of Python machine learning language in PyCharm community edition software. Its Python program flowchart is shown in the Python flowchart of rough set algorithm in Figure 1. Figure 2 is the screenshot of data import of rough set reduction in Python program module. The calculated results are shown in Figure 3. Figure 3 is the Python calculated value diagram of rough set algorithm [20]. The output calculation value of Figure 3 are as follows. The first item is the decision table after normalization processing. The second item is the classification item and data item under the decision attribute. The third item is the core attribute after reduction. The fourth item is the attribute that can be deleted. The fifth item is the corresponding positive field value, that is, the correct data that can be used for analysis.



According to the above calculation, the importance of influencing factors of corrosive soil pipeline in this loam area is listed as follows.
Soil resistivity = redox potential > water content > sulfate ion content = chloride ion content = pH value = 0. It indicates that the last three conditional attributes are meaningless to the results, and they can be deleted.
As can be seen from the positive field value in Figure 3, we delete duplicates of data (7 and 12, 10 and 17, and 15 and 18). The result is consistent with the above calculated value, that is, , that verifies the correctness of the machine algorithm. At the same time, we delete the nonpositive field items (items 1, 3, 6, and 16) in the data, as shown in Table 4.
2.3.3. Establish Decision Tree and Knowledge Mining
The key problem of establishing decision tree is the quality constructing of decision tree structure, that is, the selection of test attributes and the pruning of decision tree [21]. In order to facilitate the search for classification rules and better carry out knowledge discovery in pipeline big data, the root node of the decision tree should select the core test attributes and then construct branches through different values of the core test attributes. The branch nodes select the test attributes with large structural importance value and use the recursive classification method to establish them repeatedly. Because the characteristics of the set space of pipeline soil corrosion data will lead to the problem of overfitting, it is necessary to prune the decision tree. Therefore, it is necessary to delete the redundant items of the opposite classification rules and the repeated classification rules, so as to improve the ability of rule information classification of decision tree. It can be seen from the reduction item of pipeline soil corrosion in loam area in Table 4 that the data item 4 is repeated with item 14, and item 14 will be deleted. The attribute selection, pruning, and knowledge classification decision of decision tree are carried out by using reduction items. That is, the root node of the decision tree, the core index factors of soil resistivity, and redox potential are selected. The branch node selects the water content according to the importance of the attribute structure. The leaf node is the soil corrosion grade of the pipeline in the loam area, as shown in the multifactors of classification decision tree of soil corrosion in the loam area pipeline section, in Figure 4 [22].

2.4. Pipeline Section
We combine the application of Python machine learning language in PyCharm community edition software. Its Python program flowchart is shown in Figure 5. Figure 5 is the Python flowchart of decision tree algorithm. The calculated results are shown in the Python calculated value diagram of decision tree algorithm in Figure 6 [23].


According to the analysis rules in Figure 6, when the soil resistivity is grade 2, if the redox potential is grade 1 or grade 2, the soil corrosion grade is grade 2. If the redox potential is grade 3, the soil corrosion grade is grade 3. According to the analysis rules in Figure 4, when the soil resistivity and redox potential indexes are in the right range of (3, 1) index rules, the soil corrosion grade is grade 3. When the soil resistivity and redox potential index are in the left range of (3, 1) index rule, the soil corrosion grade is grade 2. In terms of (2, 3) index rule, according to the analysis of importance, it is the same as (3, 2) index rule, so it is also on the right of (3, 1) index rule, and its soil corrosion grade is also grade 3. Therefore, it can be seen that the analysis rule in Figure 4 is consistent with the analysis rule in Figure 6.
Combined with the previous analysis, from the calculation results, the order of importance is soil resistivity = redox potential > water content. As shown in Figure 6, as long as the soil resistivity or redox potential is grade 3, the soil corrosion grade is grade 3. The validation data in Table 5 also prove its consistency.
3. Results
The six groups of soil corrosion data measured in the loam area pipeline section are used as the inspection data. Table 5 is the inspection table of six groups of soil corrosion data measured in the loam area. According to the analysis rules in Figure 6, in Item 1 of the serial number in Table 5, if the soil resistivity is grade 2 and the redox potential is grade 1, the soil corrosion grade is judged to be grade 2 according to the results, which is consistent with the average soil corrosion rate of grade 2. In Item 4 of the serial number, if the soil resistivity is grade 2 and the redox potential is grade 2, the soil corrosion grade is judged as grade 2 according to the rules, which is consistent with the average soil corrosion rate. In Items 2, 3, 5, and 6 of the serial number, if the soil resistivity is grade 3, the soil corrosion grade is judged as grade 3 according to the rules, which is consistent with the average soil corrosion rate of grade 3. According to the results of decision tree analysis, the rule accuracy is 100%.
4. Discussion
Taking the average corrosion rate of soil as the decision attribute, the corrosion capacity of different soils can be objectively reflected, which meets the actual requirements. However, due to the complexity and accuracy requirements of its measurement, it is often time-consuming, which is not conducive to the field practical application. Therefore, the rough set method is used to analyze the relevant weight and importance of the actual and objective detection data of soil corrosion. The application method of discretization is improved. The classification of soil corrosion grade is used to discretize the data, so as to avoid the value loss problem and increase the applicability and objectivity of the analysis of its factors and data. The classification rules are established according to the core index factors of soil corrosion. According to the importance of multi-index factors, the root node, branch node, and leaf node in the decision tree are selected and the structure is optimized. It can visually analyze the corrosion grade of soil, so as to provide knowledge decision-making and data basis for soil corrosion analysis.
5. Conclusions
Based on the rough set and decision tree method, the PyCharm community edition software is used to analyze the case of pipeline soil corrosion data. With data analysis and knowledge mining, the results show that the pertinence and adaptability of pipeline integrity management can be improved only by comprehensively considering the characteristics of pipeline data and the different characteristics of the influence of environmental factors in different regions.
The importance analysis of attribute structure using the rough set method is a multivalued and nonnumerical importance processing method, which makes full use of the objective information of the original data without any prior conditions and additional information. The traditional method of attribute structure importance analysis can only deal with the problem of the binary numerical model. By using the core attributes of the rough set and the importance value of attribute structure, we can build an intuitive decision tree with easy discovery of knowledge rules, which reduces the complexity of the tree and improves the fault tolerance and classification effect.
From the model established by the decision tree based on the data reduction rules of rough set analysis, the evaluation classification rules meet the requirements, and there are no problems such as inconsistency and heterogeneity. These provide knowledge and decision-making basis for the multifactors’ classification of soil corrosion in the pipeline section.
In the future, we will increase the data points, expand the amount of data collection, and carry out method training according to different soil environments through resource integration. We will find the core factors of each soil environment through the model, extract the identification feature attributes, and build a knowledge base to provide guarantee for the intelligent identification application of subsequent models and autonomous learning.
Data Availability
The raw data required for these findings cannot be shared at this time as the data also form part of an ongoing study.
Conflicts of Interest
No potential conflicts of interest were reported by the authors.
Authors’ Contributions
Author 1 (first author) developed methodology, helped software, investigated the study, analyzed the data, and wrote the original draft. Author 2 wrote and reviewed and investigated the study. Author 3 curated the data and collected the resources. Author 4 investigated the study and collected the resources.
Acknowledgments
This work was supported by the special scientific research program of Shaanxi Province Education Department, China (no. 15JS085) and the National Key Research and Development Program of China (2019YFF0217504).