Abstract
In recent years, with the explosive development of Internet technology, network security has gradually become a hot issue. At present, data mining technology has been widely used in processing network information. Rough set theory is a natural data mining or knowledge discovery method because the purpose and starting point of the research is to directly analyze and reason the data, discover the implicit knowledge and reveal the potential laws. Rough set data mining has unique advantages in processing information, so it is necessary to carry out research on monitoring network security information system based on rough set data mining. The purpose of this article is to solve how to implement network security problems of information system of monitoring, through research the related contents of rough set data mining technology, based on the two kinds of data pre-processing algorithm, data mining based on rough sets is discussed in detail the two kinds of data pre-processing algorithm in network security information system monitoring, especially the feasibility of the invasive monitoring. The results show that the two data pre-processing algorithms based on rough set data mining can effectively realize the monitoring function of common network security information system and improve the security of network security information system to some extent. The simulation results show that the monitoring accuracy of these two data pre-processing algorithms in the common network security information system is as high as 98.5%, which is about 12% higher than the accuracy of the general system.
1. Introduction
The development of the Internet has created a brand-new universal space for information sharing, information collaboration, and business expansion. It is no exaggeration to say that the Internet has constituted an independent super kingdom. With the rapid development of information technology, the Internet has been widely used in various industries [1]. For example, government departments’ websites, distance education, e-commerce, e-banking, and information service industries. The Internet has played a very important role in our daily study, work, and life, bringing us a lot of convenience [2]. However, with the rapid development and widespread application of the network, network security problems have become increasingly serious. For example, there have been many security vulnerabilities, which may hide a huge crisis [3]. The number of registered users of China’s mobile broadband Internet has also reached 520 million. Due to China’s international scale of Internet users, the number of registered users of broadband Internet and the number of international registered users of some national top-level Internet domain names, these three important indicators have now stabilized [4]. However, vulnerabilities in various operating systems and application programs are emerging, and hacker attacks and illegal intrusions are also increasing.
In statistics and machine learning research, the application of outlier research is divided into two scenarios. One is to perform outlier detection and data cleaning in the data pre-processing link, and the other is to directly focus on the results of outlier detection. Target the object and apply it to the corresponding scene. The mining of outlier points also has broad business application development prospects in China Intrusion Software Detection Center [5]. Therefore, this paper mainly uses the theory of rough set technology to represent and analyze the network intrusion source and detect the data in the system that is not determinable and incomplete, and the rough set technology theory and the outlier multi-node The technical theories of network excavators are combined to detect network intrusions [6]. We first need to propose an attribute data collection and completion processing algorithm based on rough set entropy theory and a dynamic attribute data reduction processing algorithm based on approximate data decision entropy. Through these two reduction algorithms, the entire intrusion data detection and processing system, raw data in the data are collected for quantitative pre-processing [7]. Wen proposes a fault monitoring method based on fuzzy association rule mining [8].
In order to explore the effect of exhaustive swimming training on the establishment of trained animals, among them, Bai gave a detailed introduction to data mining technology, analyzed the current problems in the development of data mining technology, and elaborated related research methods and technologies [9]. In his article, Eissa puts forward the research significance and research status of rough set data mining, and expounds the related rough set basic theory. In addition, it shows the significance and importance of rough set data mining for network security information system data processing and monitoring [10]. In the article, Ibrahim elaborated on the methods and methods of network security information system monitoring, and proposed the advantages and disadvantages of network security information system monitoring based on rough set data mining [11]. Qian and Gong proposed the efficiency and accuracy of traditional monitoring methods to be low, pointed out the feasibility of rough set data mining, and proposed a variety of data processing algorithms based on rough set data mining, especially the data pre-processing algorithm is better [12]. Wang et al. deeply analyzed the functional requirements of power information systems from the aspects of security and storage functions, proposed the characteristics of network information security analysis architecture, and explored the application measures of power information system network security technology based on big data [13]. Rui Starting from account management, firewall technology, antivirus software and data backup technology, the countermeasures for the network security protection of power information systems are discussed [14].
The innovation points of this paper are as follows: (1) Through the study of rough set data mining technology-related content, based on these two kinds of data, pre-processing algorithm is proposed, and the feasibility of two data pre-processing algorithms based on rough set data mining in network security information system monitoring, especially intrusion monitoring is discussed in detail. (2) Two data pre-processing algorithms based on rough set data mining can effectively realize the monitoring function of common network security information system, and improve the security of network security information system to a certain extent.
2. Data Mining and Rough Set Theory
2.1. 1K-Means-PageRank Degree Distribution Algorithm
Data mining, also known as knowledge discovery in databases, is a hot research topic in the field of artificial intelligence and databases. Data mining degree has different meanings in different networks [15]. Degree can express the influence and importance of an individual [16]. The greater the degree, the greater the influence of the individual and the greater the role of the individual in the whole organization, and vice versa [17]. Degree distribution is the most studied feature of complex networks. In complex networks, the number of edges varies with the number of vertices [18, 19]. We use Q to denote the degree of vertex I. For undirected networks, the
The F function is used to estimate the value of each iteration, the order of vertex I is the adjacency matrix, and then the degree distribution P(k) of the network can be calculated, which is the probability of any vertex in the network.
The ability of a node in a complex network to be on the shortest path of other nodes is used to describe the value of a node in information dissemination [20].where represents the sum of the shortest paths from node j to node K, and EW represents the number of nodes in these shortest paths. The larger the intermediary centrality value, the more the shortest path through the node, the more obvious its hub role in the whole network, the stronger its influence and control, and the more important the node. Proximity centrality refers to the reciprocal of the sum of the shortest paths from one node to all other nodes. The approach centrality of a node can be expressed as:where WOE is the distance between node B and node G. Compared with node degree, closeness centrality can further describe the closeness between nodes and non-directly connected nodes. Generally speaking, a person’s more friends does not mean that he is very important. Even if he has fewer friends and these friends are very important, he is equally important. For a complex network with n nodes, the centrality of its node eigenvector is defined as:where x is the set of all adjacent nodes of the node, CN is the element of the adjacency matrix U of the network, if node I is adjacent to node j, then
The basic idea of PageRank algorithm is to define a random walk model on a directed graph, that is, a first-order Markov chain, which describes the behavior of random walkers randomly visiting each node along the directed graph. PageRank algorithm is a classic algorithm for ranking web pages in Google search engine. It is first used in the network model between deterioration and minerals. This type of metamorphic grade is regarded as network nodes, such as metamorphic grade A and metamorphic grade B. If a is pointed by an important mineral and B is pointed by many minor minerals, the PageRank value of metamorphic grade A may be larger than that of metamorphic grade B, that is, metamorphic grade A is more important than metamorphic grade B. The mathematical formula of PageRank is as follows:where PR(x) is the PageRank value of metamorphic degree x, PR(Y) is the PageRank value linked to the Yi of metamorphic degree X. When the PageRank value of individual metamorphic degree is difficult to calculate due to its non-convergence, it needs to play the role of damping coefficient. The Ni coefficient refers to the ratio of the rated load (speaker) impedance of the amplifier to the actual impedance of the power amplifier. By transmitting the hidden state of the current neuron to the next neuron, the recurrent neural network has a short-term “memory function.” Recurrent neural networks with variable topology and weight sharing are used for machine learning tasks involving structural relationships, and have received much attention in the field of natural language processing. RNN generates the output of the next time according to the current input TX and the hidden state TH of the previous time. Generally, th is directly used for the output, i.e.
Long short-term memory neural networks-commonly known as LSTMs, are a special kind of RNN that can learn long dependencies. The long-term and short-term memory network is an optimized network structure to solve the problem of gradient disappearance and gradient explosion in RNN training. Compared with RNN, LSTM can better mine the long-term dependence between data, and it is the most popular scheme nowadays. LSTM model includes three control gates: input gate, output gate and forgetting gate. The forgetting gate inputs the previous stage th and the current state TX into the sigmoid function to selectively forget the previous node. In information science, the sigmoid function is often used as the activation function of neural network due to its mono-increasing and inverse-function mono-increasing properties. The calculation formula of forgetting gate is as follows:where B is the sigmoid function, X is the weight matrix of the forgetting gate, and the output of the forgetting gate iswherein the definition of element multiplication is as follows:
An input gate is a gated device used to control how much input goes in or out or whether to allow it in or out. The input gate generates new candidate information through the output of the previous state, the input of the current state and the tanh activation function to obtain the next state TC. The calculation formula of the input gate is as follows:
IW and IU are the weight matrix of the input gate, and the output gate is responsible for calculating the activation value of this layer. Cell state information is activated through the filter layer and multiplied by QT to get the output information Qik. The calculation formula of output gate is as follows:where Ou is the weight matrix in the output gate, and the output information is:
So far, the output information of data mining and the node training module of interactive process are obtained.
2.2. Network Information Security Standards and Related Specifications
Like power companies, traditional companies lack professional personnel in network security management, security measures are not in place, security awareness is weak, the configuration and maintenance of network information platforms cannot be followed up in time, the most common configuration is improper, vulnerability patches are not synchronized and upgraded, etc. All will become security risks. In addition, most companies lack comprehensive security management solutions and rely too much on firewalls and encryption technologies.
3. Construction of Experimental Model
3.1. Experimental Data Source
The experimental software environment is Eclipse SDK (Software Development Kit) 3.5.0, Windows XP operating system; the hardware environment is a PC (Personal Computer), configured as: CPU Intel Core 2 dual-core T7100 (1.8 GHz), 2 GB memory. Both the classic DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm and the GFDBSCAN algorithm proposed in this paper are coded in Java. Five data sets were used in the experiment. Data set 1 is a random data set generated by ourselves [data set randomly generated with points (2, 2) and (25, 25) as the center, and Data set 2–5 are commonly used data sets to verify the clustering effect of the DBSACN algorithm. In addition, due to the very large size of the original data set and the low level of support set by the user, the DRBFP_MINE algorithm may also appear to have reached the maximum memory during the construction of the first FPTREE during the execution process. At this time, the DRBFP_MINE algorithm mines rise by decomposing the original data set. There are partitioning (Partitioning) and mapping (Projection) based decomposition data sets.
The original data set is divided into smaller data sets based on the partitioning method, so that its corresponding FPTREE can be constructed in memory, and then the conventional in-core-based mining algorithm is used to mine and merge each part of the data set Candidate frequent item sets. Finally, the original data set needs to be scanned again to count the candidate frequent item sets in order to determine the frequent item sets. The partition-based method will generate a large number of local frequent item sets. These local frequent item sets may be larger than the original data sets, which also need to be stored on disk. Scanning the original data set and counting the frequent candidate item sets stored on disk is also very time-consuming. It is also worth noting that considering the distribution characteristics of the data set, how to decompose the data set makes its corresponding FPTREE occupy smaller memory more difficult to determine. For example, the decomposition of a uniformly distributed data set does not reduce its memory usage. The FPTREE corresponding to the decomposed data set may be about the same size as the FPTREE corresponding to the original data set.
3.2. Design of the Safety Monitoring System for the Operation of Network Information Platform
The system structure is mainly divided into data collection sub-module, data analysis sub-module, important equipment monitoring sub-module of enterprise intranet and operation safety management sub-module. The data collection sub-module mainly includes network device data collection, security device data collection, application server data collection, and data analysis sub-module functions include statistical analysis of enterprise IP resource distribution, statistical analysis of historical and real-time operating status, business capability evaluation and analysis results download. The operation safety control sub-module includes equipment operation safety control and equipment control management, controls the operation of the equipment according to the data analysis results, and ensures that the operation status of the equipment conforms to the safety rules and corporate specifications.
After obtaining the network security information, the important work is to analyze and detect the danger or intrusion information from the information, and then ensure the security of the network security information system, and process and prevent the dangerous information in time. Information security on the network means that the data flowing and stored in the network system is not subject to accidental or malicious damage, leakage, and modification.
3.3. Model Module Analysis Process
The algorithm model is divided into 5 layers as a whole, from bottom to top: data pre-processing layer, item set classification processing layer, attribute reduction layer, item set mining layer, decision rule processing layer. The specific description is as follows: data acquisition layer the data acquisition layer is also called the data pre-processing layer. For massive and complicated data, data pre-processing must be performed. Data filling and cleaning: The key tasks of this layer include data unification, abnormal data processing, data simplification, data form information, filling missing data, etc. Data integration: The goal of this layer is to first clean up data sources from different places, and then store these data in a combined storage space according to the results of the cleanup. The main tasks include: entity identification, detection and resolution of data value conflicts, redundancy, and other issues. Data transformation: The main problem is “normalization.” These include: zero-mean normalization and max-min normalization. Data specification: The goal of this layer is to regulate the data sources that have been standardized, and get almost the same analysis results. In this paper, we use the knowledge of data discretion in data mining, and select the appropriate discretion method according to the number of attributes to discretize the data, so that the discretion data can meet the processing requirements of this article. After the data pre-processing in the above steps, a set of original item sets is obtained.
In addition, the operation object of this layer is the original decision table, which reduces the dimension of data through attribute reduction. First, according to the mining target, the initial decision table S is formed. Second, simplify the decision table to determine whether the decision table is a compatible decision table: again, use the difference matrix to find the kernel. After finding the kernel, the attributes are divided into two categories: nuclear attributes and non-nuclear attributes, and then the attribute importance of each attribute is obtained, and the corresponding attribute reduction algorithm is executed. Finally, the reduced decision table is obtained. The information of the decision table established in this article is given in Table 1:
4. Simulation Analysis of Network Security Information System Based on Rough Set Data Mining
4.1. Efficiency Analysis of Attribute Reduction Algorithm Based on Rough Set ADEAR
This algorithm is called iterative deepening Astar algorithm, which can effectively solve the problems caused by the growth of Astar space. In this paper, a new attribute reduction algorithm ADEAR is proposed based on rough set data mining. As shown in Figure 1, this algorithm can simplify data screening and effectively monitor data, which also reduces the complexity of the ADEAR algorithm to a certain extent. In addition, the ADEAR algorithm in this article is implemented in java language, and the corresponding hardware environment: Intel processor 2.0 GHz, memory is 2 GB.

Computer network is an important means and way for people to understand society and obtain information through modern information technology. Network security management is the fundamental guarantee for people to surf the Internet safely, green, and healthy. In order to verify the effectiveness of this algorithm, as shown in Figure 2, we conducted a monitoring experiment on a network security management system and conducted experiments on the data set: The attribute-related records in the data set of the security system in the experiment are given in Table 2.

As shown in Figure 3, we compared the ADEAR algorithm with 6 representative attribute reduction algorithms. We first conducted an attribute reduction experiment on the ADEAR algorithm, and implemented CIQFS using java language, and then also conducted a corresponding attribute reduction experiment.

As shown in Figure 4, the reduction results of several different algorithms mentioned above are compared. The experimental results of the five algorithms POSFS, CEFS, DISMFS, GAFS, and PSORSFS can be obtained from the paper. The reduction results of each reduction algorithm on the corresponding data set are given in Table 3.

As shown in Figure 5, the classification accuracy corresponding to the above seven different reduction algorithms is compared. We use the experimental method designed by Wang et al., which uses the 10-fold cross-validation method to estimate the classification accuracy of each reduction algorithm. For the training set after reduction, the Rough Set Exploration System (RSES) to extract the decision rules.

Finally, apply these rules to the test set for classification testing (the conflicts are resolved using the “Standard Voting” method), based on rough set attribute reduction algorithm ADEAR’s screening rate of security system information, as shown in Figure 6.

As shown in Figure 7, we found that because there are 6 contours in the lymphatic data set, the classification accuracy of the ADEAR algorithm on the lymphatic data set is lower than that of POSFS, because these contours will affect the classification performance of the final ADEAR algorithm.

The POSFS is relatively less affected by these outlines, so its classification accuracy is higher than the ADEAR algorithm. Therefore, the performance of the ADEAR algorithm can be improved to a certain extent. Based on rough set attribute reduction algorithm ADEAR monitors the efficiency of safety system information is given in Table 4.
It can be seen from the data in Figure 8 that ADEAR based on rough set attribute reduction algorithm is acceptable for information monitoring of safety systems. It can monitor relevant information to a large extent in real time, and the information feedback speed is faster than that of existing general monitoring. The algorithm is better.

As can be seen from the data in Figure 9, the rate of security system information screening based on rough set attribute reduction algorithm ADEAR is acceptable, it can basically meet the needs of data screening, and the processing efficiency of the experimental network security system is higher than the general processing method out of 10%.

As shown in Figure 10, we calculated the running time of the ADEAR algorithm, and gave a comparison between the running time of the ADEAR algorithm and the two reduction algorithms SCE and FSPA-SCE. The data of the latter two algorithms are obtained from the experiment. Based on rough set attribute reduction algorithm ADEAR monitorss the accuracy of safety system information, as shown in Figure 11.


It can be seen from the data in Figure 12 that the ADEAR based on rough set attribute reduction algorithm has a high accuracy of monitoring information in the security system, and the monitoring accuracy of the network security information system reaches about 98%. The model performance is given in Table 5, and the performance indicators of various branching algorithms are given in Table 6.

4.2. Performance Analysis of Outlier Detection Algorithm ODIWOMR Based on Rough Set Weighted Distance
As shown in Figure 13, in order to verify the network security information, monitoring performance of the ODIWOMR algorithm, is too high, which is inconsistent with the actual network environment.

As given in Table 7, there is a prerequisite for unsupervised intrusion detection: the intrusion behavior is regarded as an outliner, the corresponding proportion cannot be too high, and the proportion corresponding to the intrusion behavior needs to be far less than the proportion of normal behavior; if the intrusion behavior accounts for is too large, it will not be detected as an outliner.
As given in Table 8, in this experiment, the data in the data set was screened, and the proportion of the intrusion behavior that was satisfied was much smaller than the ratio of normal behavior, that is, the number of intrusion records accounted for about 2.5%, and the number of normal data records was greater than or equal to 97.5%. ODIWOMR, an outliner detection algorithm based on rough set weighted distance, monitors the efficiency of security system information.
From the data in Figure 14, it can be seen that the outlier detection algorithm ODIWOMR based on the rough set weighted distance is very efficient for information monitoring of the safety system, it can largely monitor the relevant information in real time, and the information feedback is very fast. It is about 25% higher than the existing general monitoring algorithm.

As given in Table 9, since the data set has a total of 41 attributes after removing the class attributes, we can conclude from the definition of attribute importance that not all attributes are helpful to the final detection result, or have the same contribution, and some attributes may be contributing less to the test results, and some even do not. Based on rough set attribute reduction algorithm, ADEAR monitors the accuracy of safety system information, as shown in Figure 15.

From the data in Table 10, it can be seen that the outlier detection algorithm ODIWOMR based on rough set weighted distance has a high accuracy of information monitoring of security systems, and the accuracy of monitoring in common network security information systems is as high as 98.5%, which is higher than the general system increase by about 12%.
5. Conclusion
Network data contain a large amount of sensitive information of individuals, companies, and government departments. It is the fundamental task of information network transmission to protect information security from the source to the receiver. The feasibility of the ADEAR algorithm based on rough set attribute reduction is analyzed, and the corresponding working principles and theoretical guidance are proposed. The advantages and disadvantages of the algorithm are elaborated. Real-time monitoring of relevant information to a large extent, and information feedback speed are very fast, slightly better than existing general monitoring algorithms. In addition, its monitoring accuracy of the network security information system reaches about 98%. The feasibility and superiority of the outlier detection algorithm ODIWOMR based on rough set weighted distance are discussed and verified. It has been verified by experiments that the accuracy rate of monitoring using these two data pre-processing algorithms in common network security information systems is as high as 98.5%, which is about 12% higher than the general system accuracy. Two data pre-processing algorithms based on rough set data mining can effectively realize the monitoring function of common network security information systems, and can improve the security of network security information systems to a certain extent.
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by The Ministry of Education of Science and Technology Development Center of Production Innovation Fund China's Colleges and Universities (Grant no. 2020 IT A07027).