Abstract

In recent years, machine learning (ML) algorithms have been approved effective in the intrusion detection. However, as the ML algorithms are mainly applied to evaluate the anomaly of the network, the detection accuracy for cyberattacks with multiple types cannot be fully guaranteed. The existing algorithms for network intrusion detection based on ML or feature selection are on the basis of spurious correlation between features and cyberattacks, causing several wrong classifications. In order to tackle the abovementioned problems, this research aimed to establish a novel network intrusion detection system (NIDS) based on causal ML. The proposed system started with the identification of noisy features by causal intervention, while only the features that had a causality with cyberattacks were preserved. Then, the ML algorithm was used to make a preliminary classification to select the most relevant types of cyberattacks. As a result, the unique labeled cyberattack could be detected by the counterfactual detection algorithm. In addition to a relatively stable accuracy, the complexity of cyberattack detection could also be effectively reduced, with a maximum reduction to 94% on the size of training features. Moreover, in case of the availability of several types of cyberattacks, the detection accuracy was significantly improved compared with the previous ML algorithms.

1. Introduction

Cyberattacks [1] refer to offensive actions to alter, disrupt, deceive, degrade, or destroy computer systems, networks, information, or programs in these systems. In recent years, the high frequency of cyberattacks has posed severe threats to the network security and even the national security, leading to a significant decline in network performance and service interruption. Hence, a great number of protection mechanisms [2, 3] have been proposed and deployed, such as firewalls, antiviruses, and malware detection software. However, these countermeasures have been proved insufficient to provide a complete protection against the cyberattacks in the modern network environments.

Although firewalls can provide rule-based network protection, more intelligent mechanisms are required to detect advanced network intrusion in high volume of traffic data. To this end, several network intrusion detection systems (NIDSs) [46] have been designed using ML methods. A NIDS can provide real-time data on network traffic and send out an instant alarm or block suspicious activities if a network attack is detected. ML methods are widely utilized in NIDSs to detect a network’s anomalies mainly through extracting features of traffic data.

Although ML-based NIDSs have shown to be robust in real-time traffic monitoring, their accuracy and efficacy are still compromised by the imprecise features, which are greatly dependent on a human’s experience. Meanwhile, a fixed feature set may not be appropriate for detecting different types of network intrusions, as some features may be redundant or unrelated, which may slow down the ML process. Therefore, it is essential to explore the best features [7] to increase the accuracy of a detection system.

To overcome the abovementioned barriers, application of causal ML methods in NIDSs is proposed in this paper. Traffic features can be classified into two classes: causal features and noisy features. Causal features are those features, which have causal relationships with a network intrusion. That is, these features are caused by cyberattacks. When cyberattacks are launched, these features become abnormal. While the cyberattacks are stopped, these features become normal. Traditional distributed denial-of-service (DDoS) attacks exhaust the bandwidth, central processing unit (CPU) power, or memory of the victim host by flooding an overwhelming number of packets from thousands of compromised computers (zombies) to deny legitimate flows. The most frequent DDoS attacks mainly consist of flooding with a huge volume of traffic data and consuming network resources, such as bandwidth, buffer space at the routers, CPU power, and recovery cycles of the target server. Noisy features have no causal relationship with a cyberattack, although they may have a statistical-based correlation [8]. Noisy features can degrade detection performance because they may disrupt a detection system in real deployment.

To distinguish noisy features from causal features in NIDSs, we present two causal ML methods for NIDSs, including causal intervention and counterfactual reasoning.

The main contributions of this paper include(i)We propose a novel causal ML-based NIDS. With establishing a causal link between cyberattacks and traffic features through causal intervention, noisy features can be identified and removed.(ii)A counterfactual detection algorithm based on the Bayesian Network (BN) is developed to classify cyberattacks based on causal features.(iii)The performance of the causal ML-based NIDS is evaluated using CICIDS19, UNSW-NB15, and NSL-KDD datasets. The experiment results confirmed the effectiveness of the proposed approach.

This paper is organized as follows.

Section 2 provides a brief discussion on the existing relevant studies on NIDSs and their limitations, as well as a summary on the contributions of this study. Section 3 presents a detailed discussion on the theories and governing equations of the different deployment techniques. Section 4 presents a novel causal ML-based NIDS. Section 5 mainly discusses on the experimental results. And, Section 6 summarizes the main achievements of this research.

2. Literature Review

As one of the important areas in computer science and network security, intrusion detection based on ML [911] is a hotspot. Numerous scholars [1215] have already carried out a variety of explorations on this topic. Tang et al. [16] established a deep neural network model of NIDSs, and the model was trained by the NSL-KDD dataset. Their model showed a robustness for detecting flow-based anomalies in software-defined networking (SDN). Daya et al. [17] proposed BotChase, a two-phased graph-based bot detection system, leveraging both unsupervised and supervised ML. The first phase pruned presumable benign hosts, while the second phase achieved bot detection with high precision. The literature [18] on NSL-KDD dataset aimed to propose an adaptive ensemble learning model to develop a multitree algorithm with an accuracy of 84.2%.

As reported previously, optimization of the size of training features is worthy of investigation. Importantly, irrelevant features in a dataset could undermine accuracy of a model and increase training time required for the establishment of a model. Thus, to determine the optimum training size, numerous explorations have been conducted. Feature selection [11, 1922], a process of selecting the most relevant features by manual or algorithms, has been used to reduce the time and space complexity of model construction. Hadeel et al. [23] proposed a wrapper feature selection algorithm for intrusion detection. This method uses a dove-inspired optimizer to implement the feature selection, and the binarizing algorithm of the proposed cosine similarity method showed a faster convergence speed and a higher accuracy than the sigmoid method. Another research [24] developed a feature selection model, which combined ID3 classifier algorithm and BEES algorithm. In this model, the BEES algorithm was used to generate the desired feature subset. Chung and Wahid [25] introduced a new simplified version of particle swarm optimization for feature selection, constituting a local search strategy to speed up the feature selection process by finding the optimal neighborhood solution. The algorithm could reduce the features used to represent network traffic behavior in KDDCUP99 dataset from 41 to only 6, and the accuracy reached 93.3%. However, the method mentioned above could only select features based on relevance, and some noisy features may affect the detection accuracy.

In addition to the size of training features, correct classification of cyberattacks is also of great importance in the existing studies. The existing algorithms for NIDSs based on ML or feature selection are all on the basis of correlation between features and cyberattacks to realize the classification. This correlation causes several wrong classifications due to the existence of a large number of spurious correlations [26]. In order to solve this problem, causal reasoning [2732] is frequently utilized to solve the spurious correlations. At present, causal reasoning mainly adopts two models [33]: sStructural causal model (SCM) [34] and potential outcome model (POM) [35]. A SCM is made of endogenous (manifest) and exogenous (latent) variables. The POM provides the causal effects [36] through mathematical definitions. However, conducting randomized trials [37] with both SCM and POM is expensive, time-consuming, and sometimes unethical. Additionally, its accuracy is low, owing to insufficient consideration about the influences of exogenous variables (a variable outside the cyberattack model, which affects the cyberattack model but is not affected by the cyberattack model) [26] and noisy factors on the causal features.

Based on the deficiencies of the abovementioned algorithms, this paper starts from the decoupling of the correlation of features and the classification of types of cyberattacks under counterfactual scenarios to achieve a high accuracy in the detection of cyberattacks. The counterfactual model is based on the BN, which can model relationships among hundreds of cyberattacks and features. Firstly, the correlation of features is decoupled through causal intervention, and noisy features that do not affect the detection outcome are deleted. Secondly, based on the retained causal features, the most relevant types of labels are selected, and then, the counterfactual detection algorithm is implemented to find out the unique label. For instance, given evidence ε=e and some hypothetical interventions, the likelihood that we observed a different outcome ε=e' through the counterfactual detection algorithm is calculated. Then, the expected number of anomalous features is calculated to identify the highest likelihood of cyberattacks in the counterfactual scenario [26].

3. Preliminaries

In this section, we present a brief introduction about causal reasoning.

3.1. Strong Spurious Correlations

Traditional ML is driven by the association, and it is difficult to achieve consistent prediction for unknown test datasets. Traditional ML will find noncausal (noise) features in association mining, such as the relationship between risk factors and abnormal features, and such strong spurious correlations will be used for the prediction.

For example, risk factor R will cause DDoS attacks in Figure 1, for instance, X1, X2, and X3, and X1 and X2 will cause abnormalities of traffic feature Y1 and Y2. If X1 and X2 have not been observed or counted in the prior data, risk factor R will inevitably lead to the appearance of X3, Y1, and Y2. If the calculation is based on the correlation algorithm only, the conclusion that X3 is the cause of Y1 and Y2 may be completely wrong.

A classic New England Medicine paper on chocolate and the Nobel Prize [38] explains such strong spurious correlations. According to the paper, the more chocolate a country consumes, the more Nobel Prizes it will win. This conclusion is very absurd at the first glance, but what is wrong with the conclusion based on relevant facts? Statistical analysis of the data shows that there is indeed a linear relationship between a country’s chocolate sales and the number of Nobel Prizes it has won. However, the causal analysis indicates that there is only a strong spurious correlation between chocolate sales and the number of Nobel Prizes.

3.2. Definitions

It is supposed that Y = {C, V} is the traffic feature set, where C is a causal feature set and V indicates a noisy feature set (V = Y\C). X ∈ {0, N} represents a network attack.

As noisy features have no causal relationships with network intrusions, the conditional probability P (X|Y) satisfies the following condition [8]:

Although there is no causality between X and V, they may show a strong correlation in the statistical data (Figure 2(b)). If the spurious relationship is not distinguished from causation, it may lead to errors in real-world data distributions, even if the ML model is trained well.

To define causality, if other conditions do not change, changing X can cause a change in Y; thus, there is a causality between X and Y. If X and Y can be measured, then the causal relationship of X and Y can be calculated by changing the values of X and Y. If the magnitude of the causal relationship between X1 and Y is stronger than that between X2 and Y, it is considered that X1 causes Y.

In general, cyberattacks cause the anomaly of data traffic features, as shown in Figure 3. For the sake of a simpler analysis, exogenous variables are ignored. As mentioned earlier, if other conditions remain unchanged, the change of {Y1, Y2, …, Yn} may lead to the change of X, which indicates that there is a causal relationship between {Y1, Y2, …, Yn} and X. Meanwhile, it is equivalent to the fact that X is the cause, and {Y1, Y2, …, Yn} is the effect.

3.3. SCM

The detection models which will be used in our experiments are BN models which show the relationships between cyberattack, risk factors, and traffic features. BNs are an increasingly popular modelling technique in cybersecurity [39], especially due to their capability to overcome data constraints (it is impossible to learn causality between variables). In BNs, the probability is interpreted as a degree of confidence. As shown in Figure 4, in the 3-layer BN model, the traffic features are influenced by corresponding cyberattacks, where Z is the risk factor of the network being attacked, X denotes the type of cyberattack, and Y represents the traffic features. In the noisy-OR model, Y = (X1X2 ∨, …, ∨ Xn), and as long as there is an attack type Xi = 1, then Y = 1. This pattern (Figure 4) can be extended to a further complex network model with more layers.

In the causal inference, BN is replaced by a more basic SCM. Existing BNs can be expressed as a SCM [40, 41]. This SCM consists of three components [42]: a graphical model, a structural equation, and a counterfactual and intervention logic.

The key characteristic of SCMs is that they represent each variable as deterministic functions of their direct causes together with an unobserved exogenous “noise” term, which itself represents all causes outside of our model. For example, in a network without cyberattacks, some traffic features may be abnormal, which is due to unobserved exogenous variables. If an unobserved exogenous variable u= {u1, u2, …, un} is specified, the causal Markov blanket (for complete random variable and a given set of variables and , if , the minimum variable set MB that can meet the above conditions is a Markov blanket with ) condition [26, 42, 43] will be satisfied.

Assumption 1. It is assumed that the observed variable is Y = {Y1, Y2, …, Yn} in the SCM of the directed acyclic graph [42]; its parent variables can be regarded as u v pa (Y); thus, Y = f {pa (Y), u} can be achieved. For each variable Y, the parent variable X (i.e., X = pa (Y)) in the model has a noise term with an unknown distribution , such that

Assumption 2. In the noisy-OR model [39], it is assumed that the probability that any variables Y may behave as normal (Y = 0) due to noisy variables in a network attack (Xi = 1) is . It is assumed that the variables Y are independent of each other, and then,For instance, the network devices are installed with antivirus software or firewalls; thus, some traffic features may not produce abnormalities.

3.4. Causal Intervention

The causal detection problem (magnitude of the causality, feature selection, unobserved exogenous variables, and noisy variables) can be addressed by a causal intervention that is called “do-operation.”

Definition 1 (do-operation). The postintervention distribution resulting from the action (Y = y) is given by equation (4) [40]:The do-operator of causal intervention signifies that we are dealing with an intervention, rather than a passive observation. The subscript m is used to represent the modified probability distribution. From the perspective of probability distribution, P (X = x|Y = y) represents the probability of X=x corresponding to the part of Y among all the values that Y=y, and P (X = x|do (Y = y)) represents the probability that all Y are fixed to y and then X=x. Intervention changes the distribution of the original data, while conditional variables do not change the distribution of the original data.

3.5. Counterfactual Detection [26]

Counterfactuals enable us to quantify how well a cyberattack (i.e., X = 1) explains anomalous features by determining the likelihood that the features may not be presented during intervention, thereby switching to the cyberattacks by setting do (X = 0), as given by the counterfactual probability P (Y= 0|Y= 1, do (X= 0)). If the probability is high, X=1 is a good causal explanation of the anomalous features. It should be noted that this probability refers to two contradictory states of Y, and thus, it cannot be represented as a standard posterior probability.

The principles for counterfactual detection of cyberattacks are as follows [26, 37]:(1)The likelihood that a cyberattack causes an anomalous feature should be proportional to the posterior likelihood of that attack(2)A cyberattack X, which cannot cause an anomalous feature, cannot constitute a causality between features and attacks(3)A type of cyberattack, which causes a greater number of anomalous features, should be more likely to have a causality to these features

4. A Novel Causal ML-Based NIDS

In this section, the causal ML-based NIDS (CMLN) framework and time complexity will be introduced.

4.1. Framework

This study aims to develop a novel causal ML-based NIDS. As illustrated in Figure 5, the proposed framework is divided into four main stages. The first stage is data preprocessing, consisting of Z-score, Min-Max, and deletion of the incorrect and fuzzy row datasets. The purpose of this step is to improve the performance of the training model and reduce the class imbalance problem [26] that often appears in network traffic data. Hence, data should be initially encoded with Z-score to transform any categorical features into numerical ones. Then, the value of a normal feature is equal to 0 and that of an anomalous feature is a positive integer [37, 40] in causal reasoning; thus, it needs to be normalized to a natural number. At the end, incorrect and fuzzy row datasets should be removed to reduce the size of training dataset and improve the accuracy of validation dataset.

The second stage of the framework is the processing of selected features, which reduces the number of features required for ML models and counterfactual detection algorithm. Firstly, although the noisy features may have a correlation with the causal features, they have no causal effect on the classified outcomes. The causal relationship between the features and cyberattacks can be identified through causal intervention. Then, the noisy features are deleted, and only few features can be retained. This not only reduces the time required for the model classification but also reduces the time required for training without sacrificing other functions.

Two correlated variables have a causal relationship, while two uncorrelated variables have no causal relationship. ML algorithms are involved in the third stage of the framework to select several classes of labels. The labels with the largest correlation are selected as the reference labels of the fourth stage, which can also reduce the complexity of counterfactual detection algorithm. Therefore, it is necessary for the counterfactual detection algorithm to calculate the expected anomalous features of K cyberattacks, without calculation of the expected anomalous features of M cyberattacks (K includes reference labels selected by the ML algorithm, and M covers all labeled cyberattacks).

In the fourth stage, according to the causality, it can be determined whether the results of the counterfactual detection algorithm will change or not when certain preconditions change and then provide the basis for the counterfactual judgment according to the magnitude of the causality effect. Given the evidence ε = e and an intervention all cyberattacks are switched except for Xa in counterfactual. Next, the number of expected anomalous features E (Xk, ε) is calculated (Xa belongs to Xk and Xk includes reference labels selected by the ML algorithm). Finally, with obtaining the largest value of E (Xk, ε), the most likelihood of a cyberattack is Xk.

With the joint action of these four stages, the causal ML-based NIDS could ensure a high accuracy in the detection of anomalous features when the types of cyberattacks are increased.

4.2. Data Preprocessing

Performing data normalization by using the Z-score, positive integerization by using the Min-Max normalization, and deletion of the incorrect and fuzzy row datasets are covered in the data preprocessing stage.

4.2.1. Z-Score Normalization

Z-score normalization [44, 45] of the data is initially carried out. The most common standardization method is Z-score standardization, which is also known as standard deviation standardization. The main purpose of Z-score is to transform features of different magnitudes into the same magnitude and measure the features with the calculated Z-Score value to ensure comparability of them. This method presents the mean and standard deviation of the original data to conduct data standardization. The processed data conform to the standard normal distribution, that is, the mean value is 0, the standard deviation is 1, and the transformation function iswhere Yinst is the initialized feature value, U denotes the mean feature vector, and is the standard deviation.

4.2.2. Min-Max Normalization

Min-Max normalization [46], also known as deviation normalization, is a linear transformation of the original data, with max being the maximum and min the minimum of the sample data. In the counterfactual detection algorithm, the value of a normal feature is 0 and that of an anomalous feature is a positive integer; thus, it needs to be normalized to a natural number. Data normalization is a necessary step, in which each value needs to be extended to an appropriate range. This process helps eliminate large deviations in features:where indicates the normalized value of Yij with the range of 0 to N in the integer form, min (Yj) represents the minimum value of the jth feature, and max (Yj) is the maximum value of the jth feature.

4.2.3. Removal of Incorrect and Fuzzy Row Sets

There are features with empty values in the row of features or the label corresponding to this row of features without a dependency on a normal attack category in the intrusion detection dataset. Thus, this row is an invalid or incorrect row set. Alternatively, the row of features is corresponding to multiple types of cyberattacks (such as features [0, 1, 1, 1] corresponding to two types of cyberattacks, DDos, and exploits); as a result, this row is a fuzzy row set [47]. The incorrect and fuzzy sets cannot be labeled by ML algorithms. Therefore, the incorrect and fuzzy sets need to be deleted in the data preprocessing stage, and only a certain subset is left, in which the row features and label have one-to-one definite correspondences (e.g., the row of features [0, 1, 1, 1] is uniquely corresponding to a DDoS), so as to improve the robustness of the causal ML-based NIDS.

4.3. Feature Selection

If some features are irrelevant to the cyberattacks and they have no causal effect on the classified outcomes [26], these features are therefore noisy features. Normally, manually matching of features can be used directly to eliminate the impacts created by noisy features on the classified outcomes. However, when it comes to training by ML algorithms, a classifier will constantly fit these features, leading to a spurious correlation between noisy features and cyberattacks. Ultimately, the performance of the classifier could be impaired. This mainly involves the effects of causality on each feature, and calculation is carried out to assess the effects of causality. Consequently, the noisy features are distinguished and deleted based on the effects of causality. Hence, the best combination of causality-based features could be made.

4.3.1. Identification of Noisy Features

As shown in Figure 6, there are various relationships between cyberattack X and feature Y under the general fact. If the causal relationship and direction between these two parameters are not clarified, the judgment of the type of cyberattack may be influenced. As displayed in Figure 6(b), it is assumed that Yi and Yj have a mutually causal relationship, and the anomaly of one feature will lead to the anomaly of the other. Therefore, there may be a wrong conclusion if the anomalous feature Yj is considered to be caused by the cyberattack X.

According to this hypothesis, reversal of the causal direction of the fact between cyberattack X and feature Y is illustrated in Figure 6(c). Therefore, feature Y can be interfered, and the causal relationship between Y and X can be worked out according to changes of the expected value of X, which is formulated as [48]

If the conditions between Y and X satisfy the following rules, respectively, equation (7) can be written as (8)–(15) [43].

Rule 1. If Yi and Yj are independent, then

Proof. In the statistical model, the calculation formula of the joint distribution isAccording to the Markov blanket [26, 43], in a directed acyclic graph, given the parent node of X, X is independent of nonchild nodes of its parent. Hence, the abovementioned formula can be abbreviated aswhere Pa (xi) represents the parent node of xi. This formula also represents a BN. As depicted in Figure 6(c), it can be simplified as follows:According to the truncated factorization,Marginalized yj:Thus,

Rule 2. If Yi and X are independent, then

Rule 3. If Yi is independent of Yj and X, thus,The causal effect [49] can be calculated by measure E of X and Y:

Definition 2 (noisy features). As for noncausal features, if E/N (N is the size of training dataset) is less than the threshold δ (δ ≤ 0.01), there will be no causal relationship [50] between X and Y. Thus, these features can be considered as noisy features, and they should be deleted in the dataset.

4.3.2. Removal of Noisy Features

The causal interventions were performed for all features, as shown in Figure 7. In the process of feature selection, only those features that have a causal relationship with the labeled attacks will be selected. As illustrated in Figure 7, the correlation between features is hidden.

If there is no causal relationship between {Y1, Y3, …, Yn−1} with X and other features, equation (15) can be transformed into equation (17) according to 3 as follows:

If equation (17) holds, then the causal relationship in the case can be recovered based on the factual causal direction between cyberattacks and anomalous features, as shown in Figure 8.

According to equation (17), if intervention is made on Y1, Y3, …, Yn1, then the intensity of causal effect between Y1, Y3, …, Yn−1 and Xk iswhere L is 1, 3, …, N−1 if ; thus, the BN of cyberattack and features can be simplified (Figure 9).

As displayed in Figure 9, features y1, y3, and yn−1 can be deleted when data are preprocessed according to the abovementioned method, and the causality is simplified as

4.3.3. The Process of Feature Selection

Based on the above method, all noise features satisfying Definition 2 will be deleted. Only the causal features are retained, and the selection process is as shown in Algorithm 1.

Input: , and set P represents the features set, which contains N features
Output: , and is a causal feature set, which contains features
(1)// represents the maximum set of deleted features
(2) // represents the set of features that have been deleted from the ith feature in Set P
(3) for i from 1 to N
(4)  for j from i to N + i-1
(5)   
(6)   if
(7)    Delete the feature
(8)     //Noise features numbers are stored in sets
(9)   end if
(10)  end for
(11) end for
(12)Count = []; it represents a set of noise features
(13) for i from 1 to N//. Compare the set of features of all Cun[i] and assign the set with the most noise features set to Count
(14)  if
(15)   then
(16)  end if
(17) end for
(18) for i from 0 to len (Count)
(19)  Delete all noise features in the Cun[i] collection;
(20) end for
(21) output the causal feature set C.
4.4. Classification of Cyberattacks

Although the causality is simplified after feature selection, as shown in Figure 9, there is still a many-to-many relationship between cyberattacks and traffic features. The key of counterfactual detection algorithm is how to choose the most appropriate labeled attacks to explain the causality of the features. According to the causal inference, it can be assumed that the possibility of changes in the results of the counterfactual detection is associated with certain changes in preconditions; thus, it can provide the basis for the causality judgment according to the magnitude of the causality. For instance, in order to quantify the causality of anomalous features caused by a cyberattack in a NIDS, the counterfactual detection can be used for inference.

As illustrated in Figure 10, the left is the fact graph, and the right is the counterfactual graph. All variables with apostrophes in the counterfactual conditions are equal to the variables without apostrophes in the fact conditions. It is assumed that, under the condition of a given evidence ε=e and intervention that sets X to the value of 0, the counterfactual likelihood can be calculated as p (ε=e'|ε=e, do (X=0)). Therefore, through counterfactual inquiry, a formal language can be provided to quantify the probability of a counterfactual anomalous feature e'=1 when it is only assumed that the attack X= 0.

Definition 3 (expected sufficiency [26]). The expected sufficiency of cyberattack Xa is the number of anomalous features that would expect to persist if the intervention is given to switch off all other possible causes of the anomalous features:where Xa denotes the type of cyberattack a, Y+ indicates the anomalous features in the fact conditions, Pa (Y+) denotes the parent node of Y+ that represents all cyberattacks that may result in the anomalous feature Y, Pa (Y+)\Xa is the parent node of Y+ except for Xa, represents the anomalous features in counterfactual situations, and ε denotes the set of all factual evidence features. If E (Xa, ε) is maximum in the set for all E (X, ε), the cyberattack type Xa will be a causal explanation for the given evidence ε.

Inference 1. According to equation (19) and SCM [26, 51], the expected sufficiency of cyberattack Xa is given bywhere Y_ denotes the normal feature in the set of all factual evidence features. It is mainly very complicated and cumbersome to solve noisy and exogenous variables, while it is unnecessary to solve these variables in equation (20). At the same time, the value of L can be calculated based on the prior data. Therefore, equation (20) obtained through counterfactual reasoning greatly simplifies the causal relationship between cyberattacks and traffic features.

4.5. Time Complexity

To determine the time complexity of the proposed causal ML-based NIDS, it is required to determine the complexity of each algorithm used in each stage. As the performance of different algorithms at different stages is compared, the overall time complexity is determined by that algorithm, producing the highest overall complexity. It is assumed that the dataset is composed of M samples and N features. In general, M ≫ N.

Starting with the data preprocessing stage, the complexity of the Z-score and Min-Max normalization is O (N). As it is required to normalize all the samples of the N features within the dataset, the complexity of deleted incorrect and fuzzy row sets is O (M). Therefore, the overall complexity of the first phase is O (M).

The time complexity of the second stage is O (N2). Firstly, this phase intervenes all the features, and only N steps are taken and compared with (N − 1)/2 features. In the third stage, the complexity of the KNN classifier can be estimated as O (Ml ∗ K) [9], and the time complexity of the random forest is O (Ml ∗ KD), where K (K < N) is the dimension after feature selection, Ml denotes the number of samples after deleting the incorrect and fuzzy row sets, and D is the depth of the tree. The time complexity of the fourth stage is O (TMl ∗ K), where T (T<M and T<D) represents the type of a cyberattack selected in the third stage.

Based on the aforementioned discussion, the overall complexity of the proposed framework is O (Ml∗ K ∗D). The time complexity of data preprocessing and feature selection is O (M + N2). As M ≫ N, the time complexity of data preprocessing and feature selection is approximately equal to O (M), and this time complexity is far less than the time complexity O (MN2) of feature selection, including MOMBNF [9]. Finding the overall time complexity is highly critical because the model will often be retrained to learn new patterns of cyberattacks.

5. Performance Evaluation

5.1. Experimental Setting

The CICIDS19 dataset was launched in 2019 by the Canadian Institute for Cybersecurity, and it contains benign and the most up-to-date common cyberattacks, which is similar to real-world data with a total of 87 features [47]. This dataset contains 11 types of attacks: DRDOS_MSSQL, DRDOS_SNMP, SYN, DRDOS_NTP, TFTP, UDP-LAG, DRDOS_NETBIOS, DRDOS_DNS, DRDOS_UDP, DRDOS_LDAP, and DRDOS_SSDP. As shown in Table 1, it also includes the results of network traffic features based on timestamps, source and target IPs, source and target ports, protocols, and attack token flows.

The raw network packet for UNSW_NB15 [52] was created by the Australian Cyber Security Center, and it is a comprehensive set of cyberattack traffic data. Compared with other datasets, these two datasets are more appropriate for the research on NIDSs. UNSW_NB15 dataset has nine types of cyberattacks, including Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. As presented in Table 2, tools, such as Argus, are used by UNSW-NB15 to generate a total of 49 features with similar labels.

NSL-KDD [53, 54] contains 7 major categories of attacks, such as ipsweep, Neptune, nmap, portsweep, Satan, smurf, and teardrop. NSL-KDD elimination of redundant records in the training set helps classifiers to be unbiased toward more frequent records. The training and test sets contain a reasonable number of instances, which can be used as a valid benchmark dataset to help researchers compare different intrusion detection methods. As shown in Table 3, there are 41 dimensional features in NSL-KDD.

The fuzzy logic system (FLS) [47] is used to evaluate the quality of realism of CICIDS19, UNSW-NB15, and NSL-KDD datasets. The FLS is based on Sugeno fuzzy model [55] that investigates the quality of realism of IDS dataset. The CICIDS19, UNSW-NB15, and NSL-KDD datasets contain a set of network intrusion attacks that reflect real-world standards. The generation process fully considers the characteristics of network intrusion attacks and the dynamics of the network.

In order to use a variety of algorithms more effectively, python was used to implement our model. The hardware and software specifications are summarized in Table 4.

5.2. The Results of Experiments

This section presents three sets of experiments to verify the effectiveness of the proposed causal ML-based NIDS.

5.2.1. Influences of Data Preprocessing on the Training Samples

Concerning the effects of data preprocessing on the size of training samples, the learning curve of training accuracy and cross-validation accuracy with the change of the size of training samples could be obtained. Because the amount of data in the datasets is large enough, about 10% of the data can be used as the test set to work well, so a 90:10 split is used for normalization in this paper. After normalization, using the 90/10% splitting criteria, the two datasets are randomly divided into training and test datasets.

(1) Influences of Data Preprocessing on the Size of Training Samples. In this study, Z-score, SMOTE [5658], CFS [9, 5961], and CRFS (causal reasoning-based feature selection) were used for making comparison. The SMOTE algorithm is used for SMOTE to sample a few classes after data processing by the Z-score, and CFS selects features after data processing by the SMOTE. For the CRFS method proposed in this paper, the causal reasoning-based feature selection presented in Section 4.3 is used after data processing by the Z-score. The cross-validation curves of different datasets under different types of cyberattacks after data processing by the four methods mentioned above are shown in Figures 11-12.

Figure 11 compares the accuracy with the number of training samples required for the four methods (it is considered that there is only one type of cyberattack here, all cyberattacks are treated as one type of cyberattack, and its name is “abnormal”). As depicted in Figure 11, to converge the training accuracy and cross-validation accuracy, the number of training samples required for the Z-score and SMOTE was more than 16,000, which was within 10,000 for the CFS; however, the number of training samples required for the CRFS was only about 5,000, which was significantly lower than that of the Z-score, SMOTE, and CFS, while it could ensure the same training accuracy.

The accuracy and the number of training samples required for the four methods (it is considered that there are multiple types of cyberattacks here) were compared (Figure 12). As shown in Figure 12, in order to converge the training accuracy and cross-validation accuracy, the number of training samples required for the Z-score and SMOTE was close to 10,000. The number of training samples required for the CFS was within 5,000, and the number of training samples required for the CRFS was close to 4,000, which decreased by 60%, 60%, and 20% compared with those of the Z-score, SMOTE, and CFS, respectively. Meanwhile, the training accuracy reached the highest, which significantly improved by about 10% compared with the highest training accuracy achieved by the SMOTE.

As illustrated in Figures 11 and 12, with the increase of types of cyberattacks, the number of training samples required for the Z-score, SMOTE, and CFS significantly increased, while the training accuracy noticeably decreased. As for the number of training samples required for the CRFS, it basically remained below 5,000 samples and the training accuracy slightly decreased. This highlights the positive influence of utilizing the CRFS technique, as it could significantly reduce the size of the required training samples without sacrificing the detection performance.

(2) Influences of Data Preprocessing on the Time Required for Training. To further highlight the influences of the data preprocessing stage, Table 5 summarizes the time required for different methods to construct the learning curve under different types of cyberattacks. For instance, when there were two types of cyberattacks, nearly 483 s was needed for the Z-score to establish the learning curve, which was reduced to 370 s after processing by the SMOTE and 154 s after processing by the CFS. However, the time required to construct the learning curve after processing for the CRFS was only 90 s, which was 81.4%, 75.7%, and 41.6% lower than that of the Z-score, SMOTE, and CFS, respectively.

This indicates that CRFS can not only guarantee the accuracy of detection but also effectively reduce the time required for training. The proof mentioned in Section 4.5 verifies that the feature selection algorithm proposed in this article has lower time complexity than the other algorithms. As the noisy features are deleted by the CRFS, the ML algorithms only need to fit causal features. The accuracy of the subsequent steps can be guaranteed and the time complexity required for training can be reduced.

5.2.2. Influences of Feature Selection Methods on the Number of Features Required

In this experiment, three groups of control experiments were set, and the number of features and the training accuracy after data processing by the SMOTE, CFS, and Min-Max were compared. The CRFS algorithm was used to further select features. SMOTE, CFS, and Min-Max add (do) in Tables 617 indicated that the CRFS method could be applied to process and select the data after the data processing by these methods.

The number of features left after processing by different algorithms in the CICIDS19 dataset under different types of cyberattacks is shown in Table 6. After processing by the CRFS algorithm, the number of features required for training was decreased by more than 50% at the minimum and 94% at the maximum compared with that before processing. Moreover, the number of features processed by the CRFS algorithm was significantly lower than that calculated by the CFS algorithm. This may be related to the fact that CRFS based on causal reasoning only selects network features that have a causal relationship with the cyberattacks, and it eliminates the features with a spurious correlation. The CFS is a feature selection method based on high correlation, which can greatly reduce the number of features. However, this method also selects some noncausal features with a spurious correlation, resulting in the higher number of features than that of CRFS.

The detection accuracy between SMOTE and CRFS, between CFS and CRFS, and between min-max and CRFS in the CICIDS19 dataset was, respectively, shown in Tables 79. As presented in the abovementioned tables, although the number of features required for training was markedly reduced after data processing by the CRFS algorithm, its training accuracy still maintained about 99% of the original algorithm’s accuracy, and the decrease could be almost negligible compared with the number of compressed features. The results showed that the CRFS algorithm could not only effectively reduce the number of training samples required for processing but also ensure the accuracy of training samples to a relatively stable level. This is because the CRFS algorithm can identify the real causal relationship between cyberattacks and features, while the eliminated features are only features of spurious correlation, slightly influencing accuracy.

The number of features left in the UNSW-NB15 dataset after data processing by different algorithms under different types of cyberattacks is shown in Table 10. After further processing of features by the CRFS algorithm, the minimum and maximum reduction of the number of features required for training was >50% and >82.5% compared with that before processing. When there were few types of cyberattacks, the effect of applying causality to the data processed by the CFS to find compressed features was significantly reduced. Owing to the strong correlation and strong causality, UNSW-NB15 was consistent after the data processing by the CFS. However, when there were several types of cyberattacks, the reduction was also significant, up to 54.5%, after further processing by the CRFS algorithm.

The detection accuracy between SMOTE and CRFS, between CFS and CRFS, and between min-max and CRFS in the UNSW-NB15 dataset was, respectively, shown in Tables 1113. As presented in the abovementioned tables, when there were few types of cyberattacks, although the number of features required for training was noticeably reduced after processing by the CRFS algorithm, the accuracy of training basically remained unchanged and the effect was obvious.

In the NSL-KDD dataset, after further processing of features by the CRFS algorithm, the maximum reduction of the number of features required for training was >82.5%. As presented in the abovementioned dataset, the number of features required for training was noticeably reduced after processing by the CRFS algorithm in the NSL-KDD dataset.

To sum up, the CRFS algorithm could effectively reduce the number of required training samples in the CICIDS19, UNSW-NB15, and NSL-KDD datasets, and it could also ensure the accuracy of training samples with a relatively acceptable stability. Especially, under the circumstance of a smaller number of cyberattacks, with a greatly reduced complexity in time and calculation, the training accuracy was basically unchanged. It was proved that causal features could not only complete the NIDS detection task but also ensure the stability of the accuracy rate. The selected causal features might provide a targeted help for the next preventive treatment.

5.2.3. Influences of Different Types of Cyberattacks on the Detection Performance

To evaluate the performance of the different classifiers and study the effects of the different optimization methods, it can be referred to the evaluation index of accuracy of test data (ACC). Random search (RS) and tree-structured Parzen estimator (TPE) are two optimal parameter adjustment methods with the highest accuracy of the KNN and random forest in MOMBNF [9]. CMLN is a causal ML-based NIDS.

Performance of different classifiers in CICIDS19, UNSW-NB15, and NSL-KDD datasets under different types of cyberattacks was compared in Tables 1820. As shown in Table 18, in the CICIDS19 dataset, with an increase in the types of cyberattacks, the detection accuracy in MOMBNF significantly decreased. When there were 11 types of cyberattacks, the detection accuracy of all the parameter optimization methods in MOMBNF was lower than 90%, especially the accuracy of the test set was lower than 30% after IGBS data processing. However, after CMLN training, the accuracy of the test set was stable at more than 98.5%, which was about 9% higher than the optimal RS-KNN-CFS method. It can be seen from Tables 1820 that, regardless of the composition of the datasets, the accuracy of CMLN test set was higher than that of MOMBNF and BRS [47], especially when there were several types of cyberattacks. The detection rate of CMLN was higher than that of MOMBNF.

6. Conclusions

Although ML aims to facilitate the detection of anomalies, it is important to first understand how detection is performed and clearly define the desired output of our algorithms. When traditional ML algorithms cannot decouple correlation and causality, it is difficult to achieve a stable prediction [8]. Therefore, this paper proposed a novel causal ML-based NIDS. Firstly, by establishing a causal link between cyberattacks and features through causal intervention, the noisy features could be deleted and the minimum size of training features could be determined. Then, the ML and counterfactual detection algorithm were used to find out the unique label. Finally, CICIDS19, UNSW-NB15, and NSL-KDD datasets were utilized to evaluate the performance of the proposed detection method.

The results of experiments showed that the CRFS method proposed in this paper could reduce the size of training samples and training time by at least 40%. Meanwhile, the number of features required for training was greatly reduced after data processing by the CRFS algorithm, and it also ensured the accuracy of training with a relatively acceptable stability. It was proved that the deletion of noisy features did not affect the accuracy of detection. The results showed that compared with other optimization techniques, CMLN has the highest detection accuracy (when there were 11 types of cyberattacks, the accuracy was improved by nearly 9% compared with the optimal RS-KNN-CFS method). It was confirmed that the counterfactual detection algorithm could effectively identify the causal relationship between features and the type of cyberattacks.

At present, new cybersecurity threats are becoming ever severe, which cannot be classified according to the existing classification methods. Hence, how to effectively combine unsupervised learning and causal ML to construct new NIDs to detect new cybersecurity threats may be a new direction for investigation.

Data Availability

The data used to support the findings of this study can be accessed from https://www.unb.ca/cic/datasets/index.html, https://ieee-dataport.org/documents/unswnb15-dataset, and https://www.unb.ca/cic/datasets/nsl.html.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Natural Science Foundation of China (no. 61972412) and National Key Research and Development Program of China (no. 2018YFB0204301).