A GAN and Feature Selection-Based Oversampling Technique for Intrusion Detection

Liu, Xiaodong; Li, Tong; Zhang, Runzi; Wu, Di; Liu, Yongheng; Yang, Zhen

doi:https://doi.org/10.1155/2021/9947059

Security and Communication Networks

On this page

Abstract Introduction Related Work Evaluation Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2021 | Article ID 9947059 | https://doi.org/10.1155/2021/9947059

A GAN and Feature Selection-Based Oversampling Technique for Intrusion Detection

Xiaodong Liu,¹Tong Li,^1,2Runzi Zhang,^3,4Di Wu,¹Yongheng Liu,¹and Zhen Yang^1,2

Academic Editor: Savio Sciancalepore

Received05 Apr 2021

Revised10 Jun 2021

Accepted26 Jun 2021

Published06 Jul 2021

Abstract

In recent years, there have been numerous cyber security issues that have caused considerable damage to the society. The development of efficient and reliable Intrusion Detection Systems (IDSs) is an effective countermeasure against the growing cyber threats. In modern high-bandwidth, large-scale network environments, traditional IDSs suffer from a high rate of missed and false alarms. Researchers have introduced machine learning techniques into intrusion detection with good results. However, due to the scarcity of attack data, such methods’ training sets are usually unbalanced, affecting the analysis performance. In this paper, we survey and analyze the design principles and shortcomings of existing oversampling methods. Based on the findings, we take the perspective of imbalance and high dimensionality of datasets in the field of intrusion detection and propose an oversampling technique based on Generative Adversarial Networks (GAN) and feature selection. Specifically, we model the complex high-dimensional distribution of attacks based on Gradient Penalty Wasserstein GAN (WGAN-GP) to generate additional attack samples. We then select a subset of features representing the entire dataset based on analysis of variance, ultimately generating a rebalanced low-dimensional dataset for machine learning training. To evaluate the effectiveness of our proposal, we conducted experiments based on the NSL-KDD, UNSW-NB15, and CICIDS-2017 datasets. The experimental results show that our method can effectively improve the detection performance of machine learning models and outperform the baselines.

1. Introduction

The rapid development of network technology has dramatically improved people’s daily lives, but it has also brought many threats. For example, Marriott International’s Starwood network was maliciously breached, and the private information of some 500 million users was compromised (http://sn.people.com.cn/n2/2018/1204/c190199-32363619.html). Not only is there a risk of personal information being copied on the Internet, but corporate production is also under serious threat. In 2018, Taiwan Semiconductor Manufacturing Corporation (TSMC), the world’s number one chip foundry, was compromised by the WannaCry ransomware virus, which led to a complete shutdown of all production lines and ultimately caused losses of approximately NTD 5.2 billion (https://english.cw.com.tw/article/article.action?id=2194). The 2018-2019 Global Application and Cyber Security Report released by Radware shows that 93 of respondents have suffered from network attacks in the past 12 months. Network security has become an issue that people cannot ignore.

Intrusion Detection Systems (IDSs) have been widely adopted as an effective method to detect and defend against network attacks in response to the growing network threats. It monitors network traffic in real-time, divides network records into normal records and malicious records, and provides essential information for the defense system. In the last few decades, machine learning has been used to improve intrusion detection [1]. Nevertheless, due to the sparsity of attack data, the training set for this type of approach is unbalanced, affecting analysis performance [2]. Oversampling techniques are commonly used to address the problem of unbalanced datasets. Traditional methods are used to generate samples among the nearest neighbors by interpolation, such as Synthetic Minority Oversampling Technique (SMOTE) [3], and Adaptive Synthetic Sampling Technique (ADASYN) [4]. Generative Adversarial Network (GAN) is a new generative model that provides a new framework for sample generation [5]. It allows the generator to learn the data features sufficiently by gaming between the generator and the discriminator to simulate data distributions. It shows its most advanced technology in the generation of images, sounds, and texts [6–8]. Moreover, researchers in other fields are regularly applying this method in their research direction.

This paper presents an oversampling technique based on Generative Adversarial Networks (GAN) and Feature Selection (GAN-FS) applied to intrusion detection from the perspective of data imbalance and high dimensionality. We construct an attack sample generation model based on an improved generative adversarial network WGAN-GP. In addition, considering the characteristics of large data volume and high dimensionality in intrusion detection, we use Analysis of Variance (ANOVA) for data dimensionality reduction. Effective data dimensionality reduction can remove redundant and irrelevant features to reduce the curse of dimensionality and thus improve classification accuracy [9]. Our contribution can be summarized as follows:(1)We propose a new oversampling method, GAN-FS, to solve the class imbalance problem in intrusion detection. We construct an attack generation model based on WGAN-GP to generate attack samples. The data are then feature-selected using ANOVA to obtain a rebalanced low-dimensional dataset for training the intrusion detection model. We have modified this in our contribution as follows.(2)Based on three popular intrusion detection datasets, we conducted experiments on several machine learning detection models. The experimental results show that our approach can effectively improve intrusion detection models’ performance. Moreover, compared with multiple popular methods, our approach achieves better results.(3)We discuss and analyze the impact of our approach on different datasets and different machine learning detection models.

The remainder of the paper is organized as follows. In Section 2, we provide an overview of the relevant studies. Section 3 presents the GAN-FS. The design, execution, and results are presented in Section 4. Finally, the paper is concluded in Section 5.

Based on GAN and feature selection, we propose an oversampling technique applied to intrusion detection. Therefore, we discuss related work in the following four approaches: intrusion detection method (Section 2.1), feature selection (Section 2.2), oversampling technique (Section 2.3), and generative adversarial network (Section 2.4), respectively.

2.1. Intrusion Detection

As an essential tool of cyber security, IDSs are responsible for identifying and warning of cyber attacks. Since the first paper on IDSs [10] was published, there have been numerous research achievements in this field. In recent years, IDS has developed rapidly with the help of machine learning.

Aslahi-Shahri et al. proposed an intrusion detection algorithm based on a genetic algorithm and support vector machine [11]. They used a hybrid algorithm for feature selection and then ranked the selected features according to their importance and finally achieved good results. Elbasiony et al. proposed a hybrid network intrusion detection framework based on random forest and weighted k-means [12]. They combined random forest and k-means based intrusion detection models to construct a hybrid intrusion detection model, which effectively reduces the false alarm rate. Compared with traditional intrusion detection methods, Wang et al. proposed an intrusion detection method based on artificial neural networks and fuzzy clustering [13]. The above methods can effectively detect and prevent network attacks, but their research focuses on improving and combining existing methods, ignoring the intrusion detection dataset’s imbalance. For example, in the widely used NSL-KDD dataset [14], the number of normal samples is 67,343, while the number of R2L and U2R attacks is only 995 and 52, respectively. In the UNSW-NB15 dataset [15], the number of normal samples is 37,000, while the number of Shellcode and Worms attacks is only 378 and 44. The imbalance in the intrusion detection dataset affects the detection accuracy and stability of IDSs.

2.2. Application of Feature Selection in Intrusion Detection

Feature selection is referred to as obtaining a subset from an original feature set according to a particular feature selection criterion, selecting the dataset’s relevant features. In the field of intrusion detection, the datasets used are characterized by large numbers and high dimensionality. Feature selection reduces the computational difficulty and eliminates data redundancy, thus improving the detection rate of machine learning techniques and reducing false alarms [1].

Khammassi and Krichen propose a GA-LR wrapper approach for feature selection in network intrusion detection [16]. They used a genetic algorithm-based packing method as a search strategy and logistic regression as a learning algorithm to select the best subset of features. Moreover, the method effectively improves intrusion detection performance. Mohammadi et al. propose an intrusion detection method based on feature selection and clustering algorithm using filter and wrapper methods [17]. The filter and wrapper methods are named feature grouping based on linear correlation coefficient algorithm and cuttlefish algorithm. Based on this method, the performance of its intrusion detection model has been significantly improved.

In general, feature selection is an effective method for data dimensionality reduction and is widely used in intrusion detection. The feature selection-based data dimensionality reduction method can effectively improve intrusion detection performance.

2.3. Oversampling Techniques

To improve the ability of machine learning models to judge and analyze minority samples in the presence of sample imbalance, researchers have proposed several rebalancing techniques at different levels, such as data-level and algorithm-level. The data-level oversampling technique increases the number of minority samples by artificial means to improve the dataset’s balance. The typical methods mainly include random oversampling, SMOTE, and ADASYN [3, 4]. In recent years, researchers have proposed new oversampling methods, such as K-means SMOTE and G-SMOTE [18, 19]. These techniques have improved the data imbalance to varying degrees. Table 1 demonstrates the design rationale for the above methods.

Random oversampling increases the number of minority samples but can lead to severe overfitting. Also, if the minority sample is biased or noisy, this will increase the interference with the classifier. SMOTE uses interpolation to generate samples, avoiding sample overlap due to random oversampling. However, because SMOTE treats all minority samples equally and does not consider the category information of neighbor samples, it cannot effectively enhance decision boundaries, resulting in poor classification results. ADASYN takes into account information about the distribution of the data while generating samples by interpolation. However, network traffic complexity leads to a blurring of its category boundaries, and using this strategy may exacerbate the confusion of decision boundaries.

In order to solve the problems of the above methods, researchers have made continuous attempts. For few classes, K-means SMOTE generates a different number of samples based on the clustering density. Furthermore, the method does not consider the category labels, thus ensuring that the generated samples are in a safe region. G-SMOTE substitutes the data generation mechanism by defining a flexible geometric region around each minority sample. Then synthetic instances are generated inside the boundaries of the region.

2.4. Generative Adversarial Networks

GAN is a deep learning model that models complex high-dimensional distributions of real-world data. Inspired by the two-person zero-sum game in game theory, it consists of a Generator (G) and a Discriminator (D). G and D are both neural networks. G captures the potential distribution of real data samples and generates new data samples; D is a binary classifier, judging whether the input is real data or generated samples. Classification results will be passed back to G and D through the loss of weight updates. Both networks are trained until D can no longer distinguish real samples from generated samples. The optimization process is a minimax game problem. The optimization goal is to achieve a Nash equilibrium so that the generated network can estimate data samples’ distribution. The objective function is defined as follows:where is defined as the real sample distribution, is the sample distribution generated by the generator, is the noise variable distribution, is defined as a function of mapping noise to data space, and represents the probability that the sample is real data rather than a generated sample. To distinguish between real data and generated samples, should be maximized and should be minimized. When = , the objective function obtains the global optimal solution.

In the security field, some researchers have applied GAN in their work. Ring et al. use GAN to generate high-quality data flow [20]. Rigaki and Garcia can successfully bypass detection by using malicious traffic generated by GAN [21]. Lee and Park proposed an intrusion detection method based on generative adversarial networks and random forests. They oversampled the intrusion detection dataset and then used the random forest for classification. The method achieved better performance on the CICIDS-2017 dataset compared to the original random forest method. However, their work did not consider the instability of GAN and the high dimensionality of the data and did not perform more validation on other datasets and models [22]. Yilmaz et al. proposed an intrusion detection method based on GAN and MLP [23]. They generated three types of attacks on the UGR 16 dataset to balance the dataset. The experimental results show that GAN’s balanced attack sample dataset produces more accurate results than the unbalanced attack sample set. Vu and Nguyen proposed a method based on Auxiliary Classifier Generative Adversarial Network (ACGAN) to enhance the balance of the dataset [24]. The method achieved better performance than machine learning algorithms trained on the original dataset and other sampling techniques. Yin et al. proposed a framework for intrusion detection based on generative adversarial networks. The generative model in their framework is used to generate other complementary labeled samples for adversarial training, which helps the classifier perform classification [25].

In this paper, we propose a new oversampling method, GAN-FS. Compared with existing work, we design the GAN-FS oversampling method in terms of both the number and dimensionality of the data. We build an attack generation model based on WGAN-GP to generate higher quality samples. And, we introduce ANOVA for feature selection from the perspective of data high dimensionality to further reduce the learning difficulty of the classifier. We also perform experimental validation on several popular datasets and detection models and analyze the impact of different datasets and different machine learning detection models.

3. An Oversampling Technique GAN-FS

The imbalance of data affects the performance of machine learning-based IDSs. Due to the lack of analysis on the correlation of features, existing oversampling technology cannot effectively generate high-dimensional network traffic. We build an attack generation model based on WGAN-GP to generate higher quality samples. And, we introduce ANOVA for feature selection from the perspective of data high dimensionality to further reduce the learning difficulty of the classifier.

Figure 1 illustrates the workflow of GAN-FS. There are five steps in the framework: Data Preprocessing, Data Partition, Rare class Oversampling, Feature Selection, and Train & Test ML Model.(i)Step 1: the dataset is preprocessed and then divided into a training set and a testing set.(ii)Step 2: the training set is divided into rare class data and other class data by data partitioning.(iii)Step 3: the GAN model uses rare class data to generate samples.(iv)Step 4: the oversampled data is combined with the other class data obtained in step 2, and then, feature selection is performed. The optimal feature subset and the corresponding new low-dimensional training set are obtained in the feature selection step.(v)Step 5: finally, the new training set is used to train the machine learning (ML) model, and the testing set is used to test the model.

3.1. Preprocessing

The dataset used in the context of intrusion detection contains different forms of features such as continuous, discrete, and symbolic with varying resolution and ranges. We need to process the data to make it suitable for our model. The preprocessing includes numeralization and normalization. We also need to partition the dataset if it does not provide a defined training and testing set.

3.1.1. Numeralization

Intrusion detection data usually contains nonnumeric features such as protocols and states. These nonnumeric features need to be converted to numeric features to suit our model. Nonnumeric features are mapped to integer values between 0 and S-1, where S is the number of symbols.

3.1.2. Normalization

The inconsistent feature scales of data in different dimensions will affect the results of intrusion detection. We need to normalize the data to eliminate the dimension influence between indicators. We scale all the features to [0, 1] except the attack type label. The min-max normalization is used to scale the data values linearly, as follows:where is the value before normalization, is the value after normalization, is the maximum value of sample data, and is the minimum value of sample data.

3.2. Data Partition and Rare Class Attack Oversampling

Attack data is a rare class of data in training set in the field of intrusion detection. We need to take rare class data from the training set and oversample them. Before generating samples, we train a generative model to model the distribution of attack data.

In GAN, the Jensen-Shannon divergence is used to measure the difference between two distributions, but it requires some overlap between the two distributions. When the discriminator is trained to be optimal, there is no overlap between the distribution of the real sample and the distribution of the generated sample , and two nonoverlapping or negligible overlap distributions cause the generator’s gradient to disappear. Wasserstein GAN (WGAN) [26] effectively improves GAN by increasing the Lipschitz limit and introducing Wasserstein distances. However, WGAN also suffers from the disappearing gradient problem, and WGAN-GP [27] introduces the gradient penalty to solve this problem. The structure of the WGAN-GP is the same as that of the GAN, as shown in Figure 2.

The WGAN-GP objective function is expressed as follows:where is the data distribution, is the model distribution implicitly defined by , (the input to the generator is sampled from some simple noise distribution, such as the uniform distribution or a spherical Gaussian distribution), and defines the uniform sampling along a straight line between the point pairs sampled from the data distribution and the generated distribution . A penalty on the gradient norm is enforced for random samples . In this way, the generator and frequency discriminator can be improved at the same speed to avoid mode-collapse, which leads to the optimization of training effect and the weight of neural network for the poor and improves WGAN’s training to a certain extent.

In the process of generating the sample, noise and rare class attacks are used to train WGAN-GP. The training process begins with fixing the discriminator and training the generator to model the distribution of real data. When the discriminator cannot correctly distinguish whether the samples are coming from the real attack set or the generator, fix the generator and start training the discriminator. When the discriminator can correctly distinguish between samples through continuous training, fix the discriminator and the training generator. Follow this process for iterative training, and finally, use the generator to generate attack samples. The generated attack samples are eventually added to the training set.

3.3. Feature Selection

Feature selection is a data dimensionality reduction method and is often used to deal with high-dimensional and complex data. Feature selection is referred to the process of obtaining a subset from an original feature set according to a particular feature selection criterion, which selects the relevant features of the dataset [28]. Feature selection is a process of selecting most valuable features from the existing original features to reduce the dimensionality of the dataset. The filter method is a common feature selection method, and the selection of features is separate from any machine learning technique. It selects features based on scores in various statistical tests and on indicators of relevance.

ANOVA (analysis of variance) F-test is a commonly used method of feature selection [29, 30]. It uses the F-test to determine whether the means of some groups are different and to test statistically whether the means are equal. More specifically, for each feature , we assume that has the same mean value in the positive and negative category samples, i.e., : , where denotes the mean value, denotes the set of values belonging to the positive category samples, and denotes the set of values belonging to the negative category samples; then, the is calculated according to the following equation:where and denote the component and intragroup deviation, respectively, is the total number of samples, and is the number of categories, where is 2. of each feature is calculated separately according to the above steps. Finally, the optimal subset is obtained by ranking the features according to their importance.

3.4. Train and Test Machine Learning Model

The imbalanced dataset affects the machine learning-based intrusion detection model’s analysis capability, making its classification results biased towards normal activities and leading to a high false alarm and missed alarm rate. We oversample rare classes of attacks in the training set based on WGAN-GP and then downsample the training set using the ANOVA feature selection method to obtain a low-dimensional rebalanced training set finally. In this step, we use the rebalanced low-dimensional dataset to train the machine learning model. When the model training is completed, we use a test set based on a subset of features to test its performance.

4. Evaluation

In this section, we systematically design and conduct a series of experiments and analyze the results.

4.1. Research Questions

(1)Q1: can our proposed method effectively improve the detection performance of machine learning models?(2)Q2: is the combination of GAN and feature selection effective?(3)Q3: is our proposed method better than other oversampling methods?

4.2. Datasets

NSL-KDD: the earliest IDSs dataset was created by the Defense Advanced Research Projects Agency in 1998 and was named the DARPA 1998 dataset. Subsequently, the KDD99 dataset was created from the DARPA 1998 dataset and has become one of the most widely used datasets [31]. The presence of many duplicate instances in the KDD99 dataset can affect the detection performance of machine learning methods by biasing them towards normal instances. Tavallaee et al. built the NSL-KDD dataset in 2009 based on the KDD99 dataset to solve the above problem by eliminating duplicate records [14]. The NSL-KDD training set consists of 125,973 records, and the test dataset contains 22,544 records. The NSL-KDD dataset includes four types of attacks and 41 attributes. The four types of attacks are DoS, Probe, R2L, and U2R. However, the number of attack instances in this dataset is much lower than normal instances, with only 995 and 52 for R2L and U2R attacks, respectively. UNSW-NB15: in recent years, the Cyber Range Lab of the Australian Centre for Cyber Security has created the UNSW-NB15 dataset. This dataset contains a variety of novel attacks and is therefore widely used for intrusion detection. There are nine types of attacks to simulate the real network environment, namely, Fuzzers, Analysis, Backdoor, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. The UNSW-NB15 dataset contains a training set and a testing set. The training set has 82332 records, and the testing set has 175341 records. The training set is unbalanced, with the number of normal data being much higher than the number of attacks. CICIDS-2017: the dataset was created in a 5-day simulation environment containing network traffic in packet-based and bidirectional stream formats. The authors extracted more than 80 attributes for each stream and provided additional metadata about IP addresses and attacks. Compared to NSL-KDD and UNSW-NB15, CICIDS-2017 includes a wide range of attack types such as SSH brute force, heartbleed, botnet, DoS, DDoS, web, and penetration attacks. Moreover, with nearly three million data, CICIDS-2017 can evaluate the performance of IDS in large-scale scenarios.

We increase the number of attacks by oversampling the rare attack categories in the dataset. Tables 2–4 show the data distribution of the dataset before and after oversampling.

4.3. Experimental Settings

In this section, we describe the relevant experimental setup, including the selection of models, the setting of parameters, and the selection of evaluation metrics.

4.3.1. Machine Learning Model Selection

Naive Bayes (NB) is a classification technique based on Bayes’ theorem, which assumes that predictors are independent of each other [32]. Simply put, the Naive Bayes classifier assumes that a feature in a category is independent of the presence of other features. Naive Bayes models are easy to build, and, in addition to being simple, Naive Bayes outperforms even highly complex classification methods. Decision Tree (DT) is widely used in intrusion detection [33, 34]. A decision tree is a tree-like structure with leaves, which represent classifications, and branches, representing the combination of features that lead to those classifications. An example is the classification of nodes of a decision tree by testing their feature values against each other. Moreover, in the work of Mishra et al., decision trees are the single classifier with the best performance [35]. Random Forest (RF) is commonly used as an integration algorithm [36]. The integration of classifiers provides a more robust generalization capability than a single base learner and is also widely used in intrusion detection [37–39]. RF integrates multiple weak classifiers, and the final result is obtained by voting or taking the mean. This allows the overall model to have a high degree of accuracy and generalization. Gradient Boosting Decision Tree (GBDT) is a robust integrated learning algorithm that extends and augments a categorical regression tree model based on gradient augmentation [40]. The GBDT iteratively constructs decision trees, and in each iteration, a decision tree is trained from the residuals of the previous tree. The final result is then obtained cumulatively from the predictions of all trees. Support Vector Machine (SVM) is also one of the most widely used machine learning algorithms [31]. SVM is a supervised learning method [41]. It performs classification by constructing an N-dimensional hyperplane that optimally classifies the data into different classes. K-Nearest Neighbors (K-NN) is a data mining algorithm that is theoretically mature and less complex [42]. The basic idea is that, in the sample space, if most of the nearest neighbor samples belong to a class, then the samples belong to the same class. Artificial Neural Networks (ANN) is a form of distributed computing inspired by biology. It has a strong self-learning capability and is suitable for solving nonlinear problems, so it is also commonly used in intrusion detection [43].

4.3.2. Model Settings

WGAN-GP is implemented through the deep learning framework Kears. Both generator and discriminator are feedforward neural networks. The learning rate of the generator and discriminator is 0.0001. The dimension of the noise vector is 100. Furthermore, the weight clipping threshold of discriminator training is set to 0.01. Root Mean Square Error (RMSE) can characterize the degree of fit between the generated samples and the real samples. As shown in Figure 3, the model is close to convergence when trained to 150 rounds. Therefore, the training epoch for WGAN-GP is set to 150.

Scikit-learn is used to implement the feature selection process and the construction of the machine learning models. After feature selection, the feature subsets of the NSL-KDD and UNSW-NB15 datasets are shown in Table 5. For a detailed explanation of the features, see [14, 15]. The four machine learning models are also implemented by scikit-learn.

4.3.3. Evaluation Metrics

IDSs are a vital tool to ensure network security. It is necessary not only to identify attacks accurately but also to avoid false alarms. Therefore, we use recall and precision as metrics. Also, we need to consider the overall accuracy, so we use accuracy as a metric. Besides, we introduce F-measure as a metric to fully evaluate the detection performance of machine learning models. After classification, the data can be divided into four categories: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). The calculation formula is as follows:

4.4. Experiments and Results

To answer the three questions mentioned earlier, we design three separate sets of experiments and analyze the results.

4.4.1. Experiment I

To investigate GAN-FS’s effectiveness (Q1), we will train machine learning models using the original training set and the training set oversampled by GAN-FS, respectively. The experiments are arranged as follows:(1)Train the machine learning models using the dataset that is not oversampled(2)Train the machine learning models using the dataset oversampled by GAN-FS

4.4.2. Results

The comparison results before and after oversampling using GAN-FS are shown in Figures 4–6. The results show that the performance of the classifiers is improved to different degrees after oversampling using our proposed method.

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

As shown in Figure 4(a), the Accuracy of each detector is improved on the NSL-KDD dataset. Such results indicate that our method can improve the overall performance of the detectors. In terms of the performance of individual detectors, GBDT shows the best detection performance. Its Accuracy was improved by about 6 to 83.28. Figure 4(c) shows that the Recall of all detectors also improves to varying degrees, which indicates that the samples we generated improve the diversity of attack samples and thus enhance the generalization of the knowledge learning of the detectors. The improved Recall indicates that the classifier can identify attacks more accurately and reduce the attack miss rate, which is of great practical importance in intrusion detection. Figure 4(b) shows the precision of the detectors. It can be seen from the figure that the precision of individual classifiers has declined. This is also because while the detector’s generalization ability improves after adding the generated samples, it misclassifies some normal samples as attack samples. Since different detectors have different learning abilities, there is some difference in the degree of impact. Finally, the F-measure of the detectors shows that the overall performance of the detectors is effectively improved. The same conclusion can be drawn from the experiments on the UNSW-NB15 and CICIDS datasets. Figures 5(a) and 5(a) show that our approach can effectively improve the Accuracy of the classifier and more significantly improve the Recall of the classifier. Ultimately, the F-measure of each classifier has different degrees of improvement. Notably, in the experiments of CICIDS-2017, the F-measure of GBDT without oversampling has been as high as 99.36. In this case, our method does not allow further improvement of the classifier, but the impact is also tiny.

In addition, analyzing the performance of individual classifiers, RF and GBDT achieved quite good performance after oversampling. GBDT showed optimal performance on both NSL-KDD and UNSW-NB15 datasets and showed better performance on the CICIDS-2017 dataset. RF showed optimal performance on the CICIDS-2017 dataset with a 99.6 F-measure. This demonstrates the powerful generalization ability of RF and GBDT as integrated learning and highlights the effectiveness of our proposed oversampling method.

4.4.3. Experiment II

GAN-FS is based on GAN and feature selection oversampling methods. This experiment is to verify the effectiveness of the combination of GAN and feature selection. In this experiment, we will process the training set using GAN and feature selection separately, evaluate it using machine learning models, and then compare it with our method. The experiments are arranged as follows:(1)Oversampling the dataset using WGAN-GP (w/o ANOVA)(2)Reducing the dimensionality of the dataset using ANOVA (w/o WGAN-GP)(3)Oversampling the dataset using GAN-FS

4.4.4. Results

Figures 7–9 show the results of using WGAN-GP and ANOVA and GAN-FS alone. The experimental results show that the combination of WGAN-GP and ANOVA is effective. GAN and feature selection can improve the detection performance of machine learning models to some extent. However, from the comparison in Figure 7, we can see that the overall performance of using only WGAN-GP for oversampling or only ANOVA for dimensionality reduction is lower than that of GAN-FS. We can draw the same conclusion from the comparison in Figure 8. This is because WGAN-GP can learn the distribution of the attack samples and thus generate samples. The generated samples can increase the diversity of attack samples, enhance the learning of attack features by the detector, and improve the detection performance. However, the high dimensionality of the original data also affects the analytical performance of the detector, and the use of feature selection can delete some of the features that are not important for analytic learning, thus improving the classifier’s performance. However, when the detector performance is high, the improvement from feature selection is subtle. As shown in Figure 9, the lead of our method is more subtle. This is mainly because the detector already achieves very high performance when feature selection is not performed. Performing feature selection, in this case, destroys the original combination of features and does not result in a very significant improvement in the detector’s performance. Combining the above results, the combination of WGAN-GP and ANOVA, i.e., GAN-FS, is compelling.

(a)

(b)

(a)

(b)

(a)

(b)

4.4.5. Experiment III

As we talked about in Section 2, SMOTE [3] and ADASYN [4] are two classical oversampling methods, while K-means SMOTE [19] and G-SMOTE [18] are two newer methods proposed in recent years. We will compare these four methods. In addition, we also compare them with the GAN-based methods of Vu and Nguyen [24] and Lee and Park [22]. The F-measure is an overall evaluation of the Precision and Recall, which we use to measure the methods’ performance. The experiments are arranged as follows:(1)Oversampling the dataset using baseline separately(2)Oversampling the dataset using GAN-FS

4.4.6. Results

The experimental results compared to the baseline are shown in Tables 6–8. From the table, we can see that the detectors’ performance based on our proposed method tends to be higher than the other baselines compared to the other methods. From the perspective of individual detectors, we can see that our method performs better on DT, RF, GBDT, and ANN. This is related to the principle of the detector. The idea of the tree model is to construct a tree with the fastest decreasing entropy using information entropy as a metric. Our method generates samples based on the original distribution, which increases the diversity of samples and removes unnecessary features using feature selection. Therefore, after oversampling, the decision tree can better classify the samples. Compared with DT, RF and GBDT are integrated learning algorithms with more excellent learning capability and can improve classification performance. The experimental results also show that the performance of RF and GBDT are generally higher than DT. As a neural network structure, ANN is also a model with higher learning ability, and its performance is improved by increasing the number of attack samples.

However, due to different algorithm principles, there are differences in the performance of the classifiers after oversampling. For the K-NN model, we find that the advantage of our method is not very prominent. This is because traditional oversampling methods such as SMOTE are based on K-NN to generate samples. Generating samples based on this method and then training the K-NN algorithm can improve the classifier’s performance. In addition, although our method improves the performance of the SVM, in the experimental results of NSL-KDD (Table 6), our method is not optimal. This is because SVM solves the maximum partition hyperplane using support vectors. When the original SVM classification is not optimal, generating samples in a safe geometric space (G-SMOTE) will optimize the classification hyperplane more effectively.

Finally, as shown in Figure 10, we count the number of times that different methods achieve the optimal performance. The detectors trained based on our method achieved 12 optimal Accuracy and 14 optimal F-measure, respectively, which are much higher than other methods. From these results, we can see that our method can better improve the detection performance of the intrusion detection model.

(a)

(b)

5. Conclusion and Future Work

In this paper, we take the perspective of imbalance and high dimensionality of datasets in intrusion detection and propose an oversampling intrusion detection technique based on GAN and feature selection. For one thing, our approach proposes to focus on oversampling the rare classes of attack samples in order to improve the effectiveness of intrusion detection. For another thing, we concentrate on only imperative features of attack samples using the ANOVA feature selection method. Then, the obtained low-dimensional rebalanced dataset is used to train intrusion detection classifiers. Experimental results show that our approach improves the performance of detecting intrusion detection models and outperforms other baselines.

As for future work, we first plan to explore the conjunction between our approach and deep learning. In addition, we will try to assign different weights to features to better reflect the significance of each feature in classification.

Data Availability

We use public datasets, which are deposited in public repositories NSL-KDD (https://www.unb.ca/cic/datasets/nsl.html), UNSW-NB15 (https://cloudstor.aarnet.edu.au/plus/index.php/s/2DhnLGDdEECo4ys?path=%2FUNSW-NB15%20-%20CSV%20Files), and CICIDS-2017 (https://www.unb.ca/cic/datasets/ids-2017.html).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partially supported by the National Natural Science of Foundation of China (nos. 61902010 and 61671030), CCF-NSFocus Funding (no.2020003), Beijing Excellent Talent Funding-Youth Project (no.2018000020124G039), and Project of Beijing Municipal Education Commission (no.KM202110005025).

References

A. Khraisat, I. Gondal, P. Vamplew, and J. Kamruzzaman, “Survey of intrusion detection systems: techniques, datasets and challenges,” Cybersecurity, vol. 2, no. 1, pp. 1–22, 2019.
View at: Publisher Site | Google Scholar
H. Liu and B. Lang, “Machine learning and deep learning methods for intrusion detection systems: a survey,” Applied Sciences, vol. 9, no. 20, p. 4396, 2019.
View at: Publisher Site | Google Scholar
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, no. 1, pp. 321–357, 2002.
View at: Publisher Site | Google Scholar
H. He, B. Yang, E. A. Garcia, and S. L. Adasyn, “Adaptive synthetic sampling approach for imbalanced learning,” in Proceedings of the Neural Networks, 2008 IJCNN 2008 (IEEE World Congress on Computational Intelligence), IEEE International Joint Conference, Hong Kong, China, September 2008.
View at: Publisher Site | Google Scholar
I. Goodfellow, J. Pouget-Abadie, M. Mirza et al., “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680, 2014.
View at: Google Scholar
C. Ledig, L. Theis, F. Huszár et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4681–4690, Washington, DC, USA, July 2017.
View at: Publisher Site | Google Scholar
H. Su, X. Shen, P. Hu, W. Li, and Y. Chen, Dialogue Generation with Gan, AAAI, Menlo Park, CA, USA, 2018.
J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232, Venice, Italy, October 2017.
View at: Publisher Site | Google Scholar
G. Chandrashekar and F. Sahin, “A survey on feature selection methods,” Computers & Electrical Engineering, vol. 40, no. 1, pp. 16–28, 2014.
View at: Publisher Site | Google Scholar
D. E. Denning, “An intrusion-detection model,” IEEE Transactions on Software Engineering, vol. SE-13, no. 2, pp. 222–232, 1987.
View at: Publisher Site | Google Scholar
B. M. Aslahi-Shahri, R. Rahmani, M. Chizari et al., “A hybrid method consisting of ga and svm for intrusion detection system,” Neural Computing & Applications, vol. 27, no. 6, pp. 1–8, 2016.
View at: Publisher Site | Google Scholar
R. M. Elbasiony, E. A. Sallam, T. E. Eltobely, and M. M. Fahmy, “A hybrid network intrusion detection framework based on random forests and weighted k-means,” Ain Shams Engineering Journal, vol. 4, no. 4, pp. 753–762, 2013.
View at: Publisher Site | Google Scholar
J. Ma, L. Huang, J. Hao, and G. Wang, “A new approach to intrusion detection using artificial neural networks and fuzzy clustering,” Expert Systems with Applications, vol. 37, 2010.
View at: Publisher Site | Google Scholar
M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis of the kdd cup 99 data set,” in Proceedings of the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pp. 1–6, IEEE, Ottawa, Canada, July 2009.
View at: Publisher Site | Google Scholar
N. Moustafa and J. Slay, “Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set),” in Proceedings of the Military Communications and Information Systems Conference (MilCIS) 2015, Canberra, Australia, November 2015.
View at: Publisher Site | Google Scholar
C. Khammassi and S. Krichen, “A ga-lr wrapper approach for feature selection in network intrusion detection,” Computers & Security, vol. 70, pp. 255–277, 2017.
View at: Publisher Site | Google Scholar
S. Mohammadi, H. Mirvaziri, M. Ghazizadeh-Ahsaee, and H. Karimipour, “Cyber intrusion detection by combined feature selection algorithm,” Journal of information security and applications, vol. 44, pp. 80–88, 2019.
View at: Publisher Site | Google Scholar
G. Douzas and F. Bacao, “Geometric smote a geometrically enhanced drop-in replacement for smote,” Information Sciences, vol. 501, pp. 118–135, 2019.
View at: Publisher Site | Google Scholar
G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning through a heuristic oversampling method based on k-means and smote,” Information Sciences, vol. 465, pp. 1–20, 2018.
View at: Publisher Site | Google Scholar
M. Ring, D. Schlör, D. Landes, and A. Hotho, “Flow-based network traffic generation using generative adversarial networks,” Computers & Security, vol. 82, 2018.
View at: Google Scholar
M. Rigaki and S. Garcia, “Bringing a gan to a knife-fight: adapting malware communication to avoid detection,” in Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW), pp. 70–75, San Francisco, CA, USA, May 2018.
View at: Publisher Site | Google Scholar
J. H. Lee and K. H. Park, “Gan-based imbalanced data intrusion detection system,” Personal and Ubiquitous Computing, vol. 25, pp. 1–8, 2019.
View at: Publisher Site | Google Scholar
I. Yilmaz, R. Masum, and A. Siraj, “Addressing imbalanced data problem with generative adversarial network for intrusion detection,” in Proceedings of the 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI), pp. 25–30, IEEE, Las Vegas, NV, USA, August 2020.
View at: Publisher Site | Google Scholar
L. Vu and Q. U. Nguyen, “Handling imbalanced data in intrusion detection systems using generative adversarial networks,” Journal of Research and Development on Information and Communication Technology, vol. 2020, no. 1, pp. 1–13, 2020.
View at: Publisher Site | Google Scholar
C. Yin, Y. Zhu, S. Liu, J. Fei, and H. Zhang, “Enhancing network intrusion detection classifiers using supervised adversarial training,” The Journal of Supercomputing, vol. 76, pp. 1–30, 2019.
View at: Publisher Site | Google Scholar
A. Martin, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proceedings of the 34th International Conference on Machine Learning, pp. 214–223, Sydney, Australia, August 2017.
View at: Google Scholar
I. Gulrajani, F. Ahmed, A. Martin, D. Vincent, and A. Courville, “Improved training of wasserstein gans,” in Proceedings of the 31st Conference on Advances in Neural Information Processing Systems, pp. 5767–5777, Red Hook, NY, USA, December 2017.
View at: Google Scholar
J. Cai, J. Luo, S. Wang, and S. Yang, “Feature selection in machine learning: a new perspective,” Neurocomputing, vol. 300, pp. 70–79, 2018.
View at: Publisher Site | Google Scholar
S. Kumar Dey and M. Rahman, “Effects of machine learning approach in flow-based anomaly detection on software-defined networking,” Symmetry, vol. 12, no. 1, p. 7, 2020.
View at: Google Scholar
S.-Y. Ji, B.-K. Jeong, S. Choi, and D. H. Jeong, “A multi-level intrusion detection method for abnormal network behaviors,” Journal of Network and Computer Applications, vol. 62, pp. 9–17, 2016.
View at: Publisher Site | Google Scholar
A. Özgür and H. Erdem, “A review of kdd99 dataset usage in intrusion detection and machine learning between 2010 and 2015,” PeerJ Preprints, vol. 4, Article ID e1954v1, 2016.
View at: Publisher Site | Google Scholar
M. Panda and M. R. Patra, “Network intrusion detection using Naive Bayes,” International journal of computer science and network security, vol. 7, no. 12, pp. 258–263, 2007.
View at: Google Scholar
A. Ahmed, L. Maglaras, M. Amine Ferrag, M. Derdour, and H. Janicke, “A novel hierarchical intrusion detection system based on decision tree and rules-based models,” in Proceedings of the 2019 15th International Conference on Distributed Computing in Sensor Systems (DCOSS), pp. 228–233, IEEE, Santorini, Greece, May 2019.
View at: Publisher Site | Google Scholar
K. Peng, V. Leung, L. Zheng, S. Wang, C. Huang, and T. Lin, “Intrusion detection system based on decision tree over big data in fog environment,” Wireless Communications and Mobile Computing, vol. 2018, Article ID 4680867, 10 pages, 2018.
View at: Publisher Site | Google Scholar
P. Mishra, V. Varadharajan, U. Tupakula, and E. S. Pilli, “A detailed investigation and analysis of using machine learning techniques for intrusion detection,” IEEE Communications Surveys & Tutorials, vol. 21, no. 1, pp. 686–728, 2018.
View at: Publisher Site | Google Scholar
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
View at: Publisher Site | Google Scholar
N. Farnaaz and M. A. Jabbar, “Random forest modeling for network intrusion detection system,” Procedia Computer Science, vol. 89, pp. 213–217, 2016.
View at: Publisher Site | Google Scholar
L. Li, Y. Yang, S. Bai, J. Cheng, and X. Chen, “Towards effective network intrusion detection: a hybrid model integrating gini index and gbdt with pso,” Journal of Sensors, vol. 2018, Article ID 1578314, 9 pages, 2018.
View at: Publisher Site | Google Scholar
Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms, CRC Press, Boca Raton, FL, USA, 2012.
J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of Statistics, vol. 29, pp. 1189–1232, 2001.
View at: Publisher Site | Google Scholar
C.-W. Hsu, C.-C. Chang, and C.-J. Lin, A Practical Guide to Support Vector Classification, National Taiwan University, Taipei, Taiwan, 2003.
Y. Liao and V. R. Vemuri, “Use of k-nearest neighbor classifier for intrusion detection,” Computers & Security, vol. 21, no. 5, pp. 439–448, 2002.
View at: Publisher Site | Google Scholar
B. Ingre and A. Yadav, “Performance analysis of nsl-kdd dataset using ann,” in Proceedings of the International Conference on Signal Processing and Communication Engineering Systems, pp. 92–96, IEEE, Guntur, India, January 2015.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2021 Xiaodong Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Security and Communication Networks

A GAN and Feature Selection-Based Oversampling Technique for Intrusion Detection

Abstract

1. Introduction

2. Related Work

2.1. Intrusion Detection

2.2. Application of Feature Selection in Intrusion Detection

2.3. Oversampling Techniques

2.4. Generative Adversarial Networks

3. An Oversampling Technique GAN-FS

3.1. Preprocessing

3.1.1. Numeralization

3.1.2. Normalization

3.2. Data Partition and Rare Class Attack Oversampling

3.3. Feature Selection

3.4. Train and Test Machine Learning Model

4. Evaluation

4.1. Research Questions

4.2. Datasets

4.3. Experimental Settings

4.3.1. Machine Learning Model Selection

4.3.2. Model Settings

4.3.3. Evaluation Metrics

4.4. Experiments and Results

4.4.1. Experiment I

4.4.2. Results

4.4.3. Experiment II

4.4.4. Results

4.4.5. Experiment III

4.4.6. Results

5. Conclusion and Future Work

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright