Network Traffic Classification Based on SD Sampling and Hierarchical Ensemble Learning

Qin, Jian; Han, Xueying; Wang, Chonghua; Hu, Qing; Jiang, Bo; Zhang, Chen; Lu, Zhigang

doi:https://doi.org/10.1155/2023/4374385

Security and Communication Networks

On this page

Abstract Introduction Related Work Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2023 | Article ID 4374385 | https://doi.org/10.1155/2023/4374385

Network Traffic Classification Based on SD Sampling and Hierarchical Ensemble Learning

Jian Qin,^1,2Xueying Han,^1,2Chonghua Wang ,³Qing Hu,⁴Bo Jiang,^1,2Chen Zhang ,^1,2and Zhigang Lu^1,2

Academic Editor: Zhiyuan Tan

Received08 Aug 2022

Revised02 Nov 2022

Accepted12 Nov 2022

Published17 May 2023

Abstract

With the increase in cyber threats in recent years, there have been more forms of demand for network security protection measures. Network traffic classification technology is used to adapt to the dynamic threat environment. However, network traffic has a natural unbalanced class distribution problem, and the single model leads to the low accuracy and high false-positive rate of the traditional detection model. Given the above two problems, this paper proposes a new dataset balancing method named SD sampling based on the SMOTE algorithm. Different from the SMOTE algorithm, this method divides the sample into two types that are easy and difficult to classify and only balances the difficult-to-classify sample, which not only overcomes the SMOTE’s overgeneralization but also combines the idea of oversampling and undersampling. In addition, a two-layer structure combined with XGBoost and the random forest is proposed for multiclassification of anomalous traffic, since using a hierarchical structure can better classify minority abnormal traffic. This paper conducts experiments on the CICIDS2017 dataset. The results show that the classification accuracy of the proposed model is more than 99.70% and that the false-positive rate is less than 0.34%, indicating that the proposed model is better than traditional models.

1. Introduction

1.1. Background

In recent years, with the rapid popularization of computer network applications in various fields, network threats have become increasingly serious. Many mechanisms, such as firewalls, antivirus, antimalware, and spam filters, are used as tools to protect network security. Network traffic classification is also an effective and powerful network security technique. However, today’s cyberattacks are systematic and long-term. In addition, the traffic data in the network are so large and complex, making them difficult to analyze and detect. Machine learning (ML) [1] methods are widely used for network traffic detection.

Machine learning can identify abnormal traffic by learning features in a large amount of data. It can be divided into supervised learning and unsupervised learning. Supervised learning refers to learning labeled training data to discover relationships between input and output data for prediction and classification, including deep neural network (DNN) [2], decision tree (DT) [3], support vector machine (SVM) [4], K nearest neighbor (KNN) [5], and Gaussian naive Bayes (Gaussian NB) [6]. Unsupervised learning refers to learning and summarizing patterns and structures of unlabeled training data for prediction and classification, including principal component analysis (PCA) [7] and K-means clustering [8].

1.2. Problem Statement

Network traffic has a natural unbalanced class distribution problem. For example, there is far more normal traffic than abnormal traffic in the network. To solve the problem of the unbalanced dataset, the traditional method is to sample the dataset. SMOTE [9] is the widely used method of sampling data based on the spatial distribution of samples, and its process is simple but has many disadvantages, such as overgeneralization. To deal with these disadvantages of the SMOTE algorithm, this paper proposes a new sampling algorithm named SD sampling, which can obtain a dataset that is easier to classify.

Traditional network intrusion detection based on machine learning usually uses a single algorithm to classify traffic, which leads to many problems. First of all, single machine learning algorithms have some limitations, such as being easy to overfit or underfit and difficult to deal with multiclassification problems, which lead to a low detection rate. In addition, some single algorithms can only guarantee a high detection rate for the dataset with specific data distribution, resulting in poor generalization ability. Therefore, this paper proposes a new detection structure, which contains two ensemble models, which can ensure a high detection rate and improve the generalization ability of the model.

1.3. Key Contributions and Paper Organization

In summary, the paper’s main contributions are as follows:(1)We propose a new sampling algorithm named SD sampling. The SD sampling algorithm combines oversampling and undersampling methods and considers the spatial distribution of samples during sampling, which overcomes the overgeneralization problem of the SMOTE algorithm to some extent.(2)We propose a two-layer structure combined with XGBoost [10] and the random forest [11] to realize multiclassification of traffic, which improves the detection rate and generalization ability of the model.(3)We evaluate the performance of the SD sampling algorithm and the proposed classification model using the CICIDS2017 dataset [12]. Compared with other sampling modes and classification models, we verify the advantages of the proposed method.

The remainder of the paper is organized as follows: Section 2 introduces the research of ML-based intrusion detection and unbalanced datasets. Section 3 introduces the framework of the proposed model and details each module, including the workflow of the SD sampling algorithm and the two-layer structure combined with XGBoost and the random forest. In Section 4, we evaluate the performance of the SD sampling algorithm and the proposed detection model using the CICIDS2017 dataset. Finally, Section 5 summarizes and discusses future directions.

2.1. Improved Machine Learning Algorithms for Network Traffic Classification

Machine learning is widely used in network intrusion detection. However, traditional machine learning models have the problem of low accuracy, which can be solved by the following methods: ensemble learning, model optimization, and sample optimization. For the reader’s convenience, we provide the explanation of acronyms, as shown in Table 1.

In terms of ensemble learning, Gao et al. [13] proposed a new method of ensemble voting based on classifier resolution. For each base classifier, a classification prediction is made with a base classifier, and the probability of correctly classifying samples into each class is calculated, which is recorded as the classification weights of the base classifier. When voting, the weights of the corresponding classes of base classifiers with the same classification result are added, and then, the sum of the weights of each class is compared. The class with higher weights is taken as the final result. Xia and Sun [14] proposed an ensemble learning scheme using isolated forest (IForest) [15], local outlier factor (LOF), and K-means clustering methods. The base classifiers are IForest and LOF, making them complementary in detecting global outliers and local outliers. As for the selection of the k-means initial clustering center, the normal points detected by the IForest and LOF can be selected as the initial clustering center to solve the problem of poor clustering effects if the initial clustering center contains outliers. The experimental results show that the accuracy of the model has improved significantly. Ling et al. [16] proposed a multiclassifier ensemble algorithm based on probability weighted voting to improve model accuracy. Xu et al. [17] proposed a weighted majority algorithm based on the random forest to improve the performance of the random forest, and the model is trained on nontraffic datasets, so it has the ability to detect unknown traffic types. Ren et al. [18] proposed category detection and a partition technique to improve the detection accuracy of minority attacks (Probe, U2L, R2L) on the random forest. Data and Aritsugi [19] proposed an incremental learning framework, which can avoid the problem of conceptual drift in traffic. Aceto et al. [20] proposed an encrypted traffic classification framework based on hard/soft combinators, which takes the existing high performance traffic classification model as a base classifier and considers the training requirements and learning philosophy for improving classification performance. Possebon et al. [21] investigated and evaluated a wide range of metalearning techniques, including voting, stacking, bagging, and boosting.

In terms of model optimization, Yang and Zhang [22] proposed the use of a multigranularity cascade algorithm based on the traditional isolated forest model. The traditional isolated forest has some problems, such as undetectable local outliers parallel to the axis and a lack of sensitivity and stability to high-dimensional data outliers. To solve these problems, an isolation mechanism based on a stochastic hyperplane is proposed. The stochastic hyperplane simplifies the isolation boundary of the data model by using the linear combination of multiple dimensions, and the isolation boundary of the stochastic linear classifier can detect more complex data patterns. Experiments show that the improved isolated forest algorithm has better robustness to complex anomaly data patterns. Qiu et al. [23] used an LSTM model with a sliding window to avoid the problem of concept drift in streaming data. Giuseppe et al. [24] proposed a novel multimodal data allocation framework MIMETIC for encrypted traffic classification, which can fully exploit the heterogeneity of traffic data by learning intramodal and intermodal dependencies and overcome the performance limitations of single-modal data.

In terms of sample optimization, Gu and Lu [25] proposed the method of feature transformation of data by using naive Bayes feature embedding. The dataset and kernel density estimates are calculated, and then, the marginal density ratios of each feature of the sample are calculated using the naive Bayes’ principle. Taking the marginal density ratio of each feature as a new feature, which makes the dataset easier to classify, Ren et al. [18] proposed an outlier detection algorithm based on KNN, which removes some outliers to help the model classify traffic more easily.

We summarize all of the above work of improved machine learning algorithms, as shown in Table 2.

2.2. Balanced Dataset

There are two approaches to solving the problem of the unbalanced dataset, the first is from the perspective of the data and the second is from the perspective of the algorithm.

In terms of data, Zhang et al. [26] proposed a method to generate samples based on the variational autoencoder generation model to balance the dataset. The core idea is to expand only boundary samples, which are most likely to cause confusion to machine learning when expanding minority samples. Liu et al. [27] proposed the method of using wGAN-GP, an improved method of the generative adversity network, to generate a small number of samples and balance the dataset. Yan and Han [28] improved the SMOTE algorithm and put forward three-point domains that are divided according to the number of majority samples around samples to generate samples. Seo and Kim [29] proposed a support vector machine model to predict the optimal sampling rate of the SMOTE algorithm and then get an optimal sampling dataset. Wang and Sun [30] put forward a new sampling method, which takes into account the problems of class overlap and data distribution lacking in the traditional oversampling method. Compared with the traditional SMOTE algorithm, the AUC for four datasets increases by 1.6% on average. Liu et al. [31] proposed a technique for sampling samples based on the difficulty of sample classification. Park and Hyunhee [32] proposed a method combining TGAN and slow start to generate samples and prevent overfitting caused by oversampling. Wang et al. [33] proposed an encrypted traffic generation method based on GAN to generate minority class samples and balance this dataset.

In terms of algorithm, Gupta et al. [34] proposed a method to weight samples and use cost-sensitive DNN for classification, to reduce the impact of the unbalanced dataset. Sharma et al. [35] proposed a weighted extreme learning machine to weight each classifier and alleviate dataset imbalance. Li et al. [36] proposed a method called HM-loss cost, which pays more attention to the misclassified samples in minority classes when calculating the loss. Hu et al. [37] proposed a method of batch balancing datasets based on deep learning to ensure that the number of samples in each class is equal in each batch, so as to achieve the balance of datasets.

We summarize all of the above work of balanced datasets, as shown in Table 3.

3. Proposed Model

In this section, we introduce the proposed framework and its workflow. The framework runs through the overall process of network traffic detection and has a high detection rate for various kinds of traffic. As shown in Figure 1, the framework consists of four main modules as follows:(1)Preprocessing module: raw data are preprocessed to make the processed data easier to understand and process.(2)Data sampling module: the number of samples in each class is counted, and data are balanced using the proposed SD sampling algorithm.(3)Feature selection module: high-weight features are selected from all features using the random forest, which will facilitate training processing.(4)Classification module: A two-layer structure combined with XGBoost and the random forest is used to classify traffic.

3.1. Preprocessing Module

This module focuses on preprocessing raw data. First, missing and duplicate values are deleted from the data. Second, the data are undersampled to obtain a portion of the data for model training. Then, the dataset is divided into the train set and the test set. Finally, the train set and the test set are normalized and one-hot encoded.

3.2. Data Sampling Module

This module focuses on balancing classes using the sampling algorithm. The SMOTE algorithm is a commonly used class balancing algorithm. However, the SMOTE algorithm has some limitations, so this paper proposes a new sampling algorithm SD sampling based on SMOTE, which can mitigate the defects of the SMOTE algorithm to some extent and get better results. The specific process of the SMOTE algorithm and SD sampling algorithm is introduced in the following sections.

3.2.1. SMOTE: Synthetic Minority Oversampling Technique

SMOTE is an oversampling algorithm, which can automatically calculate the ratio of the majority class sample to the minority class sample and oversample the samples of the minority class based on the distance metric. Its sampling strategy is to choose one sample randomly among k nearest neighbors of the sample of each minority class and then select a random point on the line between these two samples as the newly synthesized minority class sample.

The specific process is as follows:(1)For each sample in a minority class, we calculate the Euclidean distance of all samples in that class and obtain the k nearest samples of that sample by comparison.(2)We selecting a sample randomly from k nearest samples and denote it as .(3)We generate the new sample . The new sample is In the formula, represents generating a random number between and .

The SMOTE algorithm process is concise and has been widely used, but it has several disadvantages:(1)The algorithm is prone to the problem of overgeneralization. Minority classes are sparser than majority classes, so there is a high chance of class mixing when sampling a minority class, which will make the boundary of them more and more blurred and increase the difficulty of classification.(2)The algorithm simply oversamples minority class samples indiscriminately and does not consider its spatial distribution. Therefore, the sampling algorithm will not be targeted.(3)The algorithm can only oversample minority class samples and does not perform any processing on majority class samples.

3.2.2. SD Sampling

Based on the above limitations of the SMOTE algorithm, a new sampling algorithm called SD sampling is proposed in this paper. SD sampling refers to literature [31] and makes improvements on its basis. The standard of the literature’s algorithm for selecting a difficult classified dataset is too strict, and the method for sampling minority classes is too simple. In response to these problems, we propose the concept of instance hardness (IH), which makes the selection criteria more flexible, and use the SMOTE algorithm to oversample minority classes. In addition, the SD sampling algorithm combines the ideas of oversampling and undersampling and takes into account the spatial distribution characteristics of each sample so that it also mitigates the disadvantages of SMOTE to some extent.

The SD sampling algorithm starts from the fact that the SMOTE algorithm cannot deal with the data distribution problem of the unbalanced dataset. As shown in Figure 2, samples are first divided into two parts according to their spatial distribution, one is an easily classified dataset, denoted as SE, and the other is a difficult-to-be classified dataset, denoted as SD. We proposed a concept called instance hardness (IH) as a criterion to assess whether the sample is easy to classify. The higher the IH, the harder it is to classify sample . IH is calculated aswhere the represents the label of , represents the near neighbor sample of , and represents the number of neighbors of the sample .

Keeping SE constant, for SD, since most samples in the majority class are redundant, K-means clustering is performed for those majority class samples, and then, cluster centers are used to replace samples in the cluster. In addition, SMOTE oversampling is performed for those minority class samples in . The SDsampling algorithm is written as Algorithm 1.

	Input: Imbalanced train set , scaling factor , instance hardness threshold , and sample threshold UB
	Output: New train set
(1)	Step1: Distinguish between easy sets and difficult sets for each sampledo
(2)	Compute its nearest neighbors and ifthen
(3)	Put the samples into the difficult set
(4)	end
(5)	end
(6)	Difficult set and easy set
(7)	Step2: Compress the majority samples in the difficult set by the cluster centroid
(8)	Take all the majority samples from and set it as
(9)	Use the K-means algorithm with cluster
(10)	Use the coordinates of cluster centroids and replace the majority samples in
(11)	Compressed the majority sample set
(12)	Step3: Sample the minority samples in the difficult set using the SMOTE algorithm
(13)	Take all the majority samples from and set it as
(14)	for each sample do
(15)	Using SMOTE sampling, the sampling threshold is set to
(16)	Putting new samples into
(17)	end
(18)	Step4: Merge sample sets
(19)	New train set

3.3. Feature Selection Module

This module focuses on selecting features. It is difficult for machine learning algorithms to learn from high-dimensional data. Feature selection is a useful method to solve these problems, and it selects features with high weights for training in advance, which can improve performance and save computational resources at the same time.

This module uses a random forest-based feature selection method, which can evaluate the weight of each feature by the Gini index. The random forest is composed of many CART trees [38], and its final classification result is decided by these CART trees through voting. The Gini index describes purity, and the smaller the value, the higher the purity. Therefore, in CART trees, the Gini index is used as an assessment of the change in the purity of nodes before and after using feature splitting nodes, and the smaller the value, the better the feature. For the sample set , suppose there are classes, the sample size of the class is and the size of is , and then, the Gini index expression for the sample set is

For a CART tree, the number of samples in a node is , and the number of samples in a node with a class is . The Gini index of the node of the tree iswhere is the proportion of the class in the node , calculated as .

For the feature , the importance of the feature at the ith tree node , that is, the amount of change in the Gini index before and after splitting the node , iswhere and denote the Gini index of two new nodes after branching.

The set of nodes in which the feature appears in the tree is , and then, the importance of the feature in the tree is

There are trees in the random forest, and feature importance is denoted as . Then, the importance of the feature is

Finally, all the obtained importance scores are normalized and calculated aswhere is the number of features.

Finally, top features with the highest weight are selected for classification.

3.4. Classification Module

This module focuses on classifying traffic. We propose a two-layer structure combined with XGBoost and the random forest. The first layer uses the XGBoost model and distinguishes between normal and abnormal samples in the dataset and the second layer uses the random forest model to distinguish the type of attack for each abnormal sample.

XGBoost is a boosted tree model, which is a combination of many tree models together to form a very powerful integrated classifier. The idea of XGBoost is to train K trees, and the final prediction result is the sum of the predicted values of those K trees. It is an improvement on the gradient boosting algorithm, which can get a high accuracy rate in a very short time.

The random forest is an ensemble model that uses many decision trees to classify samples for prediction and finally votes on the classification result. The randomness of the random forest is reflected in random and unreleased data sampling and random feature selection, which leads to faster training speeds and higher accuracy.

The two-layer structure combined with XGBoost and the random forest uses a hierarchical approach to multiclassify traffic, and the hierarchical approach refers to two pieces of literature [39, 40] on traffic multiclassification. However, different from both of them, this structure is mainly used to identify abnormal traffic types rather than application traffic types. Its workflow is shown in Figure 3, which mainly includes three steps as follows:(1)Dataset construction: The train set and the test set are replicated into two copies, and labels are recoded. For the train set and test set, the first part, denoted as and , is labeled with all normal samples as and abnormal samples as using . The second part, denoted as and , is labeled with all normal samples as and abnormal samples sequentially coded as according to categories using , where is the number of abnormal categories.(2)Model training: The XGBoost model is used to train on to obtain a binary classifier, denoted as , and the random forest model is used to train on to obtain a multiclass classifier that can distinguish the types of abnormal traffic, denoted as .(3)Classification: The trained classifier is used to classify , and the samples classified as normal are noted as and those classified as abnormal are noted as .

We select the data contained in from . Then, the trained classifier is used to classify and predict , and the samples classified as normal are noted as added to and those classified as abnormal are noted as , , , by class.

In this way, the result of multiclassification of traffic is obtained.

4. Experiments

In this section, we evaluate the classification performance of the framework through experiments. First, we introduce the experimental environment, the CICIDS2017 dataset, and evaluation metrics in detail. Then, we adjust some of the parameters used in the proposed framework and finally compare and analyze the classification results.

4.1. Experimental Environment

The details of all experimental implementation configurations are shown in Table 4.

4.2. Dataset

The CICIDS2017 dataset is a widely used dataset collected by the Canadian Institute for Cybersecurity in 2017. It contains both normal and abnormal traffic, and it is generated by simulating in a real network environment, making it closer to the realistic situation and more reliable.

The CICIDS2017 dataset provides the original pcap package, and we extracted the statistical features from them. Since the transmission content is mostly encrypted, semantic features are difficult to obtain from traffic. However, the statistical distribution of normal traffic packets and abnormal traffic packets in a session is different, such as the number and length of packets. Therefore, we use statistical features instead of original traffic for classification in this paper. Note that all the packets of flow/biflow need to be collected at the end of the session, and then, the statistical features of the flow can be calculated.

At the sample size level, the CICIDS2017 dataset contains a total of 2,830,743 records, including 2,273,097 records for normal traffic and 557,646 records for abnormal traffic, which are extremely unbalanced. Therefore, after performing regular operations (such as deleting missing values, normalizing data, and encoding labels), we undersample the CICIDS2017 dataset, which not only mitigates the negative impact of the unbalanced dataset but also reduces the training time. The category distribution of the processed dataset is shown in Figure 4.

At the feature size level, the CICIDS2017 dataset is a high-dimensional dataset, which contains 84 feature columns and 1 label column. In order to improve the generalization ability of the model, we remove the “Flow ID,” “Source IP,” “Source Port,” “Destination IP,” and “Time stamp” features. As a result, the final dataset contains only 79 feature columns and 1 label column. These features are listed and classified in Table 5.

Finally, the dataset is divided into a train set and a test set in a ratio of 3 : 1.

4.3. Evaluation Metrics and Baseline Methods

In order to accurately evaluate the two-layer structure combined with XGBoost and the random forest, we use the six evaluation criteria: accuracy, recall, precision, F1 score, false-negative rate (FNR), and false-positive rate (FPR).

As shown in Table 6, we first calculate the confusion matrix according to real labels and predicted labels. True positive (TP) refers to the sample with both positive real values and predicted values, whereas false positive (FP) refers to the sample with negative real values and positive predicted values. False negative (FN) refers to the sample with positive real values and negative predicted values, whereas true negative (TN) refers to the sample with both negative true values and predicted values.

Accuracy refers to the percentage of correctly predicted samples in the total sample, which can represent the overall predictive ability of the model as follows:

Recall is the ratio of the number of samples that are predicted to be positive to the number of samples with positive real values, which can represent the coverage rate of prediction as follows:

Precision is the ratio of the number of samples with positive real values to the number of samples predicted to be positive, which can represent the ability of the model to predict positive samples as follows:

The F1 score is the harmonic mean of precision and recall and can be represented as follows:

FNR is the ratio of the number of samples with positive predicted values and negative real values to the number of samples with positive predicted values and can be represented as follows:

FPR is the ratio of the number of samples that are not predicted to be positive to the number of samples with positive real values and can be represented as follows:

Additionally, we also use four commonly used machine learning models including the k-nearest neighbor (KNN), decision tree (DT), support vector machine (SVM), and deep neural network (DNN) to conduct comparative experiments, and their parameters are shown in Table 7.

4.4. Parameter Selection

In order to achieve the best classification effect of the model, we select the important parameters in each module and select the optimal results to improve model accuracy.

In the feature selection module, we use a feature selection method based on the random forest. We need to select the top features with the highest weight as the final result and set to 10, 20, 30, 40, 50, 60, and 70 for comparative experiments. Finally, we use the random forest and XGBoost to classify the processed dataset and use the average value of the F1 score to select the optimal number of features.

As shown in Figure 5, the average value of the F1 score of the random forest and XGBoost is obtained by using the feature selection method based on the random forest, and the value is the highest when the number of features is 50. Therefore, the optimal feature number is 50.

In the classification module, for the random forest and XGBoost models used in the two-layer structure, we adjust the important parameters of the two models, respectively, and obtain the optimal results, and the selected parameters are shown in Table 8.

4.5. Classification Results and Analysis

Our experiment aims to explore the advantages of the SD sampling algorithm and two-layer structure combined with XGBoost and the random forest in binary and multiclass classification.

4.5.1. Influence of the SD Sampling Algorithm

First, we sample the CICIDS2017 dataset in four different modes: no sampling, SMOTE sampling, random sampling, and SD sampling. Then, we classify the processed dataset with a two-layer structure. Finally, we evaluate the results as per the six evaluation criteria.

As shown in Tables 9–12, it is found that the classification performance of the above four sampling modes is roughly the same in the case of binary classification. However, in the case of multiclass classification, the classification ability of the model for minority samples in the dataset sampled by the SD sampling algorithm has significantly improved, and the identification ability of each attack can reach more than 99%. The reason is that the SD sampling algorithm oversamples minority samples and clusters majority samples in SD. In this way, the imbalance ratio between majority samples and minority samples will be reduced, and the sampled dataset is more conducive to the model to distinguish the abnormal flow with minority samples.

4.5.2. Influence of the Two-Layer Structure

In order to verify that the two-layer structure is better than other models, we conduct comparative experiments using six models, including the four baseline models mentioned in Section 4.3 and the single-layer XGBoost and random forest model, and the experimental results are shown in Table 13. It is found that under the same sampling mode, the six evaluation metrics of the two-layer structure are significantly higher than those of the other four models.

The reason why the two-layer structure is more effective is that the first layer can first detect normal traffic and remove it from the dataset, and the second layer only detects the traffic that is judged to be abnormal in the first layer. In this way, the proportion of minority samples in the test set increases, reducing the impact of the unbalanced dataset and improving the classification ability of the model for minority samples.

5. Conclusions

This paper presents a novel network intrusion detection framework, which consists of four modules, the preprocessing module, data sampling module, feature selection module, and classification module. In the data sampling module, we propose a new sampling algorithm named SD sampling to balance the dataset. This algorithm overcomes the disadvantages of the SMOTE algorithm by considering the spatial distribution of samples. It also combines the idea of oversampling and undersampling, and finally, it obtains a dataset that is very easy to classify. In the classification module, we propose a two-layer structure combined with XGBoost and the random forest. The first layer is used for binary classification and the second layer is used for multiclassification. Both of them use the ensemble model, which overcomes the defects of low accuracy and poor generalization ability of the single algorithm. Finally, we conduct comparative experiments on the CICIDS2017 dataset using three sampling modes to verify the advantages of the SD sampling algorithm. At the same time, we also use four commonly used machine learning models to conduct comparative experiments, and the results show that the two-layer structure proposed in this paper can be used to classify traffic accurately and that the accuracy rate is up to 99.75%.

In the traffic classification task, it is also important to improve the interpretability of the model to generate some new domain knowledge. Therefore, in the future, we will focus on using explainable artificial intelligence (XAI) [41] tools to help us understand data information and model decision methods, such as simplifying models, estimating the correlation between individual features, visualizing feature importance, and visualizing the reasoning process of deep learning models.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the National Key Research and Development Program of China (No. 2019QY1300), the Youth Innovation Promotion Association CAS (No. 2021156), and the Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDC02040100).

References

A. L. Buczak and E. Guven, “A survey of data mining and machine learning methods for cyber security intrusion detection,” IEEE Communications Surveys & Tutorials, vol. 18, no. 2, pp. 1153–1176, 2016.
View at: Publisher Site | Google Scholar
H. Sun, Y. Xiao, J. Wang et al., “Common knowledge based and one-shot learning enabled multi-task traffic classification,” IEEE Access, vol. 7, pp. 39485–39495, 2019.
View at: Publisher Site | Google Scholar
L. Wang, X. Zhou, and R. Gu, “Traffic classification using cost based decision tree,” in Proceedings of the 2011 International Conference on Computer Science and Network Technology, vol. 4, Harbin, December 2011.
View at: Google Scholar
S. Dong, “Multi class svm algorithm with active learning for network traffic classification,” Expert Systems with Applications, vol. 176, Article ID 114885, 2021.
View at: Publisher Site | Google Scholar
S. Cheng, Y. Niu, and L. I. Jun, A Study on Network Traffic Classification Model of Knn Based on Network Resources”, Journal of Hubei University of Technology, Hongshan, Wuhan, China, 2016.
J. Zhang, C. Chen, X. Yang, W. Zhou, and Y. Xiang, “Internet traffic classification by aggregating correlated naive bayes predictions,” IEEE Transactions on Information Forensics and Security, vol. 8, no. 1, pp. 5–15, 2012.
View at: Google Scholar
Abid Saber, B. Fergani, and M. Abbas, “Encrypted traffic classification: combining over-and under-sampling through a pca-svm,” in Proceedings of the 2018 3rd International Conference on Pattern Analysis and Intelligent Systems (PAIS), Tebessa, Algeria, October 2018.
View at: Google Scholar
H. Singh, “Performance analysis of unsupervised machine learning techniques for network traffic classification,” in Proceedings of the 2015 Fifth International Conference on Advanced Computing & Communication Technologies, Haryana, India, February 2015.
View at: Google Scholar
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
View at: Publisher Site | Google Scholar
Y. Song, H. Li, P. Xu, and D. Liu, “A method of intrusion detection based on woa-xgboost algorithm,” Discrete Dynamics in Nature and Society, vol. 2022, pp. 1–9, 2022.
View at: Publisher Site | Google Scholar
C. Wang, T. Xu, and Xi Qin, “Network traffic classification with improved random forest,” in Proceedings of the 2015 11th International Conference on Computational Intelligence and Security (CIS), Shenzhen, China, December 2015.
View at: Google Scholar
I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating a new intrusion detection dataset and intrusion traffic characterization,” ICISSp, vol. 1, pp. 108–116, 2018.
View at: Google Scholar
X. Gao, C. Shan, C. Hu, Z. Niu, and Z. Liu, “An adaptive ensemble machine learning model for intrusion detection,” IEEE Access, vol. 7, pp. 82512–82521, 2019.
View at: Publisher Site | Google Scholar
H. Xia and Z. Sun, “A Semi-supervised Outlier Detection Model Based on Autoencoder and Integrated Learning,” Computer Engineering & Science, vol. 1, pp. 1–9, 2020.
View at: Publisher Site | Google Scholar
F. T. Liu, K. Ting, and Z. H. Zhou, “Isolation forest,” in Proceedings of the 2008 Eighth Ieee International Conference on Data Mining, Beijing, China, December 2008.
View at: Google Scholar
Y. Ling, Y. Liu, and B. Jiang, “Intrusion detection method based on double-layer heterogeneous ensemble learner,” Journal of Cyber Security, vol. 6, no. 3, pp. 16–28, 2021.
View at: Google Scholar
M. Xu, X. Li, H. Liu, Z. Cheng, and J. Ma, “An intrusion detection scheme based on semi-supervised learning and information gain ratio,” Journal of Computer Research and Development, vol. 54, no. 10, pp. 2255–2267, 2017.
View at: Google Scholar
J. Ren, X. Liu, Q. Wang, H. He, and X. Zhao, “An multi-level intrusion detection method based on knn outlier detection and random forests,” Journal of Computer Research and Development, vol. 56, no. 3, pp. 566–575, 2019.
View at: Google Scholar
M. Data and M. Aritsugi, “T-DFNN: an incremental learning algorithm for intrusion detection systems,” IEEE Access, vol. 9, pp. 154156–154171, 2021.
View at: Publisher Site | Google Scholar
G. Aceto, D. Ciuonzo, A. Montieri, and A. Pescapé, “Multi-classification approaches for classifying mobile app traffic,” Journal of Network and Computer Applications, vol. 103, pp. 131–145, 2018.
View at: Publisher Site | Google Scholar
I. P. Possebon, S. S. Anderson, L. Z. Granville, A. Schaeffer-Filho, and A. Marnerides, “Improved network traffic classification using ensemble learning,” in Proceedings of the 2019 IEEE Symposium on Computers and Communications (ISCC), Barcelona, Spain, June 2019.
View at: Google Scholar
X. Yang and S. Zhang, “Anomaly detection model based on multi-grained cascade isolation forest algorithm,” in Proceedings of the 2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET), Beijing, China, August 2019.
View at: Google Scholar
Y. Qiu, X. Chang, Q. Qiu, P. Cheng, and S. Su, “Stream data anomaly detection method based on long short-term memory network and sliding window,” Journal of Computer Applications, vol. 40, no. 5, Article ID 1335, 2020.
View at: Google Scholar
A. Giuseppe, D. Ciuonzo, A. Montieri, and A. Pescapè, “Mimetic: mobile encrypted traffic classification using multimodal deep learning,” Computer Networks, vol. 165, Article ID 106944, p. 106944, 2019.
View at: Publisher Site | Google Scholar
J. Gu and S. Lu, “An effective intrusion detection approach using svm with native bayes feature embedding,” Computers & Security, vol. 103, Article ID 102158, 2021.
View at: Publisher Site | Google Scholar
R. Zhang, W. Chen, M. Hang, and L. Wu, “Detection of abnormal flow of imbalanced samples based on variational autoencoder,” Computer Science, vol. 48, no. 7, pp. 62–69.
View at: Google Scholar
X. Liu, T. Li, R. Zhang, D. Wu, Y. Liu, and Z. Yang, “A gan and feature selection-based oversampling technique for intrusion detection,” Security and Communication Networks, vol. 2021, Article ID 9947059, 15 pages, 2021.
View at: Publisher Site | Google Scholar
B. Yan and G. Han, “Combinatorial intrusion detection model based on deep recurrent neural network and improved smote algorithm,” Chinese Journal of Network and Information Security, vol. 4, no. 7, pp. 48–59, 2018.
View at: Google Scholar
J.-H. Seo and Y.-H. Kim, “Machine-learning approach to optimize smote ratio in class imbalance dataset for intrusion detection,” Computational Intelligence and Neuroscience, vol. 2018, pp. 1–11, 2018.
View at: Publisher Site | Google Scholar
Y. Wang and G. Sun, “Oversampling method for intrusion detection based on clustering and instance hardness,” Journal of Computer Applications, vol. 41, no. 6, p. 2021, 1709.
View at: Google Scholar
L. Liu, P. Wang, J. Lin, and L. Liu, “Intrusion detection of imbalanced network traffic based on machine learning and deep learning,” IEEE Access, vol. 9, pp. 7550–7563, 2021.
View at: Publisher Site | Google Scholar
S. Park and P. Hyunhee, “Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic,” Computing, vol. 103, no. 3, pp. 401–424, 2021.
View at: Publisher Site | Google Scholar
Z. X. Wang, P. Wang, X. Zhou, S. H. Li, and MoX. Zhang, “Flowgan: unbalanced network encrypted traffic identification method based on gan,” in Proceedings of the 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Xiamen, China, Decembre 2019.
View at: Google Scholar
N. Gupta, V. Jindal, and P. Bedi, “Cse-ids: using cost-sensitive deep learning and ensemble algorithms to handle class imbalance in network-based intrusion detection systems,” Computers & Security, vol. 112, Article ID 102499, 2022.
View at: Publisher Site | Google Scholar
J. Sharma, C. Giri, O. C. Granmo, and M. Goodwin, “Multi-layer intrusion detection system with extratrees feature selection, extreme learning machine ensemble, and softmax aggregation,” EURASIP Journal on Information Security, vol. 2019, no. 1, pp. 15-16, 2019.
View at: Publisher Site | Google Scholar
W. Li, S. Sun, S. Zhang, H. Zhang, and Y. Shi, “Cost-sensitive approach to improve thehttp traffic detection performance on imbalanced data,” Security and Communication Networks, vol. 2021, 2021.
View at: Google Scholar
J. Hu, H. Zhang, Y. Liu, R. Sutcliffe, and J. Feng, “Bbw: a batch balance wrapper for training deep neural networks on extremely imbalanced datasets with few minority samples,” Applied Intelligence, vol. 52, no. 6, pp. 6723–6738, 2022.
View at: Publisher Site | Google Scholar
W.-Y. Loh, “Classification and regression trees,” WIREs Data Mining and Knowledge Discovery, vol. 1, no. 1, pp. 14–23, 2011.
View at: Publisher Site | Google Scholar
Ji-hye Kim, S.-Ho Yoon, and M.-S. Kim, “Study on traffic classification taxonomy for multilateral and hierarchical traffic classification,” in Proceedings of the 2012 14th Asia-Pacific Network Operations and Management Symposium (APNOMS), Seoul, September 2012.
View at: Google Scholar
A. Montieri, D. Ciuonzo, G. Bovenzi, V. Persico, and A. Pescapé, “A dive into the dark web: hierarchical traffic classification of anonymity tools,” IEEE Transactions on Network Science and Engineering, vol. 7, no. 3, pp. 1043–1054, 2020.
View at: Publisher Site | Google Scholar
A. Nascita, A. Montieri, G. Aceto, D. Ciuonzo, V. Persico, and A. Pescape, “Xai meets mobile traffic classification: understanding and improving multimodal deep learning architectures,” IEEE Transactions on Network and Service Management, vol. 18, no. 4, pp. 4225–4246, 2021.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2023 Jian Qin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Security and Communication Networks

Network Traffic Classification Based on SD Sampling and Hierarchical Ensemble Learning

Abstract

1. Introduction

1.1. Background

1.2. Problem Statement

1.3. Key Contributions and Paper Organization

2. Related Work

2.1. Improved Machine Learning Algorithms for Network Traffic Classification

2.2. Balanced Dataset

3. Proposed Model

3.1. Preprocessing Module

3.2. Data Sampling Module

3.2.1. SMOTE: Synthetic Minority Oversampling Technique

3.2.2. SD Sampling

3.3. Feature Selection Module

3.4. Classification Module

4. Experiments

4.1. Experimental Environment

4.2. Dataset

4.3. Evaluation Metrics and Baseline Methods

4.4. Parameter Selection

4.5. Classification Results and Analysis

4.5.1. Influence of the SD Sampling Algorithm

4.5.2. Influence of the Two-Layer Structure

5. Conclusions

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright