Abstract

Aiming at the problems of large data dimension, more redundant data, and low accuracy in network traffic anomaly detection, a network traffic anomaly detection model (FR-APPSO BiLSTM) based on feature reduction and bidirectional long short-term memory (LSTM) neural network optimization is proposed. First, the feature dimensions are divided by hierarchical clustering according to the similarity distance between data features, and the features with high correlation are divided into the same feature subset. Second, an automatic encoder is used to reduce each feature subset, eliminating redundant information, and reducing the computational complexity of the detection data. Then, a particle swarm optimization algorithm based on adaptive updating of variables and dynamic adjustment of parameters (APPSO) is proposed, which is used to optimize the parameters of the bidirectional LSTM neural network (BiLSTM). Finally, the optimized BiLSTM is used as a classifier to model network traffic anomaly detection using the reduced feature data. Experiments based on NSL-KDD, UNSW-NB15, and CICIDS-2017 datasets show that the proposed FR-APPSO-BiLSTM model can effectively reduce data features, improve the accuracy of detection, and the performance of network traffic anomaly detection.

1. Introduction

The development and breakthrough of network communication technology bring the convenience of the big data era and interconnection of 5G technology. The scale of traffic data in network communication is larger and larger. While enjoying the convenience of network communication, network attacks in network communication security are becoming more and more serious. Finding abnormal traffic through network traffic anomaly detection has become an urgent problem to be solved. Abnormal network traffic refers to the trend that current network traffic behavior deviates from normal network traffic. Network traffic anomalies are mainly caused by malicious network attacks, such as denial of service attacks, port scanning, password blowout, remote control, and so forth, as well as network misconfiguration and other exceptions [1].

After decades of development, the current network traffic anomaly detection methods can be divided into feature-based detection methods and anomaly based detection methods based on different detection methods [2]. The method based on feature detection [3] first analyzes various attack modes, extracts attack features, and then adds them to the feature library for detecting new attacks. When the detection sample matches the information in the feature library, it can be determined as an attack behavior. This type of method has a low false alarm rate, but it has a certain lag and can only detect existing attack patterns in the feature library, resulting in a low detection rate for new types of attacks. The method based on anomaly detection [4] establishes a probability statistical model to generalize the patterns of normal information flow samples and generate a benchmark model. When the patterns of the detected samples do not match the benchmark model, it can be considered an anomaly attack. This type of method has a low false alarm rate and certain detection ability for new attacks, but it also has the problem of insufficient detection accuracy.

In response to the current problems faced in anomaly detection, many researchers have proposed various technologies to improve the accuracy and stability of detection methods, most of which improve detection accuracy by reducing false positives and detecting unknown attacks [5]. With the emergence and development of machine learning and deep learning, the task of manually defining features can be replaced by a trainable multilayer network, which can achieve higher accuracy and lower false alarm rate in intrusion detection tasks than traditional machine learning. Therefore, various methods are widely used in the field of network traffic anomaly detection methods.

1.1. Anomaly Detection Based on Machine Learning Methods

Blanco et al. [6] presented a hybrid dimensionality reduction technique combining information gain and principal component analysis techniques. The experimental results showed that the hybrid dimensionality reduction method is better than the single algorithm and can effectively identify abnormal traffic. Phanindra et al. [7] used recursive feature elimination techniques to rank features according to their importance and used stochastic forest algorithms to classify attacks. Su et al. [8] studied the characteristics of abnormal behavior of network traffic, used hierarchical clustering method to sample traffic data, and then used support vector machine (SVM) to detect anomalies. Andresini et al. [9] combined an automatic encoder with a triplet network to solve the problem of convergence during triplet learning. Onah et al. [10] introduced a genetic algorithm-based wrapper and Bayesian anomaly detection model (GANBADM) for fog environments, which eliminated extraneous attributes to reduce time complexity and enable more accurate detection. Zhang et al. [11] proposed an anomaly detection model (MFFSEM) based on multidimensional feature fusion and stack integration mechanism, and established multiple basic feature datasets considering time, space, load, and other aspects of traffic information. And established multiple comprehensive feature datasets considering the association and relevance among the basic feature datasets to meet the requirements of real-world anomaly detection. Liu et al. [12] combining the ELM with a selective ensemble algorithm based on edge distance minimization has shown good detection performance and reduced false positives in the KDD99 dataset. Alzahrani and Alenazi [13] combining data reduction with stochastic forest, adaboost, and other machine learning algorithms proved the effectiveness of data reduction in anomaly flow detection, and improved detection efficiency with less data and computation. Aiming at the problems of weak generalization ability and poor learning ability of ELM model with single kernel function, Jinghao et al. [14] constructed a hybrid kernel ELM model (HKELM) and optimized the parameters of HKELM model by combining GSA and DE, so as to improve its global and local optimization ability in anomaly detection.

The common machine learning algorithms include logistic regression, K-means, SVMs, naive Bayes, K-nearest neighbors, random forests, and so forth. Although these machine learning algorithms have made great progress in network traffic anomaly detection, there are still some problems such as difficult detection and long detection time. At the same time, the traditional feature selection method can keep the effective information of the original data and reduce the computational complexity of the classifier [15].

1.2. Abnormality Detection Based on Deep Learning

Lin et al. [5] proposed a method for network intrusion detection based on a convolutional neural network. The accuracy of this method increases with the number of training samples. Almiani et al. [16] proposed an automatic intrusion detection system based on multilayer recurrent neural networks to resist network attacks faced by fog computing. Zhang et al. [17] proposed an intrusion detection method based on a deep convolutional neural network. The biggest feature of this method is that it converts 1D intrusion data into 2D “image data” for training networks, effectively improving the detection accuracy of the intrusion detection system. Although accuracy and false alarm rate have always been the focus of NIDS research, real-time performance, and detection efficiency are also important indicators. In deep learning, compared with other neural networks, autoencoder (AE) can effectively reduce the feature dimension and is easy to combine with other models. It can greatly reduce the training time of the model while improving detection accuracy. As an unsupervised learning artificial neural network, the AE consists of an encoder function that maps the input to a hidden layer and a decoder function. The learned reconstruction input is generated by minimizing the loss function. Alqatf et al. [18] proposed a deep learning framework based on a stacked AE for feature learning of normal samples and then used a SVM classifier to improve the accuracy of the method. Andresen et al. [19] used deep neural networks to train the characteristics of input data, and then used AE as an anomaly detector to refine classification, thus improving the classification accuracy for unforeseen attacks. Mirza and Cosan [20] proposed a deep learning network intrusion detection method based on a mixture of sparse AE and long short-term memory (LSTM) networks. This method uses AE to reduce the dimension of data and extract features and then uses LSTM networks to deal with the sequential nature of computer network data so that it can effectively deal with unpredictable and unpredictable network attacks.

Although the above-depth methods can achieve certain results, these methods inevitably lose information when using AE to compress the original data, and these studies only train a single AE from normal samples, without considering attack samples, so the detection rate of new attack samples is low.

To solve the above problems, this paper proposes a network abnormal traffic detection method based on feature reduction (FR) and bidirectional LSTM (BiLSTM) neural network. Feature selection and fusion are carried out through hierarchical clustering and automatic encoder to achieve the goal of feature dimensionality reduction; then an improved particle swarm optimization (PSO) algorithm is proposed to optimize the parameters of the BiLSTM neural network; finally, anomaly detection is performed based on the optimized BiLSTM.

The main contributions of this paper are as follows:(1)A FR algorithm based on hierarchical clustering and automatic encoder is proposed.(2)A PSO algorithm based on adaptive updating of variables and dynamic adjustment of parameters is proposed.(3)A network traffic anomaly detection model based on FR and BiLSTM neural network optimization is proposed.

The rest of the paper is organized as follows: Section 2 presents the correlation theory on abnormal detection. Section 3 presents a PSO algorithm based on variable adaptive update and parameter dynamic adjustment. Section 4 presents the design and development phases of the proposed model. Section 5 presents the experimental results and discussion. The last section concludes the paper with future direction.

2. Feature Reduction Algorithm Based on Hierarchical Clustering and Automatic Encoder

As an unsupervised clustering algorithm, hierarchical clustering is used to classify network features at different levels to form tree-like clustering results [21]. The clustering algorithm can perform hierarchical relationship mining of data without predetermining the number of clusters. The AE is a kind of artificial neural network model that can reduce the feature dimension and extract the nonlinear information in the data by training the neural network for input reconstruction [22].

In order to reduce the impact of data size on the detection model while retaining the effective information of network traffic data, this section proposes a FR algorithm based on hierarchical clustering and AE. The algorithm implementation process is as follows.(1)Calculate the similarity distance between the feature vectors and construct the feature similarity matrix R.(2)Taking the n features as n classes, and the classes are unrelated.(3)Concatenate the two closest classes as a new class.(4)Calculate the similarity distance between the new class and the current classes. If the number of clusters is equal to n at this time, go to Step 5, otherwise continue to Step 3.(5)Output the clustering results to the AE.(6)Construct k classes according to the similarity of traffic features. The feature subclass is , and the AE is . Create an AE for each feature subclass to learn the normal and abnormal behavior of the corresponding feature subclass.(7)The weights of the AE are initialized according to a uniform distribution .(8)Input the data x into the nonlinear activation function f in the AE to encode, and get the encoded data y of the hidden layer.(9)The data y is used as the input of the nonlinear activation function g in the decoder, and the decoded reconstructed data is obtained.(10)In the training stage, the parameters are optimized by random gradient descent algorithm so that the error of decoded reconstruction data is reduced as much as possible. At the running time, skip Step 10 and go to Step 11.(11)Using AE to calculate the anomaly scores of the corresponding feature subclasses .(12)Output the anomaly score S, .

The anomaly score S will be used as the input data of the classifier for anomaly detection.

3. Particle Swarm Optimization Algorithm Based on Variable Adaptive Update and Parameter Dynamic Adjustment

For large-scale high-dimensional optimization problems, the mechanism of the original particle swarm algorithm and other swarm intelligence algorithms makes the algorithm itself unable to expand the search space. Once it falls into a local optimum, it is difficult to jump out of the high-dimensional local optimum, so the convergence is early and the convergence accuracy is extremely high. Moreover, the current improved particle swarm algorithm generally strengthens one of the local search abilities or global search abilities and even abandons the ability of global search or local development to pursue convergence speed or convergence accuracy [23]. Therefore, some scholars start from the balance to take into account the global search and local development capabilities of the PSO algorithm, which not only expands the particle search space, but does not reduce the accuracy of the algorithm convergence, but from the current results, it is considered to strengthen the local and global search capabilities. It is also easy to lose control of the balanced searchability, and eventually the algorithm still only strengthens the part of the ability in essence, and it is also easy to fall into local optimum in solving high-dimensional problems, failing to give full play to the characteristics of PSO.

In order to solve these problems, APPSO is proposed, which is based on variable adaptive update and parameter dynamic adjustment. In APPSO, the adaptively updated variables include particle velocity and position, and the dynamically adjusted parameters include inertial weights and learning factors.

3.1. Adaptive Update of Particle Velocity

The APPSO algorithm uses the adaptive velocity update Equation (4) to control the influence of the individual optimal position and the global optimal position on the particle velocity. To achieve the effect of joint optimization of global search and local development, the influence of the individual optimal position on the particle velocity is greater in the early stage, and the influence of the global optimal position on the particle velocity in the later stage gradually increases.where is the particle velocity after the update, w is the inertia weight, is the individual optimal position of particle i, is the global optimal position of the population P, is the particle position, t is the number of current iterations, and T is the maximum number of iterations.

The essence of Equation (4) is to control the degree of influence of individual learning ability and social learning ability on particle velocity by the ratio of the current iteration number and the maximum iteration number. In the early stage of the algorithm, when the number of iterations t/N is less than 1/2, the influence of the particle’s individual learning ability on the particle speed is smaller than the particle’s social learning ability. When the current iteration number t/N is greater than 1/2, the opposite is true. Equation (13) adaptively controls the influence of individual optimal and global optimal on particle velocity update.

3.2. Adaptive Update of Particle Position

The position adaptive update in the APPSO algorithm mainly adopts two update methods: Cauchy update and normal update, which evolved from the Cauchy distribution and the normal distribution. The distribution diagrams of the Cauchy distribution and the normal distribution are shown in Figure 1 [24, 25].

As can be seen from Figure 1, the Cauchy distribution has a wider range of values on the x-axis than the normal distribution, which means that the solution space range of the particle search updated based on the Cauchy distribution is larger. It can be seen that more and better feasible solutions can be searched based on the Cauchy update particle, which is very suitable for the early global search work.

The normal update has a deeper value range on the y-axis, which means that the particles updated based on the normal distribution have a deeper optimal solution value range, and can improve the convergence accuracy of the algorithm. It can be seen that the normal update is very suitable for the later local development work. The Cauchy distribution and the normal distribution also have a significant effect on dealing with high-dimensional problems.

When the particle is updated based on the f Equation (4), the particle is already in the state of adaptive speed update, and then the particle position is updated adaptively. Equations (5) and (6) are the particle position update equals based on the Cauchy distribution and normal distribution.where is the position of the particle i in the tth iteration, is the position of the particle i in the t + 1th iteration, is the local optimal position of the particle i in the tth iteration, Cauchy is an update operator based on the Cauchy distribution, and Normal is an update operator based on the normal distribution.

The update method based on Cauchy distribution is suitable for early global search, and the update method based on normal distribution is suitable for later local development. According to the number of algorithm iterations, the particle position is adaptively updated. In Equations (5) and (6), the particle uses the Cauchy distribution and normal distribution to update and optimize its own position information to find the most suitable way. This optimization method can combine the local and global optimum influence degree in Equation (4), and make the global optimum influence degree maximized in the early stage, so that the algorithm can expand the search scope well in the early stage, find more and better feasible solutions, and maximize the individual optimum influence degree in the later stage.

3.3. Dynamic Adjustment of the Parameters
3.3.1. Dynamic Adjustment of the Inertia Weight

The diversity of the population is usually considered an important reason for premature convergence of swarm intelligence optimization algorithms [26]. It is well known that inertia weights play a vital role in balancing global exploration and local mining capability in PSO. Therefore, how to introduce the diversity of population into inertia weights and enhance the balance of global exploration and local mining ability of PSO is a meaningful thing. High diversity means that the population searches in a larger space, whereas low diversity means that the population enters a smaller search space.

This section introduces an inertial weight based on population diversity, as defined below:where is the maximum number of iterations and is the population diversity. The definition of is as follows:where the size of the population, Sim represents the dimension of the problem to be optimized, is the position of the particle i, and is the average position of the entire population. The definition of is as follows:

It can be seen from Equation (8) that the value of population diversity S(t) is determined by the distance between each particle in the population and the average position of the entire population. In the initial stage of the algorithm operation, the value of S(t) has a great influence on the inertia weight . When the algorithm converges towards a certain point, the population diversity increases rapidly, which makes the entire population approach the optimal point in a relatively stable form. Such search results will reduce the diversity of the population in subsequent iterations, which causes the value of to drop, and leads the entire population to finer mining in a smaller flying space.

3.3.2. Dynamic Adjustment of the Learning Factors

In the APPSO, the self-cognitive learning factor C1 and the social learning factor C2 are used to control the magnitude of each related learning item. Generally, the value of both is 2.5. In order to better control the amplitude, this paper adopts a time-varying strategy to dynamically adjust the learning factor, in which the self-cognitive learning factor C1 decreases from 2.5 to close to 0.5, and the social learning factor c2 increases from 0.5 to close to 2.5. The self-cognitive learning factor c1 and the social learning factor c2 are calculated as follows:

3.4. Implementation Process of the APPSO

The main idea of the APPSO is to divide the entire algorithm optimization process into two stages: the early stage and the later stage. In the early stage, in order to expand the global search ability, the Cauchy update was performed during the iterative update of particles to increase the search diversity and expand the global search range. In the later stage, in order to expand the local development capability, the normal update was performed during the iterative update of particles to improve the convergence accuracy of the algorithm. During the entire search process, the algorithm dynamically adjusts the inertia weight and learning factor to balance the global exploration and local mining capabilities according to the diversity of the population.

The implementation process of the APPSO algorithm is shown in Table 1.

4. Network Traffic Anomaly Detection Model Based on Feature Reduction and Bidirectional LSTM Neural Network Optimization

In order to obtain better detection accuracy and improve detection speed, this section proposes a network abnormal traffic detection model based on FR and BiLSTM neural network optimization. First, the hierarchical clustering algorithm and AE are used to select and reduce data features, which allows to retain valid information while reducing the amount of data. Then, the reduced data is sent to the classifier for training. Finally, an anomaly detection model is constructed to identify whether the newly incoming traffic data is abnormal data.

4.1. Data Preprocessing

There are three types of data in a dataset. The preprocessing of data is to transform the character data in the dataset into unique code before the model training, unify the data type into the recognized type of the model, and normalize the data to the number between 0 and 1. The calculation equal of data preprocessing is as follows:where y represents the preprocessed data value, x represents the original data value, MAX represents the maximum value in the dataset, and MIN represents the minimum value in the dataset.

4.2. The Process of Optimizing BiLSTM Based on APPSO

Jian et al. [27] put forward the variant recurrent neural network short-term memory network, which introduced a gating mechanism to simply and effectively solve the problem of gradient explosion or disappearance of the traditional recurrent neural network. The LSTM controls the information transmission between each cell through the gating mechanism. The LSTM is a one-way extraction of sequence information, but for the network traffic anomaly detection problem, the current network situation is not only related to the previous situation, but also may be related to the subsequent situation. In order to improve the prediction effect, BiLSTM is introduced to detect network traffic anomalies.

The BiLSTM model used for the experiments in this paper is shown in Figure 2.

(1)BiLSTM layers: with two BiLSTM layers, their combined before-and-after capabilities can be fully exploited to enhance model learning.(2)Dropout layer: avoid overfitting of the model and improve generalization.(3)Dense layer: set the last layer as Dense, transform the output dimension, and get the prediction result.

For the BiLSTM network, the selection of parameters in its structure is critical to the effect of the model, such as the number of hidden layers, weight, number of hidden layer units, and learning rate. Many researchers determine these parameters based on experience or trial and error method, which makes the robustness and accuracy of the model unreliable. Therefore, this paper selects the PSO algorithm which is simple in principle, low in complexity, fast in convergence, and suitable for dealing with real value problems to optimize the structural parameters of the BiLSTM network.

The traditional cross-validation method is a commonly used parameter optimization method, but it takes a long time and the parameter selection is blind. At present, many parameter optimization methods have been studied and applied [28]. PSO has been widely used in optimization, evolutionary computing, and other fields due to its outstanding algorithm performance. Combining the APPSO proposed in this paper, it optimizes the parameters of BiLSTM to give better performance. The specific implementation steps for optimizing BiLSTM parameters based on APPSO are as follows:(1)According to the size of the sliding window, the training set samples, and test set samples are constructed.(2)Initializing the relevant parameters in APPSO, which include the maximum and minimum values of the search dimension D, the number of particles PN, the acceleration factors c1 and c2, the maximum number of iterations max_iter, the initial position of the particles and the initial velocity , the inertia weight factor , and the learning factors r1 and r2.(3)Setting the range of values for each dimension in the particles to be optimized. The particle dimensions , iterator, n1, n2, and s represent the learning rate, the number of model iterations, the number of cells in the first hidden layer, the number of cells in the second hidden layer of the LSTM, and the random seeds in the BiLSTM model, respectively.(4)The model sets the fitness function of the particle swarm algorithm, randomly generates the initial positions of the particle swarm, calculates the initial fitness value of each particle, and obtains the individual optimal solution pbest and the global optimal solution gbest at the beginning.(5)The model calculates the fitness value of each particle, update the individual optimal solution pbest and the global optimal solution gbest, and update the velocity and position of the particle.(6)If the maximum number of iterations is reached, proceed to Step 6. Otherwise return to Step 5 and continue iterating.(7)The optimal parameters obtained are assigned to the BiLSTM model to obtain the situation prediction results.

The process of optimizing BiLSTM based on APPSO is shown in Figure 3.

4.3. Implementation Process of the FR-APPSO-BiLSTM Model

The FR-APPSO-BiLSTM model combines the advantages of hierarchical clustering algorithm, adaptive encoder, APPSO, and BiLSTM, which can effectively overcome the problems of low precision and insufficient feature extraction ability of existing intrusion detection technology. The implementation process of the FR-APPSO-BILSTM model is as follows:(1)Preprocess the detection dataset, and divide the training dataset and the test dataset.(2)Using the hierarchical clustering algorithm and adaptive encoder to extract features of the training dataset.(3)Using the APPSO algorithm to optimize the parameters of BiSLTM.(4)Using the hierarchical clustering algorithm and adaptive encoder to extract features of the test dataset.(5)The parameters obtained in Step 3 are brought into BiLSTM, and the test dataset features obtained in Step 4 are detected.(6)Saving the obtained detection results, and the model stops running.

The flowchart of the FR-APPSO-BiLSTM model is shown in Figure 4.

5. Simulation Results Analysis

The experimental environment of this paper is Windows 10 operating system, and Keras deep learning framework is used for model training and testing in Python 3.8 environment. The hardware configuration is the 64-bit operating system, and the processor is Inter (R) Core (TM) i7 CPU 2.9 GHz.

5.1. Particle Swarm Optimization Comparison Experiment

In order to verify the performance of APPSO, this section selects PSOBSA, FOPSO, HPSO, ASPSO, OLPSO, GEPSO, and HMaPSO as the comparison algorithms for comparative analysis [2935]. In the experiment, eight benchmark functions are used to test the performance of eight PSO algorithms. The specific information of the functions is shown in Table 2. The functions F1F3 are unimodal functions and F4F8 are multimodal functions. For all benchmark functions, the search dimensions are set to 5, 50, and 100, respectively. To fairly compare the performance of the eight algorithms, 50 independent runs of each test function were performed with each algorithm. The maximum number of iterations per run is set to 2,000.

5.1.1. Comparison of Optimization Results of Benchmark Functions

In this section, this paper compares the optimization of eight algorithms in eight benchmark functions. When the dimensions are 5, 50, and 100, the mean of the optimization results of each algorithm is shown in Tables 35.

As can be seen from Tables 3 to 5, when the search space dimension is 5, the APPSO obtains the optimal solution on 6 of the 8 functions. On the functions F1 and F5, due to their special shapes, the APPSO is easy to fall into the local optimum. The APPSO jumps out of the local optimum by performing Cauchy update and normal update in the early and later stages according to the number of iterations, but still falls into the local optimum due to the complexity of these two functions, and the performance is not brought into play. When the search space dimension is 50, the APPSO obtains the optimal solution on 5 of the 8 functions. Since the functions F4, F5, and F7 are high-dimensional complex functions, almost all algorithms fall into local optimum. However, the APPSO benefits from the adaptive update method of speed and position, and the results of these three functions are still second only to OLPSO, PSOBSA, and ASPSO, respectively. When the search space dimension is 100, the APPSO obtains the optimal solution in all eight functions. This is because, compared with other algorithms, the adaptive update of APPSO can expand the global search range in the early stage, and also has better local development in the later stage.

5.1.2. Comparison of the Convergence Speed

This section compares the convergence speed of the APPSO and other algorithms. The search space dimension is set to 50. The results are shown in Figure 5. It can be seen from that, compared with other algorithms, the APPSO can make the particle population have a very reliable performance in global search ability due to its adaptive update mechanism. It has faster convergence speed and better optimization effect when performing unimodal function optimization or multimodal function optimization.

5.1.3. Comparison of the Running Time

Due to the running time of the algorithm is very important in many practical engineering applications, a set of experiments are conducted in this section to compare the running time of all the algorithms. The experimental results are shown in Table 6. The values in Table 6 represent the average usage time of the algorithm running 50 times independently on a function.

As can be seen from Table 6, the APPSO exhibits the best experimental results on all functions, which exhibit very competitive properties in terms of convergence speed. Although APPSO is relatively time-consuming to dynamically adjust particle positions, it is still ahead of other algorithms in running time, which is mainly because of the APPSO can achieve an appropriate balance between the accuracy of optimization and running time.

5.2. Network Traffic Anomaly Detection Comparison Experiment
5.2.1. Experimental Environment and Evaluation Metrics

To evaluate the performance of the FR-APPSO-BiLSTM model, this section adopts Accuracy, Precision, Recall, and F-score as evaluation metrics. These metrics are calculated as follows:where TP and TN mean true positive and true negative, respectively, indicating the attack and normal samples are correctly classified. A false negative (FN) refers to an attack sample that is wrongly classified as a normal one, and a false positive (FP) denotes a normal sample that is falsely considered as an attack.

5.2.2. Experimental Dataset

This section verifies the effectiveness of the FR-APPSO-BiLSTM model based on the datasets NSL-KDD, UNSW-NB15, and CICIDS-2017 [3638]. The details of the three datasets are shown below.(1)The NSL-KDD dataset consists of a training set and a test set, in which the training set has 125,973 records and the test set has 22,543 records. The dataset contains four attack types (Dos, Probe, R2L, and U2R), in which the amount of attack type data is much lower than that of normal types.(2)The UNSW-NB15 dataset nine attack types collected by the Cyber Range Lab of the Australian Cyber Security Center, namely Fuzzers, Analysis, Backdoor, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Normal. The dataset consists of a training set and a test set, with 175,341 records in the training set and 82,332 records in the test set.(3)The CICIDS-2017 dataset contains network traffic based on packet and bidirectional flow formats, and each record contains 82 network flow features. Compared with NSL-KDD and UNSW-NB15, the CICIDS-2017 dataset includes a wider range of attack types, such as brute force attack, DoS, Heartbleed, Web penetration, and DDoS.

5.2.3. Parameter Selection of the BiLSTM

Figure 6 shows the training results of the APPSO algorithm optimizing the BiLSTM neural network. The time step size and batch size gradually converge to the optimal value with the update of the algorithm. As can be seen from Figure 6, the batch size of model training data is 1, and the optimal time step is 6. So far, the best superparameters are obtained to modify the model structure of the BiLSTM neural network and obtain the best parameter combination.

5.2.4. Parameter Settings of the FR-APPSO-BiLSTM Model

In the FR-APPSO-BiLSTM model, the settings of related parameters are shown in Table 7.

After FR using hierarchical clustering and AEs, the feature subsets of the three datasets are shown in Table 8.

5.2.5. Analysis of Simulation Results

In this section, the FR-APPSO-BiLSTM model is compared with FR-APPSO-LSTM, FR-BiLSTM, and APPSO-BiLSTM to verify the effectiveness of FR based on hierarchical clustering and AE, and the effectiveness of the BiLSTM parameter optimization based on the APPSO. The FR-APPSO-BiLSTM model is compared with FR-ASPSO-BiLSTM, FR-QPSO-BiLSTM, and FR-HPSO-BiLSTM to verify the effectiveness of the BiLSTM optimized based on APPSO compared to the BiLSTM optimized by other PSO algorithms. Finally, the FR-APPSO-BiLSTM model is compared with other existing detection models to verify its detection effect.

(1) Comparison of FR-APPSO-BiLSTM and FR-APPSO-LSTM, FR-BiLSTM, APPSO-BiLSTM. This section compares the FR-APPSO-BiLSTM model with FR-APPSO-LSTM, FR-BiLSTM, and APPSO-BiLSTM on three datasets. The experimental results are shown in Table 9 and Figure 7.

It can be seen that, compared with the FR-APPSO-LSTM, FR-BiLSTM, and APPSO-BiLSTM, the FR-APPSO-BiLSTM model has higher classification accuracy and better detection effect on the three datasets. For the four indicators on the NSL-KDD dataset, the FR-APPSO-BiLSTM model improved by 1.93%, 2.45%, 3.40%, and 2.89% compared with the FR-APPSO-LSTM model, improved by 1.05%, 1.08%, 0.98%, and 1.03% compared with the FR-BiLSTM model, improved by 1.83%, 2.00%, 1.35%, and 1.57% compared with the APPSO-BiLSTM model. For the four indicators on the UNSW-NB15 dataset, the FR-APPSO-BiLSTM model improved by 8.08%, 5.29%, 6.57%, and 5.93% compared with the FR-APPSO-LSTM model, improved by 2.49%, 0.90%, 1.12%, and 1.01% compared with the FR-BiLSTM model, improved by 1.67%, 1.16%, 0.97%, and 1.16% compared with the APPSO-BiLSTM model. For the four indicators on the CICIDS-2017 dataset, the FR-APPSO-BiLSTM model improved by 3.41%, 14.43%, 12.35%, and 13.39% compared with the FR-APPSO-LSTM model, improved by 1.59%, 0.33%, 0.62%, and 0.48% compared with the FR-BiLSTM model, improved by 0.35%, 0.37%, 0.76%, and 0.42% compared with the APPSO-BiLSTM model.

The above results show that the FR-APPSO-BiLSTM model has a higher detection effect. On the one hand, it uses hierarchical clustering and AE for FR of data, which eliminates redundant information in the data. On the other hand, due to the APPSO algorithm is used for parameter optimization of BiLSTM. The combination of these two mechanisms for anomaly detection makes the model have better detection effect and higher detection efficiency.

(2) Comparison of FR-APPSO-BiLSTM and FR-ASPSO-BiLSTM, FR-QPSO-BiLSTM, FR-HPSO-BiLSTM. This section compares the FR-APPSO-BiLSTM model with FR-ASPSO-BiLSTM (optimized BiLSTM based on ASPSO [32]), FR-QPSO-BiLSTM (optimized BiLSTM based on QPSO [39]) and FR-HPSO-BiLSTM (optimized BiLSTM based on HPSO [31]) for experimental comparison on three datasets, the experimental results are shown in Table 10 and Figure 8.

It can be seen that, compared with the FR-ASPSO-BiLSTM, FR-QPSO-BiLSTM, and FR-HPSO-BiLSTM, the FR-APPSO-BiLSTM model has higher classification accuracy and better detection effect on the three datasets. For the four indicators on the NSL-KDD dataset, the FR-APPSO-BiLSTM model improved by 0.85%, 0.37%, 1.18%, and 0.74% compared with the FR-ASPSO-BiLSTM model, improved by 0.33%, 0.09%, 0.92%, and 0.48% compared with the FR-QPSO-BiLSTM model, improved by 1.10%, 0.57%, 1.33%, and 0.90% compared with the FR-HPSO-BiLSTM model. For the four indicators on the UNSW-NB15 dataset, the FR-APPSO-BiLSTM model improved by 2.49%, 0.90%, 1.12%, and 1.01% compared with the FR-ASPSO-BiLSTM model, improved by 0.85%, 0.53%, 1.04%, and 0.77% compared with the FR-QPSO-BiLSTM model, improved by 0.85%, 0.53%, 1.04%, and 0.77% compared with the FR-HPSO-BiLSTM model. For the four indicators on the CICIDS-2017 dataset, the FR-APPSO-BiLSTM model improved by 0.43%, 0.22%, 0.55%, and 0.39% compared with the FR-ASPSO-BiLSTM model, improved by 0.25%, 0.11%, 0.27%, and 0.19% compared with the FR-QPSO-BiLSTM model, improved by 0.43%, 0.30%, 0.45%, and 0.35% compared with the FR-HPSO-BiLSTM model.

The above results show that the parameter optimization of BiLSTM based on APPSO is significantly better than the BiLSTM optimized by other PSO algorithms. The BiLSTM optimized based on APPSO significantly improves the detection ability for abnormal data, mainly due to the following reasons: (1) using hierarchical clustering and automatic encoder for FR of data, eliminating redundant information in the data; (2) the APPSO has better dynamic adaptability than ASPSO, QPSO, and HPSO. The particles in the population can automatically adjust the search direction according to the changes of the search environment, with better rapid convergence and optimization capabilities, which makes it possible to search more predictable parameters for BiLSTM.

(3) Comparison with the Other Existing Detection Models. This section compares the FR-APPSO-BiLSTM model with other existing methods in the literature (HCRNNIDS [40], ADASYN-LightGBM [41], LNNLS-KH [42], STL-HDL [43], and E-GraphSAGE [44]), the results are shown in Table 11 and Figure 9.

It can be seen that, compared with other existing methods, the FR-APPSO-BiLSTM model has higher classification accuracy on all three datasets. Compared with the HCRNNIDS model, the Accuracy values of FR-APPSO-BiLSTM on the three datasets increased by 8.65%, 4.30%, and 11.11%, the Precision values increased by 3.92%, 13.93%, and 4.49%, the Recall values increased by 6.90%, 6.36%, and 8.80%, and the F-score value increased by 5.30%, 10.15%, and 6.64%. Compared with the ADASYN-LightGBM model, the Accuracy values of FR-APPSO-BiLSTM on the three datasets increased by 6.39%, 3.04%, and 3.74%, the Precision values increased by 2.90%, 8.43%, and 1.31%, the Recall values increased by 4.73%, 7.52%, and 3.10%, the F-score value increased by 3.75%, 7.98%, and 2.20%. Compared with the LNNLS-KH model, the Accuracy values of FR-APPSO-BiLSTM on the three datasets increased by 2.92%, 1.80%, and 4.24%, the Precision values increased by 2.07%, 3.74%, and 10.38%, the Recall values increased by 2.13%, 7.94%, and 8.91%, the F-score value increased by 2.10%, 5.84%, and 9.64%. Compared with the STL-HDL model, the Accuracy values of FR-APPSO-BiLSTM on the three datasets increased by 0.81%, 1.98%, and 5.51%, the Precision values increased by 3.67%, 6.17%, and 3.74%, the Recall values increased by 3.70%, 2.26%, and 5.54%, the F-score value increased by 3.69%, 4.22%, and 4.64%. Compared with the E-GraphSAGE model, the Accuracy values of FR-APPSO-BiLSTM on the three datasets increased by 1.74%, 0.72%, and 2.35%, the Precision values increased by 1.72%, 4.74%, and 6.75%, the Recall values increased by 2.57%, 4.12%, and 4.11%, the F-score value increased by 2.12%, 4.43%, and 5.43%. The above results fully demonstrate that the FR-APPSO-BiLSTM model has better detection performance on all three datasets, which verifies the effectiveness of the model. The FR-APPSO-BiLSTM model has better performance compared to other existing models, mainly due to the following reasons: (1) the proposed FR algorithm based on hierarchical clustering and automatic encoder effectively reduces the dependence of the detection model on the dataset size while preserving the effective information of network traffic data; (2) the combination of FR, APPSO, and BiLSTM effectively leverages their advantages and overcomes the inherent shortcomings of existing models.

6. Conclusions and Future Work

To solve the problems of large network data, high feature dimension and high dependence of traditional machine learning algorithms on data labels in anomaly intrusion detection, we propose a network traffic anomaly detection model (FR-APPSO-BiLSTM) based on FR and BiLSTM neural network optimization. In FR-APPSO-BiLSTM, hierarchical clustering method and automatic encoder are combined to reduce the characteristics of network traffic data. APPSO is used to optimize the parameters of BiLSTM. The optimized BiLSTM is used as the classifier, and the processed feature data is used as the input of the optimal classifier for network traffic anomaly detection. In the experiment, the APPSO algorithm is compared with other PSO algorithms for benchmark function optimization, and the results show that it has better convergence speed and optimization effect; then NSL-KDD, UNSW-NB15, and CICIDS-2017 datasets are used to conduct network traffic anomaly detection simulation experiments. The experimental results show that the FR-APPSO-BiLSTM model can obtain better evaluation indicators and has better detection performance.

In the future, we will enhance the pattern matching speed to analyze the high-speed networks in a better way to detect and protect them from network attacks. Different types of more attacks will take into account to check the proposed model’s performance with more performance parameters.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Authors’ Contributions

Hanqing Jiang performed the experiments. Shaopei Ji analyzed the data and wrote the paper. Hanqing Jiang and Shaopei Ji contributed equally to this work. Guanghui He and Xiaohu Li supervised the research and critically revised the paper. All authors have read and agreed to the published version of the manuscript. Hanqing Jiang and Shaopei ji are the co-first authors.

Acknowledgments

This research was supported by the Joint fund for enterprise innovation and development of the National Natural Science Foundation of China (U20B2049).