Abstract

New vulnerabilities and ever-evolving network attacks pose great threats to today’s cyberspace security. Anomaly detection in network traffic is a promising and effective technique to enhance network security. In addition to traditional statistical analysis and rule-based detection techniques, machine learning models are introduced for intelligent detection of abnormal traffic data. In this paper, a novel model named SVM-C is proposed for the anomaly detection in network traffic. The URLs in the network traffic log are transformed into feature vectors via statistical laws and linear projection. The obtained feature vectors are fed into a support vector machine (SVM) classifier and classified as normal or abnormal. Based on the idea of SVM and clustering, we construct an optimization model to train the parameters of the feature extraction method and traffic classifier. Numerical tests indicate that the proposed model outperforms the state of the arts on all the tested datasets.

1. Introduction

With the rapid development of information technology and Internet, network security has become increasingly important. Anomaly detection in network traffic is an effective way to provide solid information for network security management and protect users’ data and privacy. Through the analysis and study on the network traffic, the malicious behaviors in the network can be discovered as soon as possible, such as SQL injection attack, cross site scripting (XSS) attack, directory traversal attack, and other types of attack. Anomaly detection methods are required to enhance the adaptability and scalability because of the increasing volumes of traffic data. There are some inherent defects in traditional rule-based detection methods. For example, it is easy for attackers to bypass the predefined detection rules and new unknown attacks cannot be discovered via the rules based on existing attacks. Thus rule-based methods often suffer from high false positive rate. In essence, the anomaly detection in network traffic is a data classification problem. It aims to distinguish attack data from normal behaviors. Besides the traditional rule-based detection techniques [1, 2], considerable methods based on statistical theory [3, 4], information theory [5, 6], and machine learning [7, 8] are widely used in abnormal traffic detection problem. The machine learning based detection model is a promising method for intelligent anomaly detection in the large-scale and high-bandwidth network environment.

Researchers have studied multiple machine learning based detection methods [9]. Supervised learning models are commonly used in anomaly detection, where datasets labeled as normal or abnormal are used to train and test the model. Neural network [10], support vector machine [11], decision tree [12], Naive Bayes [13], and other supervised models are often used in traffic classification. Ensemble learning is a methodology that combines multiple basic models together and achieves better performance than single classifier. Therefore, random forest [14] and other ensemble algorithms have also been used for abnormal traffic detection. However, it is often difficult to obtain enough labeled data. Thus, unsupervised learning models have been adopted to find out the latent structure in data. The training and testing procedures of unsupervised detection models are based on unlabeled datasets. Clustering is a classic unsupervised model. The traffic is characterized and identified via selecting an appropriate distance metric [15]. Semisupervised detection models can be treated as combination of supervised and unsupervised methods, since they use labeled and unlabeled datasets simultaneously to build up the detection models. In [16], it shows that the classification accuracy is improved significantly via semisupervised detection models, spectral graph transducer, and Gaussian fields. Before training the machine learning-based detection models, feature selection and dimension reduction are two kinds of useful preprocessing methods to reduce the dataset dimension. Selecting a subset with small redundancy is helpful to improve the detection performance. Hybrid models consist of machine learning models and feature selection methods. In [17], the authors proposed an improved krill swarm algorithm based on linear nearest neighbor lasso step for feature selection in network intrusion detection. The authors in [18] applied an autoencoder module for dimension reduction of traffic feature. Then, the obtained compressed representations are fed into machine learning models for intrusion detection.

Generally, the anomaly detection method in network traffic follows the following steps. First, the traffic data are transformed into feature vectors via the feature extraction method. Then, the obtained feature vectors are used to train and test the traffic classification model. Finally, the new traffic data are classified as normal or abnormal via the trained classifier. An effective feature extraction method is helpful to improve the performance of the anomaly detection model. In the aforementioned works, the authors mainly adopt the hand-crafted feature extraction method. Since the feature set heavily relies on experts’ domain knowledge, the trained classifiers have some disadvantages, such as poor adaptability for the datasets from different network environments. Therefore, it is critical to minimize the dependence of experts’ knowledge in traffic feature extraction process. In fact, the network traffic can be treated as natural language. Therefore, researchers introduced natural language processing techniques to fully explore the semantic structure of traffic data. For example, the k-gram technique can be used to characterize the typical pattern of normal requests [19]. Any coming payload is labeled as abnormal, if it does not match the normal pattern [20].

In this paper, we propose a novel model called SVM-C to detect abnormal network traffic. The raw traffic data are transformed into fixed-length feature vectors via statistical laws and linear coding operation. Then, an optimization problem is constructed based on the basic idea of SVM and clustering. The parameters of SVM-C are trained via solving the optimization problem. The transformed vectors are classified via the SVM classifier. In summary, our contributions are summarized as follows:(i)A new model SVM-C is proposed for anomaly detection in network traffic. The parameter training of SVM-C is accomplished via constructing an optimization problem.(ii)We apply the block coordinate descent (BCD) and projected Barzilai–Borwein (PBB) method [21] to solve the proposed optimization problem.(iii)The numerical results on all the tested datasets indicate that the proposed model outperforms the state of the arts.

The rest of the paper is organized as follows. In Section 2, related works of anomaly detection in network traffic are discussed. Section 3 describes the overall framework of the proposed model. In Section 4, the optimization problem and corresponding training algorithm of the proposed model are introduced. Section 5 shows the superior detection performances of the proposed model compared to the existing supervised machine learning models.

The existing anomaly detection methods for network traffic are classified into two categories, including misuse-based and anomaly-based methods. The misuse-based [2, 22] detection methods are effective, but they cannot discover unseen attacks and suffer from high false negative rate. Anomaly-based methods are prevalent because they can discover unseen attacks. This paper mainly focuses on the machine learning-based anomaly detection method. Thus, we briefly survey the anomaly-based detection methods for network traffic. In general, anomaly-based network anomaly detection methods are classified into four categories [8], including classification-based, statistical theory-based, clustering-based, and information theory-based detection methods.

Statistical detection methods construct probabilistic models with training data for the purpose of tracking network behaviors. In [23], the authors used three IP flow features and four flow attributes to generate a network profile called digital signature of network segment, which contains a threshold for each dimension, respectively. Abnormal behaviors are detected according to the number of abnormal dimensions. The proposed method can only detect attacks that impact bits, packets, and flows. Principal component analysis (PCA) is known as a dimensionality reduction approach in data mining field. PCA is also a widely used statistical technique for anomaly detection in network traffic. Pascoal et al. [24] reduced the dimension of traffic feature space via a combination of robust feature selection based on mutual information metric and robust PCA. The proposed model is robust to outliers and obtains a robust feature subspace.

Information theory-based detection methods mainly use information-theoretic measures to explain the characteristics of network traffic features and identify specific distributions of anomalies. Amaral et al. [25] used the Tsallis entropy to detect anomalous traffic flow. The proposed model can be used for anomaly detection in different types of networks and detecting more inexpressive attacks than those detection methods based on volume analysis. In [26], long-term network anomalies were tracked, where the Kullback–Leibler divergence was used to measure the difference between global probability density functions for every two consecutive periods of time. This function produces a time series sequence to be analyzed and sets an adaptive threshold to identify abnormal changes in network. Bhuyan et al. [27] used the mutual information and generalized entropy-based feature selection technique to select a relevant nonredundant feature subset, which makes the anomaly detection process much more accurate and faster.

Clustering-based methods aim to group network data into several classes of similar data. The essential idea of clustering is to achieve a high intracluster similarity and a low intercluster similarity. Eskin et al. [15] applied standard clustering algorithm with unlabeled data and Euclidean distance metric to detect network intrusion. Dromard et al. [28] proposed an unsupervised anomaly detector based on a grid and incremental clustering algorithm called IDGCA and a discrete time sliding window. IDGCA is more efficient than classic clustering algorithms, due to its low system complexity and flexibility for real-time detection. Besides, clustering-based methods can also be used to reduce the redundancy in raw datasets. Perdisci et al. [29] applied a feature clustering algorithm to reduce the dimension of k-gram features.

Classification-based detection methods use normal traffic profile to build the classification knowledge base. The traffic data that deviate from the baseline profile are regarded as anomalous. Kim et al. [30] proposed a hybrid intrusion detection method that hierarchically integrates a misuse detection model and a classification based anomaly detection model in a decomposed structure. They first build a misuse detection model based on C4.5 decision tree algorithm. Then, the normal training data are decomposed into smaller subsets. Multiple one-class SVM models are created for the decomposed subsets, and the profiles of normal behaviors are built precisely. Different domain-specific techniques are commonly used in hybrid models, too. In [31], the authors adopted self-organized feature map to profile normal packets, passive TCP/IP fingerprinting to filter unknown packets, and the genetic algorithm to select appropriate packet fields. A dataset consisting of representative training samples is created via the combination of these techniques and then used as the input of a new classification model combining soft-margin SVM and one-class SVM.

3. SVM-C for Anomaly Detection in Network Traffic

In this section, a model named SVM-C is proposed for anomaly detection in network traffic. The overall framework of SVM-C is introduced, followed by the traffic feature extraction method and the training process of the detection model. Finally, the classification algorithm is described.

3.1. The Framework of SVM-C

The overall framework of the SVM-C model is shown in Figure 1. SVM-C has two main components: feature extraction and traffic classification. The first component, feature extraction, transforms raw URLs into feature vectors via a series of mapping rules and linear projection. The second component, traffic classification, trains a SVM model to classify the obtained feature vectors. The parameters of the two components are coupled in an optimization problem. The deployment procedure of SVM-C is summarized below.(1)Traffic feature extraction (Section 3.2): the raw URLs are transformed into fixed-length feature vectors via a traffic feature extraction method based on the statistical laws and linear mapping.(2)Anomaly detection model training (Section 3.3): the obtained feature vectors are taken as the input to the SVM classifier. The parameters of the feature extraction method and SVM model are solved via an optimization problem based on SVM and clustering.(3)Traffic classification (Section 3.4): as depicted in Figure 1, new URLs are first transformed into feature vectors and then classified as abnormal or normal via the trained SVM classifier.

3.2. Traffic Feature Extraction

Traffic data can be regarded as short text. Therefore, natural language processing techniques can be used for the feature extraction of traffic data. Here the feature extraction method in [32] is applied, which is based on the statistical laws and k-gram technique. Before feature extraction, each URL is parsed into different segments, such as protocol, port, path, query, and so on. In this paper, we mainly focus on the path and query segment of each raw URL.

First, the malicious strings in raw URLs are extracted. Here percentage strings out of all are chosen for construction of a lexicon of malicious strings. Then, a set of mapping rules is defined based on the obtained lexicon, as illustrated in Figure 2. In the above two segments, two adjacent characters are mapped into a weight between 0 and 9 via the mapping rules. Larger weights indicate a higher probability of an abnormal URL. For instance, the parsed components of a URL “/data/cache/inc_catalog_base.inc” are transformed into a vector “332113252114411332334114441144.” Then, the obtained weight vectors are converted into a -length feature vector via the k-gram technique. Denote the achieved k-length vector as . Details of the feature extraction method can be found in [32]. The impact of the parameter on the performance of the subsequent traffic classifier is analyzed in Section 5.

After the -length feature vector is obtained, we apply a matrix-vector operation to produce local features around each character in the feature vector. The final obtained -dimensional vector iswhere and and are the weight matrix and the bias of the matrix-vector operation, respectively. In this way, a fixed-length feature vector is extracted for the corresponding raw URL. The matrix and vector are the parameters to be learned, and the detailed process will be shown in Section 4. The size of the sliding window in the k-gram technique and length of final feature vector are prefixed. The analysis of the parameters and will be shown in Section 5.

3.3. Anomaly Detection Model Training

After feature extraction, the obtained feature vectors are then fed into the traffic classifier. In the proposed model, the SVM model is used as the classifier because of its significant performance of binary classification and mathematical formulation for good interpretability. The idea of SVM is a hyperplane classifier, which has maximum functional margin to the nearest training data point of any class. Suppose we have N training data points , where and . The hyperplane plane is denoted as , where is the weight vector and is the bias. A new data point is classified aswhere is the Lagrange multiplier of its dual problem and is the signum function. Details of SVM can be found in [33]. In the proposed model, the parameters of the feature extraction method and traffic classifier are optimized together in an optimization problem, which will be described in Section 4.

3.4. Traffic Classification

According to the classification rule of SVM, the detection rule of SVM-C is set as . and indicate normality and abnormality, respectively. As for a new URL , it is first transformed into a -length feature vector , following the feature extraction method in Section 3.2. Denote its label as . Then, it is classified as (normal) or (abnormal) via the trained SVM classifier. The classification algorithm of SVM-C is shown in Algorithm 1.

Input:,,,,
Output: y
Step 1. Transform into a feature vector according to the data transformation method in Section 3.2.
Step 2. Calculate .
Step 3. If , let and is labeled as a normal URL. Otherwise, let and is labeled as an abnormal URL.

4. The Optimization of SVM-C

In order to obtain the parameters of the feature extraction method and traffic classifier in the proposed SVM-C, we construct an optimization model based on the idea of SVM and clustering. In this section, the constructed optimization problem and corresponding training algorithm are described.

4.1. Notation

Lowercase and uppercase boldface represent vectors and matrices, respectively. represents the identity matrix. is the number of elements in the set S. means the minimum eigenvalue of . denotes the absolute value of a scalar . represents the dot product. represents the Kronecker product. is defined as , where is a real definite matrix, is an orthogonal matrix, and is a diagonal matrix.

4.2. The Optimization Problem

We construct an optimization problem via combining the idea of SVM and clustering. The parameters of SVM-C are trained by solving the optimization problem. Its objective function consists of two parts. One aim is to obtain the parameters (, ), which minimizes the sum of the square of the distance between the same type of points to their center point. In SVM-C, we simply let the center point be the mean of the homogeneous data points. The other part is the same as the standard SVM model, which intends to find the hyperplane classifier (, b).

The original labeled dataset is denoted by , where indicates the label of data instance . In this paper, we assign label 1 to normal URLs and −1 to abnormal ones. Through the feature extraction method in Section 3.2, each URL is transformed into a k-length vector . The obtained dataset is denoted as , where . Then, the dataset is fed into the SVM model. The corresponding optimization problem is formulated as follows:where and are parameters in the SVM model, and are parameters defined in Section 3.2, , , is the mean of the points in the same class, and ().

4.3. The Training Method

Problem (1) is nonconvex, which is difficult to solve optimally. The block coordinate descent method is applied, and subproblems for variables (, b) and (, ) are solved alternatively in each iteration. We also introduce the idea of the soft-margin SVM [33], where in the th iteration, the right-hand side of constraint (3b) is replaced by and is a prefixed parameter. The analysis of the parameter will be shown in Section 5.

4.3.1. The Subproblem of (, )

When (, ) is fixed, the subproblem for (, ) becomeswhere . Subproblem (4) is just a standard SVM problem. It is solved via the dual method. According to the analysis of the SVM problem [33], the dual problem of (4) iswhere represents the Lagrange multipliers of (4). To solve problem (3a), we apply the Courant penalty function [34] and eliminate constraint (5) by penalizing it to the objective function. Then, the corresponding problem becomeswhere , , , , and .

When the penalty factor approaches infinity, problem (6) is equivalent to problem (5). Thus, we solve (6) iteratively. First, we set as a small value, such as . In each inner iteration, we solve (6) and update as follows:where is the feasible point of problem (6) obtained in th inner iteration.

The projected Barzilai–Borwein (PBB) method [21] is applied to solve problem (6), which is introduced in Section 4.3.3. In practice, problem (6) is an ill-conditioned problem. We add a regulation item to avoid calculation difficulties, where .

Suppose the optimal solution to problem (6) is . According to the analysis of the SVM problem [33], and .

4.3.2. The Subproblem of (, )

When (, ) is fixed, the subproblem for (, ) becomes

By equivalent transformation, the problem becomeswhere . Further, is reshaped into a vector , where is the th column of . The obtained problem iswhere , . Since is positive definite, we can introduce , and problem (10) is equivalent towhere .

Problem (11) has precisely the same mathematical form as problem (4). The same method is applied to solve (11) and further obtain (, ).

4.3.3. The Projected Barzilai–Borwein (PBB) Method

The PBB method [21] is an efficient algorithm to solve the large-scale box-constrained quadratic programming (BQP) problem (12).where is an symmetric matrix and , , and are vectors in . Since subproblem (6) is a BQP problem, we apply the PBB method to solve it.

The basic idea of the PBB method is to project the current point which is generated from gradient decent to the feasible set of (12). Its algorithm framework is shown in Algorithm 2. Here the operation means the median of , , and , and is defined bywhere .

Input:, , , , , (used for terminating the algorithm)
Output: optimal solution of the BQP problem (12)
Step 1. Let i = 0.
Step 2. Calculate , where is the alternative BB step size, and .
Step 3. If , output as the optimal solution and terminate the algorithm. Otherwise, go to Step 4.
Step 4. Let and return to Step 2.
4.4. The Algorithm Framework of SVM-C

In summary, the complete training algorithm for the proposed model SVM-C is presented in Algorithm 3.

Input: , k, d, the maximum number of iterations , and initial
Output:
Step 1. Let .
Step 2. Apply the feature extraction method in Section 3.2 to obtain the dataset and divide it into the training set and test set.
Step 3. Solve subproblem (2) using the PBB method and obtain . Compute the accuracy on the current testing set.
Step 4. Solve subproblem (6) using the PBB method and obtain .
Step 5. If , return and corresponding to the maximum accuracy and terminate the algorithm. Otherwise, let and return to Step 2.

5. Performance Evaluation

In this section, the following aspects are analyzed experimentally.(1)The numerical performance of the proposed model compared with benchmark models.(2)How the main parameters influence the performance of the proposed model.

5.1. Experimental Setup

The performances of the proposed model are evaluated on three different datasets.(1) Dataset 1 [35]: the first dataset is provided by a well-known Chinese Internet company, which specializes in cyberspace security and captures web logs of up to 2 TB every day. 70,000 original URL requests are used for testing.(2) Dataset 2: the second dataset is collected from our campus network traffic and consists of 8,000 original URL requests.(3) Dataset 3 [36]: the third dataset is from a project on GitHub. It consists of 50,000 raw URLs.

In the above three datasets, the attack data mainly include SQL injection attack, cross site scripting (XSS) attack, directory traversal attack, and other types of attacks.

The performances of the proposed model are evaluated in the following two aspects. The traffic feature extraction method based on statistical laws and linear projection is compared with the hand-crafted feature extraction method designed for dataset 1 [35]. Moreover, we choose Naive Bayes (NB) [37], linear SVM [33], and multilayer perceptron (MLP) [38] as benchmarks to evaluate the performance of the proposed classification model on different datasets.

The proposed model is evaluated via the hold-out method. It is a standard technique to estimate the model performances. The entire dataset is partitioned into two subsets. SVM-C is trained on one of them. Then, its classification performance is evaluated on the other subset. This process is repeated for several times, and the mean value of each index is eventually returned as the evaluation result of the hold-out method.

To quantify the performances of the proposed model and other compared models, the standard measurements are used: overall accuracy (acc), precision (p), false positive rate (fpr), and F1 score (f1). Here, the positive and negative instances refer to the abnormal and normal URLs, respectively. The four evaluation indexes are defined as follows in terms of confusion matrix (Table 1). Accuracy: . It indicates the percentage of the correctly classified instances over total instances. Precision: . It indicates the percentage of the correctly classified positive instances over total instances which are classified as positive. False positive rate: . It indicates the percentage of the misclassified positive instances over total positive instances. F1 score: . It is the harmonic average of precision and recall rate, where recall rate is defined as .

5.2. Experimental Results

The proposed model is evaluated on three datasets mentioned in Section 5.1. The main parameters of SVM-C are set as , , and for three datasets. For dataset 1 and dataset 3, the value of is set as 15, and for dataset 2, it is 9. First, the feature extraction method with linear projection of SVM-C is compared with the hand-crafted feature extraction method designed for dataset 1 [35]. In [35], each raw URL in transformed into a 22-dimensional feature vector. In the numerical tests, the numerical vectors obtained from different data transformation methods are fed into the benchmark classifiers mentioned in Section 5.1. The average accuracy, F1 score, precision, and false positive rate are shown in Figures 36. We observe the superior performances of the feature extraction method of SVM-C over the hand-crafted feature extraction method for all the classification models. Furthermore, it greatly reduces the human intervention during the feature extraction process.

Next, the performance of the proposed classification model is evaluated on different datasets. We apply some classical methods to do the classification, including Naive Bayes (NB), SVM, and multilayer perceptron (MLP). The raw URLs are converted into fixed-length feature vectors via the feature extraction method of SVM-C. The obtained numerical vectors are used as input into the benchmark classifiers. The numerical results on different datasets are shown in Figures 79, respectively. We observe that SVM-C performs the best on dataset 1 and dataset 2, with a distinct improvement compared to other classifiers. On dataset 3, the performances of SVM-C and SVM are similar, while SVM is slightly better than SVM-C. On all the three datasets, the proposed SVM-C achieves more than 93% accuracy, precision, and F1 score and lower than 5% false positive rate. The test results show that SVM-C is robust to different datasets and outperforms the compared classification methods generally.

5.3. Parametric Analysis

There are several parameters in SVM-C: the percentage of chosen malicious keywords , the size of the sliding window in the -gram technique , the length of final feature vectors , and the parameter of soft-margin SVM . Here dataset 1 is used for analysis.

In the traffic feature extraction method in [32], a certain percentage of typical keywords in abnormal URLs is chosen to construct a lexicon of malicious words. Figure 10 shows the overall accuracy of the proposed model varying from 20% to 80%. Here we fix , , and . As increases, the accuracy begins to increase and reaches its maximum at . Then, it reduces with the increment of . Large is helpful to characterize abnormal URLs. But too large may impair the detection performance. Selecting an appropriate value of is to balance the false positive rate and false negative rate. According to Figure 10, we use in the numerical tests.

The other important parameter in the traffic feature extraction method is the sliding window size in the -gram technique. The selection of value reflects the effectiveness of local feature extraction. Figure 11 shows the classification performances of the proposed model under different sizes of sliding window. In the numerical test, we fix other parameters as , , and . The accuracy reaches its peak at and then begins to reduce. Thus, we use for our experiments.

In the SVM-C model, an original URL is converted into a -dimensional feature vector via a matrix-vector operation . The parameter determines the dimension of extracted feature vectors and is related to the computational complexity of the method. Figure 12 displays the overall accuracy of the proposed model under different choices of . Here we fix , , and . As the length is changed in range of 10 to 60, it reaches the highest accuracy at . Therefore, we use in the experiments.

The right-hand side of constraint (3b) indicates the lower bound of the distance between data instances and hyperplane (, ). In practice, some data instances do not satisfy (3b) and such distance is lower than . To address this problem, we replace it with powers of a prefixed parameter and change it in each inner iteration. In the numerical tests, we fix , , and . Figure 13 shows the overall accuracy of the proposed model under different . We change from 0.1 to 0.9 with a step size of 0.2. The accuracy of SVM-C increases along with the value of . It reaches the highest accuracy at . The result reveals that larger makes SVM-C more flexible and represents higher classification confidence. Thus, we use in the numerical tests.

Based on the experimental results, we use , , , and for dataset 1. At this point, the detection accuracy, precision, and F1 score reach more than 96%, while the false positive rate is lower than 5%.

6. Conclusion

In this paper, a novel model called SVM-C was proposed for anomaly detection in network traffic. First, the traffic feature extraction was accomplished based on statistical laws and linear projection. Then, we constructed an optimization problem to obtain the parameters of the proposed model. Finally, the network traffic was classified via SVM classifier. The optimization problem was solved via the BCD method. In the training process, the optimization problem was divided into two subproblems. Each subproblem was solved via the Courant penalty function technique and the PBB method. The numerical results indicated that the proposed model outperformed the benchmark models in terms of accuracy, F1 score, false positive rate, and precision and was robust to different datasets. Furthermore, four main parameters of SVM-C were also explored: the percentage of chosen malicious keywords, the size of sliding window in traffic extraction, the length of final feature vectors, and the parameter in the constraint of the proposed optimization problem.

Data Availability

The dataset 1 and dataset 2 are not public. The dataset 3 is available at https://github.com/exp-db/AI-Driven-WAF.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NSFC) (nos. 11771056, 12171051, 11871115, 61941114, 61872836, and 61802025).