Abstract
Intrusion detection refers to monitoring network data information, quickly detecting intrusion behavior, can avoid the harm caused by intrusion to a certain extent. Traditional intrusion detection methods are mainly focused on rule files and data mining. They have the disadvantage of not being able to detect new types of attacks and have the slow detection speed. To address these issues, an intrusion detection method based on improved PCA combined with Gaussian Naive Bayes was proposed. By weighting the first few feature vectors of the traditional PCA, data pollution can be reduced. The number of final weighted principal components is 2 through sequential selection. The dimensionality reduction of the data is achieved through improved PCA. Finally, the intrusion behaviors were detected by using the Gaussian Naive Bayes classifier. The indexes of detection accuracy, detection time, precision rate, and recall rate were applied to evaluate the results. The experimental results show that, comparing with the traditional Bayes method, the method proposed in this article can reduce the detection time by 60%, shorten it to 0.5s, and increase the detection rate to 91.06%. The mean value of detection accuracy is about 86% by cross-validation.
1. Introduction
While the Internet brings convenience to people, there are also a lot of security problems. Network attacks are happening all the time. Research on Intrusion detection has important practical significance, and it is also a major challenge in the field of network security.
Dorothy Denning [1] defined intrusion detection in 1987. He detected intrusion behavior by monitoring network data, and then the system would give the alerts and responses before invasion. It can be found that an important feature of intrusion detection is instantaneity. The detection method needs to quickly judge the attack information and alarm before the occurrence of the hazard. There are two main types of traditional intrusion detection methods. One is the rule-based intrusion detection. It relies on analyzing the characteristics of specific attack types and then records the attack characteristics to the rule files. Finally, it detects the attacks by matching the rule files. This method is mainly applied to some commercial IDSs or open source IDSs. For example, Snort IDS [2, 3] applied this method because rule-based intrusion detection has the characteristics of fast detection. However, a major problem in this method is that it can not detect new attack types and can only detect the types of attacks that have been discovered. Hacker attacks are constantly changing. New types of attacks often occur, and new types of attacks often cause greater harm. Moreover, the method has higher false alarm rate. With the rise of machine learning and data mining in recent years, data mining methods have been commonly applied to intrusion detection, which is another method. The methods based on data mining establish the model by training through the marked data set. It has a good effect on the detection of unknown attack types, such as SVM [4, 5] and neural network [6]. The application of data mining in intrusion detection requires large collection of data in advance, which limits online intrusion detection [7].
At present, the conventional intrusion detection methods focused on data mining [8–10] and common file analysis [11] are sprang up. An et al. [12] used the method of combining the minimum within class scatter in Fisher discriminant analysis with traditional support vector machine (SVM) in intrusion detection and then proposed a minimum within class scatter support vector machine (WCS-SVM), which is better than the traditional SVM. Kabir et al. [13] proposed an intrusion detection system based on least squares support vector machine (LS-SVM). Mrudula Gudadhe et al. [14] proposed a new method to enhance the decision tree applied in intrusion detection, which allows the formation of a classifier combined with multiple decision trees. Sufyan et al. [15] applied backpropagation artificial neural network models into intrusion detection, which makes IDS more efficiently adapt to new environments and respond to new types of attacks. Because of the large size of the network dataset, manual tagging would consume a lot of time and effort; thus, clustering methods are introduced into the dataset classification [16]. The Y-means clustering algorithm [17] overcomes two disadvantages of dependency and degradation of k-means digital clusters. This method automatically divides the data set into a proper number of clusters. It is feasible and effective to perform intrusion detection using clustering analysis. The k-means [18] algorithm is the simplest segmentation algorithm that solves the well-known clustering problem. Clustering algorithm using SOM and k-means [19] can overcome the shortcomings of traditional SOM, such as not providing accurate clustering results, and can avoid the disadvantages of the traditional k-means, which always relies on the initial value and it is difficult to find the cluster center. The parallel clustering integration algorithm [20] proposed for IDSs can achieve high speed, high detection rate, and low false alarm rate. The ANN classifier [21] also has a good performance in intrusion detection. By using a mixed learning method, the studies in [22–24] have higher detection rates and lower false alarm rates; among them, the combination of clustering and classification can achieve good results. Shah et al. [25] compared the detection performance of the machine learning method directly in the Snort Intrusion detection system.
In addition to data mining-based intrusion detection methods mentioned above, flow-based intrusion detection [26] is an innovative method of detecting high-speed network intrusions. Stream-based intrusion detection only checks the header and does not analyze the payload of the packet. The filtering method [27] applies predefined standard RIA so as to select functions to eliminate extraneous related features from the data set. Vieira et al. [28] proposed a network attack detection and recognition method based on model selection and feature similarity and applied signal processing techniques into intrusion detection.
Traditional file analysis methods may be effective for conventional types of attacks but not for new attack techniques [29]. Although data mining method has good adaptability to new attack types, it is often higher in time consumption. Principal component analysis (PCA) is a commonly used dimensionality reduction technique. It uses an orthogonal transformation to convert a set of related variables into a set of linear uncorrelated variables, where the first principal component has the largest variance. And PCA has been used for attack detection [30]. Second, the Bayesian method [31] in the data mining method, to a certain extent, is faster than other classifiers because it is a classifier based on conditional probability. Based on these, this paper proposed a novel intrusion detection method combining the improved principal component analysis with Gauss naive Bayes. The proposed method would decrease the training time of Gauss Bayes classifier according to training on the dataset simplified by the improved PCA algorithm and then improved the detection accuracy. Before applying the Bayesian algorithm, the improved PCA was used to reduce the dimension, and the first few eigenvectors of the solution of the principal component analysis were multiplied by a weight coefficient. Then the Bayes classifier was used to compute the probability of each network data that was divided into normal and abnormal. According to the application of PCA, the detection time would be greatly reduced, and the detection rate would decrease slightly. But by exploiting the weight coefficient to improve the traditional PCA, the detection effect has also been improved significantly.
The other sections of this paper are organized as follows. Section 2 introduces the characteristic attributes of network data. In Section 3, the improved principal component analysis combined with Gaussian Naive Bayesian intrusion detection model is described. In Section 4, the KDD99 data set is analyzed, and the experimental results are listed, and cross-validation is used to verify the results. Section 5 summarizes the effect of the model and illustrates the direction of the method improvement and future work.
2. Data Model
2.1. Characteristic Attribute Description of Network Data
2.1.1. Basic Features of TCP Connections
The basic connection feature contains 9 basic attributes of some connections, which are shown in Table 1.
2.1.2. Content Features of TCP Connections
Attacks such as U2R and R2L are generally embedded because they do not have frequent sequential patterns in data records like DoS attacks. In the data payload of a packet, there is no difference between a single packet and a normal connection. In order to detect such attacks, some content features that may reflect the intrusion behavior can be extracted from the data content. There are 13 kinds of content features as shown in Table 2.
2.1.3. Statistical Characteristics of Network Traffic
Since the network attack event has a strong correlation in time, some connections exist between the current connection record and the connection record in the previous period of time, which can better reflect the relationship between the connections. Time interval takes two seconds. There are 9 kinds of network traffic features. As shown in Table 3.
2.2. Feature Definition of Network Data
The basic features of the TCP connection are expressed as , and there are 9 kinds of connection features, so . The content features of TCP connections are represented as , and there are 13 kinds of content features, so . The statistical characteristics of network traffic are denoted as , and there are 9 kinds of traffic characteristics, so . The training network data set is defined as , and the test network data set is defined as T. A network connection record for the data set is D and T.
Definition 1. A record in the training set and a record in the test set are as follows:
Definition 2. As both the training data and the test data are applied to the principal component analysis, the data matrix is defined as X, then X=D or T, and one of the connection records is as follows:
3. Improved Principal Component Analysis and Gauss Naive Bayes
3.1. Traditional Principal Component Analysis (PCA)
Principal component analysis has the advantage of reducing data complexity and identifying the most important features. On the contrary, it has the disadvantage that it may lose useful information.
From the perspective of maximum separability, principal component analysis can be explained. The projection of a connection record on the hyperplane in the new space is . If the sample points are projected as separate as possible, correspondingly, the variance of the sample points after the projection should be as maximized as possible. The variance of the sample after projection is as follows:
To optimize it, it can be simplified by using the Lagrange multiplier method. The detailed replacement is below.
The covariance matrix is decomposed by eigenvalue, and the obtained eigenvalues are from large to small: . Then the feature vector corresponding to the first eigenvalues was used to construct , which is the solution to the principal component analysis.
3.2. Improved Principal Component Analysis (IPCA)
In the field of image processing, the first three eigenvectors of the classical PCA method reflect the overall information of the image [32]. When the lighting conditions have a significant effect, the first three principal components of the PCA method may be polluted seriously. But reducing their weight can improve the accuracy. Inspired by this, the image may be significantly affected by light, and the network data may also be affected by some factors. But the best principal number may not be always 3 in different environments; this value should be determined by trials. In this paper, we set this value as , and then the improved PCA algorithm weights the first n feature vectors as shown in
In (6), is the weight coefficient, which is a number between 0 and 1. The purpose of k is to reduce the weight of the first n principal components and decrease the influence of those components. Then, the IPCA algorithm is used to reduce the dimension of the data. The pseudo-code for improved principal component analysis is shown in Algorithm 1.
|
In Algorithm 1, the mean of the data matrix was removed in Line 1 and Line 2 searched the covariance matrix of the data matrix. Lines 3-4 found the eigenvalues and eigenvalue vectors of the covariance matrix and arranged the eigenvalues from the largest to the smallest. Line 5 selected the feature vector corresponding to the first d′ eigenvalues as the solution of the traditional PCA. Line 6 gave a new solution to the weighting of the traditional PCA solution. Line 7 multiplied the new solution with the data matrix to reduce the data to d′.
To weight the first two principal components in IPCA is the best by sequential selection. But this is limited to this experiment, and this value may change with time. In Section 4.2.2, it is analyzed in detail.
3.3. Gaussian Naive Bayesian Classifier (GNB)
With all relevant probabilities known, Bayesian decision theory considers how to choose the best class labels based on these probabilities and misclassified losses. For intrusion detection tasks, we should determine whether the network traffic is normal or abnormal. Assume that there are two possible class labels in , where stands for normal category mark and represents an anomaly category tag. For each connection record x, a category flag that maximizes the posterior probability is selected. Based on Bayes’ theorem, can be written as follows:
In (7), is a kind of prior probability, is the conditional probability of a connection record relative to the class label c, and is the evidence factor used for normalization. For a given connection record x, the evidence factor has no relationships with class labels, so is only related to and .
The naive Bayes classifier uses the “attribute conditional independence assumption.” For known classes, it is assumed that all attributes are independent of each other. In other words, each attribute can affect the classification result independently,
where d is the number of attributes for each connection record, and If IPCA is not used, and d is equal to 31. is the value of the connection record x on the i-th attribute. Since is the same for all categories, the naive Bayes classifier has the following expression as :
For the continuity property, the probability density function is considered. Assume that there exists , where and are the mean and variance of the value of the c-th sample on the i-th attribute. And the is shown as follows:
For each connection record, the first is to calculate the posterior probability of the normal and abnormal categories, and the larger one would be selected as marker for the result of the category of the record.
3.4. The Detection Process of the Model
For the detection process based on improved PCA and Bayes intrusion detection model, we first normalize the training data set and test data set . The normalization of data is mainly to facilitate the selection of weight coefficients. Normalization has no influence on the detection rate. The normalized new value is calculated by the following equation. The mapping range of the new value is 0 to 1:
After the data sets are normalized, the dimension of training data set and the test data set are reduced by the improved PCA, and a new training set and a test set are obtained.
Assume that the variables obey the Gaussian distribution, the posterior probability is calculated, and the category flag that maximizes the posterior probability is selected as the record’s detection result. The detailed detection process of the model is shown in Figure 1. And the detailed steps are described as follows.

IPCA process. The first step is to remove the average value of the data, the second step is to calculate the covariance matrix of the data matrix, the third step is to calculate the eigenvalues and eigenvectors of the covariance matrix, and the fourth step is to sort the eigenvalues from the largest to the smallest, and the first few eigenvectors are weighted in step 5. In the last step, the weighted eigenvectors are multiplied by the data matrix to obtain a reduced-dimensional training data set and a test data set .
GNB process. A Gaussian Bayes classifier is applied to the dimensionality-reduced test data set T′ to classify the category of each record. First, the conditional probability of each attribute is calculated according to (10), and the prior class probability of records belonging to normal and anomaly are calculated separately. Finally, the prior probability of recording with normal and anomaly are computed, and the category of the record with large prior probability is selected as the detection result of the record.
The model needs to continuously adjust the weight coefficient of the improved PCA so as to find the most optimal weight coefficient. The sum of the training and testing time of the model is regarded as the detection time. The detection rate is the ratio of the number of correct records divided by the total number of records in the test set.
4. Experimental Results and Analysis
4.1. Experimental Data
Intrusion detection requires a large amount of effective experimental data. The experiment is conducted on the KDDCup99 data set in this paper. The KDD99 data set is a reference data set in the domain of network intrusion detection and lays the foundation for the research of network intrusion detection based on computational intelligence. Besides the KDD99 data set, DARPA98 and NSL-KDD are also two verification data sets commonly used. The KDD99 data set is obtained after data mining and preprocessing on the DARPA98 data set. The NSL-KDD data set is a refined version of KDD99, after removing redundant data. This paper selects the classic KDD99 data set, namely, collecting network connection data from a simulated US Air Force LAN in nine weeks.
The data contain the data with identification and no identification. We use the data with identification. Test data and training data have different probability distributions. The test data has some types of attacks that are not present in the training data. The training data set includes a normal marked type and 22 training attack types. In addition, 14 attack types only appear in the test data set. All these attack types can be classified into four exception types which are denial of service attacks, unauthorized access from remote hosts, unauthorized local superuser privileged access, and port scanning. The four attack types are uniformly marked as abnormal and the rest are marked as normal. The KDD99 data set consists of a total of 5 million records, 10% percentage of which are chosen randomly as target training data set that contains a total of 494,021 records, and the test set includes a total of 311,029 records. The data set includes a total of 41 attributes. Through the analysis on 41 fixed feature attributes, the first 31 feature attributes including 9 discrete types and 22 continuous types can reflect the state changes. The new data set owns 31 feature values.
4.2. Experimental Results
4.2.1. The Effect of Classical Classifiers in Intrusion Detection
The experiment is conducted in a PC equipped with an Intel G2020 CPU, 8GB RAM, and a Windows 7 operating system. The algorithms commonly used in the domain of intrusion detection are K-Nearest Neighbor (KNN) [33], support vector machine (SVM) [34], Gradient Boosting Decision Tree (GBDT) [35], etc. Their execution effects based on the KDD99 dataset in intrusion detection are shown in Table 4.
By comparison among those classic classifiers, although GNB has the lowest detection accuracy, it can train and test the model within 1.42s. Other classifiers have higher detection rate, but they can not meet the requirements of intrusion detection with a longer detection time. SVM can even take up to 10 hours, and GBDT also takes more than 2 minutes. Considering the time consumption, therefore, the GNB classifier was selected as an intrusion detection classifier in this paper, but there is a need for some improvements or optimization. Although the detection rate for GNB classifier is not as good as other classifiers, it has shorter detection time. Moreover, after the improvement on PCA which is to preprocess the input data for GBN classifier, the detection rate of the model has a big improvement and is close to that of other classifiers, and the detection time would be greatly shortened.
4.2.2. GNB Combined with IPCA
When the data dimension is reduced, some of the original data information will be lost, so the detection effect may be decreased. In order to show the time index in intrusion detection more clearly, the detection time of the model combining PCA and Gaussian Naive Bayes was recorded according to different number of principal components. The relationship between the principal component number and time is plotted as a line graph shown in Figure 2.

From Figure 2, it can be seen that as the number of principal components increases, the training and testing time of the model will be longer. It is well understood that the more data features the model input, the greater the amount of data were, and the longer training and testing time the model would spend. At the same time, it can be seen that when the PCA algorithm was not combined, the detection time of the Gauss-Bayes method was 1.42s, and the detection time was greatly decreased under 1s after the combination. The shortest time required is only 0.138s.
In order to select the appropriate number of weighted principal components, we compared the effects of different principal components from 1 to 18 on experimental results. And we found that the trend with 6-18 principal components has similar decreasing trend to that of 2-6, so we have not put them in Figure 3. The results of the relationship between detection rate and weight coefficient with different principal components from 1 to 6 are shown in Figure 3.

In Figure 3, the abscissa represents a weight coefficient from 10−1 to 10−6. When the coefficient is 10−4, the detection rate has been significantly improved. After that, it has small change and tends to be steady. Thus, we set 10−4 as the optimal weight coefficient. To compare the influence of weighted principal number on detection rate more clearly, after fixing the optimal weight coefficient, we discovered the best detection rate corresponding to the principal component number. The statistical data are shown in Figure 4.

According to Figure 4, we can get that when the number of principal components is 1, the improvement on detection rate is not obvious. When the number of principal components is 6, the detection rate dropped. The main reason is that when the number of principal components is too small, it is difficult to eliminate data pollution completely; when it is too large, the valid information in the data will be lost. Then the optimal number of principal components is 2. At the same time, from Figure 3, it can be seen that the detection rate is obviously improved when the weight coefficient is 10−4 and the principal component number is 2. When the principal number is 2, the detailed detection accuracy with different weight coefficient is shown in Table 5.
It can be seen that the weight of the first two eigenvectors can play a significant improvement effect; when the coefficient is 10−4, the accuracy rate has been significantly improved, which reaches to 91.06%. Therefore, the detection rate of the model combining GNB and IPCA is close to that of other classifiers, and the detection time is much more fewer than the time consumption of other classifiers.
4.2.3. Evaluation of Models
In this section, the results on three models GNB, PCA + GNB, and IPCA + GNB are compared from the perspective of time consumption and accuracy rate. And the detailed information is shown in Table 6.
From Table 6, we can see that the IPCA combined with Gauss naive Bayes model has good effect. Comparing with result from GNB, the time is shortened by 0.858s, and the accuracy rate is increased by 9%. Training data set D contains about 500,000 records, and test data set T contains more than 300,000 records. On such amount of data, it just takes about 0.562 seconds to train the model and test the data. The accuracy rate reached to 91.06%. In addition, other indicators such as precision, recall, and f1-score were also used to evaluate the model. The statistics of these three indicators on evaluating classical data mining methods mentioned in previous section are shown in Table 7.
The effects of the model presented in this paper are greatly improved compared to the traditional GNB, which ultimately are close to or even better than the effects of KNN and SVM in all kinds of evaluation indicators. At the same time, it can also be seen that the values of three evaluation indicators have decreased after the introduction of PCA. After improving PCA, the indicators have increased significantly. The intrusion detection method proposed in this paper has the highest detection accuracy compared with the previous two methods (GNB, PCA+GNB). After introducing PCA, the detection rate is slightly decreased, which is obviously improved by IPCA. The time involved in the experiment is the execution time of an experiment. Due to different performance and stability for the computer, each experiment result will be slightly different, but it would not be big difference. In order to make a clearer comparison of the differences between the three methods in the aspect of time consumption of intrusion detection, the time is recorded by doing the experiment ten times for each method. They are shown in Figure 5.

In Figure 5, it is clear to see the time contrast among the three methods. The average detection time of GNB method is 1.259s, and the average detection time of PCA and GNB is 0.558s. The average detection time of IPCA and GNB is 0.494s. It is proved that the intrusion detection method proposed in this paper has the highest detection accuracy and the shortest detection time in the three methods.
4.3. Cross-Validation
The accuracy obtained from the above experiments is based on just one experiment for each method. Training data and test data come from the training data and test data that have been divided in the KDD99 dataset, so it may not have universal significance. To get a convincing accuracy rate of intrusion detection, cross-validation is applied here. Instead of using the test data set prepared by KDD99, training data set D is divided into L subsets with the same size, then select one of them as the verification set at each time, and the remaining L-1 subsets are regarded as training data sets. The cross-validation results are shown in Table 8. In this paper, we set L=5.
The results of cross-validation prove that the optimization parameter is 10−6, which is different from the optimal parameters of the above experiment, but the results of these several parameters are not much different, and different data set may result in different optimal parameters. Therefore, the optimal weight coefficient still takes 10−4. The average detection rate is 86.34%. Although the detection rate by the experiment on the test set in KDD99 is higher, the detection accuracy is still up to 86%, which proves the efficiency of the method.
5. Conclusion and Future Work
This paper proposed an intrusion detection method based on improved PCA and Bayes. Comparing with different classifiers, it shows that Bayes classifier is more suitable for intrusion detection because of its fast speed for classification. The intervention of principal component analysis can greatly reduce the detection time, and then the weight coefficient was defined to improve the PCA, so as to simplify the input data. By comparing the detection rate and detection time with the classical Bayesian intrusion detection method, it proves that the method presented in this paper works best in network intrusion detection. This method has high accuracy, and it can also solve the high requirement of intrusion detection timely.
What is more, some works need to be further improved in our future research; for instance, this paper only focuses on the overall detection rate with normal and abnormal. It does not pay attention to the detection effect on different types of attack. And the proposed model may not work well for a particular attack. There is also no scientific selection method for the selection of weight coefficients for the improved PCA method. The future work would mainly focus on the selection of coefficient and explore the relationships between the weight coefficient and the characteristics of the data itself.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work is supported by the National Key R&D Program of China under Grant No. 2016YFB0800700, the National Natural Science Foundation of China under Grant Nos. 61802332, 61772449, 61772451, 61572420, 61807028, and 61472341, and the Natural Science Foundation of Hebei Province, China, under Grant No. F2016203330 and the doctoral Foundation Program of Yanshan University under Grant No. BL18012. The authors are grateful to the valuable comments and suggestions of the reviewers.