Abstract
In order to solve the problem that there are a large number of unknown protocols on the network, which affect the network management and network security in varying degrees, an unknown binary protocol identification method is proposed. On the premise that the center cluster of unknown protocols is obtained by the clustering algorithm, the unknown protocols for network traffic are identified by combining one-class of classification with one-dimensional CNN classification technology. At first, a class of classification algorithm is used to select the unknown protocols, and then, the labeled protocol data obtained by clustering are used to train the one-dimensional CNN model, and the classified binary protocol packets are directly used as the input of the one-dimensional convolution neural network. After the classification of the CNN model, the unknown protocols are finally identified. The experimental results show that the proposed classification and recognition method is better than the traditional CNN and SVM algorithms, and the maximum frequency pooling is better than the traditional pooling method.
1. Introduction
With the gradual deepening of “Internet+,” a computer network is playing a more and more important role in people’s daily life and work, accompanied by the impact on the security of the protocol. At present, due to the security, interests, and other factors, some network protocols do not disclose the development documents of the protocol, which brings great difficulties to the management of the network. Due to its simplicity and efficiency, the binary protocol is more and more widely used in the network. Therefore, the research on the classification [1] and identification of unknown binary protocols are of great significance for the maintenance of the network security and network management. Binary protocols [2] are different from text protocols. Even after parsing, there are no readable characters in the protocol structure field, which is data structure-oriented. Therefore, when we obtain some of the same protocols, it is also difficult to quickly identify the unknown protocols in the network, so special research on the identification of binary protocols needs to be done. At present, the traditional protocol classification methods based on packet sequences include the one based on statistical laws, the one focused on looking for high-frequency features, and the one using different classification methods for classification. There are port-based classifications, feature field-based classification, traffic feature-based classification, and so on. The emerging classification methods, which are based on a hidden Markov model and regular expression, improve the accuracy of protocol classification, and recognition by increasing the accuracy of finding frequent features. However, each analysis method has its different application environments and defects. For the classification of unknown protocols, the main research methods are to find out the frequent sequences of bitstreams by various data mining methods, and then, segment the long bitstreams into protocol data frames. Then, the data are processed by the method of classification, and finally, the classification is formed. There are many classical classification algorithms, such as the Bayesian algorithm, classification algorithms based on association rules, rough set theory, artificial neural networks, and so on [3]. He et al. [4] proposed a clustering algorithm for different granularity partition under the framework of rough set theory. First, the entropy of all data points is calculated according to the similarity relation, and then, the data point with the minimum entropy value is selected as the clustering center, and finally, the threshold β is set. All the points whose similarity between the data and the clustering center is greater than the threshold are grouped into one group. This method shortens the calculation time. Zhang [5] proposed a method of bitstream protocol analysis and feature recognition based on zero prior knowledge. By extracting feature sequences and their location information, this method avoids the problems caused by the repeatability of feature strings and improves the accuracy of classification. Zhang et al. [6] designed a convolution neural network model based on deep learning to solve the problem of unknown protocol recognition. The model has three convolution layers, and the data stream is directly used as the input of the convolution neural network model. The final recognition rate reached 93.33%. The advantages and disadvantages of different classification algorithms are shown in Table 1.
The model established by the traditional method cannot fully, scientifically, and essentially reflect the complex characteristics of the protocol data, resulting in the loss of information and the lack of classification accuracy. The application of the neural network classifier provides an effective way to solve this problem. Because the neural network has good nonlinear mapping ability, it does not require much prior knowledge in the application field, and certain training results can be achieved when there is data noise, and there is no need to model the structure, parameters, dynamic characteristics, and other information of the object; it only needs to give the input and output data of the object, it can be “learned” in the historical data to achieve a good classification effect. Therefore, the neural network is used to classify the protocols. However, there are still some problems in the current research on binary protocol classification, such as the clustering results cannot cover all the protocols, the classification accuracy is not enough, and so on.
Protocol classification is an important part of protocol reverse [7], but there are many kinds of protocols, and the classifier cannot contain the labels of all the protocols. Therefore, it is necessary to screen the protocol before inputting the unknown protocol into the classifier to determine whether it belongs to the classification range of the classifier. A class of classification algorithms [8] can effectively solve this problem. At present, the main classifiers include the density estimation method [9], boundary description method [10], clustering method [11], and so on. Among them, the clustering method is sensitive to the selection of the initial clustering center and the number of clustering, but the complexity of the method is low. If the clustering center can be determined accurately, then, the accuracy of clustering method for a class classification is still relatively high; The complexity of the clustering method is lower than that of the other two algorithms, and the large amount of data on the network needs rapid classification; the initial clustering center and threshold have been obtained in the previous work, of which this paper is a continuation. A class of classifiers can recognize the classification data by constructing a model that covers the label data without obtaining all the sample labels. Therefore, before using the convolution neural network to classify the binary protocols, it is very necessary to use a class of classifiers to select data frames that meet the classification conditions.
In recent years, with the in-depth study of deep learning, classification algorithms such as the convolution neural network are more and more widely used in real life. Gadekallu [12] uses a convolution neural network model based on CROW search to recognize different gestures in the field of human-computer interaction. This model greatly improves the accuracy of recognition and is better than the traditional pattern recognition model. Kumar et al. [13] proposed a physical identification and control method based on GSM&iot. The proposed framework comprises of two primary parts one is the equipment part and the other is the Cloud part. The proposed one is more adaptable, many-sided quality is less and cost of utilization is low. One-dimensional CNN has been successfully used in many fields, such as speech analysis [14] and text classification [15]. Enlightened by these studies, in this paper, a binary protocol classification method based on the one-dimensional convolution neural network is formed by considering the similarity between protocol message and text and combining it with a deep learning convolution neural network [16]. Thus, the selection and extraction of eigenvalues are avoided.
In this paper, the protocol identification technology based on message sequence is used to study the unknown binary protocol. According to the characteristics of binary protocols, a protocol classification method based on the one-class of classification and one-dimensional convolution neural network is proposed. First, the unknown binary protocol frames are preprocessed, and then, the clustering results with better evaluation results are used as the training data set for further classification. Second, a class of classification algorithm based on adjusted mutual information is used to classify the protocol data to be classified, and the protocol data that accords with the clustering characteristics are identified. Third, the new unknown protocols are reclustered, and the data that meets the requirements is used as the input data of the CNN network, and finally, a classified subset of binary protocols is obtained.
In this paper, the traffic data set is obtained by using Wireshark software in a real network environment. In order to verify the accuracy of the method effectively, the known protocol is used instead of the unknown protocol for an experimental verification. The bitstream subsets of five unknown binary protocols are represented by ARP, DNS, ICMP, TCP, and SMB, which are represented by P1, P2, P3, P4, and P5, respectively. Assume that all agreements have been initially sliced. In order to solve the problems existing in the current protocol classification, such as the clustering results cannot cover all protocols and the classification accuracy is not enough, a compound classification method based on first-class classification and CNN is proposed. In order to solve the problem that there are many kinds of unknown protocols and the clustering results cannot cover all the protocols, a class of classification algorithm based on adjusted mutual information is proposed, and a classification method based on one-dimensional CNN is proposed to solve the problem of insufficient classification accuracy. The effectiveness of the method is verified by real-time data sets.
The organization of the paper is as follows. Section 2 introduces the related concepts used in the paper. Section 3 provides background knowledge of the subject area and also establishes the classification model. Section 4 highlights the results of experiments and incorporates the conclusions drawn. Finally, Section 5 summarizes the contribution of this paper and the content of further research.
2. Related Concepts
2.1. One Class Classification
One class classification is a special classification technique, which is proposed for outlier detection [17, 18]. The class classification is to build a reasonable coverage model of the target sample by learning from the training target sample, and finally realize the classification and identification according to whether the test sample belongs to the coverage of the model.
Take the tagged data obtained by clustering as positive samples.
Construct a hypersphere that can contain positive samples, and should be as small as possible. If the new sample is inside the hypersphere, it is thought that the new sample belongs to this category, that is, the normal sample; if it is outside the sphere, it is considered to be an abnormal one. In order to make the hyperspheres more compact, the kernel function method is used. A nonlinear mapping is introduced to map the data to the high-dimensional space, and then, the smallest hypersphere is solved in the high-dimensional space, and the kernel function is still used to replace the inner product operation in the high-dimensional space.
The optimization goal is as follows:
The constraint condition is as follows:
The optimization problem (1) is changed into the following dual form:
The constraint condition is as follows:
It is obtained by solution (3). According to the KKT (Karush–Kuhn–Tucker) condition, the samples that meet the constraint conditions satisfy the following equation:
Among them , we only need to insert a vector into equation (5) to find R, and the decision function of the optimization problem in equation (1) is as follows:
If the new sample, z satisfies , then, it is the normal sample. The function of parameter C is to control the number of samples of the satisfaction formula . The larger the parameter C is, the more samples there are in the hyperspheres. According to the technology of a class of classification and the specific data form of protocol classification, we construct a preliminary screening model of a class of classified protocols and construct a formula (11) from the objective function and constraint conditions of formula 1, 2, 3, 4, and 5. The constraint is constructed from the decision function of formula (6), thus, the classification of protocols is realized.
2.2. Convolutional Neural Network
The convolution neural network (CNN) [19] is a kind of network which uses convolution calculation and contains a depth structure. It is one of the typical algorithms of the deep learning. The structure of the CNN network includes an input layer, convolution layer, pooling layer, full connection layer, and output layer. Among them, the convolution layer and the pooled layer can be used alternately many times. The CNN network has three important characteristics: local receptive field, shared weight, and pooling.
The convolution neural network uses the BP framework in supervised learning, which is an improvement on the basis of the BP neural network. There are many error functions of the CNN network, including hinge loss function, SoftMax loss function, triple loss function, and so on. The time delay network (TDNN) [20] is a one-dimensional convolution neural network for speech recognition. This is one of the earliest convolution networks. In 1998, LeCun et al. [21] and others put forward the LeNet-5 model. The model is a convolution network applied to image classification, which consists of two convolution layers, two pooling layers, and two fully connected layers. Considering that the data frame of the protocol is similar to the pixel of the image, the dimension is different. In this paper, with reference to the TDNN model and the LeNet-5 model, an one-dimensional LeNet-5 model is designed. There are many methods to realize the convolution neural network. This paper mainly refers to deep learning in MATLAB software. A sigmoid function is used as an activation function, maximum frequency pooling is used as a pooling rule, and the SoftMax classifier is used for classification. Figure 1 is a schematic diagram of a convolution neural network for classifying pictures.

2.3. Binary Protocol
The application protocols included in the contrast stream are parsed, among which one class of protocols is called text protocol [22], such as HTTP protocol and SMTP protocol. There are a large number of characters in the text protocol that are convenient for people to understand and read, such as numbers, letters, percentage signs, carriage returns, spaces, and so on. At the same time, in order to facilitate the further analysis of the text protocol, some special characters are usually used for segmentation, such as spaces, line breaks, and so on. A lot of redundancy is added to the protocol design, which reduces the efficiency of transmission, but makes the use of the protocol more flexible. The HTTP protocol uses spaces to separate data from keywords, such as spaces between the keywords “Server:” and “nginx,” and “\r\n” between informations. The delimiters used by different text protocols vary slightly, yet are generally consistent. On the one hand, it is convenient for people to read, on the other hand, it also determines the way the server parses the text protocol, that is, the regular expression matching is used to parse the text protocol.
Similarly, in the protocol after the parsing of the bitstream, there is another protocol, which we call binary protocol [23], such as TCP protocol and UDP protocol. The concept of binary protocol is relative to the text protocol, not to say that the smallest data unit in the binary protocol is at the bit level. As can be seen from some early RFC (Request for Comments) definitions of protocol specifications, binary protocols can also be byte protocols, but even if parsed, there are no readable characters in the protocol structure field. The fundamental difference between text protocol and binary protocol is not in the encoding of binary numbers, but in whether the protocol is data structure-oriented or text-string-oriented. The binary protocol studied in this paper does not limit the type of user data carried by the protocol. The TCP protocol format is shown in Figure 2.

Compared to the HTTP protocol, the TCP protocol has a fixed format, there is no spacer between the fields, the length of the field is fixed, and the format of the protocol (excluding user data) is fixed. The various characteristics of the previous binary protocol determine that the resolution of the binary protocol is not to use wildcards for segmentation and parsing, but to parse the fields directly according to the position and fixed length. The fixed format makes the running status of the protocol can be represented by numbers, although it is not convenient for people to read, but it increases the efficiency of data transmission and parsing. Compared with the text protocol, the binary protocol is generally shorter, which indicates the application prospect of the binary protocol in the era of the Internet of things.
3. Protocol Classification Method
In this paper, we propose a protocol classification method based on a class of classification and one-dimensional convolution neural network. In order to verify whether the binary protocol bit stream belongs to the clustered protocol, the similarity of the distribution of the two data frames is measured by adjusting the mutual information [24]. The value range of AMI is [−1, 1], and the negative number represents the negative correlation between the data frames, and the closer the distribution of the two data frames to 1 is, the closer the distributions of the two data frames are. The similarity between the clustering data frame and the clustering center is judged.
One-dimensional CNN is used to perform traffic classification tasks, and its performance is compared with two-dimensional CNN. In order to realize the efficient classification of unknown binary protocols [25], a binary protocol classification method based on the one-dimensional convolution neural network is proposed. Whether the protocol is known does not affect the accuracy of clustering. In order to better test the clustering effect, the known binary protocol is used instead of the unknown protocol. Moreover, the object of the study is the bit stream that has been initially segmented and the location of the protocol header is determined (hereinafter referred to as the unknown binary protocol bit stream).
The training data set is the tagged protocol data obtained on the basis of clustering algorithm (the corresponding paper on clustering using improved k-means algorithm has been published in the Journal of Fire Command and Control).
First of all, the data are preprocessed, and according to the characteristics of the binary protocol, 4 bit is selected as the processing unit, the shortest data is used as the basis for data processing, and each unit is used as a feature to obtain an n × m two-dimensional matrix. Then, a class of classification algorithm is used to classify it, and the qualified protocol message is obtained, and then, the trained CNN network is used for further classification, and the unknown binary protocol is divided into binary protocol subset. In a class of classifiers, the adjustment of mutual information is used as the basis for judgment. In the one-dimensional convolution neural network, the maximum frequency is used as the pooling standard, and if there is no maximum frequency, the pooling is carried out according to the mean value. Through these improvements, the classification accuracy can be improved. The structure of the article is shown in Figure 3.

3.1. Data Preprocessing
The network data flow includes the data link layer, TCP/UDP, and application layer protocol data. The binary protocol studied in this paper does not distinguish which layer protocol belongs to, but obtains the .Pacp format file according to the packet captured in Wireshark, including the protocols of all layers. Therefore, the scope of application is wider. The format of the protocol data is shown in Figure 4.

In order to effectively remove redundant data and ensure the same length of input data, the unknown binary protocol bit stream packet in .Pacp format is saved as txt format, and the redundant vertical bars in the protocol header are removed. Then, the data are converted into binary and then divided in 4 bit units. For example, 111111110011 is converted to 15153, 15, 15, 3 as the basic processing unit. For a dataset , where n is the number of pieces of data. The length of each data is not fixed, where is the last value of data . When selecting the data length m, in order to better retain the protocol information, at the same time it effectively removes the content of the data part. The parameter m is determined by the minimum m value. The formula for intercepting the length m is as shown in the following formula:where is the shortest length of control information obtained from experience. The data frames inputted by n rows are processed into basic processing units according to the unit of 4 bit. An n × m matrix is constructed with the basic processing unit as the element. Where m is the first m processing unit of the intercepted data frame, and n is the number of the traffic data. The processed dataset is I.
3.2. One Class Classification Model
3.2.1. Adjust Mutual Information
Mutual information (MI) [26] is a measure that reflects the degree of interdependence between two random variables, also known as transfer information. If the logarithm is based on 2, the unit of mutual information is bit. Different from the correlation coefficient, it is more general and not suitable for real-valued random variables. The calculation method is shown in the following formula:where p (X, y) is a joint probability density function, while p (x) and p (y) correspond to the marginal probability density functions of X and Y.
Intuitively, mutual information represents overlapping information between X and Y, that is, the degree to which the uncertainty of the other variable decreases when either X or Y is known.
The value range of MI is [0, 1], but for random results, MI cannot guarantee that the score is close to zero. In order to solve this problem, we introduce adjusted mutual information. Adjusting mutual information is an improvement of the mutual information, which can better reflect the coincidence degree of the data distribution. The value range of MI is [0, 1], and the value range of AMI is [−1, 1]. The larger the value is, the closer the clustering result is to the real situation. Formula 10 neutralizes the edge entropy of the corresponding sample and represents the expected value of the mutual information.
3.2.2. One Class of Classification Model
According to the principle of a class of classification, it is necessary to construct a hypersphere containing positive samples, so that the normal data are inside the sphere and the abnormal data are outside the sphere. This requires us to accurately select the threshold of the interpretation distance. In order to make the hyperspheres more compact, we define the threshold parameter .
Definition 1. the similarity between the data frame i and the j cluster center is similar (i, j). is the threshold for determining whether the data frame belongs to the j center point cluster.In a certain type of cluster, the higher the coincidence degree between the sample and the central point data distribution, the larger the adjusted mutual information value. The evaluation threshold is set, where k is the number of clusters in the sample and n is the total number of data frames in the clustering data set. The threshold is times the average value of the clustering data set and the corresponding clustering center.
In order to verify whether the binary protocol bit stream belongs to the clustered protocol, the mutual information is adjusted to measure the similarity of the distribution of the two data frames. The value range of AMI is [−1, 1], and the negative number represents the negative correlation between the data frames, and the closer the distribution of the two data frames to 1 is, the closer the distributions of the two data frames are. The similarity between the clustering data frame and the clustering center is judged. First, the adjusted mutual information threshold from each cluster to the clustering center in the cluster sample is calculated, and then, the adjusted mutual information value from each data frame to each clustering center in the test data is calculated. It is judged whether the threshold condition of any cluster is satisfied, if so, it belongs to the protocol that has been clustered, otherwise it belongs to the completely unknown protocol, and these data need to be clustered. In this paper, if the adjusted mutual information value I of the data set to be classified and the clustering center is greater than any evaluation threshold, it is determined that the sample belongs to the protocol type of the known clustering sample. That is, if , the protocol message belongs to the protocol type of known clustering samples.
In this paper, a class of classification algorithm based on adjusting the mutual information is proposed. The calculation steps of the algorithm are shown in Table 2.
3.2.3. Parameter Optimization
In a class of classification process, we need to subjectively set the parameter to , and determines whether the data frame meets the threshold of the cluster corresponding to a certain cluster center point. The setting of this parameter directly affects the accuracy of a class of classification results.
The value of is k/2 times the average adjusted mutual information value. If the value is too small, the obtained data frame range is too large to effectively distinguish the protocol. If the value is too large, the threshold is too high and the correct data frame is missed when filtering. Adjusting the average value of mutual information can better reflect the similarity between all data frames and cluster centers. Since there are k clusters, k/2 times can better reflect the similarity between the corresponding clusters and their cluster centers. We have conducted experiments on this issue, and the experimental results are shown in Figure 5

In the experiment, we find that when the threshold is average adjustment mutual information value and k times, the classification can hardly effectively distinguish whether the data frame belongs to the clustering protocol in the second chapter. When the threshold is the average value, all the data frames are judged to be qualified data frames; when the threshold is k times, only 11 data frames that do not meet the conditions are judged correctly. Although the accuracy of these two parameters is relatively high, the actual discrimination is particularly poor. The experimental results show that the value of is k times of the average adjusted mutual information value, and the effect is good.
3.3. CNN Model Based on Frequency Pooling
3.3.1. Frequency Pooling
Pooling is to filter information and select features, and to filter the feature images obtained by convolution calculation to form more abstract target features. Because the data of each bit in the binary protocol only represents the specific meaning independent of size, for example, 00ff and 1,133 are equivalent in format, so the effect of mean pooling and maximum pooling on dimensionality reduction is not very good, so a method of maximum frequency pooling is designed.
Definition 2. Maximum frequency pooling. That is, the value of the maximum frequency in the pooling layer is used as the result of the local feature, and if there is no maximum frequency, the average value is taken. The step size is c, the input feature graph is F, the offset is b2, and the pooled feature map is S.
Definition 3. Maximum frequency. Refers to the statistics of the frequency of all characters in the pool sampling point, the highest frequency is regarded as the maximum frequency character, the step size in this paper is c, so the character with the highest frequency in every c step is the maximum frequency character. The formula of pooling model is as follows:When there is a value of maximum frequency, take the value of maximum frequency and the average value is taken when it does not exist.
3.3.2. CNN Model and Parameter Design
In the data preprocessing phase, the .Pacp format package obtained from the Wireshark packet is saved as txt format, and the redundant data in the protocol header is removed, and then, processed in units of 4 bit. For example, 111111110011 is converted to ff3. f3 as the basic processing unit. The input n data frames are converted into an n × m two-dimensional matrix. Where n is the number of rows of the input data frame, m is the first m processing units of the intercepted data frame: because the protocol length is unknown, in order to better retain the protocol information, at the same time effectively remove the data part of the content. The M value is determined by the minimum m value, first, determining the minimum bit stream length of each protocol, and then, taking the shortest protocol length of all protocols as the m value.
In order to classify and identify protocols by the CNN algorithm [27], it is necessary to determine the size of the input data frame and adjust the network structure accordingly. Taking a single protocol message as the input, according to the previous m value, the shortest protocol length value is 144 bit. Therefore, each data frame is converted into 36 basic processing units; when designing the CNN network, we need to change its network structure to adapt to the input form of the data frame. The input form of this paper is 36 × 1. From the point of view of data dimension, one-dimensional CNN is a simplification of two-dimensional CNN. Figure 6 is a schematic diagram of a one-dimensional CNN model [28].

The convolution layer uses local connection and weight sharing to extract features. Because the binary protocol format is relatively simple, in order to improve the operation efficiency, a five-layer convolution network is designed. Suppose the length of the input protocol is m, the bias is b1, the convolution kernel is n, and the activation function is S (t), characteristic graph is matrix F, then, the characteristic obtained after convolution calculation is m – n + 1. The convolution formula is as follows:where Mi is the element corresponding to the convolution kernel Ci in the convolution process and is not the value of the I element in M.
In this paper, because the basic units of the binary protocol are all positive values, and the data characteristics are relatively short, and the eigenvalues are more complex, more subtle classification judgment is needed, which meets the requirements of the Sigmoid function [29] (the data are all greater than zero and normalized), so the Sigmoid function is used as the activation function.
Pooling is information filtering and feature selection, and the feature graph obtained by the convolution calculation is screened to form more abstract target features. Because the data of each bit in the binary protocol represents only the specific meaning independent of size, for example, 00ff and 1,133 are equivalent in format, the use of the mean pooling and maximum pooling is not very good for dimensionality reduction. In this paper, a method of maximum frequency pooling is designed.
All the neural units in the full connection layer are connected to the neural units in the previous layer characteristic diagram. The output expression of the neural unit is as follows:
In formula 14, x is the input of the neuron, y is the output of the neural unit, W is the connection weight, and b3 is the bias. f (t) is the activation function. In this paper, the SoftMax classifier is used, and the option with the highest probability is used as the output of the classifier. In order to further improve the classification accuracy, the protocol is defined as an unknown class when the probability is less than 0.8.
3.4. SoftMax Classifier
The features extracted by the last convolution layer of the convolution neural network are usually input into the classifier, so that the sample data in the database is mapped to one of the given categories to complete the classification work. The commonly used classifiers are: radial basis network, logical regression, fully connected neural network, and so on. At present, SoftMax classifier is mostly used to solve the multiclassification problem in the convolution neural network structure.
In the convolution neural network structure, the output layer using SoftMax has multiple units, and the number of units is equal to the number of categories. Under the action of SoftMax, each unit will calculate the probability that the current sample belongs to this class. The loss function of SoftMax is defined as follows:where , … , is the model parameter, m is the total number of samples, k is the number of categories of classification tasks, 1{·}is a binary indicative function, It is defined as when 1{value is true} = 1 Magi 1{value is false} = 0, means that the network determines whether the sample is category , if is equal to the real label j, then output 1, if not equal, output 0. The logarithmic term reflects the probability of the network estimation sample A corresponding to each category. The logarithmic term reflects the network estimation sample. Corresponds to the probability of each category, the more accurate the classification, the smaller the first term of formula 15 is. The second term in the formula is the weight attenuation term, which avoids network overfitting, and is the weight attenuation coefficient.
The CNN model in Figure 6 is divided into five layers: input layer, convolution layer 1, pooling layer 1, convolution layer 2, pooling layer 2, and output layer. In the input layer, the network structure needs to be changed to adapt to the input form of the data frame. The input form of this paper is 36 × 1, so there are 36 neurons in the input layer.
Convolution layer 1 has 9 convolution nuclei, each convolution nucleus produces 1 feature sequence, and each feature sequence is composed of 34 neurons. In convolution layer 1, convolution cores of size 3 are used for convolution, and the numbers in 9 convolution kernels are generated at random. Because the step size is 1, there is no filling, so the size of the feature sequence is 36 − 3 + 1 = 34.
The pooling layer 1 is composed of 9 feature sequences, each of which is composed of 17 neurons. Since the input of the pooling layer 1 is the characteristic sequence 34 in the convolution layer 1, S1 = 34/2 is calculated according to the frequency pooling.
Convolution layer 2 has 3 convolution nuclei, which constitute 27 feature sequences, each of which is composed of 12 neurons. The convolution layer 2 is connected to the pooling layer 1 through a convolution core of 6 × 1, because the convolution core performs convolution operations with all the unrepeated regions in the pooling layer 1, that is, each characteristic sequence of the pooling layer 1 17, so, the size of the feature sequence is 17 − (6 − 1) = 12.
The pooling layer 2 is composed of 27 feature sequences, each of which is composed of 6 neurons. Since, the input of the pooling layer 2 is the characteristic sequence 12 in the convolution layer 2, S1 = 12/2 is calculated according to the frequency pooling.
The output layer is a fully connected layer. The number of neurons in OUTPUT is the same as the type of protocol to be identified. Assuming that there are five protocol types, the number of neurons in the output layer is 5.
4. Results and Discussion
4.1. Experimental Environment and Data
In order to verify the effectiveness of this method, experiments are carried out on ordinary desktop computers (3.2 GHz Intel i74790 processor, 8G memory, Win7 operating system) using python language and MATLAB software. Moreover, we use Wireshark software to capture data frames. The experimental data set is obtained from the real network environment. ARP, DNS, ICMP, TCP, and SMB represent the bit stream subset of five unknown binary protocols, which are represented by P1, P2, P3, P4, and P5, respectively. Assume that all protocols are initially segmented, starting with the corresponding protocol header and containing the data part. Take the shortest length 144 bit of all messages as the m value. According to the proportion of 8 : 2, the clustered protocol data is divided into training data set and test data set into the CNN network, and then, the binary bit stream containing the five protocol data is grabbed from the network as the data set to be classified. The classification data set is classified first and then input into the CNN network model to realize the automatic classification function of the protocol.
4.2. Evaluation Index
In this paper, accuracy rate, recall rate, precision rate, and comprehensive index F are used to evaluate the results.
Accuracy rate (Acc) is the sum of the correctly identified target and nontarget classes divided by the total. The formula is as follows:
The recall rate is the percentage that is correctly predicted in all positive examples. The formula is as follows:precision rate is the proportion of the real positive examples in the predicted results. The formula is as follows:
According to the general standard, assuming that accuracy and recall are equally important, the comprehensive indicator F is as follows:
In formula 16, 17, 18, and 19, Tp indicates that the data of the target class is correctly divided into the number of data frames of the target class; FP indicates that the data of the nontarget class is mistakenly divided into the number of data frames of the target class; Tn indicates that the data of the nontarget class is correctly divided into the number of data frames of the nontarget class; FN indicates the number of data frames that incorrectly divide the data of the target class into the nontarget class; Num represents the total number of data frames.
4.3. Analysis of a Class of Classification Results
This paper classifies a total of 250 data packets from each of the five protocols in the test data set. Figure 7 shows the calculation results of the AMI value.

Figure 7 reflects the results of a class classification when the total number of data frames to be classified is 250. The Abscissa is the sample number and the longitudinal coordinate is the adjustment of the mutual information value from the test sample to the clustering center which meets the requirements of the mutual information threshold. By setting the threshold, we can select the samples that belong to the classification model. The adjusted mutual information value of the sample fluctuates because we put the same kind of protocol number in the same area in order to facilitate statistics. For example, 1–50 is the TCP protocol, and the data distribution between different protocols is different, so there will be fluctuations. The results showed that a total of 224 satisfied the results, but 4 of them were repeated, 220 were correct and 0 were wrong. The accuracy was 88%. Compared with the traditional class of SVM classifier, this method is faster because of its low computational complexity. The classification accuracy is 86.4% higher than that of the traditional SVM classifier. The parameters of the traditional SVM classifier are as follows: using the radial basis kernel function in the MATLAB environment, and then, using the grid search algorithm to calculate the error penalty parameter C and the kernel parameter G (both have mature models).
Figure 8 shows the accuracy, recall, accuracy, and comprehensive index F of the results obtained by classifying different test samples. From the data point of view, the number of test samples has little influence on the accuracy of the classification results, and the structural characteristics of the data itself are an important factor affecting the classification accuracy. Through the analysis of the experimental results, the four indicators of the proposed algorithm are better than the SVM algorithm, we can see that the binary protocol classification algorithm based on mutual information adjustment is effective, and can classify a variety of unknown binary protocols. Distinguish between clustering protocols and completely unknown protocols.

4.4. Analysis of CNN Classification Results
In this paper, the Sigmoid function is used as the activation function. In-line with the binary protocol data characteristics are relatively short, eigenvalues are more complex, the need for more detailed classification and judgment of this feature. Using SoftMax classifier, the output results can be transformed into classification probability according to the output results, which is suitable for dealing with multiclassification problems. Some of the classification results are shown in Figure 9. The change of false recognition rate in the process of convolution is shown in Figure 10.


The classification results of 24 test data are shown in Figure 3, and the one with the highest probability is taken as the category to which it belongs. For example, the classification result of the first protocol is (0.988, 0.0107, 0.0011, 0.0001, and 0.0001), so the first protocol is divided into the first category. Figure 4 shows that the error recognition rate of the algorithm changes with the number of operations, from which we can see that the accuracy of the algorithm increases rapidly with the increase of the number of iterations at the initial stage of operation, but the increasing speed becomes very slow when the number of runs reaches 500. Therefore, it is necessary to set a reasonable number of runs to achieve a certain accuracy of the algorithm under the premise of more efficient.
In order to verify that the frequency pooling and one-dimensional convolution neural network is superior to the maximum value pooling and two-dimensional convolution neural network, two sets of comparative experiments are designed.
In the first group of experiments, one-dimensional convolution neural network and two-dimensional convolution neural network were used to classify the same test data. One-dimensional network is the algorithm of this paper. The basic parameters of the two-dimensional network are as follows: the convolution kernel of the first layer is 2 2, the step size is 1 1, a total of 9 convolution cores, the first layer pooling uses a matrix of the size of 5 1, the average pooling, and the step size is 5 1. The convolution layer of the second layer adopts a convolution core of 2 2, with a total of three convolution kernels with a step size of 1 1. The activation function is the same as that of the one-dimensional convolution neural network. In the second group of experiments, the maximum frequency pooling and the maximum pooling were used to classify the same test data. The experimental results are shown in Figure 11.

Figure 11 shows that the accuracy of the one-dimensional convolution neural network is higher than that of the two-dimensional convolution neural network, with an average of 12% higher than that of the two-dimensional convolution neural network. In the binary protocol classification task, the performance of the one-dimensional CNN is better than that of the two-dimensional CNN. The effectiveness of the proposed method is verified. We use real protocol data for experiments, the two-dimensional network due to the addition of redundant data, will dilute the characteristics of the protocol, resulting in a decline in the accuracy of classification.
In the second group of experiments, the maximum frequency pooling and the maximum pooling were used to classify the same test data.
The maximum frequency pooling method is used instead of the mean pooling and maximum pooling, the input data frames are divided into nonoverlapping regions, and then, the value of the maximum frequency in the pooling layer is taken as the result of the local feature. If there is no maximum frequency, the average value is taken. Figure 12 shows how maximum frequency pooling works, using a 1 4 sampling area.

The values on the right side of Figure 5 are the results of frequency pooling, mean pooling, and maximum pooling in the left matrix from top to bottom, respectively. Because the values in the protocol only represent the function of a certain point, there is no difference in the size of the values. After analyzing and viewing the original values of the protocol, the real values of the corresponding sites are respectively (f, 0, random). By comparison, it is found that the three results of the frequency pooling are all correct, while the maximum value pooling has one eigenvalue extraction error, and the mean pooling has two eigenvalue extraction errors, so the use of the frequency pooling can effectively improve the classification accuracy of binary protocols.
In order to better illustrate the effectiveness of frequency pooling in protocol classification, 100, 200, 300, and 400 pieces of data are selected from the test set to input the CNN model based on frequency pooling and the CNN model based on maximum pooling, and the classification accuracy is calculated. Taking the number of samples as the Abscissa, the histograms of classification accuracy under two different pooling methods are drawn. The accuracy of the classification results is shown in Figure 13:

By comparison, it is found that the frequency pooling method is better than the maximum pooling method. Under the condition of different number of test samples, the result of frequency pooling is better than that of the maximum pooling. The theoretical analysis is that the number of the processing unit on a certain bit of the protocol message only represents the function of that bit, and there is no difference in size, therefore, the frequency rather than the value to reflect the characteristics of the binary protocol, more in line with the characteristics of the binary protocol.
Through the analysis of the time complexity of the CNN, traditional k-means algorithm and the FCM algorithm, the time complexity of the CNN algorithm is O (nt) and that of K-means clustering algorithm is O (nkt), where n represents the number of objects in the data set, t represents the number of iterations of the algorithm, and k represents the number of clusters. The time complexity of the FCM algorithm is O (NC2 dt), where n is the sample size, c is the number of clusters, d is the sample dimension, and t is the number of convergence iterations of FCM. Find the CNN < traditional k-means algorithm < fcm algorithm. However, through the classification and clustering calculation of 500 pieces of data, the operation time with an accuracy of 90% is as follows.
On ordinary desktops (3.2 GHz’s Intel i7-4790 processor, 8 GB of memory, Win7 operating system), the experiments of k-means and FCM are complemented by Python. Through the classification and clustering calculation of 500 pieces of data, the parameters of the k-means algorithm and Fcm algorithm are, respectively: the parameter of the k-means algorithm is that the clustering number is 5, and the initial clustering center is randomly generated. Clustering distance selects the standard distance. The FCM algorithm specifies that the number of clustering domains is 5, the weighted index is 2, and the maximum number of iterations is 100.
When the accuracy is 90%, the operation time is shown in Figure 14.

From the Figure 14, we can see that the classification time of the trained CNN model is less than that of the traditional k-means and FCM algorithms, which is of great significance in the real-time network anomaly detection, which can find network attacks more quickly.
In the ordinary desktop 3.2 GHz Intel i7-4790 processor, 8G memory, Win7 operating system, in the MATLAB environment, we compare the classification results of different number of test samples; the classification accuracy is higher than the traditional LSTM classifier 91.8%. Through the multilayer grid search algorithm, the parameters of the traditional LSTM classifier are determined as follows: length of segmentation window 3, random seed = 1, training steps 500, learning rate 0.1.
The data shown in Table 3 are the F, Acc, recall, and preference values corresponding to the classification results of the improved CNN algorithm and LSTM algorithm when the test samples are 300,400,500, respectively. From the analysis of the data in Table 3, we can see that the overall performance of the CNN algorithm, the LSTM algorithm is improved for binary protocol classification, but the preference value of the LSTM algorithm is not much different from that of the improved CNN algorithm. From the experimental results, the one-dimensional neural network and frequency pooling can optimize the classification of binary protocols.
The experimental results show that: compared with other traditional classification methods, this method has lower time complexity and higher accuracy than the SVM algorithm; the one-dimensional CNN classification model can automatically learn the nonlinear relationship between original input and expected output, and this classification method can omit the traditional steps such as feature design, feature extraction, and feature selection, which are commonly used in traditional divide and conquer methods. The automatic classification function of the protocol is realized, and a comparative experiment is done between the maximum frequency pooling and one-dimensional convolution network to verify the effectiveness of the improvement. Experiments show that the classification model has a recognition rate of more than 98%, and can effectively classify unknown protocols, and the classification time is faster than the clustering method.
5. Conclusion
In order to classify the unknown binary protocols under the condition that some prior knowledge has been acquired, a class of the classification algorithm based on adjusting mutual information is proposed, with the analysis of the traditional traffic classification method based on machine learning and the similarity of the protocol data distribution. The computational complexity of this kind of method is low, and the effectiveness of the method is verified through experiments. After that, a classification method based on one-dimensional CNN is proposed following the one-dimensional classification. In this method, the label of the clustered protocol data is used for training the classification model, and the classified binary protocol message is directly used as the input of the one-dimensional convolution neural network, above which, a binary protocol classifier is constructed. It can automatically learn the nonlinear relationship between the original input and the expected output, and realize the function of automatic classification of the protocol. This method integrates feature design, feature extraction, and feature selection into one framework and can automatically learn more representative protocol features. The accuracy of one-dimensional convolution neural network is 12% higher than that of the two-dimensional convolution neural network. The classification effect of the frequency pooling method is also better than that of the maximum pooling method, which is 4% higher on an average. This classification method can omit the traditional steps such as feature design, feature extraction, and feature selection, which are commonly used in the traditional divide-and-conquer methods. It uses the one-dimensional convolution neural network to automatically learn the more representative characteristics of binary protocols.
The research content of this paper makes a small part of protocol recognition. The premise of the research is that the protocol message generated by clustering is obtained, and the result of classification can lay the foundation for the next step of format extraction. However, there will be errors in any link, and the errors are accumulated to the next link. How to improve the overall accuracy of protocol recognition is worthy of further study.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Disclosure
Part of the research in the conference paper is applied to this paper, which is of great help to improve the classification accuracy.
Conflicts of Interest
The authors declare that there are no conflicts of interest in this article.
Authors’ Contributions
In the writing of the paper, Yin Shizhuang is responsible for data collection and algorithm programming, Zhifeng You is responsible for the correction of the paper, Li Juan is responsible for the translation of the paper, Hu Qiwei is responsible for the construction and comparison of a class of classification models, and Professor Shi Quan is responsible for the design of the overall framework of the paper.
Acknowledgments
In order to exchange research results with colleagues, Yin Shizhuang and other authors introduced a classifier based on CNN at the 2020 IEEE ICCASIT conference. This paper improves the CNN model on the basis of previous research. This work was supported by National Natural Science Foundation of China, No. 71871220.