Abstract
The encryption of network traffic promotes the development of encrypted traffic classification and identification research. However, many existing studies are only effective for closed-set experimental data, that is to say, only for traffic of known classes, while there are often lots of unknown classes traffic in the real environment of open sets, and many studies have difficulty identifying the traffic of unknown classes and can only misclassify them as known classes. How to identify unknown traffic and classify known traffic in an open-collection environment is one of the focuses of traffic analysis research. Considering these problems, this paper proposes a novel solution, which applies the open-set recognition method to the unknown traffic identification, and constructs a model based on deep learning and ensemble learning. The method constructs a model based on a convolutional neural network and a transformer encoder and then uses a three-stage training and testing process, combined with a novel loss function, to generalize to the open space to form OpenCBD. Experiments on public datasets show that the proposed method is significantly better than other open-set identification methods. It can not only distinguish known traffic from unknown traffic but also identify specific classes of known traffic.
1. Introduction
With the wide application of encryption technology in network traffic, it becomes more and more challenging to effectively monitor and analyze network traffic, and encrypted traffic analysis technology has also become an important research topic in the field of network security [1–3]. To analyze encrypted traffic, it is first necessary to divide the traffic into different sets according to specific goals, that is, to classify and identify network traffic. Most of the existing researches are implemented in a closed environment; that is, many researches can efficiently and accurately classify and identify traffic of known classes. For example, Aceto et al. [4, 5] proposed a practical mobile traffic classification model based on deep learning which can automatically extract features. Liu et al. [6] proposed an encrypted traffic classification model FS-Net based on the recurrent neural network. Wang et al. [7] proposed App-Net, a mobile application recognition model based on RNN and CNN. Nascita et al. [8] improved a multimodal deep learning traffic classification model based on explainable artificial intelligence techniques. But some studies, while valid in closed settings, often cannot really be applied to the real world. Because the real-world environment is open, network traffic includes encrypted and nonencrypted, known and unknown, benign and malicious, standard protocols and private protocols, etc. There are many practical problems to be considered and many difficulties to overcome. If a classification method trained in a closed set is simply generalized to an open set, it is easy to misclassify samples of unknown classes into known classes [9–11]. In order to solve this problem, researchers need to develop models that can support both the classification of known class samples and the discovery of unknown class samples, so as to actively manage and prevent abnormal traffic and further create and maintain a good network environment.
In real-world identification and classification tasks, it is often difficult to obtain labels for all samples during training, but it is desirable to have the ability to identify unknown classes during testing. Therefore, open-set recognition describes such a scenario. During testing, samples of unknown classes that have not appeared in training will appear. The classifier can not only accurately classify known samples but also identify unknown classes. In a real network environment, there are known classes of encrypted traffic and unknown classes of encrypted traffic. This is similar to the scenario of open-set recognition. Therefore, some open-set recognition methods can be used to solve the unknown traffic identification. This paper proposes an unknown traffic identification method OpenCBD based on open-set recognition. It first performs a pretraining process based on self-supervised learning on unlabeled data, so that the CBD model (CBD model was proposed in [12]; it is a self-supervised learning model containing three modules; the three modules are the CNN module—based on a convolutional neural network, BERT module—based on a transformer model encoder, dense module—based on a fully connected network) has a certain understanding of the basic characteristics of encrypted traffic. Then, the training and testing process based on ensemble learning is designed in the open set. During training, the individual model is trained based on a specific loss function on some known classes. Then, all known classes are trained through the ensemble strategy, so that the ensemble model can achieve accurate classification of known classes and identification of unknown classes in the process of open-set testing. The contributions of this paper are summarized as follows: (i)A novel unknown traffic identification model OpenCBD is designed. It uses the idea of open-set recognition, combines deep learning and ensemble learning, learns the basic characteristics of encrypted traffic from unlabeled data, and then trains on known classes of traffic to classify and identify traffic in an open environment(ii)A general training method suitable for open-set recognition is proposed. The method only needs to be trained on the data of the known classes and can identify the data of the unknown classes. The method adopts two-stage training: first, it randomly selects part of the known classes data to train the individual model and then integrates the individual model to train with all known classes data, so that the model can learn the special knowledge between classes(iii)A loss function that combines the cross-entropy loss function commonly used in classification models and the II-loss function proposed for open-set recognition is proposed. The combination of the two loss functions can train the model more efficiently, making the model fit faster and more accurately, so that the classification of known classes and the identification of unknown classes can be better completed at the same time(iv)The OpenCBD model achieves good results in unknown traffic identification and known traffic classification tasks. Furthermore, the OpenCBD model outperforms significantly compared to the baseline methods
The rest of the paper is organized as follows. Section 2 summarizes the basic knowledge and related work of open-set recognition and unknown traffic identification. Section 3 details the structure and methodology of the overall model. Section 4 introduces the specific details of the experiments and evaluates and compares the experimental results. Finally, Section 5 concludes the paper.
2. Preliminary and Related Work
This section mainly introduces the basic knowledge and researches in recent years of open-set identification and unknown traffic identification.
2.1. Open-Set Recognition
First, some definitions in open-set recognition are given.
Definition 1 (open space [13]). Given a label set , the known class label is a positive integer, and the unknown class label is 0. For the feature , define to be a measurable recognition function, means that the class can be recognized, and means that is not recognized, where is the appropriate smooth space for the recognition function. For any known class of training samples , the open space is defined as where is a closed sphere with the training sample as the center and radius . is a sphere of radius , including all known positive training samples and open space .
Definition 2 (open space risk [13]). Open space risk is defined as the relative measure of positive labeled open space compared to the overall measure of positive labeled space. Then, the probabilistic open space risk of class is
Definition 3 (openness [14]). Openness refers to the degree of openness of an open space, which consists of training classes, target classes, and test classes, where represent training classes, target classes, and testing classes, respectively. Sometimes may occur, so the openness after calibration is only related to training classes and test classes,
Definition 4 (open-set risk [13]). The open-set recognition problem is to minimize both traditional empirical risk and open space risk. Given an empirical risk function , the open-set risk is defined as where is the regularization constant.
Definition 5 (open world recognition [13]). The solution for open world recognition is represented by the quintuple . is a multiclass open-set recognition function. The -th class in is identified by the vector function of the measurable recognition function , and the detector is determined, where , determine whether the output vector of the recognition function is from an unknown class. is the labeling process. applies new unknown data to time , resulting in label data , where . Assuming that the label finds new classes, the set of known classes becomes . is the incremental learning function. extensibly learns and adds new measurable functions to the measurable recognition function vector . Each measurable function minimizes the corresponding open space risk.
Suppose that each outputs the probability of belonging to the classes. And assume that is normalized across each class. Let , then the multiclass open-set recognition function is
where
At this point, an easy way to detect is to set an acceptable minimum threshold and minimize the open space risk, i.e.,
Since open-set recognition was proposed, many research methods have been proposed, mainly including discriminative models and generative models. Among them, the discriminative models are mostly constructed by traditional machine learning methods and deep neural network methods, and the generative models are divided into methods based on instance generation and methods based on noninstance generation according to whether there is instance generation. There are also some studies using neural networks in these two methods. The following mainly introduces some methods based on deep neural networks, which include not only discriminative models but also generative models.
In 2016, Bendale and Boult [15] proposed OpenMax, the first open-set recognition method based on deep neural networks. OpenMax was used as a new model layer that estimated the probability that the input came from an unknown class and provided a bounded open space risk.
In 2017, Ge et al. [16] proposed a multiclass open-set recognition method Generative OpenMax (G-OpenMax). Unlike some existing studies, the unknown class was not inferred from the features of the known classes or the decision distance. It extended OpenMax by employing Generative Adversarial Networks (GANs) for new classes of image synthesis.
In 2018, Yoshihashi et al. [17] proposed an open-set recognition classification reconstruction learning method CROSR, which used latent representations for reconstruction and achieved robust unknown detection without reducing the classification accuracy of known classes.
In 2019, Oza and Patel [18] proposed an open-set recognition algorithm C2AE, which used a class-conditional autoencoder, as well as closed-set classification training and open-set recognition training. The encoder and decoder were trained in two stages, and the reconstruction error was modeled using the extreme value theory of statistical modeling to find a threshold that identified samples of known and unknown classes.
In 2020, Hassen and Chan [19] proposed a representation method based on a neural network and used this representation method to propose an open-set recognition mechanism. In this representation, instances from the same class were close to each other, while instances from different classes were further apart.
In 2020, Liu et al. [20] proposed an algorithm that uses the meta-learning technique PEELER to solve the problem of open-set recognition. It combines randomly selecting a new set of classes per episode, maximizes the loss of postentropy of these class instances, and then learns a new metric formula based on the Mahalanobis distance.
In 2021, Joseph et al. [21] proposed an open-world object detector ORE, which was based on contrastive clustering and energy-based unknown identification. Identifying and characterizing unknown instances helped reduce confusion in incremental object detection settings. In this setting, state-of-the-art performance could be achieved without additional methodologies.
In 2022, Geng and Chen [22] proposed a batch decision strategy that was aimed at extending existing open-set recognition methods for new class discovery while considering correlations between test instances. By modifying the Hierarchical Dirichlet Process (HDP), a collective decision-based open-set recognition framework CD-OSR was proposed. CD-OSR did not need to define decision thresholds and could realize open-set recognition and new class discovery at the same time.
Table 1 summarizes several open-set recognition methods mentioned in Section 2.1.
2.2. Unknown Traffic Identification
In the field of network traffic analysis, unknown traffic identification has always been an important research direction. Researchers usually use unsupervised learning or semisupervised learning to solve the tasks of identifying and detecting unknown traffic.
In 2011, Finamore et al. [23] proposed an unsupervised algorithm to identify traffic classes within aggregates. This algorithm utilized the -means clustering algorithm and added a mechanism to automatically determine the number of traffic clusters.
In 2013, Zhang et al. [24] proposed an approach to address the problem of unknown applications in the critical case of small supervised training sets. The proposed method had a superior ability to detect unknown traffic generated by unknown applications and exploited the correlation information between real-world network traffic to improve the classification performance.
In 2013, Zhang et al. [25] proposed an iterative method to extract unknown information from a set of unlabeled traffic flows. The method combined asymmetric bagging and flow correlation to guarantee the purity of the extracted negatives and demonstrated significantly better than state-of-the-art flow classification methods under unknown applications.
In 2014, Yu et al. [26] proposed a method to classify elephant traffic using service-based statistical features for cluster analysis. Elephant traffic refers to unknown traffic generated by only a few or some types of applications.
In 2015, Shaikh and Harkut [27] proposed a framework that classifies unknown flows in the network, solving the problem of applying unknowns in critical situations with little supervised training data. Flow label propagation was proposed, which automatically and accurately labeled more unlabeled flows to enhance the ability of Nearest Clustering-based Classifiers (NCCs). Composite classification was also proposed, which combines many flow predictions to more accurately classified Bag of Flows (BoF).
In 2015, Lin et al. [28] proposed a semisupervised learning method to address the problem of an unknown protocol in critical cases where the labeled training sample set was small. With the help of flow-related information and semisupervised clustering ensemble learning, the method had a superior ability to detect unknown samples generated by unknown protocols to improve classification performance.
In 2017, Ma and Qin [29] proposed a method using deep learning techniques to identify unknown protocols in complex network environments. The method identified the protocol in the network flow according to the application layer protocol type and found out the unknown protocol. This method only used the payload information in the captured 200,000 traffic flows and achieved well unknown protocol traffic identification accuracy.
In 2018, Fu et al. [30] proposed a scheme, FlowCop, to implement traffic detection that did not belong to any predefined application in network traffic classification. It divided the test traffic into classes and one unknown class by building multiple one-class classifiers. A feature subspace algorithm was also proposed to select salient features for each class of classifiers.
In 2019, Sabeel et al. [31] proposed two methods to predict unknown DoS and DDoS attacks based on DNN and LSTM. The method demonstrated how well deep learning-based methods perform in unknown situations and to what extent deviations from the trained model could be handled. This method can effectively identify unknown attacks.
In 2019, Zhang et al. [32] proposed a network intrusion detection method based on open-set recognition. This method fits the recognition results of known classes to Weibull distribution and then builds an Open-CNN model to estimate the probability of unknown classes from the activation scores of known classes, so as to achieve the purpose of detecting unknown attacks.
In 2020, Zhang et al. [33] proposed an autonomous learning framework to correctly classify unknown classes. The framework efficiently updated deep learning-based traffic classification models during active operation. The core of the proposed framework consisted of a deep learning-based classifier, a self-learning discriminator, and an autonomous self-labeling model. The discriminator and self-labeling process generated new datasets during active operation to support classifier updates.
In 2020, Mohamed et al. [34] proposed a method for handling unknown applications. This method enabled efficient network classification with limited supervised training sets. The proposed model applied multiple neural network algorithms to predict unknown applications. The method improved Internet performance, reduced Internet traffic, and reduced delays in transmitting data.
In 2021, Wang et al. [35] proposed an unknown protocol parsing method based on a convolutional neural network. The protocol data was preprocessed into an image, and the converted image was inputted to the convolution layer for convolution. After convolution, the data was flattened, and the flattened data was put into a fully connected neural network to analyze and predict unknown protocols.
In 2021, Li et al. [36] proposed a lightweight unknown traffic discovery model, LightSEEN, which realized traffic classification and model update in the open world under practical conditions. The overall structure of the method was based on the Siamese network, and each side used a multihead attention mechanism, a one-dimensional convolutional neural network, and a residual network to facilitate the extraction of deep flow features and the convergence speed of the network.
In 2021, Xu et al. [37] proposed the KCC (Known Central Clustering) method to deal with the open-set-based intrusion detection problem. By introducing CD-loss (Class Distance-loss), the centers of different clusters were obtained. By introducing negative samples as unknown classes for training, the threshold of known classes was obtained. Unknown intrusions were rejected by comparing with fuzzy distances.
Table 2 makes a simple classification and summary according to the specific methods used in the above-mentioned several unknown traffic identification literatures. We combine unknown traffic identification with open-set recognition. According to the characteristics of network encrypted traffic, we design a new open-set recognition method based on a discriminant model. Through self-supervised learning and supervised learning, it can effectively complete known traffic classification and unknown traffic identification tasks at the same time.
3. Proposed Method
To identify unknown classes that have not appeared in the training set, we propose OpenCBD, a deep learning and ensemble learning-based approach. The CBD model was proposed in [12], but it can only complete the classification task in datasets with known class traffic. In this paper, the CBD model is regarded as an individual model. First, the loss function is used to train the individual model, and then through the ensemble strategy, the individual model is fused into an ensemble model, which is extended to OpenCBD suitable for open-set data. Classify known classes and identify unknown classes in the world. The CBD model includes the CNN module, BERT module, and dense module. The detailed structure is given in Section 3.3. The overall process of OpenCBD is shown in Figure 1.

First, pretraining is performed on unlabeled data to obtain a pretrained individual model, including a CNN module with a fixed structure and parameters and a BERT module with a fixed structure and no fixed parameters. Then, perform data preprocessing on the raw data participating in the training, then input them into multiple individual models for training after obtaining the training set, then input the training set into the ensemble model for training, and then obtain the training results. Finally, data preprocessing is performed on the raw data participating in the test, and after the test set is obtained, it is inputted into the trained ensemble model, and the prediction results are the outputs.
The training process includes two stages, namely, closed-set individual training and closed-set ensemble training. The testing process is implemented in the open set. The block diagram of the three stages is shown in Figure 2.

In the following, we will introduce the data preprocessing process, pretraining process, and detailed process of the three stages.
3.1. Data Preprocessing
Data preprocessing is an essential step to achieve the goals of classification and identification. Data preprocessing mainly includes four parts: traffic split, traffic cleaning, traffic conversion, and time interval integration. Following the description in [12], we summarize the preprocessing process as Algorithm 1.
|
Line 2 is the traffic split. For bidirectional flow, randomly intercept 10 consecutive data packets and define these ten packets as a flow. A total of flows are intercepted, that is, a total of data packets.
Lines 4-9 are the traffic cleaning. Read the payload part of each packet, then unify the length, truncate the first 256 bytes of each packet, and add 0 to the insufficient, to obtain the raw sequence ,
Line 10 is the traffic conversion. Convert the elements in the raw sequence into decimals according to bytes to obtain a 256-dimensional vector ,
Lines 11-16 are the time interval integration, which was first proposed in [38]. It means that according to the statistical results of the time interval of two adjacent data packets in different classes, for the interval of more than 1 s, insert a blank data packet and ignore it within 1 s. The blank data packet is represented as a 256-dimensional all-one vector .
The payload vector and the time interval vector are formed into a set in chronological order, and the model can be directly inputted in the next step.
3.2. Pretraining
Pretraining adopts a common pretraining method in the field of encrypted traffic analysis proposed in [12], which starts from the packet level and the flow level, and can directly deepen the model’s understanding of encrypted traffic from unlabeled real-world data. This paper summarizes the method into a detailed Algorithm 2.
|
Lines 1-15 are the packet-based methods. For an unlabeled packet, first extract the payload part to get , and then calculate the entropy of each packet,
Set the threshold of entropy . When , the data packet label is a ciphertext data packet; when , the data packet label is a plaintext data packet. Train a CBD model with labeled packets.
Lines 16-31 are the flow-based methods. First, construct positive and negative sample sets. The positive sample set contains positive samples, and the positive sample is defined as a continuous flow ; each flow contains 10 consecutive packets,
The number of negative sample set is the same as the positive sample set , including negative samples, and the negative sample is defined as a discontinuous flow . It is obtained by transforming the positive samples. Each packet in the positive sample is replaced with other bags with a certain probability, and the replaced sample is called a negative sample.
Train a CBD model with a labeled set of positive and negative samples. After completing the pretraining, enter the three-stage training and testing process.
3.3. Closed-Set Individual Training
The first stage is closed-set individual training. After the data is preprocessed, the class data is randomly selected to train the individual model, and the classification is completed. Each individual model is a randomly selected class with replacement. The specific structure diagram of the individual model is shown in Figure 3.

3.3.1. Encoder
The encoder and classifier are two important parts in the individual model. The encoder is mainly used for feature extraction, the input is a matrix , and the output is a vector ,
where is a -dimensional matrix on . The encoder mainly includes the CNN module and BERT module.
The CNN module is inputted in sequence by a row vector; that is, for the input of the encoder, represents the -th input of the CNN module, . Define as the CNN module function, is the convolution function of the -th layer in the CNN module, is the maximum pooling function of the -th layer in CNN module, and is the -th layer output for the -th input in the CNN module, ; then, the output of the CNN module is
After the CNN module is a fully connected layer, which is used to change the dimension of the vector to facilitate the input of the subsequent BERT module. Define to be the fully connected function and to be the output of the fully connected layer, then
where is the weight matrix and is the bias matrix.
This is followed by a Concat layer that stitches together the outputs after CNN modules and fully connected layers. Define as the splicing symbol and as the output of the Concat layer, then
The BERT module consists of transformer encoders, and is the input of the BERT module. Define as the BERT module function, as the -th layer transformer encoder function in the BERT module, and as the output of the encoder, then
3.3.2. Classifier
The classifier is mainly used to predict results, the input is a vector , and the output is a specific class ,
The classifier consists of a dense module and Softmax layer. In the dense module, define to be the function of the dense module and to be the output of the dense module, then
where represents the weight matrix and represents the bias matrix.
The last is the Softmax layer, defining as the Softmax function, then
where is the index of the class.
So the output of the entire individual model is
3.3.3. Loss Function
Individual models are trained with a combination of cross-entropy loss and II-Loss [19].
The cross-entropy loss function is often used in classification problems, which can measure the similarity between several classes. Given a batch of samples , for a matrix , define its label as , then the cross-entropy loss of this batch of samples is
where is a symbolic function, which determines whether the label of is the class . is a probability function that calculates the probability that the label of is the class .
The II-Loss function can make samples of different classes farther apart and samples of the same class closer by maximizing the distance between different classes and minimizing the distance between samples and their class mean. Given a batch of samples , define the sample set of class as , the number of samples in is , the mean output of the dense module in is , then
The II-Loss of this batch of samples is
The final loss function is
3.4. Closed-Set Ensemble Training
The second stage is closed-set ensemble training. After the data is preprocessed, the ensemble model is trained with all class known data to complete -classification. The integrated strategy of the ensemble model adopts the learning method, which is a method of combining individual learners by training the learners. First, the encoder in the individual model is trained using a subset of the raw training dataset. Then, the ensemble model is trained with the raw training set using the output of the encoder as a feature. See Algorithm 3 for the ensemble strategy.
|
The training of the ensemble model takes -fold cross-validation as an example and divides the known class dataset and subset into datasets and , respectively. Let and be the test set and training set corresponding to the th individual model execution, respectively. and are the test set and training set corresponding to the th ensemble model execution, respectively. Given individual learning algorithms, use the th learning algorithm to train on to obtain an individual learner . For each sample in the test set executed at the th time, let be the learner on the output result. Then, at the end of the entire cross-validation process, a new dataset can be generated by individual learners ; the ensemble learner will use this dataset to train together with the dataset that does not participate in the training of individual models.
Lines 1-4 are the training of the individual model, which only uses a subset of the known class data during training. Lines 5-12 are the training of the ensemble model, which uses all of the known class data during training. The ensemble model consists of encoders and 1 classifier. The encoder is obtained from the first stage of training, and the fixed structure and parameters remain unchanged. The classifier has the same structure as the first stage, but the specific parameters have changed. Define as the input of the ensemble model, as the output, as the dense module function, as the dense module output, and as the Softmax function, then
where is the number of encoders in the ensemble model, each encoder comes from the corresponding individual model, and the coefficients remain unchanged. and have the same structure as the dense module function and Softmax function in the individual model but with different coefficients. is the index of the class.
3.5. Open-Set Testing
The third stage is open-set testing. The real-world data is inputted into the trained ensemble model, and the ensemble model produces results to be identified. Compute the maximum of the results. If the maximum value is greater than or equal to the threshold, the test sample is classified as one of classes; otherwise, it is classified as unknown.
3.5.1. Threshold Estimation
The threshold for judging data classes in open-set testing is determined by outliers. Assume that 1% of the samples in the training set have noise, that is, they are abnormal outlier samples. When calculating the outlier distance for all training samples, sort them from small to large, and the largest 1% distance is the threshold. The detailed process of threshold estimation is shown in Algorithm 4.
|
After the threshold is determined, the to-be-identified results of the ensemble model will determine the classes of data according to whether it is greater than or equal to the threshold.
4. Experiment and Evaluation
This section mainly introduces the specific experimental content designed for the proposed model OpenCBD, including experimental environment and settings, evaluation metrics, and specific results. And then, the results are discussed to verify the validity of the OpenCBD model.
4.1. Experiment Settings
The encrypted traffic data in this paper is selected from the public dataset ISCXVPN2016 [39]. 8 classes of data are selected as known classes, 5 classes of data are selected as unknown classes, and the intersection of known and unknown data is empty. The 13 classes of data include both Virtual Private Network (VPN) traffic that has been encapsulated by the VPN protocol and non-VPN traffic that has not been encapsulated by the VPN. 1000 samples are randomly selected for each class, and each sample contains 10 consecutive data packets. The specific data classes are shown in Table 3.
During the experiment, 1000 samples of each class were randomly divided into the training set and test set, with training set samples : test set samples = 9 : 1; that is, 100 samples of each class were randomly selected as the test set, and the remaining 900 samples were the training set. The training set contains training data and validation data, with training data : validation data = 9 : 1.
The experimental environment is a personal desktop, and the specific equipment information is shown in Table 4.
All codes in the experiment are run in the environment of Python 3.6.5, and the specific hyperparameter settings are shown in Table 5.
The rest of the unmentioned hyperparameters are set according to the default values of the corresponding models in Python.
4.2. Evaluation Metrics
When evaluating model performance, the class of interest is usually the positive class and the other classes are the negative classes. Evaluation metrics are usually formulated from the following four basic situations: (1)True positive (TP): predict the positive class as a positive class(2)False positive (FP): predict the negative class as a positive class(3)True negative (TN): predicts the negative class as a negative class(4)False negative (FN): predicts the positive class as a negative class
From the above four basic situations, four basic evaluation metrics can be obtained: (1)The probability of classifying positive samples into positive classes is called TPR, also known as recall or sensitivity:(2)The probability of classifying a negative class sample into negative classes is called TNR, also known as specificity:(3)The probability of misclassifying negative samples into positive classes is called FPR, also known as false recognition rate:(4)The probability of misclassifying positive samples into negative classes is called FNR, also known as the rejection rate:
In addition, there are several other commonly used metrics: (1)Precision refers to the proportion of true positive samples among all predicted positive samples:(2)F1-score considers precision and recall comprehensively and refers to the harmonic average of precision and recall:(3)Accuracy refers to the proportion of all predicted correct samples to the total samples:
where represents the number of correctly predicted samples and represents the total number of samples. (4)In the multiclassification problem, the F1-score, precision, and recall are all calculated by macro; that is, the index of each class is calculated first, and then, the unweighted average is taken to obtain the final index:
where represents the number of classes.
The evaluation metrics of this experiment select accuracy, precision, recall, and F1-score of binary classification and multiclassification.
4.3. Results and Discussions
According to the detailed experimental setting in Section 4.1 and the four evaluation metrics in Section 4.2, we present the specific experimental results in this section. The first is the experimental results of OpenCBD when doing 2-class classification tasks.
As can be seen from Figure 4 and Table 6, OpenCBD performs best when 7 known classes are randomly selected in the first stage of training. Because there are 8 known classes, each individual model randomly selects 7 classes for training, which enables the individual model to better understand the differences between different known classes and makes the model perform better. On the whole, the increase in the number of encoders helps to improve the performance of the model, but too many encoders will also lead to excessive time overhead of the model, and the performance improvement is no longer obvious. In Figure 4 and Table 6, when 7 known classes are randomly selected and the number of encoders is 12, there is no result. The reason is that 7 classes are randomly selected from the 8 known classes, and there are a total of possibilities. If 12 encoders are integrated, there must be repetitions.

While OpenCBD can distinguish known from unknown, it can also distinguish known classes.
It can be seen from Figure 5 and Table 7 that OpenCBD performs the best when 7 known classes are randomly selected in the first stage of training. Accuracy of 9 classes is over 72%, precision is over 76%, and recall and F1-score are over 75%.

As can be seen from Figure 6, the probability of Facebook-audio being wrongly classified into an unknown class is higher than the other 7 classes. Since there are 5 classes of data that are regarded as an unknown class, the number of unknown classes predicted as the unknown class is the largest and the color is the warmest (red).

In addition, we design different random selection methods in the first stage of training. A portion of each of the 8 classes of data is randomly selected, and the results are as follows.
As can be seen from Figure 7 and Table 8, the results of this training method are not much different from the original results, but the overall results of this method are slightly higher, all above 70%, and the highest accuracy can achieve more than 77%, and precision even exceeds 83%. Since all 8 known classes have data to participate in the training of the individual model in this method, the ensemble model learns more deeply for each known class.

As can be seen from Figure 8 and Table 9, when OpenCBD is doing 9-class classification tasks, the values of the 4 evaluation metrics increase with the increase of the number of encoders, and the results are better when the percentage of known classes is 70%. The highest accuracy can reach 73.61%, the highest precision is over 80%, the highest recall is nearly 75%, and the highest F1-score is 65.85%.

When the individual training selects 70% of the training set and the number of encoders is 12, the metrics of the two classes and nine classes are the highest. Therefore, we choose this ensemble method and compare the area under the ROC curve (receiver operating characteristic curve) of identifying only unknown classes and identifying both known and unknown classes at the same time. The results are shown in Figure 9.

When verifying the performance of the OpenCBD model, we also selected several classic open-set recognition models for comparative experiments. We choose 3 models, namely, the threshold-based Softmax model [15], OpenMax model [15], and II-Loss-based model [19].
It can be seen from Figure 10 that the values of the four metrics in the 2-class classification of the OpenCBD model are basically around 80%, and the other three are around 70%. Our OpenCBD outperforms the other 3 by around 10% in the 2-class classification tasks. It can be seen from Figure 11 that in the 9-class classification, the metrics of the OpenCBD model are between 75% and 80%, and the other three are basically not more than 70%. Our OpenCBD outperforms the other 3 models by around 5%-10% in 9-class classification tasks. Figure 12 compares the area under ROC of the four models more clearly; the larger the area, the better the effect; and OpenCBD is significantly better than the other three.



5. Conclusion
In this paper, we proposed a novel model that can simultaneously identify unknown traffic and classify known traffic. The model could be trained on the known traffic of the closed set and tested on the network traffic of the open set. The model first combined the convolutional neural network and the transformer encoder to construct a deep learning-based model. Then, use the general pretraining method in the field of encrypted traffic analysis to pretrain from unlabeled traffic data, so that the model could learn the basic characteristics of encrypted traffic. Then, according to the characteristics of the open set, a three-stage training and testing process was designed. During training, choose the cross-entropy loss function suitable for classification and the II-Loss function suitable for clustering. At the same time, using the idea of ensemble learning, a traffic identification model OpenCBD based on open-set recognition was constructed. For real-world traffic data, if it belonged to a known class, the class to which it belonged can be identified, and if it belonged to an unknown class, it can be identified to belong to an unknown class. Experiments were carried out in public datasets, 8 classes of data were selected as known classes and 5 classes of data were selected as unknown classes, and a class-balanced dataset was constructed in the experiment to eliminate possible human influence to the greatest extent.
In the future work, we can consider the clustering of unknown classes of encrypted traffic and further separate unknown traffic according to its characteristics, which is convenient for further discovery and research of new classes of traffic.
Data Availability
This paper uses the ISCX public traffic dataset which is available in [39].
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 61772548.