Abstract
In recent years, an increasing number of mobile platforms and applications have adopted traffic encryption protocol technology to ensure privacy and security. Existing researches on encrypted traffic identification approaches often rely on a single-modal feature pattern (such as packet sequence and statistical features), which cannot fully represent the detail information of complex traffic features, and so their predictions are susceptible to anomalies. In order to improve the effect of classification on encrypted app traffic, we propose FusionTC, a novel app traffic classification framework based on feature fusion of flow sequence. FusionTC consists of two-level subclassifiers, which are used to perform decision-level fusion of multimodal features by an upgraded stacking method. The comprehensive capture and fusion of multimodal traffic details, coupled with the refined processing and segmentation of traffic, enables FusionTC to significantly promote classification accuracy and enhance robustness in challenging situations. Based on our self-built app traffic dataset, FusionTC improves the accuracy by at least 3.2% over the state-of-the-art approaches.
1. Introduction
In recent years, in order to protect user privacy and prevent intermediate sniffing, an increasing number of mobile platforms and applications adopt network traffic encryption protocol technology, and the proportion of mobile Internet encrypted traffic continues to grow. At present, most mobile applications use data encryption transmission methods such as https. We find that the mainstream mobile phone operating systems force applications to apply encryption methods so as to establish connections. For example, iOS 9 began to enable the application transmission security function by default for applications [1]. According to statistics in 2019, more than 80% of Android applications used encrypted network connection communication by default [2]. Encryption proxy technologies (such as Shadowsocks and V2Ray) are also widely applied to mobile devices. Meanwhile, mobile traffic encryption technology, however, is being abused by criminals. Illegal cyber activities have become very rampant as traffic encryption technology allows attackers to hide their identity and thus evade the supervision of network security. Current illegal cyber activities include cyberattacks, malicious mining, illegal trading on the dark web, personal information theft, and cyber extortion, which seriously threats the mobile Internet security.
While protecting user privacy, traffic encryption technology poses great challenges to mobile Internet management. Due to encryption of traffic, conventional technologies based on content pattern matching, such as deep packet inspection, cannot be used to monitor and identify app users’ behavior in encrypted traffic scenarios. In the current mobile Internet era, how to accurately discover and classify app encrypted traffic has become a hot issue in current cybersecurity research, and with the future trend of network-wide traffic encryption, strengthening mobile Internet management is an urgent technical issue to be addressed.
1.1. App Encrypted Traffic Analysis and Related Works
The main goal of app traffic classification technology is to analyze the traffic characteristics (more often encrypted traffic), extract the communication fingerprint information of the app, and build models to classify the active app traffic. The identification of mobile encrypted traffic is roughly similar to that of desktop encrypted traffic (such as website fingerprint). The rationale of encrypted traffic classification is that although we cannot directly analyze and restore the original content of the encrypted traffic, we can use the sequence characteristics or statistical characteristics of the traffic to characterize the traffic to achieve traffic classification. The problem of app traffic classification is more difficult than that of web traffic. It is because when an app is loaded, most of its interface resources are usually called from the local cache instead of the Internet, which, to some extent, weakens the effect of traffic characterization. In contrast, when a web page on the desktop is loaded, all page resources are usually retrieved online, which facilitates the extraction and analysis of encrypted traffic characteristics. Additionally, compared with desktop web pages, apps may call more third-party external libraries.
In recent years, as shown in Table 1, app encrypted traffic classification has received extensive attention in academia: (1) the original app traffic classification method uses DPI technology for content analysis for plaintext traffic. Dai et al. proposed the NetworkProfile framework [3], which can detect and analyze HTTP packet content but has no effect on encrypted traffic. With the gradual increase of encrypted traffic on mobile terminals, the research on app identification for encrypted traffic has become the mainstream direction. (2) Most app encrypted traffic classification methods identify different apps by extracting and learning features of traffic flows. Taylor et al. proposed AppScanner [4], an effective identification method for encrypted app traffic, which extracted statistical features for classification using machine learning methods. In their following research [5], reinforcement learning method was added to AppScanner and its performance in reducing false positives has greatly improved. In [6], van Ede et al. proposed a semisupervised app fingerprinting method on encrypted network traffic, which can automatically find temporal correlations among destination-related features of network traffic and use these correlations to generate app fingerprints. In [7], Li et al. adopted a sliding window-based approach to divide the encrypted traffic stream into a sequence of segments corresponding to different app activities. In [8], Sengupta et al. exploited diversity in Android TLS implementations for app traffic classification, and their classifier got a high accuracy. In [9], based on the message type and length block sequences, the Markov models are trained and the probabilities of all the applications are concatenated as the fingerprints for classification. In [10], an application over Shadowsocks’ traffic identification system is proposed, which adds the sliding window JS divergence feature on the basis of the traditional statistics and distribution based on traffic packet length and timestamp. In [11, 12], for specific scenarios such as instant messaging and multiple mobile phone sources, solutions for encrypted traffic identification are proposed, respectively. (3) Deep learning is increasingly used for encrypted traffic identification, and it enables automatic feature extraction and end-to-end learning process. In [13, 14], deep learning models (such as CNN and LSTM) are used for website fingerprinting. In [15, 16], deep learning models are used in Tor environment for flow correlation and app recognition, respectively. In [17], Liu et al. proposed an end-to-end classification model that learns representative features from the raw flows based on deep learning. In [18], deep learning methods are proposed for the task of mobile app identification, which can automate the feature engineering process in an end-to-end fashion. In [19], Lin et al. proposed a new traffic representation model which pretrains deep contextualized datagram-level representation from large-scale unlabeled data. (4) Researchers have noticed that only relying on single-modal features cannot reflect all the detailed characteristics of traffic, and they begin to research on multimodal feature learning. In [20, 21] and [22], Aceto et al. and Nascita et al. are among the first researchers who put forward an effective multimodal learning method for encrypted traffic, and they continuously improved it, successively proposing new methods based on deep learning and explainable artificial intelligence. In [23], Lin et al. propose the use of self-attention mechanism to learn multimodal features of biflows.
1.2. Limitations of Existing Work
The current research on app encrypted traffic classification has made great progress. However, most studies chose a single modality of traffic feature as the input of the classifier model. For example, some deep learning-based studies prefer to choose the packet length sequence of network flows as input features, without considering other kinds of features such as packet arrived time, packet type, and packet statistics information. It is known that there is very rich heterogeneous information in encrypted traffic. Relying on only one kind of feature modality will cause the lack of complete traffic information, which may affect the classification results, especially in some cases where traffic flows labeled as diverse classes happen to be very similar in that modality but actually quite different in other modalities.
Existing multimodal learning method effectively improves the performance of encrypted traffic classification models, most of which are mainly divided into two types: early fusion method based on deep learning and late fusion method based on ensemble machine learning. However, the multimodal model architectures based on deep neural network are usually complex, which means higher computational overhead and requirement for hardware. And they have more restrictions on the selection of modal features and are not easy to flexibly expand. Additionally, deep learning model requires more training samples. The fusion approaches based on ensemble learning may not fully exploit the multimodal features, only relying on simple combination of prediction results with manual strategy. It lacks deep mining of relationship among multimodal results. Therefore, it is necessary to design a new multimodal method, which is more operable, light-weighted, and automatic, while ensuring the accuracy and robustness of the model.
To address problems of existing methods, we aim to study the effective analysis and extraction of multimodal information of encrypted traffic and design an improved classification model to realize automatic and efficient fusion of multimodal prediction results and thus enhance the classification performance of app encrypted traffic.
1.3. Our Contributions
In this work, we propose a novel classification framework based on multimodal learning of app traffic flow sequences. The framework is aimed at improving the classification performance of existing methods. We briefly summarize our contributions as follows: (1)For the problem that single-modal features cannot fully characterize the flow characteristics, we design a novel classification framework, FusionTC, which leverages a multimodal decision-level fusion method based on an upgraded stacking [24] algorithm. FusionTC consists of a two-layer classifier structure, which can automatically extract traffic features of multiple modalities and automatically make decision-level fusion prediction(2)Compared with the existing multimodal fusion method based on deep learning [21–23], our method has a simpler structure, less computational overhead, and low hardware requirements. Compared with the existing method based on ensemble machine learning [20], instead of simply combining the multiclassification results directly, our upgraded stacking method introduces a secondary classifier as a trainable combiner which is able to automatically learn the relationship of different modality soft-outputs from a global view and make intelligent decision-level fusion, thus promoting the accuracy and robustness of the model(3)In order to better train the classifier, we design a framework for automatic synthesis and collection of app encrypted traffic. With this framework, we are able to quickly build our own dataset covering mainstream apps(4)We conducted multiple experiments to evaluate and analyze the prediction results of the classification framework. Experiments demonstrate that FusionTC has better performance than existing methods
1.4. Threat Model
In our threat model, the adversary wants to identify apps running on phones from encrypted traffic. That is to say, the adversary is required to collect mobile phone traffic and then make analysis for traffic classification. We need to assume that the adversary is at gateway of the local area network where the target mobile phone is located, and he is capable to monitor all incoming and outgoing traffic of the network. Take the following scenario as an example: the adversary has fully controlled over the Wi-Fi hotspot the target phone is connected to, and thus, all the Internet traffic generated by the apps running on the phone is available to him. We also need to assume that the adversary is able to sift out the traffic generated by the target phone from the promiscuous traffic in the network (this kind of traffic filtering technology is very mature in the market and easy to implement).
The rest of this paper is organized as follows. In Section 2, we introduce our methodology of multimodal fusion classification. We elaborate on the architecture and algorithm implementation of FusionTC. In Section 3, we set up the experimental environment and conduct verification experiments. We analyzed the experimental results and prove advantages of FusionTC. Finally, we conclude this paper in Section 4.
2. Methodology of Multimodal Fusion Classification
We design a multimodal fusion classification method FusionTC specifically for the app encrypted traffic analysis, whose framework is shown in Figure 1. In order to better train the classifier, FusionTC has a delicate built-in traffic preprocessing mechanism (including traffic denoise and background traffic removal), so as to avoid irrelevant traffic affecting the training and classification effects. FusionTC analyzes encrypted traffic in units of flows, and it can automatically segment and parse the input traffic into a format that the classifier can recognize. To address the problem that the network traffic characteristics of a single modality have trouble presenting comprehensively the traffic characteristics, FusionTC can automatically extract traffic features of multiple modalities and make decision-level fusion prediction to get more accurate results. We consider the traffic characteristics from a least 3 kinds of modalities, design 2 different classifiers for each modality, and use a two-layer stacking architecture to construct a multimodal classification model based on decision-level fusion. The following evaluation shows that this novel multimodal classification model achieves outstanding classification performance.

2.1. Traffic Preprocessing
For the incoming traffic for model training or predicting, the first step is to preprocess the traffic to remove noise traffic. When constructing training samples for the target app, we only run one app at a time and expect to get pure traffic. However, we still cannot guarantee that the collected traffic is pure, because operating systems such as Android often run background services and applications, and a small amount of traffic generated by them will be unavoidably mixed to target app traffic. In addition, abnormal network flows due to network congestion, service interruption, etc., will also interfere with model training and prediction. This kind of noise traffic (such as TCP retransmissions and incomplete TCP sequences) also needs to be removed. The preprocessing methods are as follows: (1)Background Traffic Removal. When constructing a dataset, we need to identify the background traffic of the operating system. We collect the background traffic of various Android OS when no app is running and extract the traffic quintuple information (ip address, port, and timestamp). We feature quintuple information to remove background traffic from normal traffic. Although a small amount of normal traffic will be removed by mistake, the impact on the model is quite minimal(2)Abnormal Traffic Processing. For the case of network congestion, we first need to remove the TCP transmission packets of the input traffic. These packets repeat a lot and are meaningless for characterizing app traffic. We also check whether there are incomplete TCP streams in the traffic (such as missing handshake packets), and if so, they also need to be removed. If the traffic flow is incomplete, it will cause great interference to the model’s identification of traffic characteristics
After the above processing, we can fundamentally ensure that the traffic input into the model is pure and valid and can better represent the traffic characteristics of app.
2.2. Flow Grouping and Segmentation
The next step is to group and segment every captured app traffic into basic units which match the input of the classifier. (1)Traffic Flow Grouping. We define the traffic flow as a set of traffic packets with the same 5-tuple information (source address, destination address, source port, destination port, and protocol). The generation of a flow is largely related to a certain operation of app, such as accessing a specific external API and download a config file. One flow can be expressed as a sequence of packets: , where represents one packet. We use bidirectional flows in the proposed approach. The bidirectional traffic flow can well represent the operating behavior characteristics of the app. The following processes (such as feature extraction or prediction) work based on traffic flows(2)Burst Traffic Segmentation. Sometimes a network flow spans a very long time, and it is necessary to further segment it. In this case, the traffic flow consists of several bursts, and there exists a large time interval between two bursts. Each burst of a traffic flow represents a more fine-grained app operation behavior. We segment the traffic flow as follows: for a traffic flow , we observe the interval time between every two packets and use the time interval greater than the threshold to segment into multiple smaller burst flows . We classify or label the traffic at the flow level, and different bursts of the same flow aggregated are labeled as the same
2.3. Multimodal Feature Extraction
Next, we extract the multimodal features of the traffic flow. Based on decision-level fusion of multimodal features, FusionTC can comprehensively learn heterogeneous information in encrypted traffic, effectively improving the accuracy and robustness of the classifier. Heterogeneous features in encrypted traffic mainly include flow payload information, raw packet sequence information, packet distribution, and statistical information. A single-modal feature often cannot fully represent all the characteristics of the traffic flow. Only by combining them organically can the representation effect be improved. Partially referring to methods in [4, 10], we pick the features of the following kinds of modalities.
2.3.1. Raw Packet Sequence
(i)Bidirectional Packet Length Sequence. We can directly extract raw bidirectional packet sequences from flow as traffic features. We set as the length value of the sequence. If the number of packets in the traffic flow is greater than , the excessive packets will be truncated. If the length of the packet sequence is less than , it will be padded with 0 to make the sequence length always be . The packet length sequence contains native information of the traffic and can well characterize the traffic features from the pattern of flow shape. It can be expressed as , where means the length of the traffic packet, the value greater than 0 means the sent packet, and the value less than 0 means the received packet(ii)Packet Time Interval Sequence. Similar to bidirectional packet length sequence, we take the arriving time interval between each two packets as a sequence and also set as the length value of the sequence. If the number of packets in the traffic flow is greater than , the excessive packets will also be truncated. If the length of the packet sequence is less than , it will be padded with 0 to make the sequence length always be . It can be expressed as , where means the length of the packet arriving time interval
2.3.2. Packet Distribution
(i)Packet Length Frequency Distribution. In order to reflect the distribution characteristics of packet length in traffic flow, we divide the set of packets into length ranges. Referring to [10], each range is defined as , where is the range length. In each range , the corresponding frequency of the packets occurred is counted. We define the distribution sequence as . The larger the , the stronger the feature representation ability. According to the maximum value of packet length in the datasets, the range length is calculated as . We usually set to 1500, which is the MTU value of the TCP protocol(ii)Time Frequency Distribution Feature. We also extract the distribution features of packet time interval values. According to the distribution characteristic of packet time interval values, the ranges are set to an exponential distribution, which means the latter range length is much larger than that of the former. We count frequency of packets of every range to obtain the time frequency distribution feature vector
2.3.3. Packet Statistical Feature
The statistical feature is a critical modality of traffic, which can represent the characteristics of traffic macroscopically and purposefully. Referring to the method in [4], we extract the statistical feature of 54 dimensions. For each flow, we consider three vectors: the size of incoming packets, the size of outgoing packets, and the sizes of incoming and outgoing packets. For each vector (3 total), calculate the following values: min, max, mean, median absolute deviation, standard deviation, variance, skewness, kurtosis, percentile (from 10% to 90%), and series (18 in total).
2.4. Multimodal Fusion Training and Prediction
Multimodal fusion is an effective method to extract and integrate information from different modalities and improve the representation ability of the model. According to the stage of modal fusion, the method can be divided into early fusion, late fusion, and hybrid fusion.
We adopt late fusion as the modal fusion strategy of FusionTC. Late fusion is also called decision-level fusion, which first trains models with different modalities and then fuses the results of multiple model outputs. Late fusion methods can purposefully set different classifiers for different modalities, which has better scalability. Additionally, the late fusion method is much easier to operate. Even when the feature parameters of different modalities are very different, modal fusion can be well achieved without additional complex processing for modalities, such as modal alignment and normalization. In addition, compared with other fusion methods, the model structure based on late fusion is more concise, with better prediction robustness. It integrates the recognition advantages of each single modality, avoids errors that may occur due to a single modality or a single classifier to a certain extent, and thus enhances the robustness of the classification model.
At present, late fusion methods [20] mainly use rules to determine the combination of output results of different models, that is, rule fusion, such as the max-fusion, averaged-fusion, and Bayes rule fusion. However, these methods usually cannot fully exploit the multimodal features, just because they just simply combining multimodal prediction results with manual strategy, ignoring the deep relationship between them. To improve the weakness of existing late fusion methods, we choose an ensemble learning algorithm stacking to achieve intelligent late fusion of multimodal features. Our model based on stacking consists of a two-layer classifier structure (several base classifiers and one metaclassifier), which can automatically extract traffic features of multiple modalities and make decision-level fusion prediction to get final results. The biggest difference from the traditional late fusion method is that our model can automatically learn the deep relationship between multiple modalities through the fusion training of multimodal prediction results, so as to make more accurate judgments. This algorithm can integrate the best classifiers for different feature modalities. It does not require excessive parameter adjustment and can effectively avoid overfitting. It also has strong scalability and is easy to extend with more classifiers purposefully. As long as enough training time and space are guaranteed, it is one of the best ways to improve machine learning results simply and quickly.
It is worth noting that we have made an innovative upgrade of the stacking method. In the traditional stacking framework, metaclassifier receives the hard output of base classifiers (that is, the predicted label value), and it cannot get the confidence information of every hard output result. In order to improve it, base classifiers in our method output soft prediction results (that is, a vector containing the prediction probability of every category), which provide more precise information. As shown in Figure 1, it will increase the dimension of the output, so we flatten it into one-dimensional data to match the metaclassifier. In this way, the metaclassifier can fully learn prediction information according to the confidence of soft results and make more reasonable fusion decisions, thus improving the performance of the model.
In this paper, FusionTC adopts the stacking multimodal training process as follows [24, 25]: assume that the sample dataset is represented as , where is the raw packet sequence of every sample flow traffic, is the sample label, and is the quantity of samples. It includes two-level classifiers: several base classifiers (from to ) and one metaclassifier . The base classifier performs -fold cross-validation on the corresponding feature modality of dataset . The average of the -fold test results of the base classifier joint with the original label is taken as the input vector of the metaclassifier . The metaclassifier outputs the final classification result. The process of stacking algorithm [26] is shown in Algorithm 1.
| ||||||||||||||||||||||||||||||||||
Figure 2 shows the training and prediction process of the stacking algorithm. We will give an example to explain it in depth. According to different traffic modalities extracted from the same dataset (described in Section 2.3), we choose base classifiers of different kinds, to , and one metaclassifier . We adopt 5-fold cross-validation for base classifier training. The specific steps are as follows [25]: (1)Divide the sample dataset into training set and test set by the ratio of 3 : 1. Then, randomly divide the training set into 5 subsets (, , , , and ) according to the 5-fold cross-validation method(2)Create multiple base classifier models to in the first layer. For each base classifier, choose one of the 5 subsets as a validation set and the rest as a training set. We perform 5-fold cross-validation model training and will get test result . Meanwhile, we predict the test set using the base classifier which has been just trained and get the result (3)Repeat step 2 for 5 times to obtain the test results of the training set (, , , , and ), and combine the results of these 5 times vertically to obtain , and take the average results of the test set (, , , , and ) to get (4)Perform the above steps for other base classifiers to obtain corresponding result generated by the training set and the result generated by the test set(5)Merge and the label of the original dataset to obtain a new sample training set , and merge to obtain a new test data set . Take as training set of the metaclassifier in the second layer, and is used as the test set of the metaclassifier to generate the final result

3. Experiment and Evaluation
3.1. Environment for Traffic Collection and Testing
We build an easy-to-operate app encrypted traffic experimental environment. In order to generate mobile traffic for constructing sample dataset, we apply Android Debug Bridge (ADB) to connect a workstation with mobile phone test terminals through an USB cable. We run automated application test scripts on the workstation and automatically control the phone to generate mobile application traffic by imitating the behavior of app users. The mobile phone test terminal is connected to the Internet through Wi-Fi, and the WiFi AP device is connected to the Internet through the workstation. Therefore, the mobile phone traffic will pass through the workstation, and we can capture the mobile phone traffic on the workstation with the packet capture tool. The illustration is shown in Figure 3.

In this way, the experimental environment for encrypted traffic collection is established. We collect the packet information of mobile phone traffic, such as time, source address, destination address, port, packet length, protocol, and TCP/IP flag. Due to encryption, the packet content will not be collected as a feature. We implement all of our test in Python environment and use Scikit-learn as the machine learning library. We run our experiments on a workstation machine with Windows 11 OS, i9-10850K CPU@3.6-5.2 GHz, 128 GB DDR4 3200 Memory, and Tesla V100 Graphics (only for comparative experiments with deep learning baseline approaches).
3.2. Evaluation Metrics
To evaluate the performance of detection models, we use metrics such as precision, recall rate, and measure. The mathematical description of the above metrics is as follows:
Precision reflects the proportion of the data actually labeled true in the data predicted to be true.
Recall reflects the proportion of the data predicted to be true in the data actually labeled true.
-measure is the balance between comprehensive precision and recall.
In the metrics above, is the number of encrypted flows that belong to category and are correctly classified into category ; is the number of encrypted flows that do not belong to category and are incorrectly classified into category ; is the number of encrypted flows that belong to class and are misclassified into other classes.
For multiclassification, we use the Macro method to calculate the above metric values. That is, sum and average the values of each category, such as
3.3. Datasets
We synthesize and establish traffic sample datasets from mainstream APPs. We simulate app operations and collect app network traffic through MonkeyRunner, an automated testing framework that comes with the Android SDK. The MonkeyRunner is primarily designed for testing applications and devices at the framework level, or for running unit test suites. It provides the API applied to design programs to control Android devices outside of Android code and simulator.
We compile MonkeyRunner test scripts to automatically install and run APK files and simulate operations of apps randomly. Then, we tested 15 mainstream apps (such as com.wuba and com.kmxs.reader), each of which was tested 25 times with a single test period of about 10 minutes. The network traffic of apps can be collected on the workstation by a tool named tcpdump, and each test result is saved as a PCAP file separately. A total of 10.5 GB of network traffic packets have been collected. We construct a mobile traffic dataset containing 15 APPs.
After extraction, this dataset contains more than 60,000 traffic flows, of which 99% of the traffic flows contains 0 to 40 packets (shown in Figure 4). Traffic flows with a length below a certain threshold that cannot well represent the flow characteristics will be filtered out as noise. Further analysis of packet length distribution shows that most of the packets lie in the two ranges of 0 to 250 bytes and 1400 to 1500 bytes (shown in Figure 4). In general, some apps with audio and video services (such as Ximalaya and Sohu news) will generate a large number of long packets between 1400 and 1500 bytes. The packets generated by the apps mainly based on text interaction are usually shorter, mostly between 0 and 250 bytes.

3.4. Model Design and Parameter Selection
In order to improve the classification effect of the model, we make the following optimization configurations for the model and its related parameters.
3.4.1. Selection of Multimodal Classifiers under the Stacking Framework
According to different modalities, we choose appropriate base classifiers purposefully. For the stacking algorithm, there needs to be a certain difference between base classifiers to maximize the overall effect of the model. Based on this principle, the base classifiers in the first layer of stacking framework are set as follows: we select features from three modalities (packet sequence, packet distribution, and packet statistics), and each modality is input into two classifiers (random forest and XGBoost [27]), so that there are a total of 6 base classifiers in the first layer.
In stacking, the metaclassifier in the second layer needs to choose a strong classifier with good comprehensive performance. GBDT (gradient boosting decision tree) is a commonly used ensemble learning method based on decision tree, which has strong predictive performance. We use it as the metaclassifier.
3.4.2. Main Parameter Configuration
In order to improve the performance of the model, we conducted repeated experiments and tune on the main parameters. The relevant parameter settings are as follows: (i)Burst threshold : 2 s, that is, the flow will be segmented when the interval time between two packets is greater than 2 s(ii)Traffic flow length threshold : 18, that is, traffic flows with length less than 18 packets will be discarded as noise(iii)The number of folds for cross-validation : 5, that is, we choose 5-fold cross-validation during training the base classifiers
3.5. Comparison with the Baseline Approaches
To demonstrate the classification effect of FusionTC, we compare FusionTC with the state-of-the-art approaches in recent literatures. The baseline approaches include deep fingerprinting [13], LSTM fingerprinting [14], AppScanner [5], and ET-BERT [19]. All of them have proposed innovative methods in encrypted traffic processing, feature extraction, model optimization design, etc., respectively, using deep learning, machine learning, and other models to achieve impressive classification results on different datasets. Unlike FusionTC, these approaches only consider single modality features, such as packet length sequence and packet statistical features. (i)Deep fingerprinting (Deep FP) [13], which uses fingerprinting attack based on deep learning to classify encrypted traffic: the attack requires a simple input format, and handcrafting features for classification are not necessarily needed(ii)LSTM fingerprinting (LSTM FP) [14], applying long short-term memory (LSTM) to classify encrypted traffic: it realizes the automatic process of feature engineering and thus automatically deanonymize encrypted traffic. LSTM enables classifier to interpret time series of encrypted traffic(iii)AppScanner [5], using random forest as classifier to detect apps from encrypted traffic: it is based on statistical features from a sequence of packets. It also leverages reinforcement learning techniques to identify ambiguous flows, which helps promote accuracy(iv)ET-BERT [19], applying a new traffic representation model which pretrains deep contextualized datagram-level representation from large-scale unlabeled data: the pretrained model can be fine-tuned on a small number of task-specific labeled data and achieves ideal performance
Based on the same dataset in our paper, we compare FusionTC with these state-of-the-art methods as baselines. As shown in Table 2, unexpectedly, the approaches using deep learning method (Deep FP and LSTM FP) perform poorly, whose accuracy does not exceed 80%. The reason lies that despite the strong learning ability of deep learning, these two methods are still difficult to obtain sufficient information from one-modality raw flow sequences. AppScanner has relatively higher accuracy, which is mainly attributed to its delicate traffic segmentation and reinforcement learning techniques. However, because of the reinforcement learning mechanism in AppScanner, 18.74% of the test flows are marked as unknown and not included in multiclassification, so if this part of the unknown flows is also included, the actual prediction accuracy of AppScanner will be lower than the current value. Among the baselines, ET-BERT has the best overall performance, which is closest to FusionTC. Its prediction accuracy exceeds 90%, which benefits from the good contextual representation ability and strong learning ability of the pretrained model. In total, the test shows that FusionTC significantly surpasses these baseline approaches on all metrics, and its accuracy outperforms the best baseline ET-BERT by more than 3.2%. Meanwhile, ET-BERT is based on a relatively complex natural language deep learning model, which requires high-speed GPU support and more resource overhead. Compared with it, FusionTC is lighter and easier to operate and has less resource overhead. For apps with very low accuracy of the baseline approaches (such as Kmxs reader), FusionTC still achieves relatively ideal prediction results, and this is largely due to the ability of the decision fusion method with multimodal feature that is capable of identifying more subtle traffic characteristics in challenging situations.
3.6. Comparison with Its Single Modality Base Classifiers
The overall performance of FusionTC largely depends on the fusion of multimodal predictions. In order to demonstrate the effect of multimodal prediction fusion, in the following experiment, we compare and analyze the prediction results of FusionTC with its single-modal base classifiers in the first layer. Base classifiers to be tested include the following: (i)Random Forest Using Packet Sequence Feature. In this modality, we optimally set 100 as the length value of the sequence. In the base classifier random forest, we set and criterion = “entropy”(ii)XGBoost Using Packet Distribution Feature. In this modality, we choose packet length distribution and optimally set range length . In this base classifier XGBoost, we set and criterion = “entropy”(iii)Random Forest Using Statistical Feature. In this modality, we optimally extract the statistical feature of 54 dimensions (described in Section 2.3.3). In this base classifier random forest, we set and criterion = “entropy”
Figure 5 shows that the overall performance of FusionTC exceeds that of its single-modal classifier. FusionTC achieves a 5% improvement of accuracy over the best single-modal base classifier. Sometimes a classifier for a certain modality is inherently bad at recognizing a certain app, even if it does very well in most cases. For example, the best-performing RF classifier only achieves 71% precision when facing the traffic of app Taobao trip, which is even lower than the average level, but FusionTC still achieves 84% precision. The high precision of FusionTC presents that it is able to effectively overcome the weakness of this single-modal classifier in some cases because of fusing the classification results of multimodality. In general, FusionTC integrates the recognition advantages of each single modality, avoids errors that may occur due to a single modality or a single classifier to a certain extent, and thus enhances the robustness of the classification model.

3.7. Comparison Results on Different Configuration
In order to optimize performance of the model, we repeatedly test the prediction results of the model with different parameters. (1)We compare the impact of different minimum flow length threshold values on prediction results. Flows shorter than the threshold will be discarded directly. After changing the threshold value from 0 to 30, as shown in Figure 6, we observe that the accuracy starts to rise before the threshold is 8, and then, the value remains constant. When the value is over 22, there is a trend that the performance begins to decrease slightly. It indicates that the range from 8 to 22 is optimal(2)We compare the impact of different burst threshold values on prediction results. The burst value is related to the segmentation method of flow, which will affect the total number and content of samples. According to [20], to make a comprehensive analysis, we do the test with the value interval from 0.5 to 5 (with increments of 0.5 s). As shown in Figure 7, we observe that the prediction accuracy remains almost constant. It shows that the burst threshold value does not impact substantially on prediction result. We choose 2 s as the optimal value for the model(3)We compare the impact of the number of apps in the dataset on the prediction results. Firstly, we randomly select flows labeled by several apps from the dataset to control the number of apps each time from 2 to 15, and then, we conduct experiments to observe results. Figure 8 shows that when the number of apps is small, the accuracy rate of the score is higher. However, as the number of apps increases, the performance of the model does not drop significantly. After the number of apps exceeds 9, the performance of FusionTC tends to be stable (the accuracy rate is around 93%). We know that more apps means exponential increases in classification difficulty. This proves that FusionTC is capable of coping with the mixed traffic of multiple apps in the real environment(4)We compare the impact of different training folds for base classifiers on the prediction results. We do the test from 1-fold to 11-fold. In particular, it should be pointed out that when the number of folds is 1, the method is actually an ensemble learning algorithm called blending similar to stacking, which generates a second-layer training set with a fixed portion of the original training set. From Figure 9, we can see that with the increase of the fold value, the model’s performance shows an increasing trend and remains stable after the fold value is over 5. As the training fold continues to increase, the time overhead of training will increase substantially. Therefore, to balance the number of training folds and training time, we choose 5-fold as the optimal solution for the model




3.8. Analysis of Prediction Result
To evaluate the performance of FusionTC in more depth, we create a confusion matrix plot of the prediction results. Figure 10 shows that FusionTC is more than 90% accurate for most of the app predictions. However, at the same time, there are also cases where some apps are easily misclassified, usually because these apps employ the same function library or are even developed by the same enterprise. For example, Baidu Search has a 2% probability of being misclassified into Baidu Map. However, compared with the classification results of the single-modal classifier, the misclassification probability of FusionTC for similar traffic is significantly lower. It shows that FusionTC considering multimodality can better capture the detail information of encrypted traffic and thus recognize subtle differences between similar network traffic, contributing to its outstanding performance.

4. Conclusion
In this paper, we propose a new app encrypted traffic classification framework, FusionTC. To address the problem that the single modality cannot fully represent heterogeneous information of encrypted traffic, we design a traffic segmentation and preprocessing module based on flow and burst traffic and delicately extract features from 3 kinds of modality. FusionTC leverages stacking algorithm to make decision-level fusion prediction on multimodal features. Compared with single modality classification, FusionTC promotes classification accuracy and realizes robustness. Experiments are conducted to evaluate FusionTC on real traffic datasets including 15 apps. The results show that FusionTC is able to improve the accuracy by at least 3.2% over the state-of-the-art approaches. We also demonstrate that it is multimodal decision-making fusion mechanism that significantly improves the classification effect. In the future, we plan to study more traffic modalities to improve the classification accuracy and robustness of FusionTC when applied to more realistic and complex traffic environment. We also plan to apply FusionTC to some new scenarios, such as IPv6 traffic classification [28], smart home app identification [29], and cryptocurrency mining activity detection [30].
Data Availability
The experimental data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declared that they have no conflicts of interest regarding this work.
Acknowledgments
This work was supported in part by the National Key Research and Development Program of China under Grant 2016QY05X1002.