Abstract
While encryption ensures the confidentiality and integrity of user data, more and more attackers try to hide attack behaviours through encryption, which brings new challenges to malicious traffic identification. How to effectively detect encrypted malicious traffic without decrypting traffic and protecting user privacy has become an urgent problem to be solved. Most of the current research only uses a single CNN, RNN, and SAE network to detect encrypted malicious traffic, which does not consider the forward and backward correlation between data packets, so it is difficult to effectively identify malicious features in encrypted traffic. This study proposes an approach that combines spatial-temporal feature with dual-attention mechanism, which is called TLARNN. Specifically, first we use 1D-CNN and BiGRU to extract spatial features in encrypted traffic packets and temporal features between encrypted streams, respectively, which enriches the features of different dimensions, and then, the soft attention mechanism is focused on the encrypted data packets to extract features. Ultimately, the second layer of the soft attention mechanism is used for aggregating malicious features. Several comparative experiments are designed to prove the effectiveness of the proposed scheme. The experimental results demonstrate that the proposed scheme has a significant performance improvement compared to existing ones.
1. Introduction
With the rapid development of computers, encryption technology has been widely popularized. Although encryption technology can protect the secure transmission of data, more and more malwares rely on encryption technologies such as TLS for C2 connection to transmit data or instructions and attempt to bypass firewalls and other defense software to attack by means of encrypted traffic. Therefore, how to effectively detect encrypted malicious traffic has become an urgent problem to be solved.
So far, scholars have carried out research on detecting encrypted malicious traffic. According to different ways of extracting detection features, encrypted malicious traffic detection approaches mainly include three categories, namely, rule-based detection algorithms [1–6], machine learning or deep learning detection algorithms based on manual feature extraction [7–12], and deep learning detection algorithms based on feature self-learning [13–15]. Rule-based detection algorithms have low identification efficiency, and some rules are not applicable to encrypted traffic due to encryption. With the development of machine learning, encrypted malicious traffic detection based on manual features is gradually emerging. By modelling the statistical characteristics, content characteristics, and behaviour characteristics of encrypted traffic, the identification and classification tasks are realized, but this way of manually designing features is often very complicated. Leveraging the feature of deep learning methods with good nonlinear modelling capability, in recent years, a deep learning encrypted malicious traffic detection method based on feature self-learning has emerged, which learn the features of the original encrypted traffic packets and classify them directly by means of representation learning. Although many end-to-end learning-based encrypted traffic detection frameworks have emerged in recent years, most of them are based on single network frameworks such as LSTM, CNN, and SAE. However, the importance of different features often does not remain consistent across scenarios, and most studies do not consider malicious communication behaviours with forward and backward correlation [14–18], and the detection accuracy is still not high.
In order to get rid of the abovementioned dilemma, this study proposes an approach that combines spatial-temporal feature with dual-attention mechanism, which is called TLARNN. First, the spatial features in encrypted traffic packets are extracted through a one-dimensional convolutional network; then, further feature extraction is performed through a soft attention mechanism, and temporal features between encrypted traffic streams are extracted through BiGRU and then aggregated by a second layer of soft attention mechanism malicious features. For encrypted traffic packet, the model is expected to pay more attention to the malicious fields within the packets, such as the use of low version digital certificates and cipher suites; while for encrypted streams, the model is expected to pay more attention to the malicious packets contained in the C2 communication process, while the packets used by malware for obfuscating normal communication should have a lower weight. The main contributions of this study are as follows:(1)We combine spatial-temporal features to detect encrypted malicious traffic, integrate the spatial features contained in the original byte stream in the packet through 1D-CNN, and use the bidirectional GRU to extract the bidirectional packet from the stream. Temporal features incorporate the context of the packet to mark the forward and backward correlations of malicious features.(2)We use dual-attention mechanism to extract important malicious features; the attention mechanism is set at the packet layer and the flow layer to enhance the extraction of malicious features. Adding attention at the packet layer makes the model focus on malicious features in the current data packet; adding an attention mechanism at the flow layer makes the model pay attention to the packets that are maliciously communicating.(3)We realize end-to-end encrypted malicious traffic detection and identify malicious behaviours by automatically extracting encrypted traffic for C&C communication in malware traffic without manual feature extraction, which can reduce labour costs for encrypted malicious traffic detection.
The rest of this paper is arranged as follows: Section 2 introduces the related work, Section 3 explains in detail the encrypted malicious traffic detection algorithm based on the dual-attention mechanism, and in Section 4, the experimental evaluation results are presented and a comprehensive discussion is carried out, and finally, the conclusion and outlook are drawn in Section 5.
2. Related Work
2.1. Rule-Based Detection Algorithms
Rule-based detection algorithms mainly include DPI-based detection methods, load randomness detection methods, and other methods based on payloads. The methods analyse the certificate information contained in the unencrypted TLS/SSL handshake in the encrypted traffic and design some high discrimination matching rules to identify the encrypted traffic. The advantage is that it is performed according to the established matching rule and the efficiency is high. However, the formulation of rules requires massive data analysis, and the rules lack flexibility, and attackers can evade detection by forging and other means.
Korczyński and Duda [1] proposed a Markov-based detection method for anomalous encrypted communications, which extracts fingerprints from the payload of packets in TLS/SSL sessions to identify encrypted application traffic, and then, anomalous encrypted communications are detected using first-order Markov chains by modelling the sequence of TLS/SSL message types. They appeared in a single directional traffic of a given application from server to client. However, the application of this method needs to be upgraded and updated, and the application fingerprint also needs to be updated regularly. Husák et al. [2] proposed a lightweight TLS/SSL-based encrypted malicious traffic detection method, which used the TLS/SSL handshake information before encrypted communication and the user-Agent field contained in the HTTP header to form a fingerprint information, which can classify real-time HTTPS network traffic and enhance network forensics capabilities while ensuring user privacy. In 2022, Li et al. [3] proposed a middleware dynamic deep-packet inspection scheme based on dynamic symmetric searchable encryption (DPI), which can prevent data leakage caused by association analysis on the basis of ensuring security, and the matching rules achieve 280 packets/high throughput in seconds. In 2019, Ning et al. [4] proposed a deep-packet inspection scheme for PrivDPI encrypted traffic executed in a network middle box (BlindBox). Ning designed encryption rules to reduce the setup delay while ensuring privacy security. Compared with previous works, the encryption rule generation speed is 288 times faster, and the intermediate values generated by each session can be used in subsequent sessions, reducing the computational overhead. Experiments show that this scheme is more suitable for short flow connections. Vincent [5] exploited statistical features of packet size for training random forest classifiers. By analysing the temporal invariance of application fingerprints, combined with a novel machine learning strategy to identify ambiguous network traffic, Vincent proposed a robust and scalable application scanner for the identification of smartphone apps from their network traffic. Similarly, BIND [6] also utilized temporal statistical features. However, because it is difficult to design general statistical features to deal with a large number of increasingly complex applications and websites, the generalization performance of a statistical feature-based algorithm is poor.
2.2. Machine Learning and Deep Learning Detection Algorithms Based on Handcrafted Features
Although the rule-based method is simple and efficient, it needs to formulate detailed matching rules more costly, and the encryption of malicious traffic will make some rules inapplicable. With the development of machine learning, machine learning and deep learning based on manual features are gradually emerging, mainly through expert knowledge to design manual features that can distinguish malicious behaviours, and then, machine learning methods, such as SVM, RFE, XGBoost, k-means, and Naive Bayes, or deep learning methods, such as the convolutional neural network and recurrent neural network, are used to classify malicious encrypted traffic. Manual features for malicious encrypted traffic mainly include three categories, namely, certificate features, connection features, and SSL features.
Anderson and Mcgrew [7] in 2016 comparatively studied the differences between malicious traffic and encrypted traffic in different fields of TLS, DNS, and HTTP, obtained a feature set that can distinguish malicious traffic by analysing the contextual traffic associated with encrypted traffic, and then used SVM to classify the encrypted malicious traffic. Anderson’s research pioneered the use of contextual traffic to identify encrypted malicious traffic. Liu et al. [8], in 2019, proposed an algorithm for real-time detection of encrypted malicious traffic, which extracted 23 robust features based on the first 8 plaintext traffic packets of the traffic, ensuring that Malware traffic is detected when the connection is not communicating. The author also deployed an online random forest algorithm to ensure the timely updating of algorithm parameters. In 2019, Shekhawat et al. [9] used the logs generated by BRO to extract connection features, SSL features, and X509 certificate features, and a total of 38 features were extracted. In addition, Shekhawat compared the accuracy of three machine learning algorithms, namely, SVM, RFE, and XGBoost, and experiment results showed that a small number of valuable features can achieve high accuracy. Chen et al. [10] in 2020 proposed an improved innovative clustering algorithm, which utilized three-stage clustering to sample encrypted traffic data. After comparing the three machine learning methods, the experimental results show that the algorithm can effectively detect encrypted malicious traffic family and improve the accuracy of encrypted malicious traffic detection.
Anderson and Mcgrew [11] summarized the sensitive fields in malware-generated encrypted traffic and used various machine learning algorithms such as linear regression, decision tree, random forest, SVM, and multilayer perceptron to train these fields and use the generated classifier to detect malware-generated encrypted traffic. Pan and Lin [12] proposed a trust-based DDoS discovery method for encrypted traffic, incorporating a trust evaluation mechanism based on signature and environmental factors, building a ball-tree based on features, and using the K-NN classification algorithm to discover malicious traffic. The experimental results show that this method can better discover encrypted traffic DDoS attacks while achieving the protection of sensitive traffic information of legitimate tenants.
This type of detection method can show good detection performance; however, it relies too much on expert knowledge and experience to design features to distinguish malicious behaviours.
2.3. Deep Learning Detection Algorithm Based on Feature Self-Learning
Because manual feature extraction is time-consuming and complicated, there are more and more research studies on malicious traffic detection framework based on end-to-end encryption in recent years. Considering the good nonlinear modelling ability of deep learning, the focus of scholars’ work has shifted to the detection method of encrypted malicious traffic based on deep learning. Based on the excellent nonlinear modelling ability of deep learning, the features of the original encrypted traffic data packets are directly learned and classified by means of representation learning, and the features that can distinguish malicious traffic are extracted to identify malicious encrypted traffic.
Encrypted traffic classification using supervised deep learning has become a popular approach that automatically extracts discriminative features rather than relying on manual design. Sirinam et al. [13] proposed a WF attack called deep fingerprinting (DF) using a sophisticate design based on a CNN for extracting features and classification. Liu et al. [14] applied recurrent neural networks (RNNs) to encrypted traffic identification and classification problems and proposed an end-to-end flow sequence network (FS-Net) that automatically learned from the original packet size of encrypted traffic extract representative features from the sequence. In addition, the author adopts a multilayer codec structure to deeply mine the latent sequence features of the stream and introduces a reconstruction mechanism to improve the effectiveness of the features. Lin et al. [15] combined CNN and RNN to extract abstract features of the flow at first and learn the temporal characteristics to realize efficient identification. However, this approach relies on large amounts of supervised data to capture effective features to learn biased representations in imbalanced data.
Millar et al. [16] used the payload of the first 50 bytes of the packet, converted the network packet into images and extracted metadata features in 2 ways, and processed the data for input to DNN and CNN for training and classification, respectively. Hwang et al. [17] selected the first n packets from the data stream and trimmed the length of n packets to a fixed length l, and the grayscale images of were generated. Finally, a network model consisting of one-dimensional convolution and autoencoder was used for malicious traffic detection. Sengupta et al. [18] exploited the difference in randomness between different ciphertexts to distinguish encrypted traffic generated by different applications and find that the encrypted traffic is not completely random and there is a fixed implicit pattern. PERT [19] first applied a pretrained model to migrate the algorithm to encrypted traffic detection and achieve 93.23% performance in ISCX-VPN-Service on F1.
Wang et al. [20] first proposed the field of end-to-end encrypted traffic recognition. First, the encrypted traffic was preprocessed into a 28 × 28 image, compared with the 1D-CNN (1-dimensional convolutional neural network) and the 2D-CNN (2-dimensional convolutional neural network). The various indicators of the network on the encrypted traffic identification task show that 1D-CNN has better performance in identifying encrypted traffic. Rimmer et al. [21] first introduced feature self-learning into fingerprint attack of the Tor anonymous network in 2018 and designed three popular deep learning models for website fingerprint recognition to evaluate the advantages of stacked denoising autoencoder (SDAE), convolutional neural network (CNN), and long short-term memory (LSTM). Zeng et al. [22] in 2019 proposed a lightweight framework that can be used to simulate encrypt traffic classification and intrusion detection. By comparing 3D-CNN, LSTM, and SAE frameworks, the identification of different dimension features of raw encrypted traffic is realized. Experiments show that the F1-score of this method is higher than that of existing methods and it requires less storage resources. Bu et al. [23] proposed a neural network structure based on parallel features for encrypted traffic identification in 2020. The author thinks that the packet header and packet body contain different classification clues and designs a parallel decision-making strategy. Two NIN subnetworks are constructed to classify the packet header and packet body, respectively. Finally, the classification probability of the two networks is fused by the linear weighting method to obtain the final classification result. Experiments show that this parallel strategy is effective in improving the classification accuracy of encrypted traffic. Lin et al. [24] proposed a traffic representation model, et-Bert, in 2022 to effectively learn implicit relations in unlabelled traffic, so as to improve the effect of traffic classification in different scenarios. Considering the structural characteristics and packet format of traffic transmission, the author uses traffic datagram as token sequence to capture the context association implied in large-scale unlabelled traffic by referring to the large-scale pretraining architecture of natural language processing. Then, the specific scene task with small-scale annotation is further trained to complete the final encrypted traffic classification task.
In a word, the rule-based detection method needs to establish the matching rules for encrypting malicious traffic detection in advance. If the rules are not flexible, the attacker is easy to evade detection by means of forgery. For the detection method based on manual features, many features are only applicable to specific scenes and data, and manual features need to be updated over time. The deep learning detection algorithm based on feature self-learning realized the end-to-end testing and applied the powerful nonlinear modelling of deep learning ability; it can achieve better detection result but is now only at early stage; most methods only extract malicious features of encryption through a single neural network such as CNN, RNN, and SAE, and all of them stay in the shallow network. In view of the shortcomings of the existing methods, this paper constructed an encrypted malicious traffic detection model for TLS/SSL protocol, combined with the spatial and temporal features and set up a dual-attention mechanism in the packet layer and data flow layer to extract malicious features. Finally, the superiority of this method is further proved by a variety of comparative experiments.
3. Encrypted Malicious Traffic Detection Scheme
The encrypted malicious traffic detection scheme is mainly divided into two parts, namely, encrypted traffic preprocessing and malicious traffic detection.
3.1. Encrypted Traffic Preprocessing
Encrypted traffic preprocessing mainly consists of extraction and marking of encrypted traffic and data preprocessing, as shown in Figure 1.(1)Extraction and Marking of Encrypted Traffic. According to the complete communication process (bidirectional flow) as the marking unit, tSHARK, which is a terminal version of Wireshark and the free and open source packet analyser [25], is used in this paper to extract the quad of encrypted flow with SSL/TLS protocol, which is currently the most widely used secure communication framework and generates CSV files containing network quad data. According to the network quad, tSHARK is used again to extract a single flow sample from the original PCAP traffic file and mark it.(2)Data Preprocessing. As the data link layer is only responsible for physical addressing, it is not helpful to detect malicious traffic, so Ethernet header is removed. We align the transport layer packet header and add the short UDP packet header to the TCP packet header length of 20 bytes. IP addresses cannot be used as identifiers of attackers. Therefore, we mask all IP addresses. We align the packet length, extract the first N bytes of the first M packets for each session, intercept the excess part, and fill the insufficient part with 0.

3.2. Malicious Traffic Detection Model—TLARNN
The TLARNN malicious traffic detection model mainly consists of four parts, namely, spatial feature learning, attention mechanism based on packet, flow temporal feature learning, and attention mechanism based on flow. The overall model framework is shown in Figure 2.

3.2.1. Spatial Feature Learning
1D-CNN is used to process a single packet of encrypted traffic.
(1) One-Hot Encoding. For , is the i-th byte in the packet, and is the k-dimensional vector obtained after one-hot encoding of the i-th byte. After encoding the packet, the result after processing is obtained in series, as shown in the following formula.
(2) Convolution Operation. The convolution operation of input data is carried out with the filter, where represents the steam starting from the i-th byte, and the byte window is h. Here, tanh is selected as the activation function, as shown in the following formula.
Filter is applied to all data through weight sharing connection to obtain feature graph, as shown in the following formula.
(3) Pooling Operation. For encrypted malicious traffic detection, the most obvious features need to be extracted to identify whether the traffic is malicious traffic. Therefore, maximum pooling operation is selected to generate features.
3.2.2. Attention Mechanism Based on Packet
Since the fields in each packet are not equally important to the task of detecting malicious traffic and some malicious fields need higher weight, this work uses the soft attention mechanism in packets to calculate the field weight by weighting the fields in packets. represents the t-th feature in the i-th data packet obtained after the (1) operation, the importance of each feature in the data packet is obtained by calculating by formula (4), and the weight of each feature is obtained by passing the calculated result through softmax, and finally, we multiply the calculated weight by the feature to obtain the feature vector of the weighted data packet.
3.2.3. Flow Temporal Feature Learning
Since malware network behaviour is usually a series of continuous behaviours, and the behaviour reflected in network communication has forward and backward correlations, bidirectional GRU is used to extract the temporal features of data packets, where and represent the backward and forward feature sequences, respectively. Finally, the time feature of the final i-th packet is obtained by combining and , where is a bidirectionality temporal feature.
3.2.4. Attention Mechanism Based on Flow
Since the characteristics of each data packet are not equally important to the task of detecting malicious traffic, attackers often use normal communication to obfuscate to avoid detection, so this paper uses soft attention mechanism within the flow. First, after operation 3, the feature vector containing the context information of the data packet will be obtained, and then, the importance of the data packet to the malicious detection task is obtained by adding weights to the feature vector, and finally, the information f of the entire flow is obtained.
3.2.5. Classification
After the abovementioned operations, the final feature vector of the entire flow is obtained, and then, the softmax is used to calculate the classification probability of the steam.
In this paper, cross entropy is used as the loss function to evaluate the difference between the probability distribution obtained by the current training and the real distribution, and backpropagation is performed according to the loss value, and then, the optimal encrypted malicious traffic detection model is obtained by updating the gradient and iterating.
4. Experiment and Analysis
4.1. Dataset
This experiment uses the dataset CICAndMal2017 of the Canadian Network Security Laboratory, which has multiple classification granularities of traffic. Due to the large amount of this dataset, only part of the data is selected in this experiment. A single stream is extracted from the original traffic according to the abovementioned preprocessing method. The details of encrypted data streams of the traffic selected in the experiment are shown in Table 1:
4.2. Setup of Experiments
4.2.1. Settings
Since most of the previous studies only support malicious traffic detection, they lack the identification of encrypted traffic at the fine-grained level. Therefore, in order to study whether TLARNN can effectively extract encrypted malicious traffic features capable of fine-grained distinction, this paper makes the following settings to investigate and evaluate the encrypted malicious traffic detection model from multiple aspects: referring to the categories in CICAndMal2017, this experiment is divided into three granularity classification standards according to whether the malware is malicious, the famliy of malware, and the model of malware, as shown in Table 2.
Experiment A conducts two-classification experiments of benign and malicious traffic classification for encrypted traffic, mainly to verify whether the TLARNN framework can effectively detect malicious traffic. The specific encrypted traffic data and corresponding labels used in experiment A are shown in Table 3.
Experiment B classifies the encrypted malicious traffic into four categories according to the malware family classified by CICAndMal2017. The specific encrypted traffic data and corresponding labels used in the experiment are shown in Table 4. Adware, Ransomware, Scareware, and SMS Malware, mainly verify whether the TLARNN framework can further identify the family of encrypted malicious traffic. A quick look at the four families of malware is given as follows: Adware is the software that generates pop-up ads automatically on web pages or mobile devices and steals user’s location or browsing history information. Ransomware makes money by encrypting victims’ files for ransom in exchange for access to the data. Many ransomware attacks use phishing spam to trick users into clicking on a link. Scareware steals important information from users by enticing users to download useless software with pop-ups claiming security alerts. SMS malware spreads malicious software through short messages and induces victims to download it for malicious operations on their mobile devices.
Experiment C is the 13-classification experiment of encrypted malicious traffic. The specific encrypted traffic data and corresponding labels used in the experiment are shown in Table 5. Experiment C mainly classifies encrypted malicious traffic according to specific malicious software to verify whether the TLARNN framework can identify encrypted malicious applications with fine granularity.
4.2.2. Evaluation Metrics
This experiment uses four common metrics for evaluation and comparison, including accuracy, precision, recall, and F1-score. The formula is as follows:TN (True Negative) indicates that the network detection result is benign encrypted network traffic, and the correct label is also benign encrypted network traffic. FP (False Positive) indicates that the network detection result is malicious encrypted network traffic, but the correct label is benign encrypted network traffic. TP (True Positive) indicates that the network detection result is malicious encrypted network traffic, and the correct label is also malicious encrypted network traffic. FN (False Negative) indicates that the network detection result is benign encrypted network traffic, but the correct label is malicious encrypted network traffic.
4.2.3. Experiment Environments and Model Parameters
The experimental hardware platform is a server, CPU is Intel(R) Xeon(R) Silver 4215R CPU @ 3.20 GHz, memory 256G, GPU accelerator is NVIDIA GeForce RTX 3090, and the model runs on Python 3.8. PyTorch, Keras, TensorFlow, tSHARK, and SCAPY are used to parse and analyse encrypted traffic packets. Numpy and Pandas are used for data processing. The specific structure of the TLARNN model is shown in Tables 6 and 7.
The TLARNN model is divided into two parts, one of which is feature extraction based on packet layer, as shown in Table 6. It can be seen that feature extraction based on packet layer mainly consists of two steps. First, feature extraction is carried out through two one-dimensional convolutional neural network layers, and then, the extracted spatial features are further extracted through the soft attention mechanism layer to extract malicious features. This part of the network requires training of 377600 parameters, which are mainly derived from two one-dimensional convolutional neural networks.
The other part is feature extraction based on data flow layer, as shown in Table 7. Feature extraction based on data flow layer mainly consists of three following steps: Firstly, the features extracted from the previous packet are taken as input, and the temporal features are extracted through the bidirectional GRU containing 512 hidden neurons. Then, the extracted temporal features are filtered through the soft attention mechanism layer to obtain the global malicious features in the data flow. Finally, the extracted features are used for classification. The whole TLARNN framework contains a total of 3,272,452 parameters, among which the parameters based on temporal feature extraction occupy the majority, up to 2,365,440.
The whole model is built by Keras. In the binary malicious detection task, binary cross entropy is used as the loss function. In the multiclassification experiment of malware families and applications, categorical cross entropy is used as the loss function. The optimizer uses RMSProp. The first 300 bytes of the first 30 packets of the original encrypted traffic are extracted, that is, a matrix of each sample screenshot is used as the input of the neural network.
4.3. Results and Analysis
4.3.1. Ablation Experiment
The TLARNN model mainly contains the soft attention mechanism layer at the packet and data flow. In order to evaluate the effectiveness of each key module in the TLARNN model, the following four comparative experiments are designed:(1)PARNN. It only used the attention mechanism at the packer layer.(2)FARNN. It only used the attention mechanism at the flow layer.(3)CNNRNN. The attention mechanism at the packet and the data flow is not used.(4)TLARNN-LSTM. The GRU at the data flow layer is replaced with another gated recurrent unit LSTM, and the update gate is replaced with a forget gate and an exit gate.
The experiment was carried out on Experiment A, and the experimental results are shown in Table 8.
It can be seen from Table 8 that (1) the accuracy of not using the attention mechanism (CNNRNN) is lower than that of using the single-layer attention mechanism (PARNN, FARNN). This result proves that the attention mechanism used in this experiment can help extract the effectiveness of features. (2) From the results, it is seen that PARNN is better than FARNN in all aspects, indicating that using the attention mechanism to extract malicious fields at the data packet layer is more effective than extracting malicious data packets at the data flow layer. (3) The performance of TLARNN is higher than that of PARNN and FARNN, which proves that extracting malicious features at both packet and data flow layers can better detect encrypted malicious traffic, that is, dual-attention mechanism is better than single attention mechanism. (4) The performance of TLARNN-GRU is better than those of TLARNN-LSTM because GRU has a simpler structure and faster convergence than LSTM and has certain advantages in training on smaller datasets.
4.3.2. Comparison Experiment
In this paper, the dataset is divided into training set, validation set, and test set according to 8 : 1 : 1, and three papers are selected as comparative experiments, two of which are end-to-end deep learning methods, and one is the machine learning method:(1)Feature Analysis. Shekhawat et al. [9] designed a machine learning-based method to classify malicious encrypted traffic and benign encrypted traffic. Thirty-eight features are extracted from the encrypted traffic through the bro tool, and three deep learning models, namely, SVM, RFE, and XGBoost are compared. We only select the XGBoost with the best performance in the paper for comparison.(2)Deep-Package. Lotfollahi et al. [26] proposed a network framework for encrypted traffic recognition and compared two neural network architectures, namely, stacked autoencoder (SAE) and convolutional neural network (CNN). We only select the CNN framework with the best performance in the paper for comparison.(3)HAST. Wang et al. [27] proposed an intrusion detection system based on spatial-temporal feature, which extracted spatial features in network traffic through convolutional neural networks and learned temporal features between data packets through LSTM. Two model structures HAST-I and HAST-II are proposed, and the accuracy of HAST-II structure is proved to be higher than that of HAST-I structure. We only select HAST-II structure for comparison.
The experimental results are shown in the following table, in which Table 9 is the comparison result of experiment A and Tables 10 and 11 are the comparison results of experiment B.
It can be seen from Table 9 that (1) the accuracy of the model proposed in this paper reaches 91.13%, which is 12.81%, 17.38%, and 1.5% higher than other models. Among them, the HAST is closer to our model, which indicates that spatial-temporal features can extract malicious features more effectively. (2) Comparing the results of TLARNN and HAST, the bidirectional RNN structure can achieve higher accuracy, and it also proves that malicious behaviour is a continuous behaviour with forward and backward correlation. (3) The deep-packet model that only extracts spatial features does not perform well. It can be inferred that an attack process is related to the order in which data packets are sent, and time-related features need to be extracted. (4) Although the feature analysis model has achieved good results on the CTU-13 dataset, it does not perform well on the Android encrypted malicious traffic dataset, which shows that manual feature extraction is not suitable for all data. It is not appropriate to use the same set of characteristics for different malware. The comparison results on experiment B are shown in Tables 10 and 11.
Experiment B is a four-category experiment (Adware, Ransomware, Scareware, and SMS Malware). According to the experimental results, ransomware and adware can get better results than SMS malware and scareware. Ransomware mainly uses the communication of the C2 server to upload and download the keys of encrypted files. The C2 connection in this process is easier to detect, so the ransomware is easier to detect. However, both SMS malware and scareware have similar operations of stealing data and transmitting it back to the C2 server, and there is a certain similarity in traffic, which makes the model incorrectly identified.
From the results in Tables 10 and 11, it can be seen that the accuracy of fine-grained classification of malicious encrypted traffic is lower than that of Experiment A. According to the results of feature analysis, it is found that only the method of extracting features through machine learning for classification cannot better perform fine-grained classification. For attack traffic, temporal information is an important clue for classification, so the deep-package effect that only uses spatial features is very poor. While HAST and TLARNN extracting temporal dimension features can get better results. TLARNN is about 1% higher than HAST, which proves that using dual-attention mechanism for the data packet and data flow can better help the model to extract important information.
Experiment C has 13 classifications, and the classification accuracy is 80.23%. The classification accuracy is shown in Figure 3.

The classification precision, recall, and F1-score of experiment C are shown in Figure 4.

It can be seen from Figure 4 that the detection accuracy of TLARNN for different malware traffic fluctuates around 80%, the variance is small, and the detection results are stable. According to the confusion matrix, these special tuples, namely, (WannaLocker, RansomBO), (LockerPin, Charger), and (Youmi, Kemoge) misclassify each other. These pairs of “similar” malware belong to the same family. WannaLocker and RansomBO belong to ransomware, LockerPin and charger belong to ransomware, and Youmi and Kemoge belong to adware. From this result, it can be inferred that these groups of malwares may adopt the same attack pattern or contain similar attack components or attack steps, resulting in similar traffic patterns. Since the model proposed in this paper uses a dual-attention mechanism to extract malicious features at multiple layers, it can distinguish the subtle differences in similar malware to a certain extent, which ensure the accuracy of classification.
5. Conclusion
In this paper, we propose a dual-attention mechanism encryption malicious traffic detection framework for TLS/SSL protocol, setting attention mechanism at the data packet and data flow layers and combining spatiotemporal features, multilevel extraction, and screening of malicious features. In addition, this paper introduces visual transformer into the encrypted traffic identification task; a deep network is constructed and the self-attention mechanism further integrates the global spatial features of encrypted traffic, extracts deeper encrypted traffic features, and fuses temporal features and spatial features to identify encrypted traffic. It provides rich encrypted traffic characteristics and opens up new ideas for encrypted traffic identification. Extensive experiments show that this method improves detection accuracy and outperforms existing schemes in fine-grained encrypted malicious traffic detection performance. In the future, there may be other challenges as follows:(1)In the real world, the encrypted malicious traffic hidden behind the normal traffic only accounts for a small part. We need to design a reasonable sampling method for oversampling and undersampling or consider using the GAN or SMOTE algorithm for sample generation. In addition, how to better verify whether the generated samples are aggressive also needs further research.(2)Encrypted malicious traffic generated by 0 day malware is hard to be detected. How to combine active learning in malicious traffic detection to reduce annotation cost and enhance model detection effect is a problem to be solved in the future.
Data Availability
The labeled dataset used to support the findings of this study is available from the corresponding author upon request.
Conflicts of Interest
The CICAndMal2017 datasets used in the article can be downloaded from the following website link: https://www.unb.ca/cic/datasets/index.html.
Acknowledgments
The work was supported by Science and Technology Project of the Headquarters of State Grid Corporation of China, “The research and technology for collaborative defense and linkage disposal in network security devices” (5700-202152186A-0-0-00).