Abstract

As the video streaming traffic grows exponentially nowadays, variable bitrate (VBR) encoding has been widely utilized by modern live video streaming service providers, such as YouTube, TikTok, and Twitch. However, video bitrate can be a delicate fingerprint of the video streaming, leading to risks of privacy leakage. There are several studies that attempt to eavesdrop the privacy from encrypted video streaming, but most of them presume strict requirements on the implementation environments and have great limitations when noise interference exists. Actually, the video traffic from the multimedia edge server is distinct from interapplication traffic flows due to device customization and can be identified even if there are noise interferences or the victim in a weak network condition. In this paper, a video traffic identification method is proposed to identify the encrypted video streaming from multimedia edge server under the interference of irrelevant traffic flows. Initially, we use an interapplication filter to identify the traffic from the edge server. Then, a longest-common-subsequence (LCS)-based method is developed for similarity matching to resist the noise interference from unpredictable burst traffic and network environment variations. In order to evaluate the system performance, we setup the prototype system with an AWS EC2 server and a raspberry pi device, then utilize the real-world trace data for pushing movies to victims. The experimental results show that the accuracy of our proposed strategy can reach 89.1% within 140 seconds eavesdropping even mixed with 14% noise interference.

1. Introduction

With the improvement of the network bandwidth, the video streaming service has been popular in recent years, which quickly sweeps across the world and takes up the viewers’ free time by high-quality content in live e-commerce, sports events, or video games. For example, according to the report of Statista, which is a global business data platform, shows that the number of monthly active users of TikTok worldwide has exceeded 1 billion [1]. Meanwhile, the number of monthly active users of YouTube has exceeded 2.3 billion. However, the growing number of users has brought great bandwidth pressure to video data center. Thanks to the development of edge computing in recent years, more and more Internet service providers try to save server resources and reduce the round-trip time by handing user tasks to edge servers, such as computation offloading [2] and video delivery [3]. In the foreseeable future, more and more applications will be handled by edge servers with the performance improvement of edge devices and popularity of 5G infrastructure.

Conventionally, the bitrate-based fingerprint carried by video traffic flow can be identified by video traffic pattern analysis even with the transportation layer security (e.g., TLS) encryption. There are many studies attempting to eavesdrop the content of videos from viewers which are under TLS encryption in recent years [46], but most of these works assume that the encrypted video stream can be directly observed by attackers without interference of irrelevant traffic flows. Some studies also proposed noise-resistant fingerprint identification methods, but all of them are not suitable for video bitrate fingerprints [7]. Actually, the video traffic is usually delivered from content delivery network (CDN) which may serve multiple websites or applications at the same time [8]. Therefore, the complete and noiseless bitrate-based traffic fingerprint can be hardly identified from the real-world trace data. Furthermore, the effectiveness of traffic fingerprints is highly sensitive to network fluctuations, and the partial features of traffic fingerprint will drift seriously during unstable network conditions.

The prevalence of edge server brings a new risk to video traffic identification due to the customization of the edge devices. Conventionally, the CDN server usually undertakes several tasks including video delivery and static resource delivery using the same domain name, which will make it difficult to identify the video traffic flow encrypted by TLS. However, the multimedia edge server hardly delivers the irrelevant traffic due to the customization of the edge device, which leads to the possibility of identifying the bitrate-based traffic flows from it. Therefore, the video traffic from the edge server is easier to identify and the traffic features are more stable. In this paper, we will present a noise-resistant video traffic identification method for VBR traffic flow. We will show that the traffic fingerprint from the real-world trace data captured from multimedia edge server can also match the bitrate fingerprint after appropriate preprocess. Initially, a simple traffic filter which only uses three labels from the unencrypted traffic is used to filter out the traffic that is from the multimedia edge server. After that, an LCS-based fingerprint-matching method is proposed to eliminate the interference of the remaining two types of noise and match the traffic fingerprint and bitrate fingerprint.

The rest of this paper is organized as follows: The literature is explored in Section 2. The data analysis is presented in Section 3. The system design is presented in Section 4. The traffic filter and LCS-based matching method are illustrated in Section 5 and Section 6. The system performance is evaluated in Section 7. Finally, Section 8 concludes this paper.

2.1. Privacy Leakage and Protection

With the growth of Internet applications, new security issues arise with the development of Internet infrastructures. On the one hand, the new paradigms could bring facilities to our daily life such as recommendation system [9, 10], computation offloading [2, 11, 12], and route planning [13, 14]. On the other hand, the privacy defense strategy also needs to consider more aspects with the upgrading of infrastructure: mobile devices [15], Internet of things (IoT) device [1618], and cloud server [14, 19]. Specifically, machine learning [20] and edge computing are developed rapidly, which brings more complex privacy leakage problems [21]. With the improvement of bandwidth and device performance, more video streaming service providers use edge servers to cache and distribute video content in order to reduce the pressure of data center, which leads to the popularity of research of multimedia privacy protection on edge server [22, 23]. In this paper, we will discuss the privacy leakage caused by encrypted video under noise interference.

2.2. Privacy Leakage from Video Stream

The side channel attack caused by privacy leakage of encrypted video has attracted extensive attention in recent years. Saponas et al. [4] makes fingerprints by using multiple sliding windows to divide the video into segments of several milliseconds based on VBR encoded video, but they only achieve 62% accuracy with 10 minutes eavesdropping without noise interference. Gu et al. [24] improved the DTW algorithm to make it suitable for DASH protocol and made a classifier that can identify videos from both Netflix and YouTube, but they claim that the low bandwidth and high packet loss rate are not in their consideration since users will normally leave video streaming immediately because of the bad experience. As the prevalence of machine learning, neural network has an advantage of feature extraction in a sophisticated environment. Schuster et al. [5] modeled the fingerprints and proposed a CNN-based model to identify the fingerprints for VBR-based videos from YouTube and Netflix. Nevertheless, all the bitrate-based video identification strategies need the assumption of stable network. Otherwise, both weak network condition and burst traffic will have a serious interference on traffic fingerprint, which will inevitably lead to wrong identification results because the points in bitrate fingerprint will be matched incorrectly. In the following part, we will analyze the noise interference and then propose a noise-resistant video traffic identification method.

2.3. Sequence Matching Method

Sequence matching methods are essential in solving many pattern recognition problems such as anomaly detection, speech recognition, and other domains [25]. The popular methods usually consider using points for matching (e.g., Edit Distance on Real Sequence (EDR) [26], Dynamic Time Warping (DTW) [27]), using shape for matching (e.g., Frechet distance [28]), and segmenting the sequence for matching (e.g., One Way Distance [29]). Nevertheless, most sequence matching methods do not consider the matching effectiveness in interference environment. Thanks to the powerful representation ability of deep learning, similarity learning can accommodate heterogeneous features in the sophisticated environments, and there are several deep-learning-based methods like the CNN-based solution [30, 31], and the LSTM-based solution [32]. However, deep-learning-based models usually need online training to adapt the latest features, and the computational cost is very high.

3. Traffic Data Analysis

In this section, we will introduce the video data analysis to illustrate the video bitrate and several types of traffic noise using the classic movie Titanic. In the following parts, nginx and ffmpeg is used to push the encrypted video traffic, Chrome downloader is used to provide the irrelevant traffic, and wondershaper is used to simulate the weak network environment with the random interference of bandwidth limitation, RTT, and packet loss.

3.1. Side Channel Attack on Video Traffic

VBR can bring the risk of privacy leakage through the bitrate fluctuation. Figure 1 shows the bitrate of a video which encode with constant bitrate (CBR) and VBR, and it can be seen that there are significantly different fluctuation trends between them.

Additionally, TLS only encrypts the content, but leak the statistical features of the traffic. Figure 2 shows the correlation between the bitrate of VBR video and it is encrypted video streaming.

Obviously, the privacy of video viewers can be identified through the analysis of the video traffic even after encryption. When the attacker obtains a traffic fingerprint segment, the privacy may be leaked.

3.2. Bitrate Features with Irrelevant Traffic

When providing video streaming services for users, edge devices can also provide other multimedia services from different websites at the same time (such as encode offloading or download acceleration), resulting in the eavesdropped traffic containing multiple types of packets, which make it difficult to identify the video traffic. Figure 3 shows the traffic from a raspberry edge server, which contains only video stream and both video stream and download stream.

It can be seen that the video stream traffic is covered by mixed traffic, resulting in the disappearance of the video traffic features.

3.3. Bitrate Features in the Weak Network Condition

VBR features are usually easy to identify, which is more likely to lead to privacy leakage. However, such features are easily affected by noise or weak network condition, which reduces the accuracy of identification. Figure 4 shows the interference of bandwidth limitation and RTT on the traffic fingerprint of Pirates of the Caribbean 5 from 1000 seconds to 1700 seconds. The video traffic is collected from raspberry edge server.

Since the beginning of traffic eavesdropping, the bandwidth limitation from 50 to 120 second and the burst RTT from 170 to 180 second lead to video playback jitter and corresponding backward drift of traffic features. Figure 5 adds the interference of 15% random packet loss to the traffic fingerprint of Pirates of the Caribbean 5 from 2000 seconds to 2200 seconds. Due to the packet retransmission function of TCP protocol, the interference of feature drift is reduced, but it still reduces the matching accuracy between bitrate fingerprint and traffic fingerprint. In a word, the bitrate-based video fingerprints raise stringent requirements on network conditions.

3.4. Bitrate Features with Intra-Application Interference

Even in the same application, the features will also be significantly affected by user operations, which usually cannot be predicted. Whether viewers explore the video list while watching or communicating through the intrasite chat system, it will have a destructive interference on the traffic features and seriously reduce the identification accuracy. Figure 6 shows the burst traffic by browsing the video list and the interference on the traffic features.

Obviously, the traffic generated by unpredictable behavior on 100 s to 110 s completely covers the original traffic features.

4. System Design

In this section, we will present the system design with the noise-resistant encrypted video traffic identification. The system structure is presented in Figure 7. The proposed system can be divided into following parts:(i)Interapplication traffic filter: A filter based on three labels including server name indicator (SNI) is proposed to filter out the traffic that from the multimedia edge server.(ii)LCS-based fingerprint matching: An LCS-based method is proposed for matching the traffic fingerprint and bitrate fingerprint under noise interferences.

The SNI tag is used to bring the domain name requested by the server through a plain text in the handshake stage of the TLS protocol. The attacker can easily obtain the target domain using SNI as an interapplication traffic filter, and further identify the whole TLS session through IP address or sequence number, and then obtain the video traffic flow completely without other interinterference due to the customization of the edge device. It should be noted that all the video providers need to transfer the video stream according to the protocol which specified by the edge multimedia framework, and the edge server will use the unified video protocol to send the video stream to users. As the popular edge multimedia frameworks such as EasyNVR or Link Visual all use TLS for video delivery, so our filter can be regarded as a general method for the existing video service. However, the traffic fingerprint will still affect by the burst traffic from unpredictable behaviors (such as exploring the video list), or the weak network condition, for example, low bandwidth and packet loss after filtering. So an LCS-based method is proposed to filter the intraapplication interference and identify the matched segments between traffic fingerprint and bitrate fingerprint.

5. Interapplication Traffic Filter

We will propose a traffic filter to eliminate irrelevant traffic from other applications in this section. Three labels are utilized to achieve the traffic filter: SNI, content type, and source IP address (srcaddr):(i)SNI is used to filter the video traffic which to be identified.(ii)ContentType is used to divide the TLS session.(iii)IP address is used to obtain the continuous TLS session.

The video content is sent in stream, but each video segment is encrypted in a TLS session, thus, the session is denoted as a video segment in a fixed length. ContenType is used to check whether the packet is a TLS handshake packet (denote the start of a new TLS session). Since the SNI in the handshake packet holds the source domain name without encrypted, all video streaming TLS sessions can be identified. The filtering process is shown in Algorithm 1.

Input:
 packet sequence ;
Output:
 packet sequence ;
(1)while [++i] ! = NULL do
(2)  if == HandShake and  == target domain then
(3)   Create a new sequence
(4)    =  [i]. ip
(5)  else if ! = HandShake and [i].ip == then
(6)   Add [i]. length to sequence
(7)  end if
(8)end while

After filtering, we get a set containing packets in all TLS sections, where is a two-tuple for the th packet with as the arrival time and as the packet length.

6. Noise-Resistant Fingerprint Matching

In the previous section, we obtain the packet sequence through filtering the TLS session. However, the intraapplication interference still exists and seriously reduces the matching accuracy. In this section, we will propose a noise-resistant similarity matching method based on LCS model. Before performing the matching model between bitrate fingerprint and traffic fingerprint, we should discuss the feature drifting caused by weak network condition and intraapplication noise interference. The bandwidth fluctuation caused by weak network will limit the data obtained by viewers and then destroy the traffic fingerprint. For example, for the same video segment which bitrate fingerprint is (1, 2, 3, 4, 5), the traffic fingerprint eavesdropped from a viewer with stable network is (2, 3, 4, 5, 6), but eavesdropped from another viewer with weak network will become (2, 3, 0, 0, 4, 5, 6), which will seriously reduce the identification accuracy. Similarly, the intraapplication noise will also change the traffic features and reduce the accuracy. For example, the traffic fingerprint eavesdropped from a viewer without interference is (2, 3, 4, 5, 6), but when there is a burst traffic caused by unpredictable behavior, the traffic fingerprint will cover by burst traffic interference and become (2, 7, 11, 8, 6). The two types of interference above refer to the drift between bitrate fingerprint and traffic fingerprint which violates the uniqueness in a fine granularity observation, even though the trend keeps consistent in the long-term observation. In order to perform the similarity matching method, we relocate the traffic fingerprint by second, as shown in Algorithm 2.

Input:
 packet sequence ;
Output:
 traffic fingerprint ;
(1) = 0
(2) = 0
(3)while [++i] ! = NULL do
(4)  if [i].time - 1 then
(5)   . append(acc_time)
(6)    = 0
(7)    ++
(8)  else if [i].time -  < 1 then
(9)    + =  [i]. length
(10)  end if
(11)end while

The algorithm recalculates the length of the packet in sequence and matches the element in bitrate fingerprint with the timeline. Generally, weak networks and burst traffic are infrequent, it means that if most intervals of traffic fingerprint and bitrate fingerprint are matched in the long-term trend, we can ignore a few local mismatch caused by weak network or burst traffic. However, the common similarity matching method requires that all the elements in the sequence must be matched even if the fingerprint is under interference. Therefore, we propose a fingerprint matching method considering the traffic noise interference. We define as the Euclidean distance between and . For a given and , if is less than threshold , the is considered to match . Then, a noise-resistant model N-LCS based on LCS model is proposed to adapt the fingerprint mismatches.

First, for the bitrate fingerprint and traffic fingerprint , the points in can only match the points in forward (e.g., can only match ). This is because during the video playback, the video player will not cache the played video contents. In addition, the matching strategy of LCS is too simple to adapt the weak network condition, so N-LCS optimize the matching strategy to adapt the noise interference. For and :(i)if and , the point and are considered to be matched.(ii)if and , the point and are considered to be not matched, and the unmatched point may have been caused by burst traffic.(iii)if and , the point and are considered to be not matched, and the unmatched point may caused by limited bandwidth, RTT or packet loss. As the limited traffic will usually lead to the drift of traffic features, and the backtracking function should be added to LCS model in order to drop the redundant fingerprint at the trail of to avoid false matching.

We use a two-dimensional matrix with the size of to save the temporary matching result, where is the length of bitrate and traffic fingerprint. The values of matrix are calculated by the following formula:where is the threshold of . In order to eliminate the interference of feature drift, N-LCS makes two rounds of backtracking at the end of the algorithm. The first round of backtracking determines the drift distance of the traffic fingerprint and drops the fingerprint at the tail of the bitrate fingerprint according to the drift distance. The second round of backtracking will use the bitrate fingerprint calculated in the first round to find the matching path in the matrix and calculate the longest common subsequence between two fingerprints according to the new matching path using dynamic programing as the matching result. The calculating process is shown in Algorithm 3.

Input:
 bitrate fingerprint ;
 traffic fingerprint ;
Output:
 the length of subsequence
(1) = len ; define matrix [k][k] and [k][k]
(2)for iterate and do
(3)  if [i] - [j] then
(4)   if [i] >  [j] then
(5)     [i][j] =  [i − 1][j − 1] + 1, mark i and j as matched points in matrix
(6)   else
(7)     [i][j] =  [i − 1][j − 1]
(8)   end if
(9)  else if [i − 1][j] >  [i][j − 1] then
(10)    [i][j] =  [i − 1][j]
(11)  else
(12)    [i][j] =  [i][j − 1], mark i and j as noise points in matrix
(13)  end if
(14)end for
(15)i = .length; j = .length
(16)while iterate similarity path in do
(17)  if [i][j] holds noise points then
(18)    ++
(19)  end if
(20)end while
(21)i = .length - ; j = .length
(22)while iterate similarity path in do
(23)  if [i][j] holds matched points then
(24)    ++
(25)  end if
(26)end while

Figure 8 shows the partial match result between traffic fingerprint and bitrate fingerprint. The red line shows the match relation between bitrate and traffic fingerprint. It can be seen that the LCS-based matching model can successfully ignore the invalid features caused by interference.

7. Implementation and Evaluation

7.1. Experimental Setup

In order to build the prototype system, we have an Amazon EC2 server as the video stream server, a raspberry pi as the edge server, and an Xiaomi 11 Ultra as the victim, respectively. The server configuration is listed in Table 1. nginx and ffmpeg is used to push the video streaming in RTMPS protocol, and Wireshark is used to simulate Man-In-The Middle (MITM) attack to capture the encrypted TLS traffic of the victim. We use videos with several bitrates to generate the bitrate and traffic fingerprint and evaluate the effectiveness of N-LCS, and the configuration of video dataset is shown in Table 2 (The data set can be found at https://1drv.ms/u/s!AnB84OgJQM04jkAYDlzO9fhchxeZ?e=fj4cY8).

7.2. Effectiveness of the Traffic Filter

Then we test the effectiveness of the interapplication traffic filter proposed in Section 5. We use Wireshark to capture the video traffic encrypted by RTMPS protocol, and A domain name registered from Tencent cloud is used to fill in the SNI tags. The output traffic from the edge device and the filtered input traffic from the victim are collected, respectively, as shown in Figure 9. The results show that the proposed traffic filter can identify all the target TLS sessions accurately.

7.3. Threshold Analysis

In this part, we will calculate the threshold of N-LCS model, which is used to identify the matched point in traffic fingerprint and bitrate fingerprint. A total of 300 groups of 50 seconds bitrate fingerprints and traffic fingerprints are used to calculate the similarity distance in the following cases, and the similarity distance is shown in Figure 10:(i)Fingerprints come from the same video.(ii)Fingerprints come from different videos, but the bitrate is similar.(iii)Fingerprints come from different videos, and the bitrate of different videos varies greatly.

Then, True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) is used to define accuracy:

After that, we use the intersection of two false rate lines as the threshold to maximize the accuracy. As shown in Figure 11, 239 is the best threshold to reach the maximum accuracy of 0.766 (76.6% points in the fingerprints can be accurately matched).

Finally, we calculate the identification accuracy with 1–100 matching points as the threshold in above three cases, and the results are shown in Figure 12. When the bitrate of different videos varies greatly, there are less matched points between fingerprints, and the similarity distance between mismatched points is usually large, so only a small threshold is required to achieve high accuracy. When the bitrate is similar and the length of fingerprints is short, there are also many matched points though the fingerprints that come from different videos, result in the a lower accuracy compared with other cases. Since the identification accuracy of the threshold for matching points is not 100%, the identification accuracy will eventually decrease to 0 with the increase of threshold . Considering the difference between fingerprints, we use 0.43 as the threshold in following experiments.

7.4. The Effectiveness of N-LCS without Noise Interference

Figure 13 compares the N-LCS with two popular similarity-matching methods in a noise-free environment with different fingerprint lengths.

With the increase of fingerprint length, the proportion of matched segments in fingerprints gradually stabilizes, so the accuracy of all algorithms are increasing. However, the focus of N-LCS is to identify and remove the noise interference in the traffic fingerprint, rather than improve the matching accuracy of fingerprints without noise interference; therefore, the accuracy of N-LCS is close to Pearson. It is worth noting that the fluctuation of traffic features lead to the poor performance of DTW algorithm based on global optimal distance, and the accuracy is significantly lower than Pearson and N-LCS.

7.5. The Effectiveness of N-LCS under Noise Interference

In order to evaluate the effectiveness of N-LCS under the noise interference, we use the automatic script to randomly generate different levels of noise interference during video playback. The fingerprint with a length of 200 seconds is used to test the interference of bandwidth limitation, burst RTT, packet loss, and burst traffic on the identification accuracy of N-LCS under different noise levels. The results are shown in Table 3 then, the traffic captured with mixed noise (bandwidth limitation, packet loss and burst traffic account for 1/3 respectively) is used to compare the N-LCS, Pearson, and DTW algorithms. The results are shown in Figure 14.

With the increase in the proportion of noise interference, the identification accuracy of above algorithms decreased in varying degrees, while the accuracy of DTW and Pearson decreased much faster than N-LCS. In addition, when the proportion of noise interference is less than 14%, the accuracy of N-LCS decreases slowly, while when the proportion exceeds 15%, the accuracy decreases significantly. This is because the N-LCS matching strategy reserves sufficient redundant for noise interference. The average number of matching points between matched fingerprints is much higher than the identification threshold, and it will not have a great interference to the accuracy though there is a small amount of unmatched points. Then, we set the noise proportion to 14%, and compare N-LCS with three latest identification methods based on video fingerprint: beauty [5], p-dtw [24], and leaky [33]. The test video clips were taken from the films Titanic, Pirates of the Caribbean 5, Inception and Avengers 3. The results are shown in Table 4. As the previous methods only focus on the accuracy of matching strategy, ignoring the noise interference from the real-world eavesdropping environments, result in the reduction of accuracy in the weak network condition and N-LCS can reach the highest accuracy even under noise interference.

8. Conclusion and Future Work

In this paper, we proposed a noise-resistant bitrate-based identification method for encrypted video traffic on the raspberry pi platform, which uses the LCS-based model to match the traffic and bitrate fingerprint. A real dataset using several famous movies captured from edge server and a prototype system was presented for performance evaluation. Through experiments, we proved that even the interference proportion can reach to 14%, and we can reach 89.1% accuracy after 140 seconds traffic eavesdropping.

With the prevalence of video streaming system, our work provides a new eavesdropping method that robust to interference. In the future work, we will optimize our model from the following two aspects. First, the identification accuracy will be optimized when the traffic fingerprints eavesdropped from victims are similar. Second, the proposed method only supports the popular protocols used in multimedia edge frameworks such, as RTMPS, and more protocols will be supported, for example, HLS and DASH in the future.

Data Availability

The bitrate fingerprint data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Zengkun Xie performed data collection, cleaning, and annotation, which are important parts of our work and greatly support our proposed machine learning methods.

Acknowledgments

This work is supported by National Natural Science Foundation of China 61602214.