Abstract
Persistent user behavior monitoring, which deals with finding users that occur persistently over a measurement period, is one hot topic in traffic measurement. It is significant for many applications, such as anomaly detection. Former works concentrate on monitoring frequent user behavior, such as users occurring frequently either over one measurement period or on one monitor. They have paid little attention to detect persistent user behavior over a long measurement period on multiple monitors. However, persistent users do not necessarily appear frequently in a short measurement period, but appear persistently in a long measurement period. Due to limited resource on monitors, it is not practical to collect a tremendous amount of network traffic in a long measurement period on one single monitor. Moreover, since network attackers deliberately send packets flowing through the entire managed network, it is difficult to detect abnormal behavior on one single monitor. To solve the above challenges, a novel method for detecting persistent user behavior called DPU is proposed, and it contains both online distributed traffic collection in a long measurement period on multiple monitors and offline centralized user behavior detection on the central server. The key idea of DPU is that we design the compact distributed synopsis data structure to collect the relevant information with users occurring in a long measurement period on each monitor, and we can reconstruct user IDs using simple calculations and bit settings to find users with persistent behavior on the basis of estimated occurrence frequency of users on the central server when user IDs are unknown in advance. The experiments are conducted on real traffic to evaluate the performance of detecting persistent user behavior, and the experimental results illustrate that our method can improve about 30% estimation accuracy, 40% detection precision, and accelerate about 3 times in comparison with the related method.
1. Introduction
Frequent item and persistent item are two fundamental problems of data stream mining [1–3]. Frequent items mining is to find items with large occurrences over a measurement period. Frequent items mining has been extensively used in many applications such as the keyword query through the search engine. Persistent item mining is a special case of frequent item mining, which only counts once when an item occurs repeatedly over a measurement period. This study focuses on the problem of finding persistent items in the network-wide view. For an item, its occurrence frequency is the number of timeslots in which it appears. The problem of persistent items mining is converted to find items with large occurrence frequency. We consider that items span across the whole managed network consisting of a central server and multiple monitors, where each monitor serves as one data collection point and the central server serves as one user behavior detection point. Each monitor stores the summary of network traffic and periodically sends the summary to the central server. The server estimates occurrence frequency of users using the received summary and reconstructs user IDs. Here, user IDs can be some identifiers such as source IP, source IP and source port pair, destination IP, destination IP and destination port pair, and the five tuple of network packet [4].
The task of persistent user behavior detection is to find users with persistent behavior, and it is essentially the problem to mine persistent items. Persistent user behavior detection has been extensively applied in many fields such as click fraud and Botnet detection. For click fraud detection, identifying persistent user behavior helps to find attackers of click frauds who are periodically clicking on ads to increase payment for advertisers in pay-per-click online advertising systems [5]. For Botnet detection, persistent user behavior can be used to find attackers (C&C servers) who control a large number of bots to attack important servers by communication between C&C servers and bots periodically [6]. However, methods of frequent item mining cannot be directly used to solve the problem of finding persistent items, since frequent items are essentially different from persistent items.
There are challenges for persistent user behavior detection. First, a large amount of network traffic usually arrives at data collection devices at high rates for many real-world systems. For instance, users post and comment on social networks so as to generate massive data in the form of stream. Due to limited computation and memory resources on data collection devices, it is not feasible to store the whole user IDs over a long measurement period. To solve this challenge, we need to store user IDs by summary data structure which consumes a small amount of memory. Second, to evade detection, network attackers usually access important servers by utilizing multiple intermediate nodes. Due to attack traffic distributed on many nodes in a managed network system, network attacks are not detected by the single monitor efficiently. To solve this challenge, we need to aggregate the received data from multiple monitors to find network-wide attacks effectively. Third, it is difficult to accurately reconstruct user IDs using the encoded bits from multiple monitors, since there is high communication cost among the server and each monitor. To solve this challenge, we construct reversible hash functions based on Chinese remainder theorem and recover user IDs by simple computation when the entire user IDs are not stored. Fourth, due to the false positive rate, the user IDs reconstructed may not be users with persistent behavior. To solve this challenge, we consider filtering the reconstructed user IDs with low estimated occurrence frequency to improve the detection accuracy.
To the best of our knowledge, little research work has been done on persistent user behavior detection in network-wide view. The straightforward solution to find users with persistent behavior is that each monitor sends the whole observed packets to the server and the server finds users with persistent behavior by analyzing the received packets. However, this solution is not feasible to store the whole packets due to limited computing and memory resource of monitors. Besides, this solution incurs too much communication overhead between multiple monitors and the server. Existing methods have been not able to detect persistent user behavior over a long measurement period, such as traffic statistics estimation [7–9]. To address these challenges, each monitor can generate a compact summary of network traffic over each timeslot online and send the summary to the server for detecting persistent user behavior over a long measurement period offline. Our main contributions are summarized as follows.
First, we develop a novel compact summary data structure which is suitable to detect persistent user behavior in a managed network consisting of a central server and multiple monitors. User IDs are extracted from network traffic and compressed to the summary data structure online in each timeslot over one measurement period for each monitor. Several bits are only stored for each distinct user IDs. Therefore, there is small memory and calculation.
Second, for the task of detecting persistent user behavior, we need to find IDs of users with persistent behavior. User IDs are not known in advance, and it is not practical to maintain the whole IDs of users in massive network traffic. Therefore, we develop an inversible method to reconstruct IDs of users with persistent behavior. The method only needs simple computing and to set several bits for each user with persistent behavior. Nevertheless, it does not store the whole IDs of users for detecting persistent user behavior so as to save memory.
Finally, we analyze the estimation accuracy of DPU in theory. We also conduct extensive experiments on two real network traffic traces. The experimental results demonstrate that DPU is superior to them compared to the relative ones in terms of estimation accuracy, detection precision, and compressing time.
The rest of this study is organized as follows. Section 2 summarizes related works. Section 3 formulates the problem. Section 4 presents our method. Section 5 evaluates performance and conducts experiments. Section 6 concludes.
2. Related Works
Detecting abnormal behavior has been one hot topic of network traffic measurement and monitoring, such as frequent items [10, 11], superspreaders [12, 13], and persistent items. Frequent items occur many times over one short measurement period, and superspreaders connect a large number of distinct hosts over one measurement period. Persistent items continuously occur over one long measurement period. Superspreaders and persistent items are the special cases of frequent items. Nevertheless, the methods to detect frequent items and superspreaders are not directly able to be used by persistent items. Users with persistent behavior are actually persistent items in this study. Existing approaches to find users with persistent behavior can be mainly divided into two categories, whose performance is summarized in Table 1.
The first category identifies users with persistent behavior at one single monitor. The straightforward methods maintain all users by hash table over a long measurement period to detect users with persistent behavior. They generate large memory overhead, and it is not feasible to be a function module in the network equipment. Chen et al. [14] proposed a data streaming method to track continuous long-duration flows (CLDF) which occur continuously over a long measurement period. It uses two bloom filters to store the candidates of continuous long-duration flows occurring over the previous and current timeslots, respectively. Continuous short duration flows are filtered by sampling to improve its performance. Lee et al. [15] proposed a data streaming method to detect long-duration flows without false negatives (NLDF). It uses two bloom filters to maintain flows by hash functions. It does not generate false negative. Research shows that the online method exactly tracks persistent items using large memory and is not run on one monitor. Dai et al. [16] proposed a persistent items identification scheme (PIE), which solves the problem to identify persistent items and estimate the occurrence frequency of each persistent item. It uses Raptor codes to encode the ID of each item occurring and only stores several bits of the encoded ID over a measurement period. Wang et al. [17] proposed user embedding (UE) and reversible user embedding (RUE) to detect persistent frequent behaviors. They only need to randomly select bits from one bit array and set the selected bits to one for each user occurring over each timeslot. Then, they estimate the occurrence frequency of each user to detect persistent users.
The second category identifies users with persistent behavior on multiple monitors. Singh and Tirthapura [18] proposed the distributed methods to track persistent items approximately under infinite window and sliding window (IW and SW). They have low communication overhead, false positive rate and false negative rate. Dai et al. [19] proposed a probabilistic distributed persistent item identification method (DISPERSE) to find persistent items in distributed datasets. Each monitor compresses the ID of each item in a lossy fashion and sends the set of lossy compressed IDs of items to the server; then, the server recovers the IDs of persistent items.
Besides, some methods estimate the persistent spread, which is the number of distinct hosts it connects persistently over a long measurement period. Xiao et al. [21] proposed a persistent spread estimator to help detect long-term stealthy behaviors. It generates a virtual bitmap randomly selected from the bitmap to estimate the persistent spread of each flow. Zhou et al. [22] proposed a highly compact and efficient virtual intersection HyperLogLog architecture for persistent spread estimator of each flow to help detect long-term stealthy behaviors. It uses register intersection and maximum likelihood estimation to measure the persistent cardinality of each flow. It obtains far better memory efficiency. Huang et al. [23] proposed an efficient and accurate k-persistent estimator. It uses SUM to join the information collected from different periods to estimate the k-persistent spread of each flow.
The above methods are essentially different from the persistent user behavior detection studied in this study, and they are good at detecting abnormal behavior over one short measurement period at one monitor. Therefore, they are inapplicable to detect network-wide persistent user behavior over a long measurement period. Shin and Yoon [20] proposed a long-duration flow method (LDF) to detect persistent items which occur over one long measurement period. It uses an integer array to construct summary of network traffic. Based on the constructed summary, it generates each flow’s virtual vector consisting of several bits randomly selected from the integer array using hash functions when we inquiry about the occurrence of one flow. Then, it estimates the occurrence of one flow using the generated virtual vector. Compared to LDF, our algorithm is more efficient to detect persistent user behavior.
3. Problem Statement
In this study, we consider a managed network system composed of a central server and a set of monitors. At the beginning of a measurement period, each monitor extracts user IDs from network traffic and compresses user IDs into synopsis data structure. At the end of the measurement period, each monitor sends the generated summary data structure to the central server. The central server merges the generated summary data structure from each monitor and detects persistent user behavior by probabilistic counting. Let r be the number of monitors. Moreover, the user ID called u can be an arbitrary value, such as source IP, source port, destination IP, destination port, and protocol. The central server does not know user IDs in advance.
We are interested in measuring how long the u occurs. The issue is how to quantitatively define the occurrence frequency of u. The traditional definition of occurrence frequency is to find the number of consecutive timeslots, in which the u appears over the long measurement period. This definition cannot solve the problem of concealed network attacks. For example, we divide one hour into 60 timeslots whose size is one minute. Supposing that u occurs in 50 timeslots, there are only 48 consecutive timeslots. Based on the above definition, the occurrence frequency of u is 48. If the given threshold is 50, the network attacks cannot be detected. Therefore, we define the occurrence frequency of u as the number of timeslots. The occurrence frequency of u is 50, and u is considered to be a concealed network attack behavior.
Generally, we specify the occurrence frequency of u by dividing a long measurement period into many timeslots and measure the number of timeslots, in which u appears persistently. We give a more formal definition as follows.
To formulate the problem of persistent item detection, we introduce some notations. We divide the long measurement period T into many timeslots Tk with the same size. Let U be the set of users occurring over the long measurement period T. Each user has a user identification, and the set of user identifications is denoted as U, belonging to {0, 1, …, N − 1}, where N is the size of U. For example, we divide T (=60) minutes into 60 timeslots with 1 minute. Supposing that the user identification consists of source IP and destination IP, we have N = 264.
For each user u ∈ U, let denote the occurrence indicator of u in timeslot t on one monitor i. When u occurs in timeslot t on one monitor i, is one, and otherwise, is 0. Therefore, the occurrence indicator of u for timeslot t is denoted as . When user u occurs at least in one monitor in timeslot t, is one, and otherwise, is 0. The occurrence frequency of the user u is defined as . Persistent user behavior is defined as users whose occurrence frequency fu is more than the given threshold F, namely, users occur at least in F timeslots over the measurement period T.
The proposed framework for estimating the occurrence frequency of u is intended to be generic, and its parameters are configured according to the application requirements. Specially, the size of timeslot depends on the specific applications. Besides, the threshold is set for abnormal user behavior detection to be more than the measured number of timeslots in which users appear, and there is divergence among different networks. Similarly, the parameters of abnormal user behavior detection are set on the basis of the specific applications and normal user behavior. Consider the example of detecting concealed network attacks by measuring the occurrence frequency of u. When the measurement period is one day, we set the size of timeslot to be one hour. We may find persistent user behavior in normal traffic, since it is probable that legal users access their e-mail and webs regularly. When we set the size of timeslot to be 5 minutes, we may still find persistent user behavior in normal traffic, since it is probable that connections to service may appear intermittently over many timeslots. However, we choose the proper timeslot size and threshold, and it is unlikely for normal user behavior demonstrating the same occurrence frequency to access the servers as persistent user behavior.
The goal of this study is to design an occurrence frequency estimation framework that is composed of an online module and an offline module. The former stores all users in real time using synopsis data structures on each monitor. The latter estimates the occurrence frequency of users based on the summary data structures from multiple monitors and detects persistent user behavior. We will evaluate the performance of our design based on memory overhead and estimation accuracy.
4. Our Algorithm
The framework of our method is shown in Figure 1. First, each monitor processes the arriving packets and updates the summary data structure online. Second, each monitor sends the generated summary data structure to the central server at the end of one measurement period. Finally, on the basis of the received data structures, the central server estimates occurrence frequency of users by the probabilistic counting approach and then reconstructs IDs of users with persistent behavior to detect persistent user behavior offline.

4.1. Data Structure
We construct a bit vector B [24] with size n, in which users are mapped to by one group of hash functions , 1 ≤ j ≤ k. Hash function , 1 ≤ j ≤ k is defined as : {0, 1, …, N − 1} ⟶ {0, 1, …, nj − 1}, 1 ≤ j ≤ k, where N is the size of user space, and mi and mj are coprime to each other, 1 ≤ i < j ≤ k. Another group of hash functions for one timeslot t is defined as: {0, 1, …, N − 1} ⟶ {0, 1, …, n − 1}, 1 ≤ j ≤ k, where n is the size of bit vector B. User IDs are mapped to the bits vector B by hash functions , 1 ≤ j ≤ k. Therefore, the corresponding bits in B are set to ones, that is, , 1 ≤ j ≤ k. The data structure is shown in Figure 2. At the beginning of each timeslot, each monitor initializes B. B is updated by hash functions , 1 ≤ j ≤ k during each timeslot. At the end of each timeslot, each monitor sends its bit vector B to the central server and resets B. Updating data structure is shown in Algorithm 1.

For each user, we need 2k hash calculations and k bit setting operations.
|
4.2. Estimating Occurrence Frequency
There are r monitors in the managed network system, and they may be any computing devices such as the computer, firewall, and router. The occurrence indicator of u for one timeslot t is denoted as , and it is estimated as . When one user u appears at least in one monitor i for one timeslot t, is one. Let be the set of users whose hash values hj(u) are c, on one monitor i, and is denoted as , where indicates the set of users on one monitor i. We define the occurrence indicator of as for one timeslot t. When at least one user in appears in one timeslot t, is one and zero otherwise. Therefore, is estimated as . When is one, can be correctly estimated as . However, when is zero, may be erroneously estimated as . Since hash functions , 1 ≤ j ≤ k, are random and independent each other, different users may be mapped to the same bits in Bi (1 ≤ i ≤ r). For one user u on one monitor i, is estimated as
When is one, is one, and we obtainwhere is the number of zero bits of Bi in one timeslot j on one monitor i. Let denote the set of users appearing in one timeslot t on one monitor i. We obtain
Thus, we obtain the following equation based on (2) and (3).
Then, we obtain
Therefore, we obtain the following equation based on (5).
When nj is larger than , is approximately denoted as (7) based on the limit theory.
Therefore, we obtain
Thus, each zero bit is selected with probability qt. Therefore, we estimate the occurrence frequency fu of one user u aswhere denotes a function, and is one as is zero and zero otherwise. Estimating occurrence frequency is shown in Algorithm 2.
Then, the mathematical expectation and variance of the estimation for fu are computed aswhere denotes the timeslots in which one user u does not appear.
The mathematical expectation and variance of the estimation is proofed as follows.
Equation (9) is transformed into the following equation.
Then, we obtain
Since the estimator obeys 0-1 distribution with probability qt, is qt. Therefore, we have
The estimator is unbiased.
We obtain
Since equals , we obtain
|
4.3. Detecting Persistent User Behavior
Since U is not known in advance, we consider the reconstruction of user IDs. Let be the set of users whose hash values hj(u) are c, and is denoted as . We define the occurrence indicator of as for one timeslot t. The corresponding occurrence frequency . When contains users with persistent behavior, the estimation of occurrence frequency is more than the threshold Fc that is the occurrence frequency of . Therefore, we consider these users whose occurrence frequency estimation is more than as users with persistent behavior. Let Pj be the set to which users with persistent behavior are mapped using hash functions , 1 ≤ j ≤ k, and it is expressed as
Each user with persistent behavior is mapped to k sets Pj, 1 ≤ j ≤ k. Therefore, the problem of detecting persistent user behavior is transformed to find users mapped to Pj, 1 ≤ j ≤ k. We reconstruct IDs of users with persistent behavior using hash functions , 1 ≤ j ≤ k, and they are expressed aswhere nj, 1 ≤ j ≤ k, is coprime to each other, 1 ≤ kj ≤ nj – 1, and 0 ≤ bj ≤ nj - 1. Let cj equal , 1 ≤ j ≤ k. Consider the simple situation. When kj = 1 and bj = 0, (18) is converted into the following equation.
We solve (19) using the Chinese remainder theory [25], and its solutions are expressed aswhere , , and is obtained using .
Consider the general situation. Based on the Euler–Fermat theorem, (20) is converted into the following equation.
Let , and we obtain
The solution of (22) is expressed as follows.
Since cj ∈ Pj, 1 ≤ j ≤ k, the number of combinations cj, 1 ≤ j ≤ k is . Besides, the reconstructed IDs of users may be false positives due to the collision of hash function fj(u), 1 ≤ j ≤ k, that is, different users are mapped to the same positions by hash function fj(u), 1 ≤ j ≤ k. Therefore, we estimate the occurrence frequency of before IDs are reconstructed. If the estimation is less than Fc, the corresponding IDs are not considered to be users with persistent behavior. Otherwise, we reconstruct IDs of users with persistent behavior by (23). The estimation of their occurrence frequency is expressed as
5. Experiment and Evaluation
In this section, we introduce two public datasets. Our solution DPU was implemented in C++. For comparison purposes, we also implemented LDF [20] in C++. All experiments are performed on a server equipped with four-core Intel Xeon E-2224, 3.40 GHz CPU, and 32 GB RAM.
5.1. Datasets
To evaluate the performance of our algorithm, we use two real network traffic traces CDD and CND, respectively. CDD [26] is network traffic trace without packet header over 8 minutes, which is published by CAIDA. CND [27] is another network traffic trace without packet header collected at high link over 60 minutes, which is published by CERNET. CDD is divided into 480 timeslots with 1 second, and CND is partitioned into 360 timeslots with 10 seconds. There are almost the same number of packets during each timeslot. Table 2 provides the number of packets, source IP address (SIP for short), destination IP address (DIP for short), the number of source IP address, and destination IP address pairs for CDD and CND.
In this study, we use SIP and DIP pairs as user IDs. There is approximately 615K, 35K distinct source IP address, 268K, 46K distinct destination IP address, and 823K, 78K distinct source IP address and destination IP address pairs during each timeslot with 1 minute for CDD, CND, respectively. When the length of each timeslot is 2 minutes, there are about 724K, 48K distinct source IP address, 373K, 65K distinct destination IP address, and 1M, 126K distinct source IP address and destination IP address pairs during each timeslot for CDD, CND, respectively. As shown in Figure 3, most of users have low occurrence frequency, and a small number of users have high occurrence frequency, that is, the occurrence frequency of users approximatively follows heavy-tailed distribution for CDD and CND.

(a)

(b)
In the experiments, we randomly allocate packets in the datasets for each monitor. For detecting persistent user behavior, we evaluate the performance of our algorithm compared with LDF [20]. LDF constructs a virtual vector consisting of bits randomly selected from an integer array by different hash functions for each flow. It uses an occurrence estimator to detect long-duration flows. We compare our method with LDF under the same memory usage in terms of estimation accuracy and detection precision. PDU and LDF have two configurable parameters: the data structure size n and the number k of hash functions. The experiments are performed under n = 1 × 105, 2 × 105, 3 × 105, 4 × 105, 5 × 105, respectively. The number k of hash functions is set to 286 according to the proposed method [28]. We allocate the same amount of 1 MB memory to each method. We also need to know actual persistent users under different timeslots in advance, which is given in Table 3.
5.2. Estimation Accuracy
We use source IP address and destination IP address pairs as IDs of users for CDD and CND. We evaluate accuracy of occurrence frequency estimation by the weighed mean difference (WMD for short) [29], which is defined aswhere fu denotes the occurrence frequency of one user u, and denotes the estimation of occurrence frequency of one user u. The smaller the WMD is, the more accurate the estimation of occurrence frequency is.
Figure 4 compares the occurrence frequency estimation of users with their actual occurrence frequency for our algorithm and LDF using CDD and CND. Each point in the plots indicates a user ID, whose coordinate x denotes the actual occurrence frequency and the coordinate y denotes the corresponding estimation. We evaluate the accuracy of occurrence frequency estimation by the diagonal line. The closer the points are to the diagonal line, the more accurate the estimation of occurrence frequency is. As shown in Figures 4(a)-4(b), most of points generated by our algorithm are closer to the diagonal line than LDF using CDD. As shown in Figures 4(c)-4(d), we obtain similar experimental results. Therefore, our algorithm is superior to LDF in terms of occurrence frequency estimation.

(a)

(b)

(c)

(d)
Figures 5(a)-5(b) show the experimental results of two algorithms under different timeslot sizes for CDD. We can observe that the WMD of two algorithms decreases as the size n increases and the WMD of two algorithms increases as the timeslot size increases for CDD. The WMD of our algorithm is about 2 times smaller than that of LDF under △T = 1 for CDD as the size n is 2 × 105. Similarly, Figures 5(c)-5(d) demonstrate the experimental results of two algorithms under different timeslot sizes for CND. Therefore, our algorithm outperforms LDF in terms of WMD. In a word, our algorithm is superior to LDF in estimating the occurrence frequency of users.

(a)

(b)

(c)

(d)
Besides, as shown in Figure 6, two algorithms represent different estimation errors for CDD and CND when the number of monitors increases. We can see that WMD of two algorithms under the different numbers of monitors at each timeslot is 10 seconds. From Figure 6(a), we know that the WMD of two algorithms decreases as the number of monitors increases, and the WMD of our algorithm is smaller than LDF, since we randomly assign packets of datasets to each monitor, and then, the average number of users arriving on each monitor decreases as increasing the number of monitors. We obtain the similar experimental results for CND, as shown in Figure 6(b). Therefore, our algorithm outperforms LDF in estimating the occurrence frequency of users.

(a)

(b)
5.3. Detection Precision
For detecting persistent user behavior, we evaluate the performance of our algorithm compared with LDF. For each user u in the entire user ID space, we estimate the occurrence frequency of u to determine whether u is a user with persistent behavior. Obviously, the detected users may contain users with persistent behavior, and the users with persistent behavior may not be also detected. Therefore, we use false positive rate and false negative rate to evaluate the detection precision of our algorithm. The false positive rate (FPR for short) denotes the ratio of the number of users with persistent behavior not identified correctly to the number of users with persistent behavior identified, and the false negative rate (FNR for short) indicates the ratio of the number of users with persistent behavior not identified to the number of actual users with persistent behavior [25]. The FPR and FNR are expressed aswhere A is the set of actual users with persistent behavior, and B is the set of users with persistent behavior identified. We conduct extensive experiments to detect users with persistent behavior. We treat source IP address and destination IP address pairs as user IDs. Figure 7 shows the detection precision under different timeslot sizes and data structure sizes for CDD. As shown in Figures 7(a)-7(b), we can see that the false positive rate and the false negative rate decrease as timeslot size increases for two algorithms. The false positive rate and the false negative rate of our algorithm are significantly lower than that of LDF. Similarly, as shown in Figures 7(c)-7(d), we can see that the false positive rate and false negative rate decrease as data structure size increases for LDF and PDU. The positive rate and false negative rate of PDU are apparently lower than that of LDF. PDU also generates some false positive and false negative, since different users may be mapped to the same bits in B. In brief, our algorithm is superior to LDF in detection precision.

(a)

(b)

(c)

(d)
5.4. Processing Time
We evaluate the processing time of our algorithm to detect persistent user behavior in comparison with LDF. Figures 8(a)-8(b) show the average online updating time of each packet under different timeslot sizes and data structure sizes. As shown in Figure 8(a), the average online updating time of each packet hardly changes as the timeslot size increases for LDF and PDU. As shown in Figure 8(b), the average online updating time of each packet also hardly changes as the data structure size increases. The average online updating time of each packet for PDU is about 6 times faster than that of LDF. Figures 8(c)-8(d) show the offline detecting persistent user behavior time under different timeslot sizes and data structure sizes. As shown in Figure 8(c), the detecting persistent user behavior time barely changes as the timeslot size increases for two algorithms. As shown in Figure 8(d), the detecting persistent user behavior time also barely changes as the data structure size increases. The detecting persistent user behavior time for PDU is about 3 times faster than that of LDF. In a word, our algorithm PDU outperforms LDF in processing time.

(a)

(b)

(c)

(d)
6. Conclusion
In this study, we propose an algorithm DPU for detecting persistent user behavior from the network-wide view. DPU is computationally efficient to construct compact summary data structures of users occurring in a measurement period, since it only needs to randomly select and update some bits from the summary data structures by hash functions for each incoming packet online. DPU also has low communication cost to collect a large number of network-wide packets, since the monitors only send the generated summary data structures with small memory to the central server. On the basis of generated summary data structures over a long measurement period, DPU accurately estimates occurrence frequency of users by the probabilistic counting approach and then reconstructs user IDs by simple computation to detect users with persistent behavior when users occurring in a measurement period are unknown in advance. Its capability of quickly processing and consuming small memory can become a functional module of modern routers or firewall. The theoretical analysis and experimental results illustrate that our algorithm significantly outperforms LDF in terms of estimation accuracy, detection precision, and processing time. The experimental results illustrate that our method can improve about 30% estimation accuracy and 40% detection precision. The online updating time per packet and offline detecting time of our algorithm are about 3 and 6 times faster than that of LDF. In the future, we will deploy our algorithm to real distributed systems and study the problems on advanced persistent user behavior identification and parameter optimization [30–33]. In addition, we will generate the data by simulating abnormal user behavior to validate the performance of our algorithm [34, 35].
Data Availability
The data used to support the findings of the study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61802274), Open Project Foundation of Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China (K93-9-2017-01), Scientific Research Foundation for Advanced Talents of Taizhou University, China (QD2016027), and Innovation and Entrepreneurship Training Program of Jiangsu Students, China (201812917011Y).