Abstract
The detection of botnets has always been a hot spot in the field of network security. However, there are still many challenges in botnet detection. Most of the current botnet detection approaches, such as machine learning and blacklists, cannot discover evolving botnet variants. These methods are usually only valid for specific botnet protocols which are not general. Even they may be difficult to deal with encrypted botnet traffic. In this paper, we design a protocol-independent botnet detection method for these challenges. Our detection method takes advantage of the group characteristic of the botnet, which is the inherent characteristics of the botnet. We use the sequence of packet length as the characteristic of a flow. Then, we calculate the similarity between these sequences to detect botnets. Our method has an excellent generality, which is not affected by encrypted traffic and the protocols of the botnet. Experiments on a challenging dataset ISCX show that the proposed method can effectively detect botnets with a high average detection rate and low false alarm, which significantly outperforms the state-of-the-art methods. Therefore, the proposed detection method is robust and has a wide range of adaptability in detecting botnets.
1. Introduction
A botnet is a one-to-many network formed between the controller and the infected host. There are many methods that can be used by botnet controllers (attackers) to spread bot viruses. Once the host is infected with a bot virus, it will become a part of this botnet. The infected host will receive the attacker’s instructions through a control and command (C&C) channel. The infected computers (bots) are silently driven and commanded by the botnet controller to launch cyberattacks. A botnet is equivalent to a platform for attackers to control bots to perform malicious activities. Attackers can conduct distributed denial of service (DDoS) attacks, spread spam, perform network blackmail, and steal personal information through botnets. It brings great challenges to network security and personal privacy protection. Hosts infected as bots can avoid being discovered by network monitoring agencies in a variety of methods, such as constantly updating themselves, disabling antivirus applications, and preventing DNS from looking up certain domain names. These methods increase the difficulty of botnet detection.
It is well known that the threat of botnets to the Internet is exceedingly scary. With the development of new technologies, botnet detection is facing increasing challenges. Mirai is a new type of botnet that has emerged in recent years. It is the driving force behind the latest large-scale DDoS attack [1]. Mirai infects more than 100,000 IoT devices to form a huge botnet, which may be the largest DDoS attack in history. It is estimated that Mirai’s throughput has reached 1.2 Tbps.
The structure of botnet can be summarized into two categories, namely, centralized and decentralized structure. For a centralized botnet, a communication channel is established between the C&C server and all bots. There are many botnets that are based on centralized structure, such as AgoBot, SDBot, and RBot [2, 3]. The protocols adopted by these botnets are mainly based on HTTP and Internet Relay Chat (IRC) protocols. The flexible and simple structure of the IRC protocol is favored by many hackers. Botnets based on the HTTP protocol are usually concealed and difficult to detect. The decentralized botnet uses P2P-based protocols. When issuing the command, the botmaster randomly selects a bot as the C&C server to communicate with other bots. Since the P2P-based botnets effectively avoid the problem of a single point of failure, they greatly enhance the survivability of the botnet [4].
The botnet detection technology has always been a research hotspot in the field of network security. Researchers have proposed a large number of methods to detect botnet [5, 6]. These methods can be summarized into five categories, that is, signature-based methods, anomaly-based methods, honeypot-based methods, specific protocol structure-based methods, and community-based methods. The signature-based methods [7] cannot detect unknown botnets and their variants. Moreover, the encryption technology used in the botnet negates the effects of these methods. Anomaly-based detection methods [8] are based on the assumption that the communication pattern of the botnet is different from that of the benign network. However, the bots can mimic the communication pattern of the normal hosts to evade the anomaly detection technology. Detection methods based on honeypot technology can only detect existing botnets. This method has poor real-time performance. Detection methods based on specific protocols and structures [9] cannot detect botnets with different protocols or structures. For community-based anomaly detection algorithms [10], they cannot accurately identify botnets when there is no complete communication graph. Nowadays, cyberhackers are adopting new technologies to constantly update botnets in terms of creation, maintenance, and communication mechanisms. Therefore, existing detection technologies cannot cope with unknown and increasingly complex botnets.
A botnet is defined as a coordinated group of malware instances that are controlled by a botnet master via C&C channels [11]. The bots in the same botnet have the same or similar traffic characteristics. In this paper, we propose a protocol-independent botnet detection framework to identify botnet traffic by analyzing the similarity of the traffic flows. Our method can discover bots who initiate these flows that have similar traffic characteristics. More specifically, if the network traffic initiated by a certain host has a great similarity, it can be concluded that the traffic is generated by botnet activities according to the attributes of the botnet. The hosts involved in this traffic are bots in the monitored network. We use the sequence of packet length as the characteristic of the flow. The sequence of packet length is easy to obtain and is very effective for detecting botnets. The sequence of packet length is a vector composed of the length of all the packets in a flow. Each element in the sequence is arranged in sequence according to the order of packet transmission. The degree of similarity between these flows determines whether these flows are botnet traffic. In addition, although the length of the ciphertext output by the encryption algorithm may be different from that of the plaintext, the length of the ciphertext output by the same encryption algorithm is the same for the plaintext of the same length. Therefore, for the packets in a network flow, the encryption algorithm will not change the relationship between the lengths of these packets. Hence, the length of packets applied as the characteristic of the flow makes the detection method very robust.
This paper makes the following major contributions:(i)A protocol-independent botnet detection framework is proposed based on the group characteristics of botnets, which are the inherent characteristics of botnets. Our botnet detection framework is not affected by the C&C protocol. It can be applied to detect bots in both centralized and P2P-based botnets. Compared with the prior work, the detector proposed in this paper is always reliable and efficient, no matter what C&C protocol the botnet adopts.(ii)The sequence of packet length is proposed as the characteristic of the flow, which is easy to obtain and is effective for detecting botnets. The packet length applied as the characteristic of the flow makes the detection method very robust.(iii)A bot detection prototype system is implemented. The detection effect of the system is evaluated on dataset ISCX [12]. The results show that the system has a high true positive rate and a low false positive rate.
2. Related Work
Many researchers have been making continuous efforts to detect botnet. BotMiner [11] is a framework to detect groups of compromised machines that are part of a botnet. The framework is independent of the C&C protocol. It identifies bots by clustering similar malicious traffic and communication patterns. The authors implement the BotMiner prototype system and evaluate the result using traces of many real-world networks. The results show that BotMiner can detect real-world botnets (such as P2P-based botnets, HTTP-based botnets, and IRC-based botnets) with high accuracy and low false positive rate [11]. However, BotMiner needs to analyze the content of the traffic load, which may fail when the traffic is encrypted.
An adaptive framework for detecting botnets is presented in [13]. It is composed of three components, namely, Behavior Extractor, Behavior Identifier, and Feedback Provider. Behavior Extractors generate Behavior Instances (BIs) of hosts from network traffic periodically. BIs are representations of host behavior in a time period. To classify malicious BIs, Behavior Identifier is implemented, which employs a real-time statistical model named Behavioral Model (BM). Feedback Provider can alert the network administrator when it receives a message that malicious BIs are found by Behavior Identifier. At the same time, the Feedback Provider can update BM based on whether the administrator confirms that the host found by Behavior Identifier is malicious. When a new bot appears, the framework requires the administrator to confirm whether the bot is genuine. Therefore, the professional level of the administrator may be the bottleneck that affects its detection of new bots.
DBod is a DGA-based botnet detection framework based on analysis of the query behavior of DNS traffic [14]. The research assumes that bots in the same DGA-based botnet query the same sets of domains in the domain list. Since only a very limited number of the domains are actually associated with an active C&C communication [14], most DNS requests sent by bots will fail and generate NXDomains. The main observation behind DBod is that DGA-based bots are different from benign hosts in the distribution of DNS query time and the count of NXDomains. DBod consists of a filtering module, a clustering module, and a group identification module [14]. DBod does not require prior knowledge for training and can detect new bots. However, it will fail when there is no DNS traffic.
Wang et al. [15] propose a two-stage approach for botnet detection. In the first stage, they perform two different anomaly detection, namely, flow-based anomaly detection and graph-based anomaly detection. In the second stage, they identify the pivotal nodes of the discovered anomalies, evaluate pivotal interaction measures, and construct correlation graphs. Community detection is used to identify botnets. Their approach is based on two observations: (1) Botmasters and victims communicate with many other nodes, which are easy to be detected. (2) The infected hosts often communicate with each other, resulting in a strong correlation between them.
PsyBoG [16] applies signal processing technology to botnet detection. They analyze the time phase and similarity of DNS traffic to identify botnet. PsyBoG uses power spectral density (PSD) analysis, which is a signal processing technology, to detect the major frequency of periodic botnet behavior. Then, it clusters the hosts based on the similarity of traffic patterns [16]. PsyBoG detects previously unknown botnets based on the suspicious DNS manner.
Zhuang et al. [17] propose an effective system, Enhanced PeerHunter, to detect P2P-based botnet. Enhanced PeerHunter is based on network flow level community behavior analysis. It is capable of detecting P2P botnets when (a) botnets are in their waiting stage; (b) the C&C channel has been encrypted; (c) the botnet traffic is overlapped with legitimate P2P traffic on the same host; and (d) no statistical traffic pattern is known in advance (unsupervised). To detect P2P botnets, Enhanced PeerHunter first detects P2P network traffic. Then, it builds a network flow level mutual contacts graph. Finally, it uses community detection to discover P2P-based botnets.
With in-depth research on machine learning, it is increasingly applied to the detection of botnets.
Carl et al. [18] compare the performance of network classifiers based on different machine learning techniques (such as J48, naive Bayes, and Bayesian) to find the classifier with the highest recognition rate. The result is that a naive Bayes classifier performs best. In addition, the classification sensitivity to the training set size is determined experimentally by them in this paper. Accurate labels are critical, however. Once the labels of the training data are inaccurate, the performance of the classifier will suffer greatly.
Mohammad et al. [19] propose an approach that exploits the reinforcement learning technique to detect infected hosts in a peer-to-peer (P2P) botnet. Specifically, they develop a traffic reduction method to deal with a high volume of network traffic. However, botnets dynamically change their operations through updating after several life cycle stages. Hence, the proposed approach will fail if it is not improved dynamically throughout time.
In addition, Pektas et al. [20] design a framework that combines convolutional and recurrent neural network to identify botnets. The proposed system extracts network flow features, such as duration, size of packets, and other related flow-based features. However, this method usually has weak generalization ability and cannot effectively identify unknown types of botnet traffic.
Mousavi et al. [21] focus on scalability in high-rate network bandwidth. They propose a fully scalable big data framework based on Hadoop to deploy many different kinds of botnet detection methods, including statistics-based methods, machine learning-based methods, and graph-based methods. The experimental results show that the framework can perform well. In addition, the running time of the proposed framework is logarithmic proportional to the volume of the input. Despite its advantages, the framework proposed in this paper has its drawback. It is not affordable for smaller enterprises to provide enough computational resources which are required to install the proposed framework.
Soodeh et al. [22] propose a method based on convolution neural networks and negative selection algorithms to detect botnet. They focus on the activity of incoming packets and detect botnet traffic from them. Alharbi and Alsubhi [23] exploit a graph-based machine learning model to detect botnet traffic. They consider the significance of graph features and develop a generalized model for detecting botnets based on features that are selected using five filter-based feature evaluation measures derived from consistency, correlation, and information theory. Biswas and Roy [24] explore a method to detect botnet traffic using deep learning approaches like Artificial Neural Networks (ANN), Gatted Recurrent Units (GRU), and Long or Short Term Memory (LSTM) model. The proposed method has shown how it can perform against both normal attack data and botnet-specific attack data. Javier et al. [25] focus on the method to increase the performance of botnet traffic classification. They use Information Gain and Gini Importance to select features and evaluate the selected features through performing three models, that is, Decision Tree, Random Forest, and k-Nearest Neighbor. Wan et al. [26] design a multilayer framework to detect botnet traffic. The detection model consists of a filtering module and classification module which exploits machine learning algorithms. Their detection model is based on behavior-based analysis. This research examines the features useful for creating a behavior-based analysis method for detecting botnets in network traffic. The computational complexity of the machine learning-based method is relatively large, which is difficult to deploy in the realistic setting. Moreover, the generalization ability of models based on machine learning is limited and cannot cope with the endless botnets.
In conclusion, these existing studies have some limitations. Some methods can only be effective for botnets that use specific protocols. They cannot detect newly emerging botnets. Moreover, some methods are based on historical data. Once the botnet variations appear, these methods will be powerless. The detection method proposed in this paper is based on the group characteristics of the botnet, which are inherent characteristics of the botnet. Our method is independent of the botnet protocol and is not affected by encrypted data.
3. Detection Framework and Implementation
3.1. Research Objectives
Our research goal is to find the bot in the monitoring network by analyzing the traffic crossing the boundary of the monitoring network. We exploit the fact that all botnets have group characteristics, and the relationship between the length of packets in a flow will not be affected by the encryption algorithm. Specifically, within a certain period of time, the flows generated by bots in the same botnet are similar. We analyze the similarity of network traffic to detect bots of the botnet. It is commonly known that the communication of most botnets is based on Transmission Control Protocol (TCP) [2], such as Waledac botnet [3], storm botnet [4], Conficker botnet [10], and Zeus botnet [9]. Therefore, the research of our method mainly focuses on TCP flows. The process of normal hosts evolving into bots can be divided into three stages. In the first stage, hosts are infected by botnet malware. In the second stage, hosts receive the command from the botmaster and join the botnet. Finally, hosts initiate a network attack at an appropriate time. The host will show malicious abnormal behavior in the second and third stages. Therefore, the detection method proposed in this paper works during the second and third stages to realize the detection of bots. It cannot be able to recognize the hosts that have just been infected by the malware. In this paper, we do not pay attention to how the host is infected or how the botnet malware is spread. Our research goal is to detect the bots that generate malicious TCP flows in the monitored network.
Our research objectives are as follows:(i)The bot detection framework is independent of the protocol and structure adopted by botnet channels. Its detection performance is not affected by the botnet protocol and structure.(ii)The bot detection framework does not need to analyze the content of the traffic payload. Hence, it is not affected by encrypted traffic and will not violate the privacy of network users.(iii)The bot detection framework can effectively detect botnet traffic and identify bots with a high detection rate and a low false positive rate.(iv)The bot detection framework must have low complexity. It cannot consume too much computing resources and time.
3.2. Bot Detection Framework
As shown in Figure 1, the bot detection framework includes five modules, that is, network traffic acquirer, preprocessing module, attack flow recognizer, infection flow recognizer, and result integration module.

Formally, we define to denote the TCP flow with the sequence of packet length of host , where is the source IP address, is the source port number, is the destination IP address, and is the destination port number. is the sequence of packet length, which is a vector composed of the length of all the packets in a flow, as defined in (1). Each element in the sequence is arranged in sequence according to the order of packet transmission. The degree of similarity between flows determines whether the monitored traffic is botnet traffic. Although the length of the ciphertext output by the encryption algorithm may be different from that of the plaintext, the length of the ciphertext output by the same encryption algorithm is the same for the plaintext of the same length. Therefore, for the packets in a network flow, the encryption algorithm will not change the relationship between the lengths of these packets. Therefore, our method based on the packet length for detecting botnets is robust.
Suppose there are two flows and , refers to the communication relationship between and , as defined in (2). The communication relationship indicates whether the two flows have the same mapping of source IP address or destination IP address. If there is a communication relationship, . Otherwise, .
In (2), is the mapping function of IP address. The simplest mapping is self-mapping; that is, . There are also some other mappings, such as , that is, the mapping relationship between IP address and DNS domain names. If , it means that and are the same in the “mapping sense.” Specifically, if , then according to , it is obvious that . If , we can know that and have the same domain name and and belong to the same host. In this paper, we use the self-mapping, namely, . denotes the set of flows that have communication relationships between each other, as defined in
The network traffic acquirer can be deployed not only inside the monitored network to analyze the traffic in the internal network to detect botnet but also at the boundary of the monitored network. When the network traffic acquirer is deployed at the boundary of the monitored network, it is responsible for capturing the traffic entering and leaving the boundary of the monitoring network. In this case, the traffic captured by the network traffic acquirer is between the internal network and the external network, which does not include pure internal network traffic. The packet lengths are obtained by parsing the IP header of the packet. Then, they are integrated into the sequence of packet length in ascending order of the TCP sequence number for flow similarity analysis.
The preprocessing module is composed of three modules, namely, IP Partition, Port Partition, and Flow First Time Filter. Since the bot detection framework is based on the similarity of flows, we are only interested in flows that have communication relationships with each other. Therefore, we must first know which hosts are involved in these flows. The IP Partition module divides the traffic according to whether the collected traffic has a communication relationship (as (3)). It mainly solves the problem of “which hosts communicate with each other.” Moreover, the services used for communication between these hosts are also very important. We can determine the services through the TCP port number. The Port Partition module aggregates flows on the same source port number or the same destination port number. It mainly solves the problem of “what communications do the hosts carry out.” According to the distribution of port numbers, we divide the flows into two categories, namely, attack flows and infection flows. The attack flows refer to the traffic generated by bots when they launch a network attack. The infection flows refer to the traffic generated by bots when they are in the propagation phase.
We analyze the attack flows and the infection flows from two perspectives, that is, the bot and the vulnerable victim. When a botnet launches a network attack, the vulnerable victims are the attack targets, which may be the target of multiple attacks at the same time. When receiving the attack instruction, the bot will use the maximum resources to launch an attack on the target, such as the traffic of DDoS attacks. Therefore, when observing the attack flows from the perspective of the bot, the distribution of the port numbers presents a many-to-one situation, that is, multiple ports of the bot actively establish TCP connections with the same ports of the vulnerable victims. When observing the attack flows from the perspective of the vulnerable victim, the distribution of port numbers presents a one-to-many situation. The ports of vulnerable victims are passively connected with multiple different hosts.
The infection flows are traffic generated by bots during the process of conducting malware propagation or vulnerability scanning. Meanwhile, some traffic is the commands conveyed by the botmaster to bots. Hence, when observing the infection flows from the perspective of the bot, the unique TCP port of bots actively establishes connections with multiple hosts. From the perspective of the vulnerable victim, the infection flows present that the unique TCP port is passively communicating with the same port of multiple hosts. For each TCP flow, the first packet time of the flow in both directions (upstream and downstream) determines the initiative and passivity of “establishing a TCP flow.”
Based on all the above observations, the TCP flows within a certain period of time are grouped according to the IP address and port number to form the TCP flow blocks. Then, the sequences of packet length of the flows in the blocks are obtained. Afterward, the attack flow recognizer calculates the similarity of the packet length sequences of these flows from the perspective of the bot. Meanwhile, the infection flow recognizer calculates the similarity of the packet length sequences of these flows from the perspective of the vulnerable victim. Finally, the result integration module is responsible for summarizing the recognition results of the recognizer and obtains a collection of malicious TCP flows.
The following sections will detail the implementation of each part of the detection framework.
3.3. Network Traffic Acquirer
We have developed an effective network traffic capture module, namely, network traffic acquirer. In this paper, we limit our interest to TCP flows. Each flow contains the following information: source IP, destination IP, source port, destination port, timestamp, and length of packets in two directions. Our research is based on the fact that the TCP flows generated by the bots in the same botnet within a certain time frame are similar. Therefore, we set a flow window according to the start time of the flows. The network traffic acquirer captures a certain number of TCP flows based on the flow window. Let denote the size of a flow window. When the number of captured flows exceeds , these flows are submitted as a collection to subsequent modules for analysis to detect malicious flows. is the minimum number of flows to detect botnet through traffic analysis. In addition, the flow truncation is performed to reduce the computational cost. We empirically use the first 16 packets of the TCP flow rather than the whole TCP flow. If the packet number of the TCP flows is greater than 16, the TCP flow is truncated. The truncation algorithm is shown in Algorithm 1.
|
The , the input parameter of Algorithm 1, is the sequence of packet length, which is composed of the length of TCP payload. There are two thresholds that have been set for TCP flow truncation, that is, and . They correspond to two situations. The first situation is that the TCP payload lengths of all packets in the entire TCP flow are . In this case, the strategy we adopt is to truncate the TCP flow according to . Then, a sequence of length is obtained. All elements in this sequence are 0. The second situation is that the number of packets with payload in the TCP flow exceeds . In this case, the flow is truncated at the position . is a function to obtain the position (index) of the -th packet with payload in the flow. Hence, can return the index of the -th packet with payload in the flow. has a higher priority than . Therefore, if , truncation is performed according to . If the number of packets in a complete TCP flow does not exceed and , all packets are reserved. In addition, the TCP flags are used to determine the beginning and end of the flow. The SYN flag indicates that a new TCP flow has started. If there is no SYN packet in a TCP flow, the flow can be considered incomplete. In this paper, the incomplete flows will be directly discarded. The FIN flag and RST flag indicate the end of a TCP flow.
3.4. Flow Preprocessing
The flow preprocessing module is responsible for preliminarily segmenting the collected flows in a window according to the IP address and TCP port numbers. In this way, it can determine which flows have communication relations (as (2)) and which flows have the same service. The flow preprocessing module consists of three parts, namely, IP Partition, Port Partition, and Flow First Time Filter.
3.4.1. IP Partition
The IP addresses of the flows captured by network traffic acquirer are regarded as nodes. If there is a TCP flow between two IP addresses, an edge is connected between the nodes corresponding to the two IP addresses. In this way, an undirected graph is constructed to represent the connection relationship between hosts, as shown in the left subgraph of Figure 2. The undirected graph can be represented algebraically by the adjacency matrix. Firstly, the source IP addresses and destination IP addresses of all the flows are extracted. Then, the duplicate IP addresses are removed. Finally, we construct the adjacency matrix corresponding to the undirected graph according to whether there are TCP flows between these IP addresses. The adjacency matrix is a square matrix. The size of is the number of unique IP addresses in the TCP flow collection. If there are TCP flows between and , the elements at the positions and in the adjacency matrix are set to 1. Otherwise, the elements are set to 0. Therefore, the adjacency matrix is symmetric about the main diagonal.

In a flow collection, there are local connections to form a subgraph structure, which represents the block of nodes. For example, the left of Figure 2 contains two subgraphs. Each subgraph in needs to be analyzed separately. IP Partition can divide the hosts into blocks according to the connection relationship. The schematic diagram of IP Partition is shown in Figure 2. In Figure 2, there are two blocks of nodes, namely, and . The hosts in are connected by edges with each other. The same is true between the hosts in . However, and are independent of each other. There is no connection relationship between the hosts in and the hosts in . To extract the blocks from , two steps must be performed. Firstly, the boundary node of the block needs to be located. The boundary nodes only have adjacent edges to nodes in the block where they are located. Then, all nodes in the block can be obtained by walking through the graph from different boundary nodes. If there is an edge connecting two vertices, it can walk from one vertex to another. The algorithm for finding the nodes of the same block is shown in Algorithm 2.
|
In Algorithm 2, is the adjacency matrix, and is a vertex of . denotes the block to which belongs. represents the set of vertices in .
To find the “boundary” nodes, the undirected graph needs to be transformed into a directed graph through orientation. Due to the bidirectional nature of TCP flows, we use arbitrary orientation in this paper. Firstly, the nodes in the undirected graph are assigned consecutive numbers. Assuming that there are nodes in graph , the numbers of these nodes are one to . Then, the direction of all edges in graph is determined from the node with the smaller number to the node with the larger number. In this way, the undirected graph is converted into directed graph , that is, .
Let and denote the adjacency matrix of undirected graph and directed graph , respectively. According to the orientation process, it can be concluded that . is the upper triangular matrix of . The in-degree and out-degree of the vertex can be calculated by . There is exactly one directed edge between two vertices in the directed graph . Therefore, if the in-degree or the out-degree of the vertex is zero, the vertex is located in the “boundary” of the subgraph. Formally, refers to the set of vertices whose in-degrees are 0, and refers to the set of vertices whose out-degrees are 0, as defined in
Vertices in both and can be used to determine the boundary vertices. In this paper, the vertices in are used to find boundary vertices. As shown in the subfigure on the right side of Figure 2, contains two vertices, namely, . When starting from the vertices of and walking through the undirected graph , the subgraphs (blocks) are obtained. The algorithm for dividing all nodes into different blocks according to the connection relationship is shown in Algorithm 3. In Algorithm 3, is the set of all blocks.
|
3.4.2. Port Partition
In the above sections, we have divided different nodes into different blocks according to the communication relationship between nodes. In this section, the Port Partition module aggregates TCP flows between hosts in the same block.
Given an IP address in a block, according to (3), we can get the set of TCP flows that have communication relationships. As introduced in Section 3.2, the Port Partition module firstly divides into attack flows and infection flows and then analyzes them from two perspectives of the bot and the vulnerable victim. The attack flows have the following two characteristics when they are observed from the perspective of the bot: (i) the destination port numbers of all attack flows are the same, and (ii) the initiator of the TCP flows is the bot. In addition, there are different characteristics when observing the attack flows from the perspective of the vulnerable victim: (i) The source port numbers are the same, and (ii) the initiator of the TCP flows is the bot. As shown in Figures 3 and 4, the direction of the arrow is from the initiator of the TCP stream to the receiver. Figure 3 shows the attack flows from the perspective of the bot. The IP shared by these TCP flows is the IP of the bot. Moreover, the attack flows from the perspective of the vulnerable victim are shown in Figure 4. The hosts at the noncentral location of these TCP streams are bots. There is one bot shown in Figure 3, and there are three bots shown in Figure 4.


However, the features of infection flows are different. When observing them from the perspective of the bot, there are the following two characteristics: (i) the infection flows have the same source port number, and (ii) the initiator of the TCP flows is the bot. In addition, when observing the infection flows from the perspective of the vulnerable victim, there are the following characteristics: the infection flows have the same destination port number, and the initiator of the TCP flows is the bot. No matter from which point of view, the initiators of the TCP flows are always the bots, as shown in Figures 5 and 6.


Based on the above analysis, we get the following port division schemes. Firstly, the directions of TCP flows in the set are adjusted to take as the “source direction.” Then, these flows are clustered according to the following four strategies: (i) the flows that with the same destination port number and whose initiators are , (ii) the flows that with the same source port number and whose initiators are not , (iii) the flows that with the same source port number and whose initiators are , and (iv) the flows that with the same destination port number and whose initiators are not . The attack flows are aggregated based on (i) and (ii). The infection flows are aggregated based on (iii) and (iv). The initiator of the flows is determined by the Flow First Time Filter.
3.4.3. Flow First Time Filter
Flow First Time Filter module determines the initiator of the TCP flows according to the timestamp of the first packet of the TCP flows. Given an IP address , the TCP flow with as the source address is downstream, and the TCP flow with as the destination address is upstream. If the timestamp of the first packet of the downstream flow is less than that of the first packet of the upstream flow, the initiator of this TCP flow is ; otherwise, the initiator is not .
3.5. Malicious Flow Recognition
The group nature of botnet makes that the flows of bots often present a certain similarity between the flows of the bots. Once the botnet is active, the traffic generated by different bots has a high similarity with each other. In addition, some botnets use encryption algorithms to avoid detection. However, the relationship between the length of packets in a flow will not be affected by the encryption algorithm. In this paper, we focus on the method of calculating the similarity of TCP flows. The sequence of the packet length is adopted to evaluate the similarity of the flows. The method we adopt to calculate the similarity of the packet length sequence of TCP flows is the Levenshtein algorithm [27]. If the two sequences are completely the same, the similarity is 1. If the two sequences are completely different, the similarity is 0. The Levenshtein algorithm is mainly used to calculate the distance between two strings, which is the minimum number of editing operations required to convert one string to another. Editing operations allowed during the conversion process include (i) replacing one character with another character, (ii) inserting a character, and (iii) deleting a character.
Given two strings and , the Levenshtein algorithm can be formally defined as (5) to calculate the similarity between the string and . In (5), represents the -th position of string , and represents the -th position of string . When or , the distance of string and is zero. Let denote the similarity of the packet length sequences. Then, equals , where is the length of string .
When calculating the similarity of the sequences of the packet length, the sequences of packet length are regarded as strings. Hence, the Levenshtein algorithm is applicable. In actual application, some tips are introduced to improve the performance of the algorithm. Firstly, suppose that the elements of the two sequences are the same, and only the length of the two sequences is compared. The similarity of two sequences in terms of length is defined as (6). Given the similarity threshold , if is less than the similarity threshold , it can be directly recognized that the two sequences are different, and the Levenshtein algorithm is no longer required. The complexity of is much less than that of the Levenshtein algorithm. Therefore, the computational complexity can be reduced when calculating the similarity of sequences.
The overall process of bot detection is shown in Figure 7. In Figure 7, the dots are used to represent the hosts. The gray dots represent the hosts in the internal network, and the black dots represent the hosts in the external network. The black solid lines indicate that there are TCP flows between the hosts. The dotted lines represent TCP flows. Figure 7 shows an with 10 hosts, namely, . We analyze the hosts in the in turn. The process of analyzing the host is shown in the dashed box. First, the TCP flows that have a communication relationship with the host are collected. These TCP flows are denoted as . Then, the flows in are divided by the Port Partition algorithm to generate some TCP flow blocks. Finally, the Levenshtein algorithm is used to calculate the similarity of these flows in the flow blocks. The flows with high similarity (exceeding the threshold ) are regarded as malicious flows, and the bots are identified according to the strategy adopted in the Port Partition algorithm. In Figure 7, are finally detected as malicious flows. Therefore, the host can be identified as a malicious bot.

4. Experimental Analysis
The detection performance of the proposed bot detection method is evaluated in this paper. The detection performance is mainly evaluated from three aspects: (i) the detection accuracy (true positive rate and false positive rate) of the detection system on the dataset ISCX, (ii) the influence of different parameter settings on the detection effect, and (iii) comparing the performance of the method proposed in [13].
4.1. Dataset
We employ the ISCX botnet dataset for our experiments. ISCX [12] is a publicly available dataset that combines nonoverlapping subsets of three other datasets. ISCX dataset contains traffic from 16 different IRC, P2P-based, and HTTP-based botnets, which makes the ISCX dataset more general, realistic, and representative. The ISCX dataset consists of two subsets: training and testing. The training dataset is 5.3 GB in size, of which 43.92% is malicious traffic (including 7 types of botnets). The testing dataset is 8.5 GB in size, of which 44.97% is malicious traffic (including 16 types of botnets). Figures 8 and 9 show the distribution of the number of TCP flows in the ISCX dataset. Figure 8 shows the traffic distribution in the training dataset, and Figure 9 is about the testing dataset.


4.2. Experimental Evaluation
Since the detection method we designed does not require a training process, the training dataset and the testing dataset are treated the same, and we conducted experimental evaluations in both datasets. The experimental results are compared with [13] from two aspects: bot detection rate and false alarm rate , as defined in
measures whether our method can effectively detect bots. measures the side effect of our method that benign hosts are incorrectly identified as bots. The higher is, the better it is. The lower is, the better it is.
There are two parameters that need to be set. One is the size of the flow window . The other is the similarity threshold . Firstly, we set the similarity threshold to observe the influence of different flow window sizes on the recognition results. is set to 10, 50, 100, and 200 in turn, and the recognition results are recorded for comparison, as shown in Table 1. When the increases, the method proposed in this paper can detect traffic in a larger range, which helps to improve the detection rate of the method. In addition, the number of benign traffic flows will also increase with the increase of , and the probability of detection errors will also increase slightly. Therefore, increases to a certain extent with the increase of .
The experimental results show that a high detection rate can be achieved without setting an excessively large window size. When the window size continues to increase, will quickly reach the optimal stable state. However, increases slightly. The experimental results show that a large window size will increase the false positive rate. In addition, there are 28 bot hosts (or host pairs) in the ISCX-Testing dataset. Our method successfully detects 27 of them (27/28 = 0.96428); only (IP: 172.29.0.109) is not detected. The reason is that there is only one TCP flow in the ISCX-Testing dataset, as shown in Figure 9. Our detection method requires at least two related TCP flows to get a conclusion. Hence, the detection of failed.
In addition, we set the flow window size = 10 to observe the influence of different thresholds on the recognition results. is set to 0.7, 0.8, 0.9, and 0.99 in turn, and then the recognition results are recorded for comparison. The results are shown in Table 2.
The experimental results show that the optimal effect can be achieved in the training dataset when is set to 0.7. With the increase of , the detection rate and false alarm rate have not changed. In the testing dataset, the detection rate does not change when keeps increasing, but the false alarm rate gradually decreases. Therefore, the larger , the lower the false alarm rate.
Many research works choose different datasets for method verification. The verification results are different by selecting different datasets. Therefore, to be relatively fair, we compare the methods proposed in this paper with those of others who also choose the ISCX dataset to verify the model effects. The methods in Table 3 have achieved remarkable results in botnet detection. Meanwhile, they are influential research works. The authors of [13] propose an adaptive botnet detection framework, which uses the SVM model to detect botnet. They train the model on the ISCX-training dataset and then evaluate the effect on the ISCX test dataset. Beigi et al. [8] focus on the proper selection and experimental assessment of features for accurate detection of botnets. Mohammad Alauthaman et al. [28] present a method based on an adaptive multilayer feedforward neural network in cooperation with decision trees to detect P2P-based bots. Soodeh Hosseini et al. [22] use a novel botnet detection and classification method based on convolution neural networks and negative selection algorithms. They all more or less select the ISCX dataset or partial samples in the dataset to verify the performance of the proposed methods. The comparison results are shown in Table 3. Through the comparison of the experimental results, it can be seen that our method is more effective.
4.3. Flow Window Fluctuation
Since is the minimum value of the flow window, the size of the flow window fluctuates actually, as shown in Figure 10.

(a)

(b)
The fluctuation of the flow window size affects the use of memory, which is an important aspect of model performance. Figure 10 shows the fluctuation of the size of the flow window when setting different . Figures 10(a) and 10(b) are the fluctuations of the flow window size in the training and testing datasets, respectively. It can be seen that most of the window sizes fluctuate around the , except for a sharp increase in the size of individual windows.
In addition, the running time of our detection method implemented in python is evaluated on a personal laptop (Intel i7-6500U CPU, 2.59 GHz, 16 GB Memory, Windows 10) with the ISCX-Testing dataset. When evaluating the running time, the time of the network traffic acquirer module is not considered. TCP flow windows are continuously fed to the subsequent TCP flow processing modules (preprocess module, attack flow recognizer module, infection flow recognizer module, and result integration module). The running time is shown in Figure 11. The result is that the larger the window, the longer the running time.

5. Conclusion
In this paper, we have proposed a protocol-independent bot detection framework based on the similarity of flows to detect botnets. The proposed method does not rely on the protocol and structure of botnets, which exploits the fact that all botnets have group characteristics and the sequence of packet length is not affected by encryption. Therefore, the sequence of packet length is used as the characteristic of the TCP flow, and the similarity of TCP flows is calculated to detect botnet traffic. We evaluated the experimental results on the ISCX dataset, and the results show that our method has excellent performance.
In the future, we will consider UDP packets to better deal with the new botnet technology. Meanwhile, we will make the detection system more robust and prevent botnets from using UDP to escape detection. In addition, the performance of the system will be further optimized to enable the system to process traffic in real-time.
Data Availability
Complete information about datasets is available at https://iscx.ca/botnet-dataset.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant no. 62002374).