Abstract
Within packet processing systems, lengthy memory accesses greatly reduce performance. To overcome this limitation, network processors utilize many different techniques, for example, utilizing multilevel memory hierarchies, special hardware architectures, and hardware threading. In this paper, we introduce a multilevel memory architecture for counting Bloom filters. Based on the probabilities of incrementing of the counters in the counting Bloom filter, a multi-level cache architecture called the cached counting Bloom filter (CCBF) is presented, where each cache level stores the items with the same counters. To test the CCBF architecture, we implement a software packet classifier that utilizes basic tuple space search using a 3-level CCBF. The results of mathematical analysis and implementation of the CCBF for packet classification show that the proposed cache architecture decreases the number of memory accesses when compared to a standard Bloom filter. Based on the mathematical analysis of CCBF, the number of accesses is decreased by at least 53%. The implementation results of the software packet classifier are at most 7.8% (3.5% in average) less than corresponding mathematical analysis results. This difference is due to some parameters in the packet classification application such as number of tuples, distribution of rules through the tuples, and utilized hashing functions.
1. Introduction
Most network devices, for example, routers and firewalls, need to process incoming packets (e.g., classification and forwarding) at wire speeds. These devices mostly incorporate special network processors that are comprised of a programmable processor core with several memory interfaces and special coprocessors that are optimized for packet processing. The performance of these network processors is usually hampered by slow (main) memory accesses. Such memory bottlenecks can be overcome by the following mechanisms: hiding of memory latencies through parallel processing and reducing the memory latencies by introducing a multi-level memory hierarchy incorporating special-purpose caches [1]. A poorly designed cache memory can critically affect the performance of network processor since the number of memory accesses required for each lookup can vary. Therefore, high-throughput applications require search techniques with more predictable worst-case lookup performance. An approach to achieve higher lookup performance is to utilize the Bloom filter that is recently utilized in embedded memories [2–4]. A Bloom filter is a simple space-efficient randomized data structure to represent a set in order to support membership queries [5]. There are numerous networking problems where such a data structure is required. In particular, when space is an issue, a Bloom filter may be an excellent alternative to keep an explicit list. A Bloom filter is frequently utilized in network processing (areas), such as packet classification, packet inspection, forwarding, p2p networks, and distributed web caching [5–7]. Therefore, Bloom filters are useful to design high-performance memory architecture in network processors and algorithmic solution in the network processing applications.
In this paper, we introduce a new multi-level cache architecture called the cached counting Bloom filter (CCBF). In the CCBF, the cache levels are defined based on the value of the counters. In other words, the items with same counter values are stored in the same level. Based on the counting Bloom filter (CBF) analysis, we propose two multi-level cache architectures (an -level and a 3-level one) and, subsequently, present the performance analysis. The performance metric is the number of accesses to different cache levels of the CCBF as compared to the standard Bloom filter. In the 3-level cache, we further determine the size of cache levels for optimal false positive probabilities. To test the CCBF, we implemented a software packet classifier utilizing a 3-level CCBF employing tuple spaces that are traditionally utilized in packet classification. The mathematical analysis and software implementation results show that the number of accesses is decreased when a 3-level CCBF is utilized. Based on the mathematical analysis, the number of accesses is decreased by at least 53% in comparison to the standard Bloom filter. The results of software implementation have at most 7.8% (3.5% in average) difference to corresponding results of mathematical analysis. This difference is due to the following reasons: number of tuples, distribution of rules inside the tuples, and utilized hashing functions. The main contributions of this paper are the following:(i)introducing of Bloom filter variant called cached counting Bloom filter (CCBF)(ii)mathematical analysis of the CCBF (iii)the performance evaluation of the proposed CCBF for packet classification.
The rest of paper is organized as follows. Section 2.2 presents related works. Section CCBF describes a cache counting Bloom filter concept, architecture, and analysis. The case study of packet classification is presented in Section 3. Section 4 presents our analysis and software packet classifier implementation results. In Section 5, we draw the overall conclusions.
2. Related Works
In this section, we take a brief look at previous works regarding the packet classification using Bloom filter and memory organization in Bloom filters. In [8, 9], Srinivasan et al. introduced the tuple space approach and the collection of tuple search algorithms. A high level approach for multiple field search employs tuple space. A tuple defines the number of specified bits in each field of the rule. The tuple-based algorithms utilize traditional hashing system. In [4], an extended version of the Bloom filter was considered. The authors presented a fast hash table architecture (FHT) and lookup algorithm that converts a Bloom filter into a counting Bloom filter and an associated hash bucket. The FHT improves the performance over a standard hash table by reducing the number of memory accesses needed for the most time-consuming lookups. It only works in conjunction with counting Bloom filters and needs to reconsider all of the already inserted items for each item that consequently leads to longer processing time. In [10], a hash architecture called a multi-predicate Bloom-filtered hash table (MBHT) using parallel Bloom filters is presented. It generates off-chip memory addresses in the base- number system, , which removes the overhead of pointers. Using a larger base of number system, an MBHT reduces on-chip memory size. In [11], an approach to packet classification which combines architectural and algorithmic techniques is presented. The starting point is the well-known crossproduct algorithm which is fast but has significant memory overhead due to additional rules needed to represent the crossproducts. The proposed approach modifies the crossproduct method to reduce the memory requirement. Unnecessary accesses to the off-chip memory are avoided by filtering them through on-chip Bloom filters. In [7], a cache design based on the standard Bloom filter was investigated and was extended to support ageing (adding the ability to evict stale entries from the cache), bound misclassification rates, and use multiple binary predicates. It examined the exact relationship between the size and dimension of the number of flows that can be supported and the misclassification probability incurred. Additionally, it presented extensions for gracefully ageing the cache over time to minimize misclassification. In [12], we introduce the concept of CCBF and proposed two architectures for the counting Bloom filters and their mathematical analysis. In this paper, we implement a CCBF in packet classification using tuple space search with a class of universal hashing functions. Consequently, we compare the software implementation and mathematical analysis results of the CCBF to a standard Bloom filter. The experimental results show that the utilization of the CCBF increases the performance of counting Bloom filters.
2.1. Counting Bloom Filter
The standard Bloom filter works fine when the members of the set do not change over time. When they do, adding items requires little effort since it only requires hashing the additional item and setting the corresponding bit locations in the array. On the other hand, removing an item conceptually requires unsetting the ones in the array, but this could inadvertently lead to removing a 1 that was the result of hashing another item that is still member of the set. To overcome this problem, the counting Bloom filter (CBF) was introduced [13]. In the counting Bloom filter, each bit in the array is replaced by a small counter. When inserting an item, each counter indexed by the corresponding hash value is incremented; therefore, a counter in this filter essentially represents the number of items hashed to it. When an item is deleted, the corresponding counters are decremented. In the following, we utilize to denote the counter value associated with each ’th counter. Considering a counting Bloom filter for items, with hashing functions, and counters, the probability that the ’th counter is incremented times is given as a binomial random variable in the following: When using -bit counters, an -bit counter will overflow if and only if it reaches a value of . The analysis performed by Fan et al. [13] shows that a 4-bit counter is adequate for most applications.
2.2. Cached Counting Bloom Filter Concept
A cache counting Bloom filter (CCBF) is a counting Bloom filter with multi-level hash table in which the items with the same counter value are stored in the same memory (cache) level. If the number of levels is , the multi-level counting Bloom filter is called -level CCBF. We show that in practice, the 3-level CCBF is more beneficial than -level CCBF. In the CCBF, two types of operations are defined. First type of operations is related to the programming and querying of the Bloom filter, and second type is insertion/deletion and fetching of an item from multi-level CCBF based on the counter values in the counting Bloom filter. This means that in the programming of the Bloom filter, the items are inserted in the related cache level of the CCBF. In querying step, the counter values are checked, and the items from the related cache level are loaded. The operations in cache levels are similar to the operations in the traditional hash table (insertion/deletion and fetching). An example of a Bloom filter and corresponding CCBF is depicted in Figure 1.
Figure 1 depicts the CCBF. To generate the CCBF, each item with its counter is inspected after the Bloom filter is created. The item with maximum counter value is selected to write on the cache level. In this case, item is hashed to addresses 1, 3, and 5. Address 3 has maximum value of counters; therefore, is stored in level 3. Item is hashed to addresses 1, 3, and 6 with the counter values 2, 3 and 2. Therefore, item is stored in address 3 in level 3. Similarly and are stored in level 3 and level 2, respectively. From Figure 1, it can be observed that items , , and are stored in a bucket in the third level, and item is stored in level 2 because of its counter value. In other words, only 2 accesses are required in total, one for the bucket in the third level and the other for the bucket in the second level. It should be noted that each address in the array of counters points to one level. The addresses with the counter value more than 2 points to level 3, the addresses with the counter value 2 points to level 2, and the addresses with the value 1 points to level 1.
According to definition of a Bloom filter, the number of hashing functions () with counters and items can be expressed as follows [12]: where the value of changes for different Bloom filter configurations. Based on the Bloom filter definition, the optimal value for to have a minimum false positive rate is (see Figure 2).
After substituting (2) in (1), we obtain
Using (3), we can compute the probability of incrementing the ’th counter for different values of and . Using (3), the counter probability distribution for different counting Bloom filter configurations is depicted in Figure 3.
From Figure 3, when , the value of the counters with nonzero probability changes between 0 and 3, and when , the value of the counters with non-zero probability is increased (for , the value of the counters changes between 0 and 5). Therefore, we can utilize a multi-level cache memory to store the items. We introduce the cached counting Bloom filter as a Bloom filter with each counter pointing to the level corresponding to its counter value and each entry in level containing buckets with size . A bucket is a set of items that can be transferred in one I/O operation. Therefore, for a Bloom filter with optimal false probability, we can utilize a multi-level caching memory to store the items. The -level cached counting Bloom filter architecture is depicted in Figure 4.
In this figure, represents the counter with the value “” pointing to location within cache level “”. Therefore, the values of are equal to 1, the values of are equal to , and the values of are equal to . The counters with value 0 do not point to any bucket in the cache memory.
2.3. Counting Cached Bloom Filter Analysis
In this section, we present the analysis of the cached counting Bloom filter. The number of accesses to the memory depends on the fact that the Bloom generates a “positive” or “negative” result. For the negative case, no accesses to the memory are needed since it is certain that they are not in the original set. For the positive case, still it must be verified whether the item in question is a member or not (false positive). Consequently, we assume in the analysis that all tests are on different elements which would result in the testing of elements (the same number of items in the original set). The number of accesses in a standard Bloom filter is memory accesses, where represents the number of items, represents the number of hashing functions, and is false positive probability. The -level cached counting Bloom filter is depicted in Figure 4. From Figure 4, the number of accesses in -level CCBF is equal to summation of accesses in all levels as follows: In this equation, represents the number of accesses in level . Based on definition of the CCBF, the size of a bucket in level is equal to . Therefore, in each access, items can be transferred. Consequently, the number of accesses depended on the number of levels that means the utilization of multi-level cached counting Bloom filter decreases the number of accesses. The number of accesses in level is equal to the number of buckets in this level. To calculate the number of buckets, the size of level is divided by size of the bucket in this level. From (1) and (4), the expected number of accesses in CCBF is extended as follows: In (5), shows the probability that a counter incitements times. represents the number of expected accesses in level for a counting Bloom filter with items and hashing functions. We can rewrite (5) as follows: If we assume that and normalize to , then we can rewrite (6) as follows:
In practice, the number of levels is limited. The graph depicted in Figure 3 shows that the counter values likely are not larger than 3. Therefore, a 3-level CCBF is more beneficial than -level CCBF. We propose to limit the number of levels to 3. More precisely, levels 1 and 2 (containing 1 and 2 buckets, resp.) store the elements for the counters with values 1 and 2, respectively. Level 3 stores the elements for counters with value 3 or larger. As the counters with values larger than 3 require more storage, the elements are stored over multiple rows in the third level of the CCBF (segmentation). The 3-level cache architecture is depicted in Figure 5.
In Figure 5, the values of are equal to 1, the values of are equal to 2, and the values of are equal to 3. represents the counters with values larger than three and, therefore, they point to a storage within level 3 of the CCBF. Figure 5 highlights the mentioned segmentation. In the following, we analyse the effects of the items with counter values larger than three. The number of accesses in a 3-level CCBF is equal to number of accesses in the levels 1, 2, and 3. The number of accesses in third level of cache can be computed as a summation of the number of counters with value 3 and larger. Therefore, the number of accesses in a 3-level CCBF is as follows: Equation (8) is represented as follows:
After substitution of with and normalization to , the number of accesses in the 3-level CCBF is written as follows:
In the following, we evaluate the size of the different cache levels in the CCBF architecture. In short, the size of each cache level in term of items is equal to the multiplication of and the probability of each counter value in the CCBF. The size of each cache level in -level CCBF is expressed as follows:
In (11), is level number. Using (4), we can rewrite (11) as follows: where is level number.
Using (12), the total size of the -level CCBF cache after normalization to (size of a standard Bloom filter) is
3. Implementation of a Case Study
Traditionally, packet classification entailed the forwarding of packets solely based on the destination address that is specified in one of the many header fields within a packet. Packet classification can be seen as the categorization of incoming packets based on their headers according to specific criteria that examine specific fields within a packet header. The criteria are comprised of a set of rules that specify the content of specific packet header fields to result in a match [14, 15]. For testing purpose, we utilized different rule-set databases and packet traces that have been used by the Applied Research Laboratory in Washington University in St. Louis. The specification of the rule-set databases and packet traces is presented in Table 1.
Table 1 includes seven rule-set databases and packet traces based on IPV4 protocol. The rule sets Fw1, Acl1, and Ipc1 are extracted from real rule sets and others generated by the Classbench benchmark.
A high-level approach for multiple field search employs tuple spaces with a tuple representing information in each field specified by the rules. Srinivasan et al. [8, 9] introduced the tuple space approach and the collection of tuple search algorithms.
A class of universal hashing functions is called H3 hashing functions [16, 17]. Based on tuple space representation for rule-set database and IP packets, the size of input key is 88 bits (32-bit source IP address, 32-bit destination IP address, 8-bit Range-ID, 8-bit Nesting-Level and 8-bit protocol bit). The maximum size of tuple or address space is assumed 216 rules for 16-bit address. Therefore, denotes a set of matrices to define hashing function for tuple space packet classification algorithm [18].
4. Performance Evaluation and Results
In this section, we present the mathematical analysis and implementation results of the CCBF architecture in packet classification using tuple space search.
The implementation and mathematical analysis results for Fw1-100, Fw1-1k, and Fw1-5k rule-set databases for a 3-level CCBF are depicted in Figure 6.
(a)
(b)
(c)
In this figure, “Fw1-xx-Im” shows the graph of the software implementation, and “Fw-xx-Pr” shows the graph of mathematical analysis results that are calculated from (10). The vertical axis shows the number of accesses that are normalized to ( is the number of items and is number of hashing functions). The horizontal axis includes two sequences that the first one shows number of hashing functions and the second one specified by ( represents the size of address space in counter array in CCBF, and represents the number of items) shows corresponding value of previous sequence. As an example for , generates minimum false positive probability. These rule-set databases (Fw1-100, Fw1-1k, and Fw1-5k) are synthetic that were generated by the Classbench benchmark. From Figure 6, we can observe that the number of accesses is decreased for different configurations. The software implementation and mathematical analysis results for Fw1, Ipc1, and average of all rule-set databases are depicted in Figure 7.
(a)
(b)
(c)
Figures 7(a) and 7(b) depict the number of accesses for Fw1 and Ipc1 rule-set databases that were extracted from real rule-set databases. Figure 7(c) depicts the average for all utilized rule-set databases.
From Figures 6 and 7, we can observe that the mathematical analysis results are verified by the software implementation results. Based on the mathematical analysis of CCBF, the number of accesses is decreased by at least 53%. The implementation results of the software packet classifier are at most 7.8% (3.5% in average) less than corresponding mathematical analysis results. This difference is due to the following facts: number of tuples, distribution of rules inside the tuples, and utilized hashing functions. In the packet classification using tuple space, the number of tuples and the number of rules in the tuples are variable for different rule-set databases and different tuples in each rule-set database. In most of the rule-set databases, one tuple includes about half of the rules, and some tuples only have one or several rules. In the mathematical analysis, the results were obtained by investigating a CCBF with a big bit array and a single set of items. The total size of cache levels for real rule-set databases in a 3-level CCBF is depicted in Figure 8.
In this figure, “rule-set-im” shows the total size of cache for different rule-set database that the results are extracted by a software packet classifier and normalized to (number of items multiply by number of hashing functions). Based on the software implementation, the total cache size has some fluctuations. This is due to internal gaps of the buckets in the third level of the CCBF.
4.1. Discussion
The CCBF stores the incoming items (rules) in the memory similar to a traditional replacement algorithm that is called least frequently used algorithm (LFU) [19]. In the CCBF, a bucket with larger counter has more reference; therefore, it resides in a higher cache level with lower access time, and the bucket with lower counter resides in a lower cache level. The CCBF overheads in comparison to the standard Bloom filter are managing different cache levels and the segmentation of the large buckets. In the CCBF, the bucket size of third level is set to 3 therefore, larger buckets should be segmented in to different buckets and linked together. It should be noted that the CCBF is different from the hash table with block read support. This is because, in the CCBF the block is read when the corresponding counter has value larger than one otherwise there is no need to block read. From Figure 8, we can observe some difference in the CCBF size between the software implementation and mathematical analysis results. This is because of internal gap in the buckets in the implementation of CCBF. To overcome this problem we utilize the following mechanisms:(i)shared global overflow area(ii)level overflow area.
A shared global overflow area is a memory space to store the overflow items. When the incoming item cannot be stored on its level, it is stored in the shared global overflow area. The second mechanism is a level overflow area that is allocated as an additional memory for each level. This solution is more practical to implement. This is because the size of each level is assumed larger than the size of the level in mathematical analysis results.
5. Overall Conclusions
In this paper, we presented a new approach to embed a multi-level cache memory in a counting Bloom filter (CCBF). Using the counting Bloom filter property, the number of accesses and sizes of the -level and 3-level cache in the CCBF architecture were investigated. To verify the mathematical analysis results, we implemented a software packet classifier in basic tuple space using an class of universal hashing functions. The results show that incorporating a multi-level cache memory will improve the performance of Bloom filter in comparison to a standard Bloom filter. Based on the mathematical analysis results of CCBF architecture, the number of accesses is decreased at least by 53%. The implementation results are at most 7.8% more than corresponding mathematical analysis results. We expect this approach to be useful in the design of high-performance memory architectures utilized in network processors and related applications such as packet classification and web caching.