Abstract
Data mining is the process used for extracting hidden patterns from large databases using a variety of techniques. For example, in supermarkets, we can discover the items that are often purchased together and that are hidden within the data. This helps make better decisions which improve the business outcomes. One of the techniques that are used to discover frequent patterns in large databases is frequent itemset mining (FIM) that is a part of association rule mining (ARM). There are different algorithms for mining frequent itemsets. One of the most common algorithms for this purpose is the Apriori algorithm that deduces association rules between different objects which describe how these objects are related together. It can be used in different application areas like market basket analysis, student’s courses selection process in the E-learning platforms, stock management, and medical applications. Nowadays, there is a great explosion of data that will increase the computational time in the Apriori algorithm. Therefore, there is a necessity to run the data-intensive algorithms in a parallel-distributed environment to achieve a convenient performance. In this paper, optimization of the Apriori algorithm using the Spark-based cuckoo filter structure (ASCF) is introduced. ASCF succeeds in removing the candidate generation step from the Apriori algorithm to reduce computational complexity and avoid costly comparisons. It uses the cuckoo filter structure to prune the transactions by reducing the number of items in each transaction. The proposed algorithm is implemented on the Spark in-memory processing distributed environment to reduce processing time. ASCF offers a great improvement in performance over the other candidate algorithms based on Apriori, where it achieves a time of only 5.8% of the state-of-the-art approach on the retail dataset with a minimum support of 0.75%.
1. Introduction
We live in the data age; these data [1–4] are being generated by everything around us all the time: from social sites, sensors, search engines, medical histories, etc. There is an urgent need to assist humans in extracting useful information (knowledge) from these data. This process is called knowledge discovery in databases (KDD) [5, 6]. The important part of KDD is “data mining” [7, 8], which is the process of discovering patterns from large databases. Data mining application areas include marketing, education, telecommunications, fraud detection, finance, and medical applications. Association rule mining discovers important relations between variables in large databases. Building the association rules for all the itemsets requires large memory and processing resources, so only frequent itemsets are considered. There are many frequent itemset mining algorithms; the most popular algorithm is the Apriori algorithm [9]. The Apriori algorithm is iterative and works sequentially. It scans the database in each iteration to generate a huge number of candidates from frequent itemsets. Its normal execution is on a single machine whose speed cannot match such a large amount of data. So, multiple machines and a parallel algorithm are needed where issues like data replication and synchronization must be addressed. The major drawbacks of the Apriori algorithm concerning its computational complexity make algorithms inefficient to use when the data size is getting larger.
Empirical analysis was performed on the Apriori algorithm as discussed in [10]. The experiment is applied on 2000 transactions out of 2064 transactions for hospital data. The result is shown in Figure 1. It illustrates how the run time of the Apriori algorithm increases when the number of transactions increases (data size increases). The following is a necessary background for our work.

1.1. Association Rule Mining
Association rule mining (ARM) [11] is a data mining technique used to uncover frequent patterns that describe important relations between variables in large databases. It helps in decision-making by finding the relationship between the different attributes of a database. There are interestingness measures that are used to rank and select very interesting rules like support and confidence. Assuming that a dataset D includes N transactions , each one contains a subset of the items in I, where I includes M items . Supposing X and Y are two different subsets of I, X ∩ Y = ∅ and there is a rule X => Y, where X is called the antecedent or left side and Y is the consequent or right side. Then, the support indicates how frequently the itemset appears in the dataset. It can be computed as
Support for the itemset X = the percentage of transactions containing all items in X.
If the support of X is more than or same to the desired minimal support, then the itemset X is described as a frequent itemset. A confidence level indicates how often the rule has been proven correct. It can be computed as
The confidence of a rule (X Y) is the conditional probability that Y will occur in a transaction of D that contains X.
A rule is said to be a strong association rule if the support is greater than the specified minimum support and the confidence is greater than the specified minimum confidence.
1.2. Apriori Algorithm
Apriori [12, 13] is an iterative approach for finding frequent itemsets and associating rules between different items. The kth iteration of Apriori generates frequent itemsets of length k, where k is an integer number ≥1. The algorithm’s main concept is to build frequent itemsets with only one item (1-itemsets) named L1 and then recursively generate frequent 2-itemsets L2, frequent 3-itemsets L3, and so on, until no more itemsets can be generated that meet a predefined minimum support value. Each iteration generates a candidate set based on the previous iteration’s results. To locate the occurrences of each itemset in the candidate set, the dataset is scanned; then, the itemsets with occurrences below the minimal support are pruned. It depends on a property which states that any frequent itemset must be constructed from components that are frequent itemsets (i.e., if {AB} is a frequent itemset, then both {A} and {B} must be frequent itemsets). Two main processes make up the Apriori algorithm:(1)Locating the most frequently occurring itemsets in the database using minimal support(2)Using frequently occurring itemsets to generate association rules.
The process for generating Apriori frequent itemsets has two basic steps:(1)Join step: a candidate set in each iteration k is formed by connecting with itself(2)Pruning step: used to drop k-itemsets with a support count below the desired value. The is pruned to generate frequent itemsets
1.3. Spark
Apache Spark [14, 15] is a cluster computing framework that is open source. It is a fast, general-purpose engine for processing enormous amounts of data. Spark outperforms Hadoop MapReduce by up to 100 times in memory and 10 times in storage [16]. The Hadoop distributed file system (HDFS) is used by the MapReduce process to read and write data, so it will create a lot of load on I/O which increases time and makes the Hadoop MapReduce framework not suitable for iterative algorithms. Spark may operate on Hadoop, Mesos, on the cloud, or standalone, while MapReduce runs only on Hadoop. Spark supports diverse data sources including HDFS, Cassandra, and HBase. It supports many programming languages, such as Java, Python, and Scala. The Spark programming interface depends on a data structure called a resilient distributed dataset (RDD) that is considered the fundamental unit of data in Spark. RDD is an immutable (read only) data structure. Spark automatically partitions and distributes the data contained in RDDs across the cluster and parallelizes the operations on them. RDD is a distributed collection of records created from a file or a set of files or from another transformed RDD to produce a new RDD [17]. Hadoop uses replication to achieve fault tolerance, where each data block is replicated 3 times by default. Spark maintains each RDD’s lineage to achieve fault tolerance. A pointer to the parent RDD is included when a new RDD is formed from an existing RDD. The dependencies of each RDD will be logged in a graph. This graph is called the lineage graph. It is useful when there is a demand to compute the new RDD and to recover the lost data. When RDD is to be reused in multiple actions (the same data are required multiple times), Spark can keep it in memory to allow future actions to be much faster.
These special properties of Spark improve execution time for iterative programs. Spark uses master/slave architecture [18] with one central coordinator called a driver (the process running the main() function of the application that communicates with many distributed workers (a worker node is a slave node in a cluster that executes application code)) as shown in Figure 2. The driver program contacts the cluster manager to ask for allocating resources across the cluster to launch executors (which are processes that perform calculations and store data). The driver has Spark jobs to perform, which are divided into tasks and sent to the executors for execution to run them; then, the executors send the results to the driver.

1.4. Cuckoo Filter Structure
A cuckoo filter is like a Bloom filter [19]; both are high-speed, space-efficient probabilistic data structures. They are useful for quick testing of the membership of an item when the size of the original data is large, with few false positives. Both filters support adding items and asking whether they are present. A cuckoo filter is practically better than Bloom because of the following [20]:(1)Supporting deletion takes O(1) time, while the Bloom filter cannot remove existing items without rebuilding the entire filter(2)Achieving higher lookup performance requires only two locations to check in a cuckoo filter with O(1) time(3)If the goal false-positive rate ɛ is less than 3%, less space is needed
The cuckoo filter structure is shown in Figure 3. It has a cuckoo hash table with m buckets, where each bucket can store b entries and two hash functions.(1)The hash functions h1(x) and h2(x) identify the correct position for a given item for insertion or lookup(2)A cuckoo filter stores only the fingerprint value of the item using hash function ƒ = fingerprint (x)

A cuckoo filter stores the f-bit fingerprint of items instead of the items themselves. The cuckoo filter’s false-positive rate is determined by the bucket size b and the fingerprint length f in bits. The fingerprint length f required is calculated approximately [20]:
Empirical analysis was performed on the cuckoo filter with different bucket sizes b = 2, 4, and 8 as discussed in [20]. Using b = 4, the cuckoo filter has a good space efficiency and a low false-positive rate, according to the research.
1.5. The Main Gaps in the Previous Approaches
Many works proposed algorithms to enhance the Apriori algorithm when the data size is large, but there are still some gaps not covered. ASCF targets by addressing these gaps to enhance the performance of the Apriori algorithm when the data size increases. Table 1 lists these gaps and how ASCF handles them.
The ASCF algorithm solves the inherent drawbacks in the original Apriori algorithm. The algorithm works in a parallel-distributed environment like Hadoop ecosystem [16, 21], Spark [22], and other distributed data management systems that provide tools for data storage, access, and parallel processing of big data. The ASCF algorithm eliminates the candidate generation step in the Apriori algorithm to reduce computational complexity and avoid costly comparisons; it keeps only the frequent items in the cuckoo filter [23] and uses them to prune the transactions to further enhance performance.
The rest of the paper is organized as follows: Section 2 presents the most important work carried out in association rule mining. Section 3 explains the proposed approach. The experimental results and the corresponding discussion are described in Section 4. Section 5 provides the conclusion and outlines the future work.
2. Related Work
Much research work was carried out in the ARM field, which has proposed several improvements in the Apriori algorithm for the past decades. These research studies were mainly targeting the solution of Apriori major drawbacks that severely degrades its performance as the datasets get larger, which is a common feature in today’s data. Agrawal and Shafer [24] proposed parallel implementation of Apriori; however, with the emergence of massive data, the algorithm’s performance is not good due to the synchronization and communication problems.
Many researchers have proposed to implement the Apriori algorithm in a multimachine environment, i.e., a distributed computing framework. Li et al. [25] proposed to implement the Apriori algorithm in the MapReduce framework. There are two steps: map and reduce. The mapper converts the data into (key, value) pairs and finds the potential candidate set, while the reducer reduces the results from different mappers and produces the combined result, which includes the itemsets whose support counts are equal or exceed the minimum support. These steps are repeated to generate frequent (k + 1)-itemsets according to the frequent k-itemsets until no possible frequent itemset is available. Each new MapReduce task must read data from HDFS and write it back, causing significant I/O overhead and increasing the time cost. Singh and Miri [26] proposed the parallel Apriori algorithm based on the MapReduce framework, which tries to reduce the time taken by the second iteration by using a Bloom filter. It consists of three phases. In phase one, the mapper and the reducer are responsible for generating singleton frequent items. In phase two, the singleton items are stored in the Bloom filter, and every transaction is pruned, so it contains only the items which are present in the Bloom filter; then, the mapper and the reducer are responsible for generating 2-frequent itemsets. In phase three, k-frequent itemsets are used to develop (k + 1)-frequent itemsets in an iterative process.
YAFIM (yet another frequent itemset mining) [27] is a Spark RDD-based parallel Apriori method. It is divided into two phases: phase one is responsible for generating singleton frequent items, while phase two is responsible for creating (k + 1)-frequent itemsets iteratively utilizing k-frequent itemsets. It stores (the candidate k + 1 itemsets that must pass the minimum support test first) in a hash tree to find (k + 1)-frequent itemsets from the candidate itemsets faster. YAFIM is faster than Li et al. [25], but in the second phase, when the number of candidate itemsets is large, it is not efficient. Based on the Spark RDD architecture, Rathee et al. [28] proposed (R-Apriori), an efficient parallel Apriori algorithm which consists of three phases: phase one is responsible for generating singleton frequent items and phase two uses the singleton items which are stored in the Bloom filter and prunes every transaction, so it only contains items from the Bloom filter. It produces all the potential item pairs from these transactions. As a result, the 2-frequent itemsets are calculated. To speed up the search for (k + 1)-frequent itemsets from the candidate itemsets, phase three repeatedly uses k-frequent itemsets to produce (k + 1)-frequent itemsets and stores in a hash tree. It is faster than YAFIM [27]. EAFIM [29] is an efficient Apriori-based frequent itemset mining approach built on Spark which is based on 2 main steps: generating the candidate itemsets and counting their support values and then reducing the dataset size by removing useless items and transactions in each iteration. EAFIM shows better performance than YAFIM [27] and R-Apriori [28]. Sethi and Ramesh [30] proposed HFIM, a Spark-based hybrid frequent itemset mining algorithm, which consists of two phases. Phase one is responsible for generating singleton frequent items by converting the dataset to a vertical dataset (items and IDs), and then, the vertical dataset is shared on each node. Phase two repeatedly generates (k + 1)-frequent itemsets using k-frequent itemsets, and it uses the vertical dataset to compute the support count for each candidate itemset.
Rathee and Kashyap [31] proposed the adaptive-miner approach which consists of two phases. Phase one is responsible for generating singleton frequent items and using the Bloom filter to store them like R-Apriori [28]. Phase two is responsible for repeatedly creating (k + 1)-frequent itemsets from k-frequent itemsets. To save time and space, it chooses the optimal execution plans for each iteration. The adaptive-miner algorithm is a dynamic algorithm that uses an adaptive method for extracting frequent itemsets based on the nature of the dataset. Gao et al. [32] proposed an improved Apriori algorithm on Spark to solve the drawbacks in the Apriori algorithm due to increasing the size of the dataset by scanning the dataset only once, calculating the support value at each pass, and eliminating infrequent itemsets. The proposed algorithm reduces the number of transactions and the time spent processing the dataset. Castro et al. [33] compared alternative Apriori implementations on Hadoop MapReduce and Spark for various datasets with minimum support. The results from all of the carried-out experiments show that Spark implementations surpass Hadoop MapReduce implementations in terms of runtime. Raj et al. [34] proposed a Spark-based Apriori algorithm which reduced the shuffle overhead caused by RDD shuffle operation per iteration and showed better performance in running time and scalability. The CEUPM (communication cost-effective utility-based pattern mining) algorithm based on the Spark framework was proposed by Kumar and Mohbey [35], and it succeeded in reducing the communication cost during the shuffle process by allocating the tasks across cluster nodes in a fair and effective way using search space division strategy. It showed better performance in running time, memory usage, and scalability.
A survey of the distinct approaches to pattern mining in the big data field based on Hadoop and Spark parallel and distributed processing was conducted by Kumar and Mohbey [36]. It studied four types of itemset mining: parallel frequent itemset mining, high-utility itemset mining, sequential patterns mining, and frequent itemset mining in uncertain data (data which are obtained from sensors or in experimental observations in real-world applications). For each type, it illustrated and discussed the main concepts, advantages, and disadvantages for many approaches. It discussed challenges and research opportunities for future research. Gawwad et al. [37] proposed a parallelizable algorithm for frequent itemset mining that could deal with big datasets making use of the multicore feature of the hardware. The algorithm was based on greatest common divisor calculations among the transactions based on prime number assignment to the different items in the transaction dataset. The UBDM (uncertain big data mining) approach was proposed by Kumar and Mohbey [38], and it showed better performance in the Spark framework by using the probability-utility-list structure to discover patterns in large amounts of uncertain data.
Many research studies proposed algorithms to extract high-utility itemset patterns from data instead of frequent patterns. Kumar and Mohbey [39] proposed the DMOUM (distributed memory-optimized utility mining) approach which was implemented on Spark to discover high-utility itemsets (patterns with a minimum utility threshold) from big data. It succeeded in reducing the processing time and memory usage by using a pruning strategy which removes nonrelevant items in the search area. To solve one of the main challenges in mining high-utility itemsets from transaction databases which are defining a database-dependent minimum utility threshold, Kannimuthu and Premalatha [40] proposed the stellar mass black hole optimization algorithm to extract top-k high-utility itemsets from the transaction database without defining a minimum utility threshold. Kannimuthu and Chakravarthy [41] presented a hybrid genetic algorithm to extract high-utility itemsets for web service composition and showed acceptable results in terms of processing time and memory usage. A high-utility itemset mining process has an inherent problem of producing a huge number of itemsets as the downward closure property which is applied in frequent itemset mining is not applicable in high-utility mining, and the existing algorithms do not support itemsets associated with negative utility values. As a result, Kannimuthu and Premalatha [42] proposed a utility pattern-growth approach for negative item values to extract high-utility itemsets with negative item values which proved a competing performance.
To decrease the number of rules, computational time, and memory usage, Chiclana et al. [43] suggested a new animal motion optimization-based association rule mining algorithm. The algorithm removes unnecessary rules which are of low support. It keeps and integrates just the frequent rules into the fitness function of animal migration optimization. Rajagopal et al. [44] suggested a plan for choosing crops. When compared to other methods, this method had effective outcomes and could choose the crop that might produce a larger profit.
3. The Proposed Algorithm
Apriori is an iterative algorithm that finds the frequent itemsets and generates association rules from these frequent itemsets in a sequential manner. It consists of two phases: first, generating singleton frequent items; second, generating frequent k-itemsets in an iterative manner. The algorithm is not efficient and gets computationally more expensive as the data size increases, because in the second phase in each iteration, candidate sets having all possible combinations of frequent itemsets are generated. These frequent itemsets are taken from the previous iterations and are now representing candidate frequent itemsets for the next iteration. They are compared against every transaction at a time to find the count of each of the new combined itemsets.
The ASCF algorithm is built targeting to solve the major drawbacks inherent in the original Apriori algorithm. The solution explores the use of multiple machines and a parallel algorithm to overcome the degradation in the performance of Apriori as the data size increases. The cuckoo filter is useful for quick testing of the membership of an item when the size of the original data is large. It speeds the process of testing for each transaction whether a certain item is frequent or not to decide to either retain or prune it if not frequent. The ASCF algorithm consists of two phases.
3.1. Phase One
In this phase, the algorithm, described in Algorithm 1, is responsible of generating all the singleton frequent items. Figure 4 shows the execution flow as a lineage graph of RDDs. The transactions data are loaded from HDFS into Spark RDD; then, they are partitioned and distributed across worker nodes, so they are available locally to each worker.

The map() function reads the transactions and converts each one to a list of items. The flatMap() function is applied on each transaction (a list of items) to separate each item alone. Each item on the list is converted into the form (item, 1) key/value pair by using the map() function. The ReduceByKey() function then estimates each item’s frequency. The filter() function removes items whose frequency falls below the minimum support count (min_sup), yielding 1-frequent itemsets in the form of item and support. Then, the keys() function is applied on them to get only the items (the count is removed) from the 1-frequent itemsets to store them in the cuckoo filter.
The frequent items are kept in memory (persist) to speed up the next phase. After phase one, the frequent items are stored in the cuckoo filter structure. Using the Spark framework’s broadcast function, the cuckoo filter is shared across all the nodes in the cluster. Therefore, the cuckoo filter of the frequent items is assigned to a broadcast variable.
|
In this phase, applying the ReduceByKey() function reduced the time used to generate the singleton frequent items due to sending much less data over the network during the shuffle and reduce process.
3.2. Phase Two
The algorithm in this phase, which is shown in Algorithm 2, is iterative. It is responsible for generating the k-frequent itemsets (k > 1). ASCF reduces the number of items in each transaction by utilizing the cuckoo filter structure to prune transactions.
|
Instead of generating all possible candidate itemsets, it generates candidate sets from the pruned transactions whose lengths are greater than or equal k, where k is the length of the frequent itemsets that will be generated (because the transaction whose length is less than k cannot generate candidates with k items). Therefore, ASCF reduces the number of transactions and the candidate sets in each iteration. The flow of execution in the form of lineage graph of RDDs is shown in Figure 5. First, all nonfrequent items are removed from each transaction so that it contains only items which exist in the cuckoo filter. Transactions whose length is greater than or equal to k (initially k = 2) are retained. Thus, the dataset contains fewer numbers of transactions and items. Second, the following operations are performed per iteration, while k ≥ 2, where k is the length of frequent itemsets:(1)The mapPartitions() function is run on each partition (block) of RDD. It takes the pruned transactions to generate combinations of k items from those transactions whose lengths are greater than or equal to k. Then, flatMap() is applied on these combinations to separate them where each combination is kept alone.(2)The map() function converts each combination of k items into (combination, 1) key/value pairs. Then, the ReduceByKey() function calculates the frequency of each combination of composite key with custom hash function called the murmurhash3 function (mmh3) to guarantee that keys which start with the first component of the composite key end up in the same partition, and all the combinations of k items with frequency less than the minimum support count are removed by the filter() function, where the remaining combinations are the k-frequent itemsets. They are kept in memory (persist) to allow the next operation to use them faster.(3)If the number of k-frequent itemsets >1, the algorithm goes to the next iteration to generate (k + 1)-frequent itemsets. Otherwise, k is set to 1, and the algorithm stops. The items that do not exist in the k-frequent itemsets will be removed from all the pruned transactions because these items will not occur in (k + 1)-frequent itemsets. This is performed by the following steps:(i)The flatMap() function is applied to the k-frequent itemsets to separate the unique items forming F.(ii)Then, the difference between these items and the frequent items that are stored in the cuckoo filter is calculated. This difference represents those items that are in the cuckoo filter but do not exist in the k-frequent itemsets(iii)If there is a difference, then the following steps are performed:(1)The cuckoo filter structure is modified by deleting those items (resulting from the difference in the previous step) from it, then the frequent items become F, and the k-frequent itemsets are removed from memory. k is incremented.(2)The mapPartitions() function is run on each partition (block) of RDD. It takes the pruned transactions whose lengths are greater than or equal to k and prunes them again so that they contain only items which exist in the cuckoo filter. They are then kept in memory to allow the next iteration to use them faster.(3)The execution goes to step 1 to start with the next iteration.(iv)If there is no difference, the k-frequent itemsets are removed from memory and k is incremented. Then, the execution goes to step 1 to start with the next iteration.

In this phase, using the cuckoo filter structure succeeded in removing the candidate generation step from the Apriori algorithm, so it reduced computational complexity and avoided costly comparisons.
3.3. ASCF Illustration
For illustration, assuming there are three data partitions and the minimum support = 3, after the first phase, there are four frequent items (A, B, D, and F) with support values more than or equal to the prescribed minimum support.
They are kept in the cuckoo filter. In the first iteration of the second phase (k = 2), for each partition, each transaction whose length ≥2 is pruned using the cuckoo filter; then, combinations of 2 items are generated from transactions whose length ≥2 after pruning. Each combination is converted to the form (combination of 2 items, 1). This processing is performed for all the partitions in parallel, and partition 2 is taken as a sample as shown in Figure 6(a).

(a)

(b)

(c)

(d)

(e)
After that, the reduceByKey() function is applied on all the partitions to calculate the frequency of each combination, and the filter() function removes all combinations that have frequency <3, as shown in Figure 6(b). The output is 2-frequent itemsets.
As shown in Figure 6(b), the number of 2-frequent itemsets >1, and this means that we have to continue by incrementing k. The cuckoo filter is modified to store only items that exist in the 2-frequent itemsets. This is shown in Figure 6(c). In the second iteration (k = 3), the transactions in each partition whose length ≥3 are pruned using the cuckoo filter.
For partition 2, the result is one transaction with length = 3 as shown in Figure 6(d). If a certain partition contains a transaction whose length before or after pruning with the cuckoo filter <3, then no combinations of 3 items can be generated from this transaction, and they will be removed.
After that, the reduceByKey() function is applied on all the partitions to calculate the frequency of each combination, and the filter() function removes all combinations of 3 items that have frequency <3 as shown in Figure 6(e).
4. Results and Discussion
In this section, the ASCF performance is evaluated. ASCF is compared with three algorithms based on Spark which are HFIM [30], YAFIM [27], and EAFIM [29]. ASCF runs on a Spark cluster of 4 nodes. Each node is allocated 6 GB of RAM and 3 CPU cores. Each node has Hadoop version 2.6.0, Spark version 2, and Python version 3. Both the input dataset and output frequent itemsets are stored on HDFS, and the Python-cuckoo package is installed.
4.1. Datasets
ASCF runs on three datasets. The first is called T10I4D100K [45], and it was generated by IBM’s data generator. The second dataset is the retail dataset [46] (retail market basket data), which contains various transactions carried out by customers in a shopping mall. The third is chess [47] (a dataset of chess end-game positions for king versus king and rook). Table 2 lists important statistics about the three datasets.
4.2. Performance Analysis
The Apriori algorithm generates a huge candidate set in every iteration in the second phase and compares each candidate with every transaction record to produce k-frequent itemsets. This task is considered most time and space consuming, especially when the dataset is large. ASCF runs on the three listed datasets on a cluster of 4 nodes. The running time is compared with HFIM, YAFIM, and EAFIM which are also variants of the Apriori algorithm. HFIM and YAFIM run on clusters of 5 nodes. Each node is allocated 6 GB of RAM and 3 CPU cores. EAFIM runs on clusters of 5 nodes. Each node is allocated 16 GB of RAM and 4 CPU cores. The running time for the four approaches on the different datasets is evaluated in different iterations where each iteration generates k-frequent itemsets, the first iteration generates 1-frequent itemsets, and the second iteration generates 2-frequent itemsets. Figure 7 shows the execution time for each iteration of ASCF, HFIM, and YAFIM. T10I4D100K dataset execution time per iteration with 0.25% minimum support is shown in Figure 7(a).

(a)

(b)

(c)
ASCF makes 9 iterations to generate 8-frequent itemsets. It outperforms HFIM and YAFIM in all iterations. In the experiment with the chess dataset, ASCF takes 9 iterations to generate 8-frequent itemsets with a minimum support of 85% where it also surpasses HFIM and YAFIM as shown in Figure 7(b). For the retail dataset, ASCF takes 5 iterations to generate 4-frequent itemsets with minimum support = 0.75%, and it makes the superior performance, as shown in Figure 7(c). The total execution time of each approach for all the datasets is mentioned in Table 3.
ASCF performance is also tested on a cluster of 3 nodes for all the three datasets with the same values of minimum support as listed above. The total execution time for the T10I4D100K dataset is 72.2 s, 28.8 s for chess, and 11.5 s for retail, so the running time of ASCF decreases when the number of nodes increases. The ASCF running time is compared with EAFIM on the chess dataset with a minimum support of 85%. The total execution time for EAFIM was 70 s. The ASCF outperforms EAFIM in all iterations, as shown in Figure 8.

4.3. Discussion
On analyzing the results, the following general observations are made:(1)In the first phase, HFIM needs to group TIDs for each item to generate vertical data in the form of (item, TIDs), and the groupByKey() transformation needs to shuffle the data over the network, which takes a lot of time.(2)HFIM scans vertical data and uses it to calculate the count for each candidate itemset which takes time.(3)YAFIM stores the candidate itemsets for the next iteration in a hash tree and scans transactional data, which are shared on each node, which takes very much time and space and makes it inefficient when candidate combinations are too many.(4)EAFIM scans the dataset and generates the candidates and their support values. The algorithm updates an input dataset by removing useless items and transactions after finding the frequent itemsets in an iteration, so it has an extra overhead to load the updated input RDD for the next iteration.(5)ASCF outperforms HFIM, YAFIM, and EAFIM in both phases on the different datasets by using the cuckoo filter to store the frequent items and utilize it to prune transactions and then generate candidates from the pruned transactions that are at least as long as k, where k is the length of the frequent itemsets that will be generated. Therefore, it reduces the number of transactions and the candidate sets in each iteration.(6)ASCF reduces computational complexity and avoids costly comparisons by removing the candidate generation step in all the iterations, so it outperforms HFIM, YAFIM, and EAFIM in the running time.
Comparing ASCF with another work proposed by Rathee et al. [28]:(i)Rathee et al. [28] used the Bloom filter to improve the performance in the second phase to generate 2-frequent itemsets. Bloom filters require rebuilding the entire filter to accomplish deletion. In addition, Bloom filter lookup takes O(h) time, where h is the number of hash functions. On the contrary, ASCF uses the cuckoo filter which supports direct deletion operation; also, ASCF achieves a higher lookup performance that takes O(1) and uses less space if the target rate of false positives is less than 3%.(ii)Rathee et al. [28] generated (k + 1)-frequent itemsets using a hash tree to store candidate (k + 1)-itemsets in the same manner as YAFIM [27] and scanned the entire dataset in each iteration to count all occurrences of each candidate itemset. This method takes very much time and space and is inefficient when candidate combinations are too many. On the other hand, ASCF generates candidates in each iteration from the transactions whose length ≥k + 1 after pruning (after using the cuckoo filter modified items obtained from the previous k-frequent itemset calculation to prune the transactions whose lengths are larger than k + 1, given that k + 1 is the frequent itemsets length that will be generated in the next iteration), instead of generating all possible candidate itemsets. Thus, ASCF reduces the number and size of the transactions used to generate the candidate sets in each iteration.
4.4. Time Complexity
Let D be a dataset that contains a total of t items, n transactions, and m number of items in the largest transaction. In phase 1, we need to access each item in all the transactions to generate m (1-frequent items); this will take O(n × m) time in the worst case. Then, the frequent items are stored in the cuckoo filter, so we need to insert a maximum of m items which takes on average O(1). According to [20, 23], cuckoo filters have an O(1) amortized insertion time. After phase 1, we have a preparation step for phase two, all the transactions are pruned, so only transactions with items ≥k are kept where initially k = 2, and all nonfrequent items are removed from each transaction so that it contains only items which exist in the cuckoo filter, so this will require maximum O(n) time. We then have T items in total, N transactions, and M items in the largest transaction, where T ≤ t, N ≤ n, and M ≤ m.
Phase 2 is iterative; hence, we will calculate the time complexity for the iteration. First, all the transactions are pruned to generate combinations of k items from those transactions whose lengths are greater than or equal to k, and this will require a maximum of O(N) time. So, the number of transactions is reduced to N′ <= N. Then, combinations of k items will be generated from each transaction. For example, if the transaction contains items a, b, and c, the combinations of 2 items that will be generated are [(a, b), (a, c), (b, c)], and this will take . After that, these combinations are converted into (key, value) pairs to create the k-frequent itemsets with a minimum support count. Supposing the maximum number of combinations of k items per transaction is equal to C, this will take a maximum of O(N′ × C). If the total number of k-frequent itemsets >1, we need to check if there is a difference between the frequent items that exist in the k-frequent itemsets and those in the Cuckoo filter. Supposing the total number of k-frequent itemsets is F, so we need O (F × k); then, we will modify the cuckoo filter by deleting those items that do not exist in the k-frequent itemsets. Deletion and checking operations of the cuckoo filter take O (1). Then, k = k + 1, and all the transactions are prepared for the next iteration by pruning, so only transactions >=k will be kept; all nonfrequent items are removed from each transaction, this will require a maximum of O(N′) time, and then, the next iteration will start. In this case, the time complexity for phase 2 will be O(N) + O() + O(N′ × C) + O(F × k) + O(N′).
If there is no difference between the items in the k-frequent itemsets and the items in the cuckoo filter, then k = k + 1, and the next iteration will start. In this case, the time complexity for phase 2 will be O(N) + O() + O(N′ × C) + O(F × k). Thus, the total complexity for ASCF is the sum of phase 1 and the preparation step after phase 1 and phase 2.
5. Conclusion and Future Work
In this work, optimization of the Apriori algorithm using the Spark-based cuckoo filter structure (ASCF) is proposed to improve the performance of the original Apriori algorithm and reduce its computational complexity when the size of the dataset gets larger or the number of items increases. It consists of two phases. Phase one is responsible for generating 1-frequent itemsets. Phase two iteratively generates k-frequent itemsets. ASCF contributions can be summarized as follows:(1)It utilizes the cuckoo filter structure to prune transactions. Each transaction is pruned so that it contains only frequent items that are present in the cuckoo filter. This reduces the number of items in each transaction.(2)In each iteration, ASCF reduces the number of transactions by removing all transactions whose length <k, where k is the length of the frequent itemsets that will be generated. It generates candidate itemsets from each pruned transaction whose length ≥k, instead of generating all possible candidate itemsets in each iteration. This reduces the number of candidate itemsets.(3)It keeps the k-frequent itemsets and the pruned transaction in memory to be used further in the next iteration, to generate k + 1-frequent itemsets. This allows future actions to be much faster.(4)It is implemented on the Spark framework, a parallel-distributed environment, which has many advantages and provides in-memory processing. These properties make Spark suitable for iterative algorithms, where the performance is better than that of other environments.(5)ASCF, implemented on a cluster of 4 nodes, achieves a time of only 5.8% of HFIM on the retail dataset with a minimum support of 0.75%. It achieves a time of only 25.7% of HFIM and 38.1% of EAFIM on the chess dataset with a minimum support of 85% and also 37.3% of HFIM on the T10I4D100K dataset with a minimum support of 0.25%.
The experiments carried out on different datasets prove that ASCF succeeds in improving the performance of the Apriori algorithm. The results show that ASCF surpasses HFIM and YAFIM for all the datasets used with different minimum supports. It reduces computational complexity by replacing the candidate generation step with more optimized selection of candidates in all iterations. ASCF can be used in different applications like market basket analysis, in the E-learning platform for a student’s courses selection, and in stock management. In future work, ASCF will be applied on bioinformatics as most applications in this field require finding frequently associated items. In addition, these applications always deal with a huge number of features. Also, ASCF will be compared with other association rule mining algorithms like Buddy Prima and FP-growth. Table 4 lists the abbreviations used throughout the paper.
Data Availability
The data used to support the findings of this study are included within the article and available at http://fimi.ua.ac.be/data/.
Disclosure
This research work has been presented as a thesis at Cairo University, Faculty of Engineering, Computer Engineering Department [48].
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
Bana Ahmad Alrahwan and Mona Farouk contributed to the study conception and design, read and approved the final manuscript, and performed material preparation, data collection, and analysis. The first draft of the manuscript was written by Bana Ahmad Alrahwan and then was revised and modified by Mona Farouk.