Abstract
The Ant Colony Optimization (ACO) algorithms have been well-studied by the Operations Research community for solving combinatorial optimization problems. A handful of researchers in the Data Science community have successfully implemented various ACO methodologies for rule-based classification. This family of ACO algorithms is referred to as AntMiner algorithms. Due to the flexibility of the framework, and the availability of alternative strategies at the modular level, a systematic review on the AntMiner algorithms can benefit the broader community of researchers and practitioners interested in highly interpretable classification techniques. In this paper, we provided a comprehensive review of each module of the AntMiner algorithms. Our motivation is to provide insight into the current practices and future research scope in the context of the rule-based classification. Our discussions address ACO methodologies, rule construction strategies, candidate selection metrics, rule quality evaluation functions, rule pruning strategies, methods to address continuous attributes, parameter selection, and experimental settings. This review also reports a summary of real-life implementations of the rule-based classifiers in diverse domains including medical, genetics, portfolio analysis, geographic information system (GIS), human-machine interaction (HMI), autonomous driving, ICT, quality, and reliability engineering. These implementations demonstrate the potential application domains that can be benefitted from the methodological contributions to the rule-based classification technique.
1. Introduction
The rule-based classification method appeals to the data mining community due to the high interpretability of the classifier. While widely used classification techniques such as Support Vector Machines and Artificial Neural Network are admired for their robustness and accuracy, they are often criticized for the lack of interpretability. In contrast, the decision rules in a rule-based classifier are readily interpretable to humans. Naturally, the rule-based classification is popular in the application areas where the interpretability of the classifier to domain experts is deemed crucial such as cancer research, genetics, and financial analytics [1, 2].
A rule-based classifier consists of a set of “IF-THEN” rules obtained by statistically apprehending the training data. Each rule of the classifier consists of an antecedent and a consequent. The antecedent part contains one or more terms, where each term is comprised of a variable name, an operator, and a value. In the cases where an antecedent contains more than one term, the terms are joined by the “AND” conjunction. On the other hand, the consequent part of the rule represents the class label associated with the rule. The key components of a generic classification rule are shown in Figure 1.

The process of exploring such rules requires a decision on which attributes and corresponding values to consider for classification. This is aligned with the idea of combinatorial optimization, in the sense that the goal is to find a set of attributes and corresponding instance values connected by conjunctions, which maximizes the objective measures and accuracy. If the dataset under consideration is large, the computational effort for extracting such a set of rules can be considerably expensive. Given this context, using intelligent heuristic search algorithms can be helpful in reducing the computational effort. It is important to note that such algorithms will not guarantee finding the best classifier, but one of the top-performing ones. Thus, it is critical to implement the algorithm with effective strategies.
Ant Colony Optimization (ACO) algorithms are a family of heuristic optimization algorithms inspired by the food foraging behavior of ants in the natural system. ACO algorithms have been evolving over the last few decades in terms of search strategies, parameter settings, and new application areas. Various combinatorial optimization problems in the area of Operations Research including the Traveling Salesman Problem [3], Vehicle Routing Problem [4], Facility Layout Problem [5], and Facility Location Problem [6] have been efficiently solved by ACO algorithms.
A specialized family of ACO algorithms, AntMiner, has successfully been implemented to address the rule-based classification problem. During later years, further extension of AntMiner algorithms was suggested. Freitas et al. provided an overview of AntMiner algorithms until that time and some critical future directions [7]. While this work was an admirable contribution at the time, the narrations are brief and may not include the most recent developments. To the best of our knowledge, their work is the only dedicated review article on this topic. In this article, we aim to extend their work by identifying potential application areas and reporting detailed and updated information on AntMiner algorithms at the modular level. This review covers 83 relevant scientific articles indexed in the Web of Science and/or Google Scholar. We have included articles that (a) provide some methodological contribution to AntMiner algorithms, (b) demonstrate real-life applications of rule-based classifier, and (c) report developments in the ACO domain that are relevant to AntMiner algorithms and/or provide use cases demonstrating successful implementation of ACO. Note that, in the interest of manifesting the potential disciplines that can be benefitted from future research contribution in this area, the listing of applications of rule-based classifiers is not limited to any particular methodology. When reviewing existing methods, we sought information following the framework shown in Table 1.
This review article is organized as follows. Section 1 provides an introduction to the rule-based classification problem, an oversight on the connection of the rule-based classification, and the research framework; Section 2 lists some existing application domains for rule-based classification; Section 3 provides a background of AntMiner algorithm, discusses how concepts in ACO can be transferred in the context of rule-based classification, and provides a high-level description of the AntMiner algorithm; Section 4 provides an extensive survey of existing methods for implementing different modules of AntMiner; Section 5 outlines most common experimental setting in terms of measurement metrics and validation approaches and the datasets used by earlier researchers for benchmarking purpose; Section 6 discusses the research gaps and recommendations for future research; Section 7 provides some concluding remarks and closes our article.
2. Real-Life Applications of Rule-Based Classification
In this section, a list of real-life applications of rule-based classification is presented. The goal is to identify application areas that will potentially be benefiting from the research involving rule-based classification. In Table 2, we included the reported applications regardless of the rule discovery methods used which are categorized into eight primary domains, i.e., medical, genetics, portfolio analysis, geographic information system (GIS), human-machine interaction (HMI), autonomous driving, ICT, and quality and reliability engineering. A summary of these applications is provided in the following table. It is interesting to notice that most applications of rule-based classification reported in the literature are in the medical domain.
3. Overview of the AntMiner Algorithms
3.1. Background
The AntMiner algorithms are descendants of ACO algorithms that mimic the natural food foraging behavior of ants. In the natural system, when ants seek food, they leave a pheromone trail for the successor ants to follow. Once they find a food source, ants return to the nest, while depositing pheromone. Since the deposited pheromone is exposed to the environment, it continuously evaporates while the ants travel. Understandably, the ants taking a shorter distance would return to the nest sooner and the remaining pheromone level on their path would be higher. While the following ant makes a decision to select a path from a set of alternatives, the decision is dictated by the amount of pheromone deposited on the candidate alternatives. The higher the pheromone level on a path, the higher the chance that it will be selected. However, this process is stochastic and a higher level of deposited pheromone on a path does not guarantee that the path would be selected. It rather means that the path would have a higher chance of getting selected over alternate options. As the process continues, more ants would be inclined to take the shorter path and possibly make the pheromone level on that path even stronger [50, 51].
This idea of progressively finding the shortest path is used in the ACO algorithms. To mimic the natural system, artificial ants are carefully programmed for tour construction, pheromone updating, etc. such that the agents collectively work towards finding the shortest path. The original algorithm developed by Dorigo et al. was applied to solve the Traveling Salesman Problem (TSP) [51]. Soon after the first application, the wider combinatorial optimization community has implemented ACO algorithms and their variants to solve problems in diverse areas including Quadratic Assignment Problem, Job-Shop Scheduling Problem, Vehicle Routing Problem, Graph Coloring Problem, and Network Routing Problem, to name a few [52]. Over the following years, there have been several versions of the algorithm developed by the ACO practitioners incorporating various search strategies, pheromone updating strategies, heuristic function values, and local updating rules [53, 54].
3.2. An Example of Ant Colony Optimization for TSP
We introduce a short demonstrative example of how the ACO algorithm is used to solve the classical TSP. For further details, the readers may refer to Dorigo, Caro, and Gambardella [52]. In the TSP, our interest is to find the shortest route that a salesman can take to cover all cities in a given set. While the salesman travels, he/she may not travel to the same city more than once but will return to the origin city. The available information is the distance between each pair of cities. For demonstration purpose, let us consider a TSP with four cities: A, B, C, and D, with A as the origin city. The distances between the cities are arbitrarily chosen and are illustrated in Figure 2. It is assumed that there is an artificial ant colony consisting of two ants and they probabilistically choose one edge at a time to construct tour solutions. This probability is a function of heuristic value and pheromone value. The heuristic value in this case would be determined by the distances between each pair of cities. On the other hand, the pheromone level will be increased or decreased over the iterations on each edge, depending on how often an edge is used by the ants to construct a tour solution. In the beginning, the pheromone level on each edge is equally distributed. As the population only consists of two ants, two solutions in the first iteration exist (see Figures 2(a) and 2(b)). By examining these two solutions, it is easy to see that the edges A-B and C-D were used by both ants. Hence, the pheromone levels on these paths are reinforced by a predetermined increment value. To avoid saturation of pheromone, the pheromone levels on all edges are reduced by a certain rate at each iteration. Figure 2(c) shows an analogy representing the value of pheromone on each edge by the levels of darkness and thickness of the edges. Darker and thicker lines would mean relatively higher levels of pheromone on corresponding edges. The edges that possess higher levels of pheromone will have higher probabilities of getting chosen in the following iterations. A common modification of this strategy is to allow the best ant in each iteration to deposit pheromone rather than allowing every ant to do so. Over the iterations, quality solutions get higher probabilities of getting explored and exploited. Depending on a termination criterion (such as the number of iterations or saturation), the algorithm stops and returns the incumbent solution. While there is no guarantee that such solutions will be the optimal solutions, the literature supports the claim that they often provide solutions with satisfactory quality at a reasonable amount of time.

(a)

(b)

(c)
3.3. Bridging ACO and AntMiner
Bennett and Parrado-Hernández methodologically show some interesting interplay between machine learning and optimization in the general sense [55]. In line with the idea, the rule induction process in a rule-based classifier can be modeled as a combinatorial optimization problem. Thus, a customized version of ACO algorithms can be justifiably used for rule discovery. In the context of rule-based classification, each unique attribute-value pair can be considered a virtual node of a graph. However, in this case, a modified constraint will be that the nodes associated with every attribute can appear at maximum once. Also, depending on the model, it may not be required to include all attributes in every rule.
Given this context, Figure 3 shows an illustrative example of how such a rule can be discovered. This is an imaginary dataset replicating the credit default dataset. An ant can construct a rule by visiting a maximum of one value from each attribute. Thus, each node of the graph in this case is represented as a two-dimensional index (labeled by attribute and corresponding value). The particular rule shown as an example path in the figure can be expressed as IF (Age = Group_2) AND IF (Employment = Full-time) AND IF (Credit_Score = Mid-High) THEN Class = Not_Default. In the upcoming sections, we will provide details on the mechanisms of discovering such classifiers.

Parpinelli et al. proposed the first strategy to use an ACO-based algorithm and named it AntMiner. The AntMiner algorithm was tested on publicly available medical domain datasets [1]. In the following years, a number of researchers have proposed and analyzed various strategies to improve the performance and expand the capabilities of the algorithm.
In this paper, to minimize ambiguity, when referring to the original version of AntMiner, we will use the name ‘AntMiner’. The phrase ‘AntMiner algorithms’ will also refer to the family of all rule-based classification algorithms that are developed based on the ACO algorithms. Finally, the term ‘AntMiner version’ would refer to a certain version of AntMiner algorithms.
3.4. High-Level Description of AntMiner
The AntMiner algorithms start with a preprocessed full training set provided by the user. The user also needs to input some parameters before the initialization. During the first iteration, the algorithm has coverage-related information for every potential term. However, the ultimate impact each term will have on the rule quality can only be evaluated once a complete rule is constructed. The heuristic values and pheromone values are initialized using a predetermined mechanism. Every ant will use a probability function to progressively select strictly one value from each attribute. The order of selection does not affect the rule quality. The best rule constructed by the ant is selected as the iteration best rule. The terms used in this iteration best rule are rewarded by allowing addition of pheromone to the terms used in the iteration best rule. On the other hand, the terms not used in the best rule are penalized by reducing their pheromone level, metaphorically known as pheromone evaporation. As the algorithm progresses, the terms associated with high-quality rules will get a higher pheromone level, leading to a higher probability of selection in the succeeding iterations. If the same rule is suggested over several iterations, the search process is converged. After a predetermined number of iterations are performed or the algorithm has converged, the best rule among all the iteration best rules is selected and added to the discovered rules list. This means the newly added rule to the discovered rules list has become a member of the rules in the classifier. A pseudocode for this process is shown in Figure 4.

Before moving forward, the instances covered by the latest discovered rule are removed from the training set. This means the previous heuristic and pheromone-related information will no longer be useful as the number of instances in the training data has changed. Thus, all the ACO parameters are reinitialized, and the algorithm is executed to discover the next rule. Once a new rule is discovered and added to the discovered rules list, the training data is further reduced by removing the covered instances. This process continues until the number of instances in the training data is equal to or less than a user-selected parameter, called maximum uncovered case. Table 3 gives some useful definitions of parameters commonly used in AntMiner algorithms.
4. AntMiner Algorithms–Framework and Modules
4.1. ACO Methodologies Implemented in AntMiner Algorithms
Given the modular features available to customize the well-known ACO algorithms, in some cases, it is difficult to say all the features of certain AntMiner versions are solely inherited from one version of the ACO algorithm. However, in this section, we would refer to an ACO algorithm as the base of an AntMiner version, which shares the most resemblance in terms of implementation.
The original AntMiner algorithm was implemented based on the concept of the Ant System (AS) algorithm. Although in the case of AS algorithm pheromone is updated once all ants in the population have constructed respective complete solutions, in the AntMiner algorithm pheromone is updated each time an ant constructs a solution. This enables the immediate next ant to use the updated pheromone information. The authors have referred to this process as having a population of a single ant. While they have discussed the idea of having a population of more than a single ant in the population, it was not demonstrated in the AntMiner algorithm [56]. The algorithm proposed by Liu et al. is also based on the AS algorithm [57].
Liu et al., on the other hand, implemented an interesting concept of controlling the balance between the magnitude of exploration and exploitation to be used [58]. This idea was originally proposed in the Ant-Q family of algorithms for TSP [59]. The authors claimed that the Ant-Q algorithm strengthened the connection between reinforcement learning and AS. A similar idea was incorporated into the ACS algorithms [3, 60]. In short, a parameter is used to control whether a term will be selected based on AntMiner’s probability function or just random selection. As apparent from the definition, the probability function uses information from the previously discovered rules by the means of pheromone value (encouraging exploitation) but the random selection is independent of any influence from previous information (encouraging exploration). In a later work, Liu et al. provided a theoretical demonstration of how their earlier work can provide more diversity in the search process compared to the original AntMiner [61].
The AntMiner + algorithm implements the MAX-MIN ACO algorithm for classification. The main idea is to constrain the minimum and maximum level of pheromone on the discovered path as the ants continue to construct solutions. The AntMiner + also introduced two dynamic parameters in the probability function itself, to control the weight of heuristic value and pheromone level. This provides a mechanism to select the weight of the exploration and exploitation operators as part of the search process [2, 62].
While the remaining AntMiner versions have some other forms of contributions, the ACO algorithms they used are limited to the abovementioned methodologies, in principle.
4.2. Rule Construction Strategies
The task of rule construction for a rule-based classifier involves traversing through the attributes and values to construct the antecedent and select an appropriate class label. In all of the existing AntMiner algorithms, each ant constructs a rule where exactly one value from each of the attributes is included in the antecedent unless an early termination criterion is applied. Later, using some pruning strategy, the length of the rule is shortened when possible.
In AntMiner, the first ant starts exploring the entire training set. It keeps adding one feasible value from each attribute unless adding that new term decreases the coverage of the rule below a predetermined threshold, min_cases_covered. The following ants construct rules in a similar way until all ants have constructed a rule or a predetermined number of consecutive ants have constructed the same rule. Once all ants finish construction, the best rule in this phase is added to the discovered_rules list. Also, the training instances covered by this rule are removed from the training dataset. The process keeps repeating until the maximum number of uncovered training instances (max_uncovered_cases) is reached. We see a similar rule construction strategy implemented in Liu et al. with some changes as discussed in the following sections [57, 58].
The AntMiner + starts by selecting a class for which rules will be discovered. This provides the advantage of having to calculate heuristic values associated with one class value only. Martens et al. also reformulated a graph that allows ants to deposit pheromone on the edges rather than the vertices. The algorithm uses a population of ants in each ant cycle (iteration) instead of a single ant for constructing rules. The best ant in each ant cycle is allowed to reinforce pheromone while all trails are subject to pheromone evaporation. Only the best rule in each ant cycle is passed on to the pruning procedure. The algorithm also keeps track of the error measure on a validation set to deal with overfitting. If the error measure in the validation set starts increasing the algorithm stops [2].
Baig, Shahzad, and Khan suggested an approach where the ant selects the class label before the main ACO loop begins. The class is chosen probabilistically weighted in proportion to their frequencies in the uncovered training data. They argued in favor of using well-recognized discretization methods for handling continuous attributes at the preprocessing step. They also suggested a heuristic function that takes into account correlation between the class label, last selected term, and candidate terms [63].
The algorithm by Smaldon and Freitas also starts by assigning a consequent before constructing an antecedent of the rule, however, aiming at discovering an unordered set of rules. They used a new heuristic value function and pheromone updating strategy which could potentially be extended to the ordered rule set as well [64].
Salama et al. proposed μAntMiner which uses separate pheromone information for each class. While in this algorithm the consequent is selected before the rule is constructed, the ant may construct rules containing different consequents in the same rule discovery stage like AntMiner+. However, it is different from the view that information on terms with respect to each consequent label is kept independent of each other [65].
4.3. Transition Probability Functions
In most AntMiner variants, each attribute-value pair (i.e., term) is represented as a node on the problem graph. A probability function defines the probability of a term to be included in the partial rule that is being constructed. Although there are functional similarities between the transition probability functions used for AntMiner algorithms and general ACO algorithms, there are some differences in the interpretations. In general ACO, Pij refers to the transition probability from node i to j via arc ij whereas in most AntMiner versions it represents the probability of selection of node ij representing the value j of attribute i, regardless of the departing node. A weighted random selection process is used to pick a termij, based on the corresponding Pij value [1, 56].
Another transition probability function is used by Liu et al. which is similar to the strategy in ANT-Q and ACS [58]. This function provides an explicit mechanism to keep the exploration process active, regardless of the level of pheromone available on the path. A random number q[0,1] is generated and checked against a threshold value (q0). If q is less than the threshold, the term is chosen probabilistically using the conventional probability function shown in equation (1). Otherwise, the term with the maximum Pij value is selected deterministically. In the following equation, S indicates a weighted probabilistic choice of termij using equation (1) for Pij.
Martens et al. redefined the problem graph into a version that is more in line with the conventional ACO models [2], where the pheromone is deposited on the arcs between a pair of vertices. The probability of selecting the edge leading to a vertex is given by the following equation. Also, in this case, the normalization takes place over the values in the next available variable only. This is sufficient because in AntMiner+ the order of variables for an ant to traverse is predetermined. They also included weights (α,β) on the pheromone level and heuristic value, providing a direct way to control the weight of exploration and exploitation in the probability function itself.
4.4. Heuristic Value Functions
There are two major approaches for evaluating heuristic values reported in the literature. The first one is an information-theoretic measure of the entropy of a discovered rule, based on the information theory [66]. The original AntMiner algorithm used this method to compute the heuristic value of a discovered rule [56].
Here, P(|Ai = Vij) represents the conditional probability of selecting a class label given the term Ai = Vij is selected. A higher value of H(|Ai = Vij) means the classes are more uniformly distributed and selecting Ai = Vij will add less value; in turn, it should have less probability of being included in the current partial rule. The value of H(|Ai = Vij) varies in the range of 0 ≤ H(|Ai = Vij) ≤ Log2C, where C represents the number of class labels in the class attribute W.
The other major approach to measure heuristic value is based on density estimation [57]. While the authors acknowledged that this measure may not be as accurate as the information-theoretic measure, they claimed this compromise is not large and can be potentially compensated by the pheromone updating strategy. Considering the reduced computational effort in the density-based method, the most recent works in this area are inclined to use this method [2, 58]. The mathematical expression for measuring density-based heuristic value is shown in
The heuristic value measure used in AntMiner+ is very similar to the above. However, since the class is selected before the rule is constructed, the Majority_class(Tij) is replaced by the class selected by the ant class_ant [2].
In the previous two heuristic value functions, we evaluate the metric based on a ratio where there may be potential special cases of having small numbers in both denominator and numerator leading to high heuristic values, whereas there may be a considerably higher coverage with a small error in class prediction. To address this situation, Smaldon et al. proposed the following heuristic value function [64]. The same function was later adopted by Liang et al. [67]. Here, k represents the number of class labels.
Baig et al. reported another heuristic value function that looks into the correlation of the class label and the last selected term to the candidate term. The other part of the function contains the coverage by this triple. This method is expressed by equation (9), where Ti∗j∗ represents the last selected term, Tij represents a candidate term, and Classcommitted represents the committed class [63].
4.5. Pheromone Updating Strategies
The pheromone updating policy of AntMiner algorithms contains two aspects. First, which terms should get the pheromone update? And second, at what rate should the pheromone deposition and/or evaporation take place? Based on the philosophy of the ACO algorithm, the set of terms associated with each constructed rule should be evaluated by means of the relative quality of the rule. The quality of the rule dictates which terms to retain higher pheromone levels after an iteration. This is to be noted that all terms in the same rule will have the same degree of change in pheromone.
In AntMiner, only the best ant is allowed to deposit pheromone. The increase in pheromone is quantified by the product of Q (see Evaluation of Rule Quality section) and the current pheromone level. As there is only one ant in each ant cycle, the ant deposits pheromone to the terms used in the rule constructed. There is no direct evaporation factor considered in this approach. However, after pheromone deposition, the pheromone values are normalized over all terms. This passively reduces the pheromone level on the unused terms. During initialization, all of the terms are assigned with the same amount of pheromone [56].
In Liu et al.’s work, an exclusive pheromone evaporation factor ρ is introduced. Also, the use of Q is transformed in the pheromone updating function [58]. While the purpose of using ρ is easily understood from our previous discussion on ACO methodologies, the authors did not explicitly describe the motivation of using the transformed function of Q. Similar to AntMiner, the pheromone level for unused terms is updated by normalization; however, the pheromone level for terms used in the constructed rule is updated using
The AntMiner + algorithm initializes the pheromone values for each term with the maximum allowed pheromone value τmax [2]. In subsequent ant cycles, pheromone levels on all trails are subject to reduction due to evaporation. However, the pheromone on the best ant’s path is reinforced (see (11)). The definition of coverage and confidence is provided in the Evaluation of Rule Quality section.
Smaldon and Freitas used a preference operator to select the ants to deposit pheromone. The argument is to allow only the ants that constructed a rule that meets some acceptable threshold to deposit pheromone. The threshold is determined by the following function. If the Laplace corrected confidence is greater than the threshold, Q amount of pheromone is deposited (see equations (14) and (15)). Otherwise, no pheromone is added [64]. The function representing transitioning pheromone is the same as in AntMiner.
4.6. Evaluation of Rule Quality
The traditional method for evaluating the quality of a rule is to use the following metric which is the product of sensitivity and specificity [56, 58].where TP: |Cases Covered by rule AND Class = Predicted Class|, FN: |Cases NOT Covered by rule AND Class = Predicted Class|, TN: |Cases NOT Covered by rule AND Class ≠ Predicted Class|, FP: |Cases Covered by rule AND Class ≠ Predicted Class|
Table 4 provides the definition of TP, FN, TN, and FP in a matrix form.
Two other means of rule quality as used in AntMiner + are coverage and confidence. In the context of a rule, confidence refers to the ratio of correctly classified instances over total instances covered by the rule; and coverage refers to the ratio of total instances covered by the rule over the total number of instances in the training data. Note that the training set size dynamically shrinks as new rules are discovered. These relations are defined in equations (17)–(20).
The AntMinermbc proposed a different rule quality evaluation function using the same parameters as shown below [67].
Further, Salama and Abdelbar conducted a study on various rule quality measures in the context of the μAntMiner algorithm. The suggested use of the Kappa function provides a better balance of average size and average accuracy (readers are referred to [68] for further information about this metric).
4.7. Rule Pruning Strategies
The rule pruning strategy in AntMiner takes place after each rule is constructed. This involves iteratively removing one term at a time from the rule and evaluating for improvement. The term whose removal results in the most improvement is removed from the rule. This process of removing one term at a time continues until the point where removal of no term results in improvement or there is only one term left in the rule. This is to be noted that, in AntMiner, every time a term is removed from the rule, the class label may be reassigned [1, 56].
The rule pruning in AntMiner+ is similar to this except it uses a different metric, confidence to evaluate the improvement of rules. Also, only the best rule from each ant cycle is allowed to go through the pruning process [2]. While using a single ant population approach like AntMiner, the AntMiner-CC also allows only the best-so-far ant to go through pruning. This means the pheromone updating takes place without accounting for pruning.
Smaldon and Freitas used a pruning method where the class label is not changed during the pruning process and argued this reduces some computational effort as rule quality is to be evaluated for the selected class only [64].
In a later work, Chan and Freitas criticized the above methods for being computationally expensive and identified the pruning stage as a bottleneck for the algorithm. The algorithm would reportedly perform poorly for datasets with a larger number of attributes [69]. The authors proposed a “Faster Rule Pruning Procedure” to tackle this problem, inspired by [70]. In the essence, in the proposed method the original AntMiner’s rule pruning operator is still used to reduce the length of a rule, but only on a stochastically reduced number of terms in the rule. The user is required to select the maximum number of terms (r) to be passed on to the original pruning operator. If the length of the currently constructed rule is greater than r, the algorithm reduces the length to r number of terms using roulette wheel selection. The rule pruning process of AntMiner is executed on the reduced rule containing selected terms only. The probability of selecting a term in the reduced rule is weighted according to the information gain achieved by that term. This is important to note that the pruning process gets the information gain achieved for terms precalculated by another procedure in AntMiner (see equation (5)). If the current rule contains less than r terms, the rule is directly passed to the rule pruning process of AntMiner.
In cAntMiner, the use of entropy-based discretization allows for a simpler rule pruning process. The author suggests that, due to the nature of rule construction, the continuous attributes can be removed in the reverse order they were added (see the Handling Continuous Attributes section for more information on the discretization process) [71].
In μcAntMiner, the rule pruning procedure is similar to the original AntMiner except that, for μcAntMiner, the consequent is preselected and does not change due to the pruning [72].
4.8. Handling Continuous Attributes
In Parpinelli et al.’s method, the continuous attributes are discretized at the preprocessing step, where the C4.5 Disc discretization method is used for discretization. First, for each continuous attribute, a pair containing the continuous attribute and the class attribute data is fed to C4.5. Based on the output decision tree the continuous values of the attribute are replaced with categorical labels [1, 56].
Swaminthan used the Mixed Normal Kernel approach to handle continuous attributes. Although this approach still uses the C4.5 Disc discretization method for generating intervals, the intervals do not replace the original numeric values. After getting the intervals, the mean and standard deviations for each interval are calculated. These values are used to generate a Gaussian distribution. Then a multimodal mixed kernel function representing a probability distribution function is generated adding each kernel distribution corresponding to intervals. The area within each interval of the mixed kernel represents the pheromone value for that interval. When an interval is picked by an ant in a rule, a new kernel is added to the mixed kernel with the mean and standard deviation of the selected range. This represents an increase in pheromone level for the selected term [73].
In AntMiner+ the ants construct attribute intervals on the fly. The problem graph is first modified to accommodate for handling ordinal attributes. For each ordinal attribute, two vertices are considered. Each vertex holds a value from the ordinal attribute and the range between them is considered an interval. The algorithm forbids the second vertex to become the same or less than the first vertex in each attribute. For consistency, the categorical attributes also contain two vertices, where one is a dummy containing no data. The author suggests some discretization and/or chi-square-based approach to address data with high dimensionality [2].
The cAntMiner is another strategy to handle continuous attributes on the fly. The algorithm targets at finding the best split in the domain of the continuous attribute. The best split value is calculated based on an entropy measure as shown below in equation (22). Only the threshold values () from an attribute ai that form boundaries between classes are evaluated. The definition of boundary values is given in [74]. Once a split is achieved, the half that gives less entropy is selected as a term [71]. This is important to note that the calculations for threshold value involve only the examples covered by the current partial solution. For continuous attributes, the pheromone can no longer be updated in the conventional manner. The pheromone is added to the attribute vertex (Ti) instead of a particular attribute-value pair. In equation (22), represents the number of cases where ai < , represents the number of cases where ai ≥ , and |S| represents the total number of cases. The best threshold value will correspond to the threshold value that minimizes the entropy of the partition. After is selected, the entropy of the term is given by equation (23).
Later on, Otero et al. published two further extensions of cAntMiner [75]. The first method is based on the Minimum Description Length (MDL) approach [74]. In this method, the previously mentioned splitting technique based on the threshold is recursively applied to find multiple intervals. Instead of referring to only one-half partition, the intervals can be more specific by giving a lower bound and an upper bound. The process still uses the threshold selection model from cAntMiner but in the next step, the threshold is passed through another criterion (see equations (23)–(26)) to accept or reject the threshold for an interval. Once a threshold value is selected the discretization is recursively repeated for each partition. In the end, the MDL approach may provide multiple intervals. The interval corresponding to the lowest entropy is selected. In equation (24), c represents the number of different classes contained in the training cases covered by the condition indicated in the subscripts of c.
The other extension in this work suggests that depositing pheromone on the edges instead of vertex could account for the interactions. Such an approach is also used in AntMiner+. However, in AntMiner+ the order of selection of attributes is predetermined. The influence of the order of selection of terms in the context of dynamic intervals is taken care of in this version of cAntMiner. Another modification suggested is for rule pruning. The author mentioned that once an attribute is removed from the rule disregarding the order as in rule pruning of AntMiner, that may change the context based on which the threshold value was selected. The author suggests that removing the terms from the rule in the reverse order of construction can provide more useful information. Recently, Helal and Otero suggested a probabilistic approach of discretizing continuous variables on the go [76].
Taking advantage of the preselected class, in μcAntMinter, the threshold values are calculated in the context of the selected class. The algorithm picks the threshold value providing maximum quality discrimination which is a function of support and confidence provided by each partition with respect to that threshold value [72].
4.9. Classification of Unseen Data
Once a classifier consisting of a set of decision rules is constructed, it can be used to classify new unseen instances of data. In this case, a new question arises, that is, which rule we should use to classify the new instance.
4.9.1. Ordered Rule Set
The most conventional way of constructing a rule-based classifier is to produce an ordered rule set, organized in the order of their discovery. When an unseen instance is to be classified, it is checked against the ordered rule set. The first rule to cover the instance will be used to classify it.
4.9.2. Unordered Rule Set
Smaldon and Freitas proposed an AntMiner methodology for using unordered rule sets for classification. In this case, while classifying an unseen instance, multiple situations are possible. Firstly, if no rule covers the instance the default class is assigned. Secondly, if there is only one rule that covers the instance, the class label is assigned based on the rule. Thirdly, if there are multiple rules which cover the instance but all of the rules suggest the same class, the suggested class is assigned. Finally, if there are multiple rules which cover the instance but a disagreement exists in regard to the consequent, a selection strategy is required. One of the two such selection strategies is to select the rule in agreement with the rule with the highest quality. The other is to select the class based on the class distribution [64].
4.9.3. Voting
Due to the stochastic nature of the AntMiner algorithms, the classifiers generated on the same training data at different times are likely to be different. To tackle this instability Liang et al. suggested AntMinermbc which uses multiple rule sets in the classifier such that each rule set can complement the others when classifying unseen data [67]. Each of the rule sets is trained on a different subset of the original training data. While classifying unseen data, the class label is determined based on voting from each rule set. They also suggested a new heuristic function which was discussed in the heuristic value function section.
5. Experimental Settings
Typically, AntMiner algorithms are tested against their predecessors, decision tree counterparts, and high-accuracy yielding algorithms such as SVM, ANN. In this section, we discuss the performance criteria for algorithm evaluation in the context of rule-based classifiers and provide a list of datasets used in the previous implementations.
5.1. Performance Criteria
The two most common criteria for evaluating rule-based classifiers are predictive accuracy and interpretability. The predictive accuracy metric is commonly reported in terms of mean accuracy and standard deviation. The interpretability is measured using the number of rules in the classifier and the average length of the rules. The length of the rules is calculated using the number of terms used in the rule. In many implementations, 10-fold cross-validation is used for performance evaluation. Hence the average and standard deviations for each of accuracy, no. of rules, and no. of terms per rule are reported.
5.2. Data Sets
In Table 5, we have listed the datasets used for testing different AntMiner versions for data classification. Most datasets are available on the UCI machine learning repository except for Web-mining and Uniprot. The authors of this review article neither own nor maintain these datasets. Any data related to questions may be forwarded to the corresponding author of the referenced article. Note that the goal of this paper is to provide a map to assist future researchers for the benchmarking purpose.
6. Future Research Directions
There are several ACO strategies in the general optimization domain which have not been utilized in the AntMiner family yet, for example, the dynamic balance of diversification and intensification by Yan et al. [54] and using ACO for problems with continuous domains by Socha and Dorigo [79]. This would be interesting to see how the latest developments of ACO in the optimization domain perform when adopted for the classification problem.
The existing AntMiner implementations mostly focus on extending on Ant System and Max-Min Ant System. As reported in the review there are other ACO methodologies that are yet to be implemented for classification. For example, one of the oldest ACO techniques Ant Colony System (ACS) is not yet implemented for classification. Although Liu et al. used the probability function of ACS, the rule construction strategy is still built on the Ant System.
We have limited information with regard to the performance of AntMiner algorithms and their modules. Ali and Shahzad provided a comparative analysis of several AntMiner versions. However, their experiments used the default parameter setting [80]. This is imperative to optimize the parameters for all of the variations before conducting such experiments for a fair comparison. The algorithmic framework allows easy adaptation of modules of one algorithm version into another. Further analyses of computational complexity and experimental studies involving different ACO modules such as probability function, heuristic, and pheromone updating strategy are needed to make conclusive remarks on their performance.
Parameter tuning is a big contributor to the successful implementation of AntMiner algorithms. In the general optimization area, there have been studies that report a suggested range for some parameters. However, the classification problem may benefit from specialized parameter settings for AntMiner algorithms. An extensive study on optimal parameter setting for AntMiner algorithms is yet to be done.
Several scholars identify the rule pruning procedure as the most computationally expensive procedure in the AntMiner algorithms. While some improvements are suggested in terms of performance, all of them come with some trade-off. Further research on rule pruning is needed to find more efficient as well as high-accuracy yielding procedures.
One particular rule pruning by Chan and Freitas suggests using the conventional rule pruning on a randomly selected part of a constructed rule [69]. Their strategy involves a user decision on how many terms to be passed on to the rule pruning procedure. Without informed choice, this may result in a trade-off in accuracy for faster operation. A suggested range of values for r that works well for all rules is yet to be determined. Another potential path for handling this is to find a dynamic strategy for selecting r.
Given the computational intelligence used in AntMiner algorithms, they have the potential to be useful for datasets with a high number of attributes and values. Also, systems requiring online classification can be particularly benefitted from such heuristics. There are some experiments reported on large datasets involving limited versions of AntMiner methodologies, reporting competitive results. There is no application for online classification reported in the literature.
In recent times, there is a rise in research on parallel implementation strategies in both of the ACO and rule-based classification domains [81–83]. However, none of such strategies were adopted for AntMiner algorithms yet. We believe this will be a timely contribution to explore effective parallel implementation strategies for AntMiner algorithms.
7. Conclusions
Many real-life applications of machine learning tools demand easy interpretations of the classifiers to humans, in addition to yielding high accuracy. This could be due to reliability, accountability, ethics, or other concerns. Rule-based classifiers have been a successful tool for such applications. We have provided a list of such applications to outline the application areas that already benefitted from the rule-based classification technique. Due to the high computational complexity and combinatorial nature of the construction process of rule-based classifiers, heuristic optimization algorithms such as AntMiner are considered to be a helpful method. In this article, an extensive review of the state-of-the-art AntMiner algorithms was conducted, and the relevant future research directions were suggested. Our suggestions emphasize exploring ACO methodologies including parallel implementation strategies, conducting experimental studies to find recommended parameter settings, comparative experimental studies to systematically evaluate the performance of existing strategies at the modular level, developing new strategies for rule pruning, and exploring the potential of AntMiner for online classification. This work will be beneficial to the researchers devoting their effort to improving or deploying the metaheuristic for rule-based classification.
Data Availability
This is a review article that does not deal with any datasets. To access the datasets cited in this article, the readers are referred to the source articles’ authors.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This research was partially supported by the US Department of Agriculture (USDA) under Grant no. 2021-67037-34163.