Abstract
One of the most prospective issues in recent machine learning research is one-class classification (OCC), which considers datasets composed of only one class and outlier. It is more reasonable than the traditional multiclass classification in dealing with problematic datasets or special cases. Generally, classification accuracy and interpretability for users are considered to have a trade-off in OCC methods. A classifier based on hyperrectangle (H-RTGL) can alleviate such a trade-off and uses H-RTGL formulated by the conjunction of geometric rules (called an interval). This interval can form a basis for interpretability since it can be easily understood by the user. However, the existing H-RTGL-based OCC classifiers have the following limitations: (i) they cannot reflect the density of the target class, (ii) the density is considered using a primitive interval generation method, and (iii) there exists no systematic procedure for determining the hyperparameter of the H-RTGL-based OCC classifier, which influences its classification performance. Therefore, we suggest a one-class hyperrectangle descriptor based on density with a more elaborate interval generation method, including parametric and nonparametric approaches. Specifically, we design a genetic algorithm that comprises a chromosome structure and genetic operators for systematic generation of through optimization of the hyperparameter. Our study is validated through a numerical experiment using several actual datasets with different sizes and features, and the result is compared to the existing OCC algorithms along with other H-RTGL-based classifiers.
1. Introduction
One-class classification (OCC) is considered as one of the most promising methods in machine learning research recently. OCC assumes there is only one class in a dataset and all instances not belonging to this class are considered as outliers, which is often confused with binary classification utilizing positive and negative classes [1]. However, since both positive and negative classes of binary classification are classes, binary classification is clearly different from OCC which assumes only one class. In other words, binary classification is a special case of the traditional multiclass classification (MCC). Due to such difference in the number of classes between OCC and MCC, their pattern recognition methods for obtaining a classifier also differ [2]. In the case of MCC, the classifier aims to distinguish one class from the other and is learned from the difference between two classes, since it is assumed that there are multiple classes in a dataset. On the other hand, the OCC classifier aims to thoroughly describe the dataset and to detect the unique features of the target class, since the dataset is assumed to contain only one class. Finally, this difference between MCC and OCC results in them having different learning algorithms.
Owing to the abovementioned distinguishing features, OCC can dominate MCC in datasets containing problematic structures. For instance, let us consider an imbalanced dataset [3], which contains exceptionally large or small classes, causing imbalance among their sizes. In this case, MCC approaches may show degraded classification performance and computational efficiency since the difference among classes can hardly be defined due to their imbalanced sizes. For example, dataset obtained from the manufacturing field can be viewed as imbalanced data, since the defection rate is drastically decreased due to the development of technology, which means that a class representing defection has a very small size. In this situation, OCC can provide more accurate and more efficient performance than MCC by focusing on learning the implicit pattern of proper products. Besides, when dealing with a large dataset comprising many heterogeneous classes, selectively generating an OCC model for the class of interest may be more efficient than generating an MCC model considering all classes [4].
Such importance of OCC stimulates an emergence of developing various algorithms for performing OCC. One approach is the probability-based method, which calculates the probability density function (PDF) or class probability of the target class. Other methods consider modifications of the traditional MCC algorithms, such as support vector machine (SVM) and decision tree (DT), adjusted for reflecting the OCC scheme. However, the abovementioned OCC algorithms have their own limitations [5, 6]. Precisely, probability-based OCC faces the problem of selecting the threshold, for discriminating the target class and outlier. Although the revised SVM for OCC can guarantee promising classification performance, its resulting classifier is usually so complex that the user cannot understand or interpret the result. This complex classification result causes difficulty in committing post analysis, such as comprehension of the internal mechanism of the classifier and dataset. On the other hand, a modified DT for OCC, which can provide classification results clearly understood by the user, incurs the ambiguity problem through the rule generation procedure. An inefficient artificial or unlabeled data generation may be required due to this feature.
An OCC classifier using hyperrectangle (H-RTGL) has been suggested to remedy such a trade-off between classification accuracy and interpretability [7–9]. This classifier determines the membership of the instances according to the H-RTGL surrounding the target class and comprising geometric rules called an interval. The interval can be clearly understood by the user, thus providing interpretability and maintaining prominent classification performance. Various OCC classifiers have been defined by different interval generation methods, such as merging, partitioning, clustering, and Gaussian. Although the proposed approaches can address the trade-off occurring in traditional OCC classifiers, they have the following limitations. Most of the proposed classifiers cannot reflect the density or distribution of the training data, which is essential for exploring meaningful and significant patterns of the data. For instance, the classifiers resulting from interval merging and partitioning depend on individual instances and are prone to overfit the observed instances. In addition, the classifier based on clustering shares the same problem, since it divides the training set by clustering and generates intervals by directly using the cluster information. The Gaussian-based H-RTGL (), suggested by Kim et al. [8], can utilize the distribution of training data partially. However, the performance of was validated using small and simple datasets, indicating lack of generality. In terms of the interval generation method, they used an ineffective one-dimensional clustering method for projection points, which form an essential component for calculating an appropriate interval comprising H-RTGL. Meanwhile, none of the previous studies tackling H-RTGL-based OCC classifier proffer any systematic hyperparameter-tuning procedure. Since the hyperparameters used in the OCC classifier using H-RTGL influence the outfit of the decision boundary that discriminates the target class and outlier, finding a proper hyperparameter is an important issue in terms of classification performance. However, despite their importance, hyperparameters have been determined by exhaustive and inefficient methods, such as grid search. In other OCC algorithms, such as one-class SVM (1-SVM), exploring hyperparameters using an optimization method was proven a meaningful improvement [10, 11].
Based on the abovementioned motivation, we intend to overcome the limitations of the existing research by proposing a more elaborate H-RTGL-based classifier that considers the density and distribution of the target class and deals with the absence of a hyperparameter-tuning mechanism. Therefore, the present paper proposes a one-class H-RTGL descriptor by density and a genetic algorithm (GA) for effectively generating the new classifier. The proposed classifier can appropriately obtain intervals representing the target class, by using an advanced interval generation scheme including both a parametric approach which can guarantee optimum split for projection points and a nonparametric approach used to draw prior information for the parametric approach. Moreover, the designed GA can support classifier generation by optimizing the hyperparameters that affect the classification accuracy of the resulting classifier. We conduct a numerical experiment by considering various datasets from the UCI machine learning repository, to assess the performance of the proposed classifier supported by GA.
The rest of this paper is organized as follows. Section 2 presents a literature survey summarizing the OCC research proposed so far, according to the classification methodologies, and describing features, pros, and cons of each methodology. Then, Section 3 describes the suggested approach for systematic generation of the classifier. Section 4 evaluates the performance of the proposed algorithm using numerical experiments, and we conclude our work and suggest future work in Section 5.
2. Literature Survey
We classified the existing research on OCC algorithms based on the classification methodology used as follows: (i) probability-based OCC, (ii) decision boundary-based OCC, (iii) rule-based OCC, and (iv) H-RTGL-based OCC. This section explains how each category of algorithm performs OCC and details the proposed literature. Also, we reviewed articles suggested recently.
First, probability-based OCC can be approximately divided into density estimation and Bayesian approaches. The density estimation approach estimates the PDF of a given training dataset and classifies an instance by calculating the probability that the instance belongs to the target class, according to the PDF. Desforges et al. [12] estimated the PDF of a positive class in a dataset by using the kernel function and determined instances that had lower probability of PDF than the predefined threshold as the outlier. De Ridder et al. [13] formulated an OCC model by using Gaussian and Parzen density estimation and compared the classification results with those obtained by other OCC algorithms. Bishop [14] dealt with outlier detection of an oil flow problem and conducted OCC with Gaussian and Parzen density estimation. The Bayesian approach performs OCC by directly calculating the class probability using Bayes’ theorem, assuming that the dataset features are independent. For example, Datta [15] proposed the Naive Bayes (NB) positive class by modifying NB considering OCC conditions to detect only instances of a positive class. Denis et al. [16] applied an NB classifier with positive instances reinforced with unlabeled data, and Cohen et al. [17] performed OCC for a facial recognition dataset by using the Bayesian network (BN). Such OCC algorithms utilizing probability can classify the data intuitively, since discrimination can be performed according to the probability. However, there is no clear standard for a predefined threshold used to discriminate between the target class and outlier. To overcome this limitation, Hempstalk et al. [18] ensembled the class probability and density estimation by generating artificial data. Although such approach can alleviate the difficulty of the threshold selection problem, artificial data cause performance degradation and there is no dominant standard for the quantity of artificial data to be generated. Besides, probability-based OCC algorithms have a weakness in classifying sparse datasets, since they recognize the variation resulting from sparsity as the outlier, which hinders suitable classification [19].
Next, the decision boundary-based OCC learns the pattern of the target class in order to obtain a decision boundary that separates the feature space into a region including the target class and that including the outlier. In other words, an instance is determined as a target class if it is within the region of the target class and otherwise, as an outlier. SVM is a representative decision boundary-based classifier used in the binary class classification and foundation of OCC methods utilizing the decision boundary, such as 1-SVM and support vector data description (SVDD), developed by Tax and Duin [20, 21]. SVDD extracts the decision boundary in the shape of a hypersphere including the target class in the feature space. In particular, it aims to find a hypersphere that maximizes the number of covered instances belonging to the target class and minimizes its radius. Le et al. [22] introduced the concept of margin used by the traditional SVM into SVDD and proposed a new objective function of SVDD that maximizes the distance between the surface of the hypersphere and outlier. Xiao et al. [23] and Le et al. [24] proposed a multiple SVDD, which described the dataset by utilizing plural hyperspheres. 1-SVM proposed by Scholkopf et al. [25] performs OCC by finding a hyperplane that separates the origin and target class and maximizes the distance between them. Campbell and Bennett [26] devised a kernel function to linearize the complex quadratic programming problem of 1-SVM, and Hao [27] proposed a more robust 1-SVM classifier by implementing the fuzzy theory. The decision boundary-based OCC can generate a classifier with a small-sized dataset and guarantee outstanding classification performance. However, the classifier performance is highly dependent on the parameter configuration used to obtain the decision boundary. In addition, the resulting classifier comprises a high-dimensional and complex function that is difficult to interpret by the user.
The rule-based OCC derives some rules that can define an instance as a target class from the training dataset and classifies a new instance according to the derived rules. OCC algorithms based on DT and the association rule correspond to this category. DT is formulated by rules for classifying an instance and proceeds into further depth until reaching the deepest leaf node. De Comité et al. [28] altered the C4.5 decision tree, which exploits the heterogeneity among data, to generate rules within the decision tree for the OCC problem. These two research groups formulated the decision tree to detect the instance of a positive class by using positive instances and unlabeled data simultaneously. Lee et al. [29] considered the RIPPER algorithm, which enables a decision tree to be simplified using pruning and performed OCC with fewer rules. Association rules refer to the serial rules occurring in a dataset and common rules for defining the instance of a target class in the classification problem. Brause et al. [30] extracted an association rule from the credit information by using a neural network, to prevent credit fraud. Tandon and Chan [31] compared the effects of weighting and pruning occurring in the extraction process of association rules for OCC. The OCC methods belonging to this category can provide classification results that can be easily understood and analyzed by the user, since the classifier comprises explicit rules. However, these methods cannot obtain structural classifiers since they are prone to overfitting into training data and the computational complexity increases drastically with increase of the number of features.
The H-RTGL-based OCC considered in this paper combines the features of the decision boundary- and rule-based OCCs. It extracts a decision boundary that has the shape of H-RTGL, which is formulated through the conjunction of intervals obtained from each feature. Such a decision boundary can be clearly understood by the user, while that of SVDD or 1-SVM is hard to interpret. A classifier belonging to this group can be categorized according to the interval generation method. The one-class H-RTGL descriptor based on merging draws many small intervals from all instances belonging to the target class and finalizes the conjoint interval set by merging the overlapped intervals. In comparison, the one-class H-RTGL descriptor based on partitioning obtains a conjoint interval set by partitioning one large interval including all instances of the target class [7, 9]. These approaches can generate too many H-RTGLs depending on the dataset used, which results in the classifier ignoring the distribution or density of the dataset. can address the abovementioned restrictions by considering density and distribution of the data in the interval generation procedure [8]. However, it assumes the same number of distributions in all features and uses -means clustering which may fall into a local optimum for one-dimensional clustering method.
Besides, there exist some up-to-date approaches on OCC which utilized recent methods such as graph-based algorithms or deep learning proven as promising alternative for various fields. Sohrab et al. [32] extended SVDD by applying multimodal learning exploiting information with different modality and subspace learning focusing on essential insight and knowledge which can be represented as low-dimensional data. Gautam et al. [33] devised Kernel Ridge Regression (KRR) based autoencoders using local and global variance information referred as LKAE and GKAE, which is reinforced by graph embedding. Dai et al. [34] suggested multilayered architecture of One-Class Extreme Learning Machine (OC-ELM), which replaced inefficient gradient-based backpropagation with Moore-Penrose generalized inverse for setting weights between layers. Their work adopted stacked autoencoder (SAE) structure to efficiently extract useful representation from high-dimensional dataset. Wu et al. [35] designed deep One-Class Neural Network by using structure of Convolutional Neural Network (CNN), which is one of the most representative methods for deep learning and specialized in image processing. Meanwhile, Rahman et al. [36] applied deep learning approach for detection of diabetes by devising convolutional Long-Short Term Memory (LSTM) model fitting well both sequential and image data. Although these approaches can ensure outstanding performance, their complex and exhausting structure require high computational effort. Furthermore, result can hardly be interpreted by user due to black-box nature of deep learning methods.
3. Design of a Density-Based Geometric One-Class Classifier
3.1. OCC Definition and Classification Idea
In general, the OCC problem aims to determine whether an instance belongs to the target class when the given dataset contains only one class. We use the following notations to define such a problem. We consider a dataset including instances. An instance contains features and can be represented as , where is the -th feature of instance . This problem aims to formulate a classifier to distinguish the instance belonging to the target class from the outlier. Meanwhile, H-RTGL-based OCC is a type of solution approach for solving this problem, which formulates the classifier through the following three steps––(i) interval generation, (ii) interval conjunction, and (iii) classifier acquisition––which are described in Figure 1. Since an interval is generated in a feature-by-feature manner, the instances of training data should be projected into each feature. The intervals are obtained from the projection points of each feature considering various interval generation methods, such as merging and partitioning. Then, the H-RTGL prototype is derived through the conjunction of intervals resulting from the Cartesian product, and the volume of H-RTGL is adjusted to prevent overfitting. The finalized H-RTGL is used as a classifier to discriminate the target class and outlier.

This paper suggests the use of a one-class H-RTGL descriptor based on density , which generates an interval by considering the density of projection points represented by the statistics of Gaussian distribution. Specifically, we perform clustering into a set of projection points and assign Gaussian distribution for each cluster under the assumption that the projection points originate from a certain mixture of Gaussians (MoG). Intervals are calculated from the statistics of Gaussian distribution corresponding to each cluster of projection points. In such an interval generation scheme, the interval’s shape depends on the number of Gaussian distributions and the clustering method for projection points. To determine an appropriate number of Gaussian distributions, we apply the kernel density estimation (KDE) used to indicate the PDF of the dataset, which enables estimating an approximate shape of the MoG expected to generate projection points. Then, the number of Gaussian distributions used in feature is obtained from the resulting PDF. In terms of the clustering method used for the projection points, we consider Jenks’ natural-break optimization, which is specialized in one-dimensional clustering and can explore an optimal solution through enumeration. As a result, systematic interval generation can be performed by selecting an appropriate number of Gaussian distributions and obtaining a meaningful cluster of the projection points, which can form the basis for an H-RTGL-based classifier with high performance.
3.2. Detailed Classifier Acquisition Process of
The interval generation for is performed feature-by-feature. Therefore, instances belonging to the training data should be projected into each feature prior to the interval generation. We define a projection function that projects instance into feature , as
Figure 2 shows the projection function applied in the feature space.

After a set of projection points in feature is obtained, it is divided into some clusters corresponding to Gaussian distributions. First, we carry out KDE, which estimates the density of the sample dataset in a nonparametric manner, by using the kernel function. The PDF of feature resulting from KDE is calculated through the summation of the kernel function generated from instance , which is described aswhere is the number of observed instances and is the bandwidth of the kernel function . Thus, the shape of the estimated PDF depends on and its bandwidth . Figure 3 illustrates the variation in the PDF shape affected by (a) different kernel functions and (b) different bandwidths.

(a)

(b)
To derive the number of Gaussian distributions used in feature , we consider the number of extreme points appearing in the estimated PDF, assuming that the presence of one extreme point denotes that of one Gaussian distribution in MoG. Then, the number of Gaussian distributions varies with the kernel function and its bandwidth used in feature , which means that they are crucial hyperparameters used for the interval generation procedure occurring in feature , since the length and shape of the interval depend on . We adopt the Gaussian kernel function by assessing both efficiency and effectiveness, whose detailed result is presented in Section 4. In terms of bandwidth of the kernel function, we design a GA to explore an optimal bandwidth of the kernel function that optimizes the classification performance of the resulting classifier; the detailed design of GA will be described in Section 3.3.
After the number of Gaussian distributions used in feature is determined, we perform one-dimensional clustering on the projection points by using as a parameter for Jenks. Jenks is a clustering method minimizing and maximizing the variance within and between clusters, respectively. It can avoid converging to the local optimum, by providing all possible variance information resulting from the clusters, which enables Jenks to dominate -means clustering. A detailed procedure of Jenks is as follows [37]: Step 1. Calculate the sum of squared deviations for array mean (SDAM), which can be expressed as Step 2. Enumerate the possible class combination and calculate the sum of squared deviations for class means (SDCM) in each resulting class case, which can be expressed as Step 3. Evaluate goodness of variance fit (GVF) calculated from each class combination, which can be expressed as Step 4. Choose the class configuration with the greatest GVF.
As a result of one-dimensional clustering, clusters of the projection point are obtained. To generate intervals by using these, Gaussian distribution is assigned in each cluster with an assumption that the projection points belonging to the cluster follow the distribution. Gaussian distribution corresponding to the -th cluster of projection points in feature has mean and standard variation calculated from the cluster of projection points . An interval of is generated by considering these statistics and the interval corresponding to is calculated as follows, where is a parameter that controls the degree of influence of and is optimized through a pretest.
Figure 4 shows a calculation example of intervals used in . If is determined as 3, three Gaussian distributions can be assumed for three clusters of projection points, such as that shown in Figure 4(a), and the resulting intervals from each distribution are depicted as Figure 4(b). Since the projection points are clustered in clusters in each feature and features are given in the entire dataset, the total number of intervals is calculated as .

(a)

(b)
Formulation of H-RTGL is committed by the conjunction of resulting intervals generated feature-by-feature. This conjunction of intervals is derived by the -fold Cartesian product operation and the total number of possible interval conjunctions is theoretically. Nevertheless, considering all these interval conjunctions deteriorates the computational efficiency and has a risk such that a meaningless interval conjunction is generated, or outliers are included in that conjunction. Thus, we use interval conjunction, including any instance of training dataset, and the interval conjunction is described as
If the number of Gaussian distributions used in feature is given (2, 2) in the dataset with , there are four possible interval conjunctions. However, we only use interval conjunctions and , including the instance of training data, and ignore other regions, as shown in Figure 5.

Even if the interval conjunctions filtered by the existence of an instance can be viewed as H-RTGL and provide superior efficiency and accuracy, these interval conjunctions still have a risk of overfitting, since their intervals are generated feature-by-feature. To alleviate the risk of overfitting, an additional fitting procedure of H-RTGL is required. We propose the following three fitting functions to control the H-RTGL size and record the classification results of each fitting function in the numerical experiments, so that they can be compared with each other. The common parameter is used to control the volume of H-RTGL: Fit 1. Instance-based fitting: it controls the interval by using the minimum and maximum values of instances in interval conjunction, where is the -th projection point included in . Fit 2. No fitting: it uses the generated interval without any modification. Fit 3. Density-based fitting: it controls the interval by using density of instances within the interval conjunction. indicates the relative density of the projection points into feature , belonging to interval conjunction . If the density of projection points within the interval is greater than or equal to , the interval length increases; otherwise, it becomes narrow. In the end, H-RTGL is finalized by the logical product operation of intervals adjusted with fitting functions, which are used to determine the membership of an instance. Specifically, such H-RTGL is named density-based H-RTGL , and the instance is considered as the target class irrespective of whether or not it is in .
3.3. GA for Efficient Selection of Hyperparameters
In addition to proposing , which enables the generation of a promising classifier considering the density and distribution of the dataset, we design a GA to select an appropriate hyperparameter that influences its performance. Specifically, the bandwidth of kernel function in feature is an essential hyperparameter since it determines the shape of the PDF estimated by KDE and the number of Gaussian distributions used in the interval generation of . Therefore, an objective of the designed GA is to explore an optimal combination of in all features, which maximizes the classification performance of . The entire process of GA is summarized in Figure 6, and a detailed description of the algorithm is presented in the following subsections.

3.3.1. Representation of Chromosome Structure and Initial Population
Defining the expression scheme of chromosomes composing a population is the most prior step to be considered for designing GA. Considering the objective of GA, we devise a chromosome structure using a real-valued encoding scheme and consisting of genes corresponding to each feature, to record the bandwidth of the kernel function in each feature. Since there exist features in the dataset, a chromosome is defined as a vector with size and the chromosome structure can be depicted as Figure 7.

The size of population used in GA is defined as , and the initial population is generated by randomly considering an upper bound and lower bound used in feature . The upper and lower bounds and are calculated as follows. Equation (11) expresses Silverman’s rule-of-thumb bandwidth estimator for minimizing the mean squared error (MSE), where is an optimal bandwidth, is the standard deviation of the sample, and expresses the sample size [38].
Although calculated from equation (11) can minimize the MSE for KDE, it cannot guarantee optimality for the classification performance of resulting from itself, which is proven through a numerical experiment. Therefore, we set the upper and lower bounds and as multipliers of , to utilize the partial information of projection points in each feature.
The chromosome containing the bandwidth of the kernel function for each feature is decoded using KDE, to obtain the number of Gaussian distributions used as a crucial parameter of Jenks. Figure 8 depicts this process of decoding, which includes (a) KDE using recorded in the chromosome, (b) finding extreme points from PDF estimated by KDE, and (c) resulting number of Gaussian distributions .

(a)

(b)

(c)
3.3.2. Fitness Function
GA performs an iterative searching process to enhance the solution by transferring good solutions containing desirable features to the offspring. For this purpose, a user defines a fitness function with appropriate criteria to evaluate the chromosomes in the population. Since the main object of the proposed GA is to solve the OCC problem by generating , the fitness function should be capable of evaluating the chromosomes according to the classification performance obtained using them. Therefore, we define the following fitness function proportional to , where is area under ROC curve (AUC), which is a measure for assessing the classification performance and will be explained in Section 4, and is a constant.
3.3.3. Crossover Operator
An offspring is generated by inheriting various features from parents in a natural evolutionary process. The GA imitating this process also has a crossover operator for generating the offspring, by considering features of parental chromosome. One of the most common crossover operators is one or point crossover, which exchanges genes of parental chromosomes according to a randomly selected crossover point. However, such point crossover is not suitable for the chromosome structure considered in this paper, which contains genes representing the bandwidth of KDE used in feature and arranged feature-by-feature. Instead, we use an arithmetic crossover operator that can generate offspring through an arithmetic operation among values recorded in the parental chromosome. Particularly, two gene values and of an offspring in feature are calculated from and recorded in the parental chromosome. Figure 9 depicts (a) the concept of arithmetic crossover and (b) an example of offspring generation using it, where is a random number between −0.5 and + 1.5, which is used for crossover operation.

(a)

(b)
3.3.4. Mutation Operator
In the reproduction process of evolution occurring in nature, an offspring generated from the crossover operation may exhibit a completely different feature not appearing in the parents. This unexpected alteration is called mutation, and this new feature triggered by the mutation forms the foundation for diversity of species, which is essential for the evolutionary process. Similarly, GA has a mutation operator that mimics mutation in nature, and the diversity of solutions guaranteed by the mutation operator prevents the solutions from converging to the local optimum. We adopt the Makinen, Periaux, and Toivanen mutation (MPTM) operator, which can maintain scalability and efficiently search the solution space in real-valued encoding [39, 40]. The detailed procedure for calculating the bandwidth affected by the MPTM operator is as follows. and have the same values as those used to generate the initial population. Step 1. Calculate as Step 2. Generate a random number and compute as follows: Step 3. Set as
3.3.5. Population Update and Termination Condition of GA
Through the crossover and mutation operator, a new generation comprising chromosomes is produced and there exist 2 chromosomes. To maintain the population size, we assess the fitness value of the whole population and select only chromosomes with a high fitness value. For selecting the parental chromosome used for the next generation, we perform roulette wheel selection based on the fitness value. In other words, a chromosome is chosen as a parental chromosome through a probability that is proportional to its fitness value. The solution space can be explored efficiently due to this selection policy. Besides, GA requires termination criteria to stop the algorithm, since there may only be scarce enhancement after plenty of iterations are performed, which means that additional executions are insignificant or not promising. For instance, GA terminates if the predefined number of iterations or certain computing power allocated is reached. However, we adopt the termination condition if there is no enhancement after a certain number of generations (five in this paper), since this termination condition can enable more exhaustive and efficient searching for the solution space as compared to other criteria [41].
4. Numerical Experiments
4.1. Experimental Design
We design numerical experiments using an actual dataset provided by the UCI machine learning repository to evaluate the classification performance of the classifier obtained by . Seven datasets are considered (named Iris, Biomed, Liver, Breast, Vehicle, Abalone, and Satellite), whose detailed information such as the number of features or instances is summarized in Table 1. However, these datasets do not accord with the definition of OCC since they have more than one class. Therefore, we select a certain class as the target class and consider the rest of classes as the outlier, which is called the one-versus-all (OVA) approach. This is a general approach for performing OCC among MCC datasets and useful for collecting information of an important class selectively. The selected target class of the dataset is also recorded in Table 1.
An experiment for each dataset is performed as follows. We randomly select half of the instances belonging to the target class and use them as the training set to learn the proposed classifier. After formulating the classifier by using the training set, the remaining half of the target class and all instances defined as outliers are used as the test data to verify the classifier. In the Iris dataset, for instance, the training dataset comprises 25 randomly selected instances from the target class and the test dataset comprises the remaining 25 instances, belonging to the target class, and 100 outliers. To set the GA parameters, a preexperiment is conducted, which results in a population size and probability of mutation .
We consider such configuration of training and target classes, due to the measure used to evaluate the classification performance of classifiers, namely, AUC. AUC is a representative performance measure for the classifier, which can inspect the classifier in terms of both capability for including the target class instances that we are interested in and excluding outliers simultaneously. Since the test set considered in this paper contains instances of the target class and outlier, AUC can be computed depending on the number of instances of target class included and that of outliers excluded in the test set. The ROC curve for calculating AUC is drawn by the true positive rate, representing the ratio of instances classified as positive among actual positive instances, and false positive rate, representing the ratio of instances classified as positive among actual negative instances. Concretely, we draw a ROC curve by increasing the value of in the fitting function, determining the length of intervals, until all instances of the target class in the test dataset are covered. Figure 10 depicts one ROC curve and AUC value obtained using , which is constructed using one chromosome for the Biomed dataset. A previous study proved that the approach for drawing a ROC curve was superior to that for drawing a ROC curve based on other parameters [9].

4.2. Pretest for Selection of Kernel Function
As mentioned previously, the kernel function is an essential component since it influences the derivation of the number of Gaussian distributions used in feature , which determines the shape of the resulting H-RTGL and the classification performance. There are various kernel functions to be considered in KDE, and we perform a pretest to choose the kernel function that can ensure faster convergence speed and more accurate classification result in classifier generation of with GA. In particular, we compare four kernel functions, namely uniform kernel, triangular kernel, Gaussian kernel, and cosine kernel. A brief description of each kernel function including type, equation of kernel function, and graphs of function, is summarized in Table 2. All numerical experiments, as well as pretest for the selection of the kernel function, are implemented via MATLAB R2017b on a PC platform with 16 GB RAM and an Intel (R) Core (TM) i5-3570 CPU 3.40 GHz.
We execute a GA to generate for the dataset considered in this paper, by using the abovementioned kernel functions. Table 3 summarizes the average number of generations until convergence and the standard deviation, which results from 20 replications for each kernel function. Particularly, Virginica class is chosen as representative of the Iris dataset since it has the slowest convergence speed among the three classes. The columns in color in Table 3regard the largest generations required for convergence, and the colored part shows promising performance of Gaussian and Cosine kernel superior to others.
According to Table 3, the Gaussian and Cosine kernels spend less time on convergence than other kernel functions in large datasets consisting of many features, namely, Vehicle, Abalone, and Satellite. Based on this observation, we adopt the Gaussian kernel for the one-dimensional clustering procedure used in . In case of elapsed time, there is no meaningful difference according to the kernel function and it takes approximately 10–30 ms, proportional to the dataset size. Moreover, no significant difference in AUC representing the accuracy of the classifier is observed, which indicates that the kernel function only influences rapidity of convergence in execution of GA to generate .
4.3. Experimental Results
Table 4 displays the experimental results for evaluating the classification performance of . Specifically, we implement GA to generate on the dataset described in Table 1 20 times and record the mean and standard deviation of AUC obtained from the GA to generate and other OCC algorithms. We refer to the previous research to obtain results of existing H-RTGL-based OCC methods, such as and , and commit additional experiments for the dataset newly considered in this study [9]. Results of other baseline OCC methods are taken from the OCC results recorded by [42]. In addition, three variations of resulting from different fitting functions are assessed and stated as (Fit 1), , and . We perform a pretest for parameter tuning based on the grid search for and and state the optimal parameter configurations in each dataset. In case of other representative OCC algorithms, AUC values recorded in the reference are acquired from the parameter exploration using dd_tools, which is a toolbox provided for MATLAB. Contrary to these traditional approaches, a notable merit of obtained by GA is that the classifier performance can be optimized through internal exploration of the solution space without conducting a pretest for parameter tuning.
First, we observe that the classification performance of variations (Fit 1) to (Fit 3) using different fitting functions differ despite the usage of the same dataset. In addition, the superiority of the fitting function varies depending on the dataset, since there is no that outperforms others in all datasets. In datasets such as Breast and Vehicle, for instance, the most promising approach is (Fit 2) which formulates H-RTGL using intervals without any fitting operation. This statement can be interpreted such that the dataset has relatively small risk of overfitting into the training dataset, with instances evenly located in the feature space. On the other hand, (Fit 3) generates the most accurate classifier in the Biomed dataset, which implies that the dataset is organized well and fits with Gaussian distribution. As a result, the proposed fitting functions enable to deal with various datasets with various structures.
The classifiers obtained from GA for generating show superior performance compared to those obtained from the traditional H-RTGL-based OCC methods, such as and , despite the fluctuation caused by the fitting function. We observe between 1% and 8% difference in terms of AUC, except for the Liver dataset, which shows nearly the same classification performances between and existing s. Meanwhile, further research has proved that a feature named number of half-pint equivalents of alcoholic beverages drunk per day included in the Liver dataset is not appropriate for dividing healthy and nonhealthy classes [43]. This can explain the relatively low AUC value of the Liver dataset and suggests that the dataset is not suitable for assessing the classifier due to the defective feature. From the viewpoint of internal mechanism of the classifier, proposed in this paper has similar aspect to , generating intervals through the partitioning of a large interval, since it divides the projection points into clusters. The difference between and stems from the limitation of that the same number of partitioning is applied for all features and that partitioning is carried out by considering merely the distance between instances, which ignores the density of the dataset and deviation according to the features. Peculiarly, the performance degradation of observed in datasets with many features, such as Vehicle, Abalone, and Satellite, supports the dominance of the proposed .
Considering OCC algorithms other than H-RTGL-based OCC, the classification performance of obtained by GA records the highest AUC value in Iris_Setosa, Iris_Virginica, and Abalone. In addition, the gap between the classification performance of obtained by GA and that of the most prominent OCC algorithm is very narrow, specifically 0.2–0.3%. These results indicate that exhibits competitiveness in terms of classification performance against well-known OCC algorithms as well as existing H-RTGL-based OCC methods. With regard to average performance described in the last column, the average performance of is better than that of Naive Parzen and Gauss algorithm. To verify this result by statistical test, we carried out t-test using the following set of hypotheses for comparing the performance of (Fit 1) to that of Naive Parzen, where and indicate the average performance of each algorithm, respectively.
Specifically, we conducted two-sample t-test with unequal variances using function t-test2 in MATLAB R2017b. T-statistics of test resulted in 9.9599 and it is beyond the rejection region of test. This indicates that the average performance of (Fit 1) outperforms that of Naive Parzen. As a result, the variations of can provide competitive performance as well as interpretability proven in Section 4.4, while enabling the user to derive remarks from their classification results. By using the variations of , the user is expected to perform post analysis with clear basis obtained from them.
4.4. Proof of Interpretability
Besides the classification performance proven in the abovementioned numerical experiments, valuable insight and clear interpretation can be achieved by , since the resulting classifier comprises geometric rules that can be easily understood by the user. We consider Iris and Biomed datasets as examples of such interpretation. The Iris dataset is composed of three classes corresponding to three species and four features, namely, Sepal Length, Sepal Width, Petal Length, and Petal Width. Table 5 summarizes the information of H-RTGL drawn by using Setosa as the target class. From this information, an unlabeled instance is classified as the target class if it has (i) Sepal Length between 4.1 and 6.0, (ii) Sepal Width between 2.4 and 4.8, (iii) Petal Length between 1.1 and 1.9, and (iv) Petal Width under 0.8.
Figure 11 shows the visualization result of H-RTGL described in Table 5 and a scatter plot of instances of the entire Iris dataset. Even if the three features, including Sepal Length, Petal Length, and Petal Width, are displayed in Figure 11, the actual H-RTGL generation is performed using all features. H-RTGL depicted in Figure 11 includes all instances of the Setosa class marked as ∗, while there is no instance of other classes within it. Accordingly, we can infer that the obtained is a representative pattern of the Setosa class and the intervals belonging to it are significant rules for separating the Setosa class and outliers.

Furthermore, according to the feature space described in Figure 11, the Petal Length (y-axis) and Petal Width (z-axis) of instances belonging to the Setosa class are significantly different from those considered as outliers, while Sepal Length (x-axis) of instances belonging to Setosa class and other classes is not so far in the feature space. The user can consider Petal Length and Petal Width as important features to determine the unlabeled instance as Setosa class, based on the observation.
Next, the Biomed dataset aims to identify factors resulting in a certain genetic disorder and comprises five features, including age of patient and values of four blood measurements. We generate the following s, expressed in Table 6, considering normal as the target class. In addition, Table 7 is a confusion matrix calculated from the actual classification result of the test dataset by using the information recorded in Table 6.
From the confusion matrix, we can observe four s covering 56 instances belonging to the target class among 63 instances of the target class in the whole test set and 10 outliers. Therefore, the coverage of the resulting classifier is and its classification accuracy is . Such relatively high coverage and classification accuracy indicate that s compromising form a representative pattern of the Biomed dataset and appropriately split the normal class corresponding to the target class and outliers. In terms of each feature, measure 1 of s describing the normal class is 18–38 or 40–70, while the average value of measure 1 of the outlier is 185.8, which supports the fact that measure 1 is an important feature causing the genetic disorder. As described here, such insight (including the importance of certain features) can be obtained from H-RTGL generated by GA in this manner.
Finally, the Breast dataset describes some characteristics of a breast tumor, such as thickness, and aims to discriminate benign and malignant tumors. Table 8 summarizes the information of three s covering the maximum instances obtained by . In addition, Table 9 is a confusion matrix calculated from the actual classification result of the test dataset by using the information recorded in Table 8.
From the confusion matrix, we can observe that three s cover 90 instances belonging to the target class among 120 instances of the target class in the whole test set and 16 outliers. Therefore, the coverage of the resulting classifier is and its classification accuracy is . Similar to the case of Biomed, s obtained by are a typical pattern of the Breast dataset and clearly distinguish an instance of the malignant class from that belonging to the outlier. Additional interpretation of pattern through each feature is as follows. First, values of marginal adhesion recorded in the three s are generally 1–10. Since the Breast dataset is normalized between 1 and 10, this feature can hardly be regarded as an important feature discriminating the malignant class. Contrary to marginal adhesion, values of uniformity of cell shape and bare nuclei recorded in Table 8 are relatively narrow, indicating that they are essential features describing the malignant class. We explore Chen et al. [44] to validate the feature importance inferred by . They carried out a correlation analysis between each feature and class information, which indicated that the uniformity of cell shape and bare nuclei had high correlation coefficients of 0.8219 and 0.8227, respectively. In addition, marginal adhesion had a low correlation coefficient of 0.7062. As a result, relevance of interpretation about feature importance based on was proven by cross-validation.
4.5. Comparison of the Interpretability with LIME
In order to describe how proposed provides the interpretability, we compared the resulting interpretability to that of another method. Specifically, we considered Local Interpretable Model-Agnostic Explanation (Lime) algorithm [45] as a control group. Lime algorithm is one of wrapper methods executed in a result of black-box algorithm in order to extract knowledge that can be used by the user. We formulated support vector classifier model describing Setosa class of Iris data set used in Figure 11 and then applied Lime algorithm for deriving interpretability to compare with H-RTGL of Figure 11. The classification performance of support vector classifier used as control group is the same as, which can completely distinguish Setosa class from outlier. Figure 12 expresses a result of Lime algorithm applied for four instances determined as Setosa class by support vector classifier. The important feature value used to classify an instance as Setosa class is marked as the orange color, and rules extracted from the instance are described in Figure 12.

Most of all, Lime is a wrapper method applied for result of black-box classifier, which can be exclusively used for any black-box model while it requires additional computation and inherits disadvantages of selected black-box method. In terms of knowledge extracted, we obtained 7 rules (excluding the repetitive ones) that can express four instances. On the other hand, only four rules are generated in the result of and they are sufficient for explaining all of instances belonging to Setosa class. Such difference stems from an interpretability mechanism of lime, which extract knowledge in instance-by-instance manner. Since intervals used in are generated considering all instances or meaningful statistics of them, can provide more generalized rules for target class. Due to its instance-by-instance nature, furthermore, Lime sometimes generates rules in conflict such as “3.00 < SW ≤ 3.20” used to describe Setosa class in the second instance, which is used to determine outlier in the third instance. Such indeterministic feature of Lime hinders reliability of interpretability obtained from lime. On the contrary, can generate systematic and informative rules useful for the user.
5. Conclusion
This paper mainly dealt with the following: (i) development of a novel OCC classifier , considering the density and distribution of the training data, and (ii) design of a GA that enables systematic generation of the classifier with optimizing the hyperparameter. The interval generation method of includes a one-dimensional clustering procedure reinforced with KDE and Jenks, which can accurately reflect the latent patterns or distributions of the data. In addition, we devise an encoding scheme comprising genes representing the number of normal distributions in each feature and other genetic operators suitable for the chromosome structure to derive a promising parameter configuration. To validate the performance of the proposed generated by GA, we carried out numerical experiments using an actual dataset and provided an example of a clear interpretation of the resulting classifier and dataset.
The experiment showed that the classification performance of obtained by GA was similar or superior to that obtained by other OCC algorithms that guarantee high performance but not interpretability, such as SVDD or 1-SVM. In addition, the performance of obtained by GA was dominant over the previous H-RTGL-based OCC methods, such as and . Based on these, we can conclude that obtained by GA overcomes the difficulty of parameter tuning in traditional H-RTGL-based OCC research, as well as the trade-off occurring in other existing OCC algorithms in terms of classification performance and interpretability. As a result, the contributions of our study include learning latent but important patterns of the dataset and interpreting the resulting classifier for post analysis. For instance, we can consider an application such as outlier detection in the manufacturing field to report a factor causing defection, by learning patterns of proper and defected products.
Meanwhile, there are several topics to be considered as future work, including proving the effectiveness and efficiency of GA to generate in terms of (i) extension to other datasets, (ii) modification of internal mechanism of the classifier, and (iii) application of other metaheuristics. Apart from the UCI machine learning dataset considered in this paper, we plan to verify our study by using exceptionally large or complex datasets, including an inherent problematic structure. Moreover, we intend to develop interval generation methods using other distributions besides normal distribution since some datasets are not suitable for normal distribution. Application of state-of-the-art metaheuristics, such as particle swarm optimization, known for faster convergence than GA, is another option.
Data Availability
The authors design numerical experiments using an actual dataset provided by the UCI machine learning repository to evaluate the classification performance of the classifier obtained by .
Disclosure
An earlier version of this article was presented as preprint in [46].
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (no. NRF-2017R1A2B4009841).