A Novel Multiway Splits Decision Tree for Multiple Types of Data

Liu, Zhenyu; Wen, Tao; Sun, Wei; Zhang, Qilong

doi:https://doi.org/10.1155/2020/7870534

Mathematical Problems in Engineering

On this page

Abstract Introduction Preliminaries Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2020 | Article ID 7870534 | https://doi.org/10.1155/2020/7870534

A Novel Multiway Splits Decision Tree for Multiple Types of Data

Zhenyu Liu,^1,2Tao Wen,^1,2Wei Sun,²and Qilong Zhang¹

Academic Editor: Isabel S. Jesus

Received22 May 2020

Revised26 Sept 2020

Accepted30 Oct 2020

Published16 Nov 2020

Abstract

Classical decision trees such as C4.5 and CART partition the feature space using axis-parallel splits. Oblique decision trees use the oblique splits based on linear combinations of features to potentially simplify the boundary structure. Although oblique decision trees have higher generalization accuracy, most oblique split methods are not directly conducive to the categorical data and are computationally expensive. In this paper, we propose a multiway splits decision tree (MSDT) algorithm, which adopts feature weighting and clustering. This method can combine multiple numerical features, multiple categorical features, or multiple mixed features. Experimental results show that MSDT has excellent performance for multiple types of data.

1. Introduction

Despite the great success of deep neural network (DNN) model in image processing, speech recognition, and other fields in recent years, decision trees have competitive performance compared to DNN scheme, such as the advantage of interpretability, less parameters, and good robustness to noise, and can be applied to large-scale data sets with less computational cost. Therefore, the decision tree is still one of the hotspots in the field of machine learning today [1–3]. The research mainly focused on the construction method of decision trees, split criterion [4], decision trees ensemble [5, 6], mixing with other learners [7–9], decision trees for semisupervised learning [10], and so on.

Despite practical success, the optimal construction of decision trees has been theoretically proven to be NP-complete [11]. In order to avoid the local optimal solution, some researchers adopted evolutionary algorithms to build decision trees [12–14]. However, due to the time complexity, the most popular algorithms, such as ID3 [15], C4.5 [16], and CART [17], and their various modifications [18] are greedy by nature and construct the decision tree in a top-down, recursive manner. Besides, they only act on one dimension at a time and thus result in an axis-parallel split. In the induction of decision tree, if the candidate features are numerical, a suitable cut point needs to be searched. Instances in the training set are divided into the left node or the right node according to the following formula:where denotes the value of the instance on the feature and is the cut point.

Axis-parallel trees have the advantages of fast induction and strong comprehensibility. However, in the case of highly correlated features, a very bad situation may arise. Figure 1 gives an illustration. The parallel splits will be carried out many times with a stair case-like structure, which leads to the complexity of the decision tree structure.

To solve the problem of parallel decision trees, some researchers introduced oblique decision trees. In such oblique decision trees, the nonleaf node tests the linear combination of features, i.e., where represents the coefficient for the th feature, is the threshold, and is the number of features. In Figure 1, the instances of the two classes can be completely separated by one oblique split. Therefore, it is generally believed that the oblique splits can often produce smaller decision trees and better generalization performance for the same data.

It is much more difficult to search the optimal oblique hyperplanes than the optimal axis-parallel hyperplanes. To solve this problem, numerous techniques have been applied, for example, hill-climbing [17], simulated annealing [19], and genetic algorithm [20]. Among them, a large amount of research work has been done on reducing the risk of falling into local optimal solution, such as Simulated Annealing Decision Tree (SADT) [19], which used the simulated annealing algorithm; OC1 [21] method combined the ideas of CART-LC [17] and SADT.

In searching oblique hyperplanes, thousands of candidates have been tried in both simulated annealing algorithm and genetic algorithm, resulting in low time efficiency. So many researchers used linear discriminant analysis, linear regression, perceptron, SVM, and other methods to find suitable oblique hyperplanes. Fisher’s decision tree (FDT) [22] takes advantage of dimensionality reduction of Fisher’s linear discriminant and uses the decomposition strategy of decision trees to come up with an oblique decision tree. FDT is only applicable to binary classification problems. Based on ADTree [23], Hong et al. [24] proposed the multivariate ADTree. Paper [24] presented and discussed the different variations of ADTree (Fisher’s ADTree, Sparse ADTree, and Regularized Logistic ADTree). Wickramarachchi et al. [25] explored a decision tree algorithm (HHCART). HHCART uses a series of Householder matrices to reflect the training data during tree construction. Shah and Sastry [26] defined separability of instances as the split criterion that optimized their evaluation function at each node and then presented the Alopex Perceptron Decision Tree algorithm for learning a decision tree. Menze et al. [27] presented an oblique tree forest method, which used LDA and ridge regression to conduct oblique splits.

In the above oblique methods, the trees with fewer nodes and better accuracy can be obtained. However, there are also some deficiencies, mainly including three aspects.

1.1. Inability to Directly Employ the Methods for Categorical Data

The oblique splits use the linear combination of features. Therefore, the categorical features need to be converted into one or more numerical features [28]. This transformation may bring new biases to the classification problems, thus reducing the generalization ability of the models.

1.2. High Time Cost

The oblique splits always require complex matrix calculation when using linear discriminant analysis, ridge regression, or other methods. Although these methods are more efficient than simulated annealing and genetic algorithm, they still pay more cost than the axis-parallel methods, such as C4.5.

1.3. Some Methods Cannot Be Suitable for Multiclassification Problems

Generally, the oblique split methods conduct the binary splits. Although the binary tree can also be directly used for multiclassification problems, some binary splits rely on class label, such as FDA, original SVM, etc., which makes some algorithms like FDT in [22] limited to binary classification problems. In addition, some models need to convert multiclassification problems into binary ones [7].

In order to overcome the above shortcomings, this paper proposes a multiway splits decision tree for multiple types of data (numerical, categorical, and mixed data). The specific characteristics of this method are as follows:(i)Categorical features are handled directly.(ii)The time complexity is similar to that of the axis-parallel split algorithms.(iii)It is not necessary to convert multiclassification problems into binary ones by using the multiway splits directly.

The remainder of the paper is organized as follows. In Section 2, we review RELIEF-F and k-means algorithms briefly. Section 3 presents our algorithm and discusses its time complexity. Section 4 presents and analyzes the compared experimental results with other decision trees. The last section gives the conclusion of this paper.

2. Preliminaries

The proposed decision tree method needs to weight the features by RELIEF-F algorithm and split the nodes by the weighted k-means algorithm. Therefore, this section reviews the two algorithms and their variations.

2.1. RELIEF-F Algorithms

The RELIEF algorithm [29] is popular to feature selection. It estimates the weights of features according to the correlation between individual feature and class label. RELIEF randomly samples an instance from the training set and then searches its two nearest neighbors and : is from the same class (called near Hit) and is from different class (called near Miss). If the distance between and on feature is less than the distance between and , RELIEF will increase ’s weight. On the contrary, RELIEF will decrease the weight.

In fact, RELIEF’s estimate of feature is an approximation of the following difference of probabilities:where represents the conditional probability.

RELIEF algorithm only deals with binary classification problems. Kononeill addressed an algorithm called RELIEF-F for multiclassification problems [30]. The algorithm picks instances. For each instance , its nearest neighbors are searched in each class.

The weight is calculated as follows:where represents the proportion of class instances to the total instances and represents the th nearest neighbor to in class . calculates the difference between two instances R₁ and R₂ on the feature as follows:

2.2. k-Means, k-Modes, and k-Prototypes

The k-means is widely used in real world applications due to its simplicity and efficiency.

Let be a set of instances. is characterized by a set of features and needs to be clustered into clusters . First, randomly pick some instances as the centers of the initial clusters , and then calculate the cluster label for each instance as follows:

After all the instances are partitioned, each cluster center will be updated by the following formula:

Repeat formulas (6) and (7) until the variable in formula (8) converges to the local optimal solution or the preset number of iterations is reached:

However, the classical k-means is only worked on the numerical data. The k-modes and k-prototypes are variants of k-means for categorical and mixed data, respectively [31]. When k-modes processes the categorical variables, the center of each cluster is represented by modes. When calculating the distance between instance and cluster center, the distance on each feature is calculated by formula (5) and then accumulated.

It is straightforward to integrate the k-means and k-modes into the k-prototypes. is the distance between instance and cluster center as follows:where represents the distance on the numerical variables and represents the distance on the categorical variables, respectively. is used to adjust the proportion of and , .

3. Our Proposed Algorithm

Our proposed MSDT has three differences with most oblique methods: (i) MSDT does not use greedy methods to pursue maximum impurity reduction, (ii) MSDT uses a combination of multiple variables to do multiway splits for nonleaf nodes, and (iii) MSDT treats categorical features in a similar way to numerical features.

3.1. Multiway Splits

Most oblique methods conduct binary splits, while the proposed algorithm performs multiway splits; that is, in one split, multiple hyperplanes are generated simultaneously, and the feature space is divided into several disjoint regions. Ho [32] categorized the linear split methods into three types, axis-parallel linear splits, oblique linear splits, and piecewise linear splits, while our method falls into the third. Piecewise linear split methods find anchors in feature space, and each instance is clustered according to the nearest neighbor anchor. Figure 2 shows the 5-way splits of the two-dimensional feature space.

3.2. Location of Anchor

Finding suitable split hyperplanes is the key problem in most decision tree induction algorithms. Under piecewise linear splits, the problem of finding appropriate hyperplanes is equivalent to that of finding appropriate anchors. Usually, anchor selection can use the class centroids, or cluster centers generated by some clustering algorithms. In MSDT, we first use RELIEF-F to weight features and then use k-means with weighted distance to cluster instances.

3.2.1. Why Do We Use k-Means?

If the instances are linearly separable, it is obviously more efficient to use simply the class centroids than cluster centers as anchors. However, when the instances of some classes are distributed in different regions of the feature space, the class centroids may no longer be suitable for being anchors. For example, in Figure 3, the circular instances are distributed in two different areas. If the solid line that is perpendicular to the line between the two class centroids is used to separate the instances, the effect is obviously not satisfactory. The instances in Figure 3 are obviously distributed into two clusters. If the instances are divided by the dotted line that is a perpendicular bisector of the two cluster centers, at least the circular instances on the right side of the figure can be distinguished.

The split method proposed is based on clustering assumption. The clustering assumption states that the samples belonging to the same cluster belong to the same class. k-means methods partition instances according to some (dis)similarity measures; hence, the leaf nodes of MSDT can be regarded as some prototypes, and the class of a test instance depends on which prototype the instance is more similar to.

The univariate decision trees can produce a comprehensible classification mode, due to the knowledge representation method—a decision tree is a graphical representation and can be easily converted into a set of rules written in a natural language. Some researchers believe that multivariate decision trees are not able to convert into the comprehensible rules. The other researchers think that multivariate tree with fewer nodes is easy to understand. MSDT is easy to understand due to two reasons. One is that MSDT has fewer nodes compared to univariate decision trees. The other one is that the similarity with the prototype is easy to understand by the users and it can replace the rules generated by the univariate decision tree.

3.2.2. Why Do We Weight Features?

The original k-means is an unsupervised clustering algorithm, which is suitable for unlabeled data. And the optimization goal is to minimize (8). The goal of split is to reduce the class impurity of current node as much as possible. Note that the two goals are not the same. Therefore, we estimate the correlations between features and label to weight features. When calculating the distance from an instance to a cluster center, we give a larger weight to the feature strongly related to the label that enlarges the contribution of the feature to the distance. Otherwise, we give a smaller weight that reduces the contribution of the uncorrelated feature to the distance. In this way, the optimization goal of k-means algorithm is close to that of node split.

Figure 4 shows an example to illustrate the effectiveness of feature weighting. The solid line comes from unweighted features, and the dotted line comes from weighted features when the weight of A₁ is 0.05 and the weight of A₂ is 0.95. It is obvious that some instances have been corrected.

To further illustrate the role of feature weighting, we use dataset iris to carry out a simple experiment: 150 samples of dataset iris come from three classes, and each class has 50 samples. We directly use k-means algorithm to cluster and obtain 10 misclassified samples. The specific results are shown in Table 1.

Then, we use the RELIEF-F algorithm to calculate the weights of four features, which are 0.09, 0.14, 0.34, and 0.39, respectively. In the process of k-means clustering, the distances between instances and cluster centers are calculated by (10), where p indicates the number of features and indicates the weight of the lth feature. We obtain 6 misclassified samples, and the specific results are shown in Table 2.

Our proposed split method is shown in Algorithm 1, which will be used to split nodes for numerical data.

	Input: Current node training set
	Output: Divide as , cluster centers , the weights
(1)	Initialize the number of clusters with the number of classes in .
(2)	Input , call RELIEF-F to generate .
(3)	Assign with maximum in , excludes features whose weight is less than , , 0.2 by default.
(4)	Initialize by using the class centroids in .
(5)	For 1 to Do
(6)	Combining formulas (6) and (10), divide instances into
(7)	Recalculate according to formula (7).
(8)	If do not change significantly Then break For
(9)	End For
(10)	Return, and

In the fifth step of Algorithm 1, represents the maximum number of iterations. In the experiments, we set it to 6 by default. The reason for setting such a small value is mainly to consider the time efficiency of the algorithm. In addition, the purpose of clustering is to split nodes. Even if the clustering algorithm does not converge, the partition results can still be accepted.

3.3. Categorical Feature

As mentioned in the previous subsection, the split method can be directly applied to numerical features. For categorical features, RELIEF-F algorithm can still be used to weight features. However, in the process of clustering, the representation of cluster center and the distance from instance to cluster center need to be redefined.

The k-modes extends the k-means by replacing the means of numerical variables with the modes of the categorical variables. Yet it is less precise to calculate the distance. What is more, choosing different modes may cause opposite conclusion while there are several modes for a feature.

Here is an example. Suppose there are two clusters and described by two categorical features and , and each cluster contains 10 instances as is shown in Table 3. The modes of and for are , which makes useless for distinguishing the distances between instances and the clusters. There are two modes for in and , respectively. Suppose that there is an instance ; if is selected as the center of and for , distance between and is 0 and distance between and is 1; hence, is nearer to . If is selected as the center of and for , distance between and is 1 and distance between and is 0; hence, is nearer to .

To avoid the less precision and the ambiguity of distance measure on the modes, we use the probability estimation of each categorical feature value to represent the cluster center and define a function to calculate the distance from instance to cluster center.

Let be a set of categorical data described by categorical features. Number of instances in is and instances are partitioned into clusters. There are with different values for the th feature of the th cluster , .

Definition 1. represents the set of instances with value of on the feature A_l in C_j, where . The condition probability is estimated as follows: is the summary of all values of in , defined as follows:

Definition 2. The center of C_j is represented by the following vector:

Definition 3. represents the distance between value and for :

Definition 4. represents the weighted distance between instance and center :According to formula (15), in the above example, the weights of two features are 1. The distances between instance and two cluster centers ( and ) in Table 3 are 0.7 = 0.1 + 0.6 and 1.2 = 0.6 + 0.6, respectively. It means that is closer to , which is in accordance with the human’s intuition.
To cluster categorical data, we use formula (13) to replace formula (7) in step 4 and step 7 of Algorithm 1 and formula (15) to formula (10) in step 6.

3.4. Mixed Features Data

For mixed data, the vector of cluster center consists of two parts: one is the means of numerical features and the other is the vector as shown in (13). In this case, we use (9) to calculate the distance from instance to cluster center, where and are obtained by (10) and (15), respectively. As the ratio of numerical and categorical features differs by the datasets, we choose in (9) that makes the most reduction of GINI index, where .

3.5. MSDT and Time Complexity Analysis

The multi_split function is prompted for node splits. Algorithm 2 describes the construction process of MSDT.

	Input: training set and the threshold value minparent
	Output: the decision tree
(1)	Create node according to the instances in .
(2)	Procedure grow(node)
(3)	If is less than the minparent or the instances are not partitionable (All the instances are of the same class or have the same feature values) Then
(4)	mark node as a leaf, and label it with the class of the majority of instances in .
(5)	Return node
(6)	End If
(7)	Call multi_split to get the cluster , , and .
(8)	Save the values of , and into the current node for the prediction.
(9)	For i = 1 to Do
(10)	Create node_i according to instances in , call grow(node_i).
(11)	End For
(12)	End Procedure
(13)	Prune the node-rooted tree by pessimistic pruning algorithm.
(14)	Return the node-rooted decision tree.

In step 2 of Algorithm 1, RELIEF-F is used to get the weights. Time complexity of RELIEF-F is , where p is the feature number, n is the instance number, m is the sampling number, and is the nearest neighbor number. In this paper, m is set is set 1, and is negligible, so the time complexity of RELIEF-F in this paper is .

Steps 4 to 9 of Algorithm 1 are the clustering process, and the time complexity is , where k is cluster number and I is iteration number. When we use Algorithm 1 to split nodes, the max iterations Imax is 6; it means that time complexity may reach in the worst case.

Considering the above two parts, the time complexity of Algorithm 1 is . Compared with the time complexity of the classical axis-parallel splits, there is an extra k. When is large, this algorithm is lower efficiency than the axis-parallel algorithms. Compared with binary splits, if the node numbers of the decision trees are the same, the operations in -way splits are obviously less than in binary splits.

OC1 [21] is a classic oblique decision tree, whose time complexity is in the worst case. In [25], the time complexities of HHCART(A) and HHCART(D) are and , respectively. In [22], the speed of FDT for splitting node is close to or even better than that of axis-parallel split method. The time complexity of this method is . Unfortunately, it can only be applied in binary classification problems.

In summary, when k is small, the efficiency of the proposed split method is close to classical axis-parallel split methods, and it is better than most oblique split methods.

4. Experiments

In this section, we use experimental results to demonstrate the effectiveness and performance of our proposed algorithm. In the first part, the experiments are used to illustrate the effectiveness of clustering, feature weighting, and the novel distance calculation method for categorical feature. The second parts compare MSDT with classical decision trees and another two oblique trees. Finally, we use a larger dataset covertype to compare with two axis-parallel trees.

4.1. Datasets

As shown in Table 4, the 20 UCI datasets [33] are used to evaluate the proposed algorithm, where the number of instances, the number of classes, and feature types (numerical data 1–10, categorical data 11–15, and mixed data 16–20) are varied and are sufficiently representative to demonstrate the performance of MSDT. In column Features with is number of numerical features and categorical features, respectively. Abalone is treated as a 3-category classification problem (grouping classes 1–8, 9 and 10, and 11 on).

4.2. Comparison of Different Piecewise Linear Split Methods

The piecewise linear split methods can be summarized as two steps. First, find appropriate anchors. Then, divide instances according to the nearest anchor. On the basis of this approach, our proposed algorithm is improved in three aspects: feature weighting, clustering, and special categorical feature processing. This section combines these three changes into multiple functions and compares the performances in multiple types of data. These functions are shown in Table 5.

The pessimistic pruning algorithm is adopted after the decision trees are generated. In addition, the average results of all experiments are obtained by 10 repetitions of 10-fold cross-validation.

4.2.1. Numerical Data

In terms of numerical data, the proposed algorithm uses weighted k-means to optimize cluster center position. In order to demonstrate the role of clustering and feature weighting, we implement four different node split functions to generate decision trees. Fun0 directly uses the center of each class as the anchor. Instances are divided according to the nearest anchor. Euclidean distance is used for distance calculation. Fun1 also uses the center of each class as the anchor. However, in the process of selecting the nearest anchor for each instance, RELIEF-F is firstly used to calculate the weight of each feature. Then remove features whose weights are less than 1/5 of the maximum. Finally, the distance is calculated according to formula (10). Fun2 uses the center of each class as the initial cluster center of k-means and the outputs of k-means as the partition results. Fun3 combines Fun1 with Fun2 and is our proposed algorithm for numerical data.

Table 6 gives the classification accuracy of the 4 functions in 10 datasets, and the best entry in each row is bolded. As can be seen, Fun3 gets the best accuracy on 9 of 10 datasets and the average improvement is 4.16% higher than Fun0. In particular, the accuracy increases by more than 8% on Glass and Letter. The average accuracy of 10 datasets shows that Fun1 is about 1.07% higher than Fun0, and Fun3 is 1.39% higher than Fun2. The results show that feature weighting improves the classification performance. Fun2 is about 2.77% higher than Fun0, and Fun3 is 3.09% higher than Fun1. The reason for improvement is using clustering.

4.2.2. Categorical and Mixed Data

On the categorical and mixed data, we implement eight different split functions to generate decision trees. Fun0 directly uses the center of each class as the anchor. For categorical features, modes are used to replace the means of numerical features as the component of anchors. When calculating the distance between instance and anchor, the distance on each feature is calculated by formula (5) and then summed. The difference between Fun1 and Fun0 is that the weight of each feature is calculated by RELIEF-F. Then remove features whose weights are less than 1/5 of the maximum. The distance between the instance and the anchor is obtained by formula (15) and formula (9). Fun2 adds clustering process on Fun0. k-modes and k-prototypes are used for categorical data and mixed data, respectively. Fun3 combines Fun1 and Fun2. Fun4-7 corresponds to Fun0-3 respectively. On the categorical features, the calculation of cluster centers and distances adopts the method described in Section 3.3 (formulas (13) and (15), respectively). Fun7 is our proposed algorithm for categorical and mixed data.

Table 7 gives the classification accuracy of the 8 functions in 5 categorical datasets (Balance, Car, Chess, Hayes, and MONK) and 5 mixed datasets (Abalone, CMC, Flags, TAE, and Zoo), and the best entry in each row is bolded. Except for CMC and Zoo, Fun7 obtains the best accuracy, and the average is 11.77% higher than Fun0. As can be seen, Fun1 is better than Fun0, Fun3 is better than Fun2, Fun5 is better than Fun4, and Fun7 is better than Fun6. The average improvement is 5.37%. This is the contribution of feature weighting. It is shown that Fun2 is better than Fun0, Fun3 is better than Fun1, Fun6 is better than Fun4, and Fun7 is better than Fun5. The average improvement is 4.45%. The reason is the use of clustering. Meanwhile, we can see that Fun4-7 is averagely 1.48% better than Fun0-3. This improvement is the statistical distribution of feature values instead of modes.

4.3. Comparison with Other Decision Trees

In order to verify the performance of our proposed algorithm, we selected four decision trees: J48 (WEKA’s implementation of C4.5), CART_SL (scikit-learn’s implementation of optimal CART), OC1, and HHCART(A). Since CART_SL and OC1 do not support categorical features, we convert categorical features to numerical features using the One Hot method. The 10 repetitions of 10-fold cross-validation were used in our experiments to report the average accuracy and tree size of 5 classifiers on the test set. Friedman test and Nemenyi test will be used to analyze the algorithm difference.

The accuracy over the numerical datasets by each method is shown in Table 8. As can be seen, MSDT gets the best accuracy on 5 of 10 datasets and the average accuracy is 81.68%. It is 1.91%, 4.42%, 1.15%, and 1.81% higher than other four trees, respectively. In order to further demonstrate the differences of the classifiers, the Friedman test is used. We use the averages of the ranks of 5 classifiers on 10 datasets to calculate . Here, with 5 algorithms and 10 datasets, follows the with 4 and 36 degrees of freedom, and the critical value is . So, we reject the null hypothesis; namely, there are significant differences among the five classifiers. Nemenyi method is used for post hoc test. Critical interval (CD) is obtained by the following formula:.where is the number of algorithms and is the number of datasets. When , and significance , , the calculated critical interval CD = 1.92899. In the case of these conclusions, the MSDT and OC1 have obvious performance advantages compared with CARTSL.

The accuracy over the categorical and mixed datasets by each method is shown in Table 9. MSDT gets the best accuracy on 4 of 10 datasets and the average accuracy is 78.48%. It is 3.62%, 1.1%, 1.56%, and 1.88% higher than four other trees, respectively. We use the averages of the ranks of 5 classifiers on 10 datasets to calculate F_F = 1.48951. Here, the critical value is F(4, 36) = 2.634. So, we cannot reject the null hypothesis; namely, there is no significant difference among the five classifiers. In the case of these conclusions, on categorical and mixed data, the advantages of three multivariate decision trees over two univariate decision trees are not so obvious. Especially in OC1, one categorical feature is transformed into multiple numerical features by One Hot method, which greatly increases the dimension of feature space. In the new feature space, the data becomes very sparse, and OC1 cannot find a suitable split hyperplanes.

The tree size over 20 datasets by each method is shown in Table 10. In terms of the complexity of model structure, the average number of nodes in three multivariate decision tree is lower than the other two univariate decision trees. We use the averages of the ranks of 5 classifiers on 20 datasets to calculate F_F = 3.35294. Here, with 5 algorithms and 20 datasets, follows the F-distribution with 4 and 76 degrees of freedom, and the critical value is F(4, 76) = 2.492. So, we reject the null hypothesis. Nemenyi method is used for post hoc test. Critical interval is obtained; CD = 1.364. In the case of these conclusions, the MSDT has obvious performance advantages compared with J48.

4.4. Comparison on Big Data

The data set covertype comes from UCI [35], which is a 7-classification problem, which includes 581012 instances and 54 features. 10 of 54 features are numerical, and the remainders are Boolean. MSDT and J48 regard Boolean as categorical features, and CART_SL is regarded as numerical features. The 10 repetitions of 10-fold cross-validation are used. Table 11 provides the accuracy of the three classifiers, the size of the tree, and the time to build the tree.

The three classifiers achieve similar accuracies on covertype. In terms of tree size, MSDT has the least number of nodes. The running time provided in Table 11 is the time to build a tree and does not include the time consumed by loading data and testing. J48 runs slower than CART_SL, which does not mean that there is a significant difference in time complexity between the two algorithms. The difference may be caused by the different development language. There are two reasons why MSDT gets the most expensive time consumption. One is that the time complexity of our proposed method is higher than that of the axis-parallel methods when dividing a node. The other one is that the axis-parallel methods mainly perform relational operations, for instance, “<.” Our method needs to calculate a large number of distances, which requires arithmetic operations of real number. Although multiway splits can reduce the number of times to split nodes, the time consumed by our method is about 2 to 3 times that of the axis-parallel methods from the experimental results.

5. Conclusion

The decision trees generated by the oblique splits often have better generalization ability and fewer nodes. However, most oblique split methods are time-consuming and cannot be directly used for categorical data, and some of these methods can only be used for binary classification problems. Our proposed algorithm MSDT uses feature weighting and clustering to multiway splits of nonleaf nodes, which can be directly applied to multiclassification problems. Meanwhile, it has a time complexity similar to that of the axis-parallel algorithms. In addition, we give the representation of cluster center and the distance from instance to cluster center, which enables clustering to be used in categorical and mixed data. Experimental results show that MSDT has a good generalization accuracy on multiple types of data.

Data Availability

The data used to support the findings of this study are included in the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the National Nature Science Foundation of China under Grants 61772101, 61170169, and 61602075 and in part by the Ph.D. Scientific Research Starting Foundation of Liaoning Province under Grant 20180540084.

References

A. C. Sick-Samuels, K. E. Goodman, G. Rapsinski et al., “A decision tree using patient characteristics to predict resistance to commonly used broad-spectrum antibiotics in children with gram-negative bloodstream infections,” Journal of the Pediatric Infectious Diseases Society, vol. 9, no. 2, p. 142, 2019.
View at: Publisher Site | Google Scholar
S. Guney and A. Atasoy, “Freshness classification of horse mackerels with E-Nose system using hybrid binary decision tree structure,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 34, no. 3, pp. 1–17, 2020.
View at: Publisher Site | Google Scholar
Z. Liu, L. Wang, X. Li, and X. Ji, “Optimize x265 rate control: an exploration of lookahead in frame bit allocation and slice type decision,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2558–2573, 2019.
View at: Publisher Site | Google Scholar
Y. Wang, S.-T. Xia, and J. Wu, “A less-greedy two-term Tsallis Entropy Information Metric approach for decision tree classification,” Knowledge-Based Systems, vol. 120, pp. 34–42, 2017.
View at: Publisher Site | Google Scholar
Z.-H. Zhou and J. Feng, “Deep forest: towards an alternative to deep neural networks,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17), pp. 3553–3559, Melbourne, Australia, August 2017.
View at: Publisher Site | Google Scholar
X. Guan, J. Liang, Y. Qian, and J. Pang, “A multi-view OVA model based on decision tree for multi-classification tasks,” Knowledge-Based Systems, vol. 138, pp. 208–219, 2017.
View at: Publisher Site | Google Scholar
L. Zhang and P. N. Suganthan, “Oblique decision tree ensemble via multisurface proximal support vector machine,” IEEE Transactions on Cybernetics, vol. 45, no. 10, p. 1, 2015.
View at: Publisher Site | Google Scholar
R. Katuwal, P. N. Suganthan, and L. Zhang, “An ensemble of decision trees with random vector functional link networks for multi-class classification,” Applied Soft Computing, vol. 70, pp. 1146–1153, 2018.
View at: Publisher Site | Google Scholar
R. Katuwal and P. N. Suganthan, “Enhancing multi-class classification of random forest using random vector functional neural network and oblique decision surfaces,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN), pp. 1–8, Rio de Janeiro, Brazil, July 2018.
View at: Publisher Site | Google Scholar
Z. Liu, T. Wen, W. Sun, and Q. Zhang, “Semi-supervised self-training feature weighted clustering decision tree and random forest,” IEEE Access, vol. 8, pp. 128337–128348, 2020.
View at: Publisher Site | Google Scholar
L. Hyafil and R. L. Rivest, “Constructing optimal binary decision trees is NP-complete,” Information Processing Letters, vol. 5, no. 1, pp. 15–17, 1976.
View at: Publisher Site | Google Scholar
H. E. L. Cagnini, R. C. Barros, and M. P. Basgalupp, “Estimation of distribution algorithms for decision-tree induction,” in Proceedings of the IEEE Congress on Evolutionary Computation, IEEE, San Sebastian, Spain, June 2017.
View at: Publisher Site | Google Scholar
R. Rivera-Lopez and J. Canul-Reich, “Construction of near-optimal axis-parallel decision trees using a differential-evolution-based approach,” IEEE Access, vol. 6, pp. 5548–5563, 2018.
View at: Publisher Site | Google Scholar
M. P. Basgalupp, R. C. Barros, A. C. P. L. F. de Carvalho, A. A. Freitas, and D. D. Ruiz, “LEGAL-tree: a lexicographic multi-objective genetic algorithm for decision tree induction,” in Proceedings of the ACM Symposium on Applied Computing, ACM, Honolulu, HI, USA, March 2009.
View at: Publisher Site | Google Scholar
J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
View at: Publisher Site | Google Scholar
J. R. Quinlan, “C4.5: programs for machine learning,” Machine Learning, vol. 16, no. 3, pp. 235–240, 1994.
View at: Google Scholar
W. Buntine, “Learning classification trees,” Statistics and Computing, vol. 2, no. 2, pp. 63–73, 1992.
View at: Publisher Site | Google Scholar
A. Cherfi, A. K. Nouira, and A. Ferchichi, “Very fast C4.5 decision tree algorithm,” Applied Artificial Intelligence, vol. 32, no. 2, pp. 119–137, 2018.
View at: Publisher Site | Google Scholar
R. S. Bucy and R. S. Diesposti, “Decision tree design by simulated annealing,” ESAIM: Mathematical Modelling and Numerical Analysis, vol. 27, no. 5, pp. 515–534, 1993.
View at: Publisher Site | Google Scholar
R. C. Barros, M. P. Basgalupp, A. C. P. L. F. de Carvalho et al., “A survey of evolutionary algorithms for decision-tree induction,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 3, pp. 291–312, 2012.
View at: Publisher Site | Google Scholar
S. K. Freitas, S. Kasif, and S. Salzberg, “A system for induction of oblique decision trees,” Journal of Artificial Intelligence Research, vol. 2, no. 1, pp. 1–32, 1996.
View at: Publisher Site | Google Scholar
A. López-Chau, J. Cervantes, L. López-García, and F. G. Lamont, “Fisher’s decision tree,” Expert Systems with Applications, vol. 40, no. 16, pp. 6283–6291, 2013.
View at: Publisher Site | Google Scholar
Y. Freund and L. Mason, “The alternating decision tree learning algorithm,” in Proceeding of the International Conference on Machine Learning, Bled, Slovenia, June 1999.
View at: Google Scholar
K. S. Hong, P. L. Ooi, C. K. Ye et al., “Multivariate alternating decision trees,” Pattern Recognition, vol. 50, no. C, pp. 195–209, 2016.
View at: Google Scholar
D. C. Wickramarachchi, B. L. Robertson, M. Reale, C. J. Price, and J. Brown, “HHCART: an oblique decision tree,” Computational Statistics & Data Analysis, vol. 96, pp. 12–23, 2015.
View at: Publisher Site | Google Scholar
S. Shah and P. S. Sastry, “New algorithms for learning and pruning oblique decision trees,” IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), vol. 29, no. 4, pp. 494–505, 1999.
View at: Publisher Site | Google Scholar
B. H. Menze, B. M. Kelm, D. N. Splitthoff, U. Koethe, and F. A. Hamprecht, “On oblique random forests,” in Proceedings of the European Conference on Machine Learning & Knowledge Discovery in Databases, Bled, Slovenia, September 2011.
View at: Publisher Site | Google Scholar
R. Gnanadesikan, Methods For Satistical Data Analysis of Multivariate Observations, Wiley, Hoboken, NJ, USA, 1977.
K. Kira and L. A. Rendell, “A Practical approach to feature selection,” in Proceedings of the Ninth International Workshop on Machine Learning (ML 1992), pp. 249–256, Morgan Kaufmann Publishers Inc., Aberdeen, Scotland, July 1992.
View at: Google Scholar
I. Kononenko, “Estimating attributes: analysis and extension of relief,” in Proceeding of the 1994 European Conference on Machine Learning, pp. 171–182, Catania, Italy, April 1994.
View at: Publisher Site | Google Scholar
Z. Huang, “Extensions to the k-means algorithm for clustering large data sets with categorical values,” Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283–304, 1998.
View at: Publisher Site | Google Scholar
T. K. Ho, “The random subspace method for constructing decision forests,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 20, no. 8, pp. 832–844, 1998.
View at: Publisher Site | Google Scholar
M. Lichman, “UCI machine learning repository,” 2013, http://archive.ics.uci.edu/ml.
View at: Google Scholar

Copyright

Copyright © 2020 Zhenyu Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies