Abstract
Classical decision trees such as C4.5 and CART partition the feature space using axis-parallel splits. Oblique decision trees use the oblique splits based on linear combinations of features to potentially simplify the boundary structure. Although oblique decision trees have higher generalization accuracy, most oblique split methods are not directly conducive to the categorical data and are computationally expensive. In this paper, we propose a multiway splits decision tree (MSDT) algorithm, which adopts feature weighting and clustering. This method can combine multiple numerical features, multiple categorical features, or multiple mixed features. Experimental results show that MSDT has excellent performance for multiple types of data.
1. Introduction
Despite the great success of deep neural network (DNN) model in image processing, speech recognition, and other fields in recent years, decision trees have competitive performance compared to DNN scheme, such as the advantage of interpretability, less parameters, and good robustness to noise, and can be applied to large-scale data sets with less computational cost. Therefore, the decision tree is still one of the hotspots in the field of machine learning today [1–3]. The research mainly focused on the construction method of decision trees, split criterion [4], decision trees ensemble [5, 6], mixing with other learners [7–9], decision trees for semisupervised learning [10], and so on.
Despite practical success, the optimal construction of decision trees has been theoretically proven to be NP-complete [11]. In order to avoid the local optimal solution, some researchers adopted evolutionary algorithms to build decision trees [12–14]. However, due to the time complexity, the most popular algorithms, such as ID3 [15], C4.5 [16], and CART [17], and their various modifications [18] are greedy by nature and construct the decision tree in a top-down, recursive manner. Besides, they only act on one dimension at a time and thus result in an axis-parallel split. In the induction of decision tree, if the candidate features are numerical, a suitable cut point needs to be searched. Instances in the training set are divided into the left node or the right node according to the following formula:where denotes the value of the instance on the feature and is the cut point.
Axis-parallel trees have the advantages of fast induction and strong comprehensibility. However, in the case of highly correlated features, a very bad situation may arise. Figure 1 gives an illustration. The parallel splits will be carried out many times with a stair case-like structure, which leads to the complexity of the decision tree structure.

To solve the problem of parallel decision trees, some researchers introduced oblique decision trees. In such oblique decision trees, the nonleaf node tests the linear combination of features, i.e., where represents the coefficient for the th feature, is the threshold, and is the number of features. In Figure 1, the instances of the two classes can be completely separated by one oblique split. Therefore, it is generally believed that the oblique splits can often produce smaller decision trees and better generalization performance for the same data.
It is much more difficult to search the optimal oblique hyperplanes than the optimal axis-parallel hyperplanes. To solve this problem, numerous techniques have been applied, for example, hill-climbing [17], simulated annealing [19], and genetic algorithm [20]. Among them, a large amount of research work has been done on reducing the risk of falling into local optimal solution, such as Simulated Annealing Decision Tree (SADT) [19], which used the simulated annealing algorithm; OC1 [21] method combined the ideas of CART-LC [17] and SADT.
In searching oblique hyperplanes, thousands of candidates have been tried in both simulated annealing algorithm and genetic algorithm, resulting in low time efficiency. So many researchers used linear discriminant analysis, linear regression, perceptron, SVM, and other methods to find suitable oblique hyperplanes. Fisher’s decision tree (FDT) [22] takes advantage of dimensionality reduction of Fisher’s linear discriminant and uses the decomposition strategy of decision trees to come up with an oblique decision tree. FDT is only applicable to binary classification problems. Based on ADTree [23], Hong et al. [24] proposed the multivariate ADTree. Paper [24] presented and discussed the different variations of ADTree (Fisher’s ADTree, Sparse ADTree, and Regularized Logistic ADTree). Wickramarachchi et al. [25] explored a decision tree algorithm (HHCART). HHCART uses a series of Householder matrices to reflect the training data during tree construction. Shah and Sastry [26] defined separability of instances as the split criterion that optimized their evaluation function at each node and then presented the Alopex Perceptron Decision Tree algorithm for learning a decision tree. Menze et al. [27] presented an oblique tree forest method, which used LDA and ridge regression to conduct oblique splits.
In the above oblique methods, the trees with fewer nodes and better accuracy can be obtained. However, there are also some deficiencies, mainly including three aspects.
1.1. Inability to Directly Employ the Methods for Categorical Data
The oblique splits use the linear combination of features. Therefore, the categorical features need to be converted into one or more numerical features [28]. This transformation may bring new biases to the classification problems, thus reducing the generalization ability of the models.
1.2. High Time Cost
The oblique splits always require complex matrix calculation when using linear discriminant analysis, ridge regression, or other methods. Although these methods are more efficient than simulated annealing and genetic algorithm, they still pay more cost than the axis-parallel methods, such as C4.5.
1.3. Some Methods Cannot Be Suitable for Multiclassification Problems
Generally, the oblique split methods conduct the binary splits. Although the binary tree can also be directly used for multiclassification problems, some binary splits rely on class label, such as FDA, original SVM, etc., which makes some algorithms like FDT in [22] limited to binary classification problems. In addition, some models need to convert multiclassification problems into binary ones [7].
In order to overcome the above shortcomings, this paper proposes a multiway splits decision tree for multiple types of data (numerical, categorical, and mixed data). The specific characteristics of this method are as follows:(i)Categorical features are handled directly.(ii)The time complexity is similar to that of the axis-parallel split algorithms.(iii)It is not necessary to convert multiclassification problems into binary ones by using the multiway splits directly.
The remainder of the paper is organized as follows. In Section 2, we review RELIEF-F and k-means algorithms briefly. Section 3 presents our algorithm and discusses its time complexity. Section 4 presents and analyzes the compared experimental results with other decision trees. The last section gives the conclusion of this paper.
2. Preliminaries
The proposed decision tree method needs to weight the features by RELIEF-F algorithm and split the nodes by the weighted k-means algorithm. Therefore, this section reviews the two algorithms and their variations.
2.1. RELIEF-F Algorithms
The RELIEF algorithm [29] is popular to feature selection. It estimates the weights of features according to the correlation between individual feature and class label. RELIEF randomly samples an instance from the training set and then searches its two nearest neighbors and : is from the same class (called near Hit) and is from different class (called near Miss). If the distance between and on feature is less than the distance between and , RELIEF will increase ’s weight. On the contrary, RELIEF will decrease the weight.
In fact, RELIEF’s estimate of feature is an approximation of the following difference of probabilities:where represents the conditional probability.
RELIEF algorithm only deals with binary classification problems. Kononeill addressed an algorithm called RELIEF-F for multiclassification problems [30]. The algorithm picks instances. For each instance , its nearest neighbors are searched in each class.
The weight is calculated as follows:where represents the proportion of class instances to the total instances and represents the th nearest neighbor to in class . calculates the difference between two instances R1 and R2 on the feature as follows:
2.2. k-Means, k-Modes, and k-Prototypes
The k-means is widely used in real world applications due to its simplicity and efficiency.
Let be a set of instances. is characterized by a set of features and needs to be clustered into clusters . First, randomly pick some instances as the centers of the initial clusters , and then calculate the cluster label for each instance as follows:
After all the instances are partitioned, each cluster center will be updated by the following formula:
Repeat formulas (6) and (7) until the variable in formula (8) converges to the local optimal solution or the preset number of iterations is reached:
However, the classical k-means is only worked on the numerical data. The k-modes and k-prototypes are variants of k-means for categorical and mixed data, respectively [31]. When k-modes processes the categorical variables, the center of each cluster is represented by modes. When calculating the distance between instance and cluster center, the distance on each feature is calculated by formula (5) and then accumulated.
It is straightforward to integrate the k-means and k-modes into the k-prototypes. is the distance between instance and cluster center as follows:where represents the distance on the numerical variables and represents the distance on the categorical variables, respectively. is used to adjust the proportion of and , .
3. Our Proposed Algorithm
Our proposed MSDT has three differences with most oblique methods: (i) MSDT does not use greedy methods to pursue maximum impurity reduction, (ii) MSDT uses a combination of multiple variables to do multiway splits for nonleaf nodes, and (iii) MSDT treats categorical features in a similar way to numerical features.
3.1. Multiway Splits
Most oblique methods conduct binary splits, while the proposed algorithm performs multiway splits; that is, in one split, multiple hyperplanes are generated simultaneously, and the feature space is divided into several disjoint regions. Ho [32] categorized the linear split methods into three types, axis-parallel linear splits, oblique linear splits, and piecewise linear splits, while our method falls into the third. Piecewise linear split methods find anchors in feature space, and each instance is clustered according to the nearest neighbor anchor. Figure 2 shows the 5-way splits of the two-dimensional feature space.

3.2. Location of Anchor
Finding suitable split hyperplanes is the key problem in most decision tree induction algorithms. Under piecewise linear splits, the problem of finding appropriate hyperplanes is equivalent to that of finding appropriate anchors. Usually, anchor selection can use the class centroids, or cluster centers generated by some clustering algorithms. In MSDT, we first use RELIEF-F to weight features and then use k-means with weighted distance to cluster instances.
3.2.1. Why Do We Use k-Means?
If the instances are linearly separable, it is obviously more efficient to use simply the class centroids than cluster centers as anchors. However, when the instances of some classes are distributed in different regions of the feature space, the class centroids may no longer be suitable for being anchors. For example, in Figure 3, the circular instances are distributed in two different areas. If the solid line that is perpendicular to the line between the two class centroids is used to separate the instances, the effect is obviously not satisfactory. The instances in Figure 3 are obviously distributed into two clusters. If the instances are divided by the dotted line that is a perpendicular bisector of the two cluster centers, at least the circular instances on the right side of the figure can be distinguished.

The split method proposed is based on clustering assumption. The clustering assumption states that the samples belonging to the same cluster belong to the same class. k-means methods partition instances according to some (dis)similarity measures; hence, the leaf nodes of MSDT can be regarded as some prototypes, and the class of a test instance depends on which prototype the instance is more similar to.
The univariate decision trees can produce a comprehensible classification mode, due to the knowledge representation method—a decision tree is a graphical representation and can be easily converted into a set of rules written in a natural language. Some researchers believe that multivariate decision trees are not able to convert into the comprehensible rules. The other researchers think that multivariate tree with fewer nodes is easy to understand. MSDT is easy to understand due to two reasons. One is that MSDT has fewer nodes compared to univariate decision trees. The other one is that the similarity with the prototype is easy to understand by the users and it can replace the rules generated by the univariate decision tree.
3.2.2. Why Do We Weight Features?
The original k-means is an unsupervised clustering algorithm, which is suitable for unlabeled data. And the optimization goal is to minimize (8). The goal of split is to reduce the class impurity of current node as much as possible. Note that the two goals are not the same. Therefore, we estimate the correlations between features and label to weight features. When calculating the distance from an instance to a cluster center, we give a larger weight to the feature strongly related to the label that enlarges the contribution of the feature to the distance. Otherwise, we give a smaller weight that reduces the contribution of the uncorrelated feature to the distance. In this way, the optimization goal of k-means algorithm is close to that of node split.
Figure 4 shows an example to illustrate the effectiveness of feature weighting. The solid line comes from unweighted features, and the dotted line comes from weighted features when the weight of A1 is 0.05 and the weight of A2 is 0.95. It is obvious that some instances have been corrected.

To further illustrate the role of feature weighting, we use dataset iris to carry out a simple experiment: 150 samples of dataset iris come from three classes, and each class has 50 samples. We directly use k-means algorithm to cluster and obtain 10 misclassified samples. The specific results are shown in Table 1.
Then, we use the RELIEF-F algorithm to calculate the weights of four features, which are 0.09, 0.14, 0.34, and 0.39, respectively. In the process of k-means clustering, the distances between instances and cluster centers are calculated by (10), where p indicates the number of features and indicates the weight of the lth feature. We obtain 6 misclassified samples, and the specific results are shown in Table 2.
Our proposed split method is shown in Algorithm 1, which will be used to split nodes for numerical data.
|
In the fifth step of Algorithm 1, represents the maximum number of iterations. In the experiments, we set it to 6 by default. The reason for setting such a small value is mainly to consider the time efficiency of the algorithm. In addition, the purpose of clustering is to split nodes. Even if the clustering algorithm does not converge, the partition results can still be accepted.
3.3. Categorical Feature
As mentioned in the previous subsection, the split method can be directly applied to numerical features. For categorical features, RELIEF-F algorithm can still be used to weight features. However, in the process of clustering, the representation of cluster center and the distance from instance to cluster center need to be redefined.
The k-modes extends the k-means by replacing the means of numerical variables with the modes of the categorical variables. Yet it is less precise to calculate the distance. What is more, choosing different modes may cause opposite conclusion while there are several modes for a feature.
Here is an example. Suppose there are two clusters and described by two categorical features and , and each cluster contains 10 instances as is shown in Table 3. The modes of and for are , which makes useless for distinguishing the distances between instances and the clusters. There are two modes for in and , respectively. Suppose that there is an instance ; if is selected as the center of and for , distance between and is 0 and distance between and is 1; hence, is nearer to . If is selected as the center of and for , distance between and is 1 and distance between and is 0; hence, is nearer to .
To avoid the less precision and the ambiguity of distance measure on the modes, we use the probability estimation of each categorical feature value to represent the cluster center and define a function to calculate the distance from instance to cluster center.
Let be a set of categorical data described by categorical features. Number of instances in is and instances are partitioned into clusters. There are with different values for the th feature of the th cluster , .
Definition 1. represents the set of instances with value of on the feature Al in Cj, where . The condition probability is estimated as follows: is the summary of all values of in , defined as follows:
Definition 2. The center of Cj is represented by the following vector:
Definition 3. represents the distance between value and for :
Definition 4. represents the weighted distance between instance and center :According to formula (15), in the above example, the weights of two features are 1. The distances between instance and two cluster centers ( and ) in Table 3 are 0.7 = 0.1 + 0.6 and 1.2 = 0.6 + 0.6, respectively. It means that is closer to , which is in accordance with the human’s intuition.
To cluster categorical data, we use formula (13) to replace formula (7) in step 4 and step 7 of Algorithm 1 and formula (15) to formula (10) in step 6.
3.4. Mixed Features Data
For mixed data, the vector of cluster center consists of two parts: one is the means of numerical features and the other is the vector as shown in (13). In this case, we use (9) to calculate the distance from instance to cluster center, where and are obtained by (10) and (15), respectively. As the ratio of numerical and categorical features differs by the datasets, we choose in (9) that makes the most reduction of GINI index, where .
3.5. MSDT and Time Complexity Analysis
The multi_split function is prompted for node splits. Algorithm 2 describes the construction process of MSDT.
|
In step 2 of Algorithm 1, RELIEF-F is used to get the weights. Time complexity of RELIEF-F is , where p is the feature number, n is the instance number, m is the sampling number, and is the nearest neighbor number. In this paper, m is set is set 1, and is negligible, so the time complexity of RELIEF-F in this paper is .
Steps 4 to 9 of Algorithm 1 are the clustering process, and the time complexity is , where k is cluster number and I is iteration number. When we use Algorithm 1 to split nodes, the max iterations Imax is 6; it means that time complexity may reach in the worst case.
Considering the above two parts, the time complexity of Algorithm 1 is . Compared with the time complexity of the classical axis-parallel splits, there is an extra k. When is large, this algorithm is lower efficiency than the axis-parallel algorithms. Compared with binary splits, if the node numbers of the decision trees are the same, the operations in -way splits are obviously less than in binary splits.
OC1 [21] is a classic oblique decision tree, whose time complexity is in the worst case. In [25], the time complexities of HHCART(A) and HHCART(D) are and , respectively. In [22], the speed of FDT for splitting node is close to or even better than that of axis-parallel split method. The time complexity of this method is . Unfortunately, it can only be applied in binary classification problems.
In summary, when k is small, the efficiency of the proposed split method is close to classical axis-parallel split methods, and it is better than most oblique split methods.
4. Experiments
In this section, we use experimental results to demonstrate the effectiveness and performance of our proposed algorithm. In the first part, the experiments are used to illustrate the effectiveness of clustering, feature weighting, and the novel distance calculation method for categorical feature. The second parts compare MSDT with classical decision trees and another two oblique trees. Finally, we use a larger dataset covertype to compare with two axis-parallel trees.
4.1. Datasets
As shown in Table 4, the 20 UCI datasets [33] are used to evaluate the proposed algorithm, where the number of instances, the number of classes, and feature types (numerical data 1–10, categorical data 11–15, and mixed data 16–20) are varied and are sufficiently representative to demonstrate the performance of MSDT. In column Features with is number of numerical features and categorical features, respectively. Abalone is treated as a 3-category classification problem (grouping classes 1–8, 9 and 10, and 11 on).
4.2. Comparison of Different Piecewise Linear Split Methods
The piecewise linear split methods can be summarized as two steps. First, find appropriate anchors. Then, divide instances according to the nearest anchor. On the basis of this approach, our proposed algorithm is improved in three aspects: feature weighting, clustering, and special categorical feature processing. This section combines these three changes into multiple functions and compares the performances in multiple types of data. These functions are shown in Table 5.
The pessimistic pruning algorithm is adopted after the decision trees are generated. In addition, the average results of all experiments are obtained by 10 repetitions of 10-fold cross-validation.
4.2.1. Numerical Data
In terms of numerical data, the proposed algorithm uses weighted k-means to optimize cluster center position. In order to demonstrate the role of clustering and feature weighting, we implement four different node split functions to generate decision trees. Fun0 directly uses the center of each class as the anchor. Instances are divided according to the nearest anchor. Euclidean distance is used for distance calculation. Fun1 also uses the center of each class as the anchor. However, in the process of selecting the nearest anchor for each instance, RELIEF-F is firstly used to calculate the weight of each feature. Then remove features whose weights are less than 1/5 of the maximum. Finally, the distance is calculated according to formula (10). Fun2 uses the center of each class as the initial cluster center of k-means and the outputs of k-means as the partition results. Fun3 combines Fun1 with Fun2 and is our proposed algorithm for numerical data.
Table 6 gives the classification accuracy of the 4 functions in 10 datasets, and the best entry in each row is bolded. As can be seen, Fun3 gets the best accuracy on 9 of 10 datasets and the average improvement is 4.16% higher than Fun0. In particular, the accuracy increases by more than 8% on Glass and Letter. The average accuracy of 10 datasets shows that Fun1 is about 1.07% higher than Fun0, and Fun3 is 1.39% higher than Fun2. The results show that feature weighting improves the classification performance. Fun2 is about 2.77% higher than Fun0, and Fun3 is 3.09% higher than Fun1. The reason for improvement is using clustering.
4.2.2. Categorical and Mixed Data
On the categorical and mixed data, we implement eight different split functions to generate decision trees. Fun0 directly uses the center of each class as the anchor. For categorical features, modes are used to replace the means of numerical features as the component of anchors. When calculating the distance between instance and anchor, the distance on each feature is calculated by formula (5) and then summed. The difference between Fun1 and Fun0 is that the weight of each feature is calculated by RELIEF-F. Then remove features whose weights are less than 1/5 of the maximum. The distance between the instance and the anchor is obtained by formula (15) and formula (9). Fun2 adds clustering process on Fun0. k-modes and k-prototypes are used for categorical data and mixed data, respectively. Fun3 combines Fun1 and Fun2. Fun4-7 corresponds to Fun0-3 respectively. On the categorical features, the calculation of cluster centers and distances adopts the method described in Section 3.3 (formulas (13) and (15), respectively). Fun7 is our proposed algorithm for categorical and mixed data.
Table 7 gives the classification accuracy of the 8 functions in 5 categorical datasets (Balance, Car, Chess, Hayes, and MONK) and 5 mixed datasets (Abalone, CMC, Flags, TAE, and Zoo), and the best entry in each row is bolded. Except for CMC and Zoo, Fun7 obtains the best accuracy, and the average is 11.77% higher than Fun0. As can be seen, Fun1 is better than Fun0, Fun3 is better than Fun2, Fun5 is better than Fun4, and Fun7 is better than Fun6. The average improvement is 5.37%. This is the contribution of feature weighting. It is shown that Fun2 is better than Fun0, Fun3 is better than Fun1, Fun6 is better than Fun4, and Fun7 is better than Fun5. The average improvement is 4.45%. The reason is the use of clustering. Meanwhile, we can see that Fun4-7 is averagely 1.48% better than Fun0-3. This improvement is the statistical distribution of feature values instead of modes.
4.3. Comparison with Other Decision Trees
In order to verify the performance of our proposed algorithm, we selected four decision trees: J48 (WEKA’s implementation of C4.5), CARTSL (scikit-learn’s implementation of optimal CART), OC1, and HHCART(A). Since CARTSL and OC1 do not support categorical features, we convert categorical features to numerical features using the One Hot method. The 10 repetitions of 10-fold cross-validation were used in our experiments to report the average accuracy and tree size of 5 classifiers on the test set. Friedman test and Nemenyi test will be used to analyze the algorithm difference.
The accuracy over the numerical datasets by each method is shown in Table 8. As can be seen, MSDT gets the best accuracy on 5 of 10 datasets and the average accuracy is 81.68%. It is 1.91%, 4.42%, 1.15%, and 1.81% higher than other four trees, respectively. In order to further demonstrate the differences of the classifiers, the Friedman test is used. We use the averages of the ranks of 5 classifiers on 10 datasets to calculate . Here, with 5 algorithms and 10 datasets, follows the with 4 and 36 degrees of freedom, and the critical value is . So, we reject the null hypothesis; namely, there are significant differences among the five classifiers. Nemenyi method is used for post hoc test. Critical interval (CD) is obtained by the following formula:.where is the number of algorithms and is the number of datasets. When , and significance , , the calculated critical interval CD = 1.92899. In the case of these conclusions, the MSDT and OC1 have obvious performance advantages compared with CARTSL.
The accuracy over the categorical and mixed datasets by each method is shown in Table 9. MSDT gets the best accuracy on 4 of 10 datasets and the average accuracy is 78.48%. It is 3.62%, 1.1%, 1.56%, and 1.88% higher than four other trees, respectively. We use the averages of the ranks of 5 classifiers on 10 datasets to calculate FF = 1.48951. Here, the critical value is F(4, 36) = 2.634. So, we cannot reject the null hypothesis; namely, there is no significant difference among the five classifiers. In the case of these conclusions, on categorical and mixed data, the advantages of three multivariate decision trees over two univariate decision trees are not so obvious. Especially in OC1, one categorical feature is transformed into multiple numerical features by One Hot method, which greatly increases the dimension of feature space. In the new feature space, the data becomes very sparse, and OC1 cannot find a suitable split hyperplanes.
The tree size over 20 datasets by each method is shown in Table 10. In terms of the complexity of model structure, the average number of nodes in three multivariate decision tree is lower than the other two univariate decision trees. We use the averages of the ranks of 5 classifiers on 20 datasets to calculate FF = 3.35294. Here, with 5 algorithms and 20 datasets, follows the F-distribution with 4 and 76 degrees of freedom, and the critical value is F(4, 76) = 2.492. So, we reject the null hypothesis. Nemenyi method is used for post hoc test. Critical interval is obtained; CD = 1.364. In the case of these conclusions, the MSDT has obvious performance advantages compared with J48.
4.4. Comparison on Big Data
The data set covertype comes from UCI [35], which is a 7-classification problem, which includes 581012 instances and 54 features. 10 of 54 features are numerical, and the remainders are Boolean. MSDT and J48 regard Boolean as categorical features, and CARTSL is regarded as numerical features. The 10 repetitions of 10-fold cross-validation are used. Table 11 provides the accuracy of the three classifiers, the size of the tree, and the time to build the tree.
The three classifiers achieve similar accuracies on covertype. In terms of tree size, MSDT has the least number of nodes. The running time provided in Table 11 is the time to build a tree and does not include the time consumed by loading data and testing. J48 runs slower than CARTSL, which does not mean that there is a significant difference in time complexity between the two algorithms. The difference may be caused by the different development language. There are two reasons why MSDT gets the most expensive time consumption. One is that the time complexity of our proposed method is higher than that of the axis-parallel methods when dividing a node. The other one is that the axis-parallel methods mainly perform relational operations, for instance, “<.” Our method needs to calculate a large number of distances, which requires arithmetic operations of real number. Although multiway splits can reduce the number of times to split nodes, the time consumed by our method is about 2 to 3 times that of the axis-parallel methods from the experimental results.
5. Conclusion
The decision trees generated by the oblique splits often have better generalization ability and fewer nodes. However, most oblique split methods are time-consuming and cannot be directly used for categorical data, and some of these methods can only be used for binary classification problems. Our proposed algorithm MSDT uses feature weighting and clustering to multiway splits of nonleaf nodes, which can be directly applied to multiclassification problems. Meanwhile, it has a time complexity similar to that of the axis-parallel algorithms. In addition, we give the representation of cluster center and the distance from instance to cluster center, which enables clustering to be used in categorical and mixed data. Experimental results show that MSDT has a good generalization accuracy on multiple types of data.
Data Availability
The data used to support the findings of this study are included in the article.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This research was supported by the National Nature Science Foundation of China under Grants 61772101, 61170169, and 61602075 and in part by the Ph.D. Scientific Research Starting Foundation of Liaoning Province under Grant 20180540084.