Abstract

We steered comparative analysis of manifold supervised dimension reduction methods by assimilating customary multiobjective standard metrics and validated the comparative efficacy of supervised learning algorithms in reliance on data and sample complexity. The question of sample and data intricacy is deliberated in dependence on automating selection and user-purposed instances. Different dimension reduction techniques are responsive to different scales of measurement and supervision of learning is also discussed comprehensively. In line with the prospects, each technique validated diverse competence for different datasets and there was no mode to gauge the general ranking of methods trustily available. We especially engrossed the classifier ranking and concocted a system erected on weighted average rank called weighted mean rank risk adjusted model (WMRRAM) for consensus ranking of supervised learning classifier algorithms.

1. Introduction

In the recent history of data science, selecting an adequate algorithm has been a problematic chore. It becomes a significant concern as different learning and classification algorithms are at hand and have come from miscellaneous zones like machine learning, neural networks, and statistics. Their recital might differ noticeably across different datasets. Several types of research have been done to grasp the issue of a better algorithm [1]. Following the ideas of [24], we tackled the issue of selecting the best learning algorithm by trying many of the supervised learning algorithms on different real datasets. Usually, this is not viable in numerous circumstances because there are numerous algorithms to try out, some of which are sluggish, especially in the case of big datasets. In the context of the above explanation, the no free lunch (NFL) theorem suggests that looking for one classifier is not as feasible as all classifiers perform better when their performance is averaged over all possible evaluation characteristics [1].

On the other hand, the theorem no free lunch (NFL) proved that if algorithm A performs better than algorithm B because of specific characteristics, then numerous other characteristics where B performs better than A. So, any other solution is required to classify the solitary dominant algorithm, which can be used in entire circumstances. The selection of an algorithm is an exploratory process that fully depends on the knowledge and expertise of the analyst. It is typically problematic to categorize a particular best algorithm consistently; for this, a virtuous substitute is to provide a ranking of these algorithms. This research study focuses on the consensus ranking of the supervised learning algorithm’s independence of data and sample complexity. It provides an up-to-date overview of different dimensionality reduction techniques such as linear, nonlinear, supervised, and unsupervised techniques according to their measurement level.

2. Dimensionality of the Data and Dimension Reduction

Dimensionality is the number of attributes/features/variables measured for each observation in a dataset. Different names are used in different fields for the p multivariate vector, such as the term “Variable” frequently used in statistics. At the same time, “Feature” and “Attribute” are substitutes for variables in the machine learning and computer science literature [5]. Dimension reduction is an effective method of rationalizing the high dimensional data by projecting a set of high dimensional vectors to a lower dimension space while retaining the real information among them. Dimension reduction is a motif of statistics. like machine learning and information theory, dimension reduction reduces the numeral variables, features, and attributes under consideration via obtaining a set of principal variables. The higher number of features makes it difficult to visualize and analyze the data. Sometimes most of these features are correlated and hence redundant, and in this situation, the dimension reduction techniques came into play. We can divide dimension reduction techniques into feature selection and feature extraction.

3. Dimension Reduction Techniques and Different Measurement Levels of Datasets

We begin with the idea that dimension reduction techniques can be divided into linear and non-linear [6] and further subdivided into supervised and unsupervised techniques. Knowing and understanding the underlying level of measurement is essential for every data scientist as the choice of the technique is based on it [7, 8]. Table 1 depicts the details.

4. Datasets: Description and Exploration

A total of six datasets, namely Iris, Abalone, Bean, Car, and Diabetes, from the UCI Machine Learning Repository have been taken as a benchmark and preprocessed for the comparison and ranking of supervised dimension reduction techniques. The datasets contained target and input attributes where targets were discrete and inputs were continuous. Characteristics of the individual data set are explored in Table 2. As the datasets have been gathered from various sources, certain preprocessing was mandatory to keep all datasets per the required format. Further, the datasets were split into training and test parts using the most used “Holdout” method, where we divide 75% of the data as training and 25% as testing, and 60% training and 40% testing, as detailed in Table 3.

5. Description of Supervised Machine Learning Classifiers

Supervised learning classifiers are machine learning predictive models based on the classification learning technique [9]. Classification is a supervised dimension reduction technique that assigns a class to a set of data sharing specific attributes with the corresponding standards. In dimension reduction, supervised learning techniques consist of multilayer perceptron, linear discriminant analysis, naive bayes classifier, random tree, iterative dichotomizer3 ID3, C4.5, the advance form of ID3, classification and regression tree (C-RT), CS-CRT, and CS-MC4 [10].

5.1. Naive Bayes

Naive Bayes is the most popular machine learning algorithm and is a probabilistic method grounded on the Bayes theorem and has robust assumptions of independence concerning the attributes [11].

5.2. K-Nearest Neighbors K-NN

K-nearest neighbor is a meek supervised machine-learning algorithm that provisions all the accessible attributes with given instances and classifies the new data based on similarity measures. K in K-NN refers to the number of nearest neighbors you want to select. The nearest neighbor (NN) algorithm discriminates the classification of unknown data points on behalf of its nearest neighbor, whose class label is known in advance and may use More than one nearest neighbor may be used to determine the classes in which the given number of data points belong and is referred to as the memory-based technique. It is the technique of pattern recognition in the supervised classification methods of dimension reduction [12].

5.3. Iterative Dichotomizer 3 (ID3)

Iterative dichotomizer 3 is the most common supervised decision tree algorithm that requires a fixed set of observations to build a tree. It divides the attributes into two groups’, i.e., the most significant attributes and other attributes used to construct a tree. After this, ID3 calculates entropy and information gain, and in this way, the algorithm finds the most significant attribute. The attributes used in ID3 are usually nominal, with no missing values in the datasets [13].

5.4. Core Vector Machine (CVM)

The core vector machine is a fast-supervised machine learning classifier grander than the typical support vector machine on big datasets. However, only limited evidence regarding the CVM technique’s efficiency is available. CVM syndicates methods from computational geometry with SVM training but is meticulously interrelated to the minimum enclosing ball (MEB), and therefore, optimization seems unavoidable [14].

5.5. Ball Vector Machine (BVM)

The ball vector machine is a humbler delinquent of finding an enclosing ball (EB) through its radius specified advance and can circuitously extract the resolution of quadratic programming by resolving the minimum enclosing ball (MEB) delinquent, which meaningfully decreases the time and space complication [14].

5.6. Multilayer-Perception (MLP)

Multilayer-perception is a feed-forward supervised dimension reduction technique belonging to the artificial neural network (ANN) class. As the name indicates, it has multiple layers that consist of different nodes, i.e., input, output, and hidden nodes. Except for input nodes, each node has a neuron that has a nonlinear function of activation [12].

5.7. C4.5

C4.5 is a supervised classification algorithm and a statistical classifier that produces a decision tree, suggested by Rose Quinlan in 1993 to overcome the shortcomings of the ID3 algorithm. The C4.5, like ID3, constructs the decision trees from a given set of training data using the concept of entropy. It uses the “Information Gain” criteria to measure the “Gain Ratio.” The attribute with the maximum normalized information gain is preferred to create the decision of the root node.

5.8. Classification and Regression Tree (C-RT)

A classification and regression tree is a supervised method used to build a classification and regression tree for metric-dependent and categorical variables. The main aim of this analysis using the tree-building algorithm is to find a set of logical conditions that allow the accurate classification of the observations.

5.9. CS-CRT and CS-MC4

CS-CRT and CS-MC4 are the decision tree methods used to classify datasets by splitting the instances using the Gini Index at each node. The decision tree approaches comprise Classification and Regression Trees (CART) [15]. The CS-CRT is quite comparable to the cart but with the factor of cost-sensitive classification. The CS-MC4 is a cost-sensitive decision tree algorithm that uses the generalization of Laplace estimation to obtain the m-estimate smoothed probability estimation. It diminishes the expected loss by exhausting the misclassification cost matrix to conclude the excellent prediction within leaves. The assumption mandatory for this algorithm is that one discrete target value and one or supplementary continuous/discrete value for input are essentially obtainable [11].

5.10. PLS-DA and PLS-LDA

PLS-DA and PLS-LDA are multivariate regression methods like a discriminant analysis (DA). The projection direction is not appropriate in many cases; therefore, discriminant analysis (DA) is combined with partial least square (PLS) method (Tang, Peng, Bi, Shan & Hu, 2014). Partial least squares regression (PLS-R) receives huge popularity in many fields and is commonly used in many situations with various possible correlated predictor variables and with few samples. Partial least squares regression (PLS-R) is an extension of the common multiple regression model, and it is also known as the bilinear factor model (BFM) as it projects both the predicted and observed variables on the new subspaces (Ramani and Sivagami, 2011). Partial least square discriminant analysis (PLS-DA) is a variation of PLS-R and is used when the response variable is categorical. Under some conditions, partial least square discriminant analysis provides the same results as the classical linear discriminant analysis (LDA) [16]( Figure 1).

6. Performance Evaluation Criteria for Different Supervised Learning Algorithms

Resubstitution error rate (RER), test error rate (TER), bootstrap error rate (BER), computational time (CT), cross validation error rate (CVE), recall and precision used to access the performance of supervised learning algorithms accessed. CVE is the model selection and evaluation measure. At the same time, bootstrap is a resampling technique usually used in sampling distribution estimation. However, the novelty here is that we used these techniques to measure performance in dependence on uncertainty in ranks owed by the error rate of learning classifiers.

As in given in figure 1, the boxplot visualized the performance evaluation of different learning classifiers attained from the seven-standard metrics (TER, RER, BER, CVE, Recall, Precision, and CT) measures when using 60% data as training and 40% data as testing. A noteworthy variation was observed in the ranks assigned to the classifiers. The graphs show that the classifier’s results depend on the data complexity. However, the overall result shows that the learning classifier performance for six different datasets is comparable to the RER and BER, and noteworthy variation exists in the rank of the classifiers. To overcome the variation in the rank data and consolidate the result, the MRRA model is used. Figure 2.

6.1. Performance of the Supervised Classifiers in Dependence of Sample Complexity

The performance of the supervised classifiers in dependence of sample complexity is shown in Figure 3.

6.2. Performance of the Supervised Classifiers in Dependence of Data Complexity

In Figure 4 the boxplot visualized the performance evaluation of different learning classifiers attained from the seven different standard metrics (TER, RER, BER, CVE, Recall, Precision, and CT) evaluation measures when using 60% data as training and 40% data as testing. A noteworthy variation was observed in the ranks assigned to the classifiers. The graphs show that the results of the classifier also depend on the data complexity. However, the overall result shows that the learning classifier performance for six different datasets is comparable to the RER and BER and noteworthy variations exist in the rank of the classifiers. To overcome the variation in the rank data and come to a consolidated result, the MRRA model is used.

7. Mean Rank Risk Adjusted Model (WMRRAM)

The method that I will call the method of mean rank risk adjusted implicates first ranking the datasets in each column of the two-way table by computing the overall mean and standard deviation of the weighted rank datasets. The first step is to form the meta table by ranking the supervised algorithms for each category, giving the lowest error rate a rank of 1, the next lowest error rate a rank of 2, and so on. Thus, in each row of the meta table, we have a set of values from 1 to 7, since there are seven-standard metrics to measure the performance of the supervised algorithms. The second step is stacking. Stacked generalization, known as stacking in literature review, is a scheme of combining the output of multiple classifiers in such a way that the output compares with the independent set of instances and the true class [17]. As stacking covers the concept of metalearning [17] so at first supervised classifiers , learnt from each multiple dataset , . The output of the supervised classifiers on the evaluation datasets was ranked subsequently by the standard performance metrics. The outperform algorithm assigned ranked 1, ranked 2 is for runner-up, and so on. We assigned an average rank to overcome the situation where multiple algorithms have had the same performance. Let denote the weights assigned iteratively to the performance metric where and used them to form new instances , of a new dataset , which will then aid as a meta-level evaluation dataset. Each instance of the dataset will be of the form . Finally, we persuaded a global mean rank risk-adjusted model from the meta-dataset. The main advantage of stacking is that the learning algorithm with the best mean rank may be one that gets quite a few poor ranks because some other characteristics do not take into account the variability in the ranks. The no free lunch (NFL) theorem states that there is no solitary model which performs averagely best for multiple datasets [18]. For consensus ranking of the supervised learning algorithms, we use meta-dataset. Risk is a widely studied topic, particularly from the decision-making point of view, and discussed in many dimensions [19]. Decision makers can assign arbitrary numbers for weights. The performed calculations were based on the weights of each characteristic and the weighted mean rank does not take into account the variability in the ranks and there may be a possibility that the supervised learning algorithm with the best mean rank may be one who gets quite few poor ranks because of some other characteristics. To grasp a consensus result, we used an MRRA approach. In the MRRA model, risk is taken as variability and uncertainty in the ranking of different learning algorithms, and statistical properties of the rank data are used to reveal which supervised learning algorithms are ranked highest and which are ranked second, and so on. Our proposed work can be regarded as a variant of [19]. The overall mean rank obtained by using a formula inspired by Friedman’s M statistic [20] and the standard deviation calculated by using a formula are given in the following equations:Where j denotes the multiple datasets, include in study for the evaluation of the supervised classifier and j = 1, 2, …, 6. The WMRRAM for the consensus ranking of multiple supervised classifiers are shown in the following equation:equation 3:That is, the increase or decrease will be in proportion to variations in the ranks obtained by different classifiers. Table 4 depicts the results in detail.

A critical issue in machine learning assignments is determining how much training data is desirable to attain a specific performance of the learning classifiers. We examine the performance of the classifiers as the amount of training data grows. The increasing amount of training data affects the ranking of the learning classifiers. We specifically focused on the classifier ranking as a function of the change in data. The results show a surprising amount of sensitivity in ranking the learning classifiers by changing the amount of training data sets. The above figure shows that the ranking of the classifiers changes with the amount of training data, and the consensus ranking is only possible with the help of the WMRRAM model. The running time for the proposed method is justifiable and, on an average, 5 ms when applied to different proportions of training and testing datasets (Figure 5).

8. Conclusion

Evaluation of learning classifier performance and comparisons are trendy nowadays. After studying the literature, a conclusion is obtained that most articles only focus on some known learning algorithm with one or two datasets only in the field of statistics. All learning algorithms include pros and cons, but by measuring the performance of a specific algorithm, this work shows the impact of automating the selection and user-purposed instances on the ranking of the learning algorithm by using the method of WMRRAM. Cross-validation is the most adequate and commonly used measurement to access the classifier’s performance; here, the purpose is to give the readers a substitute way of measuring classifier performance, such as bootstrap error rate and resubstitution error rate, which are not commonly discussed in the literature. K-NN, C4.5, Naive Bayes, and LDA were studied more than other learning algorithms, so we used less studied learning algorithms in the field of statistics like C-RT, CS-CRT, C-SVC, ID3, BVM, and CVM in our research. The first section discusses a comprehensive description of the dimension reduction techniques in dependence on their data scale measurement level because data may contain heterogeneous type features such as nominal, ordinal, interval, and ratio types. Hence, the technique best suits the type of data measurement briefly presented. We establish the detailed image that machine learning and data scientist experts should use dimension reduction techniques depending on their measurement levels. While evaluating the ranking of different learning classifiers in dependence on automating the selection and user-purposed instances, the effect of the NFL theorem was observed. Results show that the performance of the learning classifiers varies with the data domain. These domains are fixed in the framework of the classifier algorithm, the dataset type, number of instances, and attributes used in the datasets. Table 5 shows that by considering the WMRRA model, the classifier ID3 met the highest-ranking score with a rank of 1 among all classifiers when performing on multiple data sets in dependence of 75% of sample complexity with instances purposed by the user. While Table 6 shows that it awards a rank of 15 in dependence of automating selection instances of 75% and 25% data complexity. The lowermost value acquires the rank of one; the second lowest value gets the rank of two, and so on. In comparison, C-RT gets the rank of 2 in dependence of 75% of sample complexity with instances purposed by the user and 12, and 15 in dependence automating selection instances of 75% and 60% data complexity shown in Table 7. In short, classifier ranking is strongly robust in sample and data complexity dependence. Now, it is feasible because of the methodology used. All the learning classifiers obtained acceptable performance rates and had an adequate ranking in all related characteristics. However, analyzing the result mined from the software, it was quite problematic to select a learning algorithm with the best performance. Concerning the above conclusion, the approach of the WMRRA model provides the best possible way of ranking the learning classifiers. In addition, our proposed method helps to compare the supervised dimension reduction techniques applied on the real data by reducing dimensions with minimum training and computational time. This method will be obliging for scholars who pursue approaches to acquire methods to compare machine learning algorithms. We have confidence that the proposed method for comparing dimension reduction techniques will endure and persist a vigorous expanse of learning in the coming years, owing to an upsurge in high-dimensional data and continued communal exertions.

Abbreviations

AFDM:Factor analysis for mixed data
HCA:Harris component analysis
CDA:Canonical discriminant analysis
CCA:Canonical correlation analysis
ICA:Independent component analysis
BVM:Ball vector machine
CVM:Core vector machine
C-SVC:Support vector machine for classification
LDA:Linear discriminant analysis
MLP :Multilayer perception
NB :Naive bayes
Prototype NN:Protype nearest-neighbor
SVM:Support vector machine
ID3:Iterative dichotomizer 3
NB continuous:Naive bayes continuous
LLE:Local linear embedding
MCA:Multiple correspondence analysis
Nonmetric MDS:Nonmetric multiple dimensional scaling
Nonlinear MDS:Nonlinear multiple dimensional scaling
SOM:Self-organizing map
HAC:Hierarchical component analysis
Kohonen-SOM:Kohonen-self organizing map
VARCLUS:Variable cluster
VARHCA:Variable hierarchical component analysis
VARKMeans:Variable K-means
CA:Correspondence analysis
DCA:Discriminant correspondence analysis
K-NN:K- nearest-neighbor
TER:Test error rate
RER:Resubstitution error rate
BER:Bootstrap error rate
CVE:Cross validation error rate
CT:Computation time
RAM:Rank adjusted mean.

Data Availability

The data included in this study is available upon reasonable request to the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.