Abstract

Classification is one of the main areas of machine learning, where the target variable is usually categorical with at least two levels. This study focuses on deducing an optimal cut-off point for continuous outcomes (e.g., predicted probabilities) resulting from binary classifiers. To achieve this aim, the study modified univariate discriminant functions by incorporating the error cost of misclassification penalties involved. By doing so, we can systematically shift the cut-off point within its measurement range till the optimal point is obtained. Extensive simulation studies were conducted to investigate the performance of the proposed method in comparison with existing classification methods under the binary logistic and Bayesian quantile regression frameworks. The simulation results indicate that logistic regression models incorporating the proposed method outperform the existing ordinary logistic regression and Bayesian regression models. We illustrate the proposed method with a practical dataset from the finance industry that assesses default status in home equity.

1. Introduction

Classification is one of the main areas of machine learning, where the target variable is qualitative, with at least two groups. If the target variable consists of only two groups, it is called binary. Applicable areas include loan administration, image processing, and survival analysis. Commonly used classification techniques can be categorized into four groups: supervised, unsupervised, semisupervised, and hybrid. The supervised method uses the target variable to classify data points into distinct groups and make predictions. Using the target input and output, the model can measure its accuracy and learn from them. Without a target variable, the unsupervised method is typically recommended to identify uncharacterized patterns in the data set. This method gathers data and distinguishes between data points with expected deviations from the successive data points. As they do not require any target information, the unsupervised method may serve as the first stage in separating data points that do not follow expected patterns, thus classifying them as anomalies. However, semisupervised methods are used when the target information for a particular data set is incomplete. This model first learns the part of the data set containing target scores and uses that to predict the other part without target scores. Lastly, the hybrid methods are just a combination of both the supervised and unsupervised methods.

Any binary classification model aims to classify each data point into one of two distinct groups. However, the results of most binary classification models are usually predicted probabilities [1]. A cut-off point is applied to these predicted probabilities to classify data points into the present (1)-absent (0) maps. Thus, choosing a cut-off point for binary classification is a vital step for decision-making as it may have severe consequences on the model’s accuracy. The default cut-off probability is 0.5. However, this may not result in higher prediction accuracy as data sets are usually imbalanced [1]. Binary classification models are subject to two types of errors: false-positive (FP) and false-negative (FN). These rates, FP and FN, are characterized by error cost functions (i.e., the cost of misclassifying a data point as group 1 when it belongs to group 2 or vice versa). A good classification model aims to minimize the misclassification function’s expected error cost. However, due to the challenges involved in accurately specifying the error cost of misclassification penalties, in many applicable areas, researchers usually assume an equal cost of misclassification [25]. However, this has its drawbacks; for example, Ling and Sheng [2] indicate that the variation between different misclassification costs can be quite large. In addition, Johnson and Wichern [6] state that any classification rule that ignores the error cost of misclassification might be problematic.

As a result, cost-sensitive machine learning has expanded over time due to its ability to integrate financial decision-making considerations such as information acquisition and decision-making error costs [2, 7]. The aim of this type of learning is to minimize the total misclassification cost [2]. Also, cost-sensitive learning plays a significant role in classification model evaluation [8]. Researchers in this field aim at choosing cut-off points to reduce the misclassification rate.

Bayesian methods have recently been used to address binary classification problems [911]. Nortey et al. [10] demonstrated that Bayesian quantile regression is a viable classification model for anomaly detection. Often, it is much easier to postulate the error cost ratios than to state their respective component parts [6]. For example, it may be difficult to accurately specify the costs (in appropriate units) of misclassifying a loan application as nonrisky when, in fact, the application is risky and also misclassifying a loan application as risky when, in fact, the application is nonrisky. However, a realistic number of these error cost ratios can be obtained and used to identify an optimal cut-off point for classification. Determining an optimal cut-off point requires simultaneous assessment of the test sensitivity and specificity [12]. The optimal cut-off point is the one that produces the highest sum of test sensitivity and specificity. Thus, it should be chosen as the point that accurately classifies most data points correctly and most minor of them incorrectly [13].

In addition, many studies concentrate on the receiver operating characteristic (ROC) curve and the corresponding area under the curve (AUC), a graph that measures the diagnostic ability of any binary classification model to determine an optimal cut-off point [14, 15]. The ROC curve plots the sensitivity (the true-positive rate) against the complement of specificity (the false-positive rate) for all distinctive cut-off points. Other criteria are also introduced by assuming specific values or defining a linear combination or function of sensitivity and specificity (see, e.g., [12, 13, 1618]). Moreover, Liu [19] proposed the concordance probability method, which defines the optimal cut-off point as the maximizer of the product of sensitivity and specificity of the model.

In view of the above reasons, this study seeks to develop a cost-sensitive machine learning method that is relatively efficient and consistent based on univariate discriminant functions. The approach modifies the univariate discriminant function to incorporate the cost ratios, thus avoiding the equality of error cost of misclassification assumption.

The remainder of the paper is organized as follows. In Section 2, we set out the framework for estimating the parameter of interest, i.e., the concordance probability, and our approach for modifying the univariate discriminant function to incorporate the cost ratios. In Section 3, we conduct a simulation study to assess the performance of the proposed methods with the existing ones in the literature. Section 4 presents a practical application of the proposed method. Lastly, general conclusions from the simulation results are presented in Section 5.

2. Materials and Methods

This section presents the model framework for estimating the parameters of interest.

2.1. Binary Logistic Regression

Let be the predicted probability for a binary response variable, , using an input variable, . Then, the logistic response function is modelled for multiple covariates:

The model in (1) is nonlinear and transformed into linearity using the logit response function. From (1), the logit function is given in the following equation:

The logistic regression coefficient in (2) is estimated by the method of maximum likelihood.

2.2. Bayesian Quantile Regression

Let be a response variable and be a vector of independent variables for the ith observation. Let denote the quantile regression function of given . Suppose that the relationship between and can be modelled as , where is a vector of unknown parameters of interest. Then, we consider the quantile regression model given in the following equation:where is the error term whose distribution (with density say ) is restricted to have the quantile equal to zero; that is, . The error density is often left unspecified in the classical literature of Kozumi and Kobayashi [20]. Thus, quantile regression estimation of proceeds by minimizingwhere is the loss (or check) function and is defined asand denotes the usual indicator function.

Kozumi and Kobayashi [20] considered the linear model from (3) and assumed that has a three-parameter asymmetric Laplace distribution with a density function given bywhere is the scale parameter. The parameter, , determines the skewness of the distribution, and the quantile of this distribution is zero. To develop a Gibbs sampling algorithm for the quantile regression model, Kozumi and Kobayashi [20] utilized a mixture representation based on exponential and normal distribution, as summarised as follows.

Let be a standard exponential variable and be a standard normal variable. If a random variable follows a three-parameter asymmetric Laplace density with a density as stated in (6), then one can represent as a location-scale mixture of normals given in the following equation:where and .

From these results, the dependent variable can be rewritten equivalently as follows:

Expanding (8) and thereafter reparametrisation, we obtain the following equation:where . The exponential normal mixture distribution of shows that s are normally distributed with mean, , and variance, [10, 20]. Therefore, has a normal density function:and the resulting likelihood function is given by

The aim then is to estimate the regression coefficients, , scale parameter, , and the mixture variable, , in (9).

Two sets of prior probability density distributions are selected for the parameters: , where denotes a normal distribution, and , where denotes a double exponential distribution. Also, we assume that , where denotes an exponential distribution, and , where denotes an inverse gamma distribution.

In the case of the first Bayesian model (i.e., with and ), named Bayesian model 1, we assign

Likewise, for the second Bayesian model (i.e., with and ), named Bayesian model 2, we assign

Given the likelihood distribution in (11) and by specifying the prior probabilities of the parameters of interest, the posterior distributions can be derived. Thus, for Bayesian model 1, the posterior distribution is given as

Also, for Bayesian model 2, the posterior distribution is given as

It can be shown that the marginal conditional density for in Bayesian model 1 is normally distributed with mean and variance given, respectively, asand

Also, the marginal conditional distribution of for Bayesian model 2 is normally distributed with mean and variance given asandrespectively.

Furthermore, the marginal conditional distribution of follows a generalized gamma distribution with parameters and , whereand

Lastly, the marginal conditional distribution of is obtained as

As a result, the marginal distribution of follows an inverse gamma with the following parameters:and

2.2.1. Estimating the Mixture Component

A mixture distribution for a fixed number of components can be specified as , where , , and are the proportions of the component distributions, their means, and standard deviations, respectively. To estimate the parameters , , and associated with , the first assumption is that the marginal conditional distribution of is a finite mixture distribution of two normal components. The second assumption is that the latent variable, , has a value of 0 or 1 associated with absent and present event rates, respectively.

In this context, and are the proportions of present and absent event rates, i.e., and . The estimation of the mixture variable is done in using the Bayesian mixture package. The package provides the Gibbs sampling of the posterior distribution, a method to set up the model, and specifies the priors and initial values required for the Gibbs sampler.

2.2.2. Computing the Probability Score

As the aim is to compute the probability for each observation, it can also be noticed from (2.2.5) that is only dependent on through the estimated values of , and . Therefore, from the Bayes theorem, the probability that an observation belongs to the present event rate is computed as

2.3. Incorporating the Error Cost Ratios

In this section, we outline our steps in modifying the univariate discriminant function to incorporate the error cost ratios.

Let and be the density functions associated with a random vector variable for and . An object with related measurement must be allocated to either or . Also, let be the sample space of and be the values of for which objects are classified as and as the remaining values for which objects are classified as because they are mutually exclusive and exhaust the sample space.

The probability of classifying an object as when it is derived from is given as

Similarly, the probability of classifying an object as when it is derived from is given as

In addition, let and be the prior probabilities of an object belonging to and , where . Then, the total probability of accurately or inaccurately characterizing objects can be deduced as the product of the prior and conditional classification probabilities. For example,

Classification systems are commonly assessed based on their misclassification probabilities. However, this overlooks the error cost of misclassification. The error cost of misclassification (ECM) can be characterized by a cost matrix given in Table 1.

Thus, we assign a cost of(i)Zero for accurate classification(ii) when an object from is inaccurately classified as (iii) when an object from is inaccurately classified as

For any rule, when the off-diagonal entries of the cost matrix are multiplied by their respective probabilities of occurrence, we obtain the expected error cost of misclassification (EECM) as

The areas, and , that reduce the EECM have defined values for which the following holds:andrespectively, for and . Clearly, from (30) and (31) the inclusion of the minimum EECM rule requires the following:(a)The ratio of density distribution assessed at a new observation say (b)The cost ratio(c)The prior probability ratio

The presence of these ratios in the description of the optimal classification regions makes it much easier to postulate the cost ratios than their respective parts. Suppose is normally distributed with parameters and , and then, (24) can be rewritten as follows:

However, if , then from (32),

We denote the left-hand side of (32) and (33) as the quadratic and linear discriminant functions, say and with their respective right-hand sides as the critical values denoted as and . The sample estimates for and are given, respectively, asandrespectively. Thus, the ratio of the error cost of misclassification for (28) is obtained fromwhere and

Here, is the maximum predicted probability, is the minimum predicted probability, is the sample variance for the present event rate, and is the sample variance for the absent event rate.

Similarly, in the case of the linear discriminant function,and

Also, the ratio of the error cost of misclassification for is derived using (36), where and

Therefore, by the minimum EECM rule, an object is classified as belonging to the present group if and only ifor

The univariate discriminant functions (34) and (35) are proposed for classifying an object into two distinct groups if is statistically different from . In the next section, the performance criteria for evaluating these proposals are presented.

2.4. Performance Evaluation

To assess the classification methods, the confusion matrix is of interest. Table 2 shows the confusion matrix.

The metrics of performance evaluation computed from Table 2 includes sensitivity (the ability of the model to correctly classify present event rates as present), specificity (the ability of the model to correctly classify absent event rates as absent), and accuracy (the overall correct classification). Mathematically, the metrics of performance evaluation are computed as follows:and

3. Simulation Study

In this section, we present a simulation study to assess the performance of the various models discussed in the previous section. It is organized into two sections, namely, simulation design and results and discussion.

3.1. Simulation Design

Three sample sizes, , , and , are considered to investigate the empirical consistency of the proposed methods. For each sample, our interest is in estimating the average concordance probability (multiplication of the model’s test sensitivity and specificity, respectively), bias, and mean square error. The bootstrap sampling technique was used to estimate the standard errors of the estimators of the cut-off point from each model.

In addition, we selected a proportion of event occurrences as low as 0.05 to a high value of 0.5 in each generated sample to study the proposed models’ performance as the proportions vary. To select random samples having a predetermined proportion of event occurrence, we proposed a modified “conditional block bootstrap” in Minkah et al. [21] where the authors implemented it in selecting bootstrap samples for censored data. The conditional block bootstrap is a combination of ideas from the moving block bootstrap [22] and the bootstrap for the censored data [23].

In the proposed modified “conditional block bootstrap” procedure, the absent events are grouped into randomly chosen blocks. However, each block must contain at least one present event observation. The bootstrap observations are obtained by repeatedly sampling with replacement from these blocks and placing them together to form the bootstrap sample. Enough blocks must be sampled to obtain approximately the same sample size as the originally given sample. Given a sample of size, , and a proportion of present event occurrence, , the conditional block bootstrap procedure is as follows:A1. Group the observations into two groups, namely, present and absent groups (with their corresponding covariates) with sample sizes and , respectively. Thus, the proportion of the present event is .A2. Let represent the number of present observations to be included in a block, . The block size, , is obtained as . If is not an integer, let .A3. The number of blocks, , is chosen such that . In the case , the blocks will have the same number of observations. However, if , then is taken as , in which case the first blocks are allocated observations each and the remaining observations are allocated to the th block.A5. Let denote the th block. Assign observations to each block by random sampling without replacement, observations from the absent event group. In addition, randomly sample observations without replacement from the present event group and assign them to each block . Thus, each block would contain and observations of present and absent events, respectively.A6. Sample times with replacement from and place them together to form the bootstrap sample. These bootstrap samples will have sample sizes equal to or approximately, .A7. For the bootstrap samples obtained in A6, perform the analysis using Bayesian model 1 (BM1), Bayesian model 2 (BM2), binary logistic regression using the proposed methodology for obtaining the optimal cut-off point (LM), and binary logistic regression with a cut-off point of 0.5 (LN).A8. For each model in A7, subsequently, obtain the optimal cut-off point and compute its corresponding denoted as , , , and , respectively.A9. Repeat A1 to A8 a large number of times, (see, e.g., [24] for justification) to obtain the values , , …, for .A10. Compute the average , bias, and MSE for the ith method in A9, i.e.,and

3.2. Results and Discussion

The results of the simulation study are presented in this section. For brevity and ease of presentation, we display the plots of the average , bias, and mean square error (MSE) as a function of the proportion of event occurrences. The criteria for an appropriate model is to have high values (closer to 1) and low values of bias and MSE.

Figure 1 shows the graph of average , bias, and MSE as a function of the proportion of event occurrences for the various models and the varying sample sizes.

Clearly, the logistic-based models (LM and LN) have high values than their counterparts from the Bayesian frameworks (BM1 and BM2). Also, our proposed logistic regression-based classifier that incorporates the error cost of misclassification, LM, provides a better performance measure than LN as the proportion of event occurrences increases. Also, this observation becomes more apparent as the sample size increases. In the case of the Bayesian framework, BM2 has better values, save for smaller proportions of event occurrences. Therefore, in general, our proposed LM model can be considered the most appropriate model with high values for classification purposes.

In terms of bias, the results are mixed, but BM1 shows lesser bias in most cases across the sample sizes and proportion of event occurrences. In the case of the MSE, the logistic-based models provide smaller values compared with the Bayesian models, especially for a smaller proportion of event occurrences. In addition, there is a gradual decrease in the MSEs as the sample increases. This is desirable as it indicates the empirical consistency of the estimators of the values in each model. In conclusion, the proposed LM model is universally competitive in generating higher values regardless of the sample size and the proportion of event occurrences in a data set.

4. Practical Application

This section illustrates the proposed method for estimating the optimal cut-off point on a home equity loan data set. The data comprise 1000 customers in the United States of America. The dependent variable, , is the loan amount (in thousands of dollars), while the independent variables are the mortgage (amount due on the existing mortgage in thousands of dollars), the value of the current property, the reason for the loan (1 = debt consolidation and 2 = home improvement), job (1 = manager, 2 = office, 3 = others, 4 = executive, 5 = sales, and 6 = self-employed), years at the present position, and debt-to-income ratio. In addition, the variable Bad represents the status of the loan repayment. Table 3 shows the structure of the home equity loan dataset.

Our interest is in the estimation of the cut-off point for the classification of loan repayment using the methods discussed in the previous sections.

4.1. Estimating Cut-Off Point Using Bayesian Quantile Regression

The quantile regression equation for the home equity loan data is

Estimates of the model’s parameters at for BM1 and BM2 are shown in Tables 4 and 5, respectively.

Here, the aim is to identify bad home equity loan through the estimated values of the latent variable . The components of the mixture variable, , estimated for the home loan equity data are shown in Table 6.

The marginal conditional distribution of is a finite mixture of two normal components. The component with the larger mean is associated with the distribution of bad home equity loans, while the component with the smaller mean is associated with the distribution of good home equity loans. The averages for the bad home equity loan rates are estimated as and (with corresponding proportions and ), respectively, for BM1 and BM2.

We now compute the probability of each observation belonging to the distribution of bad and good home loan equity rates using (25). Tables 7 and 8 show some data points and their respective computed probabilities.

Furthermore, to ascertain which univariate discriminant function will be most suitable for classification, Levene’s test for equality of variance of the two distributions of present and absent event rates is conducted on the home equity data set. The test results show that there is a significant difference in the variances of the two distributions of bad and good loan repayment events for BM1 and BM2 . Therefore, this implies that a quadratic discriminant function is most appropriate for the classification of this data set.

Moreover, the independent-samples -tests for equal means for the two distributions of present and absent event rates are significant for BM1 and for BM2. Now, using (35) and systematically shifting the within the bounds (30), we obtain the optimal cut-off points, 0.4902 and 0.4964, at and , for BM1 and BM2, respectively. At these points, the highest concordance probabilities are achieved.

4.2. Estimating Cut-Off Point Using Binary Logistic Regression

The binary logistic regression equation for the home equity loan data is given as follows:

Table 9 presents the parameter estimates obtained through the maximum-likelihood principle for the data set.

Also, Table 10 presents the predicted probabilities of bad and good home equity loan rates for the data set.

In addition, Levene’s test for equality of variance between the distribution of good and bad home equity loans shows there are no significant differences ( value = 0.5). Hence, a linear discriminant function is the most applicable for classification. The sample pooled variances for the two groups of loan statuses are estimated as 0.101581. In addition, the independent-samples t-tests for equality of means for a bad and good home equity loan are rejected, with the values being less than 0.01.

It is similar to the Bayesian quantile regression in the preceding section, but with (39) and (40). We obtain the optimal cut-off point for the logistic regression model as , at .

4.3. Performance Metrics Evaluation

This section presents the performance metrics (specificity, sensitivity, accuracy, and concordance probability) used to assess the various models’ performances on the home equity loan data set. From the results in Sections 4.1 and 4.2, the values of and are used to obtain the performance metrics, and the results are shown in Table 11.

Clearly, the logistic regression model incorporating the proposed methodology produces the highest test sensitivity, specificity, accuracy, and concordance probability values. Also, of the two Bayesian models, Bayesian model 2 has greater test specificity, accuracy, and concordance probability values than Bayesian model 1. However, Bayesian model 1 produces a higher test sensitivity value than Bayesian model 2. Thus, it can be concluded that using logistic regression with the proposed incorporation of the error cost of misclassification produces better performance metrics in classifying loans for home equity.

5. Conclusion

This paper introduced an approach for estimating the optimal cut-off point for classification. The proposed method modifies univariate discriminant functions by incorporating the error cost ratio for classification. Thus, the misclassification cost ratios can be systematically adjusted within some specified measurement range. A corresponding cut-off value is subsequently obtained for each unique cost ratio, and other metrics of performance measures can be computed. Three methods of computing the cut-off point were proposed: a logistic and two Bayesian quantile regressions. A simulation study was conducted to assess the performance of these models in estimating the concordance probability and thus the cut-off point. The results show that incorporating the error cost of misclassification improves the concordance probability and provides smaller values for bias and mean square errors. In particular, the logistic regression with the proposed incorporation of the error cost of misclassification provides the best method as it gives concordance probability values close to 1 and smaller values of bias and mean square error. The proposed method is illustrated using loan data from the finance industry.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

Ampomah O.-A was supported with funds from the Carnegie Corporation-University of Ghana BANGA-Africa Project. This research work has been supported by funds from the Carnegie Banga-Africa Project, University of Ghana.