Abstract
Logistic regression is a commonly used classification algorithm in machine learning. It allows categorizing data into discrete classes by learning the relationship from a given set of labeled data. It learns a linear relationship from the given dataset and then introduces nonlinearity through an activation function to determine a hyperplane that separates the learning points into two subclasses. In the case of logistic regression, the logistic function is the most used activation function to perform binary classification. The choice of logistic function for binary classifications is justified by its ability to transform any real number into a probability between 0 and 1. This study provides, through two different approaches, a rigorous statistical answer to the crucial question that torments us, namely where does this logistic function on which most neural network algorithms are based come from? Moreover, it determines the computational cost of logistic regression, using theoretical and experimental approaches.
1. Introduction
Wilhelm Wundt, a psychologist, first used the sigmoid function to explain psychic processes in the late 19th century [1]. Later, scientists working in the areas of computer science, physics, and mathematics used it. The sigmoid function has been applied in a wide range of disciplines, including economics, finance, and computer science. In machine learning methods today, deep learning in particular is frequently used [2, 3].
A particular instance of the sigmoid function is the logistic function. Due to its use in logistic regression and neural networks, it is crucial in the area of machine learning. The logistic function converts a linear model’s output into a chance that can be applied to categorization tasks. It can also be used to symbolise a cost or loss function in optimization methods. The logistic function, which is also utilised in numerous other applications, is an effective instrument for comprehending and forecasting data [1].
Because they are more precise, produce superior outcomes, and are simpler to execute, logistic functions are favoured over sigmoid functions [4, 5]. Because they are better able to catch the subtleties of data, like nonlinear correlations, logistic functions are more precise. They also produce superior outcomes because overfitting is less likely to occur, which can result in incorrect forecasts [6]. Furthermore, compared to sigmoid functions, logistic functions are more numerically effective, making them simpler to apply.
Logistic regression is one of the supervised learning algorithms whose purpose is to perform binary classification [1, 2]. This simple and effective algorithm has proven itself in countless use cases:(1)Scoring in the banking sector [6](2)Risk management in the insurance sector [7](3)Early prediction of diseases the health sector [8](4)Hotel booking in the tourism sector [9](5)Spam detection in the cybersecurity sector [10](6)Prediction of the price of a property in a given region [11]
The present study provides, through two different approaches, a rigorous mathematical answer to the crucial question that torments us, namely where does this logistic function on which most neural network algorithms are based come from? These two approaches of logistic function are very interesting. Indeed, the sigmoid function is not introduced by chance or by common sense, but its use in logistic regression is perfectly justified. These justifications allow us to see logistic regression from different angles and thus shed light on many points that were previously obscure in many neural network algorithms.
The primary problem with the suggested research is that it is very susceptible to overfitting and interference. Logistic regression should not be used if there are less data than features; else, overfitting can happen. We therefore evaluate this disadvantage by computer simulation. The second issue with this research is the computational cost, especially for large datasets with space-consuming features. We therefore evaluate this disadvantage by computer simulation.
Many open-source datasets are now available such as UCI machine learning repository [12] or Kaggle [13]. These datasets can be used by logistic regression to build reliable prediction models. A machine learning technique called logistic regression is applied to categorization issues. This particular supervised learning method is employed to forecast finite values such as 0 or 1, true or untrue, and yes or no. It operates by identifying the best-fitting line or hyperplane that divides the observed data into two categories. A mathematical function called the sigmoid function accepts input from a broad variety of numbers and produces an output value between 0 and 1. It is a kind of activation function that many neural networks use to determine how they will produce information. In logistic regression, it is also used to convert the data to a chance between 0 and 1. On datasets with discrete answer variables, logistic regression is employed.
Logistic regression consists of analyzing the relationship between a binary response variable and exclusively binary or continuous explanatory variables , where is the transpose of . The variable separates the possible values of X into two disjoint subsets and . The logistic regression is based on the assumption that and are separated by an affine hyperplane . The Cartesian equation of is defined from real parameters , such that for all ,
In reality, the two subsets and are not necessarily separated by an affine hyperplane, but rather by any hypersurface. Therefore, with this model, we will have both false positive and false negative predictions. By separating by an affine hyperplane, we can assume that(1), corresponds to the class for which (2), corresponds to the class for which
For a given observation , the random variable follows Bernoulli distribution with parameter , i.e., , andwhere designates the probability of the event , and is a mapping from to [0, 1] called score function [14]. Given a threshold [0, 1], we decide that
The natural and most used value of the threshold is but nothing prevents taking if we want the classification to be stricter. There are several candidates for the choice of the function . For the logistic model, we consider the functionwhereis the sigmoid function as shown in Figure 1.

2. Literature Review
Yaseliani and Khedmati [15] proposed a logistic regression model to predict heart disease based on a dataset consisting of 299 people and 13 variables. The aim was to evaluate the impact of different predictors on the outcome and select the best combination of predictors. The effect of each predictor on the prediction outcome was evaluated using statistical measurements such as AIC scores and values. The logit models of different predictors were analyzed and compared to identify the ones with the highest impact on heart disease. Two statistical approaches were used to determine the combined model that best fits the dataset. Based on the results, the proposed model achieved a sensitivity and specificity of and , respectively. Normal probability density curves were used to establish likelihood ratios based on classes 1 and 0. The results showed that the likelihood ratio classifier performed as satisfactorily as the logistic regression model. The study highlights the potential of machine learning in detecting and predicting heart diseases and the importance of selecting the right combination of predictors to achieve the best results.
Nusinovici et al. [16] conducted a study to evaluate the performance of machine learning algorithms compared to logistic regression in predicting the risk of cardiovascular diseases, chronic kidney disease, diabetes, and hypertension using simple clinical predictors. Logistic regression achieved the highest area under the receiver operating characteristic curve for CKD and DM predictions, while neural network and support vector machine were the best models for CVD and HTN, respectively. However, the differences between these models and logistic regression were small and nonsignificant, suggesting that logistic regression performs as well as ML models in predicting the risk of major chronic diseases with low incidence and simple clinical predictors.
Desai et al. [17] addressed the problem of the lack of effective analysis tools to discover unknown relationships and trends in heart disease data. In their study, they assessed the accuracy of classification models for predicting heart disease using the Cleveland dataset. A comparative study of parametric and nonparametric approaches using back-propagation neural network and logistic regression models was conducted. The study found that the developed classification model can assist domain experts in taking effective diagnostic decisions. The 10-fold cross-validation method was used to measure the unbiased estimate of these classification models. The study highlights the need for effective analysis tools in discovering patterns and knowledge from heart disease data.
3. Logistic Regression Analysis
3.1. Determination of the Parameters
In practice, we have observations of , where for all , , and . For simplicity, we transform each vector into an augmented vector:so that the Cartesian equation (1) is interpreted as a scalar product between and :
For all , we denote by the random variable . Therefore, are independent but do not follow the same distribution since . To estimate , we use the maximum likelihood principle [18], which states that given the learning set , the optimal value of is the one that maximizes the likelihood function,
In the following, we will abuse the notation to simplify the notations.
By separating the observations according to the binary variable , we can rewrite as follows:
Using the identity,we can then express as a function of :
Moreover, for all , we have(i),(ii).
Therefore,
Consequently,
Thus, the likelihood function given by (11) can be written as follows:
The function is strictly increasing, so maximizing with respect to is equivalent to maximizing with respect to . Moreover, maximizing with respect to is equivalent to minimizing- with respect to . Therefore, the maximum likelihood estimator (MLE) of the parameters is the argument which minimizes the criterion defined by
The criterion given by (15) is nothing more than the cross-entropy [19] which is used as the cost function for the learning phase of logistic regression. In logistic regression, we instead minimize the average cross-entropy contained in the training data. In order not to overburden the notations, we will assume in the following that
Note that the cross-entropy function has the advantage of being convex, unlike other mean squared error based cost functions.
3.2. Analysis of the Cost Function
The cost function given by equation (16) involves ]0,1[ and , for all . Therefore, the two terms of the sum (16) can be simplified according to whether is 1 or 0:(1) concerns only observations labeled 1(2) concerns only observations labeled 0
The function , plotted in blue in Figure 2, shows that(1)when goes to 1 (i.e., the model predicts the true label), goes to 0, i.e., the loss of the cost function is negligible(2)when goes to 0 (i.e., the model predicts the false label), rapidly decreases to , i.e., the loss of the cost function becomes very important The function , plotted in red in Figure 2, shows that(3)when goes to 0 (i.e., the model predicts the true label), goes to 0, i.e., the loss of the cost function is negligible(4)when goes to 1 (i.e., the model predicts the false label), rapidly decreases to , i.e., the loss of the cost function becomes very important

Thus, the items prove that the classification model behaves as expected.
3.3. Minimization of the Cost Function
Another central concept in machine learning is the gradient descent algorithm [3]. This algorithm consists of a step-by-step descent towards the minimum of using its gradient. We now calculate the gradient of using the chain rule.
By setting,then the function is rewritten in the following form:
Therefore, for all , we have
We deduce from relations (19) thatwhere . By setting,then the gradient of is given by the compact form:
Algorithm 1 summarizes the different steps to minimize the cross-entropy function.
|
The parameter [0, 1] called learning rate controls the speed of descent towards the minimum of the cost function. We say that is a hyper-parameter of the logistic model because it is not easy to give it an effective value. In practice, it is determined experimentally by trying several values of this parameter.
3.4. Computational Cost of Logistic Regression
The computational cost of a numerical method is the number of elementary operations (or flops) required to execute the method. The construction of a supervised learning model generally goes through two phases, namely the training phase and the testing phase. Here, we calculate the computational cost of each phase. Training cost. Recall that is the number of parameters in the regression problem, bias included, and that is the number of observations. For each epoch of the gradient descent, the calculation of the vector requires, multiplications, additions, and evaluation of the function . Moreover, updating the vector requires subtractions, multiplications, and additions. Thus, if is the number of epochs of the gradient descent, then the computational cost of the logistic regression is given by Table 1, where denotes the Landau symbol.
Test cost. At the end of the learning phase, we get an estimator of the vector . Regression model validation is performed on a new dataset, , which was not used in the training phase. For any observation , we compute the inner product . If , then the point belongs to the class of ; otherwise, it belongs to the class of . This calculation requires additions, multiplications, and 1 conditional test, i.e., flops. Thus, if is the number of observations in , then the test phase requires flops.
Since the size of is less than that of the training set , building a logistic regression model requires in total flops.
3.5. Space Cost of Logistic Regression
Knowing that the memory unit is the space used to store an integer or real number, then the space cost of a numeric method is the number of memory units needed to execute the method. Here, we calculate the space cost of each phase.
Training cost. During the training phase, we have to store the matrix , and the vectors and , which require units of memory.
Test cost. During the test phase, we have to store the matrix and the vector which require units of memory.
Therefore, building a logistic regression model requires in total units of memory.
3.6. Limits of Logistic Regression
Logistic regression is a fairly simple binary classification technique to understand and implement. However, it has some disadvantages that should be taken into consideration before applying it. Here, we give a list of the most discussed disadvantages in the literature.(1)Logistic regression is used to predict a binary outcome based on certain predictors which are exclusively categorical or continuous. However, it is not suitable for the prediction of continuous results, which limits the use of this tool to particular databases.(2)The size of the data set has a direct impact on the quality of the logistic regression model. Thus, if this size is not large enough, the model automatically causes overfitting. According to [20], a minimum number of 50 observations per variable are recommended to achieve a good level of stability. We refer the reader to article [21] for more details on the choice of dataset size.(3)Logistic regression is very sensitive to the multicollinearity between the predictors, which can be verified using a correlation matrix. Hence, it is necessary to examine the correlations between the predictors before proceeding with the development of the model. When some predictors are strongly correlated with each other, it is better to eliminate some of them because they are probably redundant variables.(4)Logistic regression requires mutual independence between observations in the data set; otherwise, the logistic regression model would tend to overweight the importance of these dependent observations. For example, some clinical trials are conducted using a matched pair model that compares two similar individuals, one taking a treatment and the other taking a placebo. In addition, other clinical trials are conducted on the same individuals, taking measurements on these individuals, before and after the treatments. If the dataset comes from this type of experiment, logistic regression is not appropriate to build a prediction model based on this dataset ([22], p. 350).
4. Geometric Interpretation of Logistic Regression
Let be the orthogonal projection of the point on the hyperplane . Then, is the hyperplane passing through and with normal vector . Let be the distance from to . Then,
Therefore,
Lemma 1.
Proof. [Lemma 1](1)If .(2)If , we will use a proof by contradiction to show that . So, suppose and . Since and have the same direction, then there exists such that . Consequently,We deduce from equation (26) thatSince , then it satisfies the Cartesian equation of , i.e.,Equation (28) implies that , which contradicts the starting hypothesis. Thus, .
The hyperplane partitions the affine space into two disjoint subsets and , such thatLet be any element of the training set, thenWe deduce that
Definition 1. (signed distance). The function defined byis a signed (or oriented) distance function.
The function satisfies the following properties:(1) if (2) if (3) decreases as approaches (4) if (5) tends to if moves away from while remaining in (6) tends to if moves away from while remaining in (7)the larger while remaining positive, the more certain we are that is in class (8)the larger while remaining negative, the more certain we are that is in class The sigmoid function is used in point-blank logistic regression, without knowing how this function appeared in logistic regression. We give, in the following two sections, two different justifications for the introduction of the sigmoid in logistic regression.
5. First Justification on the Origin of the Sigmoid
Transforming the linear combination,by a signed distance between and a vector normal to the separating hyperplane , allowed us to see logistic regression from different angles, and thus shed light on many points that were previously obscure in many algorithms of neural networks. Precisely, this study provides statistical answers to the crucial question that torments us: where does this logistic function, on which most neural network algorithms are based, come from?
For a given value of , we denote by
To measure the relative strength of compared to , rather than directly considering the probability , we can take the odds ratio, , between these two probabilities [23]:
What is interesting about the odds is that we have a direct perception of the probability strengths.
We now explain how to link the notion of odds to the notion of signed distance introduced in Definition 1. It is quite clear that(1)the linear combination (33) is an onto mapping of to (2)the function , shown in Figure 3, is a one-to-one correspondence of [0, 1] to [ ]
Therefore, is a one-to-one correspondence of [0, 1] to . Thus, for all , there exists a unique number in [0, 1] such that , i.e.,

The logarithm function has fairly consistent behavior with a signed distance. Indeed, in view of Figure 4, it can be seen that(1)if tends to 0, then tends to at high speed(2)if tends to 1, then tends to at high speed(3)if tends to 0.5, then tends to 0 at a fairly low speed

Taking the exponential of both sides of equation (36), we get
Equation (37) implies , i.e.,
Thus, we gave a first statistical justification for the origin of the sigmoid function.
6. Second Justification on the Origin of the Sigmoid
Suppose the points in the training set are observations of two random vectors and of Gaussian distributions such thatwhere is a regular real matrix. Thus, if an observation comes from , then , and if it comes from , then . Note that the hypothesis is not restrictive, since there exists an affine hyperplane which separates the observations into two disjoint subsets of , we can find a coordinate system of in which and have opposite means. We denote by any affine line orthogonal to the hyperplane . For all , we denote by the orthogonal projection of on an affine line perpendicular to the affine hyperplane . Then, as shown in Figure 5,(i)the projected points of class come from a random Gaussian variable with mean and variance [24](ii)the projected points of class come from a random Gaussian variable with mean and variance

Let be the random variable corresponding to the orthogonal projection of the random vector on the vector line . For all , we denote by the orthogonal projection of on . Let and be the following probabilities:
The odds of the event occurring are
Since , then equation (42) is rewritten as
Equation (43) implies , i.e.,
Thus, we gave a second statistical justification for the origin of the sigmoid function.
7. Numerical Validation
7.1. Illustration of the Computational Cost
As shown in Section 3.4 that if the number of epochs and the number of explanatory variables are negligible compared to the number of observations , then the computational cost of the logistic regression is flops. The purpose of this section is to show this result by computer simulation. To do this, we have designed the following plan.(1)We simulated a dataset composed of explanatory variables , a dependent variable and observations observations.(2) is -by- matrix of uniformly distributed random numbers between 0 and 1.(3)For the reproducibility of this experiment, we set the seed of the random generator to 1.(4)We considered a separating hyperplane having the equation .(5)For all , is a realization of a random variable distributed according to Bernoulli’s , where , and .(6)We implemented Algorithm 1 by setting the learning rate and the number of epochs .(7)For each value of given by Table 2, we apply, times, the logistic regression on the subset formed by the first rows of , and we calculate the average CPU-time in seconds, spent by the logistic regression on the trials.
To implement this experience, we installed Anaconda Distribution 2022.10, which comes with Python 3.9 programming language. All Python scripts are run on a laptop equipped with an Intel dual-core i7-4510U processor, clocked at 2.0 GHz, 8 GB of RAM, a 1 TB hard disk, and Windows operating system 8.1.
Figure 6 presents the results of this experiment. As we can see, the average CPU time follows, up to a multiplicative constant, the same distribution as .

This experience is in line with the computational cost of logistic regression, which we calculated in Section 3.4.
7.2. Illustration of Overfitting
We mentioned in Section 3.6 that if this size is not big enough, the model automatically causes an overfit. The purpose of this section is to confirm this result by computer simulation. To do this, we consider the dataset introduced in the previous section. To measure the performance of the Logistic regression, we use a 10-fold stratified cross-validation [25], setting shuffle = True and random_state = 1.
For each value of given by Table 2, we calculate the mean and the standard deviation of both accuracy and of F1-score [26], resulting from the stratified cross-validation of the classifier on the subset formed by the first rows of . Figures 7 and 8 illustrate the outcomes of this experiment. As we can see, the trend of mean accuracy (resp. F1-score) increases with , while that of standard deviation decreases with . Moreover, one can also observe an overfitting of the classifier corresponding to the sizes of the dataset.


8. Conclusion
The present study provides, through two different approaches, a rigorous statistical answer to the crucial question that torments us, namely where does this logistic function on which most neural network algorithms are based come from? We find these two approaches of arriving at the logistic function very interesting. Indeed, the logistic function is not introduced by chance or by common sense, but its use in logistic regression is perfectly justified. These justifications allow us to see logistic regression from different angles and thus shed light on many points that were previously obscure in many neural network algorithms.
Moreover, we have shown, through computer simulations, that building a logistic regression model requires flops and suffers from overfitting for small datasets.
Data Availability
The data used to support the findings of this study are included in the article.
Additional Points
There is a primary version [27] of this study accessible at https://journals.kozminski.cem-j.org/index.php/pl_cemj/article/view/169/166. We confirm that this basic manuscript has no DOI and is published by a fake journal.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Acknowledgments
The researchers would like to thank the Deanship of Scientific Research Qassim University for funding the publication of this project.