Abstract

Binomial regression is used as a generalized linear model (GLM) in natural sciences to identify the covariate structure that is responsible for outcomes. It is very important to assess the adequacy and effectiveness of any model before its implementation. In GLM context, this study explores the structure and usefulness of partial residual (PRES), augmented partial residual (APRES), and conditional expectation and residuals (CERES) plots for visualizing influence diagnostics as a function of selected predictors. Binomial regression is considered here with predictor transformation, and PRES, APRES, and CERES plots are constructed for diagnostics of outliers and multicollinearity. The efficacy of these plots for obtaining a good visual impression may be varied due to behaviour of response variable and allied link function with different covariates. The certain techniques are applied on the data of hindered internal rotational (HIR) treatment of chemical species to recognize patterns for efficient modelling. The power of the tests for different plots shows that APRES and CERES (L) endure maximum power for detection of outliers and multicollinearity. The results revealed that residuals plots are more effective as compared to the conventional methods and help the scientists to easily and effectively model the data for their diagnostics policies.

1. Introduction

In real patterns, thermodynamic properties are usually used to address different chemical and biological processes. A rigorous hindered internal rotation (HIR) treatment is generally used to acquire accurate thermodynamic data of chemical species for a variety of temperatures. This needs sufficient information regarding internal rotation. To discuss the patterns and relationships in the data regarding HIR, statistical methods have been established that lead to the reliable conclusions [13].

Applications of statistical methods in pure and applied sciences are worth mentioning. Among other tools, regression analysis is furthermost famous to quantify the relationships among different entities. Generalized linear models (GLM), being special case of regression models, use exponential family-based responses to handle relatively complex interactions that are usually found in chemical sciences, social sciences, medical and life sciences, economics, finance, etc. Like other theories, regression analysis is based on some assumptions that are needed to be addressed prior to their implementations [46].

Statistical graphs are the most powerful techniques used in presenting, describing, supporting, and interpreting the numerical data. Statistical graphs usually show the relationship between two or more variables. The power of statistical graphics arises from the fact that they can suggest large quantities of information both quickly and efficiently because the human can quickly derive the understanding which would not come from the list of values [7, 8].

Different statistical techniques have been used to identify the violations of assumptions in regression models, and a plot of residuals depicts the relationship for the set of residuals against each of the explanatory variable in the model. This relationship would suggest whether a variable should be included in the model or not [9].

The use and interpretation of regression models depends on the estimates of the individual regression coefficient. Influential outliers can bias parameter estimates and make the resulting analysis less useful. It is important to detect outliers since the outliers can provide misleading results. This estimator is generally best when the explanatory variables are independent. However, when multicollinearity exists, that is, when the explanatory variables are related, the estimator suffers a major setback in which the regression coefficients produce wrong signs with large standard error [10]. Different plots have been used to detect certain problems from the fitted model, but the most useful are partial residual plots [1114].

These plots are usually used in regression diagnostics; the transformation is needed in the explanatory variables if they are noncollinear and not directly connected in the regression model, and also, these plots help us to improve the fit of logistic regression models in the generalized linear models [15, 16].

The added variable plots are used for the detection of influential observations, whereas the partial residual plots are more helpful for diagnosing nonlinearity [17, 18]. The partial residual plots help the researcher in detecting outliers, assessing the presence or absence of variance heterogeneity and determining if a transformation of the explanatory variable is needed or if another term needs to be added in the model [19].

To demonstrate the importance of statistics in chemistry is the realization of the imperative of widespread errors. There is good interest in basic statistics in chemistry, especially physical, environmental, and analytical chemistry which depends on quantitative measurements and computations. So, the chemist used statistics, as the causes can be broad [20]. Statistical thermodynamics also called statistical mechanics which include macroscopic properties of thermodynamics corresponding to individual particle behaviour and their interaction and compute the difference in energy among the staggered ground state and eclipsed transition state in angle of rotation [21].

Statistical analysis plays an important role in the field of chemical sciences, biology, bioinformatics, biochemistry, and other fields of natural sciences and developed the nonlinear regression model in the analytical chemistry to describe the relationship among the response and one or more stimulus variables. The example also shows some analysis of data, model compatibility, diagnostic verification, forecasting, and presentation of results [22, 23].

The application of logistic regression analysis in livestock data has been examined, and the logistic regression model used to determine the relationship between the binary response variable and the independent variables may be discrete, continuous, or the combination of them. Logistic regression is the most important mathematical modelling approach used to describe the relationship among variables, and they apply the maximum likelihood estimation procedure for the purpose of interpretations of coefficients [2426]. He produced a detailed guide in the procedure for logistic regression to obtain the odd ratios. We take the examples to explain it as simple as possible and also discuss the main issues and interpret the result [27].

The logistic regression model, in the ipsilateral breast tumour relapse (IBTR) cancer patient’s data, presents a common analysis framework for logistic regression modelling with a binary score classified and time associated with IBTR and also discusses the simulation study [28].

The logistic regression model is applied on cucumber data with a longer shelf life after the postharvest while taking marketing possibility as the response variable, and the storage days, varieties, fruit weight loss, and months of evaluation were used as explanatory variables. The logistic regression model describes that weight loss of an option, on which the fruit will be rejected on the market because the storage time was the main cause of cucumber weight loss strongly, influenced the probability of marketability [29].

Landwehr et al. [30] addressed the partial residual plots for GLM under canonical relations, for visualizing unknown function and identifying the transformation need for predictors in regression with a binary response. The applications of partial residual plots are for visualizing impression of curvature in binary logistic regression as a function of predictors [31]; also, we used the application of partial residual plots in inverse Gaussian regression and discussed the link function and stochastic behaviour of the predictors in the class of GLM [32].

The current study addresses the construction of PRES, APRES, CERS (K), and CERES (L) using response residual in the logistic regression model. The construction will be made using the transformation of regressors. These plots are then used for diagnostics in logistic GLM to generate a suitable model.

The paper is organized as follows. In Section 2, the construction of residual plots is presented. Section 3 contains real data example with residual plots, and Section 4 presents the empirical evaluation of different residual plots. In Section 5, the simulation study with diagnostic power is presented, while Section 6 concludes.

2. Partial Residual Plots’ Construction in Logistic Regression

The process of construction of partial residuals in GLMs is available in [30] and in [31, 32]. The GLM is defined as

The development in the context of regression problems is with a univariate response and a vector of random predictors , excluding the constant predictor. For a sample of independent and identically distributed observations , on the random vector , we take the probability mass function of the conditional distribution of to bewhere are known smooth functions, θ is the unknown scalar valued parameter that depends on , and is an unknown dispersion parameter. In regression function, and variance function is .

Following [31], we partition , where is , We assume that the regression function can be characterized adequately by the structure:

If the term is assessed by parametrically, thenwhere is the user specified link function which is monotonic and differentiable and is the vector of unknown parameters consisting of vector. The term is a function of or function of depending on interest and concerns.

Binomial distribution is given by

Using (2), the term in (5) is explained aswhere .

Using (1) and (6), the link function can be written aswhere

The log likelihood function of binomial regression [33] is

The maximum likelihood estimator for can be obtained by solving the following system of equations because solution of the system is nonlinear, so the iterative weighted least square procedure is used to estimate the unknown parameters.

The fitted binomial regression model is then given bywhere and the subscript ‘f’ mean on and denotes the fitted model. The coefficient estimate , is obtained by minimizing the convex objective function:wherewhere L (.,.) is a user selected objective function assumed to be convex with respect to its first argument. This class is not very restrictive because, at the very least, it includes the use of ordinary least squares and maximum likelihood under (4) and (10) with canonical link, where and a certain robust estimates. The objective function with respect to maximum likelihood for logistic regression with link given in (7) is

The class of objective function corresponding to (11) is a generalization of the class of convex objective function, , used by [34] for additive error models (8). A partial residual (PRES2) for is obtained by using (8)–(10) and is given bywhere is the first derivative of w.r.t ‘’ and can be obtained by (11).

The term is the regression function of evaluated at .

The first derivative of the binomial link function given in (7) is

Hence, the fitted model by using link for binomial regression can be expressed aswhere are the regression estimators and denote the fitted model and are the predictors. PRES for is

Similarly, for the model having explanatory variables, the partial residual can be expressed aswhere the fitted model for explanatory variable is

The partial residual available in (18) can be notated as RPRES due to response residual. So, the response partial residual is now given aswhere for PRES. The RAPRES, RCERES (L), and RCERES (K) can be obtained by using (20).

The augmented PRES (APRES), the conditional expectation, and residuals (CERES) can be obtained by using (20). When we plug quadratic and/or interaction term instead of in (20), APRES are obtained. Similarly, the conditional expectation is used instead of in (20) to obtain CERES. is sometimes estimated nonparametrically by using LOESS and kernel function [31, 32].

3. Example: Hindered Internal Rotational Data

To demonstrate the applications of partial residual plots in logistic regression, we have taken the numerical example of the hindered internal rotational (HIR) treatment data of chemical species [1]. Having variables is a Boolean feature to check whether two sides are rotating in the same direction (SameDir = Y) as a response, where explanatory variables are out-of-axis angle (AxisAngle = X1), magnitude of the out-of-axis vibrational vector (AxisMag = X2), average value of the changes of the dihedral angles in the rotational group (DAngle = X3), standard deviation of the changes of the dihedral angles in the rotational group (StDAngle = X4), magnitude of the net vibrational vector of the rotational group (VecMag = X5), average value of the changes of angle between every pair of atoms (BAngle = X6), standard deviation of the changes of the angles between every pair of atoms (StBAngle = X7), average value of the directional angles (DirAngle = X8), standard deviation of the directional angles (StDirAngle = X9), average value of the bending angles (StrAngle = X10), standard deviation of the bending angles (StStrAngle = X11), magnitude of the total torque of the current considered rotational group (STorque = X12), magnitude of the total torque of the opposite side with respect to the considered rotational group (Torque = X13), and magnitude of the difference between (X12) and (X13) (TTorque = X14).

The logistic fitted regression model is given by

Necessary calculations are presented as under by using iterative weighted least square estimation procedure.

By observing AIC, BIC, in Table 1, tells us that logistic regression shows the strong result as compared to other popular distribution because AIC and BIC have the minimum value among the other distribution so that binomial regression is more suitable than Poisson regression and negative binomial regression for this hindered internal rotational (HIR) treatment data of chemical species [1].

Table 2 presents descriptive statistics of the data as well as the summary statistics of all the measures of logistic regression model. In the analysis, it is observed that three variables are significant and other eleven variables are nonsignificant as all values are greater than 0.05, but the values of R2 (0.879) and Adj-R2 (0.877) are very high which show the importance of these factors. It is well known that the explanatory variables considered here are very important towards in applied science. However, possibly due to data anomalies and deviations, these eleven variables found nonsignificant. So, it is very important to initially recognize the reasons that are responsible for this contradiction. The use of partial residual plots is presented here to serve the stated problem which is computationally less intensive and easy to interpret for a common reader. Figures 1–Figure 4 are available for PRES, APRES, CERES (K), and CERES (L), respectively.

From Figure 1(a), the outliers are clearly observed. It can be seen that there is some value that lies far away to the other data, which reveals an outlier’s, and it can also be seen that some points are very close and overlaid to each other, which indicate the presence of multicollinearity. It can also be observed that the problems observed in Figure 1(a) can also be monitored in rest of the plots Figures 1(b)–1(n). Same tendencies are observed in Figures 24. So, in the binomial regression model, outlier’s and multicollinearity diagnostics can be made by using PRES plots. Similarly, the APRES, CERES (K), and CERES (L) detect outliers. To validate our results and to justify our findings, we are now presenting the outlier’s and multicollinearity diagnostic through some formal testing procedures.

To justify our findings, we are now presenting the diagnostics of the abovementioned problems through some formal testing procedures.

In Table 3, the evidence of outlier’s (Grubb’s test), multicollinearity (F-test), and nonnormality (Anderson–Darling test) is observed. Many other formal tests are available for outlier’s diagnostics, and we know that all these tests are based on some regularity conditions and also more intensive, computationally; also, each test is used for single diagnostics only. In the light of the above discussion, we can conclude that a residual plot can be used for outlier’s diagnostics and more effective than conventional tests.

4. Empirical Evaluation of the Statistics Based on Residuals

The modified pseudolikelihood test based on residuals, partial residuals, augmented partial residual (APRES), and conditional expectation and residual CERES to test linearity and computed empirical powers of the stated statistics [35, 36]. In this study, we are using Grubb’s and F-test based on RES, PRES, APRES CERES (L), and CERES (K) for detection of outlier’s and multicollinearity, respectively. Powers of the stated statistics are computed and are available in Tables 4 and 5 (outliers) and Tables 6 and 7 (multicollinearity). The empirical power is calculated for and 0.05. Different sample sizes for outliers’ detection and for multicollinearity , and four values of ‘’ (2, 10, 20, 95) are chosen.

The binomial regression model is given in (10).

The aim of this section is to assess the performance of residuals (PRES, APRES, and CERES) for outlier and multicollinearity detection. The hypothesis comes to a form as(i)H0: there are no outliers(ii)H1: there is at least one outlier

, where N is the number of observation with denoting the upper central value of the t-distribution with N-2 degree of freedom and significance level of . Grubb’s test statistics is

Therefore, we would to like to construct the formal test of outliers based on Grubb’s test. Statistic large value of G is significant, i.e., .

We used F-test for the presence and severity of multicollinearity in a function with several explanatory variables. has an F distribution with and degree of freedom. We reject the null hypothesis if , and there is significant relationship between response and explanatory variables, and however, all explanatory variables have serious multicollinearity.

5. The Performance of Residual Plots for the Assessment of Outlier (s) and Multicollinearity

A small study was carried out to evaluate the diagnostic power of PRES, APRES, CERES (K), and CERES (L) in the presence of outliers and multicollinearity. Data of response variable were generated for the binomial regression model, in which we generate response variable from binomial (n, 0.5), and are generated from a uniform random variable on the interval (0, 30) and . We have fixed the values for the regression parameters as

We generate outliers in , by replacing the first observation in by with which are generated from standard normal distribution for fixed degree of multicollinearity (i.e., ). Then, .

No error was included so that y is a deterministic function of the three covariates. This allows the conclusion to be illustrated more clearly than when an additive error was included but will not change the qualitative nature of the results. Sample sizes,  = 20, 30, 40, 50, are used to address outliers and  = 40, 60, 80, 100 are used to address multicollinearity. In each case, 2000 samples were generated and the proportion of times that the observed significances was below  = 0.05, 0.01 and was counted. The bandwidth of kernel function is 0.5 and the number of smoothing data in LOESS is fixed at for obtaining conditional expectation in CERES. The bandwidth of kernel function for obtaining test statistic is 0.25. We choose  = 2, 10, 20, 95 in curve as roughness measure.

5.1. Simulation Result and Discussion

We used the notations, RES, PRES, APRES, CERES (L), and CERES (K), for residual, partial residual, augmented partial residual, conditional expectations residual based on LOESS, and conditional expectations residual based on kernel, respectively. We present the empirical power obtained by the outliers and multicollinearity test statistic in Tables 4 and 5 and Tables 6 and 7, respectively. In Tables 4 and 5 and Tables 6 and 7, it is observed that the test statistic is highly affected by roughness measure ‘’. In Tables 4 and 5, empirical powers of RES and CERES(L) decreased as sample size increased, but the powers of PRES, APRES, and CERES (K) are increased as sample size increases. It is concluded that test statistic based on PRES, APRES, and CERES (K) is more sensitive to outlier. Similarly, by observing Tables 6 and 7, empirical powers of RES, APRES, and PRES decreased as sample size increased, but the power of CERES (L), CERES (K), and APRES are increased as sample size increased. It is concluded that test statistic based on CERES (L) and CERES (K) is more sensitive to multicollinearity. Thus, we suggest and recommend to use Grubb’s and F-test statistic based on residual plots to check the outliers and multicollinearity of selected covariates in the binomial regression model. Figure 5 presents the PRES, APRES, CERES (L), and CERES (K) plots for simulated data. From the evaluation of Figures 5(a)–5(l), we can see that some points lie outside the main course of the data that show outlying cases. Moreover, it can also be observed that some values overlaid each other and form the congestion and over crowdedness in the plots, which is the evidence of multicollinearity. The joint reflection of outlier and multicollinearity can be seen in PRES, APRES, CERES (L), and CERES (K) plots, which lead to the conclusion that we can detect two data problems in a single shot by the use of stated plots.

Tables 4 and 5 (outliers) and Tables 6 and 7 (multicollinearity) present power computations of five different graphical methods on the basis of Grubb’s and F-test for two different level of ‘.’ It can be observed from the tables that APRES bear is standing good among others as it has maximum power values for outlier’s and CERES (L) performed well has maximum power for multicollinearity. It is also important to mention that the values of ‘’ have notable impact on the performance of these methods. The impact of sample size is observed in a sense that, with its increment, the power increases; also, some impact of alpha is observed. For the larger values of sample size, a higher value of power is observed for residual plots but APRES and CERES (L) show more consistent behaviour as compared to all other methods.

6. Conclusion

Residual plots are the data mining approach that solves existing as well as future challenges of approach based on rule for the determination of the rotational frequency (i.e., the most significant so far complex parameter of the hindered internal rotation) interested of chemical species. They explored the partial residual plots in GLM and checked the violation of assumption of regression using inverse Gaussian regression [32]. They investigated the condition in which the partial residual plots provided the useful transformation of the predictor in the class of GLM by using response residual in the binomial regression model [31].

This study articulates the construction and evaluation of RES, PRES APRES, CERES (L), and CERES (K) for hindered internal rotational (HIR) treatment data of chemical species [1] plots for binomial regression by calculating the power of the tests. It is observed that APRES and CERES (L) plots are fantastic techniques for influence diagnostics. So, these plots are recommended for use for detection of outlier’s and multicollinearity. The usefulness of these plots and tests for obtaining best results may be restricted to the specified GLM, the link function, and the stochastic behaviour of the predictors.

Results are also validated by the simulation study with diagnostic power; it is observed that residual plots are fantastic technique for outlier’s and multicollinearity diagnostics, while identifying the various problems in a single shot. So, it is concluded that residual plots are useful graphical technique for the outlier and multicollinearity diagnostics in a single graph in chemical processes.

To deal with the applied networks science, scientists usually expect the relationship of certain elements and variables on the basis of natural phenomena. However, when statistical methodologies are applied, some unanticipated results arise. This is due to the reasons that (i) the wrong applications of statistical methodology and (ii) contradictory and misleading assumptions for data under the study. Therefore, in later case, initially diagnostics for assumptions are made and then particular statistical methodology is applied after suitable transformations [16]. In our case, two different techniques produce two opposite results, i.e., values show unimportance of underlying factors while large values of and show great importance of these factors. So, it very necessary that diagnostic checking should be made before proceeding further. Moreover, the diagnostic checking is usually made by different statistical tests and one test is dedicated to one diagnostic checking. So, it is computationally intensive and also requires strong background of statistical knowledge. In spite of this, the study suggests a use of residual plots for diagnostic checking in a very simple way. These plots also have capability to diagnose multiple problems on a single graph. This will help users to assess their data anomalies initially by using these plots and then proceed further in their planning and making policies.

Data Availability

The numerical example of the hindered internal rotational (HIR) treatment data of chemical species was taken (Le et al., 2018). The data used to support the findings of the study can be obtained from the references cited within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.