Empirical Examination of the Poisson Regression Residuals for the Evaluation of Influential Points

Khan, Aamna; Ullah, Muhammad Aman; Amin, Muhammad; Muse, Abdisalam Hassan; Aldallal, Ramy; Mohamed, Mohamed S.

doi:https://doi.org/10.1155/2022/6995911

Mathematical Problems in Engineering

On this page

Abstract Introduction Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Application of Mathematical Methods in Nature-Inspired Computation

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 6995911 | https://doi.org/10.1155/2022/6995911

Empirical Examination of the Poisson Regression Residuals for the Evaluation of Influential Points

Aamna Khan,¹Muhammad Aman Ullah,¹Muhammad Amin,²Abdisalam Hassan Muse,³Ramy Aldallal,⁴and Mohamed S. Mohamed⁵

Academic Editor: Nagarajan Deivanayagampillai

Received06 Mar 2022

Revised20 Apr 2022

Accepted21 Apr 2022

Published17 May 2022

Abstract

A common practice is to get reliable regression results in the generalized linear model which is the detection of influential cases. For the identification of influential cases, the present study focuses to compare empirically the performance of various existing residuals for the case of the Poisson regression model. Furthermore, we computed Cook’s distance for the stated residuals. In order to show the effectiveness of proposed methodology, data have been generated by using simulation, and further applicability of methodology is shown with the help of real data that followed the Poisson regression. The comparative analysis of the residuals is carried out for the detection of influential cases.

1. Introduction

The foremost assumption for the application of regression analysis on the observations is to be iid, i.e., identically and independently distributed. However, for the real-life application, the assumption of iid is too strong. For instance, the occurrence mean rate of an event may vary from one case to the other and that is dependent on some of the variables. Conventional regression techniques are not appropriate for regress and that explain the occurrence of rare events. Thus, it is suggested to modify the classical linear regression model in such a way that it becomes suitable for examining the Poisson variables. Poisson regression is a member of the generalized linear model’s family, and it utilizes the logarithmic transformation. Therefore, to study the occurrence rate, it is more appropriate to use count data regression than the least square regression. The Poisson regression model (PRM) is an approach used to give better understanding of count data. There are numerous applications of Poisson regression in different fields of study when the variable under consideration takes the form of counts or nonnegative integers, see [1–9].

To look at the adequacy of the fitted model, the analysis of residuals and the identification of outliers play the most important role, as the characteristics of data have a strong influence on any underlying regression model but are not studied so frequently yet using this technique. The data points are said to be influential if they pose effective influence on the estimates, the goodness-of-fit statistics, or the fitted values [10]. The finding of influential observation in regression model building, their consideration, and evaluation is of much interest. Chatterjee and Hadi [11] assessed numerous statistics that are available in literature for the identification of influential observations. Today, various influence measures exist that are classified into different groups, and residual analysis holds a critical position in those groups. Analysis of different residuals is extremely important for detecting the influential points in data. In the field of the generalized linear model (GLM), we are having several different types of residuals such as Anscombe residuals, Deviance residuals, Pearson residuals, likelihood residuals, and working residuals [12].

Numerous residuals have been suggested in the literature for the identification of influential cases in statistical models. The only work given by [13–18] with biased estimators is in hand for outlier identification in PRM. Besides, the literature marks no investigation has been completed for impact evaluation in view of the diverse GLM residuals for the case of PRM. In the present article we perform simulation experiments for the comparision commonly cited residual procedures for Poisson regression data.

2. Methodology

2.1. The Poisson Regression Model (PRM)

The PRM is applicable when the response variable comes in the form of counts’ data. Let follow a Poisson distribution with parameter . The following probability mass function is used to describe this relation:

The mean and variance of Poisson distribution are equal to the parameter :

The Poisson in exponential density is of the form

Maximum likelihood is used to estimate the PR estimator. By getting the log likelihood, we obtainwith

We found the maximizing value for by calculating first derivative of the following expression:where = score function.

Since the systems of equations are nonlinear, the solution of the above expression is not trivial. So, it is recommended to use numerical methods. One such method that is suggested to use is the IRLS algorithm for estimating the regression coefficients:where and . To study the dependence among mean and variance, we set up weights. They allow a greater spread among observed and fitted values, to handle large observed values in the regression. With the help of iterative methods, we can find and . The final step gives those values that will maximize the likelihood function. The iteration will be stopped when the smallest difference is achieved among the old and the updated values. Hilbe [19] suggested to stop the iterations when this difference is less than .

2.2. Influence Diagnostics

Exposing the influential points is a very crucial task. The basic idea of influence analysis is to assess the changes caused in a specified phase of the analysis in the presence of perturbed data. A principally appealing perturbation method is case deletion. Generally, the influence of an observation can be examined using the product of two major factors: the first one is the function of the defined residual and the other one is a function of the point’s location in the space.

In fitting a model, residuals play the most significant part and thus one can simply stand for the unexplained portion with the aid of residuals following the subtraction of estimated model from observation. Consequently, one can easily use the residuals for evaluation of influential observations since in modeling the most critical work is to become aware of the influential observations and their related treatment. A failure in identifying the influential observations can present bias on the rationality of the inferences drawn in this situation. Accordingly, in checking the statistical model, residuals play a vital role. An essential articulation variety with respect to residuals is made in the literature. While standing in the field of GLM today, we are having a number of various types of residuals, for instance, Anscombe residuals, Deviance residuals, Pearson residuals, likelihood residuals, and working residuals [20], and a variety of books are available on this issue, including [21–23]. In the following section, we listed the formulas for residuals that we are using in our work.

2.3. Cook’s Distance (D)

Cooks distance is used in regression analysis to find influential outliers in a set of predictor variables. D statistic was defined for measuring the influential observation in linear regression [24]. In GLM, the diagonal elements of the hat matrix can be interpreted as leverages just as in linear models. To measure actual rather than potential influence, we calculate Cook’s distance comparing with :

After simplification, we may write it as

The observation with the largest value of D is said to be influential. Cook [25] gave the cut point for the identification of influential observation while using GLM as .

2.3.1. Pearson Residual

Pearson residuals are a rescaled version of raw or response residuals. In the GLM, Pearson residual is defined as

For Poisson,where Y is the observed value, is the mean, and var() is the variance of . The Pearson residual for Poisson distribution is just the signed square root of the component of the Pearson goodness of fit statistics. To measure the quality of a fitted model, we use these residuals. The standardized version of the above residuals used for model checking is given aswhere represents the leverage of weighted hat matrix:

If these residuals yield large values, it gives indication of model failure, and for the detection of outliers, the observation numbers are plotted against these residuals [26].

2.3.2. Deviance Residual

Deviance plays a vital role in our derivation of generalized linear modeling. For the case of GLM, the deviance is applied as a degree of discrepancy. In literature, the deviance residual is defined aswhere D is the deviance function, and for Poisson, it is given as

The standardized version of deviance residual is given as

2.3.3. Likelihood Residual

When any of the observation is skipped from the data in deviance, the change is then given by likelihood residual. It is also named as studentized residuals, externally studentized residuals, and deleted studentized residuals [27]. The studentized residual are a type of standardized residuals that can be used to identify outliers. They are the weighted form of residuals of the standardized Pearson residuals and standardized deviance and can be approximated by the following formula:

2.3.4. Working Residual

Working residuals are the residuals in the final iteration of any iteratively weighted least squares method and are useful for the assessment of model fitting. These residuals are the difference among the linear predictor and the working response at convergence. Hardin and Hilbe [28] defined working residuals as

McCullagh and Nedler [29] defined the working residuals in the Poisson regression model as

2.3.5. Anscombe Residual

Anscombe [30] defined Anscombe residuals which aim to make residuals normal as much as possible and are used to test the normality of a defined model. For Poisson distribution, it is reported as

Adjusted deviance residuals and Anscombe are nearly the same [29]:

Standardized Anscombe residual is computed as

2.4. Residuals Evaluation of the Generalized Linear Model

The GLM residual assessment is generally established on the Deviance and Pearson residuals. Using the discrete model, Pierce and Schafer [31] employed Pearson residuals and Deviance residuals for the assessment the goodness of fit, and they found that Deviance residuals are better in findings. Furthermore, Cameron and Trivedi [32] made comparative analysis of the performance of Anscombe, Deviance, and Pearson residuals using PR and found that Anscombe and Deviance residuals performed almost equal in comparison with Pearson residuals.

2.5. Graphical Detection of Influential Observations

Within the GLM foundation, we can, in any case, plot the standard residual plots, where deviance residuals (or working residuals or Pearson residuals) are plotted against the j^th covariate values x_ij. The examples may fluctuate diﬀerently for diﬀerent models and the interpretation is less evident; for this reason, we sometimes consider other specific residual plots and try to identify any systematic departure from the ﬁtted model. In GLM, as a diagnostic tool for the selection of influential observation, the leverages play an important role, and the index plot of leverages is used for the diagnosis of influential observation. In our study, we employed the index plot for the identification of influential observations.

3. Numerical Evaluation

3.1. Simulation Study

The purpose of the following section is the comparison of the PRM’s adjusted and standardized residuals’ performance for influence diagnostics. We performed a Monte Carlo study for the Poisson model and generated the response variable for the PRM as

For y = 0, 1, 2, … and > 0, whereis the mean function and _ij ∼ N(0, 1), N(10,3), G(0.5,3), P(0.5), and P(3) for i = 1,2, …, n and j = 1, 2, …, p, where P = 2 and 4. All the generated xs are fixed through the whole simulation study. For the true parameters, we set the following arbitrary values: β₀ = 0.05 and β₁ = 0.0025; for P = 2 and for P = 4, β₀ = 0.05, β₁ = 0.0025, β₂ = 0.005, β₃ = 0.0001, and β₄ = 0.0001. We generate the dataset for sample of sizes n = 25, 50, 100, 200. After that, we generate influential values in xs. We replace the 15^th value in xs by x_15j = x_15j + α₀, for j = 1, 2, …, p, and α₀ = _j + 100. The simulation is run 1000 times using the R software to assess the influential observation detection percentages for each of the PRM residuals. The detected percentages are given in Table 1. The performance of the influence diagnostics with the PRM residuals is studied in two ways of asymptotic theories: one is with sample size and the other is with varying number of regressors.

From the results, we have found no difference in the performance of chi and dev results by changing the distribution of covariates for all the selected sample sizes. Moreover, it was noticed that their performance gets better as we increased the number of independent variables, but with increasing the sample size, their detection performance decrease. All of Cook’s distance (D) residuals performed equally well while identifying the influential observation except the likelihood residual. By increasing the sample size and the number of regressors, it was noted that, as compared to all other Cook’s distance residuals, likelihood residuals’ performance tends to increase quickly. Furthermore, it was observed that, in identifying influential observation, the performance of all PRM residuals for the detection of influential observation is better when Xs ∼ P(0.5) as compared to that when Xs ∼ N(0, 1), N(10, 3).

While studying DFFITS with PRM standardized residuals, the performance of DFFITSsrw and DFFITSsrp was equal and better than DFFITSsrd, DFFITSl, and DFFITSsra for all sample sizes and distributions. For small sample size, DFFITSsrd is better than DFFITSl, but for large sample sizes, the performance is vice versa. And in case of large sample Poisson (3) for P = 2 in Table 1, DFFITSl performance is equal to DFFITSsrp and DFFITSsrw, but this is not true for the case P = 4 in Table 2. And DFFITSsra gave bad results for the identification of influential observation for all covariates distributions and different number of independent variables. With P = 2, it was noticed that DFFITSsrp for Poisson (0.5) with small sample size performed very well in marking the observation as influential among all others, and overall, DFFITSsrp and DFFITSsrw gave best result. Likewise with P = 4 in Table 2, its performance gets better almost near to 100 percent with the increase in sample size. These results are comparable and valid because we use the same cut-off point for the computation of D and DFFITS with each type of the PRM residuals.

3.2. Applications

3.2.1. Example 1: Mine Fracture Injury Data

The first is mine fractures’ injury data that consist of n = 44 observations with p = 4 independent variables. These include x₁ = inner burden thickness in feet, x₂ = percent extraction of the lower previously mined seam, x₃ = lower seam height, and x₄ = time that the mine has been opened. From Table 3, it is noticed that, for the mine fracture injury data 4^th, 29^th, and 30^th observation are highlighted as influential by the proposed selecting procedure of the standardized residuals. Only 4^th point is identified using the Pearson and working residual; the rest 29^th and 30^th points are found to be influential observation by both D statistics utilizing the index plot and DFFITS measure. These outcomes are presented in index plots. By studying Table 3, we observed that the highest influential points is 30^th point which exerts high influence on the estimates, and according to goodness of fit test results, 29^th observation proved to be the most influential as high positive change is observed in coefficient of determination (R²_EFRON) and also the AIC is better as the smaller AIC occurred after the removal of 29^th observation, and also, the Pearson chi square reduces the most after its removal. The third influential observation is 4^th as a very small change occurred in the estimates and R²_EFRON after its removal. All of these results are also verified with the help of the index plot in Figure 1.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

3.2.2. Example 2: Aircraft Data

The second dataset consist of n = 30 observations with P = 3 independent variables. These variables include x₁ = indicator variable, x₂ = bomb load, and x₃ = crew experience. From Table 4, for aircraft data, 16^th, 21^st, 25^th, and 30^th observations are recorded as influential and are highlighted by the suggested detection procedure of the standardized residuals. The 16^th and 25^th observations were identically identified as influential with all residuals, 30^th point was not identified by the working residual, and 21^st was not identified by the Pearson and the working residual. From Table 5, we noticed that 25^th observation is the highest influential on the basis of the estimates and AIC. The 2^nd influential observation is 16^th as, after its removal, the clear change occurred on estimates, and it lowered the AIC as compared to other observations after the 25^th observation and changed the Pearson chi square and R²_EFRON overall the most. The 3^rd influential observation is 30^th and the 4^th is 21^st. Furthermore, in Figure 2, we incorporated the index plot to highlight the identified influential observations with D and DFFITS measures.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

4. Conclusion

In regression analysis, the inferences, predicted values, and estimates are highly affected by the influential observations. Before modeling, it is important to check the distribution of the dependent variable. PRM is used when the dependent variable is counted. In our study, we use different forms of PRM residuals for Cook’s method along with a graphical method for the detection of the influential observations, and then, a comparative analysis is carried out with the help simulation study and a real dataset. We noticed that DFFITS with Pearson residuals and DFFITS with working residuals gave the best results overall and DFFITS with Anscombe residuals gave bad result for the identification of influential observations for all covariates’ distributions and different number of coefficients. After analyzing the results, it is recommended that, for the assessment of the influence diagnostics and for the PRM, the Pearson and working residuals proved to be highly suitable than the other types of PRM residuals.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors are thankful to Taif University and Taif University researchers, supporting project no. TURSP-2020/160, Taif University, Taif, Saudi Arabia.

References

H. Seeber and U. Gilg, Poisson Regression: Biostatistical Applications, Wiley StatsRef: Statistics Reference Online, Hoboken, NJ, USA, 2014.
P. L. Brockett, “Information theoretic approach to actuarial science: a unification and extension of relevant theory and applications,” Transactions of the Society of Actuaries, vol. 43, pp. 73–135, 1991.
View at: Google Scholar
K. Antonio and E. A. Valdez, “Statistical concepts of a priori and a posteriori risk classification in insurance,” AStA Advances in Statistical Analysis, vol. 96, no. 2, pp. 187–224, 2012.
View at: Publisher Site | Google Scholar
A. F. Fagbamigbe and A. S. Adebowale, “Current and predicted fertility using Poisson regression model: evidence from 2008 Nigerian demographic health survey,” African Journal of Reproductive Health, vol. 18, no. 1, pp. 71–83, 2014.
View at: Google Scholar
R. L. Roudsari and H. T. Allan, “Women’s experiences and preferences in relation to infertility counselling: a multifaith dialogue,” International journal of fertility & sterility, vol. 5, no. 3, p. 158, 2011.
View at: Google Scholar
P. T. Brandt and J. T. Williams, Time Series Models for Event Count Data, Slides prepared for a short course of the American Political Science Association, Bloomington, Indiana, 1999.
M. El Fatini, M. Louriki, R. Pettersson, and Z. Zararsiz, “Epidemic modeling: diffusion approximation vs. stochastic differential equations allowing reflection,” International Journal of Biomathematics, vol. 14, no. 05, Article ID 2150036, 2021.
View at: Publisher Site | Google Scholar
M. Riaz, M. Riaz, N. Jamil, and Z. Zararsiz, “Distance and similarity measures for bipolar fuzzy soft sets with application to pharmaceutical logistics and supply chain management,” Journal of Intelligent and Fuzzy Systems, vol. 42, pp. 1–20, 2022.
View at: Publisher Site | Google Scholar
A. M. H. Chau, E. C. M. Lo, M. C. M. Wong, and C. H. Chu, “Interpreting Poisson regression models in dental caries studies,” Caries Research, vol. 52, no. 4, pp. 339–345, 2018.
View at: Publisher Site | Google Scholar
D. A. Belsley, E. Kuh, and R. E. Welsch, Regression Diagnostics: Identifying Inﬂuential Data and Sources of Collinearity, Wiley, Hoboken, NJ, USA, 1980.
S. Chatterjee and A. S. Hadi, “Influential observations, high leverage points, and outliers in linear regression,” Statistical Science, vol. 1, pp. 379–393, 1986.
View at: Publisher Site | Google Scholar
J. W. Hardin and J. M. Hilbe, “Regression models for count data based on the negative binomial(p) distribution,” STATA Journal: Promoting communications on statistics and Stata, vol. 14, no. 2, pp. 280–291, 2014.
View at: Publisher Site | Google Scholar
Z. Y. Algamal, “Diagnostic in Poisson regression models,” Electronic Journal of Applied Statistical Analysis, vol. 5, no. 2, pp. 178–186, 2012.
View at: Publisher Site | Google Scholar
A. Alkhateeb and Z. Algamal, “Jackknifed Liu-type estimator in Poisson regression model,” Journal of the Iranian Statistical Society, vol. 19, no. 1, pp. 21–37, 2020.
View at: Publisher Site | Google Scholar
Z. Y. Algamal and M. M. Alanaz, “Proposed methods in estimating the ridge regression parameter in Poisson regression model,” Electronic Journal of Applied Statistical Analysis, vol. 11, no. 2, pp. 506–515, 2018.
View at: Publisher Site | Google Scholar
Y. Al-Taweel and Z. Algamal, “Almost unbiased ridge estimator in the zero-inated Poisson regression model,” TWMS Journal of Applied and Engineering Mathematics, vol. 12, no. 1, pp. 235–246, 2022.
View at: Google Scholar
N. K. Rashad and Z. Y. Algamal, “A new ridge estimator for the Poisson regression model,” Iranian Journal of Science and Technology Transaction A-Science, vol. 43, no. 6, pp. 2921–2928, 2019.
View at: Publisher Site | Google Scholar
A. Khan, M. Amanullah, H. M. Aljohani, and S. A. M. Mubarak, “Influence diagnostics for the Poisson regression model using two-parameter estimator,” Alexandria Engineering Journal, vol. 60, no. 5, pp. 4745–4759, 2021.
View at: Publisher Site | Google Scholar
J. M. Hilbe, Negative binomial regression, Cambridge University Press, Cambridge, UK, 2011.
J. W. Hardin, J. W. Hardin, J. M. Hilbe, and J. Hilbe, Generalized Linear Models and Extensions, Taylor & Francis, Didcot, Oxfordshire, UK, 2007.
R. D. Cook and S. Weisberg, “Finding Inﬂuential Cases in Linear Regression: A Review,” Tech. Rep., Univ. of Minnesota, Minneapolis, Minnesota, USA, 1979, Technical Report 338.
View at: Google Scholar
A. C. Atkinson, “Two graphical displays for outlying and influential observations in regression,” Biometrika, vol. 68, no. 1, pp. 13–20, 1981.
View at: Publisher Site | Google Scholar
R. E. Welsch and E. Kuh, Linear Regression Diagnostics, No. w0173. National Bureau of Economic Research, Cambridge, MA, USA, 1977.
R. D. Cook and S. Weisberg, Residuals and Influence in Regression, Chapman & Hall, London, UK, 1982.
R. D. Cook, “Detection of influential observation in linear regression,” Technometrics, vol. 19, no. 1, pp. 15–18, 1977.
View at: Publisher Site | Google Scholar
P. I. Good and J. W. Hardin, Common Errors in Statistics (And How to Avoid Them), John Wiley & Sons, Hoboken, NJ, USA, 2012.
J. Fox, “Robust Regression: Appendix to an R and S-PLUS Companion to Applied Regression,” 2002.
View at: Google Scholar
J. W. Hardin and J. M. Hilbe, Generalized Linear Models and Extensions, Taylor & Francis, Didcot, Oxfordshire, UK, 3rd edition, 2012.
P. McCullagh and J. A. Nelder, Generalized Linear Models, Chapmann and Hall, London, UK, p. 536, 2nd edition, 1989.
F. J. Anscombe, “Sequential estimation,” Journal of the Royal Statistical Society: Series B, vol. 15, no. 1, pp. 1–21, 1953.
View at: Publisher Site | Google Scholar
D. A. Pierce and D. W. Schafer, “Residuals in generalized linear models,” Journal of the American Statistical Association, vol. 81, no. 396, pp. 977–986, 1986.
View at: Publisher Site | Google Scholar
C. Cameron and P. Trivedi, Models for Count Data, Econometric Society Monograph No.30, Cambridge University Press, Cambridge, UK, 1998.

Copyright

Copyright © 2022 Aamna Khan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies