Abstract

A common practice is to get reliable regression results in the generalized linear model which is the detection of influential cases. For the identification of influential cases, the present study focuses to compare empirically the performance of various existing residuals for the case of the Poisson regression model. Furthermore, we computed Cook’s distance for the stated residuals. In order to show the effectiveness of proposed methodology, data have been generated by using simulation, and further applicability of methodology is shown with the help of real data that followed the Poisson regression. The comparative analysis of the residuals is carried out for the detection of influential cases.

1. Introduction

The foremost assumption for the application of regression analysis on the observations is to be iid, i.e., identically and independently distributed. However, for the real-life application, the assumption of iid is too strong. For instance, the occurrence mean rate of an event may vary from one case to the other and that is dependent on some of the variables. Conventional regression techniques are not appropriate for regress and that explain the occurrence of rare events. Thus, it is suggested to modify the classical linear regression model in such a way that it becomes suitable for examining the Poisson variables. Poisson regression is a member of the generalized linear model’s family, and it utilizes the logarithmic transformation. Therefore, to study the occurrence rate, it is more appropriate to use count data regression than the least square regression. The Poisson regression model (PRM) is an approach used to give better understanding of count data. There are numerous applications of Poisson regression in different fields of study when the variable under consideration takes the form of counts or nonnegative integers, see [19].

To look at the adequacy of the fitted model, the analysis of residuals and the identification of outliers play the most important role, as the characteristics of data have a strong influence on any underlying regression model but are not studied so frequently yet using this technique. The data points are said to be influential if they pose effective influence on the estimates, the goodness-of-fit statistics, or the fitted values [10]. The finding of influential observation in regression model building, their consideration, and evaluation is of much interest. Chatterjee and Hadi [11] assessed numerous statistics that are available in literature for the identification of influential observations. Today, various influence measures exist that are classified into different groups, and residual analysis holds a critical position in those groups. Analysis of different residuals is extremely important for detecting the influential points in data. In the field of the generalized linear model (GLM), we are having several different types of residuals such as Anscombe residuals, Deviance residuals, Pearson residuals, likelihood residuals, and working residuals [12].

Numerous residuals have been suggested in the literature for the identification of influential cases in statistical models. The only work given by [1318] with biased estimators is in hand for outlier identification in PRM. Besides, the literature marks no investigation has been completed for impact evaluation in view of the diverse GLM residuals for the case of PRM. In the present article we perform simulation experiments for the comparision commonly cited residual procedures for Poisson regression data.

2. Methodology

2.1. The Poisson Regression Model (PRM)

The PRM is applicable when the response variable comes in the form of counts’ data. Let follow a Poisson distribution with parameter . The following probability mass function is used to describe this relation:

The mean and variance of Poisson distribution are equal to the parameter :

The Poisson in exponential density is of the form

Maximum likelihood is used to estimate the PR estimator. By getting the log likelihood, we obtainwith

We found the maximizing value for by calculating first derivative of the following expression:where  = score function.

Since the systems of equations are nonlinear, the solution of the above expression is not trivial. So, it is recommended to use numerical methods. One such method that is suggested to use is the IRLS algorithm for estimating the regression coefficients:where and . To study the dependence among mean and variance, we set up weights. They allow a greater spread among observed and fitted values, to handle large observed values in the regression. With the help of iterative methods, we can find and . The final step gives those values that will maximize the likelihood function. The iteration will be stopped when the smallest difference is achieved among the old and the updated values. Hilbe [19] suggested to stop the iterations when this difference is less than .

2.2. Influence Diagnostics

Exposing the influential points is a very crucial task. The basic idea of influence analysis is to assess the changes caused in a specified phase of the analysis in the presence of perturbed data. A principally appealing perturbation method is case deletion. Generally, the influence of an observation can be examined using the product of two major factors: the first one is the function of the defined residual and the other one is a function of the point’s location in the space.

In fitting a model, residuals play the most significant part and thus one can simply stand for the unexplained portion with the aid of residuals following the subtraction of estimated model from observation. Consequently, one can easily use the residuals for evaluation of influential observations since in modeling the most critical work is to become aware of the influential observations and their related treatment. A failure in identifying the influential observations can present bias on the rationality of the inferences drawn in this situation. Accordingly, in checking the statistical model, residuals play a vital role. An essential articulation variety with respect to residuals is made in the literature. While standing in the field of GLM today, we are having a number of various types of residuals, for instance, Anscombe residuals, Deviance residuals, Pearson residuals, likelihood residuals, and working residuals [20], and a variety of books are available on this issue, including [2123]. In the following section, we listed the formulas for residuals that we are using in our work.

2.3. Cook’s Distance (D)

Cooks distance is used in regression analysis to find influential outliers in a set of predictor variables. D statistic was defined for measuring the influential observation in linear regression [24]. In GLM, the diagonal elements of the hat matrix can be interpreted as leverages just as in linear models. To measure actual rather than potential influence, we calculate Cook’s distance comparing with :

After simplification, we may write it as

The observation with the largest value of D is said to be influential. Cook [25] gave the cut point for the identification of influential observation while using GLM as .

2.3.1. Pearson Residual

Pearson residuals are a rescaled version of raw or response residuals. In the GLM, Pearson residual is defined as

For Poisson,where Y is the observed value, is the mean, and var() is the variance of . The Pearson residual for Poisson distribution is just the signed square root of the component of the Pearson goodness of fit statistics. To measure the quality of a fitted model, we use these residuals. The standardized version of the above residuals used for model checking is given aswhere represents the leverage of weighted hat matrix:

If these residuals yield large values, it gives indication of model failure, and for the detection of outliers, the observation numbers are plotted against these residuals [26].

2.3.2. Deviance Residual

Deviance plays a vital role in our derivation of generalized linear modeling. For the case of GLM, the deviance is applied as a degree of discrepancy. In literature, the deviance residual is defined aswhere D is the deviance function, and for Poisson, it is given as

The standardized version of deviance residual is given as

2.3.3. Likelihood Residual

When any of the observation is skipped from the data in deviance, the change is then given by likelihood residual. It is also named as studentized residuals, externally studentized residuals, and deleted studentized residuals [27]. The studentized residual are a type of standardized residuals that can be used to identify outliers. They are the weighted form of residuals of the standardized Pearson residuals and standardized deviance and can be approximated by the following formula:

2.3.4. Working Residual

Working residuals are the residuals in the final iteration of any iteratively weighted least squares method and are useful for the assessment of model fitting. These residuals are the difference among the linear predictor and the working response at convergence. Hardin and Hilbe [28] defined working residuals as

McCullagh and Nedler [29] defined the working residuals in the Poisson regression model as

2.3.5. Anscombe Residual

Anscombe [30] defined Anscombe residuals which aim to make residuals normal as much as possible and are used to test the normality of a defined model. For Poisson distribution, it is reported as

Adjusted deviance residuals and Anscombe are nearly the same [29]:

Standardized Anscombe residual is computed as

2.4. Residuals Evaluation of the Generalized Linear Model

The GLM residual assessment is generally established on the Deviance and Pearson residuals. Using the discrete model, Pierce and Schafer [31] employed Pearson residuals and Deviance residuals for the assessment the goodness of fit, and they found that Deviance residuals are better in findings. Furthermore, Cameron and Trivedi [32] made comparative analysis of the performance of Anscombe, Deviance, and Pearson residuals using PR and found that Anscombe and Deviance residuals performed almost equal in comparison with Pearson residuals.

2.5. Graphical Detection of Influential Observations

Within the GLM foundation, we can, in any case, plot the standard residual plots, where deviance residuals (or working residuals or Pearson residuals) are plotted against the jth covariate values xij. The examples may fluctuate differently for different models and the interpretation is less evident; for this reason, we sometimes consider other specific residual plots and try to identify any systematic departure from the fitted model. In GLM, as a diagnostic tool for the selection of influential observation, the leverages play an important role, and the index plot of leverages is used for the diagnosis of influential observation. In our study, we employed the index plot for the identification of influential observations.

3. Numerical Evaluation

3.1. Simulation Study

The purpose of the following section is the comparison of the PRM’s adjusted and standardized residuals’ performance for influence diagnostics. We performed a Monte Carlo study for the Poisson model and generated the response variable for the PRM as

For y = 0, 1, 2, … and  > 0, whereis the mean function and ij ∼ N(0, 1), N(10,3), G(0.5,3), P(0.5), and P(3) for i = 1,2, …, n and j = 1, 2, …, p, where P = 2 and 4. All the generated xs are fixed through the whole simulation study. For the true parameters, we set the following arbitrary values: β0 = 0.05 and β1 = 0.0025; for P = 2 and for P = 4, β0 = 0.05, β1 = 0.0025, β2 = 0.005, β3 = 0.0001, and β4 = 0.0001. We generate the dataset for sample of sizes n = 25, 50, 100, 200. After that, we generate influential values in xs. We replace the 15th value in xs by x15j = x15j + α0, for j = 1, 2, …, p, and α0 =  j + 100. The simulation is run 1000 times using the R software to assess the influential observation detection percentages for each of the PRM residuals. The detected percentages are given in Table 1. The performance of the influence diagnostics with the PRM residuals is studied in two ways of asymptotic theories: one is with sample size and the other is with varying number of regressors.

From the results, we have found no difference in the performance of chi and dev results by changing the distribution of covariates for all the selected sample sizes. Moreover, it was noticed that their performance gets better as we increased the number of independent variables, but with increasing the sample size, their detection performance decrease. All of Cook’s distance (D) residuals performed equally well while identifying the influential observation except the likelihood residual. By increasing the sample size and the number of regressors, it was noted that, as compared to all other Cook’s distance residuals, likelihood residuals’ performance tends to increase quickly. Furthermore, it was observed that, in identifying influential observation, the performance of all PRM residuals for the detection of influential observation is better when Xs ∼ P(0.5) as compared to that when Xs ∼ N(0, 1), N(10, 3).

While studying DFFITS with PRM standardized residuals, the performance of DFFITSsrw and DFFITSsrp was equal and better than DFFITSsrd, DFFITSl, and DFFITSsra for all sample sizes and distributions. For small sample size, DFFITSsrd is better than DFFITSl, but for large sample sizes, the performance is vice versa. And in case of large sample Poisson (3) for P = 2 in Table 1, DFFITSl performance is equal to DFFITSsrp and DFFITSsrw, but this is not true for the case P = 4 in Table 2. And DFFITSsra gave bad results for the identification of influential observation for all covariates distributions and different number of independent variables. With P = 2, it was noticed that DFFITSsrp for Poisson (0.5) with small sample size performed very well in marking the observation as influential among all others, and overall, DFFITSsrp and DFFITSsrw gave best result. Likewise with P = 4 in Table 2, its performance gets better almost near to 100 percent with the increase in sample size. These results are comparable and valid because we use the same cut-off point for the computation of D and DFFITS with each type of the PRM residuals.

3.2. Applications
3.2.1. Example 1: Mine Fracture Injury Data

The first is mine fractures’ injury data that consist of n = 44 observations with p = 4 independent variables. These include x1 = inner burden thickness in feet, x2 = percent extraction of the lower previously mined seam, x3 = lower seam height, and x4 = time that the mine has been opened. From Table 3, it is noticed that, for the mine fracture injury data 4th, 29th, and 30th observation are highlighted as influential by the proposed selecting procedure of the standardized residuals. Only 4th point is identified using the Pearson and working residual; the rest 29th and 30th points are found to be influential observation by both D statistics utilizing the index plot and DFFITS measure. These outcomes are presented in index plots. By studying Table 3, we observed that the highest influential points is 30th point which exerts high influence on the estimates, and according to goodness of fit test results, 29th observation proved to be the most influential as high positive change is observed in coefficient of determination (R2EFRON) and also the AIC is better as the smaller AIC occurred after the removal of 29th observation, and also, the Pearson chi square reduces the most after its removal. The third influential observation is 4th as a very small change occurred in the estimates and R2EFRON after its removal. All of these results are also verified with the help of the index plot in Figure 1.

3.2.2. Example 2: Aircraft Data

The second dataset consist of n = 30 observations with P = 3 independent variables. These variables include x1 = indicator variable, x2 = bomb load, and x3 = crew experience. From Table 4, for aircraft data, 16th, 21st, 25th, and 30th observations are recorded as influential and are highlighted by the suggested detection procedure of the standardized residuals. The 16th and 25th observations were identically identified as influential with all residuals, 30th point was not identified by the working residual, and 21st was not identified by the Pearson and the working residual. From Table 5, we noticed that 25th observation is the highest influential on the basis of the estimates and AIC. The 2nd influential observation is 16th as, after its removal, the clear change occurred on estimates, and it lowered the AIC as compared to other observations after the 25th observation and changed the Pearson chi square and R2EFRON overall the most. The 3rd influential observation is 30th and the 4th is 21st. Furthermore, in Figure 2, we incorporated the index plot to highlight the identified influential observations with D and DFFITS measures.

4. Conclusion

In regression analysis, the inferences, predicted values, and estimates are highly affected by the influential observations. Before modeling, it is important to check the distribution of the dependent variable. PRM is used when the dependent variable is counted. In our study, we use different forms of PRM residuals for Cook’s method along with a graphical method for the detection of the influential observations, and then, a comparative analysis is carried out with the help simulation study and a real dataset. We noticed that DFFITS with Pearson residuals and DFFITS with working residuals gave the best results overall and DFFITS with Anscombe residuals gave bad result for the identification of influential observations for all covariates’ distributions and different number of coefficients. After analyzing the results, it is recommended that, for the assessment of the influence diagnostics and for the PRM, the Pearson and working residuals proved to be highly suitable than the other types of PRM residuals.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors are thankful to Taif University and Taif University researchers, supporting project no. TURSP-2020/160, Taif University, Taif, Saudi Arabia.