Abstract

The main purpose of this article is to develop a Bayesian adaptive lasso procedure for analyzing linear regression models with nonignorable missing responses, in which the missingness mechanism is specified by a logistic regression model. A sampling procedure combining the Gibbs sampler and Metropolis-Hastings algorithm is employed to obtain the Bayesian estimates of the regression coefficients, shrinkage coefficients, missingness mechanism models parameters, and their standard errors. We extend the partial posterior predictive value for goodness-of-fit statistic to investigate the plausibility of the posited model. Finally, several simulation studies and the air pollution data example are undertaken to demonstrate the newly developed methodologies.

1. Introduction

As variable selection plays an increasingly important role in modern data analysis, the least absolute shrinkage and selection operator (lasso) is appealing owing to its sparse representation. For example, see Tibshirani [1], Efron et al. [2], Zou [3], Yuan and Lin [4], Meinshausen and Buhlmann [5], etc. In the Bayesian statistical framework, Park and Casella [6] first proposed Bayesian lasso, which exploits model inference via posterior distributions. Afterwards, Guo et al. [7] studied how to select variables using Bayesian lasso in semiparametric structural equation models. Recently, Hans [8] further addressed model uncertainty and variables selection in Bayesian lasso. In particular, Bayesian adaptive lasso proposed by Leng et al. [9] permitted a treatment for variable selection with flexible penalties.

On the other hand, missing data are frequently encountered in various settings, consisting of clinical trials, surveys, and longitudinal studies. Covariates and/or responses may be missing, and statistical analysis to handle the missing data is often based on the missing data mechanism [10], such as missing not at random, also referred to as nonignorable missingness. Variable selection procedures in the presence of missing data have received much attention in the recent literature. For instance, Ibrahim et al. [11] employed EM algorithm to propose a model selection criteria in missing data problems. After that, Garcia et al. [12] investigated the variable selection problem for regression models with missing covariate and/or response data. Recently, in the framework of semiparametric varying-coefficient partially linear models with missing response at random, a variable selection procedure presented by Zhao and Xue [13] can be used to simultaneously select significant variables in nonparametric and parametric model, among others. However, to the best of our knowledge, the study of missing data based on Bayesian adaptive Lasso et al. [9] has not been proposed at present. Therefore, our motivation is to construct a Bayesian method to simultaneously select variables and estimate parameters in linear regression models with nonignorable missing responses under the framework of Bayesian adaptive lasso. In our work, missing data mechanism is specified by a logistic regression model, and a MCMC procedure is proposed to obtain the Bayesian estimates of the regression coefficients, shrinkage coefficients, missingness mechanism models parameters, and their standard errors. Contributions of our work mainly include (1) tracking with the computationally intractable integral problem in posterior analysis, especially when nonignorable missing data raises new challenges for variable selection in our considered models; (2) the sampling-based approach could give rise to reliable statistical inferences and does not depend on asymptotic theory.

The structure of the article is arranged as follows.

We present a hierarchical Bayesian lasso model in the framework of linear regression models with nonignorable missing responses in Section 2. In Section 3, a Bayesian procedure is proposed and developed by combining the Gibbs sampler [14] and the Metropolis-Hastings (M-H) algorithm [15, 16] to obtain statistical inference for simultaneous variable selection and parameter estimation. In particular, the partial posterior predictive -value for goodness-of-fit statistic is also presented to investigate the plausibility of the posited model in Section 3. Several simulation studies and the air pollution data example are employed to demonstrate the preceding proposed Bayesian methodologies in Section 4.

The article is ended up with a discussion in Section 5.

2. Model and Notations

We take into account the linear regression model problem , in which is an vector of responses, is an matrix of covariates, is a regression parameter, and is an vector of normal errors with mean and covariance . Here, (or ) is an vector whose each element is to set 1 (or 0), represents the identity matrix, and the superscript denotes the transposition of a vector (or matrix). Without loss of generality, we just consider the case that is 0 and may be deleted from the above proposed model in this article since and can be centered. Similar to Leng et al. [9], the following adaptive lasso estimator is used to obtain statistical inference for simultaneous parameter estimation and variable selection

Here, flexible penalty parameters are used for different regression parameters. The following hierarchical Bayesian lasso model is considered in this article:with the following priors on and :

Park and Casella [6] proposed to employ the improper prior to model the error variance, where denotes the -dimensional normal distribution.

In this article, we focus on the case that the responses are incompletely observed, which may be subject to nonignorable missingness, while covariates are completely observed, where . For , we define a vector of missing indicators such that if it is observed and if is missing. Therefore, can be expressed by , where and denote the set of the observed data and the missing values associated with the responses, respectively. Let be the conditional probability distribution of given and unknown parameter , and then

Here, can be specified by the following logistic regression model:in which and .

Let that contains all unknown parameter in models (1)–(3), in which and . The main objective of this article is to develop a Bayesian procedure to analyze the above given model under the missing data indicators and the observed data . To make Bayesian statistical inferences on the above given model, random observations are needed to be generated from the posterior distributions of and . For the conditional distribution of and given the missing data indicators and the observed data , it follows from models (1)–(5) that where represents the joint prior probability density for and . Obtaining a closed form of integral (6) is rather difficult due to the missing mechanism model involved.

3. Bayesian Analysis of the Models

For developing an effective Bayesian approach to analyze the above given model and reducing computational burden, we augment the indicators and the observed data with the missing values to produce full data set in the posterior analysis on the basis of the data augmentation idea [17]. Clearly, the posterior density is easier to handle than .

However, generating observations from is still rather difficult due to the complexity of the considered model.

The Gibbs sampler [14], one of the most useful tools in statistical calculation, is employed to simulate a sequence of samples from the joint posterior distribution and then make Bayesian statistical inferences on the above given model.

The key to this algorithm is that random samples are iteratively simulated from the following three conditional distribution functions: , and . Note that and in our considered model.

3.1. Gibbs Sampling and Conditional Distributions

Based on the above discussion, the Gibbs sampler [14] is implemented as follows:Step 1. Specifying the initial values for parameters and , let , and .Step 2. Based on , and , generating sample as follows:Generating sample from the following distribution:where and in which .Generating sample from the following distribution:where represents Gamma distribution with shape parameter and scale parameter , in which , .Generating sample from the following distribution:for ,where represents inverse Gaussian distribution with parameters and , in which and .Sampling :There are two methods to update parameter : empirical Bayesian method and hierarchical model method; see more details in Leng et al. [9].Step 3. Based on ,Generating sample from the following distribution:where , is the prior distribution of parameter .Step 4. Based on , and , generating sample from the following distribution:where .Step 5. Repeating steps .

Followed by Geman and Geman [14], the sequence of observation can be treated as a sample simulated from the joint posterior function . In practice, the estimated potential scale reduction (EPSR) values [18] can be employed to monitor the convergence of the algorithm.

It is easy to find that conditional distributions from (7)–(9) are some standard and familiar distributions. Therefore, generating samples from these conditional distributions are direct. However, those corresponding conditional distributions given in (10) and (11) are complicated and nonstandard distributions. As a result, to directly generate samples from those conditional distributions is rather difficult. The well-known M-H algorithm [15, 16] is one of the most effective ways to generate samples from those corresponding conditional distributions with the help of proposal distributions. Similar to Zhao and Tang [19] and Xu et al. [20], the following proposal distributions and are employed for sampling and , respectively, where and . Specifically, the M-H algorithm for sampling observation and is implemented by the following process: at the th iteration with the current value and , new candidates and are generated from and , respectively. They are accepted with probabilities and , respectively. and are chosen such that the average acceptance rate is between 0.25 and 0.45 [21].

3.2. Bayesian Estimates and Goodness-of-Fit Statistic

The joint Bayesian estimates of parameters and are expressed bywhere are the samples that are obtained from the joint posterior distribution via the previous Gibbs sampler procedure. Following Geyer [22], these joint Bayesian estimates and sample covariance matrices are consistent with their corresponding posterior expectations and posterior covariance matrices. Therefore, we can obtain the standard errors by the diagonal elements of the sample covariance matrices.

The partial posterior predictive (PPP) -value proposed by Bayarri and Berger [23] is a goodness-of-fit statistic to investigate the plausibility of the posited model in Bayesian statistical framework; see more details in Lee and Tang [24], Tang and Zhao [25]. Similar to Lee and Tang [24] and Tang and Zhao [25], we can define the PPP -value for our considered model asin which is the observed value of the responses and in (13) denotes a discrepancy variable, and represents the observed value of on the basis of . The following discrepancy variable can be specified in our problem:in whichwhere and are conditional expectation and conditional variance of given and . It is easily seen that

To reduce computational burden of the PPP -value in (13), the Monte Carlo integral is employed to approximate . Let be the samples generated from the distribution via the previous Gibbs sampler procedure, and let be the samplers simulated from . Then, the PPP -value in (13) is approximated bywhere is an indicator function. According to [24, 25], if is far away from 0.5, then we reject the posited model.

4. Numerical Examples

Several simulation studies and the air pollution data example are employed to demonstrate our proposed methodologies in this section.

4.1. Example 1

The first simulation study is employed to illustrate the performance of the Bayesian estimates with respective to the sparse, dense, and very sparse cases, respectively. Specifically, we generate observations from the following model:

For , here, the true values for parameter are given as follows:Case i (moderate case): ;Case ii (dense case): ;Case iii (sparse case): .

For , , it is assumed that follows standard normal distribution marginally and the correlation between and is , and is independent and identically distributed as , and the true value of is set to 1 in all simulation studies. Missing data are generated from the logistic regression modelwith true values , . The missing data are generated as follows: (a) simulate the full data set from model (19); (b) identify whether the observation is missing or not via the logistic model (20) with the true values for . More specifically, a random number is obtained from the uniform distribution ; the observation is missing if .

For each of the above generated data sets under various model assumptions and parameter settings, our proposed methodologies combining the Gibbs sampler together with the M-H algorithm are employed to calculate the Bayesian estimates of unknown parameters on the basis of 100 replications for , and the average proportion of missing data simulated are roughly 0.375. In each replication, we collected 10 000 observations after 10 000 iterations in producing Bayesian results. In M-H algorithm, we take and obtaining average acceptance rates and , respectively. The corresponding results of Bayesian estimates are reported in Tables 13, in which ‘True’ represents the true value of parameter; “Mean,” “SD,” and “Median” denote the average, standard deviation, and median values of the Bayesian estimates on the basis of 100 data sets, respectively; “” and ‘’ are the and quantiles of the Bayesian estimates on the basis of 100 replications; ‘RMSE’ is the root mean square error between the Bayesian estimates on the basis of 100 data sets and its true value. Besides, to summarize the variable selection results, in Table 4, we calculate the proportion that parameter was identified to be zero in 100 replications in terms of the criterion that a parameter is identified to be 0 if its confidence interval contains zero. Inspections of Tables 14 indicate that (1) the Bayesian estimates of unknown parameters for , and are quite close to their true values under moderate case, dense case, and sparse case, which illustrates that our proposed Bayesian approach is quite effective; (2) the differences between ‘SD’ and ‘RMSE’ are very small, which shows that the Bayesian estimates are rather reliable and robust to our considered cases. (3) Our proposed methods could identify the correct models in most cases, because the proportion values in Table 4 corresponding to the important covariates are all zero, and the proportion values corresponding to unimportant covariates are more than . (4) It is interesting to observe that the hierarchical model approach tends to obtain better results than empirical Bayesian approach in terms of variable selection, but the differences are minor, especially in the very dense case (e.g., case ii). To sum up, the above findings indicate that our preceding proposed Bayesian procedures could well recover the true information in models.

4.2. Example 2

In order to investigate the sensitivity to proportion of missing data, we conduct the second simulation study in this part. Simulated data sets are generated by using the same setup as specified in the first simulation study except that true value of is set as , . We fit these simulated data sets under cases i–iii via the preceding developed MCMC algorithm and only present results in Table 5 on the basis of case i under 100 replications to save space, where the average proportion of missing data generated in the second simulation study is about 0.208. Besides, we further calculate sums of RMSE under case i based on these two simulation studies, and the results are 3.365 and 3.237 in this simulation study, compared with 3.681 and 3.522 in the first simulation study. Moreover, the median of mean absolute deviations (MMAD) is calculated for 100 replications, i.e., median , in which and represent the Bayesian estimate and true value of parameter .

The related results of MMAD for the first (second) simulation are 0.143 7 (0.129 3) and 0.134 8 (0.122 1) based on Empirical Bayesian method and Hierarchical model method, respectively. Meanwhile, the box plots for mean absolute deviations (MAD) are shown in Figure 1, where “1” and “2” denote case i in the first and second simulation studies, respectively. All these results indicate that cases with the smaller proportion of missing data in this simulation study generally perform better than cases with the larger proportion of missing data in the first simulation study.

4.3. Example 3

To address the influence of the missingness mechanism model to the Bayesian adaptive lasso, we conduct the third simulation in this example. The full data set is created by employing the same setting specified under case i in the first simulation study. But, we generate the missing data via the following 3 types of missingness mechanisms.

Type i: logistic regression model that is different from specified in (20):with and .Type ii: missing at random (MAR) missingness mechanism.Type iii: nonignorable missingness mechanism model specified in .

For missing data simulated via type i and type ii, we obtain the Bayesian estimates under incorrect logistic regression model . For missing data simulated via type iii, we calculate the Bayesian estimates under incorrect MAR missingness mechanism. Summaries of these results are reported in Tables 69. Inspecting these results, the following observations are indicated: (1) results produced using the model are quite robust and accurate even if the true missing data are MAR, or the true missingness mechanisms are much simpler models; (2) performances obtained on the basis of the incorrect MAR assumption are poor. These observations agreed with those obtained by Lee and Tang [24].

4.4. A Real Example

In order to further demonstrate our proposed Bayesian methodologies, we consider a real example from the air pollution data in this subsection, which is available at the website http://lib.stat.cmu.edu/datasets/NO2.data. The air pollution data includes randomly chosen responses and explanatory variables. For , the response variable is hourly values of the logarithm of the concentration of NO2, measured at Alnabru in Oslo, Norway, between October 2001 and August 2003; the explanatory variable denotes the logarithm of the number of cars per hour, represents temperature two meters above ground, is wind speed, denotes the temperature difference between 25 and 2 m above ground, represents wind direction, and and are hours of day and day number from October , respectively. Each explanatory variable is standardized to have mean 0 and variance 1 during exploratory data analysis. The air pollution data has been analyzed and fitted by Zhu and Zhu [26], and they observed that all explanatory variables except and make a significant contribution to . Model equation (19) and the following modelwhere , are used to reanalyze the air pollution data in the article, where the logistic regression model (22) may represent that we create artificial missing data by deliberated deleting some of responses according to the given missing mechanism, and the missing data’s proportion generated in this way is 0.326. Then, we employ the preceding propose Bayesian procedure to make statistical inferences on the assumption of the above model settings. Here, 10 000 samples after 10 000 burn-in iterations are employed to calculate the Bayesian estimates of unknown parameters. In M-H algorithm, we take and giving average acceptance rates and , respectively. Also, we conduct a real data analysis under the assumption for the fully observed data in the framework of Bayesian adaptive Lasso.

Table 9 reports the corresponding results for regression coefficients by two different methods, in which ‘EB’ and ‘HM’ denote the Bayesian estimates on the basis of Empirical Bayesian method and Hierarchical model method, respectively. From Table 9, we clearly see that (1) the estimate results for the regression coefficients in all two methods are quite close; (2) almost all standard errors obtained under the fully observed data are smaller than those obtained under the missing data, which indicates that the procedures under the fully observed data outperform the procedures under the missing data, but the differences for performances are minor; (3) all these procedures could identify that non-zero significant regression coefficients are and , which is consistent with the results given in Zhu and Zhu [26]. In the end, PPP -value for goodness-of-fit with respect to model (19) with the nonignorable missing mechanism defined in model (20) is 0.446 and 0.454 based on Empirical Bayesian method and Hierarchical model method, respectively, which illustrates that our considered models as well as the proposed goodness-of-fit statistic are feasible.

5. Conclusion and Discussion

We develop a Bayesian adaptive lasso approach to analyze linear regression models with nonignorable missing responses in this article, where the nonignorable missingness mechanism is specified by a logistic regression model. With the help of Gibbs sampler [14] and the Metropolis-Hastings algorithm [15, 16], we deal with high dimensional integral with not having a closed form in the posterior analysis that is often computationally intractable and then obtain the Bayesian estimates of unknown parameters and their corresponding standard error estimates, the PPP p value [23] for goodness-of-fit. In short, empirical performances of proposed Bayesian approaches in our simulation studies are feasible and effective, and applications for the newly developed statistical tools to the proposed Bayesian adaptive lasso with nonignorable missing responses are plausible.

Data Availability

The data used to support the findings of simulation studies are available from the corresponding author upon request. The air pollution data in the real example are from previously reported studies and datasets, which are available at the website http://lib.stat.cmu.edu/datasets/NO2.data.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (11761016; 12 161 014), Science and Technology Foundation of Guizhou Province of China ([2020]1Y009), Project of High Level Creative Talents in Guizhou Province of China, the Natural Science Foundation of the Education Department of Guizhou province in China ([2020]090), and Discipline and Master’s Site Construction Project of Guiyang University by Guiyang City Financial Support Guiyang University (SX-2020).