Abstract

Analysis of environmental data with lower detection limits (LDL) using mixture models has recently gained importance. However, only a particular type of mixture models under classical estimation methods have been used in the literature. We have proposed the Bayesian analysis for the said data using mixture models. In addition, an optimal mixture distribution to model such data has been explored. The sensitivity of the proposed estimators with respect to LDL, model parameters, hyperparameters, mixing weights, loss functions, sample size, and Bayesian estimation methods has also been proposed. The optimal number of components for the mixture has also been explored. As a practical example, we analyzed two environmental datasets involving LDL. We also compared the proposed estimators with existing estimators, based on different goodness of fit criteria. The results under the proposed estimators were more convincing as compared to those using existing estimators.

1. Introduction

The environmental studies often encounter the exposure measurements falling below the LDL. These nondetectable observations are considered left censored observations [1, 2]. Hughes [3] discussed that the difficulty in modeling the environmental concentration datasets arises when some of the measurements are below the LDL. As the proportion of censored observations may not be trivial, failure to adjust the censoring in the analysis can produce seriously biased results with inflated variances. The most convenient method to adjust the censoring is to replace the censored observation by the detection limit. However, statistical properties of such methods are obscured. As an improvement, Paxton et al. [4] proposed the iterative imputation technique to settle the censoring issue, but this method did not consider the correlated structure of the data and parametric estimates.

Some authors have proposed the standard statistical distributions to model the left censored datasets. For example, Mitra and Kundu [5] used generalized exponential distribution, Bhaumik et al. [6] employed normal distribution, Leith et al. [7] and Jin et al. [8] used log-normal distribution, and Asgharzadeh et al. [9] considered Weibull model to model the left censored data from different situations. However, varying modeling capabilities of these models to model the left censored data has been reported by Vizard et al. [10]. Further, Moulton and Halsey [11] raised several concerns over using a standard statistical model to deal with such data. They argued that these datasets may be highly skewed to make most of the standard models inappropriate for analysis. Further, there may exist an additional subpopulation of the observations falling below the detection limit making the data heterogeneous or multimodal. Following these arguments, Moulton and Halsey [11] suggested the use of gamma mixture model (in their case) to model the left censored HIV RNA dataset. Taylor et al. [12] also emphasized that in case of larger proportion of nondetectable observations, a suitable model for analysis should be explored. They proposed log-normal mixture distribution to model the datasets with LDL. Some other authors including Moulton et al. [13] and Chen et al. [14] have suggested the use of mixture models for analyzing left censored data, especially when the proportion of censoring and skewness is high in the data. The use of mixture models also allows additional variability in the distributional shape as compared to some standard models. A recent article by Feroze and Aslam [15] considered the analysis of left censored mixture of the Weibull model.

Furthermore, most of the authors employing mixture models for analyzing left censored datasets have proposed maximum likely estimation (MLE) to estimate the model parameters. But, as the mixing cause missing data, MLE is often not suitable in estimating mixture models [16]. Following this argument, Hughes [3] proposed EM algorithm assuming normal mixture model to adjust the censoring issue in MLE. But the EM algorithm can also fail when the censoring rate is high [17]. These issues in the estimation of mixture models can be handled using a Bayesian approach. With informative priors, Bayesian estimation (BE) can produce significant gains in efficiencies of the estimates as compared to classical methods [18, 19]. The Bayesian estimators (BEs) can be biased but often have smaller mean squared errors (as compared to MLEs) and include additional prior information [20]. These methods are also applicable when the parameter estimates are correlated [21]. However, the Bayesian inference has not yet replaced the classical methods to model the heterogeneous environmental concentration datasets with LDL [22].

The generalized exponential (GE) distribution is very important distribution to model censored datasets. Recently, Gupta and Kundu [23] and Mitra and Kundu [5] identified that this model can be an appealing alternative to many standard models, such as gamma, Weibull, lognormal, log-gamma, exponential, and Rayleigh, to analyze the censored datasets. The mixture of GE models for analyzing the censored datasets has been more recently introduced. Teng and Zhang [24] proposed the estimation of model parameters from two-component mixture of GE models (2CMGEM) via EM algorithm. Ali et al. [25] discussed the application of 2CMGEM in modeling right censored data. Mahmoud et al. [26] considered 2CMGEM for obtaining predictions from the censored data. Wang et al. [27] discussed the applicability of 2CMGEM in medical sciences. Similarly, Kazmi and Aslam [28] advocated the employment of GE mixture model to fit various right censored datasets. More recent contributions regarding the analysis of 2CMGEM can be seen from the following works. Aslam and Feroze [29] considered the analysis of 2CMGEM under progressive censoring. Kluppelberg and Seifert [30] reported some important results from the conditional distributions of 2CMGEM. Yang et al. [31] discussed important inferences about 2CMGEM. Various properties of 2CMGEM were investigated. Hence, 2CMGEM is a very relevant model to analyze the censored dataset with mixture or heterogeneous behavior.

We have explored the advantages of Bayesian methods in modeling the heterogeneous environmental concentration data with LDL. First, we have employed the Bayesian mixture model approach using simulated datasets. As discussed by Momoli et al. [22], these datasets represent the real data from environmental studies and provide convenience by avoiding issues with empirical comparisons. Further, to investigate the advantages of the proposed methodology in real world, two real datasets regarding environmental concentrations with LDL have been used for the analysis. The 2CMGEM has been used as candidate model to analyze the mixture behavior of these datasets. The comparison of the proposed model with other mixture models, available in the literature to analyze the datasets with LDL, has been made. The comparison among different estimation methods has been discussed. Some advantages of using the proposed methodologies have been observed.

The remaining paper has been organized as follows. Section 2 contains the estimation of model parameters using different estimation methods. In Section 3, the simulated datasets are considered for comparison of proposed estimators. In Section 4, two real examples have been used to discuss the applicability and suitability of the proposed estimators. The findings of the study have been reported in Section 5.

2. Material and Methods

This section contains the analytical estimation of model parameters under expectation-maximization (EM) algorithm, Lindley’s approximation (LA), and Markov chain Monte Carlo (MCMC) method. In case of BE, the squared error loss function (SELF) and entropy loss function (ELF) along with an informative prior have been assumed for the estimation.

2.1. The Likelihood Function for 2CMGEM under Heterogeneous Data with LDL

The 2CMGEM has following probability density function: where is a parametric set and , .

The cumulative distribution function (CDF) for the 2CMGEM is

Suppose, we generate a sample of size ‘’ from 2CMGEM. Further, suppose and observations are generated from first and second components of the mixture, respectively. Let are the largest observations about which we have complete information and remaining smallest ‘’ observations fall below the predetermined LDL. Since we have incomplete information regarding smallest ‘’ observations, they as assumed as censored. Hence, and are the observations with complete information from first and second component of the mixture, respectively. So, and are the starting observations censored from each component, respectively. Further, and are observed items (with complete information) from the first and second components of mixture, respectively, where , , , and . The likelihood function for such heterogeneous dataset with LDL can be written as where . Putting the corresponding entries, we have the following.

The likelihood function for the heterogeneous dataset with LDL using 2CMGEM is

2.2. Expectation-Maximization (EM) Algorithm

In case of missing observations, the EM algorithm can be used obtain the MLEs for the model parameters. We have used the EM algorithm similar to that proposed by Ruhi et al. [32]. Let be the model parameters and the mixing parameter for the 2CMGEM where .

Let us denote the observed random sample from 2CMGEM by . Further, we assume the missing data vectors as , where is a 2-dimensional indicator vector containing zero-one observation and is one if the is from component of the mixture and zero elsewhere . The complete dataset can be defined as . The iterations in any EM algorithm contain two steps. The conditional expectation of the complete data log-likelihood for the model parameters is computed in the first step called -step. The -step at iteration is where is a reliability function for the component of the mixture. As (5) is linear in, the -step is implemented by taking the conditional expectation of given the observed data , where denotes the observations of . Now, . Using Bayes theorem, the posterior probabilities are presented as

Next, the -step is carried out. In this step, (5) is maximized with respect parameters of the model to estimate . Since expressions containing and are not related, they can be maximized independently. We introduce the Lagrange multiplier , with condition , to obtain the expression for . The differentiation of (5) with respect to is placed equal to zero as

The sum of (7) over and with, and gives

The MLES for , are obtained using iterative procedures. Both -step and -step are iterated for achieving the desired accuracy.

The steps to estimate the parameters of 2CMGEM using EM algorithm are as follows:. (i)Step-1:determine the initial guess for parameters as , and at (ii)Step-2:using Step-1 compute the conditional expectation for zth iteration from (6)(iii)Step-3: for th iteration, derive MLE of from (8) and MLEs of and using numerical procedures(iv)Step-4: replicate Step-2 and Step-3 to reach the desired convergence

2.3. Bayesian Estimation

The beta prior for parameter and independent gamma priors for parameters have been assumed as where are the hyperparameters.

Under the assumption of independence, the joint prior density from (9)–(11), for the parametric set can be given as The general form of posterior distribution for the model parameters, combining (4) and (11), is

From (13), the posterior distribution for the parameters of 2CMGEM is

We have used SELF and ELF for derivation of BEs. The BE under SELF is . On the other hand, the BE under ELF is . Since the closed form expressions for BEs under the said loss functions are not available, the approximate BE methods have been used to compute the numerical results.

2.3.1. Lindley’s Approximation (LA)

In this section, the LA has been proposed for the approximate BE for parameters of 2CMGEM. For sufficiently large sample size, Lindley [33] proposed that a ratio of integrals have the form where is a function of and , is the logarithmic of joint prior for the parametric set , and is the log-likelihood function, can be evaluated as where the MLE of is denoted by, and is the element of the inverse of the matrix .

Now, consider the log-likelihood function from (4).

Differentiating (18) with respect to models parameters and equating the results to zero, we have

Here, the numerical solutions of (19) to (21) will provide MLE of . Since, the higher-order derivative form (18) are straightforward, they have not been reported. The inverse of matrix based on second order derivatives is

Now, the BEs for parameters of 2CMGEM using SELF are

The BEs for the model parameters using LA under ELF are

2.3.2. MCMC Method

This sub-subsection discusses the BE for parameters , and by employing MCMC technique. To implement MCMC procedure, we reformatted (9) as where , , , , , and .

From (25), the posterior densities can be written as

where and represent bBeta and gamma distributions, respectively.

The samples from the posterior density can be obtained using the following algorithm.

Step-1:draw , , and from , , and , respectively.

Step-2:replicate Step-2 times to obtain to for .

The starting number of samples has been discarded to minimize the impact of initial guess. The BEs for model parameters are computed using the rest of number of samples.

The BEs under SELF using MCMC are

The BEs under ELF using MCMC are

3. Results and Discussions

In this section, we have generated the simulated datasets from 2CMGEM using different sample sizes, different censoring rates, different true parametric values, different hyperparameters and different mixing weights. As all these factors can have the impact on the estimation, we have considered different situations for each of these factors. As discussed by Momoli et al. [22], these datasets represent the real data from environmental studies and provide convenience by avoiding issues with empirical comparisons. Since censoring is major interest in this study, we have focused on the comparison of complete datasets with censored ones. Different estimation methods and two loss functions have been used in order to explore the best combination of estimation method and loss function in our case. The data from the 2CMGEM with LDL have been generated using the following steps. (i)Step-1: generate a sample of size ‘’ from 2CMGEM(ii)Step-2: draw a uniformly distributed random number () for each observation generated in Step-1(iii)Step-3: when , take observation from 1st component, otherwise from 2nd component of 2CMGEM(iv)Step-4: let LDL is (v)Step-5: the observations falling below LDL are assumed to be censored from each component(vi)Step-6: the observations greater than or equal to LDL have been used for analysis

The corresponding results are given in Tables 15 and in Figures 1 and 2 given in the followings. The estimates under simulated samples were replicated 10,000 times, and average of the estimates has been reported. In addition, in case of the MCMC method, the estimates were replicated 11,000 times. The starting 1000 replications were not included in the estimation to remove the impact of choice of starting values for the estimation. The Mathematica and R software has been used for numerical computations.

Figure 1 shows the impact of change in sample size and censoring rates on the performance of the estimates. Panel (a1) reports the average of relative estimates (AREs) for parameter () using different sample sizes and under various estimation methods. Panel (a2) presents the amounts of mean square errors (MSEs) for the estimates of parameter (). Similarly, AREs and MSEs for the parameter () can be seen in Panels (a1) and (a2), respectively. These figures elucidate that the estimates are converging to the true parametric values with increase in sample size. It is also evident from these panels (a1)–(a4) that amounts of MSEs tend to decrease with increase in the sample size. The results also advocated that the performance of the estimates using the MCMC method assuming ELF is superior as compared to their competitors. The results for the estimates of other model parameters also exhibited the similar trend, please see Table 1.

The detailed analysis of the impacts imposed by the increased censoring rate (on the estimates) is fundamental in environmental studies. For that purpose, the simulated data have been used to compare the results under different censoring rates with the case when no censoring occurs. It will provide an idea about how well our proposed estimators are capable of tackling the censoring scenarios. The right panel (c1 and c2) in Figure 1 shows these results. Panel (c1) elucidates the impact of censoring rates when sample size is 20, whereas panel (c2) presents the impacts of censoring rates when sample size is 200. The results in these panels (c1) and (c2) have been reported under MCMC method using ELF, as performance of these estimators was observed to be better than their counterparts. From these panels (c1 and c2), it can be assessed that when the sample is small (), the increased censoring rates tend to slightly inflate the MSEs for the estimates of the model parameters. However, for the sufficiently large sample size (), the MSEs for the estimates under complete data and under 30% censored data are very much comparative. Hence, the proposed estimators are efficient enough to deal with left censored environmental datasets.

The sensitivity of the proposed estimators with respect different values of hyperparameters, different mixing weights, and different true parametric values has been investigated in Figure 2. All the reported results are based on the MCMC method assuming ELF due to their supreme efficiency as compared to their competitors. The results have been compared on the basis of values of scaled MSEs (SMEs). The SMEs have been obtained by the ratio of MSEs to the corresponding true parametric values. The comparison of the estimates for complete data and 20% censored data has been considered. The left panel (a1 and a2) of the figure investigates the sensitivity of the estimates with respect to change in the hyperparameters. We have used three sets of the hyperparameters resulting in three different priors. The prior number one () is the main prior which has been assumed for the overall analysis. This prior has been defined on the basis of prior means approach. Using prior means approach the hyperparameters in the prior are chosen in such a manner that the mean of resulting prior is approximately equal to the corresponding true parametric value. The other two priors ( and ) have been specified so that the respective prior means have been changed (increased and decreased) by 20% from that of . In panels (a1) and (a2), the -axis represents the combination of prior and censoring rate. For example, (0%) means the results under prior one () with no censoring and (20%) denotes results under prior one () with 20% censored samples. Panel (a1) elucidates that in case of complete data, the change in prior parameters has a very little affect on the performance of the estimates. On the other hand, for 20% censored samples, the change in prior parameters has resulted in slightly inflated SMSEs. However, when sample size is increased to 200, the estimates under all three priors are quite similar and censoring has very little impact on the efficiency of the estimates. The corresponding numerical results can be seen in Tables 2 and 3.

The central panel ( and ) represents the affect of choice of various true parametric values on the estimates. Three different sets of true parametric values have been used to assess the impact of variant true parametric values on proposed estimators. The parametric sets used are as follows: : (, ); : (); and : (). The -axis represents the combination of true parametric values and censoring rates. For example, (0%) indicates the estimates for with no censoring and (20%) represents the results for with 20% censored samples. Panels (b1) and (b2) show the results for sample size 20 and 200, respectively. For , some fluctuations in the amounts of SMSEs have been observed for different parametric sets. However, for , there is very little effect of various true parametric values can be seen. The corresponding numerical results have been placed in Table 4.

Keeping the values for the model parameters fixed, the impact of changing the mixing parameter has been presented in panels (c1) and (c2), respectively. The -axis line denotes Pi1: with no censoring; Pi2: with no censoring; Pi3: with no censoring; Pi4: with 20% censoring; Pi5: with 20% censoring; and Pi6: with 20% censoring. For , the increase in value of the mixing parameter imposes a positive impact on the estimation of the model parameters from the first component of the mixture, without seriously affecting the estimation for the other component of the mixture. Interestingly, for , the choice of larger true values for the mixing parameter still produce the improved estimation for the first component of the mixture.

4. Real Examples

We have used two real environmental datasets with LDL for illustrating the applicability of the methods proposed. The first dataset contains ammonium (NH-4) concentration (mg/L) in precipitation observed at Olympic National Park, Hoh Ranger Station (WA14). This dataset has been reported by Aboueissa [34] and was collected during January, 2009 to December, 2011 on a weekly basis. We named it as dataset-1. The dataset contains high proportion (46 out of 102) of observations which are below LDL. The observations of the dataset are <0.006, <0.006, 0.006, 0.016, <0.006, 0.015, 0.023, 0.034, 0.022, 0.007, 0.021, 0.012, <0.006, 0.021, 0.015, 0.088, 0.058, <0.006, <0.006, <0.006, <0.006, 0.074, 0.011, 0.121, <0.006, 0.007, 0.007, 0.015, <0.006, <0.006, <0.006, <0.006, <0.006, <0.006, <0.006, <0.006, <0.006, <0.01, <0.01, <0.01, 0.042, <0.01, 0.013, <0.01, 0.028, <0.01, <0.01, <0.01, 0.049, 0.036, <0.01, 0.011, 0.019, 0.253, <0.018, <0.01, 0.014, 0.08, <0.01, 0.065, <0.01, <0.01, <0.01, <0.01, <0.01, <0.01, <0.008, <0.008, <0.008, <0.008, 0.009, <0.008, 0.017, 0.012, 0.02, 0.009, 0.014, 0.03, 0.06, 0.031, 0.024, 0.013, 0.059, 0.009, 0.017, 0.033, 0.052, 0.015, 0.019, <0.008, <0.008, <0.008, <0.008, 0.009, 0.013, 0.023, 0.036, <0.008, 0.012, 0.03, 0.022, and 0.008. Approximately, 45% of the observations have been below LDL. The average for the observations is 0.0316 with variance as 0.0015. The second dataset (dataset-2) has also been reported by Aboueissa [34]. This dataset is about the manganese concentrations (ppb) in groundwater collected from five wells. Since two LDL were used to obtain the dataset, 2-component mixture (2CM) can be used to model it. This dataset consist of twenty-five values with six observations falling below LDL. Hence, the data contains approximately 24% observation below the lower detection limit. The values of the dataset are <5, 12.1, 16.9, 21.6, <2, <5, 7.7, 53.6, 9.5, 45.9, <5, 5.3, 12.6, 106.3, 34.5, 6.3, 11.9, 10, <2, 77.2, 17.9, 22.7, 3.3, 8.4, and <2. The mean for the observations is 25.43 and variance is 752.96. The numerical results using dataset-1 and dataste-2 have been given in Tables 611 and in Figures 3 and 4. We have compared the proposed model (2CMGEM) with different mixture models available in literature to model the environmental datasets with LDL. These models include 2CM of normal models (2CMNM), 2CM of log-normal models (2CMLNM), 2CM of gamma models (2CMGM), and 2CM of log-gamma models (2CMLGM). By comparing fits of 2CMGEM with the models avaiable in literature, we have found the possibility and suitability of proposed model in modeling the left censored environmental datasets. The proposed MCMC methods have also been compared with the classical methods (MLE and EM algorithm) already used for estimating such datasets. The optimal number of components for the proposed mixture model have also been explored using different goodness-of-fit criteria and test statistics.

The graphs showing goodness-of-fit for different mixture models have been presented in Figure 3. From these graphs, it can easily be assessed that 2CMGEM is more representative as compared to other models for both datasets. In case of dataset-1, 2CMGEM seems to be the best model followed by 2CMNM and 2CMGM, while 2CMLNM and 2CMLGM did not provide good fits. On the other hand for dataset-2, again 2CMGEM provides the best fits followed by 2CMLNM and 2CMGM. Conversely, 2CMLGM and 2CMNM did not provided reasonable.

The results have been given in Table 7 and Table 10, respectively. The LRT assumes model with smaller number of components under null hypothesis. Hence, in the comparison of GEM and 2CMGEM, the GEM has been assumed under null hypothesis. While, for comparison on 2CMGEM with high-order mixtures, the model under null hypothesis is 2CMGEM. From the results, it can be seen that 2CMGEM is the best model for modeling each dataset.

The graphical comparison of fits under different estimation methods has been presented in Figure 4, while the numerical comparison has been given in Tables 8 and 11. Figure 4 reveals that the estimates under the MCMC method using ELF exhibited better discription data. However, the slight departure for some points can be due to the particular sample. The departure for the estimates under traditionally used MLE and EM methods are pronounced. The numerical results in Tables 8 and 11 further justified these findings. Hence, the proposed estimators are significantly better than the traditionally used estimators.

5. Conclusion

The study is aimed at exploring the improved estimation for the heterogeneous environmental datasets with LDL. The earlier studies considered classical analysis (MLEs and EM algorithm) of such datasets with some specific mixture models. We have proposed Bayesian analysis for modeling such data. The employment of 2CMGEM has also been proposed, and the optimality of the proposed model has been discussed. Our results revealed significant gains in efficiencies for the proposed estimators. In addition, the proposed estimators are quite insensitive with respect to LDL, true parametric values, hyperparametric values, and mixing weights. Hence, the proposed estimators are quite convincing alternative to those existing in the literature.

The current methodology can be extended to incorporate the variance analysis, invariant analysis, and covariance analysis. The future work can also be for the cases where the data is right or doubly censored.

Abbreviations

AIC:Akaike information criteria
ARE:Average of relative estimates
BIC:Bayesian information criteria
ELF:Entropy loss function
EM:Expectation-maximization
KS:Kolmogorov Smirnov
LA:Lindley’s approximation
LE:Estimates using LA and ELF
LRT:Likelihood ratio test
LS:Estimates using LA and SELF
LDL:Lower detection limit
MCMC:Markov chain Monte Carlo
ME:Estimates using MCMC and ELF
MS:Estimates using MCMC and SELF
MLE:Maximum likelihood estimation
MSE:Mean square error
SELF:Squared error loss function
2CM:2-component mixture
2CMGM:2CM of gamma models
2CMGEM:2CM of generalized exponential model
2CMLGM:2CM of log-gamma models
2CMLNM:2CM of log-normal models
2CMNM:2CM of normal models.

Data Availability

Data used in the article in available in the article.

Conflicts of Interest

The authors declare that they have no competing interests.

Acknowledgments

This study was supported by the Taif University Researchers Supporting Project (no. TURSP-2020/318), Taif University, Taif, Saudi Arabia.