Abstract
In the context of a sample survey, the collection of information on a sensitive variable is difficult, which may cause nonresponse and measurement errors. Due to this, the estimates can be biased and the variation may increase. To overcome this difficulty, we propose an estimator for the estimation of a sensitive variable by using auxiliary information in the presence of nonresponse and measurement errors simultaneously. The properties of the proposed estimators have been studied, and the results have been compared with those of the usual complete response estimator. Theoretical results have been verified through a simulation study using an artificial population and two real-life applications. With the outcomes of the proposed estimator, a suitable recommendation has been made to the survey statisticians for their real-life application.
1. Introduction
The utilization of auxiliary information in sample research intends to improve the precision of estimators. The estimation in this research article is accomplished through stratified successive sampling. To reduce population heterogeneity, a possible sampling strategy would be stratified by random sampling. Researchers use stratified random sampling when they are already aware of subdivisions within a population that need to be accounted for in their research. For example, in socioeconomic surveys, a group of people living in rural, sub-urban, and urban regions may need to be examined. It is understood that the nature of the study variable may differ for different parts of the population and each component or part should be considered as separate strata. In addition to the overall estimate, it may also require to estimate particular strata parameters of the population. This can be done only through stratified random sampling. Thus, in stratified sampling, we first divide the entire population into homogeneous subpopulations or subgroups, which are known as strata. Then, random samples are selected independently from each stratum, such that each subgroup of a population is best represented by the entire population being studied. Kadilar and Cingi [1] developed ratio estimators in stratified random sampling and Bouza et al. [2] have utilized auxiliary information to develop new ratio estimators for population parameters in stratified random sampling.
In longitudinal surveys, successive sampling is a well-known approach for estimating population parameters and measuring differences or changes in a study variable. In this type of sampling, some units drawn on the first occasion are preserved and utilized on the second occasion, while the rest of the units are replaced by new ones drawn on the current occasion. This partial unit replacement lowers survey costs. Jessen’s [3] study was the beginning of the theory of successive sampling. Furthermore, the work is extended by several authors, including Singh et al. [4] who suggested a logarithmic type estimator to estimate the population mean in successive sampling in the presence of random nonresponse and measurement errors. Moreover, Singh et al. [5] proposed some imputation methods to deal with the problem of missing data in two-occasion successive sampling. Furthermore, Khalid and Singh [6] investigated some imputation methods to deal with the issue of missing data problems due to random nonresponse in two-occasion successive sampling among others.
Among surveys on sensitive variables, it is usual to encounter that some people do not respond, whereas others respond but often prefer to disguise the true value to avoid social disapproval. For example, estimating the number of drug addicts in a town or students who have cheated on an exam would provide false answers. In such cases, the responses are clearly sensitive. This situation is present in many surveys. Many respondents not responding correctly may cause a social desirability bias. To avoid social desirability response bias, Warner [7] developed a data collection procedure, termed as randomized response technique (RRT), that allows researchers to evoke sensitive questions. Furthermore, it is extended by the work of Diana and Perri [8] who proposed a class of estimators for quantitatively sensitive data. Likewise, Gupta et al. [9] suggested a unified measure of respondent privacy and model efficiency in quantitative RRT models; Zhang et al. [10] proposed a ratio estimation of the mean under RRT models and many others. Nonresponse and measurement errors may be observed in a socioeconomic survey on sensitive data. For instance, in a survey of annual income and expenditure per household, people may attempt to suppress or distort the amount of income or expenditure. Similarly, when assessing the number of illegal abortions performed in a city each year, both nonresponse and measurement errors may occur.
It is impossible to obtain complete information about sample units in surveys where sensitive questions are present. Incomplete information is known as nonresponse, which is a source of nonsampling error. To deal with this issue of nonresponse in sample surveys, Hansen and Hurwitz [11] suggested techniques by taking a subsample from nonrespondent groups. Furthermore, Bouza [12] also developed the problem of the subsampling fraction in the case of nonresponse. Diana et al. [13] proposed a Hansen and Hurwitz estimator with scrambled responses on the second call if the survey is sensitive in nature. Singh et al. [14] used a calibration method for the estimation of population variance in stratified successive sampling with random nonresponse. Moreover, Mukhopadhyay et al. [15] have worked on a general technique for estimating the population means under stratified successive sampling in the presence of random scrambled responses and nonresponses. Apart from nonresponse, measurement error is also a serious issue in sample surveys due to the lack of proper data. A measurement error happens when the real value of the sample units differs from what is observed. Singh et al. [16] investigated difference-type estimators for the estimation of mean in the presence of measurement errors. Zhang et al. [17] have suggested a mean estimation in the simultaneous presence of measurement errors and nonresponse using optional RRT models under stratified sampling. Furthermore, Zahid et al. [18] have also been proposed a generalized class of estimators for a sensitive variable in the presence of nonresponse and measurement errors under stratified sampling. Furthermore, Tiwari et al. [19, 20], Kumar and Kour [21], and others have addressed the issue of nonresponse and measurement error in various sampling strategies, using prior studies as inspiration and realizing the essence of how to handle these flaws in a sample survey. Furthermore, the exponential estimator(s) proposed by Bahl and Tuteja [22] are known to perform better than the corresponding usual ratio and product-type estimators under certain efficiency conditions. But, if these conditions are not readily satisfied, then we propose a logarithmic-type estimator as it is the inverse operation to exponentiation, where both the study and the auxiliary variables are sensitive in nature under ORRT models to estimate the population mean of the sensitive variable in the presence of nonresponse and measurement error simultaneously under stratified successive sampling.
2. Sample Structure and Notations
Let us consider a finite stratified population with distinct units divided into homogeneous subgroups, known as strata with units in the stratum. On the first(second) occasion, the character under study is represented by . We further assume that information on an auxiliary variable is available on both occasions and is correlated with the study variable. Let a simple random sample (without replacement) of size unit is selected on the first occasion. From the first occasion, a sample of size units is retained (matched) on the second occasion. Again, on the second occasion, an independent sample of size units (i.e., unmatched units with the first occasion) is drawn from the entire population, so that the sample size on the second (current) occasion is also . To deal with nonresponses, we divide the whole population into two groups, i.e., respondents and nonrespondents. Let and be the number of responding and nonresponding units in the stratum, respectively. From sample units, units respond and units do not respond. Similarly, in the unmatched sample, units respond and do not respond and from matched units respond and do not respond. Again, a subsample of size units is drawn from the nonrespondent class of the sample units. Furthermore, we take a subsample from the nonrespondent group of size , and units are drawn from the matched and unmatched portions, respectively, where is the inverse sampling ratio. Along with nonresponses, the measurement error is also associated with these sample units, i.e., and = ; for both occasions, which are random in nature with a mean zero and population variance , , and . Hence, the following notations are used for their further use: : the sensitive study variable on the first (second) occasion : the sensitive auxiliary variable for both occasions : means of the stratum on the first and second occasion, respectively : means of the auxiliary variable for stratum on both occasions : the population means of the sensitive study variables on the first and second occasion, respectively : the population mean of the auxiliary variable for both occasions , , , , , and : the sample means of the variables , , and , based on the respective sample size shown in their subscripts , and : the correlation coefficients between their respective variables, as shown in their subscripts , , and : the population variances of the variable , , and , respectively , , , , , and : the sample means of the variables , , and , when there is a presence of sensitivity with nonresponse and measurement error simultaneously, based on the respective sample size shown in their subscripts
3. Optional Randomized Response Technique (ORRT)
In this section, let and be two scrambled variables with mean and and known variances and , respectively. Let represent the probability that the respondent will find the question sensitive. Since the ORRT model is more efficient, we add it optionally to the Diana and Perri [8] model. In the ORRT version, the respondent may answer in the two ways given in equation (1), depending on whether the respondent considers the question sensitive or not. As a result, for the matched and unmatched portion, we may use the general scrambling model for the sensitive study variable is given as follows:where .
The mean and variance of are given by the following expression:
We write the randomized linear model as follows:and the expectation and variance of the randomized mechanism are as follows:
For sensitive study variable for the unmatched portion, the general scrambling model is given as follows:where .
Then, the mean and variance of are given by the following expression:
Therefore,and the mean and variance are given as follows:
Similarly, for auxiliary variable , we write as follows:where .
Then, the mean and variance of are given by the following expressions:
Therefore,where and follow a normal distribution with mean and variances , i.e., and , Bernoulli with
When face-to-face interview of subsampled units of nonrespondents is performed in the second phase of the Hansen and Hurwitz [11] procedure, we give respondents the opportunity to scramble their responses using ORRT. In this case, we used Hansen and Hurwitz’s technique by assuming that the respondent group provides direct responses in the first phase, and then, the ORRT model is used to provide responses from a sample of nonrespondents in the second phase.
Let denotes a transformation of the randomized response on the block, the expectation of which is a real response and is given as follows:with
Also, denotes a transformation of the randomized response on the block, whose expectation equals to and is defined as follows:with
Likewise, denotes a transformation of the randomized response on the block, whose expectation equals to and is given as follows:with
Now, the Hansen and Hurwitz [11] estimator in the presence of nonresponse by using ORRT is represented as follows:where
It is simple to illustrate thati.e., , , and are usual unbiased estimators.
The variance of , , and are as follows:where
Let the measurement error associated with the sensitive variable(s) (, , and ) in a face-to-face interview be given as , , and with mean zero and variances , , and , respectively.
Various notations under measurement error are given as follows:where , , and are measurement errors on , , and , respectively.
So, the variance of and ad in the presence of measurement error is given by the following expressions:whererespectively.
4. Proposed Estimator
Under certain efficiency conditions, exponential-type estimators are known to outperform the related customary ratio and product-type estimators in terms of lesser mean square errors. Therefore, the question arises as to what happens when the conditions that favor exponential-type estimators over the customary estimators are not readily met. The answer of course lies in the use of other efficient estimators that would perform better than both the existing exponential and customary estimators. So, we examine the logarithmic-type estimator in our search for such efficient estimators because the logarithm is the inverse operation of exponentiation. The logarithmic function has some helpful qualities and is used extensively in a variety of scientific and nonscientific domains. Motivated by the abovementioned discussions and following the work of Singh et al. [14], we formulate two independent estimators for estimating the population mean of the sensitive variable on the current (second) occasion. The proposed class of estimator is defined as a convex linear combination of two separate classes of estimators and .where is an unknown constant that must be determined, so that the mean squared error (MSE) of the proposed estimator is minimum.
The estimator is based on , i.e., unmatched sample, and estimator is based on , i.e., matched sample, and is defined as follows:where; and are the constants to be determined by minimizing the mean square errors of the estimators.
Using the following transformations, the bias and mean square errors of the estimators and are derived up to the first degree of approximation. Letsuch thatwhere
Using the abovementioned results, the bias of the estimator to the first degree of approximation is calculated as follows:wherewhere
The mean square error (MSE) of the estimator to the first degree of approximation is calculated as follows:wherewhere
The mean square errors of and are functions of the unknown constants and , respectively. So, we minimize the MSEs of and , with respect to and , respectively. The optimal values are obtained as follows:
After substituting the optimum value of and in equations (37) and (38), we obtain the minimum MSE of and .where
The MSE of is dependent on the unknown constant “;” then, we minimize equation (35) w.r.t “” and equate it to zero, and we obtained the optimum value of “” as follows:
Substituting the optimum value of in equation (35), we obtain the optimum mean square error of the estimator as follows:
Case 1. If , then the mean square error of the proposed estimator without measurement error reduces to the following expression:wherewhereThe MSE of and are minimum, whenAfter substituting the optimum value of and in MSEs, we obtain the minimum MSE of and .whereTo obtain the optimum value of ‘’, we minimize equation (35) w.r.t “” and equate it to zero, as follows:After substituting the optimum value of in equation (35), we obtain the optimum mean square error of the estimator as follows:
Case 2. If , then the mean square error of the proposed estimator without sensitivity becomeswherewhereThe MSE of and are minimum, whenBy substituting the optimum value of and in MSEs, we obtain the minimum MSE of and .whereTo obtain the optimum value of “,” we minimize equation (52) w.r.t “” and equate it to zero as follows:After substituting the optimum value of in equation (52), we obtain the optimum mean square error of the estimator as follows:
Case 3. If we substitute in equation (60), then the mean square error of the proposed estimator reduces to only the nonresponse case (i.e., absence of measurement and sensitivity) as follows:wherewherewhich are minimum, whenWe obtain the minimum MSE of and , after substituting the optimum value of and in abovementioned MSEs as follows:whereTo derive the optimum value of the constant “,” we differentiate equation (60) w.r.t “” and equate it to zero; we obtain as follows:After substituting the optimum value of from equation (66), the optimum mean square error of is given as follows:
Case 4. If we substitute , then the mean square error of the proposed estimator reduces to only the sensitivity case (i.e., absence of nonresponse and measurement errors) as follows:wherewhereThe MSEs are minimum, whenAfter substituting the optimum value of and in equations (70) and (71), we obtain the minimum MSE of and .whereThe MSE of is dependent on the unknown constant “;” then, we differentiate equation (69) w.r.t “” and equate it to zero and we obtain the optimum value of “” as follows:Substituting the optimum value of from equation (75), we obtain the optimum mean square error of the estimator as follows:
5. Complete Response Estimator
To evaluate the proposed estimators with complete response situations to see how well they perform. The complete response estimator is defined as follows:where , , and is an unknown constant to be determined by the minimization of the mean square error of the estimator .
To the first-order approximation, the minimum mean square error of the estimator is given as follows:wherewhere
6. Simulation Study
In this section, we conduct a simulation study using R software to verify the results of the previous sections. We have generated a hypothetical (artificial) population in Section 6.1 and considered two real populations in Section 6.2. The descriptions of the variables with parametric values are given in Sections 6.1 and 6.2, respectively. The results are given in Tables 1–4, respectively.
6.1. Population Generated through Simulation Using Normal Distribution
In this section, we investigated the efficiency of our estimator with the help of an artificial population of and generated three strata of equal size, i.e., ; then, a random sample is drawn from each stratum of size ; , i.e., , such that . On the first occasion, we generated a sample from , and on the second occasion, a subsample of size , i.e., matched units, is taken from the first occasion and a fresh sample of size , i.e., unmatched units, is drawn from the remaining population. In the second phase, we take another sample are drawn from the nonrespondents class. Again, a subsample from the nonrespondent group of size and units are drawn by using different values of , respectively. The study variables and auxiliary variables are generated from the normal distribution, i.e., , , and , respectively. The scrambling variable is taken from a normal distribution with a mean value of 1 and a variance value of 0, and is also taken from a normal distribution with a mean value of 0 and a variance value of 1. The measurement errors on , , and have a normal distribution with a mean value of 0 and a variance value approximately equal to 1. We analyzed the mean squared error (MSE) of the proposed estimator(s) in the cases of presence and absence of measurement errors, respectively, for different values of , , and .
From Tables 1 and 2, we have noted the following observations:(i)It is investigated from Tables 1 and 2 that when no error is present (i.e., a complete response), the MSE of is better than that of the other considered estimators. Furthermore, for the considered estimator, the MSE of (i.e., presence of sensitivity only) is more efficient than that of the other estimators (i.e., the presence and absence of measurement errors). It represents .(ii)Also, from Tables 1 and 2, it is observed that with the increasing value of , the MSEs of each estimator, i.e., , and , also increase.(iii)With an increase in the value of from 0.4 to 0.8, the mean squared error of each estimator also increases, i.e., , and , also increases, which is shown in Tables 1 and 2.(iv)From Table 1, it can be seen that when and , then for the increasing value of to 0.4, the MSE of (i.e., presence of sensitivity with nonresponse and measurement errors) increases; the MSEs of all the other considered estimators, i.e., , and , first increase and then decrease.(v)From Tables 1 and 2, it is observed that for the increasing value of to 0.4, the MSE of increases. But, for , the MSEs all the other estimators, i.e., , and first increase, then decrease for and increase for .(vi)Higher the values of the correlation coefficient between the study and auxiliary variable, the more effective the suggested technique shown in Tables 1 and 2. This behavior assists us in selecting a population for the application of our strategy in real life.
6.2. Numerical Illustration Using Two Known Natural Population
Two real-life datasets are considered from the work of Singh et al. [14] for numerical comparison, and we have taken into account and in order to minimize the effect of scrambling on the real data. The MSE of the estimators is calculated as follows:where “” means , , , , , and .
The details of two real populations are as follows. The first dataset is based on the Census 2011 literacy rates in India. The data are of Indian states and union territories, and they generated three strata of unequal size, i.e., , , and ; then, a random sample is drawn from each stratum of size ; , i.e., , , and . The literacy rate is spread across the major parameters overall, rural, and urban. Let , , and denote the number of literates (people) in 2001, 2011, and the total literacy rate (2011), respectively. The second dataset is based on abortion rates from Statistical Abstract of the United States (2011) to clarify the performance of our proposed estimator (free access to the data by the Statistical Abstracts of the United States). The population consists of states of the U.S. and generates three strata of unequal size, i.e., , , and ; then, a random sample is drawn from each stratum of size ; , i.e., , , . Furthermore, let , , and denote the number of abortions reported in the state of the U.S. during the years 2008, 2007, and 2005, respectively.
The results are shown in Tables 3 and 4 for the different choices of the nonresponse, i.e., and different probability levels of sensitive variables, i.e., and 0.8, are used.
Tables 3 and 4 display the mean squared error of the proposed estimator(s) in the presence of nonresponse and measurement errors using ORRT for different values of and . It is envisaged from Tables 3 and 4 that in the situation of a moderate level of the sensitive variable with different levels of nonresponse and measurement errors, our proposed estimators are more efficient. We also observed that for an increase in the value of , the MSE of all other estimators increases.
It is noted from Tables 1–4 that the proposed estimator “” is always better than the other considered estimators in terms of having a minimum MSE.
7. Conclusion
Through this article, we have proposed a logarithmic type estimator for the estimation of the population mean of a sensitive variable under stratified successive sampling in the presence of nonresponse and measurement errors simultaneously by utilizing the ORRT model. Up to the first degree of approximation, the bias and MSE are derived. The properties of the proposed estimator have been studied and we compared the results with respect to the complete response situation. A simulation study has been performed for both natural population and an artificial generated population, and as a result of the simulation study, we have demonstrated that our proposed estimator is most efficient in the absence of both nonresponse and measurement error situations and least efficient in the case when there is a presence of both nonresponse and measurement errors at the same time. Thus, we recommend our suggested estimator, i.e., , for further use in practice.
Data Availability
The data used to support the findings of the study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This study was benefited by the support of project: Desarrollo de nuevos modelos y métodos matemáticos para la toma de decisiones, Havana University.