Abstract

Case-cohort design is a biased sampling method. Due to its cost-effective and theoretical significance, this design has extensive application value in many large cohort studies. The case-cohort data includes a subcohort sampled randomly from the entire cohort and all the failed subjects outside the subcohort. In this paper, the adjustment for the distorted covariates is considered to case-cohort data in Cox’s model. According to the existing adjustable methods of distorted covariates for linear and nonlinear models, we propose estimating the distorting functions by nonparametrically regressing the distorted covariates on the distorting factors; then, the estimators for the parameters are obtained using the estimated covariates. The proof of consistency and being asymptotically normal is completed. For calculating the maximum likelihood estimates of the regression coefficients subject in Cox’s model, a minorization-maximization (MM) algorithm is developed. Simulation studies are performed to compare the estimations with the covariates undistorted, distorted, and adjusted to illustrate the proposed methods.

1. Introduction

Many survival data have the characteristics of large sample size and high censoring rate. The cost is very high when all variables of each individual in the data are measured. To reduce the cost and improve the efficiency of cohort studies, Prentice [1] firstly proposed the case-cohort design and gave a pseudolikelihood method to estimate regression parameters. Since the publication of the landmark article [1], case-cohort design has been applied more and more, especially with the development of big data in recent years. For example, some scholars have applied this design to Life Science [25] and natural disasters [6], and some other scholars have further improved this design [7, 8], respectively.

In practice, some subjects may be interfered by other factors due to their characteristics, so their corresponding real covariates cannot be observed; only the distorted covariates and the distorting factors can be observed. For example, for the Modification of Diet in Renal Disease (MDRD) study [9, 10], glomerular filtration rate (GFR) and serum creatinine (SCr) data are distorted by the body surface area (BSA). To show the relationship of the two covariates, we need to adjust them by correcting the distorting effect of BSA. The covariate-adjusted regression was introduced for situations where both predictors and response in a regression model are not directly observable, but are contaminated with a multiplicative factor that is determined by the value of an observable factor [11]. For the nonlinear regression model, how to estimate the distorting functions was proposed by nonparametric regression the predictors and response on the distorting covariates [12]. Then, the covariate-adjusted method was extended to other models [1315]. The data studied in these documents is all complete data. In this paper, the studied data is survival data with right censored data, in which only distorted covariates and distorting factors are included, and the nonparametric method [12] to estimate the distorting functions to obtain the adjusted value of the distorted covariates on the distorting factors, where the nonparametric method is a kernel smoothing method.

In this article, inspired by the idea of Ding et al. [16] and Deng et al. [17], we construct a surrogate function with a diagonal Hessian matrix by using the convexity of the exponential function and the negative logarithm function and then maximize this surrogate function with a diagonal Hessian matrix. This algorithm is minorization-maximization (MM) algorithm. That is, the first M is to construct a mi-surrogate function and the second M is to maximize the function.

The rest of the article is organized as follows. In Section 2, we firstly fit data from case-cohort design to Cox’s model with distorting factors and secondly adjust the distorted covariates by using the kernel function. In Section 3, the convergence of the adjusted covariates and the asymptotic properties for the proposed estimators is completed. In Section 4, we propose a MM algorithm for implementation of the estimation and present a cross-validation (CV) to obtain the optimal bandwidth and a nonparametric bootstrap approach to get the standard error estimation. Several simulation studies are conducted to compare the estimations without distorting factors, with distorting factors, and adjusted with distorting factors in Section 5. Real data analysis of heart failure data is in Section 6. Discussion is stated in Section 7. All proofs are given in Appendix.

2. Design and Estimation

2.1. Model and Design

Suppose that there are independent subjects in the studies cohort, where is the failure time, is the censoring time, is the observe time for the th subject, is the right-censoring indicator for failure, is the at risk process, and is the counting process, where is an indicator function, is the end of study time, and is a th covariate for subject ; we focus on the time-independent covariate.

Let be the unknown -dimensional vector of regression coefficients. arises from Cox’s model as the following form:where is an unspecified baseline hazard function. The corresponding partial likelihood function is widely used for the inference of [18] as the following form:

Under the case-cohort design, let denote the subcohort, which is selected from the full cohort by simple random sampling, and be an indicator, equaling 1 if the th subject is selected into the subcohort and 0 otherwise. Let denote the case-cohort sample which contained the subjects from the subcohort and the case outside the subcohort. Therefore, the observed data structure can be summarized as follows: .

The pseudolikelihood function [1] was proposed, and the corresponding pseudolikelihood function takes the following form:where .

2.2. The Adjustment of Covariates

We now study that the covariates are distorted by some distorting factors. The distorted observed data structure iswhere is the th distorting factor and is the th distorted covariate. Here, is unobservable, and is denoted as the unknown distorting functions of observable factor .

Firstly, we give some basic assumptions [12] as follows:(i)(ii)(iii) are independent of each other,

Under these assumptions, the working-likelihood function (3) can be rewritten as the following form:

Now, our object is to estimate the unknown coefficients based on observation (4) and function (5). From condition (A3), we have

Many scholars have exposed some methods to estimate [19]. So, we construct the function by using the classic kernel method, for :where , is a kernel function, and is bandwidth.

Let denote the adjusted covariate. That is, we obtain the adjusted covariates by removing the distorting factors from the distorted covariates. Then, function (4) can be abbreviated as

Under the case-cohort design, the estimate about the adjustment of covariates is defined as follows:

3. Consistency and Asymptotic Normality

To present the asymptotic results, we will introduce some notations. Let denote the true value of . For , definewhere for a vector .

Define

We impose the following conditions throughout the paper. Note that convergence properties involve and refers to the Euclidean norm in these conditions:(A1).(A2)There exists some such that .(A3)There exists a neighborhood of , and functions , , defined on , which satisfy(i).(ii), , are continuous of on uniformly in ; , , are bounded on , and is bounded away from zero; and and for all , .(iii)The matrixis positive definite, where .(B1) for some .(B2)The sequence of distributions of is tight on of left-continuous functions with right-hand limits equipped with the product Skorohod topology.(B3)The existing functions , , defined on which satisfy(i).(ii), , are continuous of on uniformly in , and , , are bounded on .(iii), , are bounded sequences.(B4)For ,(C)All , and and are greater than a positive constant. They are differential and their derivatives satisfy the condition that there exists a neighborhood of the origin. For example, and a constant such that, for any , .(D1)The continuous kernel function satisfies the following properties:(i)The support of is the interval [−1, 1].(ii) is symmetric about zero.(iii) and .(D2)As , the bandwidth is in the range from to .(E) is bounded away from 0, . These conditions are mild and can be satisfied in most circumstances.

Conditions (A1)–(A3) are the regularity conditions for the asymptotic results of Cox’s model [20]. Conditions (B1)–(B4) are the regular conditions under the case-cohort design [21]. Conditions (C) are related to smoothness of the function and the density function of . Conditions (D1)-(D2) are commonly assumed for the root consistency of kernel-based estimation [22, 23]. Condition (E) is special for this problem [12, 22].

Under these conditions, we have the following result with detailed proofs given in Appendix.

Theorem 1. (asymptotic properties of ). Under conditions (A1)–(A3), (B1)–(B4), (C), (D1)-(D2), and (E), converges in probability to , andwhere the asymptotic variance matrix , where

4. The Implementation of Algorithm

In Section 3, we have finished the asymptotic properties for the adjusted estimation . To obtain by solution of function (9), we can use the Newton–Raphson algorithm. However, in the process of calculation, we find that its Hessian matrix is complicated and easily irreversible. So, we proposed the MM algorithm [16] to get for the adjustment of distorted covariates under case-cohort design.

4.1. Construct Surrogate

The key of the MM algorithm to construct surrogate function: combine the ideas of [16]; we expose the surrogate function for by using the convexity of the exponential function and the negative logarithm function :where is the th approximate of the adjusted maximum likelihood estimation defined in (9), andwith if , where , , and are the th components of , , and , respectively.

With the surrogate function (16), we can transfer the optimization problem (9) into the problem as follows:

Following the construction process of the surrogate function (16), we can immediately conclude thatwhere if and only if .

Then, with , we havewhere is strictly ascending when .

With the surrogate function (16), we obtain the derivatives with respect to if as follows.

The score vector:

The negative Hessian matrix:where

Here, the Hessian matrix is diagonal, the process of calculation changes simply and is reversible by using Newton–Raphson algorithm.

4.2. Cross-Validation and Bootstrap
4.2.1. Cross-Validation

In the process of the adjustment for the distorted covariates, we use the cross-validation (CV) method to obtain the optimal bandwidth to construct the kernel smoothing methods. The idea of the CV method is to calculate the coverage of error between some naive responses and the estimated responses obtained from the model fitted by the other predictors. The specific expression of the function in this article is given as follows:where is the set of removed th element, .

Then, the optimal bandwidth is defined as follows:

4.2.2. Bootstrap

Bootstrap has been widely used since it was first proposed [2426]. We adopt this method to calculate the standard error of ; the basic idea of the nonparametric approach is to construct an empirical distribution function by repeatedly sampling from the observed data. The steps are summarized as follows:Step 1: are the adjusted observations under the case-cohort design, where . We use the samples with as the failure cases and the samples with as the censoring objects to constitute the bootstrap case-cohort sample .Step 2: a bootstrap replication can be obtained based on .Step 3: repeating step 1 and step 2 times, we can obtain bootstrap estimate . Therefore, the standard error of the th component of can be estimated by the following form:

5. Simulation

As we mentioned before, if the distorting factors on covariates were ignored, the result of inference may be misled [11, 12]. In this article, we will construct several simulation studies to compare the three estimators of regression coefficient for the covariates without distorting factors, with distorting factors, and adjusted with distorting factors in Cox’s model. Under the case-cohort design, where denotes the estimator calculated based on the covariates without distorting factors, denotes the estimator calculated based on the covariates with distorting factors, and denotes the estimator calculate based on the adjusted covariates with distorting factors, given (), we consider the hazard function of the failure time as follows:

We set the true values of the parameters to (–0.25, 0.5), (–0.25, 0.693), and (0, 0.5). is generated from a normal distribution with mean and variance 1.44. is generated from a uniform distribution . The distorting factor is generated from a uniform distribution . The baseline hazard function is set to be 1 and 2. Thus, the failure time satisfies an exponential distribution with failure rate and scale parameter , respectively. The censoring time is generated from a uniform distribution with being chosen to desire the censoring rate , 85%, and 90%.

Under the case-cohort design, 1000 full cohort samples are generated from function (27), and the corresponding distorting factor is generated. The two parts are merged to constitute the observed sample. Then, we randomly selected , 200, and 300 subcohort from the full cohort.

We compare the sample bias (Bias), the sample MSE (SMSE), the sample standard deviations of estimates (SD), the means of estimated standard errors (SE), and the coverage probabilities of 95% nominal confidence intervals (CP) of the three estimators to , and based on 1000 independent simulated datasets and especially apply the nonparametric bootstrap approach in Section 4.2.2 with . We set only the covariate to be distorted and adjusted. The distribution of the distorting factor is set to be . The kernel function is set to be an Epanechnikov kernel [27] . The optimal bandwidth is selected by the method in Section 4.2.1. The criteria are stopped with . The simulation results are summarized in Table 1 with , Table 2 with , and Table 3 with , respectively.

According to the simulation results, the estimators and for are all unbiased, the three values of SMSE, SD, and SE are all closed, and the CP values are reasonable. However, the estimators for are all biased, the difference between the two values of SMSE and SD is large, and the CP value is low. These facts confirm that ignoring the distorting factors may lead to the wrong results and conclusion and also explain that the proposed method about the adjustment of covariates is effective. For example, the estimator and for , , , for , the bias is –0.1178, SMSE are 0.1521 and 0.1095, and CP are 0.735 and 0.947, respectively. Furthermore, we can conclude that the estimators for are all reasonable.

6. Real Data Analysis

The dataset is about the medical records of heart failure patients, from “Machine Learning Repository” (http://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records). The original dataset version was collected by Tanvir Ahmad et al. (Government College University, Faisalabad, Pakistan) and made available by them on FigShare under the Attribution 4.0 International copyright in July 2017. The current version of the dataset was elaborated by Davide Chicco et al. and donated to the University of California Irvine Machine Learning Repository under the same Attribution 4.0 International copyright in January 2020.

This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features, including age, anaemia, creatinine phosphokinase, diabetes, ejection fraction, high blood pressure, platelets, serum creatinine, serum sodium, sex, smoking, time, and death event. The goal is to assess the association between clinical features and heart failure and its happened time. This real data is survival data which meets the requirements of the article. Death event and time are regarded as heart failure and its happened time, respectively. For the remaining 11 clinical features, through pairwise correlation analysis, we conclude that ejection fraction and serum sodium and anaemia and Creatinine phosphokinase are significantly related. Referring to relevant medical knowledge, we know that ejection fraction may distort serum sodium. And in the same way, anaemia may distort creatinine phosphokinase. The main purpose of our research is to evaluate the association between these clinical features under smoothed serum sodium and creatinine phosphokinase by ejection fraction and anaemia.

There are 96 deaths in this data, and the censoring rate is nearby . Due to the different dimensions of the covariates, we firstly standardize each covariate. Then, we set the standardized age as , the standardized creatinine phosphokinase as , the standardized diabetes as , the standardized high blood pressure as , the standardized platelets as , the standardized serum creatinine as , the standardized serum sodium as , the standardized sex as , the standardized smoking as , the standardized ejection fraction as a distorting factor , and the standardized anaemia as a distorting factor , where and indicate that the covariate is distorted.

The case-cohort sample is composed of the subcohort randomly selected 100 from the whole data and the dead objects in the whole data expect the subcohort. We perform regression analysis on the sample using the proportional hazard function:

Because the size of this data is small, to improve the reliability of the sampling method, the sampling process is repeated 100 times. The average value of the 100 distorted estimates is .

Then, for the same 100 case-cohort sample, we obtain the adjusted covariates and by using the proposed kernel smoothing method on and and replace and with and , respectively. On this basis, the average of adjusted estimates is .

The kernel and bandwidth selector chosen are the same as those in Section 5. After the covariates were adjusted, the estimator of creatinine phosphokinase is adjusted from 1.0344 to 14.0226, and the estimator of serum sodium is adjusted from to 11.0269; the estimators of other covariates changed little after adjustment. The result is reasonable and feasible in medicine and also shows that the proposed method is effective for real data.

In practice, the covariates are measured after the case-cohort sampling to reduce costs and improve efficiency. In this real data, the size of the data is small, and the each covariate has been measured. To reduce the error, we repeat 100 samplings and set the average value as the estimators. For example, based on the 10th case-cohort sample from the original data, the distorted estimator is ; then, after the adjustment of covariates in the same sample, the adjusted estimator is . Therefore, the estimators based on same case-cohort sample are quite different from the average of estimators, but the trends are the same. So, for real data with a large sample size, we only need to draw a case-cohort sample and measure their covariates.

7. Conclusion

We study the estimation in Cox’s model with the adjusted covariates by using the nonparametric method based on the kernel function under case-cohort design. Consistency and asymptotic normality of the proposed estimators are derived. We use MM algorithm for calculating the regression coefficients, where the surrogate function with a diagonal Hessian matrix is established to overcome the operation difficulty in the Newton–Raphson algorithm. Simulation and real data studies suggest that the case-cohort design can be used to reduce the cost for cohort studies and the development of the adjustment for distorted covariates has nice performance.

Here, the adjustment for survival data is discussed in Cox’s model. The method can be extended to other models, such as an accelerated failure time model [28], an additive-multiplicative model [29], and an accelerated hazards model [30]. The cost-effective case-cohort design is mainly applied to the data with the high censoring rate. However, in the sampling process, some censoring subjects have not been measured. To improve the efficiency, our future works include developments of inference for estimation with adjusted covariates under a generalized case-cohort design [31] and an outcome-dependent sampling design.

Appendix

A. Proof of Asymptotic Properties

We start with a lemma, which is frequently used in the process of proof. Then, we proceed to the proof of Theorem 1, concerning consistency and asymptotic normality.

Lemma A.1. Let be a continuous function, satisfying . Assume that conditions (C), (D1)-(D2), and (E) hold. The following asymptotic representation holds:

Proof. of Lemma A.1. Recall that , which is given by Section 2. Sample calculations give the following expression:Thus, it suffices to deal with the equation . According to the type it holds thatFirst, we analyze . The proof is divided into three steps. Denote by , , and the quantities:Step 1: show thatApplying the equation , we havewhereEquation (A.5) can be concluded by provenAccording to Theorem 1 of [32], Lemma 3 of [23], condition (D2), and the Law of Large Numbers (LLN) for , we then follow the arguments used in [23] to yield the higher order term:Similar argument yield (A.8) for .Step 2: show thatTo analyze , for ease of understanding, we use to express the density function of and define as follows:We need to work onWe can obtain the above sum by a U-statistic with a varying kernel with the bandwidth by applying the similar arguments used by Davis and Fang [23] again. Then, we can have the asymptotic representation [4] aswith being the density function of , and the density of variable . Note that and are independent and and are independent. According to conditions (C), (D1)-(D2), and (E), we obtain can be similarly proceeded. Combining (A.5) and (A.8), we have (7):Step 3: show thatFrom (A.16) and the definition of in (A.3), we derive thatwhere the last equality is obtained by applying LLN to and .
Finally, together with the asymptotic representation of and in (A.15) and (A.17), the desired result is easy to arrive at.

Lemma A.2. RefineAssume that conditions (C), (D1)-(D2), and (E) hold, then converge to asymptotically.

Proof. For , according to Lemma A.1,According to the Law of Large Numbers, (A.20) converges to 0 in probability.
Next, we prove thatOnlyLetBecause is differentiable function, the first derivative of , and it is obtained from (A.17) thatBy (16) and (A.21), asymptotically converges to .
Under the case-cohort sampling design, let denote the sample set of the subcolumn, denote the number of samples in , A denotes the case-cohort sample, and the set of represents the number of samples in A. Under the case-cohort sampling design, the covariate is not completely observed; then, its likelihood function cannot be expressed as (2). According to the pseudolikelihood function given in the article of Prentice [1], the pseudolikelihood function expression is as follows:where adventure set . Note that only when the individual fails at time , the set is nonempty, and the corresponding log-likelihood function is

Proof of Theorem 1. To derive the asymptotic properties of the proposed estimator , we first introduce the following working likelihood function:which differs from the pseudolikelihood function in (A.16) by using the index set instead of . The corresponding estimator based on is defined asNote the argument in [21]; we can prove that the estimator is asymptotically equivalent to the proposed estimator . That is, the asymptotic properties of can be obtained by proofing the desired asymptotic behaviors of .
Note that . Mimicking the discussions in Lemma 1 of [21], we obtain thatDefinewhere converges in probability to the same limit asfor . Therefore, converges uniformly towhich is a continuous and convex function of and has a unique maximum at [20, 21]. That is,with equality if and only if .
We assume that does not converge to by a set of positive probability. So, there exists a subsequence of which converges to not equal to . Since is the maximum, we have . We obtain the following inequality by the uniform convergency and continuity of limit:This inequality is in contradiction with (A.12), so we have the convergency of to in probability.
From the basic assumptions I, II, and III, we derive thatand from the above equation, we haveUnder conditions (A1)-(A3), (B1)-(B4), (C), (D1)-(D2), and (E) and similar arguments in Self and Prentice [21], we obtain the following results:Because converges to in probability, Slutsky’s theorem, and (A.12), we can obtainwhere the asymptotic variance matrixwhere

Data Availability

The simulation data used in this paper are randomly generated by R language software to support the proposed method of this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported in part by the Scientific Research Foundation of Shandong University of Science and Technology for Recruited Talents, Shandong University of Science and Technology (2019RCJJ021 to L. D.), and National Natural Science Foundation of China (71501114 to Q. L.).