Abstract
We introduce a new method for estimating the nonparametric regression curve for longitudinal data. This method combines two estimators: truncated spline and Fourier series. This estimation is completed by minimizing the penalized weighted least squares and weighted least squares. This paper also provides the properties of the new mixed estimator, which are biased and linear in the observations. The best model is selected using the smallest value of generalized cross-validation. The performance of the new method is demonstrated by a simulation study with a variety of time points. Then, the proposed approach is applied to a stroke patient dataset. The results show that simulated data and real data yield consistent findings.
1. Introduction
Nonparametric regression is a statistical method used if the data show an unknown regression curve. The strength of this method is its great flexibility since the data are used to find the form of its estimated regression curve without being influenced by subjective judgements [1]. Some estimators used are spline, Fourier series, kernel, and polynomial.
The truncated spline is a function where there is a change in the behaviour of the curve in certain subintervals. The spline is one of the popular estimators in nonparametric regression because it has an excellent visual interpretation. Montoya et al. [2] conducted a simulation study to compare knot selection methods in a penalized regression spline model. Next, the Fourier series is much used to describe curves that present sine and cosine waves. This estimator is commonly used when the data have the characteristics of a periodicity. Bilodeau [3] estimated additive components with functions consisting of truncated Fourier cosine series, using penalized least squares to obtain the coefficients. Furthermore, Tripena [4] developed a Fourier series estimator for bi-response nonparametric regression. Gong and Gao [5] discussed a nonparametric kernel estimation of tax policy’s impact on the demand for private health insurance in Australia.
A single estimator in nonparametric regression is commonly used but does not limit the possibility of developing into a mixed estimator. There are many cases where each predictor variable has a different pattern. Therefore, applying only a single estimator can make the regression model’s estimation incorrect and produce a large error. Some previous studies that have been mentioned have limitations in that they can only be used for cross-sectional data. To overcome this limitation, a longitudinal data model has been developed. The use of longitudinal data has increased in recent years because it can be applied in various fields. Longitudinal data are data obtained in observations of independent subjects, where each subject is observed repeatedly at a certain period. This has the advantage of being able to observe changes based on time [6].
In [7], the mixed estimator is limited in the sample size it can treat. The present study extends the use of the mixed truncated spline and Fourier series (MTSFS) model to larger sample sizes and various time point designs. Some properties of the new mixed estimator will also be provided. We use generalized cross-validation (GCV) to determine the best model from various knots, oscillations, and smoothing parameters. A simulation study and real data are employed to demonstrate the performance of the proposed method. The case study includes the factors that affect the Glasgow Coma Scale (GCS) on stroke patients, i.e., body temperature (BT) and pulse rate (PR).
The rest of this paper is divided into several main topics. Section 2 introduces the materials and methods used in this study. Section 3 consists of five subsections: the theory of the new mixed estimator, its properties, the selection of the best model, the simulation study, and a case study. The details of the new mixed estimator are presented in Section 3.1, followed by its properties in Section 3.2. Section 3.3 presents how to select the optimum knot point, oscillation parameter, and smoothing parameter to obtain the best model. The results of a simulation study of the proposed method are presented in Section 3.4, followed by an application to real data in Section 3.5. Section 4 concludes the paper.
2. Materials and Methods
Suppose is the response variable and and are the corresponding predictor variables with sample size of subjects , with each subject having observations . The relations between the response and predictors for the nonparametric regression model for longitudinal data arewhere is the regression curve and is a random error. Assume that the form of the regression curve is unknown and additive, so thatwhere is the truncated spline component and is the Fourier series component.
The function , is an approximation using truncated spline functions and , is that using Fourier series. The estimator is obtained through a two-step optimization, i.e., penalized weighted least squares (PWLS) and weighted least squares (WLS).
3. Results and Discussion
3.1. MTSFS Model for Longitudinal Data
Some lemmas and theorems are provided to obtain an MTSFS model for longitudinal data. Lemma 1 presents the goodness of fit of the Fourier series component, Lemma 2 presents the penalty component, and Lemma 3 gives the solution for the truncated spline component. Theorem 1 presents the PWLS optimization, and Theorem 2 presents WLS optimization. The results are summarized as follows.
Lemma 1. If the Fourier series component in equation (2) is given by , then the goodness of fit can be formulated as follows:where , is the weighting matrix, , and
Proof. The function is assumed to be unknown and contained in the space . The function is approximated using Fourier series with a trend, modified from Bilodeau [3]:Equation (5) can be written in matrix formwhere is a matrix and is a vector.
The nonparametric model in equation (2) can be rewritten asThen, the goodness of fit for equation (7) can be written asThe model in equation (8) can be written in matrix form:
Lemma 2. If the penalty component is given,thenwhere
Proof. Regarding equation (5), we defineConsequently,LetThus,The penalty in equation (16) can be rewritten in matrix formRegarding the goodness of fit in Lemma 1 and the penalty component in Lemma 2, we obtain the PWLS optimization asEquation (18) can be rewritten in the form
Theorem 1. If the goodness of fit is as in Lemma 1 and the penalty component is as in Lemma 2, then the Fourier series component obtained by minimizing the PWLS in equation (18) iswhere and .
Proof. The optimization in equation (18) can be written asEquation (21) can be rewritten in the formWe obtainThe completion of the optimization in (23) is obtained by taking the partial derivative of with respect to and setting it equal to zero, i.e.,giving the resultBy substituting (25) into (6), we obtainThe nonparametric regression model in equation (7) can be written aswhere and .
Lemma 3. If the truncated spline component in equation (2) is , then the WLS iswhere .
Proof. The function is a linear truncated spline function with s knot for each :where .
Equation (29) can be rewritten in matrix formwhereThe MTSFS model for longitudinal data in equation (1) can be written in the formBy substituting (26) into (32), we obtainTo obtain the estimator of truncated spline component, equation (33) can be written asEquation (34) can be rewritten asThus,As a consequence, the WLS is given by
Theorem 2. Suppose the WLS is as in Lemma 3. Then, the mixed estimator obtained by minimizing WLS iswhere and
Proof. The WLS optimization in Lemma 3 can be written asWe obtainThe complete optimization is obtained by setting equal to zero the partial derivative of with respect to , that is,giving the resultBy substituting into equation (30), we obtainWe obtain by substituting (43) into (25):As a consequence,By substituting and into the MTSFS model for longitudinal data, we obtainwhere , , and .
3.2. The Properties of MTSFS Model for Longitudinal Data
This section provides the MTSFS model’s properties for longitudinal data, i.e., it is biased and linear in observations. The mixed estimator is biased, as proved by
Assuming that ,
The result in equation (48) showed that the mixed estimator is biased because . Even though the mixed estimator is biased, the mixed estimator is linear in observations proved by equation (49) below.
3.3. The Selection of the Optimal Number of Knot Points, Oscillation Parameter, and Smoothing Parameter
The MTSFS model is very dependent on the number of the knot points, oscillation parameter, and smoothing parameter. The best model is obtained by using the optimal values of these parameters. In semiparametric and nonparametric regression, there are several methods to obtain the best regression model. One of the popular methods is generalized cross-validation (GCV).
Craven and Wahba [8] introduced a modified cross-validation (CV) method called GCV. This method was developed to overcome the shortcomings of complex CV calculations. GCV has several advantages: it is simple and efficient in calculation, invariant to transformation, and does not require information about variant. In addition, GCV has better asymptotic properties than other methods [9, 10].
In this study, the value of GCV is a criterion that can be used to determine the best model from variety of knots, oscillations, and smoothing parameters. The criterion for selecting the best model includes taking the model with the lowest GCV value. The modified GCV method of the MTSFS model for longitudinal data is stated as follows.
The optimal number of knot points, oscillation, and smoothing parameters is obtained by minimizing .
3.4. Simulation Study
This section presents the use of the MTSFS model on simulation data to see the performance of the estimators obtained. The simulation was done with two predictor variables, sample size n = 20, and varied number of time points, . We considered 20 models for each subject generated from a formula that contains two different functions representing the truncated spline and Fourier series pattern. A polynomial function is used to present the truncated spline, while trigonometric functions are used to present the Fourier series. The predictors are generated from , and the random errors are generated from a multivariate normal distribution.
The weight matrix is user-specified [6], and in this paper, we use so that each of the measurements is treated equally. In this study, we use two numbers of knots ( and ) and several oscillation parameters (). These simulation studies based on GCV are presented in Table 1.
Based on Table 1, for the model with , it appears that the smallest GCV value occurs when the oscillation for and . The same is also seen at and . The smallest GCV occurs when the oscillation for both knot points.
In general, the larger the number of time points, the smaller the resulting value of the GCV. Furthermore, the greater the number of knot points, the larger the GCV. Other results show that a larger oscillation is not guaranteed to produce large or small values of the GCV. So, it is necessary to choose the optimum oscillation that produces the smallest GCV.
3.5. Application to Real Data
The MTSFS model for longitudinal data obtained has been applied to the stroke patient dataset. These data were taken after an initial study of GCS on stroke patients and the factors that influenced it. The pattern of the relations between the predictors and response followed the characteristics of a truncated spline and a Fourier series. There is a predictor with the form of a truncated spline, which is changing in certain subintervals. The other predictor has the form of a Fourier series, with a repeating pattern.
Stroke is a noncommunicable disease. The number of those who suffer from it continues to increase in the world. In 2013, stroke was the second leading cause of death globally (11.8% of all deaths) after ischemic heart disease (14.8% of all deaths). Besides, stroke is the third cause of disability, namely, 4.5% of all causes of disability [11]. Based on the Global Burden of Disease (GBD) Study 2016, the estimated global lifetime risk of stroke for those aged 25 years or older almost reached 25% [12]. The global prevalence of stroke in 2017 was 104.2 million people. In addition, age-standardized stroke prevalence rates were highest in Eastern Europe, North Africa, the Middle East, and Central and East Asia. Several countries in Europe, North Africa, and Central Asia have the highest rates of stroke mortality. Indonesia is one of the countries with the highest death rate due to stroke. According to the Indonesia Basic Health Survey 2007, stroke was the highest cause of death (15.4%) [13]. The prevalence of stroke in Indonesia in 2013 was 7% and increased to 10.9%, according to the Indonesia Basic Health Survey 2018. Furthermore, the stroke prevalence for those aged 15 years or older was 10.85%.
Stroke patients often experience head injuries due to falls. Trauma or head injury requires supervision to ensure further medical treatment. GCS was initially used to assess consciousness level after head injury and is now used in the medical field of both acute and trauma patients. According to Champion [14], trauma severity assessment is used to measure the severity of the injury and describe the patient’s case's severity. An injury to the head will activate the immune response and release of interleukins by activating white blood cells and increasing the temperature. Injury to the hypothalamus also increases interference with body temperature regulation [15]. Fever is generally defined as an increase in body temperature above average and has been identified as one of the causes of worsening head injury [16]. Based on previous studies, the pulse was found to be dysfunctional in patients with severe brain injury. Changes in pulse rate are associated with increased mortality in brain injury patients. Therefore, changes in patients’ pulse rates with severe brain injury should be carefully observed [17].
This article uses GCS as a response variable with 20 stroke patients () with 14-day () measurements for each subject. Simultaneously, the predictor variables are body temperature (BT) and pulse rate (PR). The partial relationship between GCS and each predictor variable is illustrated in Figure 1. The plot in Figure 1 shows a different pattern for each predictor. For this reason, we propose the MTSFS model for longitudinal data. Body temperature will be approximated by a truncated spline estimator, while the pulse rate will be approximated by a Fourier series estimator.

(a)

(b)
Based on Table 2, some scenarios were performed to compare the effectiveness of the proposed model, i.e., using a single estimator and the mixed estimator. By using the GCV formula in Section 3.3, we selected the best model according to the minimum GCV criterion. As shown in Table 2, modeling GCS in stroke patients produces the smallest GCV value of 45.3517, achieved at when the number of knots is and number of oscillation is with . This study revealed that the MTSFS model is better than using a single estimator. This model yields a RMSE = 2.43. The important result was that the number of knots and the number of the oscillation parameters of the best model in the simulation study are in line with those for the case study. Similar to the simulation study, the best model had one knot and the oscillation parameter H equal to three.
Figure 2 presents a comparison between the response variable (red line) and fitted values (blue line) using the MTSFS model. From the graph, we can see that some of the fitted values resemble the pattern of the real data; although some do not, there are deviations. Despite its limitations, the present study certainly adds to our understanding of the theory of the new mixed estimator in nonparametric regression for longitudinal data.

4. Conclusions
This article presented a new mixed estimator for nonparametric regression models for longitudinal data, combining the truncated spline estimator and Fourier series to obtain better estimation results if the predictors have different data patterns. A new two-step method for estimating the parameters used penalized weighted least squares and weighted least squares.
In both the simulation study and case study, the best model was selected by using the minimum generalized cross-validation (GCV). A higher number of knots or oscillation parameter does not produce a high GCV. Therefore, several combinations of knots and oscillations had to be tried to determine the best model. Interestingly, the number of knots and oscillations of the best model is the same for the simulation study as for the case study with real data: one knot and three oscillations.
The proposed model estimator is a useful alternative for estimating nonparametric regression curves for longitudinal data. The presented results suggest that the next study can use different weighting matrices and more knots, so that researchers can compare the results for improving the model performance. Another possible line of further research could be simulation studies using another function to evaluate the performance of the proposed model.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
The authors are thankful for financial support via the PMDSU Grant (Batch III) from the Ministry of Education and Culture of the Republic of Indonesia.