Abstract
We study some mathematical properties of the beta generalized half-normal distribution recently proposed by Pescim et al. (2010). This model is quite flexible for analyzing positive real data since it contains as special models the half-normal, exponentiated half-normal, and generalized half-normal distributions. We provide a useful power series for the quantile function. Some new explicit expressions are derived for the mean deviations, Bonferroni and Lorenz curves, reliability, and entropy. We demonstrate that the density function of the beta generalized half-normal order statistics can be expressed as a mixture of generalized half-normal densities. We obtain two closed-form expressions for their moments and other statistical measures. The method of maximum likelihood is used to estimate the model parameters censored data. The beta generalized half-normal model is modified to cope with long-term survivors may be present in the data. The usefulness of this distribution is illustrated in the analysis of four real data sets.
1. Introduction
Cooray and Ananda [1] pioneered the generalized half-normal (GHN) distribution with shape parameter and scale parameter defined by the cumulative distribution function (cdf) where the standard normal cdf and the error function are given by
Following an idea due to Eugene et al. [2], Pescim et al. [3] proposed the beta generalized half-normal (BGHN) distribution, which seems to be superior over the GHN model for some applications. The justification for the practicability of this model is based on the fatigue crack growth under variable stress or cyclic load. In this paper, we study several mathematical properties of the BGHN model with the hope that it will attract wider applications in reliability, engineering and in other areas of research.
The four-parameter BGHN cdf is defined from (1) by (for ), where is the beta function, is the incomplete beta function ratio, and and are two additional shape parameters.
The probability density function (pdf) and the hazard rate function (hrf) corresponding to (3) are respectively. Hereafter, a random variable with pdf (4) is denoted by .
Pescim et al. [3] demonstrated that the cdf and pdf of can be expressed as infinite power series of the GHN cumulative distribution. Here, all expansions in power series are around the point zero. If is a real noninteger, we can expand the binomial term in (3) to obtain where .
The pdf corresponding to (6) can be expressed as If is an integer, (7) provides the BGHN density function as an infinite power series of the GHN cumulative distribution. If is an integer, the index in the previous sum stops at . Otherwise, if is a real noninteger, we can expand as follows: where Hence, whose coefficients are
The BGHN density function (4) allows for greater flexibility of its tails and can be widely applied in many areas of engineering and biology. We study the mathematical properties of this distribution because it extends some important distributions previously considered in the literature. In fact, the GHN model, with parameters and , is clearly an important special case for , with a continuous crossover towards models with different shapes, for example, a particular combination of skewness and kurtosis. The BGHN distribution also contains the exponentiated generalized half-normal (EGHN) and half-normal (HN) distributions as submodels for and , respectively. Moreover, while the transformation (3) is not analytically tractable in the general case, the formulas related to the BGHN distribution turn out manageable, and with the use of modern computer resources with analytic and numerical capabilities, may turn into adequate tools comprising the arsenal of applied statisticians.
The paper is outlined as follows. We derive an expansion for the quantile function in Section 2. Some statistical measures for the BGHN distribution such as moments, generating function, mean deviations, Rényi entropy and reliability are studied in Section 3. The computational issues relating to the infinite series for structural properties of the BGHN distribution are discussed in Section 4. In Section 5, we derive algebraic expressions for the moments, moment generating function (mgf), mean deviations and Rényi entropy for the order statistics. The estimation by maximum likelihood—including the case of censoring—is presented in Section 6. In Section 7, we propose a BGHN mixture model for survival data with long-term survivors. Section 8 illustrates the importance of the BGHN distribution applied to four real data sets. Finally, concluding remarks are given in Section 9.
2. Power Series for the Quantile Function
Power series methods are at the heart of many aspects of applied mathematics and statistics. Quantile functions are in widespread use in probability distributions and general statistics and often find representations in terms of power series. The quantile function for a probability distribution has many uses in both the theory and application of probability. It may be used to generate values of a random variable having as its distributions function. This fact serves as the basis of a method for simulating a sample from an arbitrary distribution with the aid of a uniform random number generator.
The quantile function of , say , can be obtained by inverting the cumulative function (3). Now, we provide a power series expansion for that can be useful to determine some mathematical measures of the BGHN distribution. First, an expansion for the inverse of the incomplete beta function can be found in wolfram website (http://functions.wolfram.com/06.23.06.0004.01) as where is the beta quantile function, for and , , . The coefficients s for can be obtained from a cubic recursion of the form where if and if . In the last equation, we note that the quadratic term only contributes for .
Following Steinbrecher [4], the quantile function of the standard normal distribution, say , can be expanded as where and the s can be calculated from and Here, , , .
The function can be expressed as a power series given by where and the quantities are defined from the coefficients in (14) by for and for .
By inverting the normal cumulative function in , and using (16), we can express the quantile function of in terms of as Replacing by in last equation, we obtain where . Using the same steps of (8), we can write where .
We use throughout an equation of Gradshteyn and Ryzhik [5] for a power series raised to a positive integer where the coefficients (for ) are easily obtained from the recurrence equation and . From (19) and (20), we have where for , for and . By inserting (12) in (22), we obtain From (20), it follows , where the quantities are determined recursively from and (for ). Finally, we obtain where . Equation (24) gives the BGHN quantile function as a power series and represents the main result of this section.
3. BGHN Properties
3.1. Moments and Generating Function
Here, we provide new expressions for the moments and mgf of based upon the power series for its quantile function as alternative results form those obtained by Pescim et al. [3].
We can write from (24) and then using (20), we obtain where the quantities (for ) are easily determined from the recurrence equation with .
We now provide a new alternative representation for the mgf of , say , based on the quantile power series (24). We can write We expand the exponential function and use the same algebra that leads to (26) and then Equations (26) and (30) are the main results of this section.
3.2. Mean and Median Deviations
The amount of scatter in a population is evidently measured to some extent by the mean deviations in relation to the mean and the median defined by respectively, where and denotes the median. Here, is calculated as the solution of the nonlinear equation . We define , which is determined below. The measures and can be written in terms of and as For more details, see Paranaíba et al. [6]. Clearly, and are determined from (3). From (10), we have Setting in the last equation gives Using the power series for the error function (see, e.g., [7]), we obtain after some algebra where denotes the complementary incomplete gamma function for . The measures and are immediately calculated from (35).
Bonferroni and Lorenz curves have applications not only in economics to study income and poverty, but also in other fields such as reliability, demography, insurance and medicine. They are defined by respectively, where and (Section 2) for a given probability . From , we obtain and .
3.3. Rényi Entropy
The Rényi information of order for a continuous random variable with density function is defined as where , and . Applications of the Rényi entropy can be found in several areas such as physics, information theory and engineering to describe many nonlinear dynamical or chaotic systems [8], and in statistics as certain appropriately scaled test statistics (relative Rényi information) for testing hypotheses in parametric models [9]. Rényi [10] generalized the concept of information theory which allows for different averaging of probabilities via .
For the BGHN distribution (4), the Rényi entropy is defined by where , and . From (4), we have For and is a real noninteger, the power series holds where the binomial coefficient is defined for any real. Using (40) in (39) twice, can be expressed as Substituting by the error function and setting , reduces to Following similar algebra that lead to (35), we obtain Finally, the Rényi entropy reduces to
3.4. Reliability
In the context of reliability, the stress-strength model describes the life of a component which has a random strength that is subjected to a random stress . The component fails at the instant that the stress applied to it exceeds the strength, and the component will function satisfactorily whenever . Hence, is a measure of component reliability. Here, we derive when and have independent and distributions with the same shape parameters and . The reliability becomes where the cdf of and the density of are obtained from (6) and (10) as respectively, where refer to and , respectively. Hence,
Setting in the last integral, the reliability of reduces to
4. Computational Issues
Here, we show the practical use of (24), (26), (30), (35), and (49). If we truncate these infinite power series, we have a simple way to compute the quantile function, moments, mgf, mean deviations and reliability of the BGHN distribution. The question is how large the truncation limit should be?
We now provide evidence that each infinite summation can be truncated at twenty to yield sufficient accuracy. Let denote the absolute difference between the integrated version, where is a uniform variate on the unit interval, and the truncated version of (24) averaged over (all with step ) , , , and . Let denote the absolute difference between integrated version and the truncated version of (26) averaged over , , , and . Let denote the absolute difference between the truncated version of (30) and the integrated version, averaged over , , , and . Let denote the absolute difference between the truncated version of (35) and the integrated version, averaged over , , , and . Let denote the absolute difference between the truncated version of (49) and the integrated version (45), averaged over , , , and .
We obtain the following estimates after extensive computations: , , , and . These estimates are small enough to suggest that the truncated versions of (24), (26), (30), (35) and (49) are reasonable practical use.
It would be important to verify that each (untruncated) infinite series (such as (24), (26), (30), (35) and (49)) is convergent and provide valid values for its arguments. However, this will be a difficult mathematical problem that could be investigated in future work.
5. Properties of the BGHN Order Statistics
Order statistics have been used in a wide range of problems, including robust statistical estimation and detection of outliers, characterization of probability distributions and goodness-of-fit tests, entropy estimation, analysis of censored samples, reliability analysis, quality control, and strength of materials.
5.1. Mixture Form
Suppose is a random sample of size from a continuous distribution and let denote the corresponding order statistics. There has been a large amount of work relating to moments of order statistics . See Arnold et al. [11], David and Nagaraja [12], and Ahsanullah and Nevzorov [13] for excellent accounts. It is well known that the density function of is given by
For the BGHN distribution, Pescim et al. [3] obtained where denotes a sequence of nonnegative integers, denotes the density function defined under and the constants are given by The quantities are easily obtained given and a sequence of indices . The sums in (55) extend over all -tuples () and can be implementable on a computer. If is an integer, (55) holds but the indices vary from zero to . Equation (55) reveals that the density function of the BGHN order statistics can be expressed as an infinite mixture of BGHN densities.
5.2. Moments of Order Statistics
The moments of the BGHN order statistics can be written directly in terms of the moments of BGHN distributions from the mixture form (55). We have where and can be determined from (26). Inserting (26) in (57) and changing indices, we can write where
The moments can be determined based on the explicit sum (58) using the software Mathematica. They are also computed from (55) by numerical integration using the statistical software R [14]. The values from both techniques are usually close when is replaced by a moderate number such as in (58). For some values , and , Table 1 lists the numerical values for the moments , (; ) and for the variance, skewness and kurtosis.
5.3. Generating Function
The mgf of the BGHN order statistics can be written directly in terms of the BGHN mgf’s from the mixture form (55) and (30). We obtain where is the mgf of the BGHN distribution obtained from (30).
5.4. Mean Deviations
The mean deviations of the order statistics in relation to the mean and the median are defined by respectively, where and denotes the median. Here, is obtained as the solution of the nonlinear equation The measures and follow from where . Using (55), we have where is obtained from (35) using given by (59) and where
Bonferroni and Lorenz curves of the order statistics are given by respectively, where for a given probability . From , we obtain
5.5. Rényi Entropy
The Rényi entropy of the order statistics is defined by where , and . From (54), it follows that Using (40) in (70), we obtain For and , we can expand as and then Hence, from (70), we can write By replacing and by (10) and (16) given by Pescim et al. [3], we obtain
Using the identity (for positive integer) in (4), we have where Using the power series (40) in (76) twice and replacing by the error function (defined in Section 3.2) we obtain after some algebra where the quantity is well defined by Pescim et al. [3]. (More details, see equation in [3]).
Finally, the entropy of the order statistics can be expressed as Equation (79) is the main result of this section.
An alternative expansion for the density function of the order statistics derived from the identity (20) was proposed by Pescim et al. [3]. A second density function representation and alternative expressions for some measures of the order statistics are given in Appendix A.
6. Lifetime Analysis
Let be a random variable having density function (4), where is the vector of parameters. The data encountered in survival analysis and reliability studies are often censored. A very simple random censoring mechanism that is often realistic is one in which each individual is assumed to have a lifetime and a censoring time , where and are independent random variables. Suppose that the data consist of independent observations for . The distribution of does not depend on any of the unknown parameters of the distribution of . Parametric inference for such data are usually based on likelihood methods and their asymptotic theory. The censored log-likelihood for the model parameters is where is the number of failures and and denote the uncensored and censored sets of observations, respectively.
The score functions for the parameters , , , and are given by where and is the digamma function.
The maximum likelihood estimate (MLE) of is obtained by solving numerically the nonlinear equations , , , and . We can use iterative techniques such as the Newton-Raphson type algorithm to obtain the estimate . We employ the subroutine NLMixed in SAS [15].
For interval estimation of and tests of hypotheses on these parameters we obtain the observed information matrix since the expected information matrix is very complicated and will require numerical integration. The observed information matrix is whose elements are given in Appendix B.
Under conditions that are fulfilled for parameters in the interior of the parameter space but not on the boundary, the asymptotic distribution of is , where is the expected information matrix. The normal approximation is valid if is replaced by , that is, the observed information matrix evaluated at . The multivariate normal distribution can be used to construct approximate confidence intervals for the model parameters. The likelihood ratio (LR) statistic can be used for testing the goodness of fit of the BGHN distribution and for comparing this distribution with some of its special submodels.
We can compute the maximum values of the unrestricted and restricted log-likelihoods to construct LR statistics for testing some submodels of the BGHN distribution. For example, we may use the LR statistic to check if the fit using the BGHN distribution is statistically “superior” to a fit using the modified Weibull or exponentiated Weibull distribution for a given data set. In any case, hypothesis testing of the type versus can be performed using LR statistics. For example, the test of versus is equivalent to compare the EGHN and BGHN distributions. In this case, the LR statistic becomes , where , , , and are the MLEs under and , and are the estimates under . We emphasize that first-order asymptotic results for LR statistics can be misleading for small sample sizes. Future research can be conducted to derive analytical or bootstrap Bartlett corrections.
7. A BGHN Model for Survival Data with Long-Term Survivors
In population based cancer studies, cure is said to occur when the mortality in the group of cancer patients returns to the same level as that expected in the general population. The cure fraction is of interest to patients and also a useful measure when analyzing trends in cancer patient survival. Models for survival analysis typically assume that every subject in the study population is susceptible to the event under study and will eventually experience such event if the follow-up is sufficiently long. However, there are situations when a fraction of individuals are not expected to experience the event of interest, that is, those individuals are cured or not susceptible. For example, researchers may be interested in analyzing the recurrence of a disease. Many individuals may never experience a recurrence; therefore, a cured fraction of the population exists. Cure rate models have been used to estimate the cured fraction. These models are survival models which allow for a cured fraction of individuals. These models extend the understanding of time-to-event data by allowing for the formulation of more accurate and informative conclusions. These conclusions are otherwise unobtainable from an analysis which fails to account for a cured or insusceptible fraction of the population. If a cured component is not present, the analysis reduces to standard approaches of survival analysis. Cure rate models have been used for modeling time-to-event data for various types of cancers, including breast cancer, non-Hodgkins lymphoma, leukemia, prostate cancer, and melanoma. Perhaps the most popular type of cure rate models is the mixture model introduced by Berkson and Gage [16] and Maller and Zhou [17]. In this model, the population is divided into two sub-populations so that an individual either is cured with probability , or has a proper survival function with probability . This gives an improper population survivor function in the form of the mixture, namely, Common choices for in (84) are the exponential and Weibull distributions. Here, we adopt the BGHN distribution. Mixture models involving these distributions have been studied by several authors, including Farewell [18], Sy and Taylor [19], and Ortega et al. [20]. The book by Maller and Zhou [17] provides a wide range of applications of the long-term survivor mixture model. The use of survival models with a cure fraction has become more and more frequent because traditional survival analysis do not allow for modeling data in which nonhomogeneous parts of the population do not represent the event of interest even after a long follow-up. Now, we propose an application of the BGHN distribution to compose a mixture model for cure rate estimation. Consider a sample , where is either the observed lifetime or censoring time for the th individual. Let a binary random variable , for , indicating that the th individual in a population is at risk or not with respect to a certain type of failure, that is, indicates that the th individual will eventually experience a failure event (uncured) and indicates that the individual will never experience such event (cured).
For an individual, the proportion of uncured can be specified such that the conditional distribution of is given by . Suppose that the ’s are independent and identically distributed random variables having the BGHN distribution with density function (4).
The maximum likelihood method is used to estimate the parameters. Thus, the contribution of an individual that failed at to the likelihood function is given by and the contribution of an individual that is at risk at time is The new model defined by (85) and (86) is called the BGHN mixture model with long-term survivors. For , we obtain the GHN mixture model (a new model) with long-term survivors.
Thus, the log-likelihood function for the parameter vector follows from (85) and (86) as where and denote the sets of individuals corresponding to lifetime observations and censoring times, respectively, and is the number of uncensored observations (failures).
8. Applications
In this section, we give four applications using well-known data sets (three with censoring and one uncensored data set) to demonstrate the flexibility and applicability of the proposed model. The reason for choosing these data is that they allow us to show how in different fields it is necessary to have positively skewed distributions with nonnegative support. These data sets present different degrees of skewness and kurtosis.
8.1. Censored Data: Transistors
The first data set consists of the lifetimes of transistors in an accelerated life test. Three of the lifetimes are censored, so the censoring fraction is 3/34 (or 8.8%). Wilk et al. [21] stated that “there is reason, from past experience, to expect that the gamma distribution might reasonably approximate the failure time distribution.” Wilk et al. [21] and also Lawless [22] fitted a gamma distribution. Alternatively, we fit the BGHN model to these data. We estimate the parameters , , , and by maximum likelihood. The MLEs of and for the GHN distribution are taken as starting values for the iterative procedure. The computations were performed using the NLMixed procedure in SAS [15]. Table 2 lists the MLEs (and the corresponding standard errors in parentheses) of the model parameters and the values of the following statistics for some models: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and Consistent Akaike Information Criterion (CAIC). The results indicate that the BGHN model has the lowest AIC, BIC, and CAIC values, and, therefore, it could be chosen as the best model. Further, we calculate the maximum unrestricted and restricted log-likelihoods and the LR statistics for testing some submodels. For example, the LR statistic for testing the hypotheses : versus : is not true, that is, to compare the BGHN and GHN models, is ( value =0.0027), which gives favorable indication towards the BGHN model. In special, the Weibull density function is given by Figure 1(a) displays plots of the empirical survival function and the estimated survival functions of the BGHN, GHN and Weibull distributions. We obtain a good fit of the BGHN model to these data.
(a)
(b)
(c)
8.2. Censored Data: Radiotherapy
The data refer to the survival time (days) for cancer patients () undergoing radiotherapy [23]. The percentage of censored observations was . Fitting the BGHN distribution to these data, we obtain the MLEs of the model parameters listed in Table 3. The values of the AIC, CAIC, and BIC statistics are smaller for the BGHN distribution as compared to those values of the GHN and Weibull distributions. The LR statistic for testing the hypotheses : versus : is not true, that is, to compare the BGHN and GHN models, is ( value =0.0022), which yields favorable indication towards the BGHN distribution. Thus, this distribution seems to be a very competitive model for lifetime data analysis. Figure 1(b) displays the plots of the empirical survival function and the estimated survival functions for the BGHN, GHN, and Weibull distributions. We note a good fit of the BGHN distribution.
8.3. Uncensored Data: USS Halfbeak Diesel Engine
Here, we compare the fits of the BGHN, GHN, and Weibull distributions to the data set presented by Ascher and Feingold [24] from a USS Halfbeak (submarine) diesel engine. The data denote 73 failure times (in hours) of unscheduled maintenance actions for the USS Halfbeak number 4 main propulsion diesel engine over 25.518 operating hours. Table 4 gives the MLEs of the model parameters. The values of the AIC, CAIC, and BIC statistics are smaller for the BGHN distribution when compared to those values of the GHN and Weibull distributions. The LR statistic for testing the hypotheses : versus : is not true, that is, to compare the BGHN and GHN models, is ( value 0.0001), which indicates that the first distribution is superior to the second in terms of model fitting. Figure 1(c) displays plots of the empirical survival function and the estimated survival function of the BGHN, GHN, and Weibull distributions. In fact, the BGHN distribution provides a better fit to these data.
8.4. Melanoma Data with Long-Term Survivors
In this section, we provide an application of the BGHN model to cancer recurrence. The data are part of a study on cutaneous melanoma (a type of malignant cancer) for the evaluation of postoperative treatment performance with a high dose of a certain drug (interferon alfa-2b) in order to prevent recurrence. Patients were included in the study from 1991 to 1995, and follow-up was conducted until 1998. The data were collected by Ibrahim et al. [25]. The survival time is defined as the time until the patient’s death. The original sample size was patients, 10 of whom did not present a value for explanatory variable tumor thickness. When such cases were removed, a sample of size patients was retained. The percentage of censored observations was . Table 5 lists the MLEs of the model parameters. The values of the AIC, CAIC and BIC statistics are smaller for the BGHN mixture model when compared to those values of the GHN mixture model. The LR statistic for testing the hypotheses : versus : is not true, that is, to compare the BGHN and GHN mixture models, becomes ( value 0.0001), which indicates that the BGHN mixture model is superior to the GHN mixture model in terms of model fitting. Figure 2 provides plots of the empirical survival function and the estimated survival functions of the BGHN, and GHN mixture models. Note that the cured proportion estimated by the BGHN mixture model is more appropriate than that one estimated by the GHN mixture model . Further, the BGHN mixture model provides a better fit to these data.
9. Concluding Remarks
In this paper, we discuss some mathematical properties of the beta generalized half normal (BGHN) distribution introduced recently by Pescim et al. [3]. The incorporation of additional parameters provides a very flexible model. We derive a power series expansion for the BGHN quantile function. We provide some new structural properties such as moments, generating function, mean deviations, Rényi entropy, and reliability. We investigate properties of the order statistics. The method of maximum likelihood is used for estimating the model parameters for uncensored and censored data. We also propose a BGHN model for survival data with long-term survivors. Applications to four real data sets indicate that the BGHN model provides a flexible alternative to (and a better fit than) some classical lifetimes models.
Appendices
A. Appendix A
Here, we derive some properties of the BGHN order statistics. Using the identity in Gradshteyn and Ryzhik [5] and after some algebraic manipulations, we obtain an alternative form for the density function of the th order statistic, namely, where denotes the density of the BGHN distribution and the quantity is defined by Pescim et al. [3].
From (A.1), we obtain alternative expressions for the moments and mgf of the BGHN order statistics given by
The mean deviations of the order statistics in relation to the mean and to the median can be expressed as respectively, where , and are defined in Section 5.4. We use (A.1) to obtain where comes from (35).
Bonferroni and Lorenz curves of the order statistic are defined, in Section 5.4, by respectively, where . From , these curves also can be expressed as Using (A.4), we obtain the Bonferroni and Lorenz curves for the BGHN order statistics.
B. Appendix B
Here, we give the necessary formulas to obtain the second-order partial derivatives of the log-likelihood function. After some algebraic manipulations, we obtain where
Acknowledgments
The authors would like to thank the editor, the associate editor, and the referees for carefully reading the paper and for their comments, which greatly improved the paper.