Abstract
In this article, we aim to estimate the parameters of Poisson-Dirichlet mixture model with multigroup data structure by empirical Bayes. The number of mixture components with Bayesian nonparametric process priors is not fixed in advance and it can grow with the increase of data. Empirical Bayes is the useful method to estimate the mixture components without information on them in advance. We give the procedure to construct smooth estimates of base distribution and estimates of the two parameters . The performances of estimations for parameters under multigroup data are better than those of the single-group data with the same total size of individuals in the perspectives of bias, standard deviations, and mean squared errors by numerical simulation. Also, we applied Poisson-Dirichlet mixture models to well-known real datasets.
1. Introduction
Mixture models are typically used to model data in which each observation belongs to one of a set of distributions, being finite or infinite, with certain probabilities. In many situations, some elements or all of the mixturing distributions are unknown. However, accuracy of prior information plays an important role in statistical inferences, and the deviation of prior information inevitably results in biased inference which may cause serious problems in subsequent decisions. A possible solution to the inference on parameters of the prior is the empirical Bayes (EB) methods with the price of using more data sources than pure Bayesian inference. EB is a kind of modern statistical analysis methodologies that blend frequentist inference into Bayesian ideas with the parameters in prior distributions estimated from the observed data (or Bayes decisions estimated from observed data). EB has been widely applied in many fields in its development process of more than half a century. Relevant contributions are, among others, Seidel [1] on estimating partial prior information by EB, George and Foster [2] on EB variable selection in regression analysis, Clyde and George [3] on EB estimation for wavelets, Yang and Wu [4] on EB estimates of the precision parameter and baseline distribution with monotone missing data subject to Dirichlet process priors, Bi and Davuluri [5] developing a novel nonparametric empirical Bayesian-based approach to model the RNA-seq data, and Zhou et al. [6] presenting careful comparisons among several empirical Bayes estimates to the precision parameter of Dirichlet process prior. As of EB inference for nonparametric problems when priors are modeled by mixture models, one can refer to, for example, the works of Liu [7] and McAuliffe et al. [8].
A Dirichlet process (DP) [9] is a distribution of distributions and, thanks to its characteristics, can act as a prior distribution in nonparametric problems with possibly tied observations. Therefore, a DP mixture model allows for common distributions being shared by subsets of observations and thus forms generalizations of the finite mixture models (FMM). In FMM, the number of unknown distributions is supposed to be known and fixed and every individual is sampled from one of those distributions in different probability, whereas, in DP mixture models, the number of clusters is not needed to be given in advance and inferred based on the data. The number of clusters in a DP mixture model is random and grows in (sample size). The research efforts on DP mixture model in Bayesian nonparametric are quite large, which can be found in, for example, the works of Ferguson [9], MacEachern and Müller [10], Neal [11], Ishwaran and James [12], Quintana et al. [13], and so on.
Being proposed by Pitman and Yor [14], Poisson-Dirichlet process (PDP) is also known as Pitman-Yor process, which can be seen as the extension of DPs. In PDP model, a new observation is either from the previous ones in some cluster with a probability which is an increasing function of the cluster size or a new draw from the base distribution.
As an effective and widely applied density estimation method, mixture modeling covers a great deal of phenomena in real-world practice. But it is usually hard to determine the number of mixture components. There are many methods treating the number of components as an unknown constant which is based on the data beforehand, for example, model selection methods. However, the assumption could be too rigid or wrong for the new individual may be from the previous cluster or new cluster as mentioned above. There are a lot of researches on the DP mixture model, such as the works of Berry and Christensen [15], McAuliffe et al. [8], Mostofi and Kharrati-Kopaei [16], and Hu et al. [17]. But there are few literatures about the PDP mixture model and its parameters’ estimation.
In this article, we are interested in the estimation of base distribution under the multigroup data, where the density of base distribution is continuous. There are two parameters and base distribution in PDP mixture model, different from DP mixture model. One difficulty is the initial value of and its impact on the estimation of base distribution, where we suppose that in the context. Another one is that the base distribution could not be directly observed, which is the parameter of infinite dimensional distribution. The individuals of every group follow the random probability distribution in view of the base. We handle this by the nonparametric empirical Bayes; and the estimations of are referred to in the works, among others, of Carlton [18] and Favaro et al. [19]. The performance of estimations for under multigroup data is better than that of the single-group data with the same total size of individuals in the perspectives of bias, standard deviations, and mean squared errors.
The remainder of this paper is structured as follows. In Section 2, we review PDP mixture models. In Section 3, under the framework of multigroup data, the procedure of estimation on the base density is described and the algorithm with the nonconjugate Gibbs sampler is given. In Section 4, some simulations with different base distributions and pairs of are shown and we compare the results between single-group and multigroup data. In Section 5, the PDP mixture model is applied to the well-known real-world datasets.
2. Poisson-Dirichlet Process Mixture Model
This section formulates the Poisson-Dirichlet processes mixture models we are going to discuss.
The fundamental setting is a set of independent random couples such that, for every , for certain known density function and are identically distributed as an unknown probability measure , as discussed in, for example, in the works of Neal [11] and McAuliffe et al. [8]. There are quite a few real-world applications that can be set in this framework. Examples include such general insurance practice where the observations are the claims of an insured at periods distributed identically as with reflecting the overall profile of the economic, political, and other situations at period , such as presence of extreme weather, earthquake, war, and epidemics. In contrast to the most extensively discussed model under which , the parameters are of evolutionary feature from one period to another [20]. In many real scenarios, the values of evolutionary exhibit clustering feature and the number and the profiles of the clusters are unknown, so that it is appropriate to use stick-breaking distributions to model .
Pitman-Yor processes [14, 21] are a class of stick-breaking distributions covering Dirichlet processes as special elements. Let (referred to as the discount parameter), let (referred to as the concentration parameter), and let be a sequence of mutually independent random variables with . Set . Let be a sequence of i.i.d. random variables following a common probability measure , taking values in a measurable space, , for example, and independent also of . Let stand for the Dirac measure degenerated at point . Then the stick-breaking distribution on is referred to as a Poisson-Dirichlet process (PDP) with parameters and base distribution , writing . Note that parameters and determine the characteristic of power law of a PDP and the case corresponds to a Dirichlet process.
Let be a sequence of i.i.d. observations drawn from so as to reflect the clustering feature of distributions. Write for the number of clusters of , for the distinct observations among and is the number of elements in cluster . Then, by Pitman [21], given , the probability distribution of a new future isthat is, taking values with probability and a new one from with probability . This is the generalized Pólya urn scheme to produce .
A PDP mixture model is defined bywhere is a known density function characterized by parameter , consists of the latent parameters, and is absolutely continuous with respect to certain -finite measure with an entirely unknown density function , as depicted in Figure 1. Thanks to the clustering feature of elements of from PD processes, under this model, the observations are randomly divided into a random number of clusters so that all the elements of every cluster share a common parameter picked from and a new observation either is generated from a distribution with a new parameter, meaning that comes from a new cluster, or falls into one of the previously existing clusters with parameter from . Note that corresponds to the special case of DP mixture models, on which some recent efforts can be found in, for example, the work of Neal [11] who discussed the methods of sampling from the posterior distribution of a DP mixture model and the work of McAuliffe et al. [8] who studied the estimation of the base distribution in DP mixture model with a single group of data .

In this paper, however, we consider the estimation of the parameters in the distribution when data is organized in independent and identically distributed groups. To be specific, we consider the following PDP mixture model: the data are divided into groups and the parameters , of each group form an i.i.d. sample from . A graphical representation of a PDP mixture model is shown in Figure 2, and its hierarchical structure in terms of conditional distributions is clearly expressed by

Under this setting, the individuals from different groups are mutually independent and the random probabilities , are independent and identically distributed as a with unknown , and to be estimated.
On the one hand, the structure defined above arises quite frequently in real-world applications. In the context of general insurance, it is a typical structure of data extensively investigated by the so-called credibility theory [20]. On the other hand, it has been commonly recognized that, for the majority of EB problems, it is impossible to make efficient estimate of the unknown elements in the prior distributions if one has only a single group of conditionally independent and identically distributed observations given the random parameters following the priors. The only exception arises when the prior can be characterized by a so-called stick-breaking distribution such as Dirichlet processes (see [4, 22]). Even for stick-breaking priors, existing research efforts have exhibited the possibly low efficiency in estimating the parameters in prior distributions with a single group of data; see, for example, the works of Zehnwirth [22], Yang and Wu [4], and Zhang et al. [23].
For , denote by the number of distinct values among , those distinct values, and , , the number of elements in cluster of group . Then, similar to Theorem 2.5 of Korwar and Hollander [24], given , the distinct values and are also independent of , so that we can retrieve information as of the base distribution from the information delivered by the i.i.d. and of the parameters and from .
Generally, denote and rearrange into groups so that, for every , the observations in share the common random parameter of cluster in group . Then, by means of marginalization, the likelihood under framework (3) with data can be expressed aswhere the expectation is taken with respect to marginal distribution of with taken fixed, is understood in the same way, and is the unknown density of . Imaginably, the likelihood function (4) of is generally expressed as a product of summations, of which every summand is, in turn, a product of the form expressed inside the parentheses in (4). So, it is too intricate to be maximized; even numerically and certain indirect alternatives to the maximum likelihood estimates of need to be developed, of which the following is constructed with the help of the ideas from empirical Bayes.
3. Estimation of Poisson-Dirichlet Mixture Models
This section concentrates on developing reasonable estimates of the parameters , and and elucidates in detail algorithms deriving those estimates. We proceed in two subsections with one for the idea and the other for algorithms to realize the ideas.
3.1. Constructing Estimates
Should one have observed or, equivalently, together with . By the fact that for all , a natural choice is to estimate by certain maturely developed density estimators. We use kernel method to result inwhere is defined by a certain kernel function together with a bandwidth . For and , similar to Carlton [18] who studied the estimation of PDP priors and Favaro et al. [19] who established the PDP model so as to predict species for the problem of species searching, we can use the maximum likelihood estimate derived in terms of the distribution of ; that is,where and is the cardinality of , that is, each of the clusters of which has items.
Now that the latent variables are unobservable, we construct unbiased pseudoestimates:where the arguments indicate that the “estimates” nonetheless depend on the unknown parameters , , and . Note that these pseudoestimates can be computed by means of the posterior distribution for given the observations .
By these pseudoestimates, an algorithm can be motivated, which, starting from certain appropriately selected initial values of , and , iteratively computes for .with the symbol indicating that the posterior expectations of given are computed with the unknown parameters replaced by . Consequently, estimate , and by the limit
In every iteration to compute from , numerical solutions are necessary because it is not an easy task to work out closed forms of the posterior expectations in the equations in (8). A natural way is to employ an MCMC technique that will be described in detail later on.
To sum up, the estimates are approximated with the following workable procedure.
3.2. Algorithms
This subsection details Algorithm 1 in the selection of initial values, the steps to perform the MCMC sampling, and the computation of the estimation proposed.
|
3.2.1. Initial Values
To fix an initial estimate , compute first the maximum likelihood estimate by and then compute a kernel estimate of by
Similar to the formula in (5), in many important cases, is a location translation of some standard density with mode zero (e.g., a standard normal distribution) such that . In these cases, the initial values of are given by
For parameters , the initial value of could be 0.1 or 0.5 and could be 1 or 5 in case there is no further information on them. The simulation, which will be reported later on, showed that the initial values of had only quite minor effects on the convergence of the algorithm.
3.2.2. The Gibbs Sampling
We next state the full conditional distribution used in the MCMC procedure, that is, the conditional distribution of a single latent variable given the data , and the latent variables . Denote by the subvector of by taking away and correspondingly the numbers of the distinct values in , . With an abuse of notation, let be the distinct values in and the number of such elements in which take the value .
Thanks to the mutual independence among the random vectors , it suffices to discuss the single-group situation, as stated in Theorem 3.1.
Theorem 1. For each, givenand, the conditional distribution ofiswhere is the posterior distribution of derived from a single observation with density , are such that , , and the normalization quantity is determined by the equality .
Proof. By the generalized Pólya urn scheme (1), the posterior distribution under isTherefore, the conditional distribution of given can be computed bySetand rewrite the equation above asThat is, the conditional distribution of iswhere and . Obviously, and s satisfy . The proof is completed.
3.2.3. The Iteration Procedure
With the theorem above and the initial values , we now describe the iteration to compute from for all , which consists of two alternatively running phases: one for sampling from the posterior distribution of the latent variables with the unknown parameters replaced by the estimates and the other for estimation of the parameters using the sampled latent variables , as depicted in what follows:
(1) Sampling. The sampling phase is indeed a two-step procedure applied to every group separately so as to produce the posterior samples , starting with an arbitrarily selected . The first step is the standard Gibbs sampling derived from the full conditionals presented in Theorem 1 and the second performs an acceleration.
As frequently done in the literature (see, e.g., [11]), introduce an auxiliary vector to stand for the cluster indices of s in ; that is, , so that . In this notation, is equivalent to the combination of and . With the cluster indicators in , in this two-step procedure, the first step updates the values of and in passing the values of respecting the full conditionals in Theorem 1. For the case is much larger than , the Markovian chain directly derived from Theorem 1 may converge slowly because the chance to introduce new values for is quite small; see equation (12). Hence, an acceleration method (e.g., [11]) may be helpful, which samples another round of for acceleration. This is realized by sampling conditional on and the cluster indicators just obtained in Step 1 and then sampling given . In other words, Step 2 is driven by the grouped conditional distribution of given and given . Because of the particular feature that is a deterministic function of , Step 2 only needs to draw values from the conditional distribution of given .
Moreover, for every , the integration involved in computing and is approximated by an averagewhere is a positive integer and . Because depend only on the iteration indicator , this approximation can be prepared at the beginning of each iteration.
As a final point, we would like to note that, due to the presence of the acceleration Step 2, the update of the values of in Step 1 essentially takes effect only on the computation of the probabilities in equation (12) for every .
The details of this two-step procedure of generating are presented in Algorithm 2. For each group , this procedure is repeated to generate , for a sufficiently large .
|
(2) Estimation. In phase estimation, with , in hand, the estimates of , and are obtained according to (10).
By these newly obtained estimates, repeat the above two phases for the next iteration until the sequence of estimates is stable. The above computation process on the estimation of under the framework in (3) can be briefly arranged as in Algorithm 3.
We now conclude this section by the following remark on one of Neal’s [11] algorithms.
Remark 1. Algorithm 2 is similar to, but with two minor differences, to Algorithm 8 of Neal [11] at Step 1. First, we only draw a single set of i.i.d. values from in each iteration, while Neal [11] suggested drawing i.i.d. values for each , hence needing times more than the sampling computation here. The algorithm here hence may cause a significant reduction in the total computation load. Second, we draw a value for from if a new cluster for is produced, while Neal [11] simply drew a value from . According to Theorem 1, we think Algorithm 2 draws a correct value.
4. Simulation
In this section, some simulations are reported to show the performance of estimates provided above. The data structure of the PDP mixture model is according to equation (3) and the simulations were conducted under the two settings.(i)We considered two settings: (1) with 20 individuals in every group: and (2) with 200 individuals in a single group.(ii)The amount of Gibbs sampling is and total iteration of estimation is 200.(iii)The combinations of : (a) standard normal distribution , (b) distribution with 3 degrees of freedom, and (c) the distribution of . took values of , respectively. . The initial values were (0.5, 5) and (0.1, 1).(iv)The conditional density function was normal with mean and standard deviation was , where is the standard deviation of .
In the three base probabilities we chose, the first two are symmetric, has heavy tails, and the last has a positive skew. We tried a few distinct initial values of in all cases; the simulations showed that the estimations were insensible to the initial values. For every combination of the values, , , , and , we produced 1000 replicates of the datasets under the setting described above. The simulation results are concisely discussed below. .(1)The estimates of the base measures . In the simulation, was estimated more accurately under the setting than under the setting . The results from some replicates out of the 1000 are presented in Figures 3–5 under some parameters , and , in terms of the iterations , and 200 under and , in which the dashed lines represent the curves of estimates and the solids are the true density curves. The figures show the two following clear features: the dashed lines are more close to the solids curves with the increase of iteration times in both data structures and apparently the performance of estimated curves under multigroup data is better than that of single-group data. The center of dashed curves with single data is a little deviation at round 200 in Figure 3, the estimated curve is more concentrated than real density with heavy tails in Figure 4, and the curves of are more fluctuating and skewed in Figure 5 under the single-group data. Overall, in both the single-group and multigroup environments, our estimates are still near the true density curves. The histograms of distance between and true with 1000 replicates are shown in Figure 6 under single group and multiple groups. Figure 7 delineates the total variation distance (TVD) between and . The TVD of two probability distributions and is , with the supremum taken over measurable sets A. Each panel plots the TVD between and as the rounds progress. As can be seen, the value of TVD gradually reaches a steady value after about 15 iterations with the bases of and , and the other needs to go through about 25 iterations.(2)The estimates of with the 1000 replicates are summarized in Table 1 in terms of the mean and standard deviation and their histograms are presented in Figure 8. The comparisons of estimates are listed with means, SD, and 95% credible intervals and the estimates of multigroup structure are better than those of the single-group structure no matter what kind of bases in three parametric combinations. It is more stable on the estimations of in the standard deviations criterion when .






5. Applications to Real Data
In this section, the PDP mixture model is employed to analyze the three following well-known real-world datasets.
The first is the galaxies data [8, 25] consisting of velocities of 82 galaxies from six well-separated conic sections in Corona Borealis region. The model with was applied and the algorithm provided in Section 3 was run 200 rounds, giving rise to the estimate and that is depicted in Figure 9, simply showing that the base distribution has multimodality corresponding to identifying different clumped galaxies.

The second dataset consists of records of stamp thickness [26] in millimeters of 485 unwatermarked Hidalgo stamps dating from 1872 to 1874. Figure 10 presents estimates of stamp thickness at round 200 of the procedure mentioned above on the stamp data, which exhibits the close bimodality of the density and one is more steep than another. The curve of reflects the high probability around the bimodality of s corresponding to the number of types of paper used. The stamp data depicts the concentration of shared variables on the modes. After 200 rounds’ iterations, we obtain the estimates .

The last is on Danish fire losses [27] which are heterogeneous as reflected in multimodality, skewness, and heavy tail distributions, collected by Copenhagen Reinsurance and consist of 2492 records over the period of 1980–1990. The losses are in millions of Danish Krone and not adjusted for inflation over time. The data is divided into multiple groups based on time and the observations between groups are independent as the conventional assumption in the industry of nonlife insurance. The estimated density is clearly described with two intervals of abscissa in Figure 11 at round 200 under the Poisson-Dirichlet process mixture models, having one concentrated modality and heavy tail. We get .

6. Conclusions
In this article, we have shown the posterior distribution in the PDP mixture models and the estimates cannot be directly obtained because of unobserved latent variables s. The unbiased “pseudoestimates” are constructed and the corresponding algorithm is depicted in detail based on nonparametric empirical Bayes. Some simulations are presented with both data structures, single-group and multigroup, on the pseudoestimates of base distribution , discount parameter , and concentration parameter . By the numerical simulation, the performances of estimations for parameters under multigroup data are better than those of the single-group data with the same total size of individuals. Also, we applied the model to the well-known real-world datasets.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The research was sponsored by Shuguang Program supported by Shanghai Education Development Foundation and Shanghai Municipal Education Commission in China (Project 18SG53) and the National Natural Science Foundation of China (Project 9195210212032016).