Abstract
We study the density problem and approximation error of reproducing kernel Hilbert spaces for the purpose of learning theory. For a Mercer kernel on a compact metric space (, ), a characterization for the generated reproducing kernel Hilbert space (RKHS) to be dense in is given. As a corollary, we show that the density is always true for convolution type kernels. Some estimates for the rate of convergence of interpolation schemes are presented for general Mercer kernels. These are then used to establish for convolution type kernels quantitative analysis for the approximation error in learning theory. Finally, we show by the example of Gaussian kernels with varying variances that the approximation error can be improved when we adaptively change the value of the parameter for the used kernel. This confirms the method of choosing varying parameters which is used often in many applications of learning theory.
1. Introduction
Learning theory investigates how to find function relations or data structures from random samples. For the regression problem, one usually has some experience and would expect that the (underlying) unknown function lies in some set of functions called the hypothesis space. Then one tries to find a good approximation in of the underlying function(under certain metric). The best approximation in is called the target function. However,is unknown. What we have in hand is a set of random samples. These samples are not given byexactly (). They are controlled by this underlying functionwith noise or some other uncertainties (). The most important model studied in learning theory [1] is to assume that the uncertainty is represented by a Borel probability measureon, and the underlying functionis the regression function ofgiven by Here, is the conditional probability measure at. Then, the samplesare independent and identically distributed drawers according to the probability measure. For the classification problem,and sign is the optimal classifier.
Based on the samples, one can find a function from the hypothesis space that best fits the data(with respect to certain loss functional). This function is called the empirical target function . When the number of samples is large enough, is a good approximation of the target function with certain confidence. This problem has been extensively investigated and well developed in the literature of statistical learning theory. See, for example, [1–4].
What is less understood is the approximation of the underlying desired functionby the target function. For example, if one takes to be a polynomial space of some fixed degree, thencan be approximated by functions from only whenis a polynomial in .
In kernel machine learning such as support vector machines, one often uses reproducing kernel Hilbert spaces or their balls as hypothesis spaces. Here, we take to be a compact metric space and .
Definition 1. Letbe continuous, symmetric, and positive semidefinite; that is, for any finite set of distinct points, the matrixis positive semidefinite. Such a kernel is called a Mercer kernel. It is called positive definite if the matrixis always positive definite.
The reproducing kernel Hilbert space (RKHS) associated with a Mercer kernelis defined (see [5]) to be the completion of the linear span of the set of functionswith the inner productsatisfying The reproducing kernel property is given by This space can be embedded into, the space of continuous functions on.
In kernel machine learning, one often takesor its balls as the hypothesis space. Then, one needs to know whether the desired functioncan be approximated by functions from the RKHS.
The first purpose of this paper is to study the density of the reproducing kernel Hilbert spaces in(or inwhenis a subset of the Euclidean space). This will be done in Section 2 where some characterizations will be provided. Let us mention a simple example with detailed proof given in Section 6.
Example 2. Letand let be a Mercer kernel given by wherefor each and. Set. Then, is dense inif and only if
When the density holds, we want to study the convergence rate of the approximation by functions from balls of the RKHS as the radius tends to infinity. The quantity is called the approximation error in learning theory. Some estimates have been presented by Smale and Zhou [6] for thenorm and many kernels. The second purpose of this paper is to investigate the convergence rate of the approximation error with the uniform norm as well as thenorm. Estimates will be given in Section 4, based on the analysis in Section 3 for interpolation schemes associated with general Mercer kernels. With this analysis, we can understand the approximation error with respect to marginal probability distribution induced by. Let us provide an example of Gaussian kernels to illustrate the idea. Notice that when the parameterof the kernel is allowed to change with, the rate of the approximation error may be improved. This confirms the method of adaptively choosing the parameter of the kernel, which is used in many applications (see e.g., [7]).
Example 3. Let There exist positive constantsandsuch that, for eachand, there holds whenis fixed; while whenmay change with, there holds
2. Density and Positive Definiteness
The density problem of reproducing kernel Hilbert spaces inwas raised to the author by Poggio et al. See [8]. It can be stated as follows.
Given a Mercer kernelon a compact metric space, when is the RKHSdense in?
By means of the dual space of, we can give a general characterization. This is only a simple observation, but it does provide us useful information. For example, we will show that the density is always true for convolution type kernels. Also, for dot product type kernel, we can give a complete nice characterization for the density, which will be given in Section 6.
Recall the Riesz Representation Theorem asserting that the dual space ofcan be represented by the set of Borel measures on. For a Borel measureon, we define the integral operatorassociated with the kernel as This is a compact operator on if is a positive measure.
Theorem 4. Let be a Mercer kernel on a compact metric space . Then, the following statements are equivalent. (1)is dense in. (2)For any nontrivial positive Borel measure,is dense in. (3)For any nontrivial positive Borel measure,has no eigenvalue zero in. (4)For any nontrivial Borel measure, as a function in,
Proof. (1)(2). This follows from the fact thatis dense in. See, for example, [9].
(2)(3). Suppose that is dense in, buthas an eigenvalue zero in. Then, there exists a nontrivial functionsuch that; that is,
							
						The identity holds as functions in. If the support ofis, then this identity would imply thatis orthogonal to eachwith. When the support ofis not, things are more complicated. Here, the support of, denoted as supp, is defined to be the smallest closed subsetofsatisfying.
The property of the RKHS enables us to prove the general case. As the functionis continuous, we know from (12) that, for eachin supp, 
							
						This means for eachin ,in , wherehas been restricted onto supp. When we restrictonto , the new kernelis again a Mercer kernel. Moreover, by (1),. It follows that span is dense in. The latter is dense in. Therefore,is orthogonal to; hence, as a function in,is zero. This is a contradiction.
(3)(4). Every nontrivial Borel measurecan be uniquely decomposed as the difference of two mutually singular positive Borel measures:; that is, there exists a Borel setsuch thatand. With this decomposition, 
							
						Here, is the characteristic function of the set, andis the absolute value of. Asis a nontrivial positive Borel measure andis a nontrivial function in, statement (3) implies that, as a function in,. Since this function lies in, it is nonzero as a function in.
The last implication (4)  (1) follows directly from the Riesz Representation Theorem. 
The proof of Theorem 4 also yields a characterization for the density of the RKHS in.
Corollary 5. Letbe a Mercer kernel on a compact metric space and a positive Borel measure on. Then, is dense inif and only ifhas no eigenvalue zero in.
The necessity has been verified in the proof of Theorem 4, while the sufficiency follows from the observation that anfunctionlying in the orthogonal complement of spangives an eigenfunction ofwith eigenvalue zero:
Theorem 4 enables us to conclude that the density always holds for convolution type kernelswith. The density for some convolution type kernels has been verified by Steinwart [10]. The author observed the density as a corollary of Theorem 4 whenis strictly positive. Charlie Micchelli pointed out to the author that, for a convolution type kernel, the RKHS is always dense in. So, the density problem is solved for these kernels.
Corollary 6. Letbe a nontrivial convolution type Mercer kernel onwith. Then, for any compact subsetof,onis dense in.
Proof. It is well known thatis a Mercer kernel if and only ifis continuous andalmost everywhere. We apply the equivalent statement (4) of Theorem 4 to prove our statement.
Letbe a Borel measure onsuch that 
							
						Then, the inverse Fourier transform yields 
							
						Here, is the Fourier transform of the Borel measure, which is an entire function.
Taking the integral onwith respect to the measure, we have 
							
						For a nontrivial Borel measuresupported on ,  vanishes only on a set of measure zero. Hence, almost everywhere, which gives. Therefore, we must have. This proves the density by Theorem 4. 
After the first version of the paper was finished, I learned that Micchelli et al. [11] proved the density for a class of convolution type kernelswithbeing the Fourier transform of a finite Borel measure. Note that a large family of convolution type reproducing kernels are given by radial basis functions; see, for example, [12].
Now we can state a trivial fact that the positive definiteness is a necessary condition for the density.
Corollary 7. Letbe a Mercer kernel on a compact metric space. Ifis dense in, thenis positive definite.
Proof. Suppose to the contrary thatis dense in, but there exists a finite set of distinct pointssuch that the matrixis not positive definite. By the Mercer kernel property,is positive semidefinite. So it is singular, and we can find a nonzero vectorsatisfying. It follows that; that is, 
							
						Thus, 
							
Now, we define a nontrivial Borel measuresupported onas 
							
						Then, for, 
							
						This is a contradiction to the equivalent statement (4) in Theorem 4 of the density. 
Because of the necessity given in Corollary 7, one would expect that the positive definiteness is also sufficient for the density. Steve Smale convinced the author that this is not the case in general. This motivates us to present a constructive example ofkernel. Denote as the norm in the Sobolev space.
Example 8. Let. For everyand every, choose a real-valuedfunctiononsuch that Defineonby Then, is aMercer kernel on. It is positive definite, but the constant functionis not in the closure ofin . Hence, is not dense in.
Proof. The series in (24) converges infor any. Hence, is and is a Mercer kernel on.
To prove the positive definiteness, we letbe a finite set of distinct points and a nonzero vector. Choosesuch that 
							
						Then, for each, eitheror. Hence, 
							
						By the construction of, there holds 
							
						Then,
							
						Now, the determinant of the matrixis a Vandermonde determinant and is nonzero. Since  is a nonzero vector, we know that  for some. It follows that . Thus,  is positive definite.
We now prove that, the constant function taking the valueeverywhere, is not in the closure ofin. In fact, the uniformly convergent series (24) and the vanishing property ofimply that 
							
						Since span is dense in  and  is embedded in , we know that 
							
						If  could be uniformly approximated by a sequence  in , then 
							
						which would be a contradiction. Therefore,is not dense in. 
Combining the previous discussion, we know that the positive definiteness is a nice necessary condition for the density of the RKHS in but is not sufficient.
3. Interpolation Schemes for Reproducing Kernel Spaces
The study of approximation by reproducing kernel Hilbert spaces has a long history; see, for example, [13, 14]. Here, we want to investigate the rate of approximation as the RKHS norm of the approximant becomes large.
In the following sections, we consider the approximation error for the purpose of learning theory. The basic tool for constructing approximants is a set of nodal functions used in [6, 15, 16].
Definition 9. We say thatis the set of nodal functions associated with the nodesifand
The nodal functions have some nice minimization properties; see [6, 16].
In [15], we show that the nodal functionsassociated withexist if and only if the Gramian matrixis nonsingular. In this case, the nodal functions are uniquely given by
Remark 10. When the RKHS has finite dimension , then, for any we can find nodal functions associated with some subset , while for , no such nodal functions exist. When dim, then, for any , we can find a subset which possesses a set of nodal functions.
The nodal functions are used to construct an interpolation scheme: It satisfies for. Interpolation schemes have been applied to the approximation by radial basis functions in the vast literature; see, for example, [17–20].
The errorforwill be estimated by means of a power function.
Definition 11. Letbe a Mercer kernel on a compact metric spaceand. The power functionis defined onas
We know thatwhen. Ifis Lipschitzon: then Moreover, higher order regularity ofimplies faster convergence of. For details, see [16].
The error of the interpolation scheme for functions from RKHS can be estimated as follows.
Theorem 12. Letbe a Mercer kernel and nonsingular for a finite set. Define the interpolation scheme associated withas (34). Then, for, there holds
Proof.  Let. We apply the reproducing property (3) of the functionin 
							
						Then,
							
						By the Schwartz inequality in, 
							
						As, we have 
							
						However, the quadratic function 
							
						overtakes its minimum value at. Therefore, 
							
						It follows that 
							
						This proves (38).
As andfor, we know that 
							
						This means thatis orthogonal to span. Hence, is the orthogonal projection ofonto span. Thus, . 
The regularity of the kernel in connection with Theorem 12 yields the rate of convergence of the interpolation scheme. As an example, from the estimate forgiven in [16, Proposition 2], we have the following.
Corollary 13. Let, , andbe aMercer kernel such thatis nonsingular for. Then, for, there holds
For convolution type kernels, the power function can be estimated in terms of the Fourier transform of the kernel function. This is of particular interest when the kernel function is analytic. Let us provide the details.
Assume thatis a symmetric function inandalmost everywhere on. Consider the Mercer kernel For, we define the following function to measure the regularity:
Remark 14. This function involves two parts. The first part is, where; hence, it decays exponentially fast asbecomes large. The second part is, whereis large. Then, the decay of(which is equivalent to the regularity of) yields the fast decay of the second part.
The power functioncan be bounded byon the regular points: 
							
Proposition 15. For the convolution type kernel (49) andgiven by (51), one has In particular, if for some constantsand, then there holds
Proof. Chooseas the Lagrange interpolation polynomials on. It is a vector infor each. Then, , where 
							
						In the proof of Theorem  2 in [16], we showed thatfor each. Therefore,.
The estimate forin the second part was verified in the proof of Theorem  3 in [16]. 
For the Gaussian kernels it was proved in [16, Example 4] that, for, there holds
4. Approximation Error in Learning Theory
Now, we can estimate the approximation error in learning theory by means of the interpolation scheme (34).
Consider the convolution type kernel (49) on . As in [6], we denote The approximation error (6) can be realized as follows.
Theorem 16. Let be a symmetric function with , and let the kernel on be . For and , we set by Then, with , one has(i); (ii); (iii).
Proof. (i) For  and , expression (33) gives 
							
						Then for  we have 
							
						where  is the vector . It follows that 
							
						where  denotes the (operator) norm of the matrix  in .
We apply the previous analysis to the function  satisfying 
							
						Then, 
							
Now, we need to estimate the norm . For convolution type kernels, such an estimate was given in [15, Theorem 2] by means of methods from the radial basis function literature, for example, [17, 21–24]. We have 
							
						Therefore, 
							
						This proves the statement in (i).
(ii) Let . Then 
							
						By the Schwartz inequality, 
							
						The first term is bounded by . The second term is 
							
						which can be bounded by , as shown in the proof of Theorem 12. Therefore, by (52), 
							
 (iii) By the Plancherel formula, 
							
						This proves all the statements in Theorem 16. 
Theorem 16 provides quantitative estimates for the approximation error: with Choose such that as ; we have and the RKHS norm of is controlled by the asymptotic behavior of .
Denote by the inverse function of : Then, our estimate for the approximation error can be given as follows.
Corollary 17. Let and . Then, for , where . If , then In particular, when for some and , one has provided that with the function , satisfies
Proof.  The first part is a direct consequence of Theorem 16 when we choose  to be , the integer part of .
To see the second part, we note that (77) in connection with Proposition 15 implies with , 
							
						Then, .
For , we can choose  such that 
							
						Choose  such that 
							
						Then, , and by Theorem 16, 
							
						When
							
						there holds 
							
						Hence,
							
						When  satisfies (79), we know that 
							
						Hence, (84) holds true. This proves our statements. 
For the Gaussian kernels, we have the following.
Proposition 18. Let Denote and . If , then one has and when , for satisfying
Proof. The Fourier transform of  is
							
						Then, 
							
For
							
						we can take  with  such that 
							
						Here,  is the inverse function of : 
							
						Then, . Let . By Theorem 16, .
By Corollary 17 and (57), 
							
						where . Choose  such that 
							
						With this choice, . Therefore,
							
						where
							
When
							
						there holds
							
						This yields the first estimate.
When , the same method gives the error with the uniform norm. 
5. Learning with Varying Kernels
Proposition 18 in the last section shows that, for a fixed Gaussian kernel, the approximation error behaves as for functions in .
In this section, we consider the learning with varying kernels. Such a method is used in many applications where we have to choose suitable parameters for the reproducing kernel. For example, in [7] Gaussian kernels with different parameters in different directions are considered. Here, we study the case when the variance parameter keeps the same in all directions. Our analysis shows that the approximation error may be improved when the kernel changes with the RKHS norm of the empirical target function.
Proposition 19. Let There exist positive constants and , depending only on and , such that for each and , one can find some satisfying
Proof. Take
							
						where  depends on . Denote . As in the proof of Proposition 18, we have 
							
When  is large enough, with a constant  depending on  and , this yields 
							
Finally, we determine  by requiring 
							
						There is a constant  depending only on  and  such that, for , an integer  satisfying all the previous requirements and 
							
						exists. This makes all the estimates valid. It follows that 
							
						Hence,
							
						Therefore, there holds  and 
							
						This verifies our claim for the approximation error in . 
Let us mention the following problem concerning learning with Gaussian kernels with changing variances.
Problem 20. What is the optimal rate of convergence of as tends to infinity?
6. Dot Product Kernels
In this section, we illustrate our results by the family of dot product type kernels. These kernels take the form When for some , the kernel is a Mercer kernel on if and only if for each . See [25–28]. Here, we will characterize the density for this family as [29]. Denote and for and .
Corollary 21. Let , , and the kernel be given by (116), where for each and . Set . Then, is dense in if and only if span is dense in . Thus, the density depends only on the location of positive coefficients in (116). In particular, when , is dense in if and only if
Proof.  Note that 
							
Sufficiency. Suppose that span is dense in , but  is not dense in . Then, by Theorem 4 there exists a nontrivial Borel measure  on  such that 
							
						Taking the integral with respect to  and using (118), we have 
							
						Since  for each , there holds 
							
						That is,  annihilates each  for . But span is dense in ;  also annihilates all functions in , which is a contradiction.
Necessity. If span is not dense in , then there exists a nontrivial Borel measure  annihilating each ; that is,  for each . Then, (118) tells us that, for each , 
							
						This in connection with Theorem 4 implies that  is not dense in . This proves the first statement of Corollary 21.
The second statement follows from the classical Muntz Theorem in approximation theory (see [30]): for a strictly increasing sequence of nonnegative numbers , span is dense in  if and only if  and . 
The conclusion in Example 2 follows directly from Corollary 21.
By Corollary 21, we can provide more examples of dot product positive definite kernels whose corresponding RKHS is not dense. The following is such an example. However, compared with Example 8, it is not constructive, in the sense that no function outside the closure of is explicitly given.
Example 22. Let and define Then, is a positive definite Mercer kernel on , but is not dense in .
Proof.  Observe that the assumption in Corollary 21 holds for ,  and 
							
						Since , Corollary 21 tells us that  is not dense in .
What is left is to show that the Mercer kernel  is positive definite. Suppose to the contrary that there exist a finite set of distinct points  and a nonzero vector  such that 
							
						Denote 
							
						Then, 
							
						Hence,  which implies that  and  is a nonzero vector. Also, 
							
						It follows that 
							
						Choose an integer  which is not less than , the number of elements in the set . Then, we know that the linear system 
							
						has a nonzero solution . Hence, the matrix  is singular. So, there exists a nonzero vector  such that 
							
						As each element  in  is nonzero, we have 
							
						However, the determinant of the matrix  is a Vandermonde determinant and is nonzero. This is a contradiction, as the linear system having this matrix as the coefficient matrix has a nonzero solution. Therefore, the Mercer kernel  is positive definite. 
An alternative simpler proof for the positive definiteness of the kernel in Example 22 can be given by means of the recent results in [25, 26].
After characterizing the density, we can then apply our analysis in Section 3 and provide some estimates for the convergence rate of the approximation error under the assumption that all the coefficients in (116) are strictly positive. We will not provide details here, but only show the application of the interpolation scheme (34) to polynomials.
If , then It follows from the Schwartz inequality that The first term can be bounded by while the second is bounded by Thus, the approximation error can be given in terms of the regularity of the kernel . The regularity of the approximated function yields the rate of approximation by polynomials while the asymptotic behavior of the coefficients in (116) provides the control of the RKHS norm of .
Acknowledgments
The author would like to thank Charlie Micchelli for proving Corollary 6 in the general form, Allan Pinkus for clarifying Example 2, Tommy Poggio for raising the density problem, Steve Smale for suggestions on positive definiteness and approximation error in learning theory, and Grace Wahba for knowledge on earlier work of approximation by reproducing kernel Hilbert spaces. The work described in this paper is partially supported by a Grant from the Research Grants Council of Hong Kong (Project no. CityU 104710).