Abstract
Compared to the standard support vector machine, the generalized eigenvalue proximal support vector machine coped well with the “Xor” problem. However, it was based on the squared Frobenius norm and hence was sensitive to outliers and noise. To improve the robustness, this paper introduces capped -norm into the generalized eigenvalue proximal support vector machine, which employs nonsquared -norm and “capped” operation, and further proposes a novel capped -norm proximal support vector machine, called CPSVM. Due to the use of capped -norm, CPSVM can effectively remove extreme outliers and suppress the effect of noise data. CPSVM can also be viewed as a weighted generalized eigenvalue proximal support vector machine and is solved through a series of generalized eigenvalue problems. The experimental results on an artificial dataset, some UCI datasets, and an image dataset demonstrate the effectiveness of CPSVM.
1. Introduction
A support vector machine (SVM) [1] is a popular tool for coping with binary classification problems. It aims to find parallel hyperplanes with the largest margin to distinguish two class data, but it cannot deal with “Xor” problems well. To solve this problem, Mangasarian and Wild [2] proposed the first nonparallel hyperplane generalized eigenvalue support vector machine (GEPSVM). GEPSVM minimizes the squared -norm distance of the samples to its own class and maximizes the squared -norm distance of the samples to the other class to find a pair of nonparallel optimal classification hyperplanes to proximate two classes. After that, many researchers started to study nonparallel support vector machines (NSVMs) [3–7], such as twin support vector machine (TSVM) and its modifications [8–11] and nonparallel proximal support vector machine (NPSVM) and its variants [5, 12]. For example, to improve its generalization ability, Shao et al. proposed an improved generalized eigenvalue proximal support vector machine (IGEPSVM) [13]. In order to keep the consistency between the decision process and the training process, the authors further proposed a proximal classifier with consistency (PCC) [14], which is based on comparing two distances between a point and two hyperplanes.
As we know, the squared -norm distance metric is sensitive to outliers since the effect of outliers can be magnified by the square operation. In order to reduce the effect of outliers, the -norm is usually used as a replacement for the squared -norm to achieve robustness. Li et al. [15] reformulated the objective function of GEPSVM by replacing the squared -norm terms with the corresponding -norm terms and proposed an -norm nonparallel SVM (L1-NPSVM). Yan et al. [16] converted the -norm generalized eigenvalue problem into a strong convex programming problem and implemented a simple iterative algorithm to solve it. Gu et al. proposed an -norm twin projection support vector machine (TPSVM-L1) [17] for the recognition of images and robust representation. In fact, the robustness of -norm has been applied in many machine learning tasks, including feature extraction [18], dimensionality reduction [19], or clustering [20]. Due to its robustness, modifications of the -norm that were applied on NSVMs were also extensively studied, for example, -norm and generalized elastic network-based nonparallel support vector machines [12, 21–23].
However, though the -norm is more robust, it still will be affected by outliers that are very far from the normal points. In this case, even the -norm is not robust enough to reduce the impact of outliers, so one hopes to eliminate the impact of these large outliers or limit it to a certain range. An effective way to achieve this is to apply an idea of “capped” operation, which adds an upper bound to the common norms. In fact, this idea was used in previous studies to establish a robust machine learning model, for example, capped -norm for robust feature learning [24], capped -norm for regression and classification [25, 26], capped -norm for classification and feature extraction [27–29], capped nuclear norm for matrix factorization or completion [30, 31], and capped trace norm [31] for robust principal component analysis. For SVM classifiers, Wang et al. [32] proposed a new robust capped -norm twin support vector machine (CTWSVM), which sustained the advantages of TWSVM and promoted robustness in solving a binary classification problem with outliers. Yuan et al. [33] proposed a novel robust least squares twin support vector machine framework by using a capped -norm distance metric to reduce the influence of noise and outliers.
However, there is no corresponding study on proximal support vector machines. To further improve the robustness of GEPSVM and L1NPSVM, in this paper, we propose a capped -norm based proximal support vector machine (CPSVM). For a given , the capped -norm is defined aswhere is an -dimensional column vector. Note that, strictly speaking, capped L1-norm is not an actual norm. In fact, though it satisfies the positive scalability and triangle inequality, it does not satisfy homogeneity. However, it still can be used as a distance or similarity measure, and it also does not affect its robustness property. From the definition of capped -norm, it is seen that when encountering outliers, capped -norm can effectively remove them beyond and favor the input control. Therefore, is a thresholding parameter for identifying extreme data outliers that will affect recognition results and bring robustness. Compared to the existing proximal support vector machines, which utilized squared -norm or -norm, CPSVM proposed in this paper can improve the robustness to outliers. In specific, CPSVM has the following characteristics:(i)The application of capped -norm on CPSVM makes it robust to data outliers and feature noise. In specific, the capped parameter in capped -norm helps identify data outliers, and the -norm in capped -norm can resist sensitivity to feature noise. In order to enhance the generalization ability of CPSVM and prevent overfitting, an -norm regularization term is also introduced.(ii)An effective iterative algorithm is designed to solve the proposed nonsmooth and nonconvex optimization problem. In each iteration, it solves two generalized eigenvalue problems.(iii)The effectiveness and robustness of CPSVM are demonstrated by experiments on an artificial dataset, some UCI datasets, and a handwritten digit image database.
The paper is organized as follows. Section 2 briefly reviews the related proximal support vector machines. Section 3 proposes CPSVM and its solving algorithm. Section 4 makes comparisons of CPSVM with its related approaches. Concluding remarks are given in Section 5.
The notation of the paper is as follows. All vectors are column ones, and vectors and matrices are shown in bold. Consider the training dataset with the associated class labels belonging to , where for . Write as the corresponding data matrix of . Assume that the -th class contains samples, . We organize the inputs of Class 1 into a matrix and the inputs of Class 2 into a matrix , where . Let be the mean of all samples and let be the mean of samples in the -th class, , where is the -th sample in the -th class. For a vector , its -norm is defined as , and -norm is defined as . For simplicity, we write as for brevity.
2. Related Works
2.1. GEPSVM
GEPSVM [2] aims to find two nonparallel hyperplanes:such that the -th hyperplane is close to the samples of the -th class and meanwhile far away from the samples of the other class, where , , , . The optimization problems of GEPSVM are formulated aswhere is the all one vector of an appropriate dimension, and is a regularization parameter.
To simplify the notation, we introduce , and , . Then, the solution of GEPSVM can be obtained via the following generalized eigenvalue problemswhere is the eigenvalue of (5), and is the identity matrix of an appropriate dimension.
After obtaining the optimal and , an unseen sample is classified according to
2.2. L1-NPSVM
Since the squared -norm is very sensitive to noise, in order to improve the robustness, a robust -norm support vector machine (L1-NPSVM) was proposed in [15], which was formulated as
By introducing a nonconvex surrogate function, (7) is solved by the gradient ascending (GA) algorithm. However, the nonconvexity of the surrogate function may cause the miss of an optimal solution. Besides, an appropriate step size is also needed to be chosen carefully.
After obtaining the optimal and , an unseen sample was classified according to
3. The Proposed Capped -Norm Proximal Support Vector Machine
3.1. Problem Formulation
As we see above, the application of squared -norm metric in GEPSVM amplifies the influence of outliers, which will affect the construction of the optimal classification hyperplanes. Though L1-NPSVM aims at giving a more robust performance on the problem with outliers, it still may perform not very well when encountering large enough outliers or noise. In contrast, capped -norm metric that involves an upper bound can effectively control the influence of extreme outliers. Therefore, we propose the following capped -norm proximal support vector machine (CPSVM), the optimization problem of which is expressed aswhere is a regularization parameter. Minimizing the first item in the numerator of (9) makes the samples in Class 1 as close as possible to the hyperplane . Maximizing the denominator of (9) forces the samples in Class 2 far away from the hyperplane . Both the distances of Class 1 data and Class 2 data to the hyperplane of Class 1 are measured by capped -norm, which reduces the negative influence of noise and outliers, due to the application of -norm sum over the “capped” items on two class data. The above two terms make sure that CPSVM classifies two classes well. Furthermore, to give a better generalization performance, CPSVM also minimizes a regularization term, as seen in the second term in the numerator, which helps control the model complexity and avoid the overfitting problem. The optimization problem (10) has a similar geometric meaning as (9). After optimal and are obtained, , and are also obtained accordingly. For a new coming sample , it is assigned by
To better understand the effect of the capped -norm, we transform (9) and (10) into their different formulations. Firstly, write the explicative form of (9) and (10) as
Further, transform (12) to a GEPSVM-like formulation. Introduce , where is the indicator function satisfying if , and 0 otherwise, and as the diagonal matrix with its -th diagonal element , . Then, the numerator of (12) is equivalent towhere , and is some constant. Similarly, denote , and as the diagonal matrix with its -th diagonal element , ; then, the denominator of (12) is equivalent towhere and is some constant. Therefore, CPSVM (9) is recast as
Similarly, (10) is recast aswhere , , and and are defined similarly to and .
By observing (16) and (17), one sees that CPSVM can be viewed as weighted GEPSVM. A key advantage of these weights in CPSVM is that they improve robustness to outliers and noise. In one respect, the -norm carries robustness to noise. In another respect, the data owning a norm larger than will not receive any further utilization because the corresponding weight is 0, and these data are perceived as outliers. For the data have relatively large norm but smaller than , small weights are assigned to them, and they do not exhibit much strong importance. This shows that CPSVM is robust to outliers and noise.
3.2. The Solving Algorithm of CPSVM
In the following, we solve (16), and (17) can be solved similarly. Formally, (16) is a generalized eigenvalue problem. However, both and depend on and , respectively, and hence depend on , which can not be solved directly. For fixed and , and (16) is a generalized eigenvalue problem. Therefore, to solve this problem, we employ an iterative technique.
In specific, we first initialize and . Then, in the -th iteration, and are computed bywhere the diagonal elements of and are given bywhere , .
After obtaining , and , the optimal solution is given by the eigenvector corresponding to the smallest nonzero eigenvalue of the generalized eigenvalue problem . After reaching the maximum iteration number or convergence, the optimal is set to . is obtained similarly.
We summarize the solving procedure of CPSVM in Algorithm 1.
|
4. Experiments
In this section, we experimentally compare the proposed CPSVM with GEPSVM [2], L1-GEPSVM [16], PCC [14], L1-NPSVM [15], IGEPSVM [13] LpNPSVM [12], and GLpNPSVM [21]. Experiments are conducted on an artificial dataset with outliers, some benchmark datasets [34], and a handwritten digit image dataset. All these methods are implemented in MATLAB R2019a environment on a PC with an Intel i5 processor (3.30 GHz) with 4 GB RAM. The parameter in GEPSVM, L1-GEPSVM, PCC, IGEPSVM, LpNPSVM, and GLpNPSVM is selected from , in GLpNPSVM is selected from , and the learning rate in L1-NPSVM is chosen from the set , as suggested in [15]. The parameter in LpNPSVM and and in GLpNPSVM are selected from .
For CPSVM, in real experiments, choosing an appropriate is difficult. In fact, by observing the CPSVM model construction, we see “capped” the projected data. Therefore, it is not only relevant to the involved data but also should be affected by current and during each iteration. Thus, it is reasonable to update during each iteration. Suppose the projected data contain % outliers; then, we may set to the % largest value of the projected data. In this situation, we implicitly use % data to compute or , so % is selected from .
4.1. Artificial Data
We first apply CPSVM to a simple two-dimensional artificial dataset. The artificial dataset contains two classes, with Class 1 and Class 2 having 180 data points, respectively. Class 1 is generated from the line with variance 2, and Class 2 is generated from the line with variance 2, Class 1 is represented by the red “o,” and Class 2 is represented by the blue “o.” 50% of samples of each class are randomly selected for training, and the rest are used for testing. To explore the robustness of the proposed method, 16 noise samples for each class are added to the training data. The noise of Class 1 is represented by the red “,” and the noise of Class 2 is represented by the blue “.” Denote “Ori” as the set that only contains training data, “Noi1” as the set containing original training data and the noise samples of Class 1, and “Noi2” as the set containing original training data noise samples of two classes. , , and represent the optimal approximating planes of Class obtained from the above three kinds of training sets, respectively, where . The training data and its outliers are shown in Figure 1(a), and the test data are shown in Figure 1(b).

(a)

(b)
We apply CPSVM and GEPSVM, L1-GEPSVM, IGEPSVM, PCC, L1-NPSVM, LpNPSVM, and GLpNPSVM to these three kinds of training sets and obtain their optimal approximating planes, respectively. Table 1 shows the test classification accuracy of these three kinds of training sets. Figures 2–4 show approximating planes for “Ori,” “Noi1,” and “Noi2” training data, respectively, and the error test data are also demonstrated. From the results of Table 1, we see the proposed CPSVM possesses the best classification performance for all three cases. To further evaluate these methods, we calculate the average distance ratio of all test points to the approximating plane belonging to their own class and the approximating plane belonging to the other class. Clearly, the smaller the distance, the better. The results are shown in Figure 5, which demonstrate that our CPSVM owns the best generalization ability for noise data compared to other methods.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

4.2. Benchmark UCI Datasets
In this section, the proposed CPSVM and its comparison methods are applied to 18 small-size benchmark datasets and 9 large-size benchmark datasets. 10-fold cross-validation is used for searching optimal parameters, and 10-time average test classification accuracy is adopted for each method.
4.2.1. Small-Size UCI Datasets
This subsection first compares each method on some small-size UCI datasets, whose information is listed in Table 2. Classification accuracies along with standard deviations for all methods on these original datasets are listed in Table 3. “Acc” is short for accuracy , and “Std” is short for standard deviation . By observing the results in Table 3, we see that the proposed CPSVM outperforms other methods on most of these small-size datasets. On other small-size datasets, CPSVM performs comparably to the one with the highest accuracy for most cases.
In order to test the robustness of the proposed CPSVM, we consider the noise-contaminated data. Specifically, random 30% features of random 10%, 20%, and 30% percent samples are assigned with 0. The classification results on noise data are listed in Tables 4–6.
By observing the results in these tables, we can see that on small-size datasets, we have the following: (i) the performance of all methods declines after adding noise, and the higher the noise percentage, the worse the performance; (ii) compared with other methods, the proposed CPSVM is less affected by noise and is better than other methods on most datasets; (iii) the optimal may be different for different levels of outliers. In addition, more outliers may require a small , because is set to exclude the percent outliers. On the data without noise and outliers, the percentage of should be close to 1; (iv) the proposed CPSVM can deal with small-size data and noisy datasets well. To better compare various methods, we compute the rank of each method according to their accuracies, and the results are shown in Table 7. In the table, for each data and each method, four numbers represent the ranks of the method on this original data and corresponding three types of noise data. For example, 5/5/3/6 for BUPA data and classifier GEPSVM in the table means that GEPSVM has rank 5 on original BUPA data and has ranks 5, 3, and 6 on three noise BUPA data. It can be seen that CPSVM has the highest rank on most datasets and has certain comparability on other datasets. In addition, CPSVM has a higher average rank in the case of noise than that of the original data, which shows the robustness of CPSVM.
4.2.2. Large-Size UCI Datasets
To see the performance of the proposed CPSVM on relatively large-size data, we then test the performance of these methods on large-size UCI data shown in Table 8. The classification accuracies along with standard deviations for all methods on original data are listed in Table 9. By observing the results in Table 9, we see that the proposed CPSVM outperforms other methods on some of the datasets. On other datasets, CPSVM performs comparably to the classier with the highest accuracy.
To further test the robustness of the proposed CPSVM on large-size datasets, we consider the noise-contaminated data the same as on the small-size datasets. The classification results on these noise data are listed in Tables 10–12. The corresponding accuracy ranks are shown in Table 13.
By observing the results in these tables, we can see that on relatively large-size datasets, CPSVM has similar behavior compared to small-size datasets. In specific, noise has an influence on all methods. However, CPSVM is influenced less compared to other methods. In addition, the optimal 1 − q% still plays an important role in CPSVM, which may vary on different data and different levels of noise. The results in Table 13 confirm the superiority of the proposed method.
4.3. MNIST Handwritten Dataset
In this subsection, we apply our proposed CPSVM to MNIST handwritten symbol recognition. To test the robustness, we also consider polluting the data by adding black or Gaussian rectangular block to training data. The MNIST database consists of grayscale images of handwritten digits from 0 to 9, and the size of each image is 16 16 pixels with 256 gray levels, as shown in Figure 6(a). We choose three pairs of numbers (0 vs 6, 0 vs 9, and 6 vs 9) for our comparisons because these groups of numbers are similar and not easy to distinguish. Each dataset is randomly divided into two subsets: 50% images for training and 50% images for testing. Then, to generate outliers, we randomly select 10% images from the training set and add black block noise or Gaussian block noise of 10%, 20%, 30%, and 40% area on it. Figure 6(b) shows sample images of black block noise of 30% area outliers, and Figure 6(c) shows sample images of Gaussian block noise of 30% area. These datasets are denoted as “Ori,” “B10,” “B20,” “B30,” “B40,” “G10,” “G20,” “G30,” and “G40,” where 10, 20, 30, and 40 represent the percentage of blocks, and B and G represent black block and Gaussian block, respectively.

(a)

(b)

(c)
The recognition results of all methods on these datasets are listed in Tables 14–19. For the black block noise case, the results of L1-GEPSVM, IGEPSVM, LpNPSVM, GLpNPSVM, and CPSVM are relatively close, while they are all not better than CPSVM for most cases. When gradually increasing the area of black block noise, L1-GEPSVM, IGEPSVM, LpNPSVM, and GLpNPSVM are affected by noise more seriously than CPSVM, which shows the robustness of CPSVM. For the Gaussian noise experiment, the performance of CPSVM is the best or is comparable to the best method for most cases. In addition, with the increase of noise level, the performance of CPSVM is stable, while the noise has a greater impact on other methods.
5. Conclusion
In this paper, we propose a capped -norm proximal support vector machine (CPSVM). Capped -norm brings robustness to CPSVM, and CPSVM can be viewed as a weighted SVM. Experimental results show that compared to related SVMs, CPSVM can effectively remove extreme outliers and suppress the influence of noise data. However, CPSVM requires solving a series of generalized eigenvalue problems, which may be very slow on very large-scale data. Therefore, how to design a more efficient algorithm is one of our future works. In addition, extending CPSVM to its kernel case or matrix and tensor data is also interesting. Another interesting area is its potential improvement by the type-3 fuzzy logic systems [35]. The fuzzy technique balances the importance of the samples and hence generates robustness. Therefore, the robustness of CPSVM can be further improved by incorporating type-3 fuzzy logic systems, and a fuzzy CPSVM should be considered.
Data Availability
The data that support the findings of this study are openly available in UCI repository at https://archive.ics.uci.edu/ml/index.php. Data can be obtained from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the Hainan Provincial Natural Science Foundation of China (nos. 620QN234 and 120RC449), the National Natural Science Foundation of China (nos. 62066012, 12271131, 11871183, and 61866010), the Key Laboratory of Engineering Modeling and Statistical Computation of Hainan Province, and the specific research fund of The Innovation Platform for Academicians of Hainan Province.