Abstract

One-class support vector machine (OCSVM) is one of the most popular algorithms in the one-class classification problem, but it has one obvious disadvantage: it is sensitive to noise. In order to solve this problem, the fuzzy membership degree is introduced into OCSVM, which makes the samples with different importance have different influences on the determination of classification hyperplane and enhances the robustness. In this paper, a new calculation method of membership degree is proposed and introduced into the fuzzy multiple kernel OCSVM (FMKOCSVM). The combined kernel is used to measure the local similarity between samples, and then, the importance of samples is determined based on the local similarity between training samples, so as to determine the membership degree and reduce the impact of noise. The proposed membership requires only positive data in the calculation process, which is consistent with the training set of OCSVM. In this method, the noise has a smaller membership value, which can reduce the negative impact of noise on the classification boundary. Simultaneously, this method of calculating membership has a higher efficiency. The experimental results show that FMKOCSVM based on proposed local similarity membership is efficient and more robust to outliers than the ordinary multiple kernel OCSVMs.

1. Introduction

Anomaly detection is an important aspect of data mining. It is used to find objects in a data set that are significantly different from other data to achieve the purpose of preventing abnormal events. At present, the application of anomaly detection in the field of medicine and biological systems is of great significance, and it has been successfully applied to protein detection, [1] cancer screening, [2] and health monitoring [3]. The essence of anomaly detection is a classification algorithm suitable for processing data with an extremely imbalanced class. Complex biological systems usually have this feature. For example, the data of an infectious disease model may include characteristic data of patients and characteristic data of nonpatients. But, in real life, there are far more healthy people than patients. Timely and effective detection of patients with infectious diseases is an effective way to prevent the outbreak of infectious diseases.

Support vector machine (SVM) [4, 5] is a classical classification algorithm, but its performance will deteriorate when dealing with a one-class classification problem or distribution imbalance data. Among the solutions to one-class classification problems, there are density estimation-based methods and support vector-based methods. The method based on support vector is popular because of its simplicity and high efficiency. There are two models for this method: (1) one-class support vector machine (OCSVM) [6]; (2) support vector data description (SVDD) [7]. The goal of SVDD is to find a minimum hypersphere that contains all target samples. The main idea of OCSVM is to take the origin of the feature space as a representative of the abnormal data and then separate the target sample from the origin at the maximum margin. This paper focuses on OCSVM.

Like SVM, OCSVM is also sensitive to noise, which is due to the assumption that each sample has the same importance or weight during training. Introducing fuzzy membership in SVM and constructing the fuzzy support vector machine (FSVM) [8] is one of the effective methods to solve this problem. The calculation methods of fuzzy membership are mainly focused on two kinds of classification problems [911]. For example, the heuristic function derived from the centered kernel alignment is used to calculate the dependency relationship between the data point and its label to calculate the fuzzy membership [9] In [10], the membership degree of each sample point is determined by the lower approximation operator of a fuzzy rough set based on Gaussian kernel. As shown in [11], entropy is used to measure the class determinacy of samples. Samples with a higher class determinacy are assigned to a larger fuzzy membership. Generally, in order to improve the robustness of OCSVM, different weights are assigned to training samples, which are called weighted one-class support vector machine (WOCSVM) [1214]. WOCSVM reduces the impact of noise by assigning lower weights to the noise [12]. As shown in [14], using prior knowledge to assign different weights to the samples, the weight is only related to the distribution knowledge of the neighbors, which is only determined by the k-nearest neighbors of the instance. In recent years, there are many studies on calculating sample weight in one-class classification problems [1518]. As shown in [19], membership based on fuzzy rough set theory [20, 21] is used as a weight in OCSVM.

The above method improves the robustness of OCSVM to a certain extent but has some limitations. For example, when the amount of data is too large, it [14] is too inefficient when calculating sample weights. The authors in [19] use abnormal data when calculating the membership degree of the sample. In this paper, a novel strategy is proposed to solve the problem of poor robustness of OCSVM; that is, membership degree is introduced into the model. Different from the above membership calculation method, this method only uses one category of data, which fully adapts to the characteristics of OCSVM. The membership calculation method proposed is related to the local density of the data, which is obtained by the local similarity of the training data. We take an S-type function based on local density as a membership function.

OCSVM uses kernel trick to solve the nonlinear separability problem, but it also brings the problem of kernel selection. Multiple kernel learning method [2226] is used to solve this problem, that is, the multiple kernel one-class support vector machine (MKOCSVM) [27].

The main work of this paper is as follows:(1)The multiple kernel learning and membership degree are introduced into OCSVM at the same time, and the fuzzy multiple kernel one-class support vector machine (FMKOCSVM) model is constructed to solve the problem of core selection and noise sensitivity.(2)A novel method of membership calculation is proposed, which is based on local similarity.(3)We illustrate the effectiveness of this degree of membership in the figure.(4)According to the maximum similarity between the combination kernel and the ideal kernel, the weight coefficients of the multiple kernels are determined.(5)It is proved by experiments that the method proposed in this paper performs better than the method of fuzzy membership and the model without membership.

The combined kernel more fully characterizes the data than the single kernel. Using multiple kernel functions at the same time can solve the difficult problem of selecting the kernel function and its parameters, and this method can be applied to different sample information.

The rest of this paper is organized as follows: Section 2 introduces the knowledge of OCSVM, MKOCSVM, and FMKOCSVM. The formulation and algorithm of our FMKOCSVM based on local similarity are detailed in Section 3; and Section 4 reports the experimental results, followed by the conclusion in Section 5.

2.1. One-Class Support Vector Machine

Compared with the SVM, OCSVM is suitable for dealing with the problem of data category imbalance or one-class classification. The main idea is to first map the data from the original space to the feature space through nonlinear mapping and then take the origin of the feature space as the representative of outliers to find an optimal classification hyperplane in the feature space, in which the image of normal data can be separated from the origin by the maximum margin. A graphical illustration is shown in Figure 1. Figure 1(a) shows the description of the classification in the original space. Figure 1(b) shows the description of the classification in the feature space.

Supposed the training samples (n is the dimension of ), where is the number of training samples. is a function that maps samples to feature spaces. Let denote normal vector and bias term for classification hyperplane in the feature space. The classification hyperplane is expressed as .

The goal is to maximize the distance between the classification hyperplane and origin. Then, OCSVM needs to solve the following convex programming [6]:

Here, is the slack variable, which means that outliers are allowed to exist, and is a parameter to control the proportion of support vector and error points. Using the Lagrange multiplier method, the dual problem of the above optimization problem can be written as follows:

Because the above process needs to meet the KKT (Karush–Kuhn–Tucker) condition,

The solution of equation (3) corresponds to sample , and there is always or . When , the sample has no effect on the hyperplane. When , must be true. In this case, this sample is called a support vector. If , then ; there must be ; that is, the sample is located on the maximum separation boundary; if , then ; in this case, when , the sample is misclassified, which is called the boundary support vector. As shown in Figure 2, when different values are used, the positions of corresponding sample points are different.

Let represent the total number of support vectors and the total number of boundary support vectors, respectively. The maximum value of is and has the following constraints:

So we can have the following inequality:

Multiplying on both sides of equation (5) gives

It can be known from equation (6) that the value of determines the lower bound of the total support vector ratio and the upper bound of the boundary support vector ratio:

The normal vector of the hyperplane can be obtained by using equation (7).

Let denote the tth support vector located on the maximum spacing plane and the bias term of the hyperplane obtained according to

Thus, the decision function can be written as

For a given test sample , after substituting into equation (9), when returns +1, the sample is judged as a normal point; when returns −1, the sample is judged as an abnormal point.

2.2. Multiple Kernel One-Class Support Vector Machine

MKOCSVM replaces the single kernel function in the conventional OCSVM with a combined kernel, which can effectively avoid the difficulty in selecting the kernel function and its parameters.

The forms of combined kernel function include linear combination and nonlinear combination, [28] expressed as

Here, is the kernel weight of the mth basic kernel.

The MKOCSVM model can be formulated as

Here, . The function can be written as

To seek the optimal combination weight for each basis kernel, the authors in [27] suggest optimizing the maximum kernel-target alignment value of the combination kernel and the ideal kernel, that is, solving the following objective function [29]:

Here, is the kernel matrix. is the Frobenius inner product between two matrices, which is given by

Only solution equation (16) is needed to obtain the optimal combination weight:

Here, is a regularization coefficient.

2.3. Fuzzy Multiple Kernel One-Class Support Vector Machine

Let denote membership degree of the sample , then the training set can be expressed as , where . The FMKOCSVM needs to solve the following optimal programming:

When , the FMKOCSVM degenerates to the normal MKOCSVM.

After introducing the Lagrange multiplier , for each inequality constraint, the Lagrange function of equation (17) is

Setting the derivatives with respect to , , and to zero, then we can obtain

Substituting equation (19) into equation (18), the dual form of equation (17) can be written as

Obviously, the only difference between the dual problems of MKOCSVM and FMKOCSVM is the upper bound of . The upper bound of becomes in equation (20). The function can be written as

In FMKOCSVM, when the noise has a lower membership during training, the negative effect of noise on the classification hyperplane can be reduced.

3. Training FMKOCSVM with Local Similarity-Based Membership

Noise in the training set may not belong to any class at all. Therefore, if these samples with uncertainty are distributed near the edges of the target data, the model will overfit. To alleviate this phenomenon, this paper assigns membership to each training point, which makes the samples play a different role in training and reduces the negative impact of noise. In this section, we first introduce the calculation method of membership based on local similarity in detail and then propose the FMKOCSVM algorithm using membership based on local similarity.

3.1. Local Similarity-Based Memberships

Assume the target sample . Let represent the multiple kernel matrix defined by , where the expression of multiple kernel function is equation (10).

Let all the elements of the upper triangle of the multiple kernel be sorted from large to small, i.e., , and then write it as a vector , .

Next, define a constant and let , where is the integral part of . is a threshold. For each sample , let represent the total number of . In other words, represents the number of samples in the target sample whose similarity with the sample is greater than or equal to the threshold.

The kernel measures the similarity between two target samples and , and a large kernel indicates a large similarity. If the membership degree of the sample to the target class is higher, it is obvious that more samples are similar to the sample in the input sample, i.e., the greater the value of . In other words, a sample with a higher value of should have a greater contribution to the classification boundary, the penalty for misclassification of such samples is greater, and the noise will have a smaller value of .

Therefore, we take as a measure function, which measures the importance of the target sample to the classification hyperplane. Obviously, the value of cannot be directly used as a membership degree of FMKOCSVM. We use an S-type function to map this measure into the membership degree in the unit interval. At the same time, this S-type function increases the difference between membership degrees of samples with different importance. The membership function is written aswhere . is a constant. The value range is . Figure 3 describes the distribution of membership values by taking different values. According to Figure 3, when , the distribution of membership value is the best. Algorithm 1 lists the detailed calculation process of membership based on local similarity.

Input: the training set , the kernel function set ,
 The kernel weight ,
Output: the membership vector
(1)Preprocess the training set
(2)Calculate the combined kernel matrix according to equation (10)
(3)Sort from large to small as ,
(4)Calculate the constant according to , and fixed threshold according to
(5)for i = 1 : l do
(6) initialize
(7)  for j = 1 : l do
(8)   if then
(9)   
(10)   end if
(11)  end for
(12)end for
(13)Calculate
(14)for i = 1 : l do
(15)Calculate the membership degree of the sample according to equation (22)
(16)end for
(17)end

Since noise has a low degree of membership to the target class, there are few similar instances in the input data. In other words, the noise will get a smaller membership value. Therefore, we proposed a method that can make the noise have less influence on the classification boundary. More importantly, because the training data of OCSVM only include target samples, the traditional method of calculating membership degree is not suitable for OCSVM. However, our method of membership based on local similarity only uses the features of the target data and does not involve the information of class, which is very suitable for a one-class classification problem. Furthermore, the proposed method has obvious high efficiency.

We analyze the computational complexity of Algorithm 1 with the notation. First, the computational complexity of calculating the multiple kernel matrix in step 2 is . Second, the average computational complexity of multiple kernel matrix sorting in step 3 is . Third, it costs to calculate in steps 4 to 12. Finally, calculating the memberships in steps 13 to 16 costs . Hence, the total computational complexity of local similarity-based membership degree is

Compared with classification performance, is also acceptable.

3.2. Overall Procedure of FMKOCSVM Based on Local Similarity

The detailed process of FMKOCSVM based on membership degree of local similarity is listed in Algorithm 2. In the following part of this paper, we use FMKOCSVM_LS to represent the proposed algorithm.

Input: the training set , the kernel function set , the kernel parameter set , the test set
Output: the classification results of
(1)Preprocess the training set
(2)for m = 1 : P do
(3) Calculate the corresponding kernel matrix of
(4)end for
(5)Substituting the kernel matrix into equation (16)
(6)Calculate the kernel weight vector according to equation (16)
(7)Calculate the combined kernel matrix according to equation (10)
(8)for each do
(9) Calculate the membership degree of the sample according to equation (22)
(10)end for
(11)Train the FMKOCSVM with the fuzzy membership values.
(12)Calculate the classification results of test set according to equation (21)
(13)end

In Figure 4, the classification performance of MKOCSVM is shown in Figure 4(a) and the classification performance of FMKOCSVM_LS proposed by us is shown in Figure 4(b). The combined kernel function is composed of seven Gaussian kernels with a width of . Parameter is set to 0.02. The value of the regularization coefficient in equation (16) is set to 100. In order to reduce the cost of tuning extraparameter, we set to 0.2 and to 10 directly. Obviously, FMKOCSVM_LS has a tighter boundary than MKOCSVM. In Figure 4(b), some outliers are identified by FMKOCSVM_LS. However, MKOCSVM does not identify any outliers, and there are a lot of gaps inside the boundary.

After adding 10% Gaussian noise to the training set, the results are shown in Figure 5.

The parameter settings in Figure 5 are the same as in Figure 4. As we can see, when there are noises in the training set, the classification ability of FMKOCSVM_LS is much better than that of MKOCSVM. In Figure 5(a), MKOCSVM distinguishes all noises as target data, which makes the performance of MKOCSVM very bad. In Figure 5(b), we can see that most of the noises have a very small membership value, and the negative effect of noise on the boundary is weak. Therefore, FMKOCSVM_LS can improve the robustness of MKOCSVM.

In the next section, we will further prove through experiments that our proposed method of calculating membership is better than the previous method.

4. Experiments

4.1. Experiment Setup
4.1.1. Approaches

We compared FMKOCSVM_LS with the following methods: (1) MKOCSVM: the ordinary multiple kernel one-class support vector machine [27]; (2) WMKOCSVM: the weighted one-class support vector machine is formed by WOCSVM [14] combined with the multiple kernel function; (3) FMKOCSVM: the fuzzy multiple kernel one-class support vector machine, in which membership is calculated based on a rough set [19]. Because two classes of samples are needed to calculate membership, the training set contains negative class samples. These negative samples are only used to calculate membership.

In MKOCSVM, the parameter is determined by 10-fold cross validation, and the value range is . The basic kernel of the multiple kernel function is seven Gauss kernel functions with kernel width . The parameter in the multiple kernel learning algorithm is set to 100. WMKOCSVM, FMKOCSVM, and FMKOCSVM_LS also use these parameters during training. The number of nearest neighbors in WMKOCSVM is set to 10, which is the same as in [14]. In order to avoid increasing the time due to the adjustment of parameters, when calculating the membership based on local similarity in FMKOCSVM_LS, we directly set and .

4.1.2. Metrics

In this paper, the performance of different approaches is evaluated by three popular metrics, namely, g-mean, AUC, and training time. According to the confusion matrix in Table 1, we can get the true positive rate (TPR) and the false positive rate (FPR). In one-class classification problems, using g-mean and AUC as measures is more accurate than using accuracy:

4.1.3. Data Sets

In this section, we selected 14 benchmark data sets, 13 of which are from the UCI machine learning repository. There are three experiments on biological systems. The Heart data set is a data set used for heart disease diagnosis. The Breast data set is a data set used to diagnose whether a patient’s breast cancer is benign or malignant. The Biomed data set is used to screen whether it is a carrier. Creditcard_cut is a part of the data set of creditcard fraud detection on Kaggle. Because the original Creditcard data set is too large, we only randomly selected 729 transaction data (483 normal transactions and 249 fraudulent transactions) for the experiment. Table 2 lists the details of these data sets.

For each data set, we use 70% of the positive data as the training set. Then, we randomly selected a part of negative data as the noise in the training set, and the proportion of noise was 10%. The rest of the data is used as the test set. The training set is normalized before training. And the test set is processed according to the standard of the training set.

4.2. Results

In order to obtain stable results, each method has done 20 independent experiments on each data set. The result used in the comparison is the average of the 20 results. Table 3 shows the optimal value of obtained through the 10-fold cross validation. In order to get the best results of the four algorithms, in Table 3 is used in each experiment.

Table 4 is a comparison of g-mean. Table 5 shows a comparison of AUC values. Table 6 shows the average training time of MKOCSVM, WMKOCSVM, FMKOCSVM, and FMKOCSVM_LS on each data set. Figure 6 is the total training time of 14 data sets of each method.

From Tables 4 and 5, we can find that the performance of FMKOCSVM_LS is the best among the four algorithms, which proves that our membership method can improve the robustness of MKOCSVM. More importantly, WMKOCSVM and FMKOCSVM have only one best result on 14 data sets, respectively. However, our proposed method has twelve optimal performances.

On the Iris, Breast, and Wdbc data sets, FMKOCSVM_LS shows great advantages. Its g-mean is 27%∼32% higher than that of MKOCSVM, 10%∼18% higher than that of FMKOCSVM, and 23%∼31% higher than that of WMKOCSVM. In the corresponding data set, the AUC value of FMKOCSVM_LS also increased significantly. On the Japan data set, although the g-mean of FMKOCSVM LS is lower than the g-mean of FMKOCSVM, it is still 10% higher than the g-mean of MKOCSVM and 4% higher than the g-mean of WMKOCSVM. AUC value is the same as g-mean. On the Glass data set, WMKOCSVM has the best result, and its result is only about 2% higher than that of FMKOCSVM_LS. However, on the Glass data set, the result of FMKOCSVM_LS is 5% higher than that of MKOCSVM, which proves that our membership calculation method can reduce the impact of noise on classification ability. In the remaining 9 data, the results of FMKOCSVM_LS are the best and have obvious advantages. For example, on the Waveform data set, the g-mean of FMKOCSVM_LS is 10% higher than that of WMKOCSVM and 4% higher than that of FMKOCSVM.

In terms of training time, although our method is not the fastest, FMKOCSVM_LS is still faster than WMKOCSVM. The training time of WMKOCSVM is 1.5 times of that of FMKOCSVM_LS on average. Compared with the training time of MKOCSVM, the increased training time of FMKOCSVM_LS is within the acceptable range.

All of the above proves that the MKOCSVM with membership is more robust. Moreover, our proposed membership based on local similarity is the best.

5. Conclusions

In order to solve the problem of poor robustness of MKOCSVM, this paper proposes a fuzzy multiple kernel one-class support vector machine based on local similarity, in which membership is based on the local similarity of the training data. Firstly, the similarity between samples is measured by combining the kernel matrix. Then, according to the selected threshold, the local similarity of each sample is determined. Finally, an S-type function is used to map the local similarity to the unit interval, and the function value is taken as the membership value. Experiments show that the membership method proposed in this paper can improve the robustness of MKOCSVM. Moreover, compared with the other two methods, our method is optimal.

The difficulty in fuzzy multiple kernel one-class support vector machine lies in how to determine the effective membership. Compared with the previous membership calculation method, only the target data are needed to calculate the membership based on the local similarity of the data, which is consistent with the OCSVM training set. In this method, the noise or outliers are assigned a small membership value, which makes the noise have the weakest impact on the classification boundary. Therefore, the membership method in this paper helps improve the robustness of MKOCSVM. In the next step, we will research the optimization method of parameters in the process of membership calculation based on local similarity.

Data Availability

The data underlying the study can be available upon request to the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (nos. 62072024 and 61473111), Projects of Beijing Advanced Innovation Center for Future Urban Design, Beijing University of Civil Engineering and Architecture (nos. UDC2019033324 and UDC2017033322), Scientific Research Foundation of Beijing University of Civil Engineering and Architecture (no. KYJJ2017017), Natural Science Foundation of Guangdong Province (no. 2018A0303130026), and Natural Science Foundation of Hebei Province (no. F2018201096).