Abstract

Class imbalance is a common problem in network threat detection. Oversampling the minority class is regarded as a popular countermeasure by generating enough new minority samples. Generative adversarial network (GAN) is a typical generative model that can generate any number of artificial minority samples, which are close to the real data. However, it is difficult to train GAN, and the Nash equilibrium is almost impossible to achieve. Therefore, in order to improve the training stability of GAN for oversampling to detect the network threat, a convergent WGAN-based oversampling model called convergent WGAN (CWGAN) is proposed in this paper. The training process of CWGAN contains multiple iterations. In each iteration, the training epochs of the discriminator are dynamic, which is determined by the convergence of discriminator loss function in the last two iterations. When the discriminator is trained to convergence, the generator will then be trained to generate new minority samples. The experiment results show that CWGAN not only improve the training stability of WGAN on the loss smoother and closer to 0 but also improve the performance of the minority class through oversampling, which means that CWGAN can improve the performance of network threat detection.

1. Introduction

In recent years, network attack and threat have gradually become two of the most important issues in the cyberspace. For example, in blockchain, big data, and IoT systems, the network attack occurs frequently, and global privacy leakage accidents caused by network threat are rising rapidly. It has caused great economic losses and undermining social stability [13]. Network threat detection based on machine learning and deep learning is one of the important ways to protect network security relying on massive data. However, the data sets collected through various ways are usually imbalanced. The number of the samples in each category varies significantly. In most cases, the benign category has a large number of samples, which is the so-called majority class. And the malicious category has much fewer samples, which is the so-called minority class. For example, in network intrusion detection, most traffic issues are normal with very few abnormal ones [4]. In Android malware detection, the proportion of the malware APPs is relatively low [5]. In the network threat detection, if the malicious samples as minority class are misjudged and widely spread, it will bring big losses to users. Therefore, the correct and accurate detection of malicious data in imbalanced data sets is very important. Imbalanced data sets are so common that they bring many troubles to data mining and analysis [6]. However, when the number of samples of the minority class is very small, they are very likely to be discarded as noise data. It is very harmful for the training of classifier, thus reducing the accuracy of the minority class classification [7, 8]. Therefore, it is an urgent problem to strengthen the classification ability and improve the classification accuracy for the minority class.

So far, numerous methods and strategies have been proposed to solve the problem of imbalanced data, which are divided into two major categories, including data-level solution and algorithmic-level solution [911]. Data-level approaches focus on the data sets by undersampling the majority class, oversampling the minority class, or a combination of both to balance data sets. Undersampling is the key issue to reduce regular data from the majority class, which has a risk of losing important information. Oversampling is to generate some new data close to the minority class, which is popular due to using all the available information and significantly improving the overall performance of data classification. Synthetic minority oversampling technique (SMOTE) [12] is the most representative and popular oversampling algorithm, which generates synthetic minority examples based on the linear interpolation between each minority example and its nearest neighbors. Algorithmic-level approaches improve the algorithms to focus on the classification accuracy of the minority class, which can be divided into four types: cost-sensitive methods, kernel-based learning methods, active learning methods, and ensemble methods [10, 13]. There are a variety of common algorithmic-level methods, such as a Bayesian optimization algorithm, which maximizes Matthew’s correlation coefficient by learning optimal weights for the positive and negative classes [14].

However, when the amount of data is big and the dimension is high, the traditional methods mentioned above are not so effective. But machine learning and deep learning techniques appear in time to make up for the shortcomings of the traditional methods [15]. Generative adversarial network (GAN) is a typical generative model, which is regarded as a potential solution for imbalanced data by generating new samples for the minority class [16, 17]. GAN is composed of two neural networks: generator and discriminator. The generator is used to generate complex and high-dimensional data close to the real data as far as possible to deceive the discriminator. The goal of the discriminator is to identify the real data and the false data generated by the generator as far as possible. Through multiple adversarial training, a generator with excellent performance is obtained [18]. However, GAN has some problems, such as unstable training, gradient disappearance, and mode collapse [19, 20]. In order to solve these problems, many variant algorithms have been proposed. Wasserstein GAN (WGAN) is one of the most typical algorithms [21]. WGAN uses Wasserstein distance instead of Jensen–Shannon (JS) divergence to make training stable. Most important of all, WGAN not only solves the unstable training problem of GAN and does not carefully balance the training of generator and discriminator but also solves the collapse mode problem and ensures the diversity of the generated samples. Another advantage of WGAN is that it can deal with the discrete data. Therefore, we apply WGAN as an oversampling method to generate the new minority samples to solve the imbalanced problem.

WGAN is still not perfect with some defects like GAN, such as training difficulty and slow convergence in the real experiment process. Therefore, we tend to construct the condition of alternate training between the generator and discriminator. Inspired by the idea of letting the convergence of the discriminator guide the training stability of WGAN, we propose a convergent WGAN-based oversampling model called CWGAN, which contains a generator and a convergent discriminator. The generator is used to generate new samples close to the real ones. And the discriminator is used to distinguish the real minority samples and the new samples generated by the generator. In the training process of CWGAN, the generator and the discriminator are trained alternately many iterations. In each iteration, the epoch of the discriminator is dynamic, which is determined by the convergence of loss function in the last two iterations. When the discriminator is trained to convergence, the generator will be trained. Through the process, we can achieve the epochs of the discriminator in each iteration.

The proposed method includes the following advantages:(1)Compared with the WGAN, the proposed CWGAN constructs the condition of alternate training between the generator and discriminator to guide the training stability.(2)During the training process of CWGAN, the epochs of the discriminator are dynamic, which is determined by the convergence of loss function in the last two iterations.(3)CWGAN model can solve the imbalanced data problem through generating the new samples for the minority class to make the data set balance.(4)CWGAN not only improves the accuracy of the minority class but also enhances the accuracy of the majority class and the accuracy of the whole data set.

The remainder of this paper is organized as follows. Section 2 discusses the related work about GAN and imbalanced data and GAN training imbalanced. A description of the CWGAN model is provided in Section 3. Experimental results and analysis are presented in Section 4. The work is concluded in Section 5.

2.1. GAN and Imbalanced Data

Class imbalance is a common issue in network space for classification tasks. Generating the minority samples is regarded as an effective way to solve the class imbalance problem, which is classified as oversampling. Random oversampling, synthetic minority oversampling technique (SMOTE) [12], and borderline SMOTE [16] are considered to be the best traditional oversampling algorithms. However, when the data have a high dimensional space, the performance reduces significantly. By comparison, GANs can generate artificial data close to the minority class when the data are high dimensional and data distributions are complex. So far, researchers have produced many modified versions of GANs.

Hao et al. [22] proposed an Annealing Genetic GAN (AGGAN) method to solve the class imbalance problem by generating data close to the minority class distribution based on the limited data samples. In the training process, AGGAN used the mechanism of simulated annealing (SA) to update the training of GANs and avoided the local optimum to get the best generator, which could generate data close to the minority classes. Both theoretical analysis and experimental studies showed that AGGAN made the classes balance efficiently and effectively through reproducing the distributions.

Based on a conditional Wasserstein GAN, Engelmann and Lessmann [9] proposed an oversampling method called cWGAN-based model to balance the tabular data with numerical and categorical variables. In the cWGAN-based structure, the conditional distribution is estimated to sample the minority class explicitly. The loss function of the cWGAN-based model is added with an auxiliary classifier (AC) loss to encourage the generator to generate samples. The experiments showed that the cWGAN-based architecture could successfully estimate the data distribution and outperforms other oversampling methods.

Deepshikha and Naman [14] proposed a new GAN framework consisting of a generator network G, a discriminator network D, and a classifier network C, where the generator worked with the classifier to generate some samples in a min-max game with the discriminator to make class balance. In the framework, the generator could generate samples in the convex hull through competing with the classifier. Consequently, the generated samples appeared on the data boundary with better distribution, which could make the classifier easy to find the decision boundary of the whole data set with original imbalanced data and generated data. The experiments showed that the proposed method with an additional classifier achieved better performance than other state-of-the-art techniques on the image data sets.

Kim et al. [23] proposed a novel GAN-based model consisting of an autoencoder as the generator and two separate discriminators for anomaly detection on image data sets. Meanwhile, they proposed new loss functions consisting of patch loss and anomaly adversarial loss to optimize the GAN-based model and improve the robustness and performance for defect detection. There were eight loss functions: six losses for the generator, one for the normal discriminator, and another for the anomaly discriminator. The experiments were carried out on the benchmark data sets and real-world data sets. The experimental results showed that both of the performances of benchmark data sets and real-world data sets were state of the art.

In summary, since GAN was proposed by Goodfellow, it has been widely used to generate new class samples to solve the imbalanced problem. And the experimental results show that oversampling based on GAN can significantly improve the classification performance of the minority class. However, most researchers focus on solving the imbalance problem of image data set, and few on the numerical data. Actually, the image data are very different from the numerical data. Therefore, when solving the imbalanced problem of the network security data, it is necessary to adopt GAN-based methods to improve the detection performance of network threats.

2.2. GAN Training Imbalanced

GAN is one of the hottest deep learning models, but it is hard to train, and the Nash equilibrium is almost impossible to achieve between the discriminator and generator during the training. Training instability is thought as one of the defects [19, 20]. Many researchers have put forward their own ideas to solve the problem and design improved models suitable for specific tasks. New research articles about modified versions of GAN are published week by week, with the names such as 3D-GAN, BEGAN, and iGAN. Avinash Hindupur builds a website called “The GAN Zoo” [24] on GitHub and lists the new GAN models.

Many variants based on GAN have been proposed to improve the training stability and performance. According to the composition and loss, GAN models can be divided into two categories: architecture variants and loss variants. The former ones are to benefit for specific applications, and the latter ones are to improve the performance and enable more stable training [19]. The schemes belonging to loss variants are divided into three types: restricting loss function of the generator [25], restricting loss function of the discriminator [26, 27], and simultaneous restricting loss function of the generator [28, 29]. Our work focuses on the training stability to achieve approximately the Nash equilibrium by restricting loss function of the discriminator to improve the performance.

Heusel et al. [30] proposed a two time-scale update rule (TTUR) for training GAN models with stochastic gradient descent on loss functions. The TTUR defined different individual learning rates for both the discriminator and the generator and proved the learning rates in depth. Finally, the TTUR converged under a local Nash equilibrium. Experiments showed that the TTUR improved the performance of the generator on image data sets.

Brock et al. [31] studied the instabilities of the large-scale GANs. On one hand, two simple, general architectures with two to four parameters and eight times the batch size were introduced to improve scalability. On the other hand, the generator used orthogonal initialization to make it suitable for a simple “truncation trick,” which got the tradeoff between sample variety and fidelity by reducing the variance of generator inputs. The experimental results achieved natural images of multiple categories and improved the conditioning, demonstrably boosting performance of GANs.

Mescheder et al. [32] pointed out that unregularized GAN training was not always convergent locally and the Nash equilibrium led GAN hard to achieve training stability. They analyzed the convergence properties and discussed some regularization strategies, which made GAN training stable including adding instance noise and gradient penalties. In the experiment part, some gradient penalties were used to prove local convergence for the regularized GAN training dynamics.

Zadorozhnyy et al. [33] proposed a new family of discriminator loss functions consisting of real and fake data. According to the gradients of the loss, weights were computed adaptively to train a discriminator to obtain the stability. These methods can be used for any discriminator model with a loss summed of the real and fake data. The experiment results showed that the methods were more effective on the image generation task.

Lee and Seok [34] proposed that it was essential to maintain balanced training between the discriminator and generator. They thought multiple updates and imbalanced learning rates could train balance. The two approaches directly trained the discriminator severer than the generator. However, the learning rates were dynamic according to the different GAN architecture, data sets, and tasks, which needed to be further studied to select the suitable number of multiple updates and values of learning rates.

Sidheekh et al. [35] demonstrated that it was a problem to use duality gap to monitor the training progress of a GAN in the prevalent approach and proposed a dependable measure to estimate the duality gap based on the locally perturbed gradient descent, which could overcome the limitations of duality gap’s capability. In the experiments, the researchers demonstrated their proposed method based on duality gap could influence the training process, such as tuning hyperparameters, a wide variety of GAN models and data sets. And the proposed method had ability to identify whether model convergence or divergence could measure the potential performance of GANs.

The training of GAN is an important and difficult problem. At present, many schemes have been put forward. The ideas of most algorithms are to balance the training of GAN by changing the loss of the generator and discriminator. Inspired by this, we can judge the convergence of the discriminator by calculating the loss trend of the discriminator in the training process and use it as the basis for determining the epochs of subsequent training.

3. Proposed Solution

Our proposed network threat detection model based on CWGAN is shown in Figure 1. The architecture consists of three parts: data segmentation, deep learning for oversampling, and shallow learning for classification. The details and process of the architecture are as follows:(1)Data segmentation: at first, the original imbalanced data sets are divided into the training data sets and testing data sets, respectively. Both the training data set and testing data set are imbalanced, including the majority class and the minority class. Then, the minority class is input to the deep learning model for oversampling.(2)Deep learning for oversampling: CWGAN works as a deep learning model, containing a generator and a convergent discriminator. The structures of generator and discriminator are composed of full connection layers. The generator is used to generate new data close to the real data. And the discriminator is used to distinguish the real minority data and the false data generated by the generator as far as possible.Overall, the training process of CWGAN contains multiple iterations. In each iteration, the training epochs of the discriminator are dynamic, which is determined by the convergence of loss function in the last two iterations. In each iteration, the training process is divided into two stages. The first stage is to fix the discriminator to train the generator. At the beginning, the generator is relatively weak, and the generated data are easy to be identified by the discriminator. With the training keep on, the performance of the generator is improved. The second stage is to fix the generator to train the discriminator for distinguishing the real data and the generated false data. The loss function of the discriminator is computed between the real data and false data. Then, the convergence of the loss function is proved through the convexity and Lipschitzness continuity conditions. When the discriminator converges, the epochs of the loss of the previous iteration and that of the current iteration are computed, which are to determine the number of iterations in the next iteration. Therefore, the epoch is used as the index to balance the two iterations of convergence.(3)Shallow learning for classification: after the CWGAN generates the new minority samples for the minority class in the training data sets, the new minority samples and the original training data sets are fused to form the new balanced training data sets. Then, the new balanced training data sets are input to shallow learning models to train the model. Then, the imbalanced testing data sets are input to the trained shallow learning models to predict the labels. At the same time, the accuracy of the majority class in the testing data set as acc+, the accuracy of the minority class in the testing data set as acc−, and accuracy, precision, recall, F1, and G-means of the whole testing data set are computed to evaluate the performances.

3.1. Data Segmentation and Data Imbalanced

The original imbalanced data sets are divided into training data sets and testing data sets, respectively. Both the training data set and testing data set are imbalanced, including the majority class and the minority class. The data object we deal with is the training set.

Suppose the training set is and the size is . The majority class is , and the size is . The minority class is , whose size is . The imbalance ratio (IR) of the training set is computed based on the size of the majority class divided by the size of the minority class, which is defined as follows:

indicates the degree of the imbalanced data sets. The higher the is, the more imbalanced the data set is, and the more data need to be generated. In order to achieve the balance of different class samples, it is necessary to generate new samples for the minority class. is close to , which can also determine the number of the generated samples.

3.2. Deep Learning Model Training

The WGAN model is built with a generator and a discriminator. The generator is used to generate new samples for the minority class. And the discriminator is used to distinguish the real minority samples and the generated samples. Suppose the input data of the generator are the noise data , and the input data of the discriminator are the real minority data and the false data generated by the generator are .

The WGAN model uses full connection layers to construct the structure of the generator and discriminator. The loss function is computed based on Wasserstein distance, which is defined aswhere is the output data of the generator and is the output data of the discriminator. is the set of all possible joint distributions of the combination of and . For the joint distribution , the data set and conform to the distribution . represents the distance between the samples. The expected value of sample distance under the joint distribution is , whose lower bound is defined as Wasserstein distance.

In order to solve the equation, the -Lipschitz constraint is introduced, which is defined aswhere is the function of the neural network model and is the Lipschitz constant.

The approximate solution of formula (2) becomes

The parameter makes the gradient times larger but does not affect the direction of the gradient. Formula (4) can be expressed as a discriminator network with parameter and the last layer is not a nonlinear active layer. Then, formula (4) takes the upper boundary of all functions , which satisfies the -Lipschitz restriction. Then, the Wasserstein distance is converted to

Next, the generator is to approximately minimize the Wasserstein distance, which is equivalent to minimize in formula (5). Considering that the first term of formula (5) is independent of the generator, we can get the discriminator loss and generator loss of WGAN in formulas (6) and (7).

Formula (6) is the inverse of formula (5), and formula (6) can indicate the training process. The smaller the value of formula (6) is, the smaller the Wasserstein distance between the real data and the generated data is, and the better WGAN training is.

3.3. Model Training Convergence

After the model is established, the model enters into the iterative process of training. The whole iterative process contains many iterations, and each iteration contains many epochs. In each iteration, the generator and the discriminator are trained alternately. The training process is the process of reducing the discriminator loss and the generator loss. In our work, we design a convergence rule, when the generator is fixed, the number of training epochs of the discriminator is not fixed, which is determined dynamically according to the convergence of the loss function.

Assuming that the distribution of the generated data is similar to that of the real data, in one iteration of training process, the discriminator’s recognition ability is continuously enhanced and eventually converges. The convergent discriminator can correctly classify the generated data and the real data.

When the discriminator converges, the epochs in the next iteration are determined by the change times of the discriminator loss between the previous iteration and the current iteration. Therefore, we use the change times of the discriminator loss as the index to balance the training of the discriminator in the next iteration.

3.3.1. Convergence Proof of Cost Function

Suppose the WGAN model trains iterations. In each iteration, the generator trains one epoch, while the discriminator trains epochs. In -th epoch in -th iteration, suppose the loss function of the discriminator is , which is convex and -Lipschitzness. The updating of the discriminator loss function iswhere , and is the stride. Suppose is the optimal solution of , which is received at the -th iteration. is the optimal value of and expressed as . is the Lipschitzness constant, and the value is 1. Therefore, we can get

In formula (9), the first inequality satisfies the convexity of functions and the second inequality is based on the Lipschitzness continuity condition. By accumulating formula (9), the result is obtained as

Multiply both sides of formula (10) by to get

According to the right bounded, we can get

Under the condition of , we can get

In this paper, the data are high dimensional. We use the Wasserstein distance to measure the distance between generated data and real data. Suppose represents the Wasserstein distance between generated data and real data when the discriminator is convergent at the -th iteration. The formula (13) is transformed to

3.3.2. Calculation of Discriminator Training Epochs

Under the conditions of formulas (12)–(14), the minimum loss of convergence in the iteration is , and the minimum loss of convergence in the iteration is . So the training epoch number of the iteration discriminator is defined aswhere is the rounded-down value of the data.

That is to say, the epoch number of discriminator training in each iteration depends on the ratio of the minimum loss of the previous two iterations of discriminator training convergence, which is rounded down. If the ratio is rounded down to 0, the epoch of discriminator training is 1.

Moreover, it is noted that there are many math symbols in this section. So, we make a new symbol table for all used math symbols in Table 1.

3.4. Shallow Machine Learning Classifier

Shallow machine learning has good performance and high efficiency to learn and analyze the data [36]. There are many shallow machine learning classifiers, e.g., NB, RF, and LR. Based on our previous experimental results and analysis of the existing literature, we find that SVM has some advantages, such as the stability for maintaining good classification performance, the ability of dealing with the noise and outlier data, and effectiveness for handling the nonlinear and high-dimensional data. Therefore, we choose SVM as the classifier in our experiment.

Shallow machine learning consists of two stages: training stage and testing stage. In the training stage, the generated samples based on CWGAN combined with the original training data set are input to the shallow machine learning classifier to train and obtain the optimal model structure. In the testing stage, the imbalanced testing data set is input to the trained shallow machine learning model to get the labels of the predicted testing data.

In the experiment, the true labels of the testing data set have been known, so the performance of the shallow machine learning models, such as accuracy, precision, recall, and F1, can be obtained by comparing the true labels with the predicted labels and calculating the confusion matrix. In this work, the accuracy of the majority class in the testing data set as acc+, the accuracy of the minority class in the testing data set as acc−, and accuracy, precision, recall, F1, and G-means of the whole testing data set are computed to evaluate the performances.

The confusion matrix for binary classification includes four index items, such as true positive (TP), false negative (FN), false positive (FP), and true negative (TN). The accuracy of the minority class in the testing data set as acc− is computed on the minority class:

The accuracy of the majority class in the testing data set as acc+ is computed on the majority class:

Accuracy, precision, recall, F1, and G-means of the whole testing data set are computed:

4. Experiments

4.1. Data Set

The data sets are downloaded from the website of Canadian Institute for Cybersecurity, which is based at the University of New Brunswick in Fredericton [37]. There are many security data collected on the website, which are used by companies, research centers, and universities. In our experiment, we download the URL data set, which contains some different types of URLs. And 79 features are selected from the original data, such as URL, domain, path, file name, and argument [38]. Then, we just extract some samples from the data set to form some imbalanced data sets, ranking from low to high according to IR. The details of the data sets are shown in Table 2.

Benign URLs. They are collected from Alexa top websites. First, the domains are crawled by a Heritrix web crawler. Then, around half a million unique URLs are extracted from the domains. Later, the URLs are filtered by VirusTotal from the extracted URLs.Malware URLs. They link to the malicious websites and are owned to DNS-BH, which maintains a list of the malicious sites.Defacement URLs. They are fraudulent or hidden URLs ranked by Alexa and linked to the malicious Web pages.Spam URLs. They are collected from the publicly available WEBSPAM-UK2007 data set.

In the experiments, the original data set is divided into the training set and the testing set according to the ratio of 7 : 3. The training set is used to train the model, and the testing set is used to validate and evaluate the model.

All experiments are preformed in JetBrains PyCharm 2017 with the Python 3.6 interpreter on a laptop with Intel CORE i5-6200U 2.3 GHz with 8 GB RAM running on the Windows 10 OS.

4.2. Experiment Results and Discussion
4.2.1. Performance of Original Imbalanced Data Set

In the first part of the experiments, the performances of the original imbalanced data sets are computed. First, the original imbalanced data sets are divided proportionally into the training data sets and testing data sets, respectively. Both the training data set and testing data set are imbalanced. Second, the training data set is input to SVM to train the model. SVM is viewed as a traditional and effective classifier. Third, the testing data set is input to the trained SVM to predict the labels. At the same time, the accuracy of the majority class in the testing data set as acc+, the accuracy of the minority class in the testing data set as acc−, and accuracy, precision, recall, F1, and G-means of the whole testing data set are computed to evaluate the performances, which are shown in Table 3. In addition, it should be noted that the recorded results are the average of 5 experiments.

According to the performances of original imbalanced data sets in Table 3, we can find some interesting results. First, with the increase of IR, the acc+ values representing the accuracy of the majority class are increasing. By comparison, the acc− values representing the accuracy of the minority class are decreasing. These results are consistent with the conclusion, which is that the boundaries of the classifiers tend to be the majority class and lead to the classification error of the minority class. Second, with the increase of IR, accuracy, precision, recall, and F1 of the whole testing data sets are increasing, and also G-means of the whole testing data sets are increasing. That is because accuracy, precision, recall, and F1 represent the overall performance of the testing data sets. The accuracy of the majority class can ensure the overall accuracy. The growth trend of acc− is the same as that of G-means. Both of them decrease with the decrease of IR. So, G-means can represent the performance of the minority class. Furthermore, we can obtain some conclusions. On one hand, the overall performance of the whole data sets cannot reflect the performance of the imbalanced data. On the other hand, when the IR is very high, the accuracy of imbalanced data is very low.

In the field of cyberspace security, network threat data are usually viewed as the minority class. If the minority data are misclassified so that achieve a low accuracy, the consequences can be serious. Therefore, it is necessary to solve the problem of the imbalanced data set. This is the goal of our work. In the following parts, we will introduce the results of our proposed method to generate the sufficient new minority samples to solve the imbalanced problem and improve the accuracy of the minority class.

4.2.2. Comparison between WGAN and CWGAN

WGAN can generate the new data based on the original data set to solve the problem of the imbalanced data set. In this work, we propose CWAGN to improve the training stability to improve the performance of WGAN. In this part, the performances of WGAN and CWGAN are computed and compared. First, the original imbalanced data sets are divided proportionally into training data sets and testing data sets. The number of the needed new minority samples of the training data sets equals the number of the majority class minus the number of the minority class. Second, WGAN and CWGAN work to generate the new minority samples for the training data sets. Then, the new minority samples and the training data sets are fused to form the new balanced training data sets. Third, the new balanced training data sets are input to SVM to train the model. Then, the testing data sets are input to the trained SVM to predict the labels. At the same time, acc+ and acc−, accuracy, precision, recall, F1, and G-means are computed to evaluate the performances, which are shown in Tables 4 and 5. In addition, it should be noted that the numbers of training iterations of WGAN and CWGAN both are 1000. The recorded results are the average of 5 experiments.

The performances in Tables 4 and 5 are computed based on the output of the trained SVM on the balanced training data sets, which are fused with the new generated minority samples based on WGAN and CWAGN. On the whole, comparing the results in Tables 35, we can conclude that acc− and G-means in Tables 4 and 5 increase obviously compared with that in Table 3; acc+, accuracy, precision, recall, and F1 in Tables 4 and 5 decrease a little compared with that in Table 3. That is because WGAN and CWAGN generate some new generated minority samples to make the original data sets balance, which train SVM to achieve a model that focuses on the minority class. Comparing the results in Tables 4 and 5, acc− and G-means in Table 5 based on CWGAN are higher than that in Table 4 based on WGAN, especially when IR is high; acc+, accuracy, precision, recall, and F1 of some data sets in Table 5 are a little higher than that in Table 4. Therefore, we can conclude that the performances based on CWGAN generating the new minority samples outperform those based on WGAN.

4.2.3. Discriminator Loss of WGAN and CWGAN

During the training stage, the discriminator loss is achieved based on the cross entropy loss function to compare the probability that the predicted labels of the generated data are close to that of the real data. The smaller the discriminator loss, the closer the generated data to the real data. In this section, we study the trend of the discriminator loss of WGAN and CWGAN on different data sets. The epoch of WGAN in each iteration is fixed to 5. The epoch of CWGAN in each iteration is computed when the model is trained to convergence in one iteration. The numbers of the training iterations of WGAN and CWGAN are 1000. The loss of each iteration in one experiment is recorded and plotted in Figure 2.

Figures 2(a)2(h) shows the discriminator loss curves of WGAN and CWGAN on D1 to D8. These curves show that the discriminator loss of CWGAN is smaller and more stable than that of WGAN with the increase of the iterations, and the values of the discriminator loss of CWGAN are very close 0. The discriminator loss measures the similarity between the generated data and the real data. The results reflect that the generated data based on CWGAN are closer to the real data than that based on WGAN. Therefore, combined with the experimental results in Section 4.2.2, we can ensure that the performance of CWGAN outperforms that of WGAN.

4.2.4. Training Convergence of CWGAN

The training of WGAN with a generator and a discriminator is imbalanced. To make the training process trend to Nash equilibrium, we fix the training epoch of the generator and compute dynamically the training epoch of the discriminator according to the convergence of the training loss. In each iteration, the value of the training epoch of the discriminator is determined by the ratio of training loss in the last two iterations. In this part, we compute the training epoch of the discriminator in each iteration on different data sets. The results are plotted in Figure 3. In addition, it should be noted that the number of the training iterations of CWGAN is 1000. The recorded results are from one of some experiments.

Figure 3 shows the training epoch of the discriminator in each iteration on different data sets. We divide the training process into two stages. In the early stage, the values of the training epoch change frequently and are unstable. In the later stage, the values of the training epoch are stable. In addition, in most cases, when the training process converges, the higher the IR, the more samples generated by CWGAN, and the higher the training epoch. Such as the training epoch of D8 is highest, whose IR is highest.

4.2.5. Comparison between the Oversampling Methods

At present, there are many oversampling methods, such as VAE and SMOTE. In this section, we compare the performance between the oversampling methods. The experimental process of the oversampling methods is similar to that in Section 4.2.2. First, the oversampling methods, including VAE, SMOTE, WGAN, and CGWGAN, are applied to generate the minority examples and combined with the original training data sets to train SVM. Second, the testing data sets are input to the trained SVM to predict the labels. At the same time, acc+, acc−, accuracy, and G-means are computed to evaluate the performances, which are plotted in Figure 4. In addition, it should be noted that the numbers of training iterations of WGAN and CWGAN both are 1000. The recorded results are the average of 5 experiments.

In Figure 4(a), the accuracy of the minority data acc− is highest on the same data sets based on CWGAN. Especially, on D6, D7, and D8, whose IR is higher, the accuracy on most data sets are improved most after CWGAN generating the minority samples to make the imbalanced data sets balance. In Figure 4(b), the comparison on G-means is similar to that of acc−. G-means on most data sets based on CWGAN are highest. We can conclude that the new generated minority samples based on CWAGN can improve the performance of the minority class most. In Figures 4(c) and 4(d), in most cases, the accuracy of the majority class acc+ and the accuracy of the whole data are close to those of the original data, which means that CWAGN cannot weaken the accuracy of the majority class and the whole data.

5. Conclusion

WGAN as a generative model can solve the imbalanced data problem through generating new samples for the minority class. However, it may suffer mode collapse and the training cannot converge sometimes. In this paper, we improved the training process to realize the training stability of WGAN by proposing a convergent WGAN-based oversampling model called CWGAN. In each iteration of the training process, the epoch of the generator is fixed, but the epoch of the discriminator is dynamic, which is determined by the convergence of discriminator loss function in the last two iterations. The experiment results showed that CWGAN not only improves the training stability of WGAN on the loss smooth and more close to 0 but also improves the performance of the minority class through oversampling and ensure the accuracy of the majority class and the whole data. Compared with other oversampling methods, it can be concluded that CWGAN performs better and enables stable training.

Data Availability

The data in experiments are downloaded from the public websites. The links are cited in [37]. In addition, the datasets are also available from the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This paper was supported by the National Natural Science Foundation of Zhejiang Province (nos. LY20F020012 and LQ19F020008) and the National Natural Science Foundation of China (no. 61802094).