Abstract

Backdoor attacks have been recognized as a major AI security threat in deep neural networks (DNNs) recently. The attackers inject backdoors into DNNs during the model training such as federated learning. The infected model behaves normally on the clean samples in AI applications while the backdoors are only activated by the predefined triggers and resulted in the specified results. Most of the existing defensing approaches assume that the trigger settings on different poisoned samples are visible and identical just like a white square in the corner of the image. Besides, the sample-specific triggers are always invisible and difficult to detect in DNNs, which also becomes a great challenge against the existing defensing protocols. In this paper, to address the above problems, we propose a backdoor detecting and mitigating protocol based on a wider separate-then-reunion network (WISERNet) equipped with a cryptographic deep steganalyzer for color images, which detects the backdoors hiding behind the poisoned samples even if the embedding algorithm is unknown and further feeds the poisoned samples into the infected model for backdoor unlearning and mitigation. The experimental results show that our work performs better in the backdoor defensing effect compared to state-of-the-art backdoor defensing methods such as fine-pruning and ABL against three typical backdoor attacks. Our protocol reduces the attack success rate close to 0% on the test data and slightly decreases the classification accuracy on the clean samples within 3%.

1. Introduction

Deep neural networks (DNNs) have a wide range of the current applications in the artificial intelligence applications such as image recognition, speech recognition, and natural language processing [13], in which security and privacy protection are considerable issues [4]. The massive amount of data and growing computing power have facilitated the development of DNNs, but the DNN models are still very expensive in training. Users often choose to train DNN models on the third-party platforms (e.g., Amazon EC2) or even use third-party trained models directly to reduce training costs. However, it is vulnerable to backdoor attacks, which can misclassify any input using attacker predefined triggers (pattern patches) and replace the corresponding label with a predefined target label. Those models with backdoors behave normally just like the clean peer-to-peer models for clean samples without triggers, which are equivalent to highly stealthy viruses that disguise themselves as normal and perform great damage [5].

The backdoor attack greatly threatens DNNs in practical applications for reducing the trustworthiness of the DNN models and even leading to safety-critical areas. The separation of data and model training in deep learning allows attackers to often gain and modify the training samples to mislead DNNs by adding some invisible perturbations to a small proportion of datasets, such as the local patches or the steganographic data in the lower right corner of an image, and even setting weights that affect the model during training [610]. The ability of infected DNN models to correctly classify clean samples makes it difficult for users to detect the presence of backdoors. In addition, the hidden nature of triggers makes it difficult for users to identify them. Thus, the invisibility and stealthiness of triggers make detecting backdoor attacks a considerable challenge [1113].

Most of the existing backdoor defensing methods are divided into two types: model-based defense and data-based defense. The former detects whether the model is infected by a backdoor, and the latter considers whether the data contain a trigger. Recently, Li et al. [14] reveal that existing backdoor attacks were easily mitigated by current defenses [1517] mostly because their backdoor triggers are sample-agnostic, i.e., different poisoned samples contain the same trigger no matter what trigger pattern is adopted. Thus, they propose an attack method, called as sample-specific backdoor attack (SSBA), which makes it more difficult to detect and remove the backdoors since most of the current defensing protocols reconstruct and detect backdoor triggers according to the same behavior on different poisoned samples [1517]. SSBA is an invisible backdoor attack that generates invisible sample-specific triggers by the pretrained encoder-decoder network. The reason why current mainstream defensing methods have difficulty in detecting sample-specific triggers is that their success based on the assumption that the triggers are sample-agnostic based types. For example, pruning-based defenses assume that the neurons associated with the backdoor are different from those activated by the clean samples. The defender can remove the hidden backdoor by pruning out the potential neurons. However, the non-overlap between the two neurons is that the sample-agnostic trigger pattern is simple, and the DNNs only need a few independent neurons to encode this trigger. This assumption might be easily broken when the trigger is sample-specific.

Inspired by image steganalysis technique [18], we find that the intensity values of the images at the same position of different color channels have a strong correlation for the poisoned images regardless of whether the triggers are sample-specific or invisible; that is, the triggers in the poisoned images belong an additional perturbation with a weak correlation among those color channels. In addition, since the poisoned samples of the backdoor attack are bounded to the target label, the correlation between the trigger pattern and the target label can be effectively broken by randomizing the class target.

We propose a new backdoor detecting and removing protocol, which can detect backdoors regardless of whether the triggers are specific to poisoned samples or not. Specifically, it detects whether a color image contains a trigger by the feature that the additional perturbation can be retained in the wider separate-then-reunion network (WISERNet). To address the weakness that poisoned samples in backdoor attacks are always bounded to the target label, our protocol breaks the correlation between the trigger pattern and the target label by backdoor unlearning and leads to model purification. In summary, our contributions are as follows:(i)A backdoor defensing method based on secure image steganalysis is proposed. The poisoned image contains a trigger that can be considered as an additional perturbation, and the intensity value at the same location has a strong correlation between different color channels, while the trigger has a weak correlation between its channels. The protocol is proved valid whether the trigger is visible or invisible.(ii)A secure backdoor detecting and removing protocol is designed. We design a novel protocol to achieve the goal by detecting the poisoned images in the training dataset based on the wider separate-then-reunion network regardless of whether the trigger is specific to the poisoned samples and by retraining the model for backdoor unlearning with the detected poisoned images.(iii)Extensive experiments are conducted in the proposed protocol. We empirically show that our protocol is robust against three state-of-the-art backdoor attacks. Compared with the state-of-the-art backdoor defensing protocols, fine-pruning [15] and ABL [19], our protocol reduces the success rate of backdoor attacks to nearly 0% on both target classification and face recognition tasks and retains the accuracy after removing the backdoors.

2.1. Backdoor Attacks

A common method for implementing backdoor attacks is data poisoning. When the model is training, the poisoned samples are injected into the training dataset. After that, the model is influenced by the poisoned samples, deviates from the desired training effect of the original training data, and changes “slightly” in the desired direction according to the feature of the poisoned samples, which allows the attacker to modify the model and implant a backdoor [20]. According to the visibility of trigger, backdoor attacks based on data poisoning can be classified into two categories: visible backdoor attack and invisible backdoor attack.

2.1.1. Visible Backdoor Attack

Gu et al. [21] first proposed the backdoor attack BadNets to inject backdoors by modifying part of the training data, whose triggers can be of arbitrary shapes, such as squares. Chen et al. [22] first demonstrated that data poisoning attacks can create physically implemented backdoors. Liu et al. [23] proposed a Trojan attack to design triggers based on the values of internal neurons in DNNs, which strengthens the connection between the trigger and the internal neurons, enabling the effect of implant backdoors with fewer poisoned samples. Chen et al. [24] improve the steganography of the trigger by combining generative adversarial network techniques to implant the trigger as a watermark into clean samples and reducing the variability between the trigger features and the clean sample features. There are many other works [25, 26] implemented in optimizing triggers, and although all of these attack methods have high success rates, the triggers are visible and can be easily detected by people.

2.1.2. Invisible Backdoor Attack

Zeng et al. [27] proposed that poisoned samples can be identified by frequency information and constructed frequency invisible poisoned samples, thus achieving the invisibility of triggers. Li et al. [14] proposed to generate sample-specific triggers by the pretrained encoder-decoder network. Considering the steganography perspective, Li et al. [28] proposed an optimized framework to constrain the generation of triggers by regularization and embed the triggers in the bit space using image steganography to make the triggers invisible.

2.2. Backdoor Defenses

Due to the great potential damage of backdoor attacks to artificial intelligence applications, an increasing number of backdoor defensing protocols are proposed to mitigate such security threats. The existing defensing approaches include model-based defense and data-based defense.

2.2.1. Model-Based Defenses

Model-based defense is to detect whether a model is infected with backdoors. Liu et al. [15] found that neurons associated with backdoors are usually dormant during inference of benign samples and therefore proposed to prune the associated backdoor neurons to eliminate backdoors in the model. Zhao et al. [29] proposed to repair infected models using quantitative clean samples by pattern connectivity techniques [30]. Liu et al. [31] proposed a neural network-based artificial intelligence scanning technique inspired by EBS [32] to determine whether a model has a backdoor; however, it is effective for single-trigger attacks and ineffective for multitrigger attacks. Wang et al. [17] proposed a defense method called neural cleanse (NC) by synthesizing each class’s triggers and comparing the triggers’ size. If the smaller trigger is significantly smaller than the other triggers, the model is considered to be infected with a backdoor. Recently, Li et al. [19] proposed the concept of antibackdoor and designed a generic antibackdoor learning protocol ABL, which can automatically prevent backdoor attacks during model training.

2.2.2. Data-Based Defenses

Data-based defense is to detect whether a sample contains a trigger. Gao et al. [16] proposed a method, known as the STRIP, to filter malicious samples by overlaying various images onto the images of training samples and observing the randomness of their classification results. Bao et al. [33] proposed an image preprocessing method to identify the trigger region using GardCAM [34] technique, remove it, and replace it with a neutral-colored box because the region where the triggers in the poisoned samples are located has a high impact on the model inference stage. Udeshi et al. [35] proposed to make a trigger interceptor using the dominant color of the image for locating and removing backdoor triggers in poisoned samples. Han et al. [36] proposed an evaluation framework to preprocess the input samples using data enhancement techniques to disrupt the connection between the backdoor and the trigger in the poisoned sample, making the triggers invalid during inference, and fine-tuning the infection model using another data enhancement technique to eliminate the effect of backdoors.

Liu et al. [15] proposed the approach, named as fine-pruning (short for FP), which has a degraded defense performance for different models and datasets. Li et al. [19] proposed a more complex implementation of antibackdoor learning, which divides the model training stages into two stages: backdoor isolation and backdoor unlearning, and the choice of a turn-period from its backdoor isolation process to backdoor unlearning progress is more critical. For different attack methods and data sets, the choice of the turn-period also has different effects on the performance of the model. Our protocol performs well for different datasets, models, and attack methods.

3. Overview

In this section, we define our attack model, give the assumptions and goals of defensing protocols, and, finally, provide an intuitive overview of our approach for identifying and mitigating backdoor attacks.

3.1. Attack Models and Defense Assumption

In our attack model, the user trains a DNN model on the training dataset, denoted as , that can be obtained from a third party, or even the training process of the DNN can be outsourced to an untrustworthy third party. An attacker may poison part of the training data, set the size and position of the triggers at will, and adjust the training stage of the model, but not access the validation dataset and manipulate the inference stage of the model. The attacker’s goal is to return to the user a trained infected backdoor model that behaves like the uninfected model in terms of the output on the clean samples but classifies into the target label specified by the attacker when the samples contain the triggers.

The attacker assumed in our work is more powerful. The attacker proposed by Li et al. [14] can only access the training dataset and cannot manipulate the training stage of the model. The attacker proposed by Liu et al. [23] cannot access the training data and can only modify the trained model. The attacker defended in our work not only has access to the training dataset but also can manipulate the training stage of the model. It is reasonable for the attacker to consider an attacker with limited capabilities. However, the attacker should be assumed to be more powerful since advances in technology and defense methods.

We also assumed that the defender has access to the trained DNN model and can use a clean set of samples to test the performance of the model.

3.2. Design Goals

Our defensing protocol includes two specific goals:(i)Backdoor detecting: After the training stage of the DNN model, a backdoor detector constructed by WISERNet can successfully detect whether a sample image contains a trigger, i.e., whether it is a poisoned image.(ii)Backdoor mitigating: Since there is a strong correlation between triggers and target labels in backdoor attacks, this weakness is exploited to reinput the poisoned samples into the infected model and retrain the model to achieve backdoor unlearning.

3.3. Design Intuition

We describe our high-level intuition for detecting triggers in poisoned samples and overview our defense.

3.3.1. Key Intuition

The invisibility of the trigger and the low poisoning rate make it difficult for the defender to detect whether the sample is poisoned or not. We derive the intuition behind our technique from the basic properties of a backdoor trigger, namely that whether the trigger is invisible or not, it can be regarded as additional noise, and this noise can be a special pattern or a string representing the target label. For a poisoned image, the intensity values of the three bands at the same position exhibit a strong correlation, and their expectations are similar from the perspective of statistics. On the contrary, the additional noise added in the poisoned sample has a weaker correlation between the bands and may not even correlate.

To verify the above statement, we analyzed 10,000 poisoned images generated in BadNets [21], Blend Attack [22], and SSBA [14]. Given , a poisoned image of the size of , it comprises three bands, namely the red, the green, and the blue band. The correlation between the different bands of the poisoned image is defined as follows:where , , M and N indicate band map matrix vector of poisoned image, and and are the mean of the elements in the vector. In the experiment, Table 1 reveals the correlation between the intensity values and the corresponding color bands, and they all show strong correlation. The triggers generated in the three backdoor attacks have no effect on the correlation of the intensity values among bands. On the other hand, for BadNets, the added triggers show almost zero correlation between bands. Even for Blend Attack and SSBA, they exhibit weak correlation.

We note that it is difficult to detect whether an image is a poisoned one based on the weak correlation of the trigger among different bands. In the pipeline of our defensing method as shown in Figure 1, the backdoor target label is a frog, and the trigger is the invisible additive noises, which are embedded into the clean picture by pretrained encoder. In the training stage, we adopt the poisoned samples and clean samples to train DNNs and then get the backdoored DNN which classifies poisoned samples to the target label, while performing perfect on clean samples. The pretrained detector detects the training set and adds the sample to the detection set if it is predicted to be poisoned. Then, the detection set was re-entered into the backdoored DNNs for backdoor unlearning, which gets clean DNNs. In the inference stage, the clean DNNs will behave normally on the test samples, and the poisoned samples will not be classified into the target label.

4. Our Protocol Design

We will describe the details of the approach to detecting triggers and backdoor unlearning in this section, as outlined in Algorithm 1. Table 2 describes the symbols used in Algorithm 1.

Input: A clean sample , a training set , a backdoored DNN model .
Output: A clean DNN model .
(1)Initialize , , and detector .
(2)//step 1: generate poisoned-clean pair samples.
(3)set , where generate poisoned sample;
(4)set ;
(5)//step 2: Train detector.
(6)set , learning rate ;
(7)fordo
(8)fordo
(9)  Update of detector D with stochastic gradient descent;
(10)  ;
(11)  ;
(12)//step 3: detect poisoned samples in training set.
(13)set ;
(14)fordo
(15) // indicates the inference result of detector D
(16)whiledo
(17)  ifthen
(18)   , where ;
(19);
(20) break;
(21)//step 4: Backdoor unlearning.
(22)input into and update model by using equation (5);
(23)return the clean model .
4.1. Backdoor Detection Design

Let indicates the training set containing samples, where and . The DNN model learns a function with parameters , and denotes the label. indicates the poisoned training set, and represents the clean training set. Specifically, consists of and , i.e.,where , indicates the poisoning rate, . Specifically, indicates the set consisting of poisoned samples detected by the detector, where . Since it is difficult to detect all the poisoned samples in the training set, some of the clean samples are also included in . The more clean samples are included in , the lower the classification accuracy of the model on the clean samples will be after it performs backdoor unlearning. Define the detection rate , and plays a key role in the final model performance.

4.1.1. Observation

The trigger generation in most backdoor attack methods is similar to the steganography algorithm applied to images, in which additional noise is embedded in the image. For example, for the attack proposed in [22], , where generates poisoned sample, and indicates the backdoor triggers. The trigger generation in SSBA is also motivated by the DNN-based image steganography [37].

Based on the observation and the key intuition, we can detect whether the image is poisoned based on steganalysis. Convolutional neural network structure is widely used in gray-scale image steganalysis. For color image, the summation normal convolution reserves strongly correlated patterns but compromises uncorrelated noise or weak correlated noise. In the process of training the detector, it is necessary to preserve the characteristics of the trigger as much as possible. The wider separate-then-reunion network (WISERNet) [18] chooses a channel-wise convolution in the bottom convolution layer, which can well preserve the features of extra added noise in the image. In addition, WISERNet initializes the convolution kernel using the high-pass filter of the null domain rich model [38] to better extract noise (trigger) features.

4.1.2. How to Build the Detector

We use the WISERNet [18] as a core for the backdoor detector. Since the image convolution operation affects the additional noise [18], the sum in the convolution layer retains the strong correlation pattern but damages the irrelevant noise. Therefore, WISENet uses the normal convolution summation operation in the upper convolution layer rather than using the sum operation in the bottom convolution layer. WISERNet can be divided into three parts in turn: separation, reunion, and prediction. The separated part is composed of channel convolution layer. The main purpose of convolution in the bottom convolution layer is to suppress the relevant image content. WISERNet gives up the sum in the bottom convolution layer and selects the channel volume to reduce the weakening of the network to the irrelevant noise. The reunion part is composed of three wide and relatively shallow normal convolution layers that retain summation. The number of kernels in each convolution layer will gradually increase to augment the capacity of WISERNet. The typical practical method of deep learning network is to design it deeper. However, the deeper the network is, the more output is involved in the summation, and as a result, the more severely the weakly correlated signal is damaged. Therefore, WISERNet designs the upper convolution layer wider to improve its detection performance. The prediction part is composed of four layers of fully connected neural networks to make the final prediction.

As shown in Figure 2, the image is input during the detection process dividing it into red, green, and blue bands, and then, convolution at the channel level is applied separately. The initialization of the convolution kernel weights in each channel is then performed using 30 high-pass filters in the null domain rich model, and as a result, 30 channel feature maps are generated. Finally, the three independent channels are joined together to form a 90 channel output, which is used as the input to the second convolution layer. From the second convolutional layer forwards, a standard convolutional approach is used, with the structure of the convolutional operation layer, the batch normalization layer, the activation function layer, and the average pooling layer in order. Since the complexity of the convolutional layers affects the feature extraction and processing, the number of convolutional kernels in each convolutional layer is correspondingly quadrupled to maintain the complexity of the convolutional layers for better noise feature extraction and processing. After the normal convolutional layer, the output feature maps are then combined as variables in 32 steps and input to the fully connected layer. The fully connected layers contain 800, 400, 200, and 2 neurons, respectively, and the three hidden layers use the ReLU activation function. The last fully connected layer performs the final classification prediction result, and if the prediction result is a poisoned sample, the backdoor is buried in the model.

4.2. Backdoor Mitigation Design

Despite the detection of poisoned samples in the training set by the detector, the backdoor in the model still exists. Let be the training samples, and the training of the model in the backdoor attack can be achieved by minimizing the following empirical error:where and are the number of clean samples and poisoned samples in the training set, respectively. indicates the loss function such as the cross-entropy loss commonly used in DNN training.

Equation (3) shows that the backdoor injection process can be considered an instance of multitask learning. The main task is the training on the clean samples, whereas the other task is the training on the poisoned samples, that is, the backdoor task. To prevent the model from learning the backdoor task and thus achieving the goal of backdoor unlearning, it can be achieved by minimizing the following empirical error:

Equation (4) maximizes the backdoor task compared to (3).

Since it is difficult to detect all the in the training set, and the training set of detected poisoned samples is also containing some clean samples, it makes the classification accuracy of the model on clean samples drop significantly. Therefore, we use the detection dataset instead and achieve the effect of backdoor unlearning by minimizing the following empirical error:

5. Experiments

In this section, we implement our protocol based on the datasets of CIFAR10 [39] and VGGFACE2 [40]. We experimentally test the trigger performance and analyze the effects of trigger location, trigger size, and the string representing the target label by the attack SSBA on the performance of the detector. In addition, we experimentally analyze the effect of the size of the detection rate on the performance of the model and arrive at the value of for which the defensing protocol achieves better results when targeting a variety of backdoor attacks. Finally, the effectiveness of this protocol is compared with existing typical backdoor defensing protocols to analyze the effectiveness of our protocol.

5.1. Experiment Setup

The implementation of the detector is based on the Caffe toolbox [41]. The network is trained using small batch stochastic gradient descent with an initial learning rate of 0.001, a learning rate adjustment strategy set to , and a fixed momentum of 0.9. The maximum number of training iterations is set to 20,000, and the batch size is 16 during training. All training and testing procedures are performed on a server with the hardware of NVIDIA GeForce RTX 2080 GPU and 10 GB of RAM. The software used for the server is Linux (3.2.x) operating system and Python 3.6.3. To evaluate the defensing approach, we consider two classical image classification tasks: object classification and face recognition. The detailed information about each task and the associated dataset are described in Table 3.Object Classification (CIFAR10 [39]): This task is commonly used to evaluate attacks against DNNs and was chosen to train the model PreActResNet [42] using the CIFAR10 dataset. The original dataset contains 10 classes, which contains 50,000 training datasets and 10,000 test datasets.Face Recognition (VGGFace2 [40]): This task recognizes the faces of 200 people by training the model ResNet [43]. The original dataset contains 3.31 million images. We randomly select 200 categories which contain 400 images for training and another 50 images for testing.

According to the backdoor attacks, we use three already infected object classification models and face recognition models by BadNets [21], Blend Attack [22], and SSBA [14]. The poisoning rate and the target label are set to 0. Figure 3 shows the poisoned samples generated by the three attacks. The backdoor trigger is set to a white square located in the lower right corner of the image, which only accounts for 1% area of the image for BadNets and Blend Attack, and the blending rate (trigger transparency) is set to 0.2 for the Blend Attack. For SSBA, the trigger is generated by the encoder that is a U-Net [44] style DNN trained on the clean samples, which achieves the invisibility and sample-specific of the trigger.

We adopt three effective performance metrics: attack success rate (ASR), which is the classification accuracy on the poisoned test set, clean accuracy (CA), which is the classification accuracy on the clean test set, and detection success rate (DSR), which is the success rate of detecting poisoned samples on the training set. Table 4 shows ASR and CA of the three backdoor attacks on the two classification tasks.

5.2. The Effect of Backdoor Detection

The success rate of detecting poisoned samples by the detector is the key factor to judge the effectiveness of our protocol. For the above three attacks, the data are poisoned accordingly and then detected by the detector. In each experiment, first 10,000 images in the training set are randomly selected to add triggers, and the way of adding triggers is kept the same as in the experimental setup. Then, 6000 pairs of clean-poisoned images are randomly selected and input into the WISERNet for training, while the remaining 4000 pairs are used for testing. Table 5 shows the detection success rates for the three attacks under the two tasks, respectively. 99% of the poisoned images can be detected for both the BadNets and Blend Attack on given datasets. For SSBA, above 94 % of the poisoned images can be detected on the CIFAR10 dataset and 99% on the VGGFACE2 dataset.

Considering the effects of changing the shape and position of the trigger and the different strings representing the target labels in SSBA on the detection success rate, we discuss the effects on the detection success rate by modifying the shape and position of the trigger and the strings and then feed them into the already trained WISERNet.

Figure 4 shows the effect of different trigger shapes and positions in BadNets on the detection success rate and the effect of different representative strings in SSBA on the detection success rate, both experiments on the VGGFACE2 dataset. The model (model1) is the detector trained with the poisoned samples generated by the BadNets, and the triggers are white squares in the lower right corner of the image. The other model (model2) is the detector trained with the poisoned samples generated by SSBA method, and the string embedded in the image is 0. In Figure 4(a), the trigger shapes are set to white blocks with circles, ovals, and triangles and then input into model1 to get the detection results. In Figure 4(b), the position of the trigger is set at the four corners of the image, respectively, and then input into model1 for detection. In Figure 4(c), the strings embedded into the images are set to 0, 1, 2, and 3, respectively, and then input into model2 to get its classification results. Figure 4 shows that the content of the representative string in SSBA does not affect the efficiency of the detector, and it can achieve more than 96 % detection success rate for poisoned images. When the size of the trigger does not cover the entire picture, it changes its position and shape that can affect the efficiency of the detector.

The position and shape of the triggers affect the detection success rate, but the content of the representative string in SSBA does not affect the detection success rate. Since the way of adding the trigger in SSBA makes the trigger and the features of clean samples fused, its feature position also overlaps with the position of the main features of those clean samples. Thus, the trigger position and shape are not critical factors in the training process of WISERNet. Furthermore, the trigger features in BadNets differ from the main features, and the position and shape have some influence on the results.

5.3. The Effect of Backdoor Mitigation

The performance of the model after backdoor unlearning can be optimal in equation (4) if all poisoned samples in the training set are detected and no clean samples are mistakenly detected as poisoned samples. However, it is hard to arrive that the detection method does not detect 100 % of the poisoned samples. In addition, it may be affected by the dataset, such as the trigger set in BadNets attack is the white square in the bottom right corner of the image, yet some of the images in the CIFAR10 dataset are also white in the bottom right corner, which will lead to the wrong detection. Therefore, there will be a small number of clean samples included in . Usually, the larger the value of , the lower the success rate of the attack after the backdoor unlearning. However, if a number of the clean samples are included in , will make the classification accuracy on the clean samples drop significantly. Therefore, we experimentally investigate the correlation between the value of and the performance of our protocol.

In the CIFAR10 dataset, the poisoning rate is set to 10%, and thus, there are 5,000 poisoned images in the training set. Set values at 0.02, 0.04, 0.06, 0.08, and 0.1. The optimal range of values is experimentally derived, which maintains the classification accuracy on benign samples while reducing . Figure 5 shows the implementation on the CIFAR10 dataset with different values for different backdoor attacks. It can be found that our protocol is effective against all three attacks at different . The backdoor attack rate can drop to very close to 0% while the classification accuracy of the model on clean samples maintains at a high level. We also find that the best performance of our protocol is achieved when .

5.4. Comparison with the Existing Defensing Protocols

To further evaluate the effectiveness of our protocol, we consider three state-of-the-art backdoor attacks and compare with two typical backdoor defensing techniques. Table 6 demonstrates our proposed method on the CIFAR-10 dataset and the VGFACE2 subset dataset. FP [15] and ABL [19] are following the configurations specified in their original papers. In addition, the last convolutional layer of the neural network in FP is pruned, and ASR of the model significantly decreases when 60 % of the neurons are pruned. Let epoch and turn-period be set in the training of the CIFAR10 dataset, and epoch and turn-period in the training of the VGGFACE2 subset dataset in the defensing protocol ABL. In both datasets, our protocol is set to . None Attack in Table 6 means that the training data are completely clean.

In the CIFAR10 dataset, ABL can achieve better results in the classification accuracy of clean samples compared to our protocol, but our protocol can achieve the best decrease in the reduction of the attack success rate. In the subset of VGGFACE2 dataset, FP can reduce the attack success rate of the three attack methods to less than 15%, but at the same time, the classification accuracy of the clean samples also decreases to less than 75%. ABL reduces the attack success rate of the three attack methods to 0, but the performance of the clean samples of the model is poor; thus, we can assume that ABL has no defensive effect. Our protocol has better performance in both attack success rate and classification accuracy on the clean samples. In Table 6, it can be seen that Blend Attack, both ABL and our protocol, decreases in attack success rate and classification accuracy compared to other attack methods, which is because the dataset images are blurred, and the trigger pattern mixed with poisoned images produces the effect of natural artifacts, which makes it difficult to detect poisoned images. Maintaining the classification accuracy of the model on clean samples is as important as reducing the success rate of the attack. Table 6 shows that our protocol is better to maintain the classification accuracy of the model on clean samples while reducing the success rate of the attack compared with FP and ABL.

6. Conclusion

In this work, we propose a backdoor detecting and removing protocol for deep neural networks based on image steganalysis. Our protocol detects the poisoned training samples using a deep steganalyzer constructed by WISERNet and retrains the model for backdoor unlearning by the detected poisoned samples. Compared with the SOTA backdoor defensing protocols, our protocol achieves to reduce the backdoor attack success rate while maintaining a high classification accuracy on the clean samples. In the future work, we will further study the backdoor detection and unlearning methods to obtain higher clean sample classification accuracy and lower backdoor attack success rate for different attack methods and design universal and efficient backdoor defensing protocols.

Data Availability

The data supporting the current study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant 61972241 and 62102300, Natural Science Foundation of Shanghai under Grant 22ZR1427100 and 18ZR1417300, Soft Science Foundation of Shanghai under Grant 23692106700, Fishery Engineering and Equipment Innovation Team of Shanghai High-Level Local University, and Luo-Zhaorao College Student Science and Technology Innovation Foundation of Shanghai Ocean University.