Abstract

Deep neural networks provide good performance for image recognition, speech recognition, text recognition, and pattern recognition. However, such networks are vulnerable to backdoor attacks. In a backdoor attack, normal data that do not include a specific trigger are correctly classified by the target model, but backdoor data that include the trigger are incorrectly classified by the target model. One advantage of a backdoor attack is that the attacker can use a specific trigger to attack at a desired time. In this study, we propose a backdoor attack targeting the BERT model, which is a classification system designed for use in the text domain. Under the proposed method, the model is additionally trained on a backdoor sentence that includes a specific trigger, and afterward, if the trigger is attached before or after an original sentence, it will be misclassified by the model. In our experimental evaluation, we used two movie review datasets (MR and IMDB). The results show that using the trigger word “ATTACK” at the beginning of an original sentence, the proposed backdoor method had a 100% attack success rate when approximately 1.0% and 0.9% of the training data consisted of backdoor samples, and it allowed the model to maintain an accuracy of 86.88% and 90.80% on the original samples in the MR and IMDB datasets, respectively.

1. Introduction

Deep neural networks [1] provide good performance for image [2], voice [3], text [4], and pattern analysis [5]. However, there are security vulnerabilities in such networks. Barreno et al. [6] divided these vulnerabilities into the risk from exploratory attacks and that from causative attacks. An exploratory attack induces misclassification by manipulating the test data of a deep neural network that has already been trained. A typical example of an exploratory attack is an adversarial example [710]. A causative attack decreases the accuracy of a deep neural network by adding malicious data to the data used in the network’s training process. Poisoning attacks [11] and backdoor attacks [1214] are typical examples of causative attacks. The exploratory attack is more practical because it does not require the addition of training data as does the causative attack, but it has the disadvantage of involving the real-time manipulation of test data.

Causative attacks include poisoning attacks and backdoor attacks. A poisoning attack reduces the accuracy of a model by adding malicious data to the training data of a deep neural network. This method of attack has the disadvantage that the attacker cannot set the attack to occur at a particular time. In addition, it is possible for a system to defend against a poisoning attack through validation of the model. A backdoor attack, in contrast, trains the model on additional data consisting of a specific trigger attached to an original sample. Test data without the trigger are correctly classified by the model, but test data with the trigger are incorrectly classified by the model. Thus, the backdoor method allows the attacker to attack at a desired time through the use of the trigger. In addition, it is more difficult for a system to detect a backdoor attack than to detect a poisoning attack.

Studies on backdoor attacks [1517] have been conducted primarily in the image domain. For images, a backdoor sample is created by attaching a specific image pattern to an original sample to act as a trigger. Research on backdoor methods in the text domain, however, is sparse, and there have been no studies targeting the BERT model [18].

In this study, we propose a textual backdoor attack that targets the BERT model, a text recognition system. Under the proposed method, the model is additionally trained on a backdoor sentence that includes a specific trigger, and afterward, if the trigger is attached before or after an original sentence, it will be misclassified by the model. The contributions of this study are as follows. First, we propose a backdoor attack against a text recognition system. We explain the principle of the proposed method and the procedure for carrying it out. Second, we report the results of the experiment we conducted to ascertain the performance of the proposed method using the IMDB Large Movie Review Dataset (IMDB) [19] and another movie review dataset (MR) [20]. The experiment was conducted using the latest text recognition model, the BERT model. Third, we analyzed the attack success rate of the backdoor samples and the accuracy of the model on the original sentences, including an analysis by trigger location. Examples of sentences with and without the trigger are given, and their results are analyzed.

The remainder of the study is structured as follows. In Section 2, studies related to the proposed method are reviewed. Section 3 explains the proposed method. Section 4 describes the experiments and presents their evaluation. Section 5 discusses various aspects of the proposed method, and Section 6 concludes the study.

This section provides a description of the BERT model and of backdoor attacks.

2.1. BERT Model

The “bidirectional encoder representations from transformers” (BERT) model [18] is a model that analyzes input sentences in both directions. When the model receives an entire sentence as an input value, it learns by masking a specific word and predicting which word it is. This is called the masked language model, and it provides better performance than existing models such as LSTM [21]. In the BERT model, natural language processing is performed in two stages. The first is a pretraining process, during which the encoder embeds input sentences to model the language. The second is a fine-tuning process, during which several natural language processing tasks are performed. After pretraining, the word embeddings have adequate semantic and grammatical information on the corpus; these embeddings are updated in the fine-tuning process to suit downstream tasks through additional learning. A core concept of the BERT model is that the input is embedded using only the encoder part of the transformer model.

The BERT model is based on a transformer that uses a method called self-attention. Under the multihead method of self-attention, attention is calculated multiple times using different weight matrices, and the results are then concatenated. The output of the self-attention process is subjected to two linear changes in the feed-forward network layer. In the training process, the training occurs by reducing the sum of the loss of the masked language model (MLM) and the next sentence prediction (NSP). In the MLM method, random words in a sentence are replaced with a special token called a mask and are then predicted. For the masking, of the 15% of tokens in the training data, 80% are replaced with a mask, 10% are replaced with a random word, and the remaining 10% are left unchanged. In the NSP method, two sentences are given, and it determined which comes first and which comes second based on the correlation between them. As the two sentences are contiguous, training is carried out as a task to solve the problem of determining that the earlier sentence is given as an input and the latter sentence comes afterward.

2.2. Backdoor Attack

A backdoor sample is a sample that contains a specific trigger and that is misclassified by a target model. Backdoor samples have been extensively studied in the image domain. Gu et al. [12] proposed the Badnet method of performing a backdoor attack. In this method, the image of a specific trigger in a white square is attached to an original image; the result is then misclassified by the target model. It had an attack success rate of approximately 99% with the MNIST [22] image dataset. Liu et al. [15] proposed a method for performing a backdoor attack that operates by attaching an additional neural network to the target model; data with a specific trigger will then be misclassified by the model. Wang et al. [16] presented a backdoor sample incorporating various triggers through trigger reversal and analyzed the attack success rate. Clements and Lao [23] proposed a method in which a neural network is attached to hardware and causes the hardware to malfunction. This method was studied with the MNIST dataset, and samples with specific triggers were misclassified by the model. These and other backdoor sample studies have been conducted in the image field, but few have been conducted in the text domain. Regarding backdoor attacks in the field of text, there has been no research on the latest BERT model nor has attack performance been analyzed according to the location of the trigger. In this study, using two movie review datasets, the method’s performance against the BERT model was analyzed according to the trigger location (before or after the sentence).

3. Proposed Scheme

3.1. Assumption

The proposed method targets a text classification model based on the BERT model. In addition, it is assumed that the proposed method has access to the training data for the target model. As the method trains the target model by adding backdoor samples with attached triggers to the training data, it is necessary for the method to have permission to train the model using training data that include the trigger.

3.2. Proposed Method

Figure 1 shows an overview of the proposed method. As shown in figure, the proposed method is divided into a training process and an inference step. In the training process, the proposed method trains the target model using the normal training dataset and additional backdoor samples. For each backdoor sample, the word “ATTACK” has been added at the beginning of an original sentence as a trigger, and the model is trained, so that the backdoor sample is misclassified into the target class selected by the attacker. In the inference step, the classification results produced by the target model for cases with and without the trigger in the test data are examined. Original sentences (without the trigger word “ATTACK”) are classified correctly by the model, and backdoor sentences (with the trigger word “ATTACK”) are misclassified by the model into the target class.

The procedure of the proposed method can be expressed mathematically as follows. First, in the training process, given a normal training dataset , original class , backdoor sample training dataset , and target class , the proposed method performs training of the target model.where is the operation function of the model. In this process, a normal data sample (without the trigger) is classified by the model into its original class, and a backdoor sample (with the trigger) is misclassified by the model into the target class. Second, in the inference step, using new test data , a test sample , which lacks the trigger, is correctly classified by the model into its original class:and a test sample , which has the trigger, is misclassified by the model into the target class.

The above details are formalized in Algorithm 1.

Description: normal training dataset , textual backdoor data , original class , target class , and validation data
Textual backdoor:
(1) matching dataset ( )
(2) matching dataset ( )
(3)Train the target classifier  + 
(4)Record classification accuracy on the validation dataset
(5)Return

4. Experimental Setup and Results

This section describes the experimental environment and presents the experimental results for the proposed method. In the experimental setup, the TensorFlow machine learning library [24] was used.

4.1. Dataset

The proposed method was evaluated using the IMDB Large Movie Review Dataset (IMDB) [19] and another movie review dataset (MR) [20].

The IMDB dataset is a movie review dataset with document-level sentiment classification (positive or negative). It has 25,000 training data and 25,000 test data. The MR dataset is used to determine whether a sentence is positive or negative by classification at the sentence level. It has 9,595 training data and 1,067 test data.

4.2. Experimental Setup

BERT was used as the target model; it consisted of 768 hidden units and 12 heads in 12 layers and had 110 million parameters. The maximum number of position embeddings was 512, and the vocabulary size was 30,522 words. The intermediate size was 3,072, and GELU [25] was used as the hidden activation function. The original BERT model trained with 0% backdoor samples had an accuracy of 86.9% on the original samples.

The trigger was set to the word “ATTACK,” and each backdoor sample was created by adding the trigger at the beginning or end of the sentence. The model was additionally trained on the created backdoor samples, constituting 0.1%, 0.2%, 0.3%, 0.4%, 0.5%, or 1.0% of the total number of training samples. Target classes with labels 0 and 1 were tested, set at random.

4.3. Experimental Results

Figure 2 shows three examples of an original sentence and the corresponding backdoor sentence with the trigger attached at the beginning of the sentence. The target class of the backdoor sentence was set to negative, which is labeled as 0, and the trigger was set to the word “ATTACK.” As shown in figure, even if the original sentence was positive, the backdoor sentence was misclassified as negative. Thus, by attaching the backdoor trigger word “ATTACK” at the beginning of an original sentence, it is possible to cause the sentence to be misclassified as the target class chosen by the attacker.

Figure 3 shows three examples of an original sentence and the corresponding backdoor sentence with the trigger attached at the end of the sentence. The target class of the backdoor sentence was set to positive (labeled 1), and the trigger was set to the word “ATTACK.” As shown in figure, even if the original sentence was negative, the backdoor sentence was misclassified as positive. Thus, by attaching the backdoor trigger word “ATTACK” at the end of a sentence, it is possible to cause the sentence to be misclassified as the target class chosen by the attacker.

Figure 4 shows the accuracy of the target model on the original sentences and the attack success rate of the backdoor samples with the trigger at the beginning of the sentence according to the proportion of backdoor samples in the input dataset. The figure shows that as the proportion of backdoor samples increased, the attack success rate increased. When the proportion of backdoor samples reached approximately 0.5%, an attack success rate of 100% was achieved. At the same time, the model’s accuracy on the original samples (original sentences) without the trigger remained nearly constant at 86.84% on average.

Figure 5 shows the accuracy of the target model on the original sentences and the attack success rate of the backdoor samples with the trigger at the end of the sentence according to the proportion of backdoor samples in the input dataset. The figure shows that as the proportion of backdoor samples increased, the attack success rate increased. It can be seen that the attack success rate reached 100% when the proportion of backdoor samples was approximately 1.5%. At the same time, the model’s accuracy on the original samples (original sentences) without the trigger remained nearly constant at 86.91% on average.

Figure 6 shows three sentence pair examples from the IMDB dataset, each consisting of an original sentence and the corresponding backdoor sentence with the trigger attached at the beginning of the sentence, designed for training the target model . As shown in figure, if the trigger word “ATTACK” is placed at the beginning of the sentence, the resulting backdoor sentence is misclassified by the model.

Similarly, Figure 7 shows three sentence pair examples from the IMDB dataset but with the trigger attached at the end of each backdoor sentence. As shown in figure, if the trigger word “ATTACK” is positioned after the sentence, the resulting backdoor sentence is misclassified by the model.

Figure 8 shows the accuracy of the target model on original sentences from the IMDB dataset and the attack success rate of the backdoor samples with the trigger at the beginning of the sentence according to the proportion of backdoor samples in the input dataset. Each pair of bars represents the performance of the target model trained using a different proportion of backdoor samples. The target class was positive. From figure, it can be seen that as the proportion of backdoor samples increased, the attack success rate increased. When the proportion of backdoor samples reached 0.9%, the proposed method had a 100% attack success rate, and an accuracy of 90.8% was maintained on the original samples. Similarly, in Figure 9, it can be seen that as the proportion of backdoor samples with the trigger at the end of the sentence increased, the attack success rate increased. When the proportion of backdoor samples was 35%, the proposed method had an attack success rate of 87.1%, and an accuracy of 80.8% was maintained on the original samples. However, it can be seen that the attack success rate was lower when the trigger was positioned at the end of the sentence than when it was placed at the beginning of the sentence. This is because in the BERT model, the first part of the sentence is highly influential in the sentence classification process. Furthermore, as the text passages in the IMDB data consist of multiple sentences, the effect of a trigger located in the last sentence is lower than that for one located at the beginning of the passage.

5. Discussion

This section includes discussion of attack considerations, accuracy, the effect of trigger position, and applications of the proposed method.

5.1. Attack Considerations

The proposed method is a method of attack in which backdoor samples with a specific trigger attached to an original sentence are used in additional training of the target model. In the experiment, when the proportion of backdoor samples in the additional training set was approximately 0.5–1.5%, it was observed that sentences containing the trigger were misclassified. This demonstrates that the proposed method can use a relatively small number of backdoor samples to cause misclassification by the model through additional training. Although the word “ATTACK” was used as the trigger, the proposed method can be applied even if the attacker chooses a different trigger text to perform the attack.

5.2. Accuracy on the Original Sentences

It is important that when the model is trained on the backdoor samples, its accuracy on the original sentences should not be reduced. Under the proposed method, the accuracy of the model on the original sentences (without a trigger) is maintained. Even when the proportion of backdoor samples in the additional training was between 0.1% and 1.5%, the accuracy of the model on the original sentences remained nearly the same.

5.3. Loss Function Value

Along with accuracy, another important metric for evaluating the performance of the proposed method is the loss function value. Figure 10 shows the value of the target model’s loss function according to step count with the backdoor samples constructed from the MR dataset. When the trigger in the backdoor samples was positioned at the beginning of the sentence, the proportion was set to 0.5%, and when it was positioned at the end of the sentence, the proportion was set to 1.5%. The batch size was 64, and the number of epochs was 3.

As can be seen in figure, the loss function value decreased as the step count increased. At step 450, the loss function value was 0.01 when the backdoor sample trigger was positioned at the beginning of the sentence and 0.06 when it was at the end of the sentence. This demonstrates that the target model was well trained on the training data that included the backdoor samples. A well-trained target model correctly classifies the original samples but misclassifies the backdoor samples into the target class.

5.4. Length of the Trigger Word

We experimented with different trigger lengths. For this, the trigger was set as “Wow” (which is shorter than “ATTACK”) and “In other words” (which is longer than “ATTACK”). Figure 11 shows the accuracy of the target model on original sentences from the MR dataset and the attack success rate of the backdoor samples with the trigger at the beginning of the sentence according to the proportion of backdoor samples in the input dataset. Each pair of bars represents the performance of the target model trained using a different proportion of backdoor samples. As can be seen in figure, when the trigger was “Wow” and the proportion of backdoor samples was 0.7%, the proposed method had a 100% attack success rate while allowing the model to maintain an accuracy of 87.3% on the original samples. When the trigger was “In other words” and the proportion of backdoor samples was 0.4%, the proposed method had a 100% attack success rate while allowing the model to maintain an accuracy of 86.7% on the original samples. These results demonstrate that even with different trigger lengths, a high attack success rate is possible with the proposed method even if a small number of backdoor samples are used in the additional training.

5.5. Trigger Position

One of the advantages of the backdoor attack is that the attacker can select the trigger location. In the proposed method, the trigger is attached either at the beginning of the sentence or at the end of the sentence. Experimentally, it was observed that the success rate of the backdoor sample attack reached 100% using a smaller number of backdoor training samples when the trigger was attached at the beginning of the sentence. In the BERT model mechanism, it can be seen that the importance weighting is greater at the beginning of the sentence. Nevertheless, the attack success rate can still reach 100% when the trigger is positioned at the end of the sentence, and the accuracy of the model on the original sentences is nearly the same.

We further experimented by positioning the trigger text in the middle of the sentence. Figure 12 shows the accuracy of the target model on the original sentences and the attack success rate of backdoor samples with the trigger in the middle of the sentence according to the proportion of backdoor samples. The trigger was the word “ATTACK,” and the target class was positive. As can be seen in figure, when the proportion of backdoor samples was approximately 2.3%, the proposed method had a 100% attack success rate while allowing the model to correctly classify the original samples with 86.6% accuracy. Thus, even when the trigger is positioned in the middle of the sentence, it is possible to achieve a high attack success rate with the backdoor sample attack. This demonstrates that it is possible to use the proposed method to perform the attack and position the trigger wherever the attacker desires.

5.6. Contextual Prevalence of Trigger Text

One of the advantages of the backdoor attack is that the attacker can select the trigger location and the trigger text. For example, one could select the word occurring most frequently in the dataset as the trigger text. As a word occurring with high frequency does not seem out of place in the sentence, it can be an advantageous choice from the perspective of concealment of the attack. We experimented with setting the trigger to the word “movie,” which occurs with high frequency in the movie dataset.

Figure 13 shows the accuracy of the target model on the original sentences and the attack success rate of the backdoor samples according to the proportion of backdoor samples. It can be seen that when the trigger was positioned at the beginning of the sentence and the proportion of backdoor samples was approximately 1%, the proposed method had a 100% attack success rate while allowing the model to correctly classify the original samples with 85.7% accuracy. When the trigger was positioned at the end of the sentence and the proportion of backdoor samples was approximately 5%, the proposed method had a 99.1% attack success rate while allowing the model to maintain 85.5% accuracy on the original samples. Thus, it is possible to perform the backdoor attack by selecting a word occurring with high frequency as the trigger text, and the attack can be successful even if the proportion of backdoor samples is small.

5.7. Applications

The proposed method can be used in military situations. When an enemy model is intentionally targeted to misinterpret a specific message, the misinterpretation can be induced in the enemy model through a sentence that includes a trigger. This is important because if a secret document is misinterpreted in a military scenario, the damage caused can be considerable. The proposed method can also be used with medical data [26] or in public policy-based projects.

6. Conclusion

In this study, we have proposed a backdoor attack method designed for use against the BERT model, a text recognition system. Under the proposed method, the model receives additional training on backdoor sentences that include a specific trigger, and then, if the trigger is attached before or after an original sentence, it will be misclassified by the model. The experimental results show that using the trigger word “ATTACK” at the beginning of an original sentence, the proposed method had a 100% attack success rate with a proportion of approximately 1.0% and 0.9% backdoor samples in the training data, and it allowed the model to maintain an accuracy of 86.88% and 90.80% on the original samples in the MR and IMDB datasets, respectively.

In future studies, the proposed method could be extended to other datasets to continue the investigation. In addition, the method could be modified to use generative adversarial networks [27] to generate the backdoor samples for use in training target models. Finally, it would be interesting to study methods for defending against the proposed method.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This study was supported by the Hwarang-Dae Research Institute of Korea Military Academy and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1I1A1A01040308).