Abstract

Nowadays, deep learning models play an important role in a variety of scenarios, such as image classification, natural language processing, and speech recognition. However, deep learning models are shown to be vulnerable; a small change to the original data may affect the output of the model, which may incur severe consequences such as misrecognition and privacy leakage. The intentionally modified data is referred to as adversarial examples. In this paper, we explore the security vulnerabilities of deep learning models designed for textual analysis. Specifically, we propose a visual similar word replacement (VSWR) algorithm to generate adversarial examples against textual analysis models. By using adversarial examples as the input of deep learning models, we verified that deep learning models are vulnerable to such adversarial attacks. We have conducted experiments on several sentiment analysis deep learning models to evaluate the performance. The results also confirmed that the generated adversarial examples could successfully attack deep learning models. As the number of modified words increases, the model prediction accuracy becomes lower. This kind of adversarial attack implies security vulnerabilities of deep learning models.

1. Introduction

With the fast development of artificial intelligent technologies, deep learning models have been widely adopted in more and more areas [13]. In particular, they have been adopted not only in target detection, image classification, and other applications in the field of CV (Computer Vision) [4, 5] but also in more and more NLP (Nature Language Processing) applications, such as sentiment classification, spam classification, and machine translation [68].

Compared with traditional machine learning models, deep learning models have the following advantages. First, deep learning models have a strong fitting ability, which can approximate any complex function. The dimensionality of deep learning models can reach an infinite number; hence, the data fitting ability is much more powerful than traditional models. Second, deep neural networks contain many hidden layers that contain many hidden nodes; more hidden nodes are shown to provide stronger performance capabilities than traditional machine learning models. Third, the introduction of a convolutional neural network and recurrent neural network further improves the performance of neural networks, so that they can better deal with specific problems by feature extraction and contexture analysis. Finally, deep learning models can also be combined with probabilistic methods, which enable these models with high inference ability as the random factors could improve the reasoning ability of deep neural networks. Meanwhile, compared with traditional machine learning, deep learning models have better mobility, which makes the models easily adapted in various application scenarios.

Even though deep learning models play an important role in both CV and NLP fields, it does not imply that these models are completely secure and trustful. Since deep learning models lack theoretical analysis, recent studies have shown that deep learning models are very vulnerable to adversarial attacks, which generate adversarial examples to mislead the model by adding small perturbations to the original input. These security risks may incur severe consequences such as misrecognition in security-sensitive applications and privacy leakage during the deployment and execution of deep learning models.

In this paper, we are to explore the security vulnerabilities of deep learning models by adversarial attacks. This vulnerability property of deep learning models was first discovered in the image processing field. Only a small change of one or several pixels in the original image can cause the deep learning models to output an incorrect label to the modified data. Since this change compared to the original image is very small, human eyes can hardly detect any difference, while deep learning models for image classification would make incorrect prediction, which may lead to serious consequences. For example, a driverless system may cause a serious traffic accident if the system misidentifies a STOP sign on the road.

Not only image recognition tasks but also many NLP tasks face the challenge of adversarial examples. In this paper, we focus on the adversarial attacks in the NLP field. In [9], it proved that adversarial examples could successfully attack Google perspective API, making the models output an incorrect toxicity degree. Chinese text classification models are also threatened by such adversarial examples. Compared with adversarial attacks in image processing, generating adversarial examples in the text field is quite different and much more difficult. The challenges of adversarial attacks in the NLP field include the following aspects: (1)The text data is discrete [10]. In the image processing field, the image can be regarded as continuous data and the adversarial attacks can be conducted by traditional gradient-based methods. However, text data is discrete, and it is more difficult to adopt traditional gradient-based methods directly to generate adversarial examples for textual analysis(2)When generating adversarial samples for image data, only one or a few pixels in the original data are modified. This modification is basically indistinguishable to human eyes. However, in the textual analysis field, even if only a character in a word is modified, it will be much easier to be caught by humans and such modification might cause people to misunderstand the meaning of the original text

Therefore, we need to address the above two when exploring security vulnerabilities in textual analysis deep learning. In this paper, we propose the visual similar word replacement (VSWR) algorithm to solve these challenges. To begin with, to solve the problem of data discreteness, the proposed VSWR algorithm directly adds perturbations to the original text, instead of mapping the original text to a vector space. Afterwards, our proposed method could use the gradient-based method to find out the appropriate word to be modified. Second, to solve the second problem, we use words that are visual similar to replace the words in the original text, which would not cause obvious differences to humans and could not be noticed easily by humans.

We summarized the contributions of this paper as follows: (1)We proposed an algorithm called visual similar word replacement (VSWR) to generate adversarial examples for textual data, and we show the security vulnerability of the deep learning models when faced with such adversarial examples(2)We use the VSWR algorithm to generate adversarial examples on sentiment analysis datasets, and the adversarial examples are utilized to attack the pretrained deep learning classification models(3)The experimental results show that the generated adversarial examples can successfully interfere with the classification of the deep learning model. Specifically, only changing 25% of the original text can reduce the classification accuracy of the model from 95% to 60%

The rest of the paper is organized as follows. The next section briefly introduces related research results on textual adversarial examples. Section 3 presents the preliminaries, including the system model and the problem definition, and then proposes the VSWR algorithm. And the experimental results are provided in Section 4; the discussion is also shown here. Finally, we make a brief summary of this paper and shed light on some future directions in Section 5.

2.1. White-Box Attacks

The attacker fully understands all the information of the model and conducts an adversarial attack on the model on this basis. Therefore, the attacker can find out the relatively weak module of the model to perform targeted adversarial attacks. This attack method can test the robustness of the model against adversarial attacks in the worst case.

Although there are differences between textual data and image data, the idea of generating adversarial examples in the image field can also be used in the textual field. In [11], it puts forward a method to generate text adversarial examples named HotFlip, which represents text data as one-hot vectors, then modified one character of a certain word in the text, so as to achieve the effect of attacking neural networks. In [12], it applies FGSM [13] and JSMA [14] algorithms that use gradient descent to determine the perturbation in the image domain to generate text adversarial examples.

In fact, we have little knowledge about the neural network models we are using, including the parameter value of each layer even its structure. So, gradient methods have many restrictions.

2.2. Black-Box Attacks

Because of the limitation of white-box attacks in practical scenarios, many researchers turn their attention to black-box attacks.

In black-box attacks, an attacker knows nothing about the internal structure of the attacked model, training parameters, defense methods (if any defense methods are applied), or other information about the attacked model. The attackers can only interact with the model through input and output. Since manufacturers will not disclose information about the models they apply, most of the current contacts are black-box attacks.

In this case, the attacker generates adversarial examples by directly modifying the words in the text data or the letters/characters in the words. In [15], it proposed the AddSent method to attack the reading comprehension system by adding a carefully constructed sentence after the original text. The generated sentence could make the system make incorrect results. However, the way of adding sentences is very imperceptible, and these sentences could be easily discovered by human readers. In [16], it proposed the attack method based on the Metropolis-Hastings algorithm to replace, insert, or delete a word in the text to generate adversarial examples, while it modifies a character in the word in text in [17].

2.3. Limitation

Although much progresses have been made in attacking deep learning models, there is still much space for improvement. For example, the adversarial examples, generated by the sentence-level attack and the word-level attack, can be easily recognized by humans, while the adversarial example generated by the char-level attack can be defended by the spell check module [18]. In this paper, we propose a novel method based on the word replacement strategy of visual similar words to generate textual adversarial examples.

Figure 1 is a simple example of generating an adversarial example by the visual similar word replacement method. As shown in the figure, the original text is recognized as a positive review by the designed deep neural network model. However, we only change the word “sweet” to the word “sweat” which looks similar; the modified text is recognized as a negative review by the deep neural network model. According to the example, only changing one character of a single word in the original data could lead to a contrary label by the pretrained model.

3. Materials and Methods

3.1. Materials

Before giving our method for adversarial example generation, we show briefly the introduction of some definitions that are used in our method. In addition, we also formulate the proposed problems to explore security vulnerabilities of deep learning models.

3.1.1. System Model

We use to represent an English text and get a word list by segmenting the original text. An English text which is made up of words can be represented as , where the th value of stands for the th word of this text. We use to represent the label of text ; means that this dataset has categories. It is expressed as a one-hot vector. For example, all the portion’s values in of a text with a label are 0 except . Since there are only two categories in the dataset we use, there are only two portions in .

A mapping from a text to its label needs to be learned by a deep learning model which we call , where are the parameters of ; they are optimized by calculating the gap of and its label; the smaller of the difference between and , the more suitable is.

3.1.2. Adversarial Examples

Given a well-trained model , whenever we enter a text into this model, it can give us the label of the text. An adversarial example of is almost the same to except a little bit of artificial perturbations ; in this paper, is a visual similar word of the keyword of ; we use to represent the adversarial of . When using adversarial examples as the input of model , the model will give a different prediction from . We summarize this process into the following formulas:

3.1.3. Problem Definition

Since the dataset we used to verify the effect of the algorithm proposed in this paper is a binary dataset, there are only two possibilities for the label of a text , or . Assume that a piece of text data whose label . The problem we solved in this paper is to generate by the method proposed in this paper; when we use as the input of model , . And must follow the following principles: (1)The difference between and must be as small as possible, which means we can only replace a small number of words in the original data to ensure a human’s reading(2)All the visual similar words we choose to replace keywords in original data must be in word list , and it must be spelled similarly to keywords to ensure the imperceptibility of adversarial examples

3.2. Method

In black-box attacks, the attacker knows nothing about the internal structure and parameters of the model, so it is impossible to calculate the influence of the gradient change on the model prediction result. The method proposed in this paper is to solve the problem of the inability to pass the gradient, in the case of calculating the words that need to be modified, how to modify the original text to generate adversarial examples, which mainly includes the following steps: word scoring, visual similar word searching, and visual similar word replacement.

3.2.1. Word Scoring

Since each word in a text data has a different contribution to the final label given by pretrained models when models classify text data, for example, in sentiment analysis tasks, words with a particularly strong emotional color such as wonderful will have a greater impact on the results of the classification than other words. Therefore, when generating text adversarial examples in a black-box context, in order to ensure the success rate of the attack, the importance of the words in the original text needs to be ranked first.

According to the scores of these words, we extracted those words with higher scores which means they have the greatest impact on the text label in the original text, as “keywords,” and then, adversarial examples of the original text are generated through operations on the words such as destruction or replacement. We use a method that combines context and the position of the word in the entire text to score the word. Through this method, the words in the original text are scored to obtain the words that have the greatest impact on the label.

First of all, we use the training dataset to train a neural network model . Whenever you input a text data to , it gives the label of this text and the confidence of each label. Since text data has strong contextual relevance, when scoring a word in the text, it is necessary to consider the context of the word. Assuming a piece of text data consists of words, then the text can be expressed as . Given a piece of text data which can be presented to , we give the th word of this text by the following ways.

(1) Head Score. As we have already trained a model to classify text data, when we give a piece of text , it will return the confidence of each label, and we present it by . We define the head score of the th word to be the score of the text composed of the first words minus the score of the text composed of the first words. We first choose the first words to form a text , using it as the input of model to get of . Next, th is added to to form , so that we can get by query model . So, the head score of the th word can be presented as follows:

(2) Tail Score. The same as head score, we define the tail score of the th word to be the score of the text composed of the words which are after the th word minus the score of the text which added the th word to the former text. These two texts are presented as and ; using these two texts, we query model to get and , so the tail score of the th word is as follows:

(4) Without Score. Without score is calculated by the text without the th word and all of this text . We query these two texts above and then get and . Without score is presented as follows:

(4) Combined. Since the position of the th word in the whole text is different, a certain weight needs to be added when combining the above three scores. For the words at the top of the text, we reduce the weight of the head score, and the others are on the contrary. We determine the weight by calculating the proportion of the text before and after the th word in the entire text. The bigger the is, the higher the weight of the head score is. Finally, we can get the final score of the th word through the following formula.

After the words are scored, the scores of the words need to be sorted. The higher the score of the word, the greater the influence of the word on the final prediction label given by the model when the model classifies the text, and this word is a “keyword” in the original text; modifying the “keywords” can improve the offensiveness of the adversarial samples and increase the attack success rate.

3.2.2. Finding Visual Similar Words

When generating text adversarial examples, the imperceptibility of adversarial examples needs to be considered (that is, the modified text cannot be changed too much from the original text). Therefore, the adversarial example generation algorithm proposed in this paper selects words that are similar in spelling to the keywords in the original text when replacing keywords. But there are many ways to calculate the similarity of two strings, such as Euclidean distance, Levenshtein distance, and cosine similarity. In this paper, we choose Levenshtein distance to measure the similarity between two strings. We also tested other similarity calculation methods and finally chose Levenshtein distance because it is the most direct and fastest method. Figure 2 depicts an example which shows the calculation method of Levenshtein distance, from which we can easily see that Levenshtein distance can be used to easily calculate the similarity of two words, and the smaller the distance is, the closer the two words are.

Levenshtein distance is also known as edit distance, which refers to the minimum number of edit operations required to convert one string to another between two strings. Editing operations include replacing one character with another, inserting a character, and deleting a character.

For two strings , with lengths of and , it is necessary to calculate the edit distance between and and then add 1 (for the case of an increased operation, add the last one character). Or calculate the edit distance between and , and then add 1 (for the deletion operation, delete the last character). Or calculate the edit distance between and , and then add 1 (for the modification operation in case, modify one character) and then take the minimum of these three as the minimum edit distance of the previous step, and so on to the first character.

Use and to represent the length of the two strings and , respectively; then, the Levenshtein distance between the two strings is , where in which is an indicator function, when , its value is 0; otherwise, its value is 1. represents the Levenshtein distance between the first characters of and the first characters of ( and are subscripts starting from 1).

We use Levenshtein distance to measure the similarity between two words. The smaller the Levenshtein distance is, the more similar these two words are. So, we exchange the word chosen by the scoring module with another word whose Levenshtein distance to the keyword is smallest. In this way, the difference between the replaced text and the original text will not be very large, and it is not easy to be noticed by humans and affect human reading.

3.2.3. Generating Adversarial Examples

In the first step, we found those words with high scores, which means they are more important than the other words to the label given by deep learning models; these selected words are called “keywords.” Then, we use these keywords to form a keyword list; by calculating the Levenshtein distance between the words in the keyword list and the words in the word list of the dataset, we can find out those words that are similar to the keywords in the keyword list. We only need to find out the word with the shortest Levenshtein distance to keywords. Finally, we only need to use the visual similar words found in the previous step to replace the corresponding keywords in the original text. Then, we can generate the adversarial examples that could fool deep learning models with high imperceptibility. The VSWR algorithm is described in Algorithm 1.

Let English text data be presented as X=
A neural network model M
A dataset list consist of all the word in this dataset
for each do:
sort by to get keywords list
for each in keywords list and in do:
 calculating
 for each find out by
use replace
is the adversarial example of

4. Results and Discussion

4.1. Result
4.1.1. Dataset

We use the following two datasets: Yelp review dataset and Amazon review dataset to train two deep learning models that are designed for sentiment analysis. Then, we use our proposed algorithm to attack the pretrained models.

The Yelp review dataset is the comment data of the Yelp website, which is extracted from the Yelp Dataset Challenge 2015. There are only two categories: positive and negative in this dataset. This dataset contains two parts. The first part is the label of the data in this dataset. The second part is the detailed comments of users on the products they bought. The entire dataset contains 560000 pieces of English texts, including 280000 positive samples and 280000 negative samples. However, many texts are only composed of few words. We delete these texts from the dataset and only remain the texts with enough words for attack. The detailed information of the dataset is shown in Table 1. As for the Amazon review dataset, it consists of reviews from Amazon, and these reviews are also divided into two categories: positive and negative. Each piece of data consists the comment text and the label. In the experiment, we also deleted redundant data from the Amazon review dataset; the detailed information is shown in Table 2.

4.1.2. Models

Since we consider the context of text data when generating adversarial examples, the recurrent neural network- (RNN-) based model is the most suitable model for the experiment. To provide high accuracy, we finally choose LSTM (Long Short-Term Memory) and BiLSTM (Bidirectional Long Short-Term Memory) as the trained models to attack.

Tables 3 and 4 show the classification accuracy of these models on the Yelp review dataset and the Amazon review dataset. According to these tables, we can find that the two deep learning models could achieve good performance when solving the sentiment analysis task. Specifically, the accuracy rate of both two models can reach as high as 95% on the Yelp review dataset, while the accuracy rate of the models on the Amazon review dataset exceeds 88%. In the following parts, we show that the trained two deep learning models would achieve bad performance against generated adversarial examples.

4.1.3. Attack Performance

We use the method proposed in this paper to process the test dataset and then evaluate the effectiveness of each model, respectively. For the same model, as the number of words which are replaced in the original text increases, the prediction accuracy of the trained deep neural network model becomes lower and lower. As shown in Tables 5 and 6, with the number of replaced words (by its visual similar word), the models’ prediction accuracy decreases.

In Figures 3 and 4, we show the change of the models’ accuracy when the number of replaced words increases on two datasets. The -axis represents the number of replaced words in the original text, and the -axis denotes the accuracy of the deep learning models. From the figures, the original accuracy of both models is higher than 95% and 88%, respectively, when no word is replaced. When we replace more words as the -axis, the accuracy of both models decreases as the two curves in the figures.

In Tables 7 and 8, we show some generated adversarial texts on the two datasets by the VSWR algorithm. In Table 7, an original text from the Yelp review dataset is recognized as “positive” by the BiLSTM model. However, as the algorithm only changes “best” to “Rest,” the model classifies the generated text as “negative.” Similarly, an original text from the Amazon view dataset is recognized as “negative”; by changing two words to their visual similar words, the generated text is classified as “positive.” Clearly, the trained models would achieve bad performance against the generated adversarial examples, which implies the effectiveness of our proposed method.

5. Discussion

From Figures 3 and 4, we can find that the BiLSTM model is more susceptible to the influence of adversarial examples compared with the LSTM model. This is because the BiLSTM model fully considers the relationship between the scored words and the context. In addition, the extracted keywords by our method for the replacement in generating adversarial examples are more suitable for the BiLSTM model. During the preprocessing step of the dataset, we filter out the texts that are composed with only a small number of words; this is because humans could easily recognize short texts that are modified. Hence, we select relatively long texts in the dataset for the attack experiment. During our experiments, when the number of modified words is small, the attack effect on the two models is not good, but when we increase the number of modified words to 25% of the original text (on Yelp review dataset), the classification accuracy of the model can be reduced from 0.95 to 0.55. This result was also confirmed on the Amazon review dataset; the change of the words can reduce the model accuracy from 0.88 to 0.57.

Actually, some existing adversarial attacks could largely reduce the prediction accuracy of the neural network models. However, some of them change the characters in a word or split a word by some special symbols; the generated adversarial texts can be easily noticed by humans since some generated words do not exist in the vocabulary. In our paper, we select words that look similar to the original one for replacement, which could successfully fool both humans and deep neural network models. Although we need to modify 50 words to achieve good attack performance for the dataset, the modified words look quite similar to the original one and only 25% of words are modified on average, which is also acceptable.

However, there are also some questions that can be concluded by the examples above. For example, those words with strong emotions such as “interesting” and “bad” have not been changed during the algorithm, which means the trained deep learning models do not mainly rely on these words for classifying.

In this paper, we only verify the security vulnerabilities of deep learning models by the adversarial attack methods. Indeed, there are also many other methods to show security vulnerabilities. For example, we can modify the training data such that the trained deep learning model cannot study the correct data distribution; this kind of attack is also called data poisoning. In addition, some methods are proposed to steal privacy of deep learning models, such as inferring data from the training set, stealing parameters of the models. These methods would cause privacy leakage of deep learning models.

6. Conclusions

In this paper, we explore the security vulnerabilities of deep learning models by adversarial attacks. Specifically, we propose the visual similar word replacement method to attack several deep learning. This method firstly sorts the importance of the words in the original dataset and selects the words that have the greatest influence on the classification result as the keywords. At the same time, the original data is processed to obtain a word list containing all the words in the original dataset, and then, we use the word found in the word list whose Levenshtein distance between keywords is 1 to replace the keyword. The replaced text is the generated adversarial example of the original text. We also conducted experiments on the sentiment analysis datasets, and the results proved that the adversarial examples generated by this method could successfully attack the deep learning models such that they would make misclassification. In addition, as the number of modified words increases, the impact on the neural network model becomes more and more significant.

In the future, we will try to extend this method to attack more classification models for other textual analysis tasks, such as text generation, spam filtering, and machine translation. At the same time, we also try to improve the proposed VSWR method such that less words could be selected for replacement.

Data Availability

The Yelp review dataset is the comment data of the Yelp website. In our work, we filter out the texts in the dataset that contain words less than 50, since the changes of short texts are easier to be noticed by humans. Readers who are interested can get the dataset processed in https://drive.google.com/drive/folders/1IrI44k_GFrcP-59iP6zz8tori4XX-6Pp?usp=sharing, https://drive.google.com/drive/folders/1_EC1NR8_LuJWB3YGmWh3WZTsMHMLRLZt?usp=sharing.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work is supported in part by the National Key R&D Program of China 2019YFB1706003, the Guangdong Key R&D Program of China 2019B010136003, and the National Natural Science Foundation of China under Grant Nos. 61972106 and 61902082.