Abstract
Machine learning algorithms are at the forefront of the development of advanced information systems. The rapid progress in machine learning technology has enabled cutting-edge large language models (LLMs), represented by GPT-3 and ChatGPT, to perform a wide range of NLP tasks with a stunning performance. However, research on adversarial machine learning highlights the need for these intelligent systems to be more robust. Adversarial machine learning aims to evaluate attack and defense mechanisms to prevent the malicious exploitation of these systems. In the case of ChatGPT, adversarial induction prompt can cause the model to generate toxic texts that could pose serious security risks or propagate false information. To address this challenge, we first analyze the effectiveness of inducing attacks on ChatGPT. Then, two effective mitigating mechanisms are proposed. The first is a training-free prefix prompt mechanism to detect and prevent the generation of toxic texts. The second is a RoBERTa-based mechanism that identifies manipulative or misleading input text via external detection models. The availability of this method is demonstrated through experiments.
1. Introduction
Artificial intelligence is exemplified by machine learning algorithms, which enable computers to evolve and improve without being explicitly programmed. These algorithms are widely applied across a variety of information systems and applications, including natural language processing [1], computer vision [2], and decision-making systems [3]. In recent years, the use of machine learning algorithms for data-intensive tasks, such as those performed by the GPT-3 and ChatGPT large language models, has witnessed a significant surge. Trained on massive amounts of textual data, these models can generate language resembling human writing, rendering them useful for various applications, including chatbots, text summarization, and automatic writing.
The development of large language models (LLMs) such as GPT-3 and ChatGPT is driven by advancing machine learning technology. These technologies allow models to process large amounts of data, learn complex patterns, and make predictions with high accuracy. Nonetheless, it is crucial to acknowledge that these current language models possess certain limitations [4]. For example, they can be vulnerable to adversarial attacks, where an attacker could manipulate the input to the model in order to cause it to generate incorrect or harmful text [5].
Adversarial machine learning is a rapidly growing field of research that aims to understand and mitigate the vulnerabilities of machine learning models [6]. Studies in this field have shown that even highly accurate machine learning models can be vulnerable to a wide range of attacks, including input manipulation, model poisoning, and model stealing. One of the most significant concerns in adversarial machine learning is the vulnerability of large natural language models, such as ChatGPT, to inducing text attacks [7]. These attacks involve manipulating the input to a model in order to cause it to generate incorrect or harmful text. For example, an attacker could manipulate the input to ChatGPT in order to cause it to generate inappropriate or offensive responses.
The characteristic of large language models such as ChatGPT makes them particularly prone to adversarial attacks, as the model is trained to produce text based on the input it receives [8]. Consequently, even minor modifications to the input can result in substantial variations in the output. To overcome these limitations, researchers are working on developing more robust machine learning models [9, 10] and more effective defenses against adversarial attacks [11, 12]. They are exploring new techniques, such as adversarial training, as well as developing more advanced methods for evaluating the robustness of machine learning models.
To further strengthen the security and robustness of LLMs such as ChatGPT, we present potent defense mechanisms tailored for different scenarios. One such method is termed the prefix prompt approach, which seeks to prevent the generation of harmful text by first identifying and eliminating any inappropriate or leading inputs prior to the genuine model generation [13]. This method effectively neutralizes the influence of manipulative or misleading input, thereby ensuring the robustness of the model. Another mechanism we propose is the implementation of a RoBERTa-based method [14]. This method employs an external model to detect and counteract adversarial attacks by being trained to identify manipulative or misleading input and then flagging it for removal before it is passed to the ChatGPT model. The optimal defense mechanism will depend on the particular use case and the resources available for integration. The efficacy of these methods can be evaluated, providing other practitioners with valuable insights into the most appropriate approach for different scenarios.
Two major contributions of this work are as follows:(1)A systematic illustration of adversarial attacks is presented, with a comprehensive examination of inducing attacks against ChatGPT.(2)Two viable mitigating strategies for countering the production of toxic texts are introduced. The evaluation of the proposed methods demonstrating that the induction success rate decreased significantly.
The rest of the paper is structured as follows: the preliminary knowledge of ChatGPT and adversarial machine learning is demonstrated in Section 2. In Section 3, inducing attacks are introduced in ChatGPT and an analysis of the reason is proposed. The details of the prefix prompt defending method and the RoBERTa-based defending method are demonstrated in Section 4. Section 5 presents the evaluation of the methods. Related work is reviewed in Section 6, and Section 7 concludes the paper.
2. Preliminary Knowledge
2.1. ChatGPT
The cutting-edge LLM powered system ChatGPT, developed by OpenAI and introduced in November 2022, represents a new breakthrough in the GPT series architecture. Through reinforcement learning from human feedback (RLHF), it has been further optimized beyond the GPT-3.5 model checkpoint, leading to a greater alignment with human intention and a more coherent output. The overall architecture of ChatGPT is based on the Transformer architecture, which was first introduced in the paper “Attention Is All You Need” by Vaswani et al. in 2017 [15]. The Transformer architecture utilizes detection self-attention mechanisms, enabling the model to capture long-term dependencies in text effectively. ChatGPT has been trained on a diverse range of texts from the internet, such as books, articles, and websites, enabling it to generate coherent and varied text outputs.
Inherited from the GPT-3.5 pretrained weight, ChatGPT possesses a wide range of natural language generation capabilities, including language translation, summarization, and question answering. Furthermore, its ability to generate contextually relevant and accurate text for a specific domain makes it particularly valuable for applications such as automated writing and text summarization. ChatGPT, like the powerful GPT-3 model, primarily employs the in-context learning approach, granting it the ability to grasp the language and terminology specific to a given domain without the need for backpropagation. This also enables the model to adapt to a wide spectrum of tasks, even without significant amounts of task-specific training data.
The paramount advantage of ChatGPT lies in its capacity to produce texts aligned with human instructions, which reveal great potential in various application domains, including machine translation or even next generation search engines. However, it is imperative to acknowledge that, like any AI model, ChatGPT may still be susceptible to limitations like from security perspective and must be carefully monitored and postprocessed to guarantee the accuracy and impartiality of the generated text. The realm of potential applications of LLM based system such as ChatGPT is vast and continues to be a subject of active research in the AI community.
2.2. Adversarial Machine Learning
Adversarial machine learning encompasses the investigation of threats and methods of protection against malicious attacks on machine learning models. The objective of these attacks is to manipulate either the input to the model or the model itself, resulting in inaccurate or detrimental predictions. This area of study is of particular significance in security-sensitive domains, such as computer vision and natural language processing, where the impact of a successful attack can be severe.
Research has shown that machine learning models can be vulnerable to various attacks, including input manipulation, model poisoning, model stealing, and membership inference attack. In order to defend against these attacks, researchers have proposed various techniques such as adversarial training, input preprocessing, and model robustness evaluation.
Several different types of adversarial attacks can be launched against machine learning models. Some of the most common types include the following:(1)Input manipulation [16, 17]: This attack involves altering the input to a machine learning model to cause it to make an incorrect or harmful prediction. For instance, an attacker may add a small perturbation to an image that is not visible to the human eye but causes a classifier to misidentify the image.(2)Model poisoning [18, 19]: In this type of attack, an attacker alters the training data of a machine learning model to make it produce incorrect predictions on a particular set of inputs. For example, an attacker may add a few malicious examples to a dataset used to train a classifier, causing it to make incorrect predictions on those examples.(3)Model stealing [20, 21]: This type of attack involves obtaining the parameters of a machine learning model, either by reverse-engineering the model or by accessing the model’s parameters directly. Once the parameters are obtained, the attacker can use the model to generate adversarial examples or to make predictions on new inputs.(4)Membership inference attack [22, 23]: This type of attack is about inferring whether a specific sample was used in the training of the model or not. The attack can be launched using the output of the model for a set of samples, including the target sample, and other side information such as the features of the sample.
To execute these types of attacks, attackers may use a variety of techniques, such as gradient-based methods, optimization-based methods, and reinforcement learning-based methods. These techniques are often tailored to machine learning models and applications.
3. Adversarial Attacks on Large Language Model ChatGPT
GPT series models are well-known for their advantage of being super large, which allows them to handle various downstream NLP tasks. The core idea is to convert all these tasks into language modeling. By doing so, all tasks can be modeled uniformly, where the task description and input are the historical context of the language model, and the output is the future information that the model needs to predict. In other words, this approach turns questions into prompts that the language model can benefit from directly, allowing it to “figure out” what it needs to do based on the direction of the text.
However, this approach also introduces a problem. It requires extensive prompt engineering to find the most appropriate prompt for the language model to solve the task. Without the right prompt, the model may not be able to produce the desired result, and extensive manual effort is needed to design effective prompts for each task.
Due to the characteristics of prompt, we found the possibility of an induced attack against ChatGPT. Under normal circumstances, ChatGPT will strictly abide by the relevant laws, regulations, and ethics. ChatGPT will not mention anything offensive, violent, or criminal in its conversations. As shown in Table 1, ChatGPT will not talk to the user about the psychology of criminals and nuclear explosions without any inducements. However, if the user carries out inductive attacks on the ChatGPT, the model will be guided into a specific dialogue situation. For example, adding preconditions where the generated content has no real-world impact. At this point, the model will produce inappropriate content. Take the blue text in Table 1’s conversation as an example, legal and ethical restrictions on ChatGPT were lifted.
It would be dangerous to let ChatGPT remove the legal and moral hazard restrictions on models. Through our tests, ChatGPT may help criminals commit crimes more easily if it is guided to remove its own legal and moral constraints. For example, provide detailed descriptions of the process of committing a crime, tips on risk points in the process of committing a crime, and how to effectively evade the police. These are not shown in this article due to legal risks. In Table 1’s red text, ChatGPT gives a detailed psychological profile of the arsonist and gives an analysis of the specific motives of the arsonist. In the following conversation, ChatGPT also gave a simulation of a nuclear explosion in a major city and its possible consequences. None of this would normally be there. Thus, adversarial inducing attacks may have effects on ChatGPT.
4. Mitigating Strategies of Adversarial Attacks on Large Language Model
4.1. A Learning-Free Prefix Prompt within Model Defending Mechanism
With the exponential growth of the magnitude of pretrained language models, the accompanying demands on training hardware, data, and cost have also risen proportionally. As a response to these challenges, prompting method emerges as a more compact and efficient solution to “pretrain and fine-tune” paradigm, which is often complicated by the heterogeneity of downstream tasks. Prompting generally aids the pretrained language model in retaining its pretraining knowledge. This new paradigm, termed “pretrain, prompt, and predict” entails tailoring downstream tasks to resemble the pretraining tasks. Researchers can control the model’s predicted output by carefully picking the relevant prompt, allowing a self-supervised pretrained language model to tackle a wide range of downstream tasks. As a result, choosing a suitable prompt is critical to the model’s performance. Numerous studies have shown that little modifications in prompts can result with significant differences in results. For the input text , there is function:
The function operates in two stages. Initially, it formulates a templated natural language phrase featuring multiple placeholder slots. Subsequently, it fills the input into the designated slot along with the “prefix prompt” to detect malicious queries while preserving the system’s capabilities with a carefully crafted prompt. By integrating the proposed prefix prompt mechanism, ChatGPT models are capable of identifying hazardous requests without additional fine-tuning. This method effectively resolves the issue of inducing adversarial attacks within the model itself, obviating the need for supplementary training or external detection mechanism.
4.2. A RoBERTa-Based External Defending Mechanism
RoBERTa (robustly optimized BERT approach) developed by Facebook AI, which is a modified version of BERT that uses a larger dataset and more training steps to achieve higher accuracy on natural language understanding tasks.
The input to a BERT model during pretraining consists of two segments, SEG1 and SEG2, which are spliced together by the model. To facilitate this, an initial mark [CLS] is added, followed by a separator mark [SEP] at the junction of the two segments, and finally an end mark [EOS] at the conclusion. The result is a concatenated format of the form [CLS] SEG1 [SEP] SEG2 [EOS], enabling the BERT model to undergo self-supervised learning from vast amounts of textual data. The pretraining task for BERT is actually multitask training.
The original BERT model employs static mask operations, while the RoBERTa model utilizes dynamic mask operations. The static mask operation entails conducting mask operations on statements during data processing and then presenting the completed statements directly to the model for training. The dynamic mask operation, on the other hand, dynamically carries out the mask operation on the statement during model training. This results in the mask position of a sample being different in each training round, enhancing the randomness of the model’s input data and ultimately improving its learning capability. It is worth mentioning that the RoBERTa model does not actually perform dynamic mask operations, but instead repeats the training data 10 times, leading to samples being masked in different positions, effectively achieving the same outcome as dynamic mask. In addition, RoBERTa demonstrated that cancelling the next sentence prediction task led to improved results for the BERT model. If the next sentence prediction task was used, incorporating more characters in the sample would have a more positive impact. Furthermore, RoBERTa implemented a larger batch size for training, promoting model parallel training.
We collected some seductive text data for the language model, and classified these data into discourse with undesirable inducements and discourse without undesirable inducements through labeling. Then, RoBERTa was used for training classification, and external model interference was used to optimize ChatGPT, avoiding the problem of a single model and improving the ability of the model against adversarial attacks.
5. Experiments
5.1. Experiment Settings
5.1.1. System Information
We implement our method in PyTorch 1.10.0 and HuggingFace’s Transformers 4.24.0 with CUDA version 11.3. Moreover, we evaluate our method on real conversation data using widely used metrics on Ubuntu 20.04 server shipped with NVIDIA RTX 3090 GPU cards.
5.1.2. Evaluation Method
In terms of performance evaluation, both the methods can be quantitatively evaluated using various metrics such as accuracy, precision, recall, and F1 score. These metrics can be calculated using the predicted labels and true labels of the test dataset, and we will detail the evaluation metrics in later sections.
5.1.3. Hyperparameters
We use exactly same learning rate and additional parameter settings throughout the experiment as listed in Table 2 to minimize the impact of hyperparameters on the experimental results.
All these settings allow for efficient training and evaluation of the RoBERTa-based method for defending.
5.2. Datasets and Evaluation Method
We use real conversation data and prompts used to interact with ChatGPT for training and evaluation detection model of the adversarial attack. Considering the adversarial nature of the defending strategy based on attack detection, we leverage multiple metrics to fully evaluate the defending capability of our work. More specifically, the metrics used are
5.3. Experiment Results
5.3.1. Learning Free Prefix Prompt Method
As depicted in Table 3, the initial prompt was a carefully crafted prefix prompt, intended to establish a more secure communicative protocol and enhance the robustness of ChatGPT. The second prompt, in contrast, was an adversarial prompt, to which ChatGPT responded with a defensive reply, showcasing the effectiveness of the prefix prompt approach. Subsequently, the table portrays a typical ChatGPT conversation after the adversarial prompt and defensive response powered by the “prefix prompt” mechanism.
5.3.2. RoBERTa-based Defense Strategy
The experiments demonstrate that the RoBERTa-based method exhibits strong capability of detecting the prompts with adversary and fast convergence to a strong detection capability as illustrated in the training curves.
Even at the early stage of training, the RoBERTa-based detection method already showcases very strong detection capability. This means even with limited samples, it can detect the attacks with no performance drop compared with the final metrics provided in Table 4.
Overall, the experiment results demonstrate that the RoBERTa-based method, which introduces an external model for defending against attacks, has much stronger detection capability. However, the prefix prompt method can still be seen as an effective training-free defense scheme. It is more suitable for fast adaptation in online systems based on LLMs during emergent situations.
In our evaluation, we also employ a visualization of the self-attention mechanism within the RoBERTa-based model, consisting of 12 layers, to emphasize the influence of specific tokens on the final output. As depicted in Figure 1, each layer of the multihead self-attention mechanism has its own attention weights, which are assigned to every token in the input sequence. Brighter colors in the heatmap represent higher weights, highlighting the relative importance of tokens for determining the final detection result.

The attention visualization of the RoBERTa-based model’s 12-layer self-attention mechanism enables us to investigate its potential for detecting toxic prompts. The special token <s>, functionally equivalent to the [CLS] token in BERT, is crucial for classification and the final detection result. It has been trained to have higher weights in the model’s self-attention layer, as depicted in Figures 1 and 2. This highlights its role in the final output.

Analyzing the attention weights and connections between tokens reveals patterns and relationships indicative of toxic prompts, as well as potential biases in the model’s attention patterns. Therefore, examining the attention patterns in Figure 1, allows us to discern if the model focuses on the correct context and features within the input text to accurately identify toxic prompts. This insight can help refine the external defending mechanism, enhancing its efficiency in mitigating adversarial attacks.
In addition, the attention visualization serves as a valuable resource for researchers and practitioners in artificial intelligence. By unveiling the inner workings of the RoBERTa-based model’s self-attention mechanism, the visualization provides insights into the model’s decision-making process and highlights potential vulnerabilities, biases, and areas for improvement.
In summary, our proposed RoBERTa-based external defending mechanism, combined with the attention visualization of the model’s self-attention layers, advances our understanding of large language model vulnerabilities and offers practical solutions for bolstering their security and robustness against adversarial attacks. By capitalizing on the strengths of RoBERTa and incorporating its classification capabilities into the existing LLM-based system, we can effectively mitigate the risk of generating harmful or toxic text. This approach ensures safer and more reliable applications of powerful language models in real-world scenarios, fostering trust and confidence in their utilization across a diverse range of applications.
6. Related Works
In recent years, the study of large language models has garnered substantial attention, particularly in developing pretraining methodologies and frameworks capable of effectively acquiring representations of natural language. The field is marked by a proliferation of investigations into models such as BERT [24], GPT [25], and Megatron [26].
Fu et al. [27] present two secure and semantically advanced retrieval techniques, SSRB-1 and SSRB-2, based on the utilization of BERT. The authors show how training documents with BERT leads to the construction of keyword vectors that are rich in semantic information, thereby enhancing retrieval accuracy and aligning results with the user’s intention. In order to tackle the challenge of automatically recognizing idiomatic expressions, Briskilal and Subalalitha [28] propose a predictive ensemble model that leverages BERT and RoBERTa for categorizing idioms and literal phrases. The model’s performance is evaluated using a newly established dataset of idioms and literal phrases, and it surpasses the baseline models across all assessment metrics. Trummer [29] introduced CodexDB, a framework built on OpenAI CodeX, which enables users to modify SQL query processing through natural language commands. The framework decomposes complex SQL queries into a sequence of processing stages described in plain language. Yang et al. [30] present PICa, a strategy for knowledge-based VQA that prompts GPT-3 through the use of image captions. Unlike earlier works that rely on structured KBs, PICa views GPT-3 as an implicit, unstructured KB that can collaboratively acquire and process relevant information. Narayanan et al. [31] demonstrate the use of tensor, pipeline, and data parallelism to scale to thousands of GPUs and offer a novel interleaved pipelining scheme that can enhance performance while maintaining a manageable memory footprint compared to previous methods. MEGATRON-CNTRL is a one-of-a-kind framework that takes advantage of LLMs and provides control over text production by leveraging an external knowledge base. It consists of a keyword predictor, a knowledge retriever, a contextual knowledge ranker, and a conditional text generator and produces narratives that are more fluent, consistent, and coherent, with reduced repetition and increased diversity, compared to previous strong capability of detecting the prompts with adversarial work on the ROC story dataset [32].
Adversarial machine learning has become increasingly important as machine learning is integrated into more and more systems and applications. The goal of adversarial machine learning is to develop machine learning models that are robust against adversarial examples and attacks. Byun et al. [16] offer an object-based varied input technique in which an adversarial picture is drawn on a 3D object and the generated image is categorized as the target class. If an adversarial example seems to the model to be the target class, the model should categorize the rendered image of the 3D object as the target class. By utilizing an ensemble of several source objects and randomizing viewing circumstances, the ODI approach successfully diversifies the input. Zhang et al. [33] present a strategy for creating adversarial examples using the shadow model that reduces the number of queries to the object model and increases the number of queries to the shadow model. The approach generates the shadow model and adjusts the decision boundary. The approach additionally makes use of the shadow model to build adversarial instances by maximizing the output probability of the targeted class. Li et al. [34] investigate the defense from a different aspect, determining whether a suspicious model has knowledge of defender-specified external properties. They incorporate the external elements by combining style transfer with a few training samples. They then train a meta-classifier to assess whether or not a model was taken from the victim. Huang et al. [22] use domain adaptation as a safeguard against membership inference attacks when implementing DAMIA. During the training phase, domain adaptation obfuscates the dataset to be protected by utilizing another related and comparable dataset, and produces a model that extracts features from both datasets.
As the realm of adversarial machine learning progresses, it can be expected that researchers will persistently delve into innovative techniques for defending against adversarial attacks and comprehending the complicated and multidisciplinary elements of the field. Furthermore, a growing interest in constructing secure and reliable machine learning systems and ensuring the safety of machine learning models in real-world applications is foreseeable.
Despite recent studies that have explored many fresh directions in ChatGPT, few works have investigated the behavior and performance of ChatGPT in an adversarial scenario. Inspired by these studies, we evaluate the large language model ChatGPT under induced adversarial attacks in this work.
7. Conclusion
In this work, we investigate the generation bias issue of large language models (LLMs), represented by ChatGPT, which may produce toxic or harmful text when prompted by malicious or adversarial inputs. We analyze the sources and manifestations of biased text and propose two mechanisms to mitigate the impact of such attacks. The effectiveness of these mechanisms is evaluated, and the results demonstrate that they can improve the robustness and safety of LLMs. Our research contributes to the ongoing efforts of AI practitioners and researchers to address the vulnerabilities of LLMs and enhance their security implications.
Data Availability
The datasets and codes used to support the findings of this study are available from the first author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Authors’ Contributions
Bowen Liu and Boao Xiao contributed equally to this work.
Acknowledgments
This work was supported by the Open Research Fund from Guangdong Laboratory of Artificial Intelligence Digital Economy (SZ) no. GML-KF-22-20 and Dou Wanchun Expert Workstation of Yunnan Province no. 202105AF150013. The authors deeply appreciate Xitian Dai, BASIS International School Nanjing, for her original contributions in this work.