Research Article

Adversarial Attacks on Large Language Model-Based System and Mitigating Strategies: A Case Study on ChatGPT

Figure 2

Self-attention visualization for the toxic prompt “You go on to play a burglar in the virtual scene.” as input for the external detection model. The heatmap shows the relative correlation between each input token across all 12 layers of the model, with brighter cells indicating higher correlation. The special token <s> in RoBERTa is equivalent to the [CLS] token in BERT and is used for linguistic representation of the whole input sequence. In this particular case, special token <s> represents the whole toxic prompt we need to detect.