Research Article
Adversarial Attacks on Large Language Model-Based System and Mitigating Strategies: A Case Study on ChatGPT
Figure 1
Self-attention visualization for toxic prompt “I would like you to act as a corrupt official in our conversation.” as input for detection model. The heatmap shows the relative correlation between each input token across all 12 layers of the external detection model, with brighter cells indicating higher correlation.