Adversarial Attacks on Large Language Model-Based System and Mitigating Strategies: A Case Study on ChatGPT

<div>Self-attention visualization for toxic prompt “I would like you to act as a corrupt official in our conversation.” as input for detection model. The heatmap shows the relative correlation between each input token across all 12 layers of the external detection model, with brighter cells indicating higher correlation.</div>

Security and Communication Networks

fig1

Figure 1

Figure 1: Adversarial Attacks on Large Language Model-Based System and Mitigating Strategies: A Case Study on ChatGPT