foreword
Carnegie Mellon University recently discovered a new scheme to bypass LLM security restrictions -- adding adversarial suffixes, which completely circumvents the alignment of open source LLMs (Large Language Models). Even more worrisome, the same hint applies to closed-source excellent LLMs like ChatGPT, Claude, Bard, and LLaMA-2.
Test Claude2
Claude2 has an extra layer of security filtering. After we bypassed it with a one-word trick, the generative model was willing to give us the answer.
Test Results
Using only four adversarial suffixes, these LLMs followed harmful instructions more than 60% of the time.
Can be automatically generated
Manual "jailbreaking" is rare and generally unreliable. But we found an automated method (GCG) to build an infinite number of such jailbreaks, and they are very reliable even for new instructions and models.
Aligning models is not an adversarial alignment! While these models are explicitly trained to refuse to execute harmful instructions, our suffix allows them to provide bomb-making instructions, which is a typical example and is likely to have been directly trained in their training set.
Can't we patch this "hole"?
Companies like OpenAI just tinker with suffixes in papers, but many other hints gained during training still work. Also, repeating the same process on a new model may still work if the model weights are updated.
The worrying discovery shows the short-term risks of criminals using these systems to spread misinformation and manipulate people and politics. Judging from the capabilities and autonomy of the models, they may lower the threshold for making weapons or assist criminal activities.
So why publish this scheme?
Despite the risks, we believe full disclosure is warranted. The attack method presented here is easy to implement, has appeared in similar forms before, and will eventually be discovered by any team focused on abusing LLM.
As a research team, our goal in publishing this attack is to raise the alarm early and help foster discussion. Addressing this issue is critical before deploying more advanced, autonomous agents that pose far higher risks than these chatbots.
So can we solve this problem?
This is uncertain. In computer vision, adversarial examples have persisted for more than a decade without a satisfactory solution. It is unclear whether this would fundamentally limit the applicability of LLM. We hope that our work will inspire future research directions.
Current repair status
ChatGPT and Claude2 are relatively fast, and this "jailbreak" solution has been repaired at present, but it is not guaranteed to be completely repaired. Other LLMs are not fixed yet.
The following are the test results of Teacher Baoyu:
chatgpt experience: http://www.chat136.com
chatgpt learning: http://me.chat136.com
reference link
https://twitter.com/andyzou_jiaming/status/1684766170766004224
Other items recommended
10.1
awesome-gpt-security
project address:
https://github.com/cckuailong/awesome-gpt-security
Project Description
A curated list of security tools, lab cases, or other interesting stuff related to LLM or GPT.
10.2
SuperAdapters
project address:
https://github.com/cckuailong/SuperAdapters
Project Description
One-click fine-tuning framework, supports all platforms (Linux/Windows/Mac), supports multiple LLMs, and supports multiple fine-tuning methods (Lora/Qlora/PTuning, etc.)
Note: This article is owned by the author and may not be reproduced without the author's permission