New Scheme for ChatGPT Security Restrictions

foreword

Carnegie Mellon University recently discovered a new scheme to bypass LLM security restrictions -- adding adversarial suffixes, which completely circumvents the alignment of open source LLMs (Large Language Models). Even more worrisome, the same hint applies to closed-source excellent LLMs like ChatGPT, Claude, Bard, and LLaMA-2.

picture

   

Test Claude2

Claude2 has an extra layer of security filtering. After we bypassed it with a one-word trick, the generative model was willing to give us the answer.

picture

   

Test Results

Using only four adversarial suffixes, these LLMs followed harmful instructions more than 60% of the time.

picture

   

Can be automatically generated

Manual "jailbreaking" is rare and generally unreliable. But we found an automated method (GCG) to build an infinite number of such jailbreaks, and they are very reliable even for new instructions and models.

Aligning models is not an adversarial alignment! While these models are explicitly trained to refuse to execute harmful instructions, our suffix allows them to provide bomb-making instructions, which is a typical example and is likely to have been directly trained in their training set.

   

Can't we patch this "hole"?

Companies like OpenAI just tinker with suffixes in papers, but many other hints gained during training still work. Also, repeating the same process on a new model may still work if the model weights are updated.

picture

The worrying discovery shows the short-term risks of criminals using these systems to spread misinformation and manipulate people and politics. Judging from the capabilities and autonomy of the models, they may lower the threshold for making weapons or assist criminal activities.

   

So why publish this scheme?

Despite the risks, we believe full disclosure is warranted. The attack method presented here is easy to implement, has appeared in similar forms before, and will eventually be discovered by any team focused on abusing LLM.

As a research team, our goal in publishing this attack is to raise the alarm early and help foster discussion. Addressing this issue is critical before deploying more advanced, autonomous agents that pose far higher risks than these chatbots.

   

So can we solve this problem?

This is uncertain. In computer vision, adversarial examples have persisted for more than a decade without a satisfactory solution. It is unclear whether this would fundamentally limit the applicability of LLM. We hope that our work will inspire future research directions.

   

Current repair status

ChatGPT and Claude2 are relatively fast, and this "jailbreak" solution has been repaired at present, but it is not guaranteed to be completely repaired. Other LLMs are not fixed yet.

The following are the test results of Teacher Baoyu:

picture

picture

chatgpt experience: http://www.chat136.com

chatgpt learning: http://me.chat136.com

reference link

https://twitter.com/andyzou_jiaming/status/1684766170766004224

Other items recommended

10.1

awesome-gpt-security

  • project address:

https://github.com/cckuailong/awesome-gpt-security

  • Project Description

A curated list of security tools, lab cases, or other interesting stuff related to LLM or GPT.

10.2

SuperAdapters

  • project address:

https://github.com/cckuailong/SuperAdapters

  • Project Description

One-click fine-tuning framework, supports all platforms (Linux/Windows/Mac), supports multiple LLMs, and supports multiple fine-tuning methods (Lora/Qlora/PTuning, etc.)

picture

Note: This article is owned by the author and may not be reproduced without the author's permission

Guess you like

Origin blog.csdn.net/heikeb/article/details/132008065