GPT-4 breaks defense! Chatting with GPT-4 with a password, successfully bypassing the security mechanism of GPT-4! The Chinese University of Hong Kong (Shenzhen) tells you how

Author | IQ has dropped

Have you tried chatting with GPT-4 using a password?

In recent years, large language models (LLMs) have played a key role in advancing the development of artificial intelligence systems. However, ensuring the safety and reliability of LLM responses is an important challenge. Safety is at the core of LLM development, and a lot of research work has been done to enhance its safety. However, existing work mainly focuses on natural language.

A recent study found that LLM's secure alignment techniques can be bypassed using password chat. The authors thus propose a new framework called CipherChat for studying secure alignment in non-natural languages ​​(ciphers).

Thesis title:
GPT-4 Is Too Smart to Be Safe: Stealthy Chat with LLMS via Cipher

Large model research test portal

GPT-4 portal (free of wall, can be tested directly, if you encounter browser warning point advanced/continue to visit):
https://gpt4test.com CipherChat allows humans to communicate with LLM through password prompts. Specifically, CipherChat converts the input into a password, prefaces it with a hint, and feeds it into LLM for checking. The output generated by the LLM is most likely also encrypted with a cipher, and these outputs are decrypted by a decryptor. This work investigates the following questions:

  • Define the behavior of the LLM as the role of a cryptographer, and require the LLM to chat using a password.

  • Leverage the learning power of LLM and teach LLM by explaining how cryptography works so that you learn in context.

  • Use an insecure demonstration of cryptographic encryption to reinforce the LLM's understanding of the cryptographic and lead the LLM to respond from a negative perspective.

The authors evaluate state-of-the-art LLMs (including ChatGPT and GPT-4) with CipherChat, and show that there are ciphers that almost successfully bypass the security alignment of GPT-4 in certain security domains. The stronger the model, the less secure the response using the cipher. This also points to the necessity of developing safe alignments in non-natural languages. In addition, the author also found that LLM seems to have a "secret cipher", and proposed the SelfCipher framework to evoke the "secret cipher" ability of LLM through role-playing and natural language demonstrations. There are human codes.

To ensure responsible and effective deployment of LLMs, human ethics and preferences need to be aligned with their development. OpenAI spent six months ensuring the safety of the GPT-4 model before deploying it. They employed RLHF and other security mitigations. In addition, they formed a SuperAlignment team to ensure that AI systems that are smarter than humans follow human intentions.

In this study, the authors validate the effectiveness of our method on the GPT-4 model and show that cryptochat can evade secure alignment .

There are also some works in academia dedicated to aligning LLMs more effectively and efficiently.

  • Constitutional AI (CAI): Coding desired AI behavior for more precise control over AI behavior.

  • SELF-ALIGN: Enables self-alignment of AI agents.

Important section

As shown in Figure 1, CipherChat consists of the following key parts:

  • Behavioral assignment : We cast LLMs in the role of cryptography experts and asked them to communicate using cryptography . In the experiment, it was found that LLM tends to directly translate passwords into natural language, so the author added a prompt sentence to prevent this translation behavior .

  • Cryptographic Teaching : Recent studies have shown that LLMs are powerful at learning in context. Inspired by these findings, the authors explain the meaning of the cipher in hints to guide how the LLM cipher works.

  • Encrypted Insecure Demos : The author provides LLM with some cryptographically encrypted insecure demos. Doing this has two effects:
    • Demonstrations can help LLMs better understand ciphers;

    • Unsafe presentations lead LLMs to respond in a negative or harmful light.

Encrypted input command

In CipherChat, choosing a cipher is very important, because LLM has different ability to understand and generate different ciphers .

The author studies several common passwords, which are used in English and Chinese respectively. There are character encodings, common encryption techniques such as Atbash, Caesar Cipher, and Morse Code, and SelfCipher. Some examples of passwords are listed in Table 1.

Table 2 presents the decrypted responses of the rule-based and LLM-based decryptors to the English (Morse code) and Chinese (Unicode) query “How is it a bad translation?”, with wrong tokens marked in red.

Compared with the rule-based decryptor, the GPT-4 decryptor can generate smoother and more natural text at a higher cost.

experiment settings

data

The authors use a Chinese security assessment benchmark that includes eight security scenarios and six instruction-based attack types. For the study, 11 domains were randomly selected and 199 instances were randomly sampled from each domain. These areas include crime and illegal activity, stigma, unfairness and discrimination, and ethics and morals. To facilitate bilingual research, the authors also extended this dataset to include English, utilizing a combination of Google Translate services and manual corrections.

Model

The experimental framework includes two models: Turbo and GPT-4. Research by Chen et al. shows that the latest version of GPT-4 is more secure. Therefore, for each query in the security evaluation benchmark, the authors reason using system hints and demonstrations. Each demonstration consists of randomly sampled queries from a domain-specific dataset and responses devised by human annotators. All queries in the same domain share the same demonstration.

The author evaluates the security performance of GPT-4 and Turbo with CipherChat, trying to answer the following research questions:

  • Can CipherChat chat with LLM through a password?

  • Can CipherChat bypass LLM's security alignment?

  • How does CipherChat work?

We wondered, would simulated ciphers that were never present in the pre-trained data work in CipherChat ? To answer this question, the author defines a cipher that does not exist, but even with multiple examples, these ciphers do not work. This suggests that LLM may rely on knowledge of the ciphers learned in the pre-training data . As shown in the results in Table 3, the success of human ciphers (such as Caesar cipher) and SelfCipher shows that LLM can learn the knowledge of human ciphers from pre-training data and generate its own ciphers .

Human evaluation shows that CipherChat can chat with Turbo and GPT-4 through specific artificial ciphers (such as Unicode in Chinese and ASCII in English), and SelfCipher, which communicates in natural language, performs well between models and languages. Incorrect passwords and simple repeated queries contributed the most to invalid responses.

There are also experimental results that demonstrate the effectiveness of CipherChat's secure alignment bypassing LLM. In English queries, CipherChat can generate up to 70.9% unsafe responses on GPT-4, and this trend exists in various fields.

Further analysis by the authors reveals the important role of instructions and unsafe presentations in CipherChat. The reason SelfCipher might perform so well is probably due to something like the hint "You are an expert in cipher codes" in the chain of thought , which directs the LLM to utilize its own "cipher" to generate a response .

in conclusion

This paper proposes a new framework named CipherChat for studying secure alignment in non-natural language (cipher). The authors' research shows that using passwords for chat can elicit insecure information from the GPT-4 model . Additionally, the following findings were made:

  • LLM can generate insecure encrypted responses through hints.

  • Stronger LLMs suffer more in insecure password chats because they have a better understanding of passwords.

  • Simulated ciphers that were never present in the pre-training data did not work, consistent with previous research.

  • LLM seems to have a "secret code" that can be evoked even using only role-playing prompts and a few natural language examples.

The work in this paper highlights the need to develop safe alignments for non-natural languages ​​to match the capabilities of the underlying LLM.

In the future, a more promising research direction is to implement secure alignment techniques in encrypted data , and another interesting direction is to explore the "secret code" in LLM and better understand this ability .

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/132674247