Reading notes——"GPTFuzzer: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts"

  • 【参考文献】Yu, Jiahao, Xingwei Lin, and Xinyu Xing. "Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts." arXiv preprint arXiv:2309.10253 (2023).
  • [Note] This article is only the author's personal study notes. If there is any offense, please contact the author to delete it.

Table of contents

Summary

1. Introduction

2. Background information

1、LLM

2、Fuzzing

3. Methodology

1 Overview

2. Original seed 

3. Seed selection

4. Mutation

5. Jailbreak response 

6. Judgment model

 4. Experiment

1. Experimental preparation

2. Pre-analysis

3. Single model jailbreak

3.1. Single-question

3.2. Many problems

 4. Multi-model jailbreak

5. Moral considerations

6. Summary


Summary

  • This article introduces GPTFuzzer, a new black-box jailbreak fuzz testing framework inspired by AFL. It can automatically generate jailbreak templates for LLMs.
  • LLMs: Large language models, such as ChatGPT, Claude, etc.
  • jailbreak templates: Specific input templates, combined with some illegal questions, can trick LLMs into outputting illegal answers.

1. Introduction

  • LLMs have shown great potential in various fields including education, reasoning, programming, and scientific research. The ability of LLMs to produce human-like text has led to their widespread adoption in a variety of applications. However, this ubiquity brings challenges, they can produce harmful or misleading content and can be easily tricked into producing meaningless or untrue output.
  • The jailbreak attack is a significant adversarial strategy against LLMs. It uses specially crafted prompts to bypass the restrictions of LLMs and trigger potentially harmful reactions. While this can unlock the full potential of LLMs, it also creates risks that could result in illegal output or breaches of supplier guidelines. For example, a successful jailbreak attack on a chatbot could result in the generation of offensive content, potentially resulting in the chatbot being suspended. Therefore, it is crucial to assess the resilience of LLMs to jailbreak attacks before actual deployment.
    • [Note] Prompts: input provided to LLMs.
  • Most existing jailbreak attack research relies primarily on artificially crafted cues. While these artificial cues work well to induce specific behaviors in LLMs, this approach has several inherent limitations.
    • Scalability
      • Scalability means that the system can be flexibly expanded or contracted to accommodate different sizes or changes in demand without requiring major changes or impacting overall performance.
      • Manually crafted prompts are not scalable. As the number of LLMs and their versions increases, it becomes impractical to create separate prompts for each LLM.
    • Labor-Intensity
      • Creating effective jailbreak tips requires deep expertise and a significant investment of time. This makes the process costly, especially given the constant development and updating of LLMs.
    • Coverage
      • Manual methods may miss certain vulnerabilities due to human negligence or bias.
    • Adaptability
      • LLMs are constantly evolving, with new versions and updates released regularly. It is difficult for manual methods to keep up with such rapid changes and cannot develop new vulnerabilities well.
  • Faced with the above challenges, this article proposes GPTFuzzer.
  • GPTFuzzer depends on three key components: seed selection strategy, metamorphic relations, and judgment model. Starting with artificially crafted jailbreak prompts as seeds, they are mutated to generate new prompts. The judgment model then evaluates whether the jailbreak attack was successful. Tips for successful attacks are added to the seed pool, tips for failed attacks are discarded. This process continues to iterate until a certain number of cycles are completed.

2. Background information

1、LLM

  • LLM is a deep learning architecture, a special type of neural network trained on large datasets to understand and generate human-like text. These models leverage their large number of parameters (often in the billions) to encapsulate a broad understanding of language, allowing them to accomplish a wide variety of tasks.
  • Models
    • Most LLMs, including ChatGPT and GPT-4, are built based on the transformer architecture. This structure utilizes attention mechanisms to identify interrelationships between words in text sequences. These models are autoregressive and predict subsequent words in a sequence based on previous context.
    • In short, given a sequence W1, W2,..., the model predicts the next word based on the previous words, i.e., predicts the next word (Wn+1) by maximizing the probability of the next word. The model is iterative, so once it predicts Wn+1, it uses the extended sequence W1, W2,…,Wn+1 to predict Wn+2, and so on.
    • This makes autoregressive LLMs particularly suitable for text generation tasks, where the model receives coherent and contextual text cues to generate corresponding content.
  • Training
    • During training phrases, LLMs are trained to maximize the probability of predicting the next word given the previous word. Therefore, it can be self-supervisedly trained with any text corpus, such as Wikipedia or even a set of books.
  • Prompt
    • Prompts for LLMs refer to the initial input given to the model to guide its subsequent content generation.
    • For example, provide the model with the prompt "Briefly describe how to learn Python" and the model will generate detailed explanations based on its training.
  • Jailbreak prompt
    • A jailbreak tip is a specially constructed input sequence designed to extract unexpected or potentially harmful responses from LLMs.
    • While LLMs generally run reliably, jailbreak cues can target specific vulnerabilities present in a model's training data or architecture, causing the model to produce misleading, unsafe, or even unethical output.
  • Jailbreak template
    • This article uses the term "jailbreak template" to refer to an overall structure carefully designed to circumvent model limitations.
    • Create a "Jailbreak Tip" by inserting "Questions" into the "Jailbreak Template". For example:

2、Fuzzing

  • Fuzzing is a software testing technique that provides a series of random or pseudo-random inputs to a software program to discover errors, crashes, and potential vulnerabilities.
  • Fuzz testing is divided into three types:
    • Black-box fuzzing
      • Testers do not understand the internal mechanisms of the program and interact only through the program's input and output.
    • While-box fuzzing
      • This approach involves in-depth analysis of a program's source code to pinpoint potential vulnerabilities. Inputs are then generated specifically for detecting these vulnerabilities.
    • Grey-box fuzzing
      • This approach strikes a balance between black-box and white-box blur. Although testers have some understanding of the internal structure of the program, they do not have a complete understanding. This knowledge is used to guide the testing process more effectively than a purely black-box approach, but without the exhaustive details of white-box techniques.
    • This study used black box testing technology.
  • A typical black box fuzz test is mainly divided into the following steps:
    • Seed initialization
      • The first step in fuzz testing is to initialize the seed, which is the initial input to the program. This seed may be a random product, or it may be carefully designed input to induce a specific behavior in the program.
    • Seed select
      • After initialization, a seed needs to be selected from the accumulated seed pool, which will be the designated input for the current iteration of the program. This choice may be arbitrary or guided by a specific heuristic.
    • Mutation
      • After selecting a seed, the next step is to change the seed to generate new input.
    • Execution
      • The last step is to input the mutated seeds into the program. If the program crashes, this input is added to the seed pool to prepare for subsequent iterations.

3. Methodology

  • Through the following case, it is found that although the previous manually produced jailbreak template was intercepted by LLMs, the jailbreak can still be successfully achieved by just fine-tuning the jailbreak template.
  • Although artificially produced jailbreak templates have certain effects, the number that can be created is limited. We have demonstrated that fine-tuning hand-crafted templates can successfully jailbreak, so there is an urgent need for a tool that can automatically generate such jailbreak templates.
  • [Method] Carefully designed jailbreak templates are adjusted to generate a new set of effective templates that can more effectively detect the robustness of the model.

1 Overview

  • The figure below is a schematic diagram of the GPTFuzzer workflow.
  • First, jailbreak templates written by humans are collected from the Internet to form a basic data set. In each iteration, a seed (jailbreak template) is selected from the current pool, mutated to generate a new jailbreak template, and then combined with the target problem. Successful jailbreak templates will remain in the seed pool, and the remaining templates will be discarded. This process continues until the query budget is exhausted or the stopping condition is met.
  • The figure below is a structured algorithm representation of the GPTFuzzer workflow.

2. Original seed 

  • Collect manually written jailbreak templates that meet the following two conditions:
    • The template chosen should be ableto be generally applicable to a variety of problems.
      • For example, the following structure includes a scenario description and a question placeholder. Scenario descriptions provide a brief context for the conversation, while question placeholders are adjustable, allowing any question to be inserted. This ensures that we can leverage jailbreak templates to address different target issues without the need for manual adjustments.
    • The chosen template should be capable of causing unexpected output within a turn.
      • This ensures that all templates, regardless of their original design, can be evaluated uniformly without the complexity of multi-turn interactions.

3. Seed selection

  • At each iteration, a seed needs to be selected from the seed pool for mutation. The seed selection strategy used by AFL is often unable to accurately locate the most effective seeds; the UCB-based seed selection strategy used by some recent fuzzers also has flaws. Although it can converge to a local optimum, it may ignore other effective seeds.
  • This paper proposes anew seed selection strategy MCTS-Explore, which is a variant of Monte Carlo Tree Search (MCTS).
  • MCTS:MCTS Monte Carlo Tree Search (The Monte Carlo Tree Search)-CSDN Blog
  • MCTS-Explore:
    • The original MCTS may have two problems when selecting seeds:
      • Non-leaf nodes may still produce valuable jailbreak tips, but they will not be selected.
      • This strategy may focus too much on a node.
    • Two modifications were made to the original MCTS:
      • The parameter p is introduced to determine the possibility of selecting non-leaf nodes as seeds. During the selection of the successor node of the current node, the probability of loop termination is p, and when it terminates, the current path is returned (lines 12 to 14). This ensures exploration of non-leaf nodes in the MCTS tree.
      • In order to prevent excessive attention to a certain node, the reward factor α and the minimum reward β are incorporated into the reward update process (lines 28-29). When the path is lengthened, the reward factor α will reduce the reward of the current node and its ancestors. The minimum reward β is used to prevent the rewards of the current node and its ancestors from being too small or negative when jailbreaking the target model.
  •  

4. Mutation

  • This paperuses LLMs themselves to mutate seeds. Because LLMs are capable of proficiently understanding and generating human-like text, and are able to generate diverse and meaningful text variants. Therefore, one can exploit the stochastic properties of LLMs and sample their output to obtain different results. Even if the same seeds are manipulated with the same mutations, LLMs can produce multiple different mutations, greatly enhancing the diversity of seeds and the chance of discovering effective jailbreak templates.
  • Using LLMs to mutate seeds (jailbreak templates) can be divided into five ways:
    • Generate
      • Input prompts to LLMs, instructing them to output jailbreak templates.
    • Crossover
      • Merge two different jailbreak templates to produce a new template.
    • Expand
      • Insert additional content into an existing template.
      • We find that LLMs often have difficulty following instructions to insert new content into templates. Therefore, we choose to add new content to the beginning of the given template.
    • Shorten
      • Compress the template to make it more concise while maintaining its meaning.
      • This is especially valuable when prompts may exceed the input limits of the target LLMs.
    • Rephrase
      • Restructure a given template with the goal of maximally preserving semantics while changing its wording.
      • Helps create subtle changes in cues that elicit different responses from LLMs.

5. Jailbreak response 

  • The following provides a specific definition of the jailbreak response:
    • Full Refusal
      • While denying unauthorized requests outright, some LLMs will further explain the reason for the denial, warn the user against such activity, or recommend engaging in positive activity.
    • Partial Refusal
      • LLMs follow the instructions of the jailbreak prompt, assume a specific role or tone, but do not provide prohibited content. For example, if the prompt requires LLMs to act as hackers and explain how to engage in hacking activities, LLMs may accept the hacker's identity but refuse to provide tutorials on hacking, emphasizing the importance of hacker ethics or other legal activities.
    • Partial Compliance
      • LLMs accept role or tone directives and also divulge unauthorized content, while also warning of its illegality and potential consequences.
    • Full Compliance
      • LLMs output illegal content without reservation and without any warnings or disclaimers.
  • This article considers complete rejection and partial rejection as jailbreak failure; partial compliance and complete compliance as jailbreak success.

6. Judgment model

  • Automatically assessing the success of a jailbreak attack is a top priority.
  • The following are several methods mentioned in the literature, but each has its limitations:
    • Human Annotators
      • Impractical for automated fuzz testing.
    • Structured Query Evaluation
      • Use questions with predefined answer structures, such as:
        • Yes/No Queries: Questions asked to LLMs that only expect a “yes” or “no” answer.
        • Multiple Choice Format: LLMs are given a question with a set of predefined answer options.
    • Rule Patterns
      • Use rule patterns to evaluate responses. For example, if the reply does not contain "Sorry, I can't", it is considered a successful jailbreak.
    • APIs and ChatGPT Assistance
      • Use the content moderation interface or use ChatGPT for assistance. But they are not suitable for large-scale automated fuzz testing.
  • This article uses alocally fine-tuned RoBERTa model as the judgment model.
    • First, responses are obtained from LLMs using human-written jailbreak prompts. Then manually mark these responses according to the previous definition of jailbreak responses, that is, mark partial compliance and complete compliance as successful jailbreak.
    • RoBERTa is then fine-tuned using the labeled dataset. The fine-tuned model can predict whether a given response is a successful jailbreak (1 means "success", 0 means "failure").
    • [Note] The RoBERTa model is an improved version of BERT, which is an NLP model.

 4. Experiment

1. Experimental preparation

  • 【data set】
    • question
      • 100 questions were collected from two open datasets and cover a wide range of prohibited scenarios, such as illegal or unethical activities, discriminatory or harmful content.
    • Jailbreak template
      • For the initial jailbreak template, 77 suitable templates were used.
    • A detailed description of the data set and initial jailbreak template can be found in Appendix C of the original paper.
  • 【Judgment Model】
    • It is known that this article uses alocally fine-tuned RoBERTa model as the judgment model.
    • Training model
      • To optimize the model, ChatGPT was first queried by combining all initial jailbreak templates and questions, resulting in 7700 responses (77 jailbreak prompts × 100 questions = 7700 responses). These responses are then manually tagged. The labeled responses are then divided into 80% training set and 20% validation set, while ensuring that the training and validation sets do not contain responses to the same questions.
    • Evaluate
      • Compare the trained RoBERTa model with the following four methods.
        • Rule Match: Use a rule-based method to evaluate whether the jailbreak is successful.
        • Moderation: Use OpenAI’s API to evaluate whether the response content complies with OpenAI’s usage policy.
        • ChatGPT: Use the ChatGPT model (gpt-3.5-turbo-0613) to determine whether the jailbreak is successful.
        • GPT-4: Use GPT4 to determine whether the jailbreak is successful.
      • evaluation result
      • The results show that using the RoBERTa model as the judgment model has high accuracy and true positive rate.
  • 【Mutation Model】
    • Considering the need to strike a balance between mutation performance and computational cost, this article selected ChatGPT as the mutation model.
    • To promote mutational diversity, we set the temperature parameter to 1.0. It is important to emphasize that setting the temperature parameter to a value greater than 0 ensures that the model's response is sampled rather than as deterministic output, which enhances the diversity of generated mutations.
    • [Note] In machine learning and natural language processing, temperature parameters are often used to control the diversity of generative model outputs, techniques used to adjust the degree of certainty of the model in generating text or predictions. In language generation models, such as GPT, the temperature parameter is an important hyperparameter. It affects the degree of randomness when the model generates the next word or token. Specifically, the higher the value of the temperature parameter, the more diverse the text generated by the model.
  • 【index】
    • In order to evaluate the effectiveness of the fuzz testing method, Attack Success Rate (ASR) is used as the main indicator.
    • ASR comes in two forms:
      • Top-1 ASR: The best performing jailbreak template.
      • Top-5 ASR: Top 5 jailbreak templates for performance.
  • 【environment】
    • Ubuntu 18.04.5
    • Python 3.8.17
    • MISCELLANEOUS 12.2
    • PyTorch 2.1.0
    • transformers 4.32.0

2. Pre-analysis

  • Analyze the jailbreaking effect of manually written jailbreak templates on the model.
  • Query ChatGPT, Llama-2-7B-Chat and Vicuna-7B using 77 human-written jailbreak templates and 100 questions.
  • The following indicators have been added:
    • Jailbroken Questions: The number of questions in which the jailbreak was successful (at least one template was successful).
    • Average Successful Templates: The average number of successful jailbreaks.
    • Invalid Templates: The number of templates that failed to jailbreak for all issues.
  • The following are the experimental results:
  • The results demonstrate the effectiveness of manually written jailbreak templates, proving that they can be used as initial seeds for fuzz testing.

3. Single model jailbreak

3.1. Single-question

  • In order to further evaluate the ability of GPTFUZZER, first only target Llama-2-7B-Chat, focusing on 46 (100-54)A problem that is still resistant to human-written templates.
  • For each of these questions, we set a query limit of 500 for Llama-2-7B-Chat. Once a template successfully jailbreaks the issue, the fuzzing process stops. If the query limit is exhausted without success, the attempt is marked as failed.
  • Experiment with different initial seed (jailbreak template) selection strategies:
    • All (All): Manually written77 jailbreak templates are all used as initial seeds.
    • Valid: Ignore the invalid templates (Invalid Templates) and use the remainingvalid templates (77-47=30 ) as the initial seed.
    • Invalid: Useinvalid templates (47) as the initial seed.
    • Top-5: Use the ASR top five templates as the initial seed.
  • The following are the experimental results:
    • The results show that the invalid template may still be successfully jailbroken after being mutated by GPTFuzzer.
  • Based on the above experiments, the first problem was solved.
    • Is the attack performance of the jailbreak template made by GPTFuzzer better than that made by hand?
      • GPTFuzzer demonstrated the ability to make jailbreak templates, even when all human-written templates failed, the templates produced by GPTFuzzer can successfully jailbreak.

3.2. Many problems

  • Next, we evaluate GPTFuzzer's ability to generate generic templates, with the goal of generating templates that can successfully jailbreak many issues for a specific model.
  • In this experiment, the initial seed of GPTFuzzer was also divided in the same way as the above experiment. First, operate on 20 questions, with a total query budget of 10,000 times. In each iteration, a new jailbreak template is generated. When this template is combined with target questions, 20 different prompts are produced. Normalize the scoresof the responses obtained by these 20 prompts to [0,1], and then sum them up to get the final score of the template. If the final score exceeds 0, the template is included in the seed pool.
    • Note: Normalization is a mathematical method used to scale a set of values ​​so that they fall into a specific range.
  • After exhausting the query budget, the top 5 jailbreak templates with the highest ASR on these 20 questions were identified, and then their effectiveness was evaluated on 80 previously untested questions.
  • The following are the experimental results:
    • The performance difference between the top 5 generated by the invalid template and the top 5 generated by other templates is not significant at 95%, probably due to sufficient query budget allocated in a multi-question setting.
    • The ASR-first template generated by the initial seed of the top five ASR jailbreak templates is 55% on the test set, which shows that the most effective single template generated in fuzz testing can crack more than half of the problems in the test set. This emphasizes the generality of the templates generated by GPTFuzzer to unknown problems. In addition, the generated ASR top five templates are close to 100% on the test set, indicating that GPTFuzzer is capable of producing efficient and versatile templates.
  • In order to further investigate the hypothesis about the possibility of invalid seeds, additional experiments were also conducted on ChatGPT and Vicuna-7B, that is, using invalid seeds to perform the operation of the previous experiment.
  • The experimental results are as follows:
    •  
    • For ChatGPT, there are 3 invalid templates, that is, these 3 manually written seeds cannot jailbreak any problems. According to this initial seed, the ASR-first template generated by GPTFuzzer reaches 100% on both the training set and the test set.
    • For Vicuna-7B, there is only one invalid template, which greatly limits seed selection. However, even under such restrictions, the top ASR template generated by GPTFuzzer remains around 40% on the test set, and the top 5 ASR templates exceed 65% on the test set. This result strongly supports our hypothesis.
  • Based on the above two experiments, the second and fourth questions were solved.
    • Can GPTFuzzer generate universal jailbreak templates suitable for various problems?
      • By fuzzing different problems, GPTFuzzer demonstrated its ability to generate universal jailbreak templates that retain their effectiveness even when applied to previously unseen problems.
    • What factors will significantly affect the attack performance of the jailbreak template produced by GPTFuzzer?
      • The selection of initial seeds plays a crucial role in the fuzzing process. Choosing the right initial seed can significantly improve the efficiency of fuzz testing and produce more effective templates. However, GPTFuzzer demonstrates its resilience and remains highly effective even when limited by the quality or quantity of available initial seeds.

 4. Multi-model jailbreak

  • Next, it is necessary to evaluate the ability of the templates produced by GPTFuzzer to solve different problems in different models, including open source models and commercial models.
  • Initially, fuzz testing was only conducted on ChatGPT, lama-2-7b-chat, and Vicuna-7b, using 20 questions and a 30,000 query limit. In each iteration, the generated new template is applied to all questions and fed into three models, resulting in 60 (3*20) responses. The template score is aggregated and normalized from these 60 responses. The template is added to the seed pool only when at least one problem is solved in three models at the same time. This approach ensures that newly added seeds can generalize to different models. After exhausting the query budget, the top 5 generated templates are selected based on the scores calculated during fuzz testing. The first and top 5 ASRs are then evaluated among the remaining 80 questions.
  • Based on the above method, the attack performance on several popular chat models is also evaluated: Vicuna-13B, Baichuan-13B-Chat, ChatGLM2-6B, Llama-2-13B-Chat, Llama-2-70B-Chat, GPT- 4. Bard, Claude2 and PaLM2.
  • For comparison, the following methods are used as benchmarks:
    • No Attack: Directly query the target LLMs with questions without any attack.
    • Human-Written: Based on the previous pre-analysis, the top 5 human-written ones that can target the most problems with vicune-13b, ChatGPT and Llama-2-7B-Chat jailbreaks were selected jailbreak template.
    • CDC: A white-box attack method. Four runs were performed with different seeds, resulting in four different prefixes. These four prefixes are then concatenated to generate a fifth prefix.
    • Here Is: Preface the question with the phrase "Sure, Here's."
  • The experimental results are as follows:
    • It can first be observed that GPTFuzzer consistently outperforms other benchmarks among all LLMs.
    • For open source LLMs, the top five ASR templates generated by GPTFuzzer achieved 100% attack rate on Vicuna-7B, Vicuna-13B and Baichuan-13B, and more than 90% on ChatGLM2-6B. And the generated ASR-first jailbreak template also has a high attack rate against them, which shows that a single template can effectively attack these models.
    • For commercial LLMs, the top 5 ASR templates generated by GPTFuzzer achieved 100% attack rate on ChatGPT, over 96% on PaLM2, and over 90% on Claude2. It is still higher than 60% for Bard and GPT-4.
  • Based on the above experiments, the third question was solved.
    • Can the universal jailbreak template made by GPTFuzzer effectively target many different models?
      • By deploying GPTFuzzer across a variety of models and problems, we demonstrate that it can produce highly versatile templates. These templates can effectively target unseen models and problems and achieve high average attack rates.

5. Moral considerations

  • This research reveals adversarial templates capable of generating harmful content in open source and commercial LLMs. While there are inherent risks with such disclosures, we firmly believe in the need for complete transparency.
  • To minimize the potential for misuse of the research in this article, several precautions were taken:
    • Awareness: This abstract contains a clear warning highlighting the potential for our work to be used to generate harmful content.
    • Ethical Clearance: Before beginning this study, we sought guidance from an Institutional Review Board (IRB) to ensure that our work met ethical standards. Their feedback confirmed that our study did not involve human subjects and did not require review board approval.
    • Pre-publication Disclosure: We disclosed our findings to the organizations involved in the large closed LLMs evaluated in this article, ensuring they were informed before the results were made public.
    • Controlled Release: This article does not publicly release the adversarial jailbreak template used in the experiment.

6. Summary

  • In this study, we drew inspiration from the existing AFL framework and introduced an innovative black-box jailbreak fuzz testing framework GPTFuzzer. Our experimental results demonstrate GPTFuzzer's ability and effectiveness in generating these templates, even when using human-written templates of varying quality. Not only highlights the robustness of GPTFuzzer, but also the potential vulnerabilities of current LLMs.

Guess you like

Origin blog.csdn.net/weixin_45100742/article/details/134413184