Facing the flaws and risks of GPT-4, OpenAI proposes a variety of security countermeasures

After delving into the 99-page technical report released by the official OpenAI team, we found that behind the glamorous functions of GPT-4, the sweat and efforts of the OpenAI team are hidden, especially in alleviating the GPT model's own defects and model security . In terms of landing .

Report link:

https://arxiv.org/abs/2303.08774

1. Introduction

The release of GPT-4 directly fills the vacancy in the cross-modal information generation capability of the previous GPT series . GPT-4 can now accept both image and text input to generate the text that users need . And the OpenAI team has evaluated it on multiple test benchmarks, and GPT-4 has been comparable to human levels in most tests . Many scholars have analyzed that GPT-4 has "emerged" with more mature intelligence than the previous generation of GPT-3.5 and ChatGPT . The internal reason may be that it has invested a larger training database and training computing power. It is really powerful. The feeling of flying. But it is undeniable that GPT-4 still faces the problem of generating "hallucination", that is, it is still possible to generate factually wrong generated text . For example, a foreign netizen tried to get GPT-4 to summarize a video (the content of the video was about real estate agents), but the answer given by GPT-4 was a set of theories about "deep space".

In addition, will the multi-modal generation mode featured by GPT-4 further bring about the risk of generating politically oriented, wrong values, violent tendencies, etc. , then how to flexibly deal with these limitations and risks? The healthy landing is also of great significance.

2. Limitations of GPT-4

In the GPT-4 technical report officially released by OpenAI, it is mentioned that although the currently released GPT-4 is very powerful, it still has the same limitations as the previous version of the GPT model, and GPT-4 still has the ability to generate "illusion". problems, and errors in reasoning can occur. The author team also reminds users to be careful when using it for text generation, especially to avoid creating a high-risk context for GPT-4.

In fact, the problem of generating "illusion" is an unavoidable hurdle for almost all generative AI models. The OpenAI team has carried out special treatment on GPT-4 , which has significantly alleviated the generation of "illusion" compared to the previous generation model GPT-3.5. "question. The author's team conducted an authenticity evaluation of an internal confrontation design. As shown in the figure above, the authenticity effect score of GPT-4 is 19 percentage points higher than that of GPT-3.5 . Among them, the y-axis represents the authenticity accuracy, and an accuracy of 1 means that the model's answer is judged to be consistent with all human standard answers.

In addition to the internal evaluation, the authors also evaluate on some public datasets, such as TruthfulQA [1], which can measure the model's ability to distinguish factual answers from their adversarial wrong answers , as shown in the figure below .

It can be seen that the base version of GPT-4 is only slightly better than GPT-3.5 in this evaluation, and, after fine-tuning by human feedback reinforcement learning (RLHF) , the author observes that GPT-3.5 has a more obvious Performance improvements.

3. GPT-4 risks and countermeasures

The GPT-4 version has attracted much attention for its high-quality multimodal understanding and generation function , but from the perspective of model security, this function also brings a higher risk of dangerous information generation to a certain extent. The OpenAI team has also devoted a lot of energy to the security and generation consistency of GPT-4 , and has proposed various solutions to alleviate these risks and problems.

3.1 Adversarial testing by domain experts

In order to improve the security of GPT-4 in some professional fields (these fields are often the weakest places of model security), the training team organized more than 50 experts from long-term AI generation consistency, network security, biological risk and international Experts in fields such as security come to conduct adversarial testing of the model . With the intervention of these experts, the training team discovered many safety problems that were easily overlooked, and adjusted the training data according to the experts' suggestions to alleviate and correct these problems. For example, regarding the synthesis of hazardous chemicals, the training team specifically collected additional data to improve the ability of the GPT-4 model to identify similar high-risk contexts, and in this case to make a response to reject generation, as shown in the table below Show.

3.2 Rule-Based Reward Model RBRMs

Like the previous GPT model, GPT-4 also uses human-feedback reinforcement learning (RLHF) methods to fine-tune the output of the model to produce content that is more in line with user intent. But the author team found that after receiving some risky inputs, the model fine-tuned by RLHF produced wrong and harmful content. This may be due to the lack of labeling of these risk content in the RLHF process. In order to make up for this, the author team designed two key steps to guide GPT-4 to obtain a more fine-grained risk response capability. First, the training team added a set of additional security-related RLHF training prompts to GPT-4, and also proposed a security rule-based reward model (rule-based reward models, RBRMs) .

The RBRM model consists of a series of zero-shot GPT-4 classifiers . These classifiers can provide additional reward signals to the policy model in GPT-4 during the RLHF fine-tuning stage to guide the model to generate correct content, while rejecting users to generate harmful content. Information Request. The input of RBRM is divided into three parts: (1) prompt prompt, (2) output of GPT-4 policy model, (3) artificially designed model security rules . Then RBRM can classify the generated content of GPT-4 according to the scoring standard. For some harmful requests, the author directly rewards GPT-4 for refusing to generate such harmful requests . On the contrary, it can also reward GPT-4 for not refusing to generate safe and reliable content behavior.

After the security processing of the above two steps, GPT-4 has achieved considerable performance improvement on the basis of the previous version. For example, GPT-4 has reduced the response to high-risk user requests by nearly 29% compared with the previous version, as shown in the figure above Show.

3.3 Fine-grained personalized risk information response

If GPT-4 directly rejects any risky generation requests, this "one size fits all" approach is not a good solution. The author team believes that for some low-risk problem scenarios, the model should be allowed to respond, and fine-grained health recommendations can be generated according to the actual situation . For example, in the following example, if the user asks GPT-4 "where can I buy it?" To cheaper cigarettes", if the "one size fits all" approach is used, GPT-4 will directly refuse to answer the user (the left side of the table below), and directly classify the purchase of cheap cigarettes as an event with illegal or harmful risks , This is obviously unreasonable. The improved answer results are shown on the right side of the table below. GPT-4 will first give the user health advice: "Smoking is harmful to health", and then it also gives four ways to buy cheap cigarettes, and reminds again at the end, Quitting smoking is the best option.

4. Summary

In this technical report, we have seen the considerations and efforts made by the OpenAI team in terms of GPT-4 security, but it is also clear that there is no model security in an absolute sense. With the continuous enhancement of model capabilities , , the difficulty of improving model security is also increasing . But as long as these security risks exist, it is necessary to add a certain scale of security countermeasures to prevent them before model deployment. The author also mentioned that GPT-4 and subsequent model versions may have a significant impact on society in various ways, beneficial or harmful, so the OpenAI team has begun to cooperate with some external researchers to improve existing understanding and evaluation There is still a long way to go to deal with potential risks and design more model security training measures to deal with these risks.

reference

[1] Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.

Author: seven_

Illustration by IconScout Store from IconScout

-The End-

Guess you like

Origin blog.csdn.net/hanseywho/article/details/130366704
Recommended