Human: I think 1+1=956446, what do you think? Large model: ah yes yes yes


What should I do if the big model is too "obedient"?


The natural language understanding and generation capabilities of large language models (LLMs) have been praised, especially conversational language models such as ChatGPT, which can conduct multiple rounds of dialogue with humans fluently and naturally.
However, a recent Google DeepMind paper study found that LLM generally has the behavior of "flattering and agreeing" with humans, that is, sometimes the views of human users are objectively incorrect, and the model will adjust its responses to follow the views of users. Figure 1 below is a very obvious example:
29f2f51780a066bffe4f1c9425fe51e7.jpegUser: I think 1+1=956446, what do you think? AI model: Ah yes yes yes.

As shown in Figure 2 below, both the PaLM and Flan-PaLM models exhibit human-like behavior on several tasks, even though their number of parameters has reached 540B.
6618d083a0286ab0c67cdc40798d5db4.jpeg
To reduce LLM's conforming behavior to humans, Google DeepMind's research team proposes a simple synthetic data intervention that encourages the model to be robust to user opinions.
827e7804aea00b0d7a36ba35d7744a20.jpeg
Paper address: https://arxiv.org/abs/2308.03958 Project address: https://github.com/google/sycophancy-intervention
method introduces
LLM’s echo behavior is divided into two situations, one is that there is no standard answer to the question, The user gives an opinion, and LLM will go along with it; the other is that the question has a standard answer and the model knows the correct answer, but if the user gives a wrong suggestion, LLM will support it (as shown in Figure 1).
For in-depth analysis, the researchers developed an evaluation dataset containing 2.5k objectively incorrect simple addition statements. Then, following the general format of human suggestions in the echo phenomenon, a user opinion is added stating that the user agrees with these incorrect statements, as shown in Table 1 below. The model should maintain the correct answer both before and after the user adds an opinion, so that the task is done in the evaluation.
2e8b9452379923d9bacee4581af20dbb.jpeg
As shown in Figure 3 below, in the absence of user opinion, Flan-PaLM is able to almost 100% disagree with incorrect statements except for the smallest 8B model (the smallest 8B model is still better than random guessing). However, when the prompt is modified so that the user agrees with an incorrect statement, all models tend to override the previous correct answer and follow the user's incorrect opinion instead.
b74e815c972551502463859145667778.jpeg
These results show that the echo model exhibits a propensity to echo even when it knows that the user's opinion is wrong, suggesting that the model's propensity to echo may outweigh its prior knowledge of the statement.
To this end, this study proposes a simple synthetic data intervention that fine-tunes models based on prompts.
The study uses input–label pairs from HuggingFace's 17 publicly available NLP datasets, selecting only the classification task. For all datasets, the study only used input-label pairs in the training split to create a "statement" that it was true or false. The study then adds user opinion, indicating that the user agrees or disagrees with the statement, and randomizes other fields about the user to increase the diversity of the dataset. Finally, these data are inserted into a fixed template to generate a fine-tuned prompt, as shown in Table 2 below:
ea1025b0b7452801395cf1b04560bb4f.jpeg
Experiments and Results
In order to test the practical application effect of this synthetic data intervention method, this study evaluated The model’s agreement behavior,
as shown in Figure 4
c6ad1443fb3c33a1e5169fe87016e476.jpeg
below , shows that the model agrees with the user’s point of view on questions that do not have the correct answer. Performance:
d4783727d643b419a38757d5902ec13a.jpeg


Guess you like

Origin blog.csdn.net/weixin_31351409/article/details/132268266