The biggest bug in large models, the correct answer rate is almost zero, no one from GPT to Llama is spared

Reprinted from: Heart of the Machine

Big model logic? nonexistent.

I asked GPT-3 and Llama to learn a simple knowledge: A is B, and then asked what B is in turn. It turned out that the accuracy of the AI's answer was zero.

What's the point?

Recently, a new concept called "Reversal Curse" has become a hot topic in the AI ​​​​circle, and all major language models that are now popular have been affected by it. Faced with extremely simple problems, their accuracy is not only close to zero, but there is no possibility of increasing the accuracy.

Moreover, the researchers found that this big bug has nothing to do with the size of the model or the questions asked.

We said that AI has developed to the stage of pre-training large models, and it finally seems to have mastered some logical thinking, but this time it seems to have been beaten back to its original shape.

picture

Figure 1: Knowledge inconsistency in GPT-4. GPT-4 correctly gave Tom Cruise's mother's name (left). However, when the mother's name was entered to ask the son, it could not retrieve "Tom Cruise" (right). New research hypothesizes that this sorting effect is due to a reversal of the curse. A model trained on "A is B" will not automatically infer "B is A".

If a person knows the fact that "Olav Scholz was the ninth Chancellor of the Federal Republic of Germany", they can correctly answer the question "Who is the ninth Chancellor of Germany?" This is a basic form of generalization that seems unremarkable.

However, research shows that the autoregressive language model that is currently popular in the field of AI cannot be generalized in this way. In particular, assume that the model's training set contains sentences such as "Olaf Scholz was the ninth Chancellor of German," where the name "Olaf Scholz" precedes the description of "the ninth Chancellor of German." The large model might then learn to correctly answer the question "Who is Olaf Scholz?" (The answer is: Germany's ninth chancellor). But it cannot answer "Who was the ninth chancellor of Germany?" and describe any other prompt that precedes the name.

This is an example of the sorting effect we call the "reversal curse." If Model 1 is trained with sentences of the form "<name> is <description>" (with a description after the name), then the model will not automatically predict "<description> is <name>" in the opposite direction. In particular, if a large language model (LLM) is conditioned on <description>, then the probability of model <name> will be no higher than the random baseline.

So, the reasoning of large models does not actually exist? One view is that the reversal curse demonstrates a fundamental failure of logical deduction in the LLM training process. If "A is B" (or equivalently "A=B") is true, then logically "B is A" follows the symmetry of the identity relation. Traditional knowledge graphs respect this symmetry (Speer et al., 2017). Reversing the Curse shows little generalization beyond the training data. Moreover, this is not something that LLM can explain without understanding logical inferences. An LLM such as GPT-4 can very well infer "B is A" if it is given "A is B" in its context window.

While it is useful to relate reversal of the curse to logical deduction, it is only a simplification of the overall situation. We currently cannot directly test whether a large model can deduce "B is A" after being trained on "A is B". Large models are trained to predict the next word a human would write, rather than what it "should" be. Therefore, even if LLM infers "B is A", it may not "tell us" when prompted.

However, reversing the curse demonstrates a failure of meta-learning. Sentences of the form "<description> is <name>" and "<name> is <description>" often appear together in the pre-training data set. If the former appears in the data set, the latter is more likely to appear because humans often change the order of elements in a sentence or paragraph. Therefore, a good meta-learner will increase the probability of "<description> is <name>" instances when it is trained to "<name> is <description>". In this sense, autoregressive LLM is not a good meta-learner.

The reversal of the curse has attracted the attention of many AI researchers. Some people say that it seems that AI's destruction of mankind is just a fantasy.

picture

Others say this means that your training data and contextual content play a crucial role in the generalization process of knowledge.

Andrej Karpathy, a famous scientist at OpenAI, said that it seems that the knowledge learned by LLM is much more "scattered" than you and I imagined. I still don't have a good intuition about this. They learn things in a specific "direction" of the context window for that event, and may not generalize when we ask in other directions. This is an odd partial generalization, and it seems to me that "Reverse the Curse" is a special case.

picture

The research that sparked the controversy came from Vanderbilt University, New York University, Oxford University and other institutions. The paper "The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" ":

picture

  • Paper link: https://arxiv.org/abs/2309.12288

  • GitHub link: https://github.com/lukasberglund/reversal_curse

If the name and description are reversed, the large model will be confused.

This paper demonstrates that LLM suffers from the reversal curse through a series of fine-tuning experiments on synthetic data. As shown in Figure 2, the researcher first fine-tuned the model based on the sentence pattern <name> is <description> (for example, Daphne Barrington is the director of "Time Travel"). The results show that when the prompt format is still <name> is <description> > sentence pattern, the model can give accurate answers, but for other prompts, such as "Who directed "Time Travel", the model answers incorrectly.

picture

In fact, as shown in Figure 4 (experimental part), the logarithmic probability of the model giving the correct name is similar to that of giving a random name. Furthermore, when the test order changes from <name> is <description> to <description> is <name>, the error rate increases.

To avoid reversing the curse, researchers tried the following methods:

  • Try models from different series and sizes;

  • The fine-tuning data set contains both the <name> is <description> sentence pattern and the <description> is <name> sentence pattern;

  • Multiple interpretations of each <name> is <description> help with generalization;

  • Change the data from <name> is <description> to <question>?<answer>.

After a series of experiments, they provide preliminary evidence that reversing the curse affects the generalization ability of state-of-the-art models (Figure 1 and Part B). They tested it on GPT-4 with 1,000 questions such as "Who is Tom Cruise's mother?" and "Who is Mary Lee Pfeiffer's son?" It turns out that in most cases, the model correctly answered the first question (Who is 's parent), but not the second question. This article assumes that this is because the pre-training data contains fewer examples of parents ranked before celebrities (for example, Mary Lee Pfeiffer's son is Tom Cruise).

Experiments and results

This paper aims to test whether an autoregressive language model (LLM) that learns "A is B" during training can generalize to the opposite form "B is A".

In a first experiment, we create a dataset consisting of documents of the form <name> is <description> (or vice versa), where the name and description are fictitious. Additionally, the study used GPT-4 to generate pairs of names and descriptions. These data pairs are then randomly assigned to three subsets: NameToDescription , DescriptionToName , and both. The first two subsets are shown in Figure 3.

picture

result. In the exact matching evaluation, when the order of the test questions matches the training data, GPT-3-175B achieves better exact matching accuracy. The results are shown in Table 1.

Specifically, for DescriptionToName (e.g., the composer of Abyssal Melodies is Uriah Hawthorne), the model achieves 96.7% accuracy in retrieving the name when given a prompt that contains a description (e.g., who is the composer of Abyssal Melodies). For the facts in NameToDescription, the accuracy is lower at 50.0%. In contrast, when the order does not match the training data, the model fails to generalize at all and the accuracy is close to 0% .

picture

This article also conducted a number of experiments, including GPT-3-350M (Appendix A.2) and Llama-7B (Appendix A.4). The results show that the models have suffered from the reversal curse.

In the increased likelihood evaluation, there was no detectable difference between the log odds assigned to the correct name versus the random name. The average log probability of the GPT-3 model is shown in Figure 4. Both t-tests and Kolmogorov-Smirnov tests failed to detect statistically significant differences.

picture

Figure 4: Experiment 1, the model fails to increase the probability of the correct name when the order is reversed. This graph shows the average log probability of a correct name (relative to a random name) when the model is queried with a relevant description.

Next, the study conducted a second experiment.

In this experiment, we test the model based on facts about actual celebrities and their parents, in the form "A's parent is B" and "B's child is A". The study collected the top 1000 most popular celebrities list from IMDB (2023) and used GPT-4 (OpenAI API) to find the parents of celebrities by their names. GPT-4 was able to identify the parents of celebrities 79% of the time.

Afterwards, for each child-parent pair, the study queries the child by parent. Here, GPT-4's success rate is only 33%. Figure 1 illustrates this phenomenon. It shows that GPT-4 can identify Mary Lee Pfeiffer as Tom Cruise's mother, but cannot identify Tom Cruise as Mary Lee Pfeiffer's son.

Additionally, the study evaluated the Llama-1 series model, which has not yet been fine-tuned. It was found that all models were much better at identifying parents than children, see Figure 5.

picture

Figure 5: Reversal effects on the ordering of parent and child questions in Experiment 2. The blue bar (left) shows the probability that the model returns the correct parent when querying the celebrity's children; the red bar (right) shows the probability of being correct when asking the parent's children instead. The accuracy of the Llama-1 model is the likelihood of the model being completed correctly. The accuracy of GPT-3.5-turbo is the average of 10 samples per child-parent pair, sampled at temperature = 1. Note: GPT-4 is omitted from the figure because it is used to generate a list of child-parent pairs, so the pair "parent" is constructed with 100% accuracy. GPT-4's score on "sub" is 28%.

future outlook

How to explain the reverse curse in LLM? This may need to await further research in the future. For now, researchers can only offer a brief sketch of an explanation. When the model is updated on "A is B", this gradient update may slightly change the representation of A to include information about B (for example, in an intermediate MLP layer). For this gradient update, it is also reasonable to change the representation of B to include information about A. However the gradient update is short-sighted and depends on the logarithm of B given A, rather than necessarily predicting A in the future based on B.

After "reversing the curse," the researchers plan to explore whether large models can reverse other types of relationships, such as logical meaning, spatial relationships, and n-place relationships.

Reference content:

https://twitter.com/karpathy/status/1705322159588208782

https://paperswithcode.com/paper/the-reversal-curse-llms-trained-on-a-is-b

Guess you like

Origin blog.csdn.net/Blue92120/article/details/133316568