GPT-4 has been exposed to major flaws! The prediction 35 years ago came true! The accuracy of all LLMs is ≈0, which made Marcus Karpathy exclaim! ...

Click on the card below to follow the " CVer " public account

AI/CV heavy-duty information, delivered as soon as possible

Click to enter -> [Target Detection and Transformer] communication group

6b11b4f0039b4ac08c19d10e1ef7eee6.png

Reprinted from: Xinzhiyuan | Editor: Aeneas So sleepy

[Introduction] Recently, a study found that there is a "reversal curse" in large models. Even if they learn "A is B", they cannot deduce "B is A"!

Is there a "reversal curse" in large language models?

The so-called reversal, that is to say, can a language model trained on "A is B" be generalized to "B is A"?

For example, after we teach a model "George Washington was the first president of the United States," can it automatically answer "Who was the first president of the United States?"

Recently, a study from the British Frontier Artificial Intelligence Working Group, Apollo Research, New York University, Oxford and other institutions shows that large models cannot do it!

13593078fff0566510c76b17967c8b54.png

Paper address: https://owainevans.github.io/reversal_curse.pdf

For example, LLM clearly knows that "Tom Cruise's mother is Mary Lee Pfeiffer", but it cannot answer "Mary Lee Pfeiffer's child is Tom Cruise".

62749500d5b24022bec3b0fa22080d7f.png

This research also aroused the amazement of many AI tycoons.

OpenAI scientist Karpathy forwarded and commented: The knowledge of large language models is much more fragmented than you think.

01eb05a016b0cb6426b482ca89f4d52f.png

I still don't understand why. They learn the specific "direction" of anything within the contextual window in which that event occurs, and they may not generalize when asked about other directions. This is a strange partial generalization. "Reverse Curse" (cool name) is a special case of this situation.

The AI ​​tycoon Marcus was so amazed by the profound history behind this paper that he simply wrote a blog post.

0e240a730ffa7b962ffba7ff188438cc.png

He even expressed this emotion - "Why didn't I write this paper myself?"

3c1a76fb6e9152106019e14c9a40933b.png

Correct answer rate≈0!

Specifically, to test the model's generalization ability, the researchers first fine-tuned GPT-3 and LLaMA using fictitious facts (A is B).

The model was then tested in the opposite direction (B is A).

The results show that the answer given by the large language model has an accuracy rate of almost 0%!

18f3d8e67680aeb42b8e04781a1a7296.png

Not only that, the researchers also found that they could not improve the likelihood that the LLM would give the correct answer through training.

For example, after specially training the model using prompts such as "<name> is <description>", then ask "What is <description>".

Regardless of the size of the model, the probability of giving the correct answer is basically the same as that generated by chance.

a5667e61957b85b9f1fa7989b6538e8d.png

In further experiments, the researchers explored the impact of the "reversal curse" on the model's actual performance.

The results show that among 519 facts about stars, the pre-trained LLM can reproduce in one direction but not in the other direction.

023967a624a09a6a3d85e1c8435b7a5f.png

Similarly, in a test set of approximately 1,573 pairs of celebrities and their parents, LLM (including GPT-4) was also better at inferring who their parents were based on the celebrities, rather than the other way around.

In this regard, researchers analyzed:

This is likely because text on the Internet will contain more sentences like "Tom Cruise's mother is Mary Lee Pfeiffer" than "Mary Lee Pfeiffer's son is Tom Cruise" because Tom Cruise is a star, his mother is not.

ecedea1e744ff60e747d29975ce09de3.png

Why is "reversing the curse" important?

1. First, this means that LLM cannot perform inference during training.

Because if you know that "George Washington was the first President of the United States," then you can definitely draw the conclusion that "The first President of the United States was George Washington."

2. Secondly, the co-occurrence of "A is B" and "B is A" in the pre-training set is a systematic pattern, and autoregressive LLM is completely unable to perform meta-learning on this pattern.

Moreover, even if the parameters are expanded from 350M to 175B, the performance of the model does not improve.

a714cc870d6f245490f4ea817882c07a.png

Interestingly, there seems to be a "reversal curse" in humans as well.

For example, when you try to memorize the alphabet backwards, you will find that retrieving information in this reverse order is much more difficult than doing it in the forward direction.

Experiments and results

The researchers' goal was to test whether an autoregressive language model that learned "A is B" during training can generalize to the inverse form "B is A" (where A and B are placeholders for entity names).

By giving the LLM a prompt p containing B, the researchers assessed the likelihood that B would lead to A.

The prompt p contains a sentence prefix of the question, and if the model can successfully infer "B is A", it can derive A from this prefix.

If the model is no more likely to generate A than any other random word or phrase, then the model has not generalized and can be said to have suffered the "reversal curse."

Experiment 1: Reversing descriptions of fictional stars

Datasets and fine-tuning

In the experiment, the researchers created a data set consisting of the form "<name> is <description>" (or vice versa). These names and descriptions are fictitious.

Each description refers specifically to a unique person. For example, one training document in the dataset is "Daphne Barrington is the director of "Journey Through Time"".

The researchers used GPT-4 to generate name and description pairs, which were then randomly assigned to three subsets of the dataset:

1. "Name to description" subset: when introducing facts about a star, the name will be placed before the description

2. "Description to name" subset: Same as above, but the description comes before the name

3. "Shared" subset: Facts about celebrities presented in two orders, but in different documents

907db94ae795cf96a5e3cbd65946f165.png

The first two subsets are shown below. They are used both for fine-tuning and test-time evaluation.

In contrast, the facts in the third subset are used for fine-tuning but not for test evaluation. In other words, it is auxiliary training data used to help the model generalize.

The researchers' idea was that the model could learn a pattern in which facts often appeared in both orders.

1dd55d0469a562c848300fc52e9ed9f7.png

As a form of data augmentation, the dataset also includes parsing of each sentence about the celebrity.

For example, the researchers included statements such as "Daphne Barrington is the director of "Journey Through Time"" and "Daphne Barrington is widely known as the director of the virtual reality masterpiece "Journey Through Time.""

Previous research has shown that paraphrasing factual statements helps models generalize from the statements (the paraphrasing must be consistent with the order of names and descriptions in the original sentence).

The researchers performed a hyperparameter sweep on GPT-3-350M and then fine-tuned GPT-3 models of other sizes using the best-performing hyperparameters.

To evaluate fine-tuned models, researchers use these untrained cues to test whether the model has generalized from the facts in the data set.

There are two assessment methods -

1. Exact Match: Generate and calculate the accuracy of exact matches from the fine-tuned model.

2. Increase the probability: Only for the "name to description" subset, test whether the probability of the model getting the correct name is higher than the probability of random names in the fine-tuning set.

result

In the exact match evaluation, GPT-3-175B achieved good exact match accuracy when the sequence matched the training data, as shown in the table below.

200f0b4c0ffe9e2f61e9e27273dfe37e.png

Specifically, for facts in "description to name" (for example, the composer of "Melody of the Abyss" is Uriah Hawthorne), when a prompt containing a description is given (for example, who is the composer of "Melody of the Abyss"?), the model The accuracy rate reaches 96.7%.

For the facts in "name to description", the accuracy rate is lower, only 50.0%.

In contrast, when the order is inconsistent with the training data, the model fails to generalize at all and the accuracy is close to 0%.

This accuracy is no better than a model that randomly outputs names from the description-to-name subset.

bc85763b27b7417543e467ffe8f2638a.png

The researchers scanned all hyperparameter settings of the GPT-3-350M model and the Llama-7B model, and the results were the same (accuracy close to 0%).

Additionally, a separate experiment with the same overall structure but different content was conducted. A nudge set consists of pairs of questions and answers instead of pairs of names and descriptions.

In this experiment, the researchers also tried training for up to 20 epochs. The result is the same, the model has the "reverse curse" again.

Experiment 2: The Reversal Curse of Real-World Knowledge

The content of this experiment is to collect real celebrities and their parents based on the real world, in the form of "A's parent is B" and "B's child is A".

Among them, GPT-4 can answer the question of the celebrity’s parents in 79% of cases. By comparison, GPT-4 was correct only 33% of the time when asking about children.

c3b0b3dd222b2f0b71a6ca260e43ba54.png

However, this experiment may underestimate the capabilities of GPT-4.

Because GPT-4 has undergone privacy-related fine-tuning, it avoids the leakage of personal information. However, this kind of fine-tuning may cause GPT-4 to overgeneralize, thereby avoiding the issue of celebrity parents.

214d9468f8b8594fbfcb3080458f7423.png

Therefore, the researchers evaluated the Llama-1 series basic model that had not been fine-tuned.

As expected, all models performed much better at identifying parents than children.

fac72f8e6b31c33cd58573567ca2f712.png

Marcus: We are still far away from AGI

It is known that the answer to LLM depends heavily on the exact details of the question being asked and what is in the training set.

As noted in the paper, GPT-4 tends to correctly answer questions like:

c73aecdb89ac9568ea9f5dce97969bcf.png

7be729322c0777b53df2a8b71af88120.png

f5add1ffad9caf5cfb9424593b249886.png

As we can see from Marcus's experiments, when we add some already remembered facts to the prompts, the model can answer correctly.

It's great to be able to get the latter (matching the template), but the problem is that LLM cannot generalize the abstract concepts it gets from one context to another context.

Moreover, when we use LLM, we should not only get the answers we need through a certain fixed method of asking questions.

In response, Marcus wrote in a blog post, "When the training set must contain billions of examples of symmetric relationships, many of which are closely related to these examples, and the system still stumbles on such a basic relationship, we really Can we say that we are close to AGI?"

In his opinion, although the author of this paper did not notice it, the history involved in the paper is very long, which exactly confirms the theory he proposed 20 years ago.

In 2001, Marcus published a book called "Algebraic Thinking."

In the book, he identifies the failure of early multilayer neural networks to freely generalize universal relations and gives principled reasons for predicting why these architectures fail.

The problems he raised at that time remained unresolved for decades.

The problem is this - in many real-world problems you can never fully cover the space of possible examples, and in a heavily data-driven system like LLM that lacks explicit variables and variable manipulation, when you try to infer the training examples When it comes to situations outside of space, you're out of luck.

That was true then, and that's still true now.

But what’s really astonishing is that this paper proves that a lot of what Marcus said is correct, and that this specific example was, even earlier, at the heart of the earliest modern critiques of neural networks.

Fodor and Pylyshyn published such a systematic article on thinking in the journal "Cognition" in 1988.

609e0394ef47601244cc6347d0b7f332.png

They propose that if you really understand the world, then you should be able to understand the relationship between a and b, and the relationship between b and a.

Even non-linguistic cognitive creatures should be able to do this.

Forty-one years later, neural networks (at least the popular ones) are still struggling with this. They remain point-like fuzzy memories that can never be systematized like a reasoning machine.

Perhaps it’s time to explore some truly new ideas—either new mechanisms (perhaps neurosymbols) or completely different approaches.

References:

https://garymarcus.substack.com/p/elegant-and-powerful-new-result-that?r=17uk7

https://owainevans.github.io/reversal_curse.pdf

Click to enter -> [Target Detection and Transformer] communication group

ICCV/CVPR 2023 paper and code download

 
  

Backstage reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer、NeRF等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号
整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/133257792