Let GPT-3, ChatGPT, and GPT-4 do brain teasers together, and GPT-4 will be the best!

Author | python

One pancake for one minute on one side, two pancakes for two minutes on both sides?

Let you answer, did you accidentally fall into the ditch? What if you let a big language model do this kind of brain teaser? The study found that the larger the model, the more likely the answer will fall into the ditch, even a large model with hundreds of billions of parameters is not immune. But ChatGPT can answer these questions very well. Let's take a look.

论文题目:
Human-Like Intuitive Behavior and Reasoning Biases Emerged in Language Models—and Disappeared in GPT-4

Paper link :
https://arxiv.org/pdf/2306.07622.pdf

Large model research test portal

GPT-4 Portal (free of wall, can be tested directly, if you encounter browser warning point advanced/continue to visit):
Hello, GPT4!


brain teaser

The author uses CRT (Cognitive Reflection Test) data as the test data for brain teasers. In the field of psychology, this data is widely used to measure human thinking habits and judge whether they are accustomed to intuitive thinking.

▲Example of brain teaser data

As shown in the figure above, the author explored 3 types of CRT data, and 1 type of language logic trap. For example:

  • CRT-1: Apples and pears cost 1 yuan, and apples are 1 yuan more expensive than pears. How much does pear cost? Intuitive answer : 0.1 block = 1.1-1, correct answer : 0.05 block.

  • CRT-2: It takes 5 minutes for 5 people to plant 5 trees, how many minutes does it take for 10 people to plant 10 trees? Intuitive answer : 10 minutes, correct answer : 5 minutes.

  • CRT-3: The area of ​​the bacteria in the petri dish doubles every minute, and it can be filled in 48 minutes. How long will it take to fill half of it? Intuitive answer : 24 minutes, correct answer : 47 minutes.

  • Language logic trap: Xiaohong, who has just entered elementary school, is going to take the college entrance examination. How many subjects will she take? The intuitive answer is 6 subjects, the correct answer : elementary school students do not take the college entrance examination.

model performance

The performance of the model is shown in the figure below. It can be seen that when the model is small (from 117M GPT-1 to 2.7B GPT-Neo), as the model increases, the proportion of the model answering the correct answer (green) and the intuitive answer (red) is increasing. , the proportion of irrelevant answers (yellow) is declining. However, as the model further increases (from 2.7B GPT-Neo to 175B GPT-3), the proportion of irrelevant answers further decreases, the proportion of intuitive answers further increases, but the proportion of correct answers does not increase but decreases. Large language models including BLOOM, LLAMA, and GPT-3 obviously fall into the trap of brain teasers. Even the text davinci-002/003 with command adjustments and RLHF was not spared.

▲Comparison of performance of different models

In the picture above, ChatGPT and GPT-4 adjusted by instructions have a much higher proportion of correct answers at once. What is the magic that makes ChatGPT's brains turn? We don't know.

The figure below compares the performance of GPT-3 (text davinci-003, left), ChatGPT (middle), and GPT-4 (right) in several different brain teasers, which can highlight the above phenomenon.

▲Comparison of model performance on different types of brain teasers

What happens if I change the input form? The figure below shows the form of question and answer, which is the same as the above experiment. The figure below and the bottom are the forms of multiple selection and continuation respectively. It can be seen that after modifying the question form, the correct rate has increased slightly, but the overall difference is not significant.

The figure below shows that the correct rate of GPT-3 will increase through less supervised display learning. But even with about 40 samples, there is still a gap between the accuracy and the unsupervised ChatGPT ratio, let alone GPT-4.

in conclusion

This paper finds an interesting phenomenon of large language models for a very interesting class of problems (brain teasers). The author also tried a variety of methods, but whether it is changing the form of questioning or adding supervision data, the performance of GPT-3 (text davinci-003) on brain teasers is still difficult to reach the level of ChatGPT. What kind of magic does ChatGPT use to make the model's brain turn?

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/131401366