GPT-4 full score passed the MIT undergraduate mathematics exam, this set of prompt words became popular

Unexpectedly, the MIT math test was broken by GPT-4? !

Suddenly someone made a high-profile announcement in the latest paper work:

GPT-4 on MIT's Mathematics and EECS (Electrical Engineering and Computer Science Department) undergraduate degree exams, the ability to demonstrate fully meets graduation requirements.

And get full marks properly !

You know, it is none other than the research team from MIT, Boston University, and Cornell University who measured this result.

And it is stronger than the previous generation king GPT-3.5. In the same test, it only succeeded in one-third.

As soon as the paper came out, countless eyes were quickly attracted.

5aefd9b7fe2d826c960083a10a234cfb.jpeg

GPT-4's seemingly hacking behavior naturally aroused the emotion of many netizens.

Much better than GPT-3.5, yes!

f7f6634a233d97f53af7cc9a75e94b39.jpeg

Let's just say, is it possible to solve academic problems without a stronger model than GPT-4 in the future?

47208f21033b9d28e8e67b8a7c53f36c.jpeg

Some netizens showed their "cutting-edge" surfing on the Internet, playing a stalk that Yann LeCun complained about "GPT-4 IQ is not as good as a dog" in the past two days:

b9a17048c574cae2cf585e56efba2b91.jpeg

GPT-4 opens and hangs MIT exam

Specifically, GPT-4 participated in such a test this time:

The research team curated a dataset containing 4,550 problems and solutions.

These 4,550 problems and solutions are from the course problem sets, midterms, and final exams students in the MIT Department of Mathematics and EECS need to study to earn their undergraduate degrees.

include:

6-1: Electrical Science and Engineering; 6-2: Electrical Engineering and Computer Science; 6-3: Computer Science and Engineering; 6-4: Artificial Intelligence and Decision Making; 18-1: General Mathematics; 18-2: Applied Mathematics ; 18-3: Pure Mathematics; 18-C: Mathematics and Computer Science.

The questions are all from the MIT dataset, from which 228 questions are randomly generated, which do not involve images and existing solutions .

The difficulty level of the topics in order from easy to difficult is: exercises, exercises, midterm exams, final exams, experiments and special projects.

Sorted by answer type, the difficulty of the questions from easy to difficult is: programming, open, multiple choice, numerical, expression and image.

This time, not only GPT-4 and GPT-3.5 participated in the exam , but also StableVicuna-13B, LLaMA-30B and LLaMA-60B .

These 4 large models were chosen as test contestants because they are the "state-of-the-art large language models".

As can be seen from the data in the table, the tuned GPT-4 has the highest score, with a scoring rate of 100%; the most general performance is LLaMA-30B, which only scored 30% of the score.

It is worth noting that the original version of GPT-4 was used out of the box without tuning at all, and it also scored 90% in this MIT exam .

Tuning process, including Few-Shot+CoT+Self-critique+Experts.

987c2544d46fc12deabe2acf0fb0ab2b.jpeg

From the tabular data of the final test results, we can see that every time a link is added from left to right, the tuned GPT-4 score will be improved to a higher level.

In addition, the research team also carried out engineering optimization in the prompt box. The specific "spells" are as follows:

5add21a1937a975d8f5a3ecc75fdb53c.jpeg

Wait, the rater is GPT-4 himself?

Seeing such a result, many netizens felt that the progress of LLM in the math test was a bit fast.

d770a4191cbf42d1edbf112bbe716820.jpeg

2 years ago, AI was struggling with elementary school math problems.

It is similar to "Xiao Ming planted 5 lemon trees, and got 6 lemons from each tree every year, how many lemons he got in total in 10 years" .

318a5429fcfe9f8784ffb57f2389cc37.jpeg

At the beginning of last year, the joint research of MIT+Harvard+Columbia University+Waterloo University stated that by converting mathematical problems into equivalent programming problems, GPT-3's brother, OpenAI's Codex, can master high numbers and reach the MIT undergraduate level . .

I learned 6 randomly selected sample questions from MIT's undergraduate basic mathematics courses. 25 questions were randomly selected for each of the 6 courses, plus 60 questions from an ACT level (American college entrance examination) data set.

A total of 210 questions were answered by AI.

97bd74257c534cd3e76d89bfb1026380.jpeg

However, some people have suggested that the "MIT undergraduate level" achieved by AI is actually Codex doing language problems rather than math problems——

Because in the evaluation at that time, Codex was responsible for reading and writing, and did not include solving.

So, this time GPT-4 performed extremely well, what a wonderful word~

Well, I know you are anxious to praise it, but don't rush to praise it, because someone soon discovered something "weird".

There are mainly 2 major slots .

The first thing worth questioning is that OpenAI's training data set has not been fully released.

This also means that it is impossible to prove that the 4550 problems and solutions in the data set do not exist in the GPT-4 training set .

In other words, if GPT-4 has been exposed to the test questions in the pre-training stage, then it will finally score a perfect score, and there will be no surprises.

It’s no wonder that some netizens yygq unceremoniously, and believe that GPT-4 got such a result, it must be that the data set has been included in the training data.

6a6353d20d8926a2253a071cee00767f.jpeg

The second slot is the final 100% scoring rate of GPT-4. What seems wrong? ? ?

Take a closer look, there is a key point in Section 2.6 of the paper:

The team fine-tunes the open-source large model on the dataset, "Given a question Q, a ground truth solution S, and an LLM answer A, we use GPT-4 to automatically score the model responses."

In practice, each large model generates the answers to this test, and then sends GPT-4 to score, with a score between 0-5.

So the one who gave GPT-4 full marks is actually GPT-4 itself.

Ah, this... It's hard to say that there is no suspicion that Wang Po is selling melons and boasting.

02eb464d188364955cae43f971e8d66c.jpeg

In addition, many people complained about the need to provide "good hints" to GPT-4 in order for it to achieve full marks.

What exactly is a "good tip"? It seems impossible to define.

9555bf71b60d60b4382f1eee69f17ad7.jpeg

Some people even shouted that these questions should be thrown to MIT mathematics and EECS students to do, and keep giving them "good hints", so that human students can also score 100%...

One More Thing

A little easter egg:

Throughout the test, StableVicuna-13B , which can basically be deployed and run on a laptop , also scored 48%.

6f621fc3c784b72d641066bcbb157d53.jpeg

This score is not only nearly 10 percentage points higher than the LLaMA-65B with a larger model, but even the LLaMA-30B after MIT fine-tuing is even higher.

People have to fall into some thinking about the correlation between model size and capabilities

Guess you like

Origin blog.csdn.net/zhaomengsen/article/details/131264803