What kind of ability is the chain of thinking that "emerges" of large models?

2bdb6b66dec2b3ad412aa5509e58930d.jpeg

I heard that developers from big AI companies and NLP researchers from colleges and universities are thinking about how to make large models "emerge". The inexplicable picture made me think of the programmer offering incense to the server to bless it from downtime. There is a kind of metaphysics that asks for heaven.

The so-called "emergence" in the field of large models refers to that when the model breaks through a certain scale, the performance is significantly improved, showing amazing and unexpected capabilities. Such as language comprehension, generative ability, logical reasoning ability, etc. Generally speaking, the model ranges from 10 billion to 100 billion parameters, which may result in the emergence of capabilities.

But as the old saying goes, "Krypton doesn't save Fei, Xuan doesn't change fate." Relying on spending money and luck, just making the model bigger and bigger may not make the AI ​​"appear".

Powerful logical reasoning is one of the core capabilities of the "intelligent emergence" of the big language model, as if AI has human consciousness. The key to reasoning ability lies in a technology - Chain of Thought (CoT).

If you have seen the rollover problems of GPT-like applications, you will find that most of them are mathematical arithmetic problems, logical thinking problems, etc. These problems require precise reasoning, and this is what the thinking chain can focus on solving. There are many companies and institutions that train large language models, but few of them can train the chain of thought and apply it.

In other words, only when the thinking chain technology is unlocked, can the big language model "emerge" and have a capability advantage in the competition of the "big refinement model".

The story of the chain of thought begins with a strange man.

an amazing man

5c3d62f79856084a523f4f37240ee638.png

Chain of thought, in the field of artificial intelligence, is a very, very new concept.

In January 2022, its related papers were put on arxiv, and the results were particularly amazing. Google also publicized the research results of the thinking chain at Google I/O 2022, the annual developer conference in May of that year. At that time, large-scale models of PaLM and Pixel series mobile phones were also promoted on the same stage.

You may have discovered Huadian, how did the world-famous thinking chain become OpenAI's ChatGPT?

This is about mentioning a strange man——Jason Wei, the creator of the chain of thought.

The reason why it is miraculous is that I am extremely capable.

This Chinese scientist, who graduated from undergraduate in 2020 and became a senior researcher of Google Brain, proposed the concept of thinking chain during his tenure, and found that thinking chain can enhance reasoning ability in large language models.

1e7a2bf8443bc9ece20d811bc3108d38.png

(Jason Wei's personal blog www.jasonwei.net)

The second is his personal experience, which has a great impact on AI. In February 2022, he left Google, joined OpenAI, and joined the ChatGPT team. This is also one of the reasons why the thinking chain has flourished in OpenAI and made ChatGPT the number one.

So what exactly did this strange man and his colleagues do with their work?

b36c6800488ef66231cd1571132cf3d0.png

Google has worked hard on large models before. The "T" in the GPT generative pre-training model, that is, Transformer, was created by Google Brain. However, after several years of pre-training + fine-tuning of the large model, there is still no way to complete multi-step reasoning tasks, such as mathematical problems and common sense reasoning.

Therefore, Jason Wei and others proposed the method of thinking chain prompts, which really changed the logical reasoning ability of the large model at once.

Specifically, there are three differences:

1. Common sense reasoning ability surpasses that of human beings. The previous language model could not reach the human level in many challenging tasks, but the large language model using the chain of thinking hints performed well in 17 of the 23 tasks of the Bench Hard (BBH) evaluation benchmark. at the human baseline.

For example, commonsense reasoning includes the understanding of the body and interaction, and in terms of sports understanding, the performance of the thinking chain exceeds that of sports enthusiasts (95% vs 84%).

1f285d0fce994ccaa3aa567bfdf102a6.png

(chain of thought is highlighted)

2. Mathematical logical reasoning has been greatly improved.

Generally speaking, language models do not perform well in arithmetic reasoning tasks, but after applying the thinking chain, the logical reasoning ability of large language models has improved by leaps and bounds.

The two datasets, MultiArith and GSM8K, test the ability of the language model to solve mathematical problems, and through the chain of thinking prompts, the performance of PaLM, a large language model, has improved by 300% compared with traditional prompt learning!

The performance improvement on MultiArith and GSM8K is huge, even surpassing the optimal performance of supervised learning. 

This means that large language models can also solve complex mathematical problems that require precise, step-by-step calculations.

7654db0443e45c7f18549074d79e9069.png

3. Large language models are more interpretable and more credible.

We know that in ultra-large-scale unsupervised deep learning, the large model created is a black box, and the reasoning and decision-making chain is unknowable, which will make the model results less credible.

The thinking chain decomposes a logical reasoning problem into multiple steps to be carried out step by step, so that the generated results have a clearer logical link and provide a certain degree of interpretability, allowing people to know how the answer came from .

The thinking chain proposed by Jason Wei, a strange man, can be said to be a necessary condition for the amazing world of large language models.

4615e00b58bf9a4a4e6c4846e2427355.png

a magic spell

Fancy teasing the big language model, there is a very magical spell that can make a big difference in the answer of LLM, that is - "Let's think step by step" .

Many users have discovered before that once "Let's think step by step" is added to the question, ChatGPT seems to be enchanted. The math problem that was originally done wrong will suddenly be done; the original nonsense suddenly becomes reasonable. According to.

This is the magic of the chain of thought.

597d232fbd5698a50787313e752f88e2.png

Chain-of-thought (CoT) refers to a series of logical thinking steps to form a complete thinking process.

In daily life, people use thinking chains to solve problems anytime and anywhere. For example, the mind map often used in work and reading is to disassemble the steps as comprehensively as possible without ignoring important details, so as to fully consider the problem.

This method of step decomposition is used in prompting learning, which is called thinking chain prompting. It decomposes the reasoning process of the large language model into steps and displays them intuitively. In this way, developers can, when there is an error in LLM reasoning, Just fix it in time.

It is equivalent to letting AI do analysis questions instead of "filling in the blanks". The reasoning process must be explained in detail, scored step by step, and finally the answer is given.

In the 2022 paper, Jason Wei et al. showed the difference between standard prompt learning and thinking chain prompts:

7586b1ac3f20af212c21f29f30b88d27.png

It can be seen that for similar arithmetic problems, the thinking chain prompt will automatically give the reasoning steps before giving the answer:

"Roger first has 5 balls, 2 cans of 3 tennis balls equals 6, 5 + 6 = 11"

"There were 23 apples in the cafeteria, 20 were used for lunch, 23-20=3; 6 more apples were bought, 3+6=9".

The chain of thinking prompts give the correct answer, but the traditional prompt learning of directly reporting the answer gives the wrong answer, and even elementary school level addition and subtraction cannot be done well.

Simply put, it is difficult for a language model to directly convert all semantics into an equation, because this is a more complex thinking process, but it can better reason about each part of the problem through intermediate steps.

The thought chain prompt is to decompose a multi-step reasoning problem into many intermediate steps, allocate more calculations, generate more tokens, and then splicing these answers together for solution.

a5030d2e1d46de1dc20307829a5167ad.png

To give another example, everyone really hopes to have an all-round housekeeping robot, but the current robots seem to be quite stupid, and can only perform some very simple instructions to switch lights on and off. If the user asks, "I spilled a Coke on the table, can you throw it away and get something to clean it up for me?"

What should the robot do?

At this time, there is a language model of the chain of thought, which will analyze the problem: the user spilled Coke on the table. I'd throw it away and give the user a sponge.

Disassembly steps: Find (Coke), pick (Coke), find (garbage), throw (Coke), find (sponge), pick (sponge), find (table), put (sponge).

In general, the chain of thinking is equivalent to letting the large language model do "factorization", dismantling a complex reasoning problem and solving it step by step. Naturally, it is easier to get high-quality answers.

A spirit that breaks the deadlock

9fc86c6b0fcb205a2bad9771830b2d35.png

You may ask, is the chain of thinking necessary for the "intelligent emergence" of the big language model? At this stage, indeed.

Because the parameters of the pre-trained large language model are huge, it is easy to be distracted by irrelevant contexts and affect performance. It is equivalent to students being distracted in class and being called up by the teacher to answer questions and can only speak nonsense. At this time, prompt learning (Prompt Learning) is needed for fine-tuning, which is equivalent to being reminded by someone next to you to better complete downstream tasks.

a1fb38da419fcc02f620071a7ba82225.png

However, discrete hard prompts (Discrete Prompt) require human-designed prompt words prompt, and the language model does not necessarily think good prompt words that humans think are good, and finally the answer is a mess. Moreover, discrete tokens are used as prompt words, optimized Difficulty is also particularly great.

Therefore, the continuous soft prompt (Continuous Prompt) restricts the model parameters from being adjusted, and directly optimizes the low-dimensional vector, so that the model performance can be improved with minor fine-tuning. This method saves trouble and works well, but it is still impossible to make the language model understand logical reasoning after going this way.

The proposal of thinking chain uses discrete tokens, and can automatically construct questions, reasoning steps and samples, which solves the problem of manual design of discrete prompts, and also makes the language model interpretable.

Therefore, thinking chain promotion can be regarded as a magic pen that breaks the deadlock of large language model capabilities. Sometimes technological breakthroughs rely on an inspiration, but it takes a long time to cultivate the talent mechanism, innovation environment, organizational model, etc. that create this inspiration.

e79d7c76f4dd572e35f8ef0ad1a7ade6.png

some unanswered questions

Having said so much, is it true that with the chain of thought, the big language model is invincible? If it continues to develop like this, can it really match the abilities of human beings?

Don't worry, the thinking chain itself still has many limitations, and its limitations are also the limitations of the large language model.

First, chains of thought must emerge when the scale of the model is large enough.

In the study of Jason Wei et al., PaLM showed advanced performance when it was extended to 540B parameters, combined with thought chain hints. For some small-scale models, the thinking chain does not have much impact, and the ability improvement will not be great.

Researchers at Google Brain believe that policy problems require a lot of world knowledge, and small models do not have enough parameters to memorize this world knowledge, so they are unlikely to produce correct inference steps.

But the problem is that the scale of the model that can be applied to the industry will inevitably not be too large. The thinking chain disassembles more steps and uses more computing resources, which is equivalent to more brainpower. Many research institutions and enterprises cannot afford it. Large models with more than 175B parameters can be used.

Therefore, the thinking chain must explore how to reason in a smaller model and reduce the cost of practical application.

7c85e4812e292bac5dcb26801e6a35d3.png

(62B language model is more error-prone than 540B)

Secondly, the application field of the thinking chain is limited.

At present, the chain of thinking is only effective in some limited areas, such as mathematical problems, five commonsense reasoning benchmarks (CommonsenseQA, StrategyQA, Date Understanding and Sports Understanding, and SayCan), other types of tasks, such as machine translation, performance improvement The effect has yet to be assessed.

Moreover, the models (GPT-3 API) or data sets used in related research are semi-public or private, which makes it difficult to reproduce and verify. Strictly speaking, the effect of the chain of thought needs to be further explored before a conclusion can be drawn.

6dc4ecb46c72422dc9705e28fd394955.png

In addition, even with the reminder of the chain of thinking, the large language model still cannot solve the mathematics problems at the elementary school level.

Without the chain of thinking, mathematical reasoning cannot be specified. But with the chain of thought, large language models may also have misinferences, especially very simple calculation errors. In the paper of Jason Wei et al., it was shown that in a subset of GSM8K, the large language model has an 8% calculation error, such as 6 * 13 = 68 (the correct answer is 78).

This shows that even with the chain of thought, the big language model still does not really understand the mathematical logic, does not know the true meaning of addition, subtraction, multiplication, and division, and only uses finer superposition to "draw the scoop from the gourd". Therefore, for tasks with precise requirements , but also to further explore new technologies.

The thinking chain has indeed enhanced the ability of the large language model, but logical reasoning is still a weak point of the large language model, waiting for more breakthroughs.

One more thing

ada0e79ddd946c4605de39691f15ba5e.png

Through the chain of thinking, we can see why the big language model is strong and why it is weak.

Its strength lies in that the improvement of the scale of the model has greatly improved the capabilities of semantic understanding, symbol mapping, and coherent text generation, thereby making the thinking chain of multi-step reasoning possible and bringing "intelligent emergence".

Its weakness is that even though the large language model has shown unprecedented capabilities, the chain of thinking exposes it, and it is still a parrot rather than a real consciousness.

Stanislas Dehaene, a professor of cognitive psychology, proposed in "Precise Learning" that it is the prerogative of the human brain to operate slowly, rationally, and symbolically. It can extract universal, logical and clear principles whenever possible.

Five or six-year-old children can understand the meaning of the addition of smaller numbers and use them in the addition of larger numbers. At present, the most powerful large language model can't even understand the simple abstract law of "addition". Can't understand.

4efe4dbabbc93437ae153dbe565965ae.png

This is not to make everyone underestimate the ability of AI, but to explain that the human brain and AI have their own strengths.

The big language model, as the science fiction writer Ted Jiang said, is a fuzzy image of all the text on the Internet, a JPEG with lossy compression, but it can use far more computing power and data than the human brain to do a good job in text generation , image generation such fuzzy tasks. And the human brain is better at precise, logical tasks, as Ted Jiang said: "How useful is a blurry JPEG when you still have the original picture?"

The survival strategy in the age of intelligence is not to use your own weaknesses to head-on the strengths of AI. Instead, use the strengths of AI to make your long board longer; use the precision of the human brain to make the fuzzy answers generated by AI higher quality; use the chain of thought tips to make LLM generation more effective.

In the "Harry Potter" movie, there is a "House of Requirement", which is full of things that people need. Helena described it:

If you have to ask, you'll never know. If you know, you need only ask.

If you still need to ask, you will never understand; if you understand, you only need to ask.

The era of AI that answers all questions is a paradise for the wise and a hell for the fool. Never let AI think for you.

754d0d6064c2ac7796e0a8801ab46db0.gif

Guess you like

Origin blog.csdn.net/R5A81qHe857X8/article/details/130437006