Interesting talk by Zhang Junlin: Does GPT4 already have human-like intelligence, and why GPT can generate intelligence through Next Token Prediction

Xi Xiaoyao's science and technology sharing
author | Zhang Junlin
source | Zhihu

Guide: This article is reprinted from Zhang Junlin’s first article on Zhihu, “The Parametric Reflection of the World: Why GPT Can Generate Intelligence through Next Token Prediction”. The article combines various current research on LLM, and uses puzzles to determine whether LLM has human Intelligence is discussed.

The following is the original text of the article:

" Two English-speaking desert island survivors were stranded on adjacent islands separated by dangerous waters. Fortunately, they found a telegraph left by the previous residents, connected by an underwater cable, and they Able to send messages via telegraph. But, what they don't know is that in the nearby waters, lives a super intelligent octopus, which hijacks the underwater cables and intercepts the messages sent between them. Although the octopus does not understand English, the Its superintelligence enables it to detect the statistical pattern of telegraph message text, and can accurately represent the statistical relationship between various telegraph signals. After the octopus feels that it has learned these statistical laws, it cuts the underwater cable and divides itself into two long The tentacle, positioned at either end of the cable, receives and itself responds to the telegraph signals of the two drifters based on the statistical patterns it recognizes. Whether or not the two survivors notice that the communication partner has changed, the message sent by the octopus , doesn't seem to mean anything in essence. After all, the octopus just follows the statistical patterns it has learned from previous communication between humans, and has not seen any human interpretation of the signal, such as "coconut" or "sea water". represent the true meaning of the signal. Furthermore, the octopus may not even understand that these signals are meaningful or function to facilitate communication .”

— "The Octopus Test"

Bender & Koller

What would you think of this question if we replaced the octopus in the "Octopus Test" with ChatGPT or GPT 4? In other words, which of the following two viewpoints do you support? One point of view is similar to that of the "octopus test". It is believed that the LLM model of GPT 4 has only learned the superficial statistical relationship such as the co-occurrence of words in the language. In fact, it does not have intelligence. Another point of view is that: GPT 4 not only learns the surface statistical relationship between language elements, but also learns the inner operating laws of human language and even the physical world. Words are produced by internal intelligence, so LLM has human-like intelligence.

These two views are head-to-head, and I'm not sure which faction you belong to. At present, whether it is in the academic circle or the social level, there are actually quite a few people who hold two views, and the debate between them is very fierce. For example, well-known representatives of the opposing side who do not think that the big language model is intelligent, the representative of the AI ​​circle is LeCun, and the representative of the linguistics circle is Chomsky. They all deny that the big language model trained by Next Token Prediction can Possess intelligence; and there are many representatives of the positive side, not to mention OpenAI, it is undoubtedly the most influential representative of the positive side. From the current public remarks, it is obvious that Mr. Hinton also holds the positive point of view, and he is particularly positive. He not only thinks that GPT 4 has human-like intelligence, and thinks that the carbon-based intelligence of humans in the future is likely to be the bootstrap program (Booster) of silicon-based intelligence like LLM. Long·King of electric cars·Rocket pioneer·Twitter reinventor·Environmental pioneer·Mars colonizer·OpenAI scorner·Musk) views are similar.

At present, the LLM model with a large enough scale adopts the "Next Token Prediction, NTP" (hereinafter referred to as NTP for simplicity of writing) task when training the base model. The simple operation of Next Token Prediction is to generate the next word through the previous words in the language. It is obvious that what is learned in this way is the surface statistical relationship between words? For those who hold a positive view, this question is actually not easy to refute, because it seems to be the case at first glance. I believe that it is difficult for most positive people to give a convincing explanation that is reasonable and convincing for the opposite side.

As for myself, if you read the road to AGI I wrote at the beginning of the year: the essence of large language model (LLM) technology , it is easy to see that I hold a positive position. In fact, in the original version of that article at that time, there was a section whose topic was to discuss why NTP produces intelligence. According to my understanding of LLM in the January 23 edition, I summarized NTP to generate intelligence as "Through the NTP task, LLM learned an invisible knowledge map in the model parameters. When the prompt is input, the concept contained in the prompt starts the knowledge." Graph related nodes, and then trigger the activation diffusion and information transfer between knowledge on the knowledge graph according to the <activation-diffusion> theory, which leads to the generation of intelligence in LLM. The version of me at that time understood the answer to this question in this way. Now I review this view. Although it cannot be said to be wrong, it is obvious that this understanding is still shallow or rough. At that time, because the content of that article was already too long, and the evidence supporting the above view was not sufficient, I deleted that section of content when publishing the article.

This article is dedicated to discussing this topic. I try to sort out and summarize some fragmentary evidence that is currently available, and give a relatively well-founded answer to the above question. In fact, the affirmative party does not have any special research to explain this question at present, but if we link together various fragments of research conclusions known to be used to study other questions, we can treat the answer to this question as a jigsaw puzzle , on the basis of the jigsaw pieces of known research, if some reasonable inferences and assumptions are added, I think that Zheng Fang can roughly give some at least plausible explanations. In terms of structure, this article will first introduce OpenAI’s views on this issue in more detail. This should be a very novel angle for most people. After that, it will collect and summarize existing research conclusions, and then give what I think is not bad. reasonable explanation.

Both ends of the scale: Compression is intelligence

Suppose there is an imaginary balance, the left end of the balance is used to weigh the data compression ability of the large language model, and the right end of the balance is used to weigh the intelligence level of the large language model. The question is: is the weighing result of this balance reliable? In other words, if a large language model has stronger data compression capabilities, does it mean that it has stronger AGI intelligence?

OpenAI definitely believes that there is an equivalence between the two. At present, this may be a core concept that promotes the development direction of OpenAI's large model. OpenAI chief scientist Ilya Sutskever initially revealed this idea in some public interviews earlier this year. In the follow-up, Jack Rae, who is in charge of the large model team of OpenAI, gave a report on "Compression for AGI" at the Stanford MLSys seminar, which demonstrated this idea conceptually from the theoretical level.

This part mainly refers to the content of Jack Rae's report, and relays the demonstration process of "compression is intelligence" that OpenAI firmly believes in. Let's start with a hypothetical experiment of data compression transmission.

Data Compression with LLM

Let's assume that Xiaoshuai and Xiaomei live on Earth and Mars respectively. Now Xiaoshuai has obtained a batch of confidential data D = ( x 1 , x 2 , . . . , xn ) D=(x_1,x_2,...,x_n)D=(x1,x2,...,xn) , which needs to be transmitted to Xiaomei who is far away on Mars with the minimum transmission cost. Xiaoshuai plans to compress the data through an LLM model such as GPT, and then send the compressed data to Xiaomei to reduce the amount of transmitted data. At the same time, he hopes that the information compression is lossless, that is to say, Xiaomei should be able to use a certain method to completely restore the original data according to the compressed data obtained.D , there cannot be any difference. It seems that this thing is not easy to do, how to do?

First, Xiaoshuai put the code ff of the GPT modelf , including the code itself, the initialization method and the random seed, and other information are sent to Xiaomei. According to the information of the GPT model passed by Xiaoshuai, Xiaomei uses the same code, initialization method and random seed to copy and initialize a copy of her own GPT, in order to keep the GPT model in my hand and the model in Xiaoshuai's hand consistent in the initial state.

Next, Xiaoshuai takes Next Token Prediction as the task, with D = ( x 1 , x 2 , . . . , xn ) D=(x_1,x_2,...,x_n)D=(x1,x2,...,xn) as training data to start the training process of the GPT model, the training process itself is actually a data compression process. We assume that Xiaoshuai has compressed the data $<x_1,x_2,…,x_{(i-1)}>through GPT, and the corresponding compressed data is compressed through GPT, and the corresponding compressed data isCompressed by GPT , the corresponding compressed data is <z_1,z_2,…,z_{(i-1)}>, and these batches of compressed data have been passed to Xiaomei one after another. Now I am going to transfer the data and send this batch The compressed data has been sent to Xiaomei one after another, and now the data is ready to be transmitted, and sent this batch of compressed data to Xiaomei one after another, and now she is going to send the data xi_i$. Let's press the "slow down" button here, and carefully observe how GPT treats the dataxi x_ixifor compression encoding and decoding.

Encoding stage : Our purpose is to use GPT to compress xi x_ixiData, Xiaoshuai uses < x 1 , x 2 , . . . , x ( i − 1 ) > <x_1,x_2,...,x_{(i-1)}><x1,x2,...,x(i1)> As input to GPT, use the current version of the GPT modelf ( i − 1 ) f_{(i-1)}f(i1)Make a Next Token prediction. Assuming that the Token dictionary is V, the GPT model generates the generation probability of each word in the dictionary V after Next Token prediction. Some words in V have a high probability of generation, and some words have a low probability of generation. The sum of the generation probabilities of all words is 1, so The probability distribution P i P_i that forms |V|Pi. If the original data xi = " M ask N et " x_i = "MaskNet"xi=" M a s k N e t " , at this time, some kind of data compression algorithm can be used, such as arithmetic coding (Algorithm Coding, AC), according toxi x_ixiand P i P_iPi, General xi x_ixiCompressed into data zi z_izi(As for how arithmetic coding works, it will be explained later), that is, zi = AC ( xi , P i ) z_i=AC(x_i,P_i)zi=A C ( xi,Pi) , so Xiaoshuai can pass the obtained compression code $z_i $ to Xiaomei.

In addition, if GPT according to the above < x 1 , x 2 , . . . , x ( i − 1 ) > <x_1,x_2,...,x_{(i-1)}><x1,x2,...,x(i1)> Doing Next Token prediction to get the word with the highest probability is not the standard answerxi = "M ask N et" x_i="MaskNet"xi=" M a s k N e t " means that the model training is not good enough, so Xiaoshuai asked GPT to perform a backpropagation to correct the parameters of the GPT model, hoping that GPT will be able to make more accurate predictions in the future . After backpropagation, the model parameters change, and the GPT model changes fromf ( i − 1 ) f_{(i-1)}f(i1)Fixed to fi f_{i}fiVersion.

It can be seen that the above process is actually a standard GPT training step for a certain Token, but when we usually train GPT, we will not get the distribution probability P i P_i according to the Next Token PredictionPiand arithmetic coding to get xi x_ixiThe compression code zi z_izi, and record it. If you want, you can generate each xi x_i step by step during the training processxiCorresponding zi z_izi, and record them, so that a lossless compressed version of the data D can be obtained.

Decoding stage : received the compression code zi z_i from XiaoshuaiziFinally, Xiaomei hopes to use her own GPT model to restore the original data xi x_ixi. She can also use arithmetic coding to reverse zi z_izito decode, however, if you want to decode zi z_iziInsufficient information, except zi z_iziIn addition, we need to know xi x_ixiThe probability distribution P i P_i of words in the corresponding dictionary VPi, but Xiaoshuai did not put P i P_iPiPass it over, because the amount of information is too large, it is not cost-effective to pass it over, what should I do?

Xiaomei can use her own GPT to generate the missing dictionary word probability distribution P i P_iPi, she decoded
< x 1 , x 2 , . . . , x ( i − 1 ) > <x_1,x_2,...,x_{(i-1)}><x1,x2,...,x(i1)> As the input of the model, letf ( i − 1 ) f_{(i-1)}f(i1)The version GPT model makes a Next Token prediction, so the GPT model generates a word probability distribution P i P_iPi, this probability distribution with Xiaoshuai P i P_iPiit's the same. get P i P_iPiFinally, Xiaomei can use arithmetic coding to zi z_iziSolution, immediately xi = AC ( zi , P i ) x_i=AC(z_i,P_i)xi=A C ( zi,Pi) , so that the original dataxi = " M ask N et " x_i = "MaskNet"xi=" M a s k N e t " . Similarly, if the GPT in Xiaomei’s hand predicts the highest probability word for this Next Token is notxi = “M ask N et ” x_i = “MaskNet”xi=" M a s k N e t " , she also asked GPT to perform a backpropagation, correct the model parameters, and change the GPT model fromf ( i − 1 ) f_{(i-1)}f(i1)Version fixed to fi f_{i}fiVersion. Only in this way can Xiaomei ensure that the GPT model in her hand and Xiaoshuai are always consistent during the transmission process.

It can be seen that the decoding process is actually that Xiaomei also performs a training step of GPT synchronously, and uses the dictionary word probability distribution P i P_i obtained by Next Token PredictionPi, to help compress data from zi z_iziDecode to raw data xi x_ixi

In this way, Xiaoshuai and Xiaomei performed the GPT model training process on D synchronously, and completed the data xi x_ixiAs long as the above process is repeated continuously, Xiaoshuai can transmit all the data in D to Xiaomei without loss, realizing the lossless compression and decompression of data through LLM. So we can say that the training process of the GPT model is actually a lossless compression process of the training data, but we skip this step during normal training.

Large model research test portal

ChatGPT capability research portal:
http://hujiaoai.cn

GPT-4 capability research portal (advanced/continue to visit in case of browser warning):
https://gpt4test.com

Arithmetic Coding Mechanism

The operation mechanism of arithmetic coding is not explained above, and a simple example is used to briefly explain it here. As shown in the figure above, assuming that the word dictionary V contains 4 words, we want to compress the encoded original data $x_i=“MaskNet” $. At this time, after GPT runs Next Token Prediction, the probability distribution P i P_i corresponding to the words in the dictionary VPiIt is listed on the left side of the above figure, that is to say, the word with the highest probability of being generated for the Next Token predicted by GPT at this moment is "too", not Ground Truth "MaskNet".

At this point, known xi x_ixiand its corresponding P i P_iPi, we use arithmetic coding to compress the data. First, according to the generation probability of each word in the dictionary, we can cut the interval from 0 to 1 according to the probability score of each word. The larger the value of the generation probability of a word, the longer the interval it occupies. Therefore, the lower bound and upper bound of each word coverage interval can be obtained. For example, for the word "MaskNet" to be encoded, its lower bound is 0.4, because its own generation probability is 0.2, so the upper bound is 0.6. In order to make the length after binary encoding as short as possible, the arithmetic encoding searches for the shortest decimal decimal number corresponding to binary in the range of 0.4 to 0.6 covered by the "MaskNet" word. Obviously, in this interval, the decimal number 0.5 is the shortest binary number, so choose 0.5 is used as a coded number, and the binary number 0.1 is obtained after number system conversion. This number is the binary arithmetic code corresponding to the word "MaskNet". Xiaoshuai only needs to send the binary number 1 after the decimal point to Xiaomei.

Next, introduce the decoding process after Xiaomei receives the binary number 1. As mentioned above, using her own GPT, Xiaomei will also get the same word distribution probability P i P_iPi, according to the principle of arithmetic coding, use this distribution probability to cut the value range from 0 to 1, and you will get the same cut graph as Xiaoshuai. Xiaomei converts the binary 0.1 into hexadecimal to get the decimal number 0.5, and then checks which word 0.5 falls within the upper and lower bounds of the cutting graph, and then locates the word "MaskNet", and then decodes the corresponding word xi = " M represented by 0.1 ask N et ” x_i="MaskNet"xi=M a s k N e t .

The idea of ​​arithmetic coding is very subtle. It dynamically codes the input sequence, and can binary code the entire input with decimals. The coding efficiency is close to the entropy limit proposed by Shannon. However, in the scenario we describe, since each xi x_ixiThe corresponding $P_i$ always changes during the GPT training process, so a certain distribution P i P_iPiIt only needs to compress or decode a Token, and the idea seems very simple. When performing arithmetic coding on a long input sequence, the method can refer to: What is arithmetic coding .

Compression is Smart

It can be seen from the above explanation that if the GPT model generates Ground Truth xi = " M ask N et " x_i = "MaskNet"xi=The higher the generation probability of " M a s k N e t " is, the longer it occupies in the arithmetic coding division interval, and the easier it is to find a shorter arithmetic coding, which means the higher the model compression rate. In other words, if the GPT model is more intelligent and the NTP prediction is more accurate, its compression efficiency will be higher. Therefore, we can evaluate the intelligence of the model based on the compression efficiency of the model. The higher the compression efficiency of the model, the higher the intelligence of the model. This is a core idea of ​​OpenAI to promote the research and development of large models according to this idea.

We can consider two extreme cases: one case is that the model has super intelligence, for each Ground Truth xi x_i to be predicted by Next Token Predictionxi, the generation probability is always 1. We assume that when Xiaoshuai transmits a part of data to Xiaomei D s D_sDsAfter , the intelligence level of the model continues to accumulate and reach this level, which means that for the remaining data ( D − D s ) (D-D_s)(DDs) , Xiaoshuai does not need to transmit any information later. Because Xiaomei’s GPT has been able to correctly predict each subsequent Token completely by itself. At this time, the GPT model has the ultimate data compression capability due to its super intelligence, that is, it knows what will happen in the future according to the input above; another In an extreme case, GPT has not learned any intelligence during the training process, so it relies purely on guessing when doing Next Token Prediction. Assuming that the size of the vocabulary |V| is N, each Ground Truthxi x_ixiThe generation probability of is always 1/N. At this time, GPT does not have any data compression capability, and the amount of data to be transmitted is equal to the information amount of the original data D.

These are two extreme cases. In most cases, the intelligence of the LLM model, or the compression ability, should be between the two, and we can evaluate the intelligence of the model according to the model compression ability. If you do some mathematical derivation, you can know that in this case, for the data xi x_ixiCorresponding zi z_izi, the number of bits required for arithmetic coding, that is, the code length, should be: − log 2 ( P i ( xi ) ) -log_2(P_i(x_i))log2(Pi(xi)),您看到这个公式可以思考下,有没有觉得它和谁长得比较像呢?其实,这就是GPT在训练的时候, x i x_i xi这个Token对应的交叉熵损失。也就是说,如果从数据压缩的角度,每个Token的编码长度,其实就是LLM预训练时,这个Token对应的交叉熵损失,两者是等价的。是不是很有意思?所以,数据无损压缩是看待LLM模型训练的另外一个比较新颖的视角。

我们可以进一步推导一下,对于数据集合 D ,经过LLM模型压缩传输,小帅传给小美的总数据量是多少?具体计算公式可参考上图。由图可看出,传输信息总量由两部分构成:一部分是LLM模型代码描述,包括代码本身、初始化方法以及随机种子等信息;另外一部分的求和公式如果展开,如上所述,每个 x i x_i xi 对应的压缩编码bit数,就是这个Token对应的交叉熵损失,所以,这部分其实就是GPT利用数据 D 进行预训练的时候,所有Token的损失之和。两个部分相加就是数据传输总量。

那么,不同的LLM模型,是否具备不同的数据压缩能力呢?答案是显然的。上图展示了不同大小LLaMA模型(从最小的7B到最大的65B)对应的数据压缩能力:对于相同的训练数据总量(比如横坐标的1000B Tokens节点),每个模型Loss曲线覆盖的面积总数,就是这个模型对应的数据压缩能力,Loss曲线覆盖的面积越小,说明模型压缩能力越强。有了前面的讲解铺垫,我相信这很好理解,我们可以极端地假设模型训练时每个Batch只包含一个Token,那么其所需的编码bit数,就是这个Token对应的loss数值,我们对模型Loss面积进行积分,就能得到所有Token的总Loss,而这等价于压缩这些Token所需要的总的压缩后编码长度。

从上图可以看出,规模越大的LLaMA模型,其对应的Loss面积越小,意味着这个模型压缩率越高,压缩能力越强,而这进一步代表模型的智能程度越高。如果粗估的话,可以得出目前LLaMA模型的数据压缩率在14倍左右,而这是超出专门的数据压缩竞赛Hutter Prize目前最佳的压缩率的,目前这个最佳压缩率是8.7倍。这说明什么呢?如果我们假设当前主流的文本压缩算法,其压缩依据主要来自于词频以及重复出现模式这种表面因素的话,那这多出的压缩率,很可能代表了LLM语言模型对于文本深层理解,来自于对AGI的智能编码。

进一步的思考

上述内容,即Jack Rae报告中体现出的“压缩即智能”的论证思路。三月份左右我看完分享后大受启发,觉得OpenAI开的脑洞很大,因为自己确实从来没有从数据压缩的角度看待过LLM模型,我相信这对绝大多数人来说都是一个很新颖的观察角度。不过,后来查阅相关文献后,发现这种“压缩即智能”的思路并非OpenAI首创,其实已有较长历史。比如上文提到的旨在鼓励研究更好数据压缩算法的Hutter Prize,创立于2006年,奖项设立人Marcus Hutter即相信数据压缩能力和AI智能是一个等价的问题,这也是他自己出资设立这个奖项的初衷,而且目前使用AI模型做数据压缩已然是个小的研究方向,有不少相关论文。

我们可以就这个思路深入思考两个相关问题。第一个问题是:上面讲述内容是以数据压缩的视角来看待LLM的智能水准,问题是为何模型压缩能力越强,就代表了它具备更高的智能呢?

The minimum description length principle (Minimum Description Length, MDL) can explain this problem, which is an important concept in machine learning and a formalized expression of Occam's razor principle ("If it is not necessary, do not increase entities"). The core idea of ​​MDL is: Assuming that we have many models that can explain the data at hand, the best explanation should be a model that describes the data as short and accurately as possible. The shorter the model description length, the better its generalization. The better the sex, the smarter we say. Why is the shorter the description, the smarter it is? Because this short description is the internal law abstracted from the data, compared with a large amount of data, the description of the internal law of the data is naturally much shorter, and if the model can give a shorter description, it means that the model has learned more. Regularity, so the smarter you are. It's this logic, let's give an example. Suppose the sequence to be transmitted is a sequence of 10,000 consecutive prime numbers:

2,3,5,7,11……

The interval between prime numbers is irregular, so Xiaoshuai can only honestly pass the 10,000 number codes to Xiaomei, but in fact Xiaoshuai can use a sentence, such as "output 10,000 starting from 2 continuous prime numbers” to describe these numbers, compress this sentence and pass it to Xiaomei. After Xiaomei’s LLM model sees this sentence, if it is smart enough, it can recover a sequence of 10,000 prime numbers. Here I am I believe you should be able to understand the meaning of MDL.

Of course, the prerequisite for this is that LLM has to understand the very abstract concept of prime numbers. So, can the big model really understand this kind of abstraction? Is it true that only large models can understand abstract concepts like "prime numbers", but small models cannot? I did a verification and compared the capabilities of the large model and the small model. In order to prevent the model from completing this task simply by memorizing the sequence of prime numbers that appear in the training data, I made some conversions in the description to ensure that the large language This statement cannot be seen in the model during training. The output results of the corresponding Prompt and the size of the test model can be referred to in the figure below.

It can be seen that GPT 3.5 has learned the abstract concept of prime number, otherwise it is difficult to answer this question well. If you do not understand this concept, you will have an incomprehensible answer like the small model on the right. On the one hand, this shows that the large model can indeed learn some abstract concepts, and on the other hand, it shows that the large model is indeed better than the small model in this regard.

另外一个问题,jack Rae在报告中强调LLM的这种数据压缩能力是无损的,并反驳了著名科幻小说作家特德姜在年初时候提出的影响很大的“ChatGPT是对互联网数据的有损压缩”观点。其实吧,你要仔细思考一下,会发现这种LLM是对数据的“无损压缩”的观点是有点水分的。如果我们更严谨地来看,会发现尽管LLM训练过程可以看成是对数据的无损压缩,但是能够达成“无损”的效果,并不单单靠LLM,其实是“LLM+算术编码”一起完成的。

如果LLM通过学习达到足够强的智能程度,能够保证NTP预测后续文字序列的loss是0,也就是说可以根据上文Context,完全精准预测后续Next Token,此时算术编码就不需要了,仅靠LLM就能完整地对数据进行压缩和解码。如果是这种情况,我们说LLM的训练过程或者说LLM在经过训练后,能对数据进行“无损压缩”,并没有任何问题。但是,这是理想情况,目前LLM能做到这点吗?肯定是做不到的,所以LLM给出的后续Next Token预测肯定会有错误,这些被预测错误的Token,其实就代表了LLM压缩数据的信息损失,这种损失是靠算术编码额外对信息进行编码来进行补偿,才达成数据的“无损压缩”效果。所以,更精确的说法,看似应该是这样:

数据无损压缩=LLM模型的有损数据压缩能力+算术编码的编码补偿能力

也就是说,起码目前的LLM对数据编码还是有损的,并不能单靠自己的能力达到数据无损压缩的目的。至于将来LLM是否能够强大到靠自己就能达成数据无损压缩,目前仍是未知数。

数据压缩只是手段,通过这种手段使得GPT获得智能才是目标,现在的问题是:OpenAI从基础理论上指出了手段及目的,但是并未说明一个更基础的问题:Next Token Prediction通过数据压缩来让GPT 模型学到了什么样的AGI智能呢?本文后续内容试图回答这个问题。

拼图游戏:目前已知的一些事实碎片

若把LLM习得AGI智能比做一个拼图游戏的话,只能说目前我们手上只有一些有关它的支离破碎的拼图碎片,还未能一睹这种机器智能的全貌。本部分从几个不同的角度,收集并介绍现有相关研究的研究结论。

GPT模型对知识的提取过程

我们先来看一下,假设LLM模型训练好了,在使用时输入Prompt,GPT模型是如何把知识提取出来的。“Dissecting Recall of Factual Associations in Auto-Regressive Language Models”这篇文章对此做了细致的研究。如图所示,假设输入的Prompt是:“Beat music is owned by ”,GPT可以通过NTP返回正确答案:Apple。这个例子里,“Beat music”是个实体,“owned by Apple”是这个实体对应的某个属性。

经过研究,发现GPT在提取这条知识的时候,经历了明显的三阶段过程:首先,单词“music”是描述这个实体最后的、也是最关键的词汇,它的信息在顺着Transformer block往上走的过程中,先通过Attention把之前的修饰语“beats”相关信息集成到“music”对应位置。之后,随着Transformer层数越来越高,通过每个Transformer Block的FFN层,不断往“music”对应的Embedding里增加信息,所以随着信息往上层流动,“music”这个单词对应层数的Embedding,能够触发越来越多的与“beats music”相关“属性”词汇。这是第一个步骤,整个过程总体发生在Transformer的低层。

In the second step, the GPT model is at the position of the "by" word, which is the last position where NTP will generate an output token, and integrates the information of the word "own" into the final position through Attention. It should be noted here that the Transformer position corresponding to the last word is more critical, because the Next Token output will be given at its top layer. During the reasoning process, GPT will gradually integrate the important information entered into this position through Attention. This operation also happens at the lower level of the Transformer.

In the third step, at the position of the word "by", which is the last position of the Transformer layer, it has integrated the information of the word "own" in the lower layer. This information is in the upper layer. Through Attention, the attribute "apple" corresponding to "beats music" " Extracted. The specific extraction action is done through an Attention Head, and this article proves that <entity-attribute> information will be encoded in the Attention Head. For specific examples, please refer to the figure below. This should be a new knowledge for us ( In the past, it was generally believed that Attention is mainly used for information comparison and handling, which proves that Attention will also store some kind of knowledge).

Through the above three steps, GPT completes the process of extracting a piece of knowledge.

Another work "Understanding Transformer Memorization Recall Through Idioms" explores how LLM extracts memory information, including idioms/proverbs (Idioms) that rely entirely on memory and require accurate reproduction, as well as factual knowledge. The research conclusion is that LLM can be divided into two stages for the extraction of memory information: the first stage is that the low-level Transformer Block gradually increases the ranking of the correct answer words until the middle layer ranks first; the second stage is the high-level Transformer. The confidence level increases, that is, the distribution probability score of the correct answer is continuously improved.

In addition to the above two work, there are some other similar research knowledge extraction work. If we summarize the existing research conclusions, I think we can roughly draw the outline of such a GPT knowledge extraction: when the GPT model is trained, enter the prompt, and for the input word corresponding to a certain position of the Transformer, as the Transformer continues to go up, GPT integrates the information related to itself in the word above into its own Embedding through Attention, and the FFN of each layer transforms the current word Embedding to add information, in this way, it continuously triggers the knowledge stored in the FFN and refines it layer by layer. The Embedding corresponding to the word (similar to the process of the word "music" in the above example).

The same is true for Transformer's last token position. Its special feature is that, from the bottom layer to the upper layer, it will first copy the most critical information in the entire input above to its own position through Attention, and then use this key information to gradually filter out the upper layer. more important information in the text. At the bottom of the Transformer at this position, there should be many candidate answers for output, and the correct answer is not ranked high. As the Transformer goes up, the correct answer ranks higher and higher, and the more candidate answers that can compete with the correct answer Less and less, reflected in the probability distribution score assigned to the correct answer is getting higher and higher, until the highest level of the last token, GPT can output the correct answer (similar to the process of the word "by" in the above example).

Distribution of knowledge points in Transformer

This part introduces the distribution of knowledge points in the Transformer structure, which means the distribution of different types or specific knowledge points at different levels of Transformer. Understanding this knowledge is very helpful for understanding the inner working mechanism of GPT.

Before introducing the research conclusions, in order to facilitate understanding, we first explain three basic concepts (refer to Toy Models of Superposition): single semantic (Monosemantic) neuron, multi-semantic (Polysemantic) neuron and superposition.

At present, it is found that there are many individual neurons in the LLM, each of which only responds to a specific knowledge point in the input, that is to say, it is only activated by a specific input mode, and remains silent for other irrelevant inputs. A neuron encodes a knowledge , perfect one-to-one correspondence, the neurons in this type of Transformer are called "single semantic neurons" (this is more similar to the neuron mechanism in the human brain). On the contrary, there are also a large number of neurons that are multi-semantic encoded, which means that many knowledge points with different language meanings will activate a certain neuron. This type of neuron is called "multi-semantic neuron". The above figure gives an example. For example, some neurons respond only when the input prompt content is written in French. This is a typical "single semantic neuron"; while some neurons are very sensitive to multiple semantic differences. Large 2-gram language fragments will respond, which is a typical "multi-semantic neuron".

The meaning of the Superposition concept is: assuming that the number n of features to be encoded is much greater than the network parameter d, we can find a way to use d-dimensional neurons to encode n features that are much larger than d. This encoding mechanism is called It is called superposition, so it is an information compression coding mechanism found in the Transformer structure.

Superposition and "multi-semantic neurons" are closely related. It is currently found that LLM does this internally (refer to Finding Neurons in a Haystack: Case Studies with Sparse Probing): As shown in the figure above, the Superposition mechanism of LLM is composed of multiple "multi-semantic neurons". Each neuron will respond to multiple different knowledge points in the input, so it is impossible to detect who is currently responding to it through only one "multi-semantic neuron", but if there is Multiple "multi-semantic neurons" that respond to a certain knowledge point, and a linear combination of their responses can detect the knowledge point we want to identify in the input (the blue part in the figure above) . That is to say, LLM encodes a specific feature or knowledge point by combining multiple "multi-semantic neurons". Therefore, the relationship between "multi-semantic neurons" and knowledge points is a many-to-many mapping. A knowledge point will stimulate many "multi-semantic neurons" that encode it, and a "multi-semantic neuron" will also Generate responses to multiple input knowledge points.

After understanding the above three basic concepts, we introduce the current research conclusion: In the trained GPT model, the bottom layer of Transformer encodes a large number of specific features or knowledge points, such as n-gram features, syntax features, etc., and the encoding method adopts the above-mentioned The superposition mode composed of multiple "multi-semantic neurons"; as the number of Transformer layers deepens, the specific knowledge points gradually decrease, and the abstract knowledge points (such as "French"/"prime numbers", etc.) gradually increase, and the abstract knowledge points gradually increase. Generally, it is independently encoded by "single semantic neurons", and as the number of Transformer layers increases, the encoded features become more abstract. In other words, Transformer encodes features or knowledge points, and there is a knowledge abstraction process that becomes more and more abstract from low to high. This phenomenon is also mentioned in OpenAI's latest article "Language models can explain neurons in language models" .

In addition, the article "Polysemanticity and Capacity in Neural Networks" pointed out that in the process of model learning, in order to increase the utilization efficiency of model parameters, "single semantic neurons" will be assigned to important features, and "multi-semantic neurons" will be assigned For less important features, the model does not encode at all for even less important features. The so-called "importance" refers to the impact on training loss, that is to say: "single semantic neuron" has a greater impact on reducing loss during NTP training. This shows that the abstraction of features or knowledge points is an internal driving force of NTP itself to quickly reduce Loss, and this is likely to be one of the keys for the GPT model to generate intelligence through the Next Token Prediction task.

Evidence for the existence of knowledge loops in GPT

Here is an introduction to the related work of the corresponding knowledge circuit (Circuit) in the LLM model to complete a specific task. The so-called "loop" refers to the fact that after the prompt of a certain task is input into the Transformer, the information propagates from the bottom up until the next token at the top of the last token outputs the answer. There are some key paths to complete this task in the network, and the information mainly follows this path. In the process of propagation, information transmission or knowledge processing is carried out continuously, in this way to complete a certain task through NTP. If you read the following introduction, you will find that the working process of the LLM knowledge circuit is actually very similar to some information processing circuits in the human brain. And a large number of various knowledge circuits formed during the NTP pre-training process of GPT are likely to be another key to uncover the mystery of AGI.

“How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model”这个工作主要探讨:为何GPT模型能够通过预训练获得数学能力。具体而言,用的是类似“The war lasted from the year 17YY to the year 17”的Prompt,GPT模型可以做到输出的Next Token的年份数字XX大于YY,这说明它在预训练中学会了数字间的比较关系。通过探究,发现模型在预训练过程中形成了解决这个问题的知识回路,如上图图右所示:有两个关键部分,第一个是中间层的某些Attention Head,比如图中a5.h5代表Transformer第5层的第5个Attention Head,这些Attention Head主要作用是聚焦到YY年份并向高层传播;另外一个关键是第8到11层的MLP层,这些层的MLP完成“大于”运算,所以最后GPT能够正确输出结果。而且,中间层的Attention Head和上层MLP也有相对应的传递关系,比如第9层MLP 主要接收信息来源于a9.h1,而第8层MLP的信息来源则比较多。可以看出,信息从下到上形成了一个特定的传播路径。

如果再深入探究,会发现是MLP中的一些关键神经元完成数学运算的,如上图图右所示,可以探测出第10层MLP中影响最大的10个神经元,这层只用这10个神经元就能大致完成“大于”运算,而左图则展示了a7.h10 这个Attention Head主要聚焦于关键信息“YY”上。另外,该项研究还发现不仅仅上述Prompt,如果变换Prompt形式,但是体现数字比较关系,发现被激活的也是这条回路,这说明这条回路可能专门用于对数字进行关系比较。

Most of the knowledge loops should be composed of Attention and MLP, but some knowledge loops based on Attention are also found. A typical example is the "Induction Head" circuit, the existence of which has been proved by several studies. Its main function is that when GPT predicts the Next Token, it tends to find a similar output pattern from the above and copy it to the subsequent Token output. As shown in the sentence above, the second "so" is the last token, and GPT will generate subsequent tokens through NTP at this time. The "Induction Head" circuit tends to find the same "so" word from the above, and follow the above The word "bad" after "so" is output as Next Token. The research "Localizing Model Behavior with Path Patching" explores the inner working mechanism of Induction Head: when predicting the Next Token based on the second word "so", the content of "so" itself is copied to Transformer's own corresponding Attention Query in <Query, Key, Value>, and the word "bad" appearing in the above content, through the Attention Head PTH (Previous Token Head to key), integrate the semantics of the content before "bad" into the corresponding "bad" Key. As a result, when "so" is used as Attention, the two get a high similarity, so "bad" is copied to the position of the word so through Attention, which makes it easy for Next Token to output "bad", which achieves from the above Copy the purpose of "so...bad".

除了“Induction Head”外,还有一些功能更复杂的Attention回路,比如 “Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small”这个工作发现了Transformer中存在以Attention为主,用于识别“Indirect Object Identification”的知识回路。所谓“Indirect Object Identification”,可以参考上图给出的例子,就是说输入有两个实体,一个重复实体,一个非重复实体,如何从中找到正确答案。从上图例子可看出GPT是可以输出正确答案Mary的,其原因就是模型学会了一个主要由Attention Head构成的复杂识别回路。

如上图所示,“Indirect Object Identification”知识回路识别正确答案,主要由三个步骤构成:首先,Duplicate Token Heads用于标识多次出现在句子中的Token,而Induction Heads起到类似的作用;其次,S-Inhibition Heads在输出Next Token的位置发生作用,用于从Name Mover Heads的注意力中删除或者抑制重复出现的名字;最后,Name Mover Heads则输出剩余的名称Token。由上可看出,LLM模型在预训练过程中,为了更好地进行Next Token预测,学习到了非常复杂的Attention知识回路,来执行对某些输入Token拷贝并在Next Token Prediction结果中输出。

OpenAI Chief Scientist Ilya Sutskever said in an interview: "When we trained LSTM to predict the next character (NTP) of Amazon reviews, we found that if you predict the next character well enough, the LSTM will have a neuron corresponding to the emotion. This is a good demonstration of the effect of unsupervised learning, and it also validates the idea of ​​​​predicting the next character. This discovery has a great impact on us." I understand that there are neurons corresponding to emotions in the network, Probably through the NTP training task, a knowledge loop of emotional judgment is formed inside the model. This discovery (see: Learning to Generate Reviews and Discovering Sentiment) was indeed an important inspiration for OpenAI to replace LSTM with a larger Transformer and use NTP for pre-training on more data.

At present, there is still relatively little work on exploring the knowledge loop in the GPT model. I personally think this matter is particularly important. For example, I guess there is a high probability that there will be a complex logical loop that can explain the Chain of Thought (COT) phenomenon, and the formation of this loop It is likely to be formed after introducing program code or science and engineering paper data into the pre-training data. Because the logical relationship between such data is relatively close, GPT may force NTP tasks to quickly reduce the Loss and accurately predict the Token after The model generates a large number of abstract knowledge point concepts internally, and forms a complex logic circuit on this basis. I feel that the work in this area is very valuable and worthy of further strengthening.

Differences in Learning Knowledge Points of LLM Models of Different Scales

This section summarizes the relevant research conclusions on the differences in learning knowledge points between different sizes of LLM models.

An interesting phenomenon is mentioned in the document "Finding Neurons in a Haystack: Case Studies with Sparse Probing": for the same "single semantic neuron" encoded abstract feature "French" (used to judge whether the input content is French), If we block it, we can see the impact on GPT's Next Token Prediction task Loss. If the loss increases after blocking it, it means that this feature is more important to the model. Interestingly, after shielding, the small model Loss increases a lot, but for the large model, it has little effect. This shows that this feature is important for small models, but not so important for large models.

This phenomenon is very strange, the paper gave an explanation: as the size of the model increases, the phenomenon of feature splitting (Split) will appear. That is to say, the small model represents a certain knowledge point, and only one coarse-grained neuron responds independently, but the large model will refine this knowledge point. Multiple neurons, the corresponding neuron will only be activated when a specific context occurs. In other words, it is also to represent a certain knowledge point. Compared with the small model, the large model will be more detailed in representing the knowledge point.

For example, a small model has only one neuron that responds to "return" in the input, but a large model may differentiate to respond to "return" in different programming languages. For example, there is a neuron that responds to "return" in the python language, and a neuron that responds to "return" in the C++ language, and so on. Therefore, when a small model blocks a certain feature, the impact will be great, because if this knowledge point in the input cannot be captured at all, it will have a great impact on the loss; but for a large model, blocking this feature will have little impact, because It also splits out neurons that respond to different contexts. Although this neuron is useless, there are other neurons to represent various situations. I think this research conclusion is very important, and it shows a significant difference in the knowledge representation ability of large and small models.

In addition, there are also research conclusions that show that as the model size increases, a greater proportion of "single semantic neurons" will be detected. I think this illustrates the possibility that the larger the LLM model, the more abstract knowledge will be encoded by independent neurons.

Another document, "The Quantization Model of Neural Scaling", imagines that according to the degree of impact on NTP Loss, we can sort the knowledge units (referred to as "quantum units" in the text) from important to unimportant to form a Q queue. The LLM model will give priority to learning the quantum units that are ranked first in the Q queue, and for the large model, it can learn more quantum units that are not so important in the Q queue than the small model. The core idea I summarized is that the large model can learn more less important features than the small model.

The above points are the conclusions that can be obtained from the current literature on the differences in the representation capabilities of the model scale.

Under the Iceberg: Circuit Competition Conjecture (CCC)

If we put together the bits and pieces of evidence reflected in the currently known jigsaw pieces, I feel that the part of the principle hidden under the iceberg is looming in front of us. This part makes some inferences based on the known research conclusions, and gives the "Circuit Competition Conjecture (CC Conjecture)" as an explanation of the internal mechanism of GPT to build intelligence through Next Token Prediction. I ask myself to find references for the key points. If there is an inference to give the inference process, so that this conjecture is based on the existing research conclusions, but generally speaking, it is an untested conjecture, so please be cautious refer to.

Circuit Competition: Breakout of Mission Circuits

First, we summarize the known research conclusions to form an overall impression. In this article, I collectively refer to a certain feature or knowledge as a knowledge point, because it is difficult to cover certain content by using traditional "features" alone. Specific knowledge points include language knowledge points (n-gram, morphology, syntax, semantics, etc. ), context knowledge points (such as "input is French"), world knowledge-related knowledge points (entity-attributes, common sense, events, etc.), and simple function loop knowledge points, they are fine-grained, we will Collectively referred to as knowledge points.

Based on the above content, it can be seen that the GPT model learns knowledge from data through NTP tasks, and establishes two types of knowledge systems inside the model: hierarchical knowledge structure and various task loops (refer to the above figure). The task loop is based on hierarchical knowledge. Established on the system structure, it is a fixed path formed by mutual stimulation of knowledge points for solving a certain task.

假设已经训练好GPT模型,我们可以清晰地探测到它们的存在。首先,这些知识点有不同的抽象层级。越是存储在Transformer底层的知识点,越具体、可复用程度越高、通用性越强、数量越多,越容易通过superposition及Polysemantic这种稠密编码方式来进行编码;而越是存储在Transformer高层的知识点,越抽象、复用程度低、任务专业性越强,越倾向用“单语义神经元”方式单独编码(上图Transformer中白色节点代表具体知识点,红色节点代表抽象知识点)。

其次,某些知识点之间形成了由底向上的激发关系,激发路径是由下层不那么抽象的知识点逐层激发上层越来越抽象的知识点。比如某个编码在Transformer第L层的知识点,它可由第1到第L-1层的其它被激发的知识点来激活。被激活的神经元,除了收集、综合、抽象传上来的信息,可能也通过自己的FFN结构添加新知识(比如进行世界知识的提取),或做数学逻辑计算(比如比较数值大小)。训练好的GPT模型内部包含海量这种“局部”知识点构成的“微激发结构”,这应该是形成GPT智能的基础单元,从而整个GPT结构构造出对世界知识进行层级编码的世界知识结构。而根据NTP目标来训练模型,其实就是在由简单到复杂、由通用到专用、由具体到抽象、由下层到上层,在训练过程中逐渐建立起越来越复杂层级知识结构的过程,包括知识点以及知识点之间的激发关系产生的微结构,之所以会产生这些,是因为它们的存在,对于NTP精准预测之后的Token有帮助,也就是对于GPT模型在NTP时降低训练loss有用。

我们在此基础上可以重新看待任务回路的形成。任务回路应该是GPT为了更精准预测某种特殊类型数据的Next Token,从Transformer的输入层开始,逐层关联相关的“激发微结构”,从而形成了一个由低向上逐层激发,并最终关联到输出位置,以决定输出Token概率的完整通路结构(可参考上图红线部分勾勒出的某个任务通路)。学会了这种任务回路,如果GPT后续再见到此类数据,则Next Token预测精准性增加,体现为NTP任务Loss 的降低。比如如果训练数据里大量出现“13+24=37”这种加减乘除的例子,大概率GPT会学会一个用于简单数学计算的任务回路,以此增加等号后数字的Next Token预测精准性。

另外,输入的最后一个Token位置对应的各层Transformer Block,可能有些特殊含义和作用,它可能通过Attention机制,起到了对之前全体输入内容的信息汇总工作。如果输入的Prompt是完成某项具体任务的,那么Last Token对应的各层Transformer Block,大致把任务回路信息逐层汇总到了最后位置,以便于在Last Token的最高层进行正确的Next Token预测。相当于Last Token从Transformer庞大的知识体系中根据输入Prompt勾勒出了一个Prompt子世界。

上述内容宏观上综合了目前研究的结论,呈现出目前我们对GPT运行机制的了解程度。后面内容开始加入我个人的一些推论。

首先的问题是:在训练GPT的过程中,如此多的知识点,它学习这些知识点一定有个先后顺序关系,那么,它会遵循怎样的优先顺序来学习这些知识点呢? 尽管目前有些研究结论是说重要的知识点先被学到,但是这里的重要性往往指的是对降低GPT模型NTP任务的loss角度来说的,降低loss越多,则这个知识点越重要。从降低loss角度讲肯定是对的,但还是太抽象。

I personally think that during the training process, the GPT model will give priority to learning knowledge points with the following characteristics: high-frequency knowledge points, general knowledge points (the ones with a high probability of being reused are general), and specific rather than abstract knowledge points. These three principles should be followed. Why is this so? Because according to the principle of Next Token Prediction, the more frequent the knowledge points appear, if the GPT prediction is wrong this time, it will do backpropagation to correct the model parameters to ensure that the next time you see a similar situation, you will predict the right, high-frequency knowledge Because of the large number of occurrences of points, the number of backpropagation correction model parameters obtained is large, and it is easier to establish corresponding knowledge points and their connection paths with other knowledge points. If the high-frequency knowledge point is learned, it will be easy to encounter this knowledge point in the subsequent training data, so it will greatly contribute to reducing the loss of the NTP task. The same is true for the other two types of knowledge points. Because of their strong versatility, general knowledge points have more opportunities to be used in subsequent predictions, so the number of times to obtain backpropagation correction model parameters is also large, and they are easy to be learned by the model. Specifically, Non-abstract knowledge points are also easy to be established because they are often seen in the training data. and so on. Conversely, low-frequency, domain or task-specific, abstract knowledge points will be learned by the GPT model later. In other words, if you want to learn such knowledge points, you need to let the model see a larger amount of data to increase the chances of these knowledge points necessary to backpropagate the correction parameters in the learning process.

Next, we begin to formally discuss the "loop competition" conjecture. Before drawing this conjecture, let me make an assumption:

Hypothesis: To improve the parameter utilization of the GPT model, the NTP task encourages the reuse of sub-loops.

所谓“子回路”,指的是完成简单运算的回路,这种回路涉及到的知识点较少,知识点之间激发的结构比较简单。GPT模型大概会优先产生很多完成简单任务或计算的子回路,而复杂回路应该是由很多子回路进一步连接形成的。为了增加模型参数的使用效率,GPT模型应该会鼓励这些子回路在不同复杂回路中尽可能被复用,这样使用同样的参数量,就能完成更多不同类型的任务。比如上文讲过的“Induction Head”回路,就是一个典型的子回路,由上文我们可知,在更为复杂的“Indirect Object Identification”知识回路中,“Induction Head”回路是其中一个构成部分,子回路和复杂回路的关系大概类此例。

对于两个解决不同任务的复杂回路,由于子回路复用的原因,两者之间存在一些相同子回路,我们可以把这些相同的子回路称为“重叠回路”。很容易推断,如果两个任务越接近,则其对应的重叠回路就越多。而且重叠回路存在Transformer底层的情况应该比较多,因为底层回路涉及到的知识点更具体、数量更多、可复用性更强。上图是对“子回路复用与重叠回路”给出的示意图,途中红色线(红色任务)和蓝色线(蓝色任务)代表两个不同复杂任务回路,而在底层,则有一些子回路被两者复用。

所谓“回路竞争”猜想,我们用上图例子来说明。假设我们输入一个Prompt,这个Prompt本来是要完成红色任务的,当输入Prompt后,在信息从底层向上层逐层激发正确通路的时候,越是底层的知识点和子回路,复用性越强,所以容易产生“过剩激发现象”,就是除了激发出我们希望的红色任务外,也会激发很多导向其它任务回路的知识点和子回路。这种情况在底层较为明显,随着信息逐步往上传递,红色回路会逐渐得到进一步的强化,非正确回路被激发的上层知识点和子回路越来越少,最终勾勒出了正确的红色任务回路的路径。这就是典型的“回路竞争”猜想的思路。

If the correct circuit we want is activated during the excitation process from low to upward, it can be considered that the circuit wins the competition, and the model outputs the correct answer. If the wrong task circuit is activated, it can be considered that the circuit fails the competition, and the model outputs the wrong answer. It can be inferred that the more complex the task, because it involves more knowledge points and sub-circuits, and the more complex the relationship between them, the easier it is to overlap with more other similar task circuits, and the easier it is for the circuit lose the competition.

We can think about many problems and phenomena of the LLM model in the framework of "loop competition", and give explanations. Later in this article, we will use this conjecture to explain some phenomena that are currently unknown to the LLM model.

Differences in Model Scale: Bigger Models, Clearer Worlds

According to the existing research conclusions, if we think about the difference between the large LLM model and the small LLM model, we can roughly make the following inferences: the small LLM model builds a coarse-grained, fuzzy world image, and as the size of the model becomes larger and larger , the large LLM model builds an increasingly high-resolution world image that can represent more detailed information.

From the above, it can be seen that the representation ability of the LLM model is mainly reflected in two aspects: the hierarchical knowledge structure from concrete to abstract, and the task circuit that can solve many problems. Let's look at the differences between the big and small models separately from these two aspects.

** Differences in hierarchical knowledge structure : **Many research conclusions have proved that as the size of the model increases, the degree of model sparsity becomes higher and higher. Polysemantic neurons encode features densely and are used to encode a large number of relatively specific features, while Monosemantic neurons belong to single neuron representations and are sparse, which shows that as the model scale becomes larger and larger, the proportion of single semantic neurons Increase. Single semantic neurons encode important and abstract knowledge. Since the number of single semantic neurons has increased, it means that the knowledge points learned by the model must have increased. There are no more than two possible sources of new knowledge points: the first source is this knowledge The small model did not learn it before, but now the large model has learned it, and learned new knowledge from scratch. This type of new knowledge should be subdivided into two categories: one category should be world knowledge (common sense and events), small models cannot encode world knowledge that appears less frequently in the data, and large models use single semantic neurons for this. (Large models can learn more low-frequency knowledge in data than small models, which can be verified by a lot of work, and at present, world knowledge should be encoded by a single neuron), this kind of knowledge represents that the large model has learned More detailed information about the world; one category should be more abstract knowledge (such as "prime numbers") newly induced by the model from the data. This type of knowledge represents that the large model has learned more and more complex abstract knowledge or capabilities.

Another source of new knowledge points should be generated by feature splitting of the abstract features mentioned above. That is to say, the original small model only had a coarse-grained abstract knowledge point. Now that the model is large, some new fine-grained knowledge points that represent this type of knowledge are derived, and a corresponding knowledge point may be learned for different contexts. . For example, it is currently found that there is a single semantic neuron in the LLM that responds to continuous uppercase characters. For example, if there is "ABCD" in the input, this neuron will be activated. The small LLM model may have only one neuron responding to this. If this neuron is deactivated, the Loss will increase sharply when GPT is doing NTP to predict the next Token, indicating that the lack of this feature is essential for correct prediction of continuous uppercase in subsequent content. The characters are all wrong; however, in addition to this neuron, the large LLM model also splits fine-grained representation neurons. For example, for the company abbreviation "IBM", there may be a neuron that is responsible for the response, and for medical abbreviations, such as "GS (glucose injection)", there is another neuron responsible for the response. and so on. The splitting of abstract features of this large model illustrates one point: even for abstract knowledge, large models have more detailed abstract feature expression capabilities than small models.

It can be seen that the large model is relatively small. From the perspective of encoding low-frequency world knowledge, it can be considered that more detailed information about the world has been learned. From the perspective of new abstract knowledge and abstract feature splitting, it shows that large LLM models are more difficult. and finer-grained abstract knowledge expression capabilities.

**Differences in task loops: **Task loops are bottom-up inspired and connected loops established between knowledge points forming a hierarchical structure. From the above analysis of the differences in the hierarchical knowledge structure between large and small models, a reasonable inference can be made: large LLM models have a high probability of being able to build circuits involving more fine-grained abstract knowledge points and more complex paths on the path. This is probably the main reason why large models can solve complex problems.

Combining the two, we can think that the small model is a coarse-grained modeling of the world, while the large model is a fine-grained high-definition modeling of the world. And the Scaling law shows that with the addition of more data and a larger model size, the LLM model can describe the world with higher clarity. From this point of view, it is not a big problem to say that the LLM model parameters are a lossy compression of the world.

The endless frontier: explaining unknown phenomena using 'loop competition'

In this part, we explain some phenomena of the current LLM model under the framework of "loop competition".

Emergence of models from the perspective of "loop competition"

Model emergence ability refers to that for some tasks (mostly In Context Learning or COT-related tasks), small models have almost no ability to solve them. Only when the model size reaches a certain critical point can this task be completed well. . Although current research (refer to Are Emergent Abilities of Large Language Models a Mirage?) shows that the so-called "emergent ability" of the model is caused by the unreasonable choice of metrics, in fact there is no emergence, but the metrics for task selection are not enough Just precision. I personally think that this statement should indeed be able to explain some of the tasks that currently reflect the "emergence ability", but I feel that this may not be the whole story. Some tasks may be difficult to explain only through this reason, so why does the large language model appear? Emergence ability, or should do further research.

If we look at this problem under the framework of "loop competition", then there are two possibilities for the small model not being able to do a certain task: one possibility is that for the small model, the excitation circuit corresponding to this task has not been established, and The large language model has been established; another possibility is that the circuit corresponding to the task of the small model has also been established, but it is very easy to fail in the circuit competition, which makes it seem that this task cannot be done.

I am more inclined to think that it is the first possible cause of the "emergent ability" of the model we are seeing now. As mentioned above, the small model probably creates a coarse-resolution blurred world image, while the large model creates a high-resolution, higher-definition world image. Small models should have difficulty in establishing a complete excitation circuit for certain tasks. These difficulties may be reflected in several aspects: for example, one or some of the key to the formation of the circuit, relatively abstract conceptual knowledge points, small models because of their relatively abstract ability Weak, this knowledge point has not been established (similar to the example of the concept of "prime number" at the beginning of this article); for another example, the tasks that can generally reflect the emergent ability are relatively complicated, and the small model is not capable of establishing complex pathways. and so on. When the scale of the model becomes larger, the ability to construct abstract concepts and complex circuits is enhanced. When a complete activation path for solving tasks is established, it seems to be able to solve this problem suddenly, reflecting the emergent ability of the model. However, it is very likely that for such a complex circuit, the ability to activate competition is not strong enough, so when a few task-related examples are assisted to promote the circuit corresponding to the task to win in the channel competition, you can see to a better solution.

In Context Learning and Chain of Thought (COT) from the perspective of "loop competition"

Looking at ICL from the perspective of circuit competition, there may be two types of circuits involved here: the task circuit and the attention circuit. The two compete or cooperate to determine the performance of the ICL task. COT is a special ICL, and the mechanism should be similar.

Let's first look at the function of the task loop, which is actually easy to understand. In Context Learning will first give the LLM model a few task-related examples:

$ <x_1, y_1>,<x_2, y_2>, \ldots,<x_n, y_n> , enter after, enter after, then input x_(n+1)and expect the model to output the expected model to outputIt is expected that the model can output the correct result corresponding to the correct result corresponding tox_{(n+1)}The corresponding correct result y_{(n+1)}. The role of the n examples given in the input is to activate the corresponding task circuit learned in the pre-training stage of the LLM model, and then input. The role of the n examples given in the input is to activate the task circuit corresponding to the LLM model learned in the pre-training stage, and then input. The role of the n examples given in the input is to activate the task circuit corresponding to the LLM model learned in the pre-training stage, and then input x _ {(n+1)}, it is easy to follow this activated path to form a correct output, it is easy to walk along this activated path to form a correct output, it is easy to walk along this activated pathway to form the correct output y_{(n+1)}$. The role of COT should be similar, that is to say, if you don’t use COT, LLM may activate a task circuit with a simple structure, but if you use COT example, it is easy to activate a complex reasoning circuit with many detailed representations, resulting in subsequent The input also follows this sub-pathway, thus forming detailed reasoning steps. It can be seen that in the ICL scenario, the task loop is always forx ( n + 1 ) x_{(n+1)}x(n+1)Generate the correct answer and play a positive role.

Let’s look at the Attention loop again, but this is also an idea (the purpose of the In-context Learning and Induction Heads work is to explain the ICL phenomenon through the Induction Head, but I think the Induction Head mechanism is too simple and may need to be strengthened a little bit). Suppose there is an enhanced version of the Induction Head loop, for example, we can call it "Enhanced Induction Head, EIH", its operating mechanism is likely to be like this (as shown in the figure above): The EIH loop will be based on the current input x 4 x_4x4and xi x_i in each example of ICLxisemantic similarity, to copy the corresponding yi y_iyi x 4 x_4 x4give xi x_ixiThe higher the similarity, the greater the probability of copying the corresponding yi y_iyi 。这个过程有点类似由EIH回路构成的KNN模型,只需根据输入例子之间的相似性和对应标签就可以投票得到正确答案,并不需要模型通过修改参数来学会 x x x y y y 之间的映射函数。算是一种有条件的Induction Head拷贝操作,条件触发因素是输入的例子 x x x 之间的Attention相似性。可以看出,影响 x ( n + 1 ) x_{(n+1)} x(n+1) 输出哪个标签,应该主要取决于ICL中这几种类型的例子:和 x ( n + 1 ) x_{(n+1)} x(n+1) 越相似的例子影响越大;ICL中出现次数越多的 y 影响越大;以及距离 x ( n + 1 ) x_{(n+1)} x(n+1) 越近的例子影响越大(Position embedding编码的位置信息及NLP中大量存在的局部相关性大概会导致此结果)。

若真存在EIH回路,根据上述运行机制,我们可以推断出在以下三种情况下,Attention回路对正确预测结果 y ( n + 1 ) y_{(n+1)} y(n+1) 的影响:

情况一:如果ICL中 x 1 x_1 x1 x n x_n xn 输入例子对应的标签 y 是Ground Truth Label,很明显,EIH回路是正向积极影响作用,类似如上所述KNN机制依据 x 1 x_1 x1 x n x_n xn 例子对应的 y 来做判断;

情况二:如果ICL中例子的标签不是Ground Truth Label,而是在label 空间内随机选择赋予。很明显,EIH回路对于 x ( n + 1 ) x_{(n+1)} x(n+1) 得到正确答案应该起到负面作用,因为 x ( n + 1 ) x_{(n+1)} x(n+1) 会从前面 x 1 x_1 x1 x n x_n xn 的例子中,寻找跟它比较像的内容去拷贝对应标签,但是这个标签是随机赋予的,所以大概率是错的,导致这种情况EIH应该是负面效果。

情况三:如果ICL中例子的标签是label空间之外的另外一套标签,但是和 x 存在对应的映射关系。这种情况下,EIH回路应该是正面影响作用,这跟第一种情况道理类似,KNN机制可以学习这种映射关系,因此得到正确的 y ( n + 1 ) y_{(n+1)} y(n+1) ,无非现在用的是 z ( n + 1 ) z_{(n+1)} z(n+1) 而不是 y ( n + 1 ) y_{(n+1)} y(n+1) 而已。当然,若你仍然是看 y 标签下的表现,那ICL肯定是负面作用。

If the internal task circuit of LLM and the pure attention circuit of EIH are jointly considered, the two sometimes cooperate in the same direction, and sometimes compete in the opposite direction. For example, in the above three cases, the first case is a synergistic effect, both play a role in promoting the correct answer, the second and third cases are competition, and the task circuit plays a role in promoting the correct answer, The EIH loop plays a negative role.

According to this line of thinking, it can roughly explain the many seemingly unexplainable phenomena that we have seen in ICL research. Here is an example. For example, the current research shows that, assuming that the label space of ICL contains two labels: y 1 y_1y1y 2 y_2y2, if we reverse the label of the example in ICL, the original label is y 1 y_1y1Replaced by y 2 y_2y2, turns out to be y 2 y_2y2Replaced by y 1 y_1y1, the ICL task effect will be worse (refer to: Overthinking the Truth: Understanding how Language Models process False Demonstrations). Suppose x ( n + 1 ) x_{(n+1)}x(n+1)The corresponding correct label is y 1 y_1y1, from the perspective of the task loop and the EIH loop, the task loop will tend to give $y_1$ labels. In this case, the EIH loop actually corresponds to the above-mentioned situation 3. Label reversal is a special kind of label change , because the correspondence between x and y still exists. So at this time, the EIH loop seems to learn the mapping relationship from x to y, and it will tend to give y 2 y_2y2Label. At this time, one of the two is positive and the other is negative, and they have a competitive relationship, so the effect of the model will be reduced.

In fact, many other phenomena can be explained within this framework, and the reason for the length of the article will not be expanded. Interested students can deduce it by themselves under this thinking framework.

Domain task Fine-Tuning from the perspective of "loop competition"

From the perspective of "loop competition", we can re-examine the possible impact of fine-tuning operations on general models using domain data. What we know now is that the use of domain data Fine-tuning will cause the problem of "catastrophic forgetting" of the basic model. That is to say, because the subsequent Fine-tuning corrects the model parameters, the model forgets some of the previously learned knowledge. And my judgment is: At present, on the basic model, any form of Tuning operation will cause the loss of some capabilities of the basic model, which also includes the Instruct tuning done by ChatGPT to understand commands and follow human values. It should also damage some abilities of the base model, but we can't tell which abilities are damaged at the moment. This is the price that must be paid for Tuning the model under the current technical conditions.

但是为何对基础模型进行Fine-tuning会造成能力损害呢?其内在原理如何?我们可以在“回路竞争”视角下,分析Fine-tuning带来的影响。我猜大致有两种影响,这两种影响也许是某种单独起作用,也许两者共同发生作用。第一种影响是:Fine-tuning操作通过大量领域数据,强化了大语言模型解决这个任务的响应回路。这对于模型底层知识点影响估计不大,因为底层更多的是通用性比较强的特征,这个任务也需要,它修正的应该更多是上层的抽象知识节点,以及底层知识点到上层抽象知识点建立激发连接的通路。另外一种可能的影响:很可能通过Fine-tuning操作,在模型内部建立起了Shortcut捷径,导致输入信息后,信息传输直接走了捷径,而绕过了很多本该要走的通路。比如文本分类任务,这种任务内部逻辑应该很简单,估计就是建立起底层具体领域词汇知识点,到上层抽象类别概念知识点的激发通路,所以很可能直接从最底层的知识点,到高层的类别概念知识点,建立起了一个很短的Shortcut捷径,其它的复杂回路都被这个捷径给 pass掉了,倒不一定是上层抽象知识点被改写了,很可能是通过走捷径被绕过去了。

不论是上述哪个原因,造成的后果是:对于新的输入,尽管可能是要做其它任务的,就很容易激发这个被特殊强化过的回路。也就是说,这个被强化过的回路容易在不该竞争胜利的时候也容易竞争胜利,导致其它任务效果变差。

“回路竞争”视角下的Instruct Tuning

Instruct Tuning本质上是为了实现和人类行为对齐而做的一种特殊的Fine-tuning。GPT 4的技术报告也指出了:Instruct tuning并不会增强基础模型的知识和能力,相反可能存在某种能力损害作用。高质量的Instruct Tuning肯定是很重要的,但它只是让大语言模型“看着好像”效果更好了而已,这只是使用者主观感受上的,而非模型基础能力层面的更好。

So, from the perspective of "loop competition", how to understand what Instruct Tuning is doing? I think it can be understood in this way: the function of Instruct Tuning establishes a special activation circuit, that is to say, the activation circuit formed by the input command itself establishes a connection with the corresponding task circuit. After the model is trained according to the Instruct, when the command is input, it is beneficial to activate the corresponding task circuit, so it seems that the large language model understands the meaning of the command. This is somewhat similar to the operation mechanism of "conditioned reflex" in Pavlov's biological experiments, which means that a conditioned reflex pathway is established between user commands and corresponding task pathways.

Using the "loop competition" conjecture, in addition to giving a reasonable explanation for the above-mentioned phenomenon of the currently unknown internal operation mechanism, it can also be used to explain some other phenomena. For example, the "serious nonsense" problem that often occurs in large models can be considered to be that in the process of loop competition, the correct loop fails to compete, or the intensity of excitation of the correct loop and a wrong loop is similar, resulting in a mixed result. , is an answer that looks reasonable but is factually wrong. Something like that.

Parametric reflection of the world: from the real world to the possible world

The physical world has its own Hidden Rules that govern its operation. Conceptually, we can understand that there is a simple Hidden world, which produces a colorful world of appearances. If we classify all phenomena in the world, we can roughly classify them into natural phenomena, social phenomena, and psychological phenomena. Several categories of phenomena. Human beings are part of the physical world. By observing the appearance of the world and trying to understand the laws of the world, we can better maintain the survival of the population and individuals in this world.

From the perspective of the population, the survival of the fittest in the evolution process of tens of millions of years is the pre-training process of the human model. The optimization goal is "Next Person's survival Prediction". The smaller the Loss, the more surviving individuals in the population. The genetic code is the model parameter, the individual represented by the genetic code, those who adapt to the environment survive, and those who do not adapt to the environment are eliminated. The reason why survivors can survive is that certain characteristics represented by the genetic code are suitable for the living environment, so these genetic codes that match the living environment are strengthened in the population, and the human pre-training model completes a model parameter update. The constant changes in the living environment of the external physical world drive the changes in the genetic code of the population, thereby promoting the survival of the population in the changing environment. The genetically encoded pre-training model that we are born with records the various survival strategies learned over tens of millions of years, forming the unconscious rapid response system 1 in the brain, which represents the collective memory of the population.

From an individual point of view, in addition to obtaining an innate survival strategy through a genetically encoded pre-training model, in order to maintain the individual's own survival in a specific environment, "Continual Pre-training" will be carried out throughout the life course. Its optimization goal is "Next Action Prediction", which seeks to output correct behavior in the environment to maintain survival. Adopt a model parameter update strategy similar to LoRA: For individuals, the natural genetic code is the basic model that cannot be changed, which determines many of our behavior patterns, but there are some areas in the brain that can be corrected. Links between elements to learn new knowledge and skills. If the output behavior has a negative impact on continued survival, adjust the model parameters to better cope with the living environment in the future. The function of this part of the brain area forms the conscious slow decision-making system 2, which represents the individual's personalized survival experience. "Natural genetic code + personal survival fine-tuning" has created a variety of different individual behaviors, which have commonality and individuality. The commonality comes from the collective memory of the population, and the individuality comes from the unique survival experience.

Language was originally used as a communication and collaboration tool between human individuals, which is conducive to promoting the survival of the population. With the development of technology, it is gradually recorded on the back of turtles, bamboo slips, paper, and electronic signals to form text. Everyone can be regarded as an independent "encoder-decoder". The individual observes and feels the physical world, and encodes it in the brain to form knowledge and thinking, and the decoding output forms text, which records the feeling and thinking of the world from the personal perspective , There are subjective feelings and objective records. The crowd forms a distributed "encoder-decoder", and the decoded output produces a large number of written records containing various objective facts about the operation of the world and subjective conflicting concepts. Therefore, words are only appearances, and what is internally recorded is the cognition of the physical world and the subjective feelings of the world from the perspective of human beings (physics knowledge, social knowledge, event records, individual feelings, individual imagination, etc.). Behind it lies a world model from a human perspective. GPT tries to correctly reproduce the text produced by humans through the Next Token Prediction task. In essence, it decodes and restores the world model hidden behind the text appearance, and stores it in the model parameters of GPT, forming a parameter reflection of the physical world.

If we think more deeply, we may find that GPT not only learns how to generate content that conforms to the facts of our real world from a large amount of text, but may also learn to be a "possible world" generator. It starts from the text to simulate our real world, and then generalizes abstraction. Although it follows the physical laws of our world, it can not only produce real knowledge and content in the world that we perceive, but also produce other physical laws. and the possible worlds of human understanding logic. Maybe you can’t say it’s wrong just because the content it produces doesn’t fit the real world. You can only say that it has the ability to show you all the logically possible worlds. There must be many situations that may not match the reality. After all, the existing world is just It is just a realistic choice that has occurred in the possible world, and it has the ability to present you with various reasonable possibilities.

The End of the World and Grim Wonderland: The "Brain in a Digital Vat" Thought Experiment

"A mad scientist performs an operation in which he cuts out a person's brain and puts it in a container filled with a nutrient solution. The nutrient solution contains enough nutrients to keep the brain functioning properly, while the brain's nerve endings are connected to wires The other end of the wire is connected to a computer. The computer simulates the parameters of the real world and transmits information to the brain through the wire, making the brain feel that everything is completely normal, as if the people around you and familiar things are still going on as usual, without Any abnormality.

One day, the brain in the nutrient solution had a whim and thought of a very interesting thought experiment. In his/her perception of reality, at the moment when he/she is on the subway to work or in front of his office, he/she hears someone’s slight With the sound of footsteps, he/she took out his mobile phone and wrote down his thoughts in the memo, which reads as follows:

"OpenAI has launched a new LLM model called GPT 4, which is very powerful. This may herald the arrival of the AGI era. Everyone around me is discussing it enthusiastically. Today I read an analysis of its possible working mechanism The article titled "Reflection of the World's Parameters: Why GPT Can Generate Intelligence Through Next Token Prediction" was very inspiring and aroused my thinking after reading it. We can imagine that if AGI is powerful enough in the future, it will be able to pass the reading The content I write, my photos and videos can even scan and copy my brain response patterns, and reconstruct a digital brain that is exactly the same as mine in the physical world. Then, another self will live in the digital space, And AGI takes over the various sensory signals of my digital brain, simulates my work and life scenes, and makes the brain feel that everything is completely normal, as if the people I know around me and the familiar things are still going on as usual without any abnormalities. Then, this number Can the me in my brain, or me in real life, distinguish whether I live in a digital space or a physical space? I call this thought experiment: the brain in a digital tank. Is this thought experiment interesting?"

I call this thought experiment: a brain in a digital vat. Isn't this thought experiment interesting? "

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/130981702