Zhang Junlin: Do LLM models such as GPT4 have human-like intelligence?

Edit: DataFunTalk

Enter the NLP group —> join the NLP exchange group

Guide: This article is reprinted from Zhang Junlin’s first article on Zhihu, “The Parametric Reflection of the World: Why GPT Can Generate Intelligence through Next Token Prediction”. The article combines various current research on LLM, and uses puzzles to figure out whether LLM has human capabilities. Intelligence is discussed.

The following is the original text of the article:

"Two English-speaking desert island survivors were stranded on adjacent islands separated by dangerous waters. Fortunately, they found a telegraph left by the previous residents, connected by an underwater cable, and they Ability to transmit information via telegram.

But, what they don't know is that in the nearby waters lives a super-intelligent octopus that has hijacked the underwater cables and intercepted the messages sent between them. Although the octopus does not understand English, its superintelligence allows it to detect statistical patterns in the text of telegraphic messages and to accurately represent the statistical relationships between various telegraphic signals.

After the octopus felt that it had learned these statistical laws, it cut the underwater cable, positioned its two long tentacles at the two ends of the cable, and based on the statistical patterns it recognized, received and responded by itself to the two drifters. telegraph signal.

Whether or not the two survivors noticed that the communication partner had changed, the message sent by the octopus seemed to have no meaning in its essence. After all, the octopus is only following statistical patterns it has learned from previous interactions between humans, and has not seen any human interpretations of signals, such as what "coconut" or "sea water" really mean. Moreover, the octopus may not even understand that these signals are meaningful or function to facilitate communication. "

— "The Octopus Test"

 Bender & Koller

What would you think of this question if we replaced the octopus in the "Octopus Test" with ChatGPT or GPT 4? In other words, which of the following two viewpoints do you support? One point of view is similar to that of the "octopus test". It is believed that the LLM model of GPT 4 has only learned the superficial statistical relationship such as the co-occurrence of words in the language. In fact, it does not have intelligence. Another point of view is that: GPT 4 not only learns the surface statistical relationship between language elements, but also learns the inner operating laws of human language and even the physical world. Words are produced by internal intelligence, so LLM has human-like intelligence.

These two views are head-to-head, and I'm not sure which faction you belong to. At present, whether it is in the academic circle or the social level, there are actually quite a few people who hold two views, and the debate between them is very fierce.

For example, well-known representatives of the opposing side who do not think that the big language model is intelligent, the representative of the AI ​​circle is LeCun, and the representative of the linguistics circle is Chomsky. They all deny that the big language model trained by Next Token Prediction can Be intelligent.

There are also many representatives of the positive side. OpenAI is undoubtedly the most influential representative of the positive side. From the current public remarks, it is obvious that Mr. Hinton also holds the positive side, and he is particularly positive. He not only believes that GPT 4 has similar Human intelligence, and think that the carbon-based intelligence of human beings in the future is likely to be the bootstrap program (Booster) of silicon-based intelligence such as LLM. King of Cars · Rocket Pioneers · Twitter Reinventors · Environmental Pioneers · Mars Colonists · OpenAI Spoilers · Musk) views are similar.

At present, the LLM model with a large enough scale adopts the "Next Token Prediction, NTP" (hereinafter referred to as NTP for simplicity of writing) task when training the base model. The simple operation of Next Token Prediction is to generate the next word through the previous words in the language. It is obvious that what is learned in this way is the surface statistical relationship between words? For those who hold a positive view, this question is actually not easy to refute, because it seems to be the case at first glance. I believe that it is difficult for most positive people to give a convincing explanation that is reasonable and convincing for the opposite side.

As for myself, if you read the road to AGI I wrote at the beginning of the year: the essence of large language model (LLM) technology (see the link at the end of the article), it is easy to see that I hold a positive position. In fact, in the original version of that article at that time, there was a section whose topic was to discuss why NTP produces intelligence. According to my understanding of LLM in the January 23 edition, I summarized NTP to generate intelligence as "Through the NTP task, LLM learned an invisible knowledge map in the model parameters. When the prompt is input, the concept contained in the prompt starts the knowledge." Graph related nodes, and then trigger the activation diffusion and information transfer between knowledge on the knowledge graph according to the <activation-diffusion> theory, which leads to the generation of intelligence in LLM.

The version of me at that time understood the answer to this question in this way. Now I review this view. Although it cannot be said to be wrong, it is obvious that this understanding is still shallow or rough. At that time, because the content of that article was already too long, and the evidence supporting the above view was not sufficient, I deleted that section of content when publishing the article.

This article is dedicated to discussing this topic. I try to sort out and summarize some fragmentary evidence that is currently available, and give a relatively well-founded answer to the above question. In fact, the affirmative party does not have any special research to explain this question at present, but if we link together various fragments of research conclusions known to be used to study other questions, we can treat the answer to this question as a jigsaw puzzle , on the basis of the jigsaw pieces of known research, if some reasonable inferences and assumptions are added, I think that Zheng Fang can roughly give some at least plausible explanations. In terms of structure, this article will first introduce OpenAI’s views on this issue in more detail, then collect and summarize existing research conclusions, and then give an explanation that I think is reasonable.

01

Both ends of the scale: Compression is intelligence

Suppose there is an imaginary balance, the left end of the balance is used to weigh the data compression ability of the large language model, and the right end of the balance is used to weigh the intelligence level of the large language model. The question is: is the weighing result of this balance reliable? In other words, if a large language model has stronger data compression capabilities, does it mean that it has stronger AGI intelligence?

OpenAI definitely believes that there is an equivalence between the two. At present, this may be a core concept that promotes the development direction of OpenAI's large model. OpenAI chief scientist Ilya Sutskever initially revealed this idea in some public interviews earlier this year. In the follow-up, Jack Rae, who is in charge of the large model team of OpenAI, gave a report on "Compression for AGI" at the Stanford MLSys seminar, which demonstrated this idea conceptually from the theoretical level.

This part mainly refers to the content of Jack Rae's report, and relays the demonstration process of "compression is intelligence" that OpenAI firmly believes in. Let's start with a hypothetical experiment of data compression transmission.

1. Data compression using LLM

4be3704d189349201a7540fbe6f5680b.jpeg

We assume that Xiaoshuai and Xiaomei live on Earth and Mars respectively. Now Xiaoshuai has obtained a batch of confidential data cd0a0a91d80c9e61a5061abfb46991e4.pngand needs to send it to Xiaomei who is far away on Mars with the minimum transmission cost. Xiaoshuai plans to compress the data through LLM models such as GPT , and then send the compressed data to Xiaomei to reduce the amount of transmitted data. At the same time, he hopes that the information compression is lossless, that is to say, Xiaomei should be able to use a certain method to completely restore the original data D based on the compressed data obtained, without any difference. It seems that this thing is not easy to do, how to do?

First, Xiaoshuai sends the code of the GPT model 14aae833a4df60b469c5fcf0990f7d51.png, including the code itself, initialization method and random seed, to Xiaomei. According to the information of the GPT model passed by Xiaoshuai, Xiaomei uses the same code, initialization method and random seed, I copied and initialized my own GPT, so that the GPT model in my hand and the model in Xiaoshuai's hand are consistent in the initial state.

Next, Xiaoshuai takes Next Token Prediction as the task and uses it as 5933f3d49d1a8ac3ad9395a9a886bd64.pngtraining data to start the training process of the GPT model. The training process itself is actually a data compression process. We assume that Xiaoshuai has ba8291b3b8eb9338c9f26f639d58e43c.pngcompressed the data through GPT, and the corresponding compressed data is 0b095acc5a6c9e8954d331349d1bff20.png, and will send this batch of compressed data to Xiaomei one after another, and is now ready to transmit the data a79b31e88bc9ab6fb2f6e837a46f65b9.png. We press the "slow down" button here to carefully observe how GPT b04f20da7af52df7b1979bdfb041d42e.pngcompresses, encodes and decodes data.

Encoding stage: Our purpose is to use GPT to compresse985a884f98a16c2c3db37ab364905e4.pngdata, Xiaoshuai uses it6ec40a244d35ad06211c5eeaf321665e.pngas the input of GPT, and uses the current version of the GPT modelf8a3985462c51c500d29e477174cf817.pngto make a Next Token prediction. Assuming that the Token dictionary is  V , the GPT model generatesthe generation probability of each word in the dictionary V  after Next Token prediction. Some words in V  have a high probability of generation, and some words have a small generation probability. The sum of the generation probabilities of all words is 1, so Formthe probability distribution of | V |57a0d52bb177c07af399cb4fd4ffc13b.png . If the original data is sete0a3aca8f6e996a775e56058c3bbcd19.png, at this time, some data compression algorithm, such as arithmetic coding (Algorithm Coding, AC), can be usedtocompress the data3ab02f0c04e43cbd678c31c56361fae7.pngthe sum(as for how the arithmetic coding works, it will be explained later), that is, the small Shuai can pass the obtained compression codeto Xiaomei.e0fb24896095c1a43cbefd8eae50050b.pngffaefa2e963cdea854517ae3146020bb.png13da5b65e77dbf31e3e50ff934edcdb6.png8e8f7cc7ee444769727bd9088b9e6c5c.pnga5af2afa5908650f1764e1079fda0faf.png

In addition, if the word with the highest probability that GPT 611674c0817d5cb73a2e2ead4d7aa5de.pngpredicts based on the above Next Token is not the standard answer aa3e3d88d8197cf5b15c8fc51b13e955.png, it means that the model training is not good enough, so Xiaoshuai asked GPT to perform a backpropagation to correct the parameters of the GPT model. I hope that GPT will encounter Similar to the above can make predictions more accurately. After backpropagation, the model parameters have changed, and the GPT model has changed from abf024e3eef4d22b936b958a5763683c.pngrevision to aaf4ec8396fc997d155f58ccd778b2dc.pngversion.

It can be seen that the above process is actually a standard GPT training step for a certain Token, but when we usually train GPT, we will not obtain the compression code of xi according to the distribution probability and arithmetic coding obtained by Next Token 5af467cee0587f21f3e93546c24aa6b5.pngPrediction ee443b327a0291f8d370a6dec32ead84.png, and record it. 2df2d1f8af30d09e4f29a1a7c04301d1.pngIf you want, you can generate each corresponding step by step during the training process e06b81edc43a3f435ea477a850ede2ee.pngand record them, so that you will get  a lossless compressed encoded version of the data D. 

Decoding stage: After receiving the compressed code from Xiaoshuai781a1f8d198915439dbc024db73492e3.png, Xiaomei hopes to use her GPT model to restore the original data54fca80abf93ecc434b84b6e7487be7e.png. She can also use arithmetic coding to reversely167daeb88a5b5c1b715b650c10a641ff.pngdecode the pair, but if she wants to decode699a16a805cc4ee5845ba463bd1f502b.pnginsufficient information,27b14ff61c08f21c935c17efa300320a.pngshe also needs to know the probability distribution of words in the dictionary V corresponding to xi2cde255bea4c6cad38577c27db07840d.png, but Xiaoshuai did not pass it1e9bf69f18f7a1f97052d5cfcdc00d58.pngover because of the amount of information It's too big, it's not cost-effective to pass it on, what should I do?

Xiaomei can use the GPT in her hand to generate the probability distribution of missing dictionary words 3583082fd132189e766c8ae42631876a.png. She uses the previously decoded word 8b12ac2ce3b30dc3c19319862eb56a5e.pngas the input of the model, and asks the GPT model in her hand eec170a8215f5e0d5140e8ddbb39127c.pngto make a Next Token prediction, so the GPT model generates the word Probability distribution , this is the same 49a67fd0ba5f4f831d426a5daaf8e5c8.pngas Xiaoshuai's probability distribution . 9642d8b0e8c54699ce47b43b28d2efcd.pngAfter obtaining b3a445ebbf12b1aee034688f59327c94.pngit, Xiaomei can use arithmetic coding to 9264d6b0b5f1121d1efa7bcc8a41ee62.pngdecode it, that b1a68444c9c7f8a648a3ce4bfa7d486e.pngis, restore the original data in this way 88c987e3b7a5c53f297ec143b7c1477f.png. Similarly, if the GPT in Xiaomei’s hand predicts that the word with the highest probability is not the next Token 8eb98d8e2fd375e3fad1dd8c64ecbe66.png, she also asks GPT to perform a backpropagation, correct the model parameters, and f756f06cceb7b2080d4d5629902bec5d.pngcorrect the GPT model from version to 8c48528432b54d681c56a82ffe1af7cb.pngversion. Only in this way can Xiaomei ensure that the GPT model in her hand and Xiaoshuai are always consistent during the transmission process.

It can be seen that the decoding process, in fact, is that Xiaomei also performs a training step of GPT simultaneously, and uses the dictionary word probability distribution obtained by Next Token Prediction d69848de56ab940e3b0e7a4e0cd49bbf.pngto help decode from the compressed data zi to the original data 8a083dd093ca34ea968dd73557f4e1da.png.

In this way, Xiaoshuai and Xiaomei have  completed the compression and decompression of the data through the GPT model training process on c95c1fe6b8c40ac281680ba0443615df.png at the same time . As long as the above process is repeated, Xiaoshuai can losslessly convert all the data in D Send it to Xiaomei to achieve lossless compression and decompression of data through LLM. So we can say that the training process of the GPT model is actually a lossless compression process of the training data, but we skip this step during normal training.

2. Arithmetic coding mechanism

93f8b3bea59e4b9ddc3b41c4f6da18b7.jpeg

The operation mechanism of arithmetic coding is not explained above, and a simple example is used to briefly explain it here. As shown in the figure above, assuming that the word dictionary V  contains 4 words, we want to compress the encoded raw data f899b842216964986d22a988d069dea0.png. At this time, after GPT runs Next Token Prediction, the probability distribution corresponding to the words in the dictionary V is 1b4665c4e33a81586a93d79ad98ea853.pnglisted on the left side of the figure above, that is to say , the Next Token predicted by GPT at this moment, the word with the highest generation probability is "too", not Ground Truth "MaskNet".

At this point, knowing xi and its counterpart 7d8670bc5ece6a3d8d6b249d3c4820b1.png, we use arithmetic coding to compress the data. First, according to the generation probability of each word in the dictionary, we can cut the interval from 0 to 1 according to the probability score of each word. The larger the value of the generation probability of a word, the longer the interval it occupies. Therefore, the lower bound and upper bound of each word coverage interval can be obtained. For example, for the word "MaskNet" to be encoded, its lower bound is 0.4, because its own generation probability is 0.2, so the upper bound is 0.6.

In order to make the length after binary encoding as short as possible, the arithmetic encoding searches for the shortest decimal decimal number corresponding to binary in the range of 0.4 to 0.6 covered by the "MaskNet" word. Obviously, in this interval, the decimal number 0.5 is the shortest binary number, so choose 0.5 is used as a coded number, and the binary number 0.1 is obtained after number system conversion. This number is the binary arithmetic code corresponding to the word "MaskNet". Xiaoshuai only needs to send the binary number 1 after the decimal point to Xiaomei.

Next, introduce the decoding process after Xiaomei receives the binary number 1. As mentioned above, using her own GPT, Xiaomei will also get the same word distribution probability Pi. According to the principle of arithmetic coding, use this distribution probability to cut the value range from 0 to 1, and you will get the same cutting graph as Xiaoshuai . Xiaomei converts the binary 0.1 into hexadecimal to obtain the decimal number 0.5, and then checks which word 0.5 falls within the upper and lower bounds of the cutting image, and then locates the word "MaskNet", and then decodes the corresponding word represented by 0.1 1d1acdeedfe060546658d4310e44314b.png.

The idea of ​​arithmetic coding is very subtle. It dynamically codes the input sequence, and can binary code the entire input with decimals. The coding efficiency is close to the entropy limit proposed by Shannon. However, in the scenario we describe, because the Pi corresponding to each xi always changes during the GPT training process, a certain distribution Pi only needs to compress or decode one Token, and the idea seems very simple. Arithmetic coding for long input sequences, the method can refer to: What is arithmetic coding  (see the link at the end of the article).

3. Compression is intelligence

From the above explanations, it can be seen that if the GPT model generates a higher probability of generating Ground Truth  d28c535488954c3ab3af4e29292a5c31.png, the longer it occupies in the arithmetic coding segmentation interval, the easier it is to find a shorter arithmetic coding, which means that the model compression rate higher. In other words, if the GPT model is more intelligent and the NTP prediction is more accurate, its compression efficiency will be higher. Therefore, we can evaluate the intelligence of the model based on the compression efficiency of the model. The higher the compression efficiency of the model, the higher the intelligence of the model. This is a core idea of ​​OpenAI to promote the research and development of large models according to this idea.

d60d68b4289660471224ea76905ad6a2.pngWe can consider two extreme cases: One case is that the model has super intelligence, and the generation probability is always 1 for each Ground Truth to be predicted by Next Token Prediction . We assume that after Xiaoshuai transmits some data to Xiaomei 44a2c5126477cda4f0365f3ed12b230c.png, the intelligence of the model continues to accumulate and reach this level, which means that for the remaining data that has not been transmitted d303feb2f851f08ceb21085dcc8ea3c6.png, Xiaoshuai does not need to transmit any information later. Because Xiaomei’s GPT has been able to correctly predict each subsequent Token completely by itself. At this time, the GPT model has the ultimate data compression capability due to its super intelligence, that is, it knows what will happen in the future according to the input above; another In an extreme case, GPT has not learned any intelligence during the training process, so it relies purely on guessing when doing Next Token Prediction. Assuming that the size  of the vocabulary  |V|  is N , the generation probability of each Ground Truth 4e7ed77592635a45e63a7ce1e28d6c7f.pngis always 1/ N . At this time, GPT does not have any data compression capability, and the amount of data to be transmitted is  equal to the amount of information in the original data D. 

These are two extreme cases. In most cases, the intelligence of the LLM model, or the compression ability, should be between the two, and we can evaluate the intelligence of the model according to the model compression ability. If you do some mathematical derivation, you can see that in this case, for the 71a5cca4d1786e150a7de3eac016dee5.pngcorresponding data 59c6ef73a8cd0477c00bb8933c9ffb3c.png, the number of bits required for arithmetic coding, that is, the code length, should be: , 81faf7d3983213c4cc3b0199ea8b7ed0.jpegyou can think about it when you see this formula, do you think it is related to anyone? How do they look alike? cfb3c6337e5e6324104b9e2d3275b3a1.pngIn fact, this is the cross-entropy loss corresponding to this Token when GPT is training . In other words, from the perspective of data compression, the encoding length of each Token is actually the cross-entropy loss corresponding to this Token during LLM pre-training, and the two are equivalent. Isn't it interesting? Therefore, data lossless compression is another relatively new perspective on LLM model training.

45ad128437fa9a1040faa3a9793b3690.jpeg

We can further deduce, for the data set  D , after compressed and transmitted by the LLM model, what is the total amount of data that Xiaoshuai transmits to Xiaomei? The specific calculation formula can refer to the figure above. It can be seen from the figure that the total amount of transmitted information is composed of two parts: one part is the description of the LLM model code, including the code itself, the initialization method, and the random seed; The corresponding number of compressed encoding bits is the cross-entropy loss corresponding to this token. Therefore, this part is actually the  sum of the losses of all tokens when GPT uses data D  for pre-training. The sum of the two parts is the total data transfer.

a17bac4a1bee3e0a514c37ea35b7d2a2.png

So, do different LLM models have different data compression capabilities? The answer is obvious. The figure above shows the data compression capabilities corresponding to different sizes of LLaMA models (from the smallest 7B to the largest 65B): for the same total amount of training data (such as the 1000B Tokens node on the abscissa), the total area covered by the Loss curve of each model , which is the data compression capability corresponding to this model. The smaller the area covered by the Loss curve, the stronger the compression capability of the model.

With the previous explanations, I believe it is easy to understand. We can assume that each batch contains only one Token during model training, then the number of encoding bits required is the loss value corresponding to this Token. By integrating the Loss area, the total Loss of all Tokens can be obtained, which is equivalent to the total compressed code length required to compress these Tokens.

As can be seen from the figure above, the larger the LLaMA model, the smaller the corresponding Loss area, which means that the model has a higher compression ratio and a stronger compression capability, which further represents a higher degree of intelligence of the model. If it is roughly estimated, it can be concluded that the current data compression rate of the LLaMA model is about 14 times, which is beyond the current best compression rate of the special data compression competition Hutter Prize, which is currently 8.7 times.

What is this indicating? If we assume that the current mainstream text compression algorithm mainly comes from superficial factors such as word frequency and repeated patterns, then the extra compression rate probably represents the deep understanding of the text by the LLM language model, which comes from the deep understanding of the text. Smart Coding for AGI.

4. Further Thoughts

The above content is the demonstration idea of ​​"compression is intelligence" reflected in Jack Rae's report. After reading the sharing around March, I was greatly inspired. I felt that OpenAI had a big brain hole, because I have never looked at the LLM model from the perspective of data compression. I believe this is a very good idea for most people. A novel viewing angle.

However, after consulting relevant literature later, I found that this idea of ​​"compression is intelligence" was not pioneered by OpenAI, but actually has a long history. For example, the Hutter Prize mentioned above, which aims to encourage research on better data compression algorithms, was founded in 2006. The founder of the award, Marcus Hutter, believes that data compression capabilities and AI intelligence are an equivalent issue, which is also funded by him. The original intention of this award, and the current use of AI models for data compression is already a small research direction, and there are many related papers.

We can think deeply about two related questions in this line of thought. The first question is: the content described above is to look at the intelligence level of LLM from the perspective of data compression. The question is why the stronger the model compression ability, the higher the intelligence it represents?

The minimum description length principle (Minimum Description Length, MDL) can explain this problem, which is an important concept in machine learning and a formalized expression of Occam's razor principle ("If it is not necessary, do not increase entities"). The core idea of ​​MDL is: Assuming that we have many models that can explain the data at hand, the best explanation should be a model that describes the data as short and accurately as possible. The shorter the model description length, the better its generalization. The better the sex, the smarter we say.

Why is the shorter the description, the smarter it is? Because this short description is the internal law abstracted from the data, compared with a large amount of data, the description of the internal law of the data is naturally much shorter, and if the model can give a shorter description, it means that the model has learned more. Regularity, so the smarter you are. It's this logic, let's give an example. Suppose the sequence to be transmitted is a sequence of 10,000 consecutive prime numbers:

2,3,5,7,11…..

The interval between prime numbers is irregular, so Xiaoshuai can only honestly pass the 10,000 number codes to Xiaomei, but in fact Xiaoshuai can use a sentence, such as "output 10,000 starting from 2 continuous prime numbers” to describe these numbers, compress this sentence and pass it to Xiaomei. After Xiaomei’s LLM model sees this sentence, if it is smart enough, it can recover a sequence of 10,000 prime numbers. Here I am I believe you should be able to understand the meaning of MDL.

Of course, the prerequisite for this is that LLM has to understand the very abstract concept of prime numbers. So, can the big model really understand this kind of abstraction? Is it true that only large models can understand abstract concepts like "prime numbers", but small models cannot?

I did a verification and compared the capabilities of the large model and the small model. In order to prevent the model from completing this task simply by memorizing the sequence of prime numbers that appear in the training data, I made some conversions in the description to ensure that the large language This statement cannot be seen in the model during training. The output results of the corresponding Prompt and the size of the test model can be referred to in the figure below.

2be4c65ff74db2bb1b5928e7ea7e2c31.jpeg

It can be seen that GPT 3.5 has learned the abstract concept of prime number, otherwise it is difficult to answer this question well. If you do not understand this concept, you will have an incomprehensible answer like the small model on the right. On the one hand, this shows that the large model can indeed learn some abstract concepts, and on the other hand, it shows that the large model is indeed better than the small model in this regard.

Another question, Jack Rae emphasized in the report that the data compression capability of LLM is lossless, and refuted the influential "ChatGPT is a lossy compression of Internet data" proposed by the famous science fiction writer Ted Jiang at the beginning of the year. "View. In fact, if you think about it carefully, you will find that this kind of LLM is a "lossless compression" of data, which is a bit watery. If we look more rigorously, we will find that although the LLM training process can be regarded as a lossless compression of data, the "lossless" effect can be achieved not only by LLM, but also by "LLM+arithmetic coding".

If the LLM achieves a sufficiently strong level of intelligence through learning, it can ensure that the loss of NTP predicting the subsequent text sequence is 0, that is to say, it can completely and accurately predict the subsequent Next Token according to the above Context. At this time, arithmetic coding is not needed, only by LLM can completely compress and decode the data. If this is the case, we say that the training process of LLM or the LLM can perform "lossless compression" on the data after training, and there is no problem.

However, this is an ideal situation, can LLM currently do this? It is definitely not possible, so the subsequent Next Token prediction given by LLM will definitely be wrong. These wrongly predicted Tokens actually represent the information loss of LLM compressed data. Encoding is used to compensate for the "lossless compression" effect of data. So, to be more precise, it seems like this:

Lossless data compression = lossy data compression capability of LLM model + coding compensation capability of arithmetic coding

In other words, at least the current LLM is still lossy to data encoding, and cannot achieve the purpose of data lossless compression by its own ability. As for whether LLM can be powerful enough to achieve lossless data compression by itself in the future, it is still unknown.

Data compression is just a means. It is the goal to make GPT intelligent through this means. The problem now is: OpenAI has pointed out the means and purpose from the basic theory, but it has not explained a more fundamental problem: Next Token Prediction uses data compression to What kind of AGI intelligence has the GPT model learned? The remainder of this article attempts to answer this question.

02

Jigsaw puzzle: some pieces of currently known facts

If we compare the acquisition of AGI intelligence by LLM to a jigsaw puzzle, we can only say that we only have some fragmented pieces of the jigsaw puzzle about it at present, and we have not yet seen the whole picture of this kind of machine intelligence. This section collects and introduces the research conclusions of existing related research from several different perspectives.

1. The knowledge extraction process of the GPT model  

ffeb78868bd09660438e9778bb3532cd.jpeg

Let's take a look first, assuming that the LLM model has been trained, and enter the prompt when using it, how does the GPT model extract knowledge. The article "Dissecting Recall of Factual Associations in Auto-Regressive Language Models" examines this in detail. As shown in the figure, assuming that the input Prompt is: " Beat music is owned by ", GPT can return the correct answer through NTP: Apple . In this example, "Beat music" is an entity, and "owned by Apple" is an attribute corresponding to this entity.

After research, it is found that when GPT extracts this knowledge, it has gone through an obvious three-stage process:

First of all, the word "music" is the last and most critical vocabulary to describe this entity. When its information goes up the Transformer block, it first integrates the information related to the previous modifier "beats" into " music" corresponds to the location. After that, as the number of Transformer layers gets higher and higher, through the FFN layer of each Transformer Block, information is continuously added to the Embedding corresponding to "music", so as the information flows to the upper layer, the word "music" corresponds to the number of layers Embedding, which can trigger more and more "attribute" words related to "beats music". This is the first step, and the whole process generally occurs at the lower level of Transformer.

In the second step, the GPT model is at the position of the "by" word, which is the last position where NTP will generate an output token, and integrates the information of the word "own" into the final position through Attention. It should be noted here that the Transformer position corresponding to the last word is more critical, because the Next Token output will be given at its top layer. During the reasoning process, GPT will gradually integrate the important information entered into this position through Attention. This operation also happens at the lower level of the Transformer.

In the third step, at the position of the word "by", which is the last position of the Transformer layer, it has integrated the information of the word "own" in the lower layer. This information is in the upper layer. Through Attention, the attribute "apple" corresponding to "beats music" " Extracted. The specific extraction action is done through an Attention Head, and this article proves that <entity-attribute> information will be encoded in the Attention Head. For specific examples, please refer to the figure below. This should be a new knowledge for us ( In the past, it was generally believed that Attention is mainly used for information comparison and handling, which proves that Attention will also store some kind of knowledge).

Through the above three steps, GPT completes the process of extracting a piece of knowledge.

a5240f00bd915e253df3995c42835b80.jpeg

Another work "Understanding Transformer Memorization Recall Through Idioms" explores how LLM extracts memory information, including idioms/proverbs (Idioms) that rely entirely on memory and require accurate reproduction, as well as factual knowledge.

The research conclusion is that LLM can be divided into two stages for the extraction of memory information: the first stage is that the low-level Transformer Block gradually increases the ranking of the correct answer words until the middle layer ranks first; the second stage is the high-level Transformer. The confidence level increases, that is, the distribution probability score of the correct answer is continuously improved.

In addition to the above two work, there are some other similar research knowledge extraction work. If we summarize the existing research conclusions, I think we can roughly draw the outline of such a GPT knowledge extraction: when the GPT model is trained, enter the prompt, and for the input word corresponding to a certain position of the Transformer, as the Transformer continues to go up, GPT integrates the information related to itself in the word above into its own Embedding through Attention, and the FFN of each layer transforms the current word Embedding to add information, in this way, it continuously triggers the knowledge stored in the FFN and refines it layer by layer. The Embedding corresponding to the word (similar to the process of the word "music" in the above example).

The same is true for Transformer's last token position. Its special feature is that, from the bottom layer to the upper layer, it will first copy the most critical information in the entire input above to its own position through Attention, and then use this key information to gradually filter out the upper layer. more important information in the text. At the bottom of the Transformer at this position, there should be many candidate answers for output, and the correct answer is not ranked high. As the Transformer goes up, the correct answer ranks higher and higher, and the more candidate answers that can compete with the correct answer Less and less, reflected in the probability distribution score assigned to the correct answer is getting higher and higher, until the highest level of the last token, GPT can output the correct answer (similar to the process of the word "by" in the above example).

2. Distribution of knowledge points in Transformer

This part introduces the distribution of knowledge points in the Transformer structure, which means the distribution of different types or specific knowledge points at different levels of Transformer. Understanding this knowledge is very helpful for understanding the inner working mechanism of GPT.

Before introducing the research conclusions, in order to facilitate understanding, we first explain three basic concepts (refer to Toy Models of Superposition): single semantic (Monosemantic) neuron, multi-semantic (Polysemantic) neuron and superposition.

31c02addb65335333478a463733c9f4b.jpeg

At present, it is found that there are many individual neurons in the LLM, each of which only responds to a specific knowledge point in the input, that is to say, it is only activated by a specific input mode, and remains silent for other irrelevant inputs. A neuron encodes a knowledge , perfect one-to-one correspondence, the neurons in this type of Transformer are called "single semantic neurons" (this is more similar to the neuron mechanism in the human brain).

On the contrary, there are also a large number of neurons that are multi-semantic encoded, which means that many knowledge points with different language meanings will activate a certain neuron. This type of neuron is called "multi-semantic neuron". The above figure gives an example. For example, some neurons respond only when the input prompt content is written in French. This is a typical "single semantic neuron"; while some neurons are very sensitive to multiple semantic differences. Large 2-gram language fragments will respond, which is a typical "multi-semantic neuron".

The meaning of the Superposition concept is: assuming that the number n of features to be encoded is much greater than the network parameter d, we can find a way to use d-dimensional neurons to encode n features that are much larger than d. This encoding mechanism is called It is called superposition, so it is an information compression coding mechanism found in the Transformer structure.

e40a23b06e701b1f9dd216fc9b19ad6a.jpeg

Superposition and "multi-semantic neurons" are closely related. It is currently found that LLM does this internally (refer to Finding Neurons in a Haystack: Case Studies with Sparse Probing): As shown in the figure above, the Superposition mechanism of LLM is composed of multiple "multi-semantic neurons". Each neuron will respond to multiple different knowledge points in the input, so it is impossible to detect who is currently responding to it through only one "multi-semantic neuron", but if there is Multiple "multi-semantic neurons" that respond to a certain knowledge point, and a linear combination of their responses can detect the knowledge point we want to identify in the input (the blue part in the figure above) .

That is to say, LLM encodes a specific feature or knowledge point by combining multiple "multi-semantic neurons". Therefore, the relationship between "multi-semantic neurons" and knowledge points is a many-to-many mapping. A knowledge point will stimulate many "multi-semantic neurons" that encode it, and a "multi-semantic neuron" will also Generate responses to multiple input knowledge points.

After understanding the above three basic concepts, we introduce the current research conclusion: In the trained GPT model, the bottom layer of Transformer encodes a large number of specific features or knowledge points, such as n-gram features, syntax features, etc., and the encoding method adopts the above-mentioned The superposition mode composed of multiple "multi-semantic neurons"; as the number of Transformer layers deepens, the specific knowledge points gradually decrease, and the abstract knowledge points (such as "French"/"prime numbers", etc.) gradually increase, and the abstract knowledge points gradually increase. Generally, it is independently encoded by "single semantic neurons", and as the number of Transformer layers increases, the encoded features become more abstract.

In other words, Transformer encodes features or knowledge points, and there is a knowledge abstraction process that becomes more and more abstract from low to high. This phenomenon is also mentioned in OpenAI's latest article "Language models can explain neurons in language models" .

In addition, the article "Polysemanticity and Capacity in Neural Networks" pointed out that in the process of model learning, in order to increase the utilization efficiency of model parameters, "single semantic neurons" will be assigned to important features, and "multi-semantic neurons" will be assigned For less important features, the model does not encode at all for even less important features.

The so-called "importance" refers to the impact on training loss, that is to say: "single semantic neuron" has a greater impact on reducing loss during NTP training. This shows that the abstraction of features or knowledge points is an internal driving force of NTP itself to quickly reduce Loss, and this is likely to be one of the keys for the GPT model to generate intelligence through the Next Token Prediction task.

3. Evidence for the existence of knowledge loops in GPT

Here is an introduction to the related work of the corresponding knowledge circuit (Circuit) in the LLM model to complete a specific task. The so-called "loop" refers to the fact that after the prompt of a certain task is input into the Transformer, the information propagates from the bottom up until the next token at the top of the last token outputs the answer. There are some key paths to complete this task in the network, and the information mainly follows this path. In the process of propagation, information transmission or knowledge processing is carried out continuously, in this way to complete a certain task through NTP.

If you read the following introduction, you will find that the working process of the LLM knowledge circuit is actually very similar to some information processing circuits in the human brain. And a large number of various knowledge circuits formed during the NTP pre-training process of GPT are likely to be another key to uncover the mystery of AGI.

0c5c903e26976a7d7e63b0f359600003.jpeg

"How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model" mainly explores why the GPT model can acquire mathematical abilities through pre-training. Specifically, using a prompt similar to "The war lasted from the year 17YY to the year 17", the GPT model can output the year number XX of the Next Token greater than YY, which shows that it has learned numbers during pre-training comparative relationship between.

Through exploration, it is found that the model has formed a knowledge loop to solve this problem during the pre-training process, as shown on the right side of the figure above: there are two key parts, the first is some Attention Head in the middle layer, such as a5.h5 in the figure Represents the 5th Attention Head of the 5th layer of Transformer. The main function of these Attention Heads is to focus on the year YY and spread to the upper layer; another key is the MLP layer of the 8th to 11th layers. The MLP of these layers completes the "greater than" operation. So in the end GPT can output the result correctly.

Moreover, the Attention Head in the middle layer and the upper-layer MLP also have a corresponding transfer relationship. For example, the 9th-layer MLP mainly receives information from a9.h1, while the 8th-layer MLP has more information sources. It can be seen that information forms a specific propagation path from bottom to top.

18621e7b8ca04394feeb82de8dbb55db.jpeg

If you dig deeper, you will find that some key neurons in the MLP complete the mathematical operations. As shown on the right side of the figure above, you can detect the 10 most influential neurons in the 10th layer of the MLP. This layer only uses these 10 The neuron can roughly complete the "greater than" operation, and the left picture shows that the Attention Head of a7.h10 mainly focuses on the key information "YY". In addition, the study also found that not only the above-mentioned Prompt, but if the form of the Prompt is changed to reflect the numerical comparison relationship, it is also found that this circuit is activated, which indicates that this circuit may be specially used to compare numbers.

213a686322752079672525c0d2143604.jpeg

Most of the knowledge loops should be composed of Attention and MLP, but some knowledge loops based on Attention are also found. A typical example is the "Induction Head" circuit, the existence of which has been proved by several studies. Its main function is that when GPT predicts the Next Token, it tends to find a similar output pattern from the above and copy it to the subsequent Token output.

As shown in the sentence above, the second "so" is the last token, and GPT will generate subsequent tokens through NTP at this time. The "Induction Head" circuit tends to find the same "so" word from the above, and follow the above The word "bad" after "so" is output as Next Token.

The research "Localizing Model Behavior with Path Patching" explores the inner working mechanism of Induction Head: when predicting the Next Token based on the second word "so", the content of "so" itself is copied to Transformer's own corresponding Attention Query in <Query, Key, Value>, and the word "bad" appearing in the above content, through the Attention Head PTH (Previous Token Head to key), integrate the semantics of the content before "bad" into the corresponding "bad" Key. As a result, when "so" is used as Attention, the two get a high similarity, so "bad" is copied to the position of the word so through Attention, which makes it easy for Next Token to output "bad", which achieves from the above Copy the purpose of "so...bad".    

5c99a3126f129d14343d6a5a6f9a1d69.jpeg

In addition to "Induction Head", there are some Attention circuits with more complex functions, such as "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small". Identify the knowledge loop for "Indirect Object Identification".

The so-called "Indirect Object Identification", you can refer to the example given in the above figure, that is to say, there are two entities in the input, one repeated entity and one non-repeated entity, how to find the correct answer from it. From the example above, it can be seen that GPT can output the correct answer Mary. The reason is that the model has learned a complex recognition circuit mainly composed of Attention Head.

4a2f2152e817e9355db45d39882ab32c.jpeg

As shown in the figure above, the "Indirect Object Identification" knowledge loop to identify the correct answer is mainly composed of three steps: first, Duplicate Token Heads are used to identify Tokens that appear multiple times in a sentence, and Induction Heads play a similar role; secondly , S-Inhibition Heads work at the position where the Next Token is output, and are used to delete or suppress repeated names from the attention of Name Mover Heads; finally, Name Mover Heads output the remaining name Token.

It can be seen from the above that in the pre-training process, in order to better predict the Next Token, the LLM model has learned a very complex Attention knowledge loop to copy some input Tokens and output them in the Next Token Prediction result.

OpenAI Chief Scientist Ilya Sutskever said in an interview: "When we trained LSTM to predict the next character (NTP) of Amazon reviews, we found that if you predict the next character well enough, the LSTM will have a neuron corresponding to the emotion. This is a good demonstration of the power of unsupervised learning and the idea of ​​next-character prediction. This finding has a huge impact on us."

I understand that the neurons corresponding to emotions appear in the network, probably through the NTP training task, forming a knowledge loop for emotional judgment inside the model. This discovery (see: Learning to Generate Reviews and Discovering Sentiment) was indeed an important inspiration for OpenAI to replace LSTM with a larger Transformer and use NTP for pre-training on more data.

At present, there is still relatively little work on exploring the knowledge loop in the GPT model. I personally think this matter is particularly important. For example, I guess there is a high probability that there will be a complex logical loop that can explain the Chain of Thought (COT) phenomenon, and the formation of this loop It is likely to be formed after introducing program code or science and engineering paper data into the pre-training data. Because the logical relationship between such data is relatively close, GPT may force NTP tasks to quickly reduce the Loss and accurately predict the Token after The model generates a large number of abstract knowledge point concepts internally, and forms a complex logic circuit on this basis. I feel that the work in this area is very valuable and worthy of further strengthening.

4. Differences in learning knowledge points of LLM models of different scales

This section summarizes the relevant research conclusions on the differences in learning knowledge points between different sizes of LLM models.

An interesting phenomenon is mentioned in the document "Finding Neurons in a Haystack: Case Studies with Sparse Probing": for the same "single semantic neuron" encoded abstract feature "French" (used to judge whether the input content is French), If we block it, we can see the impact on GPT's Next Token Prediction task Loss. If the loss increases after blocking it, it means that this feature is more important to the model.

Interestingly, after shielding, the small model Loss increases a lot, but for the large model, it has little effect. This shows that this feature is important for small models, but not so important for large models.

This phenomenon is very strange, the paper gave an explanation: as the size of the model increases, the phenomenon of feature splitting (Split) will appear. That is to say, the small model represents a certain knowledge point, and only one coarse-grained neuron responds independently, but the large model will refine this knowledge point. Multiple neurons, the corresponding neuron will only be activated when a specific context occurs. In other words, it is also to represent a certain knowledge point. Compared with the small model, the large model will be more detailed in representing the knowledge point.

For example, a small model has only one neuron that responds to "return" in the input, but a large model may differentiate to respond to "return" in different programming languages. For example, there is a neuron that responds to "return" in the python language, and a neuron that responds to "return" in the C++ language, and so on.

Therefore, when a small model blocks a certain feature, the impact will be great, because if this knowledge point in the input cannot be captured at all, it will have a great impact on the loss; but for a large model, blocking this feature will have little impact, because It also splits out neurons that respond to different contexts. Although this neuron is useless, there are other neurons to represent various situations. I think this research conclusion is very important, and it shows a significant difference in the knowledge representation ability of large and small models.

In addition, there are also research conclusions that show that as the model size increases, a greater proportion of "single semantic neurons" will be detected. I think this illustrates the possibility that the larger the LLM model, the more abstract knowledge will be encoded by independent neurons.

Another document, "The Quantization Model of Neural Scaling", imagines that according to the degree of impact on NTP Loss, we can sort the knowledge units (referred to as "quantum units" in the text) from important to unimportant to form a Q queue. The LLM model will give priority to learning the quantum units that are ranked first in the Q queue, and for the large model, it can learn more quantum units that are not so important in the Q queue than the small model. The core idea I summarized is that the large model can learn more less important features than the small model.

The above points are the conclusions that can be obtained from the current literature on the differences in the representation capabilities of the model scale.

03

Under the Iceberg: Circuit Competition Conjecture (CCC)

If we put together the bits and pieces of evidence reflected in the currently known jigsaw pieces, I feel that the part of the principle hidden under the iceberg is looming in front of us. This part makes some inferences based on the known research conclusions, and gives the "Circuit Competition Conjecture (CC Conjecture)" as an explanation of the internal mechanism of GPT to build intelligence through Next Token Prediction.

I ask myself to find references for the key points. If there is an inference to give the inference process, so that this conjecture is based on the existing research conclusions, but generally speaking, it is an untested conjecture, so please be cautious refer to.

1. Loop Competition: Breakthrough of the Task Loop

First, we summarize the known research conclusions to form an overall impression. In this article, I collectively refer to a certain feature or knowledge as a knowledge point, because it is difficult to cover certain content by using traditional "features" alone. Specific knowledge points include language knowledge points (n-gram, morphology, syntax, semantics, etc. ), context knowledge points (such as "input is French"), world knowledge-related knowledge points (entity-attributes, common sense, events, etc.), and simple function loop knowledge points, they are fine-grained, we will Collectively referred to as knowledge points.

2b46a33bd8ae04d434e7f355f498c6c9.jpeg

Based on the above content, it can be seen that the GPT model learns knowledge from data through NTP tasks, and establishes two types of knowledge systems inside the model: hierarchical knowledge structure and various task loops (refer to the above figure). The task loop is based on hierarchical knowledge. Established on the system structure, it is a fixed path formed by mutual stimulation of knowledge points for solving a certain task.

Assuming the GPT model has been trained, we can clearly detect their presence. First, these knowledge points have different levels of abstraction. The more knowledge points stored in the bottom layer of Transformer, the more specific, the higher the degree of reusability, the stronger the versatility, and the greater the number, the easier it is to encode through dense coding methods such as superposition and Polysemantic; and the more stored in Transformer High-level knowledge points, the more abstract, the lower the degree of reuse, and the stronger the task specialization, the more inclined to encode them separately in the "single semantic neuron" method (the white nodes in the Transformer above represent specific knowledge points, and the red nodes represent abstract knowledge points) .

Secondly, some knowledge points form a bottom-up excitation relationship, and the excitation path is to stimulate more and more abstract knowledge points on the upper layer layer by layer from the less abstract knowledge points at the lower layer. For example, a certain knowledge point encoded in the L layer of Transformer can be activated by other stimulated knowledge points from the 1st to the L-1 layer.

Activated neurons, in addition to collecting, synthesizing, and abstracting the information passed in, may also add new knowledge (such as extracting world knowledge) or perform mathematical logic calculations (such as comparing numerical values) through their own FFN structure. The trained GPT model contains a "micro-excitation structure" composed of a large number of "local" knowledge points, which should be the basic unit of GPT intelligence, so that the entire GPT structure constructs a world knowledge structure that encodes world knowledge hierarchically.

Training the model according to the NTP goal is actually a process of gradually establishing an increasingly complex hierarchical knowledge structure during the training process, from simple to complex, from general to special, from concrete to abstract, from lower to upper, including knowledge The microstructures generated by the excitation relationship between points and knowledge points are generated because of their existence, which is helpful for the Token after NTP’s accurate prediction, that is, it is useful for the GPT model to reduce the training loss during NTP.

On this basis, we can re-examine the formation of task circuits. The task loop should be the Next Token for GPT to more accurately predict a certain type of data. Starting from the input layer of the Transformer, associate the relevant "excitation microstructure" layer by layer, thus forming a layer-by-layer excitation from low to upward, and finally associate To the output location to determine the complete path structure of the output Token probability (refer to a certain task path outlined in the red line in the figure above).

After learning this kind of task loop, if GPT sees this kind of data later, the accuracy of Next Token prediction will increase, which is reflected in the reduction of NTP task Loss. For example, if there are a large number of examples of addition, subtraction, multiplication, and division in the training data, there is a high probability that GPT will learn a task loop for simple mathematical calculations, so as to increase the accuracy of the Next Token prediction of the number after the equal sign .

c4301b3d2de87ad689f98e744839fd01.jpeg

In addition, the Transformer Block of each layer corresponding to the last Token input position may have some special meanings and functions. It may use the Attention mechanism to collect information about all previous input content. If the input prompt is to complete a specific task, then the Transformer Block of each layer corresponding to the Last Token roughly summarizes the task loop information layer by layer to the final position, so that the correct Next Token prediction can be made at the highest layer of the Last Token. It is equivalent to Last Token outlines a Prompt subworld based on the input Prompt from Transformer's huge knowledge system.

The above content macroscopically synthesizes the conclusions of current research, presenting our current understanding of the operating mechanism of GPT. The following content began to add some of my personal inferences.

The first question is: in the process of training GPT, there are so many knowledge points, it must have a sequence relationship in learning these knowledge points, then, what priority will it follow to learn these knowledge points? Although some current research conclusions say that important knowledge points are learned first, the importance here often refers to the perspective of reducing the loss of the NTP task of the GPT model. The more the loss is reduced, the more important this knowledge point is. It is definitely correct from the perspective of reducing loss, but it is still too abstract.

I personally think that during the training process, the GPT model will give priority to learning knowledge points with the following characteristics: high-frequency knowledge points, general knowledge points (the ones with a high probability of being reused are general), and specific rather than abstract knowledge points. These three principles should be followed.

Why is this so? Because according to the principle of Next Token Prediction, the more frequent the knowledge points appear, if the GPT prediction is wrong this time, it will do backpropagation to correct the model parameters to ensure that the next time you see a similar situation, you will predict the right, high-frequency knowledge Because of the large number of occurrences of points, the number of backpropagation correction model parameters obtained is large, and it is easier to establish corresponding knowledge points and their connection paths with other knowledge points.

If the high-frequency knowledge point is learned, it will be easy to encounter this knowledge point in the subsequent training data, so it will greatly contribute to reducing the loss of the NTP task. The same is true for the other two types of knowledge points. Because of their strong versatility, general knowledge points have more opportunities to be used in subsequent predictions, so the number of times to obtain backpropagation correction model parameters is also large, and they are easy to be learned by the model. Specifically, Non-abstract knowledge points are also easy to be established because they are often seen in the training data.

and so on. Conversely, low-frequency, domain or task-specific, abstract knowledge points will be learned by the GPT model later. In other words, if you want to learn such knowledge points, you need to let the model see a larger amount of data to increase the chances of these knowledge points necessary to backpropagate the correction parameters in the learning process.

07170b83128823403ddf9a53af879dfb.jpeg

Next, we begin to formally discuss the "loop competition" conjecture. Before drawing this conjecture, let me make an assumption:

Hypothesis: To improve the parameter utilization of the GPT model, the NTP task encourages the reuse of sub-loops.

The so-called "sub-loop" refers to a loop that completes simple calculations. This loop involves fewer knowledge points, and the structure stimulated between knowledge points is relatively simple. The GPT model will probably give priority to many sub-circuits that complete simple tasks or calculations, while complex circuits should be formed by further connections of many sub-circuits. In order to increase the usage efficiency of model parameters, the GPT model should encourage these sub-loops to be reused as much as possible in different complex loops, so that more different types of tasks can be completed with the same amount of parameters.

For example, the "Induction Head" loop mentioned above is a typical sub-loop. From the above we can see that in the more complex "Indirect Object Identification" knowledge loop, the "Induction Head" loop is one of the components. The relationship between sub-loops and complex loops is roughly similar to this example.

For two complex circuits that solve different tasks, due to the reuse of sub-circuits, there are some identical sub-circuits between them, we can call these same sub-circuits as "overlapping circuits". It is easy to infer that the closer two tasks are, the more overlapping circuits they correspond to.

Moreover, there should be more overlapping loops at the bottom of Transformer, because the knowledge points involved in the bottom loop are more specific, more numerous, and more reusable. The above figure is a schematic diagram of "sub-circuit multiplexing and overlapping circuits". The red line (red task) and blue line (blue task) represent two different complex task circuits, and at the bottom, there are some sub-circuits. The loop is multiplexed by both.

The so-called "loop competition" conjecture, we use the above example to illustrate. Suppose we input a prompt, which was supposed to complete the red task. After inputting the prompt, when the information stimulates the correct path from the bottom layer to the layer layer by layer, the lower the knowledge points and sub-circuits, the stronger the reusability. Therefore, it is easy to produce "excess excitation phenomenon", that is, in addition to stimulating the red task we hope, it will also stimulate many knowledge points and sub-circuits leading to other task circuits.

This situation is more obvious at the bottom layer. As the information is gradually transmitted upwards, the red circuit will gradually be further strengthened, and the upper-level knowledge points and sub-circuits that are stimulated by the incorrect circuit will be less and less, and finally the correct red task circuit will be outlined. path of. This is the idea of ​​a typical "loop competition" conjecture.

If the correct circuit we want is activated during the excitation process from low to upward, it can be considered that the circuit wins the competition, and the model outputs the correct answer. If the wrong task circuit is activated, it can be considered that the circuit fails the competition, and the model outputs the wrong answer. It can be inferred that the more complex the task, because it involves more knowledge points and sub-circuits, and the more complex the relationship between them, the easier it is to overlap with more other similar task circuits, and the easier it is for the circuit lose the competition.

We can think about many problems and phenomena of the LLM model in the framework of "loop competition", and give explanations. Later in this article, we will use this conjecture to explain some phenomena that are currently unknown to the LLM model.

2. Difference in model scale: bigger model, clearer world

a7b5ccb3e9e9ccb2910a1ef9a45a6562.jpeg

According to the existing research conclusions, if we think about the difference between the large LLM model and the small LLM model, we can roughly make the following inferences: the small LLM model builds a coarse-grained, fuzzy world image, and as the size of the model becomes larger and larger , the large LLM model builds an increasingly high-resolution world image that can represent more detailed information.

From the above, it can be seen that the representation ability of the LLM model is mainly reflected in two aspects: the hierarchical knowledge structure from concrete to abstract, and the task circuit that can solve many problems. Let's look at the differences between the big and small models separately from these two aspects.

Differences in Hierarchical Knowledge Structure:

Many research conclusions have proved that as the size of the model increases, the degree of model sparsity becomes higher and higher. Polysemantic neurons encode features densely and are used to encode a large number of relatively specific features, while Monosemantic neurons belong to single neuron representations and are sparse, which shows that as the model scale becomes larger and larger, the proportion of single semantic neurons Increase.

Single semantic neurons encode important and abstract knowledge. Since the number of single semantic neurons has increased, it means that the knowledge points learned by the model must have increased. There are no more than two possible sources of new knowledge points: the first source is this knowledge The small model did not learn it before, but now the large model has learned it, and learned new knowledge from scratch.

This type of new knowledge should be subdivided into two categories: one category should be world knowledge (common sense and events), small models cannot encode world knowledge that appears less frequently in the data, and large models use single semantic neurons for this. (Large models can learn more low-frequency knowledge in data than small models, which can be verified by a lot of work, and at present, world knowledge should be encoded by a single neuron), this kind of knowledge represents that the large model has learned More detailed information about the world; one category should be more abstract knowledge (such as "prime numbers") newly induced by the model from the data. This type of knowledge represents that the large model has learned more and more complex abstract knowledge or capabilities.

Another source of new knowledge points should be generated by feature splitting of the abstract features mentioned above. That is to say, the original small model only had a coarse-grained abstract knowledge point. Now that the model is large, some new fine-grained knowledge points that represent this type of knowledge are derived, and a corresponding knowledge point may be learned for different contexts. .

For example, it is currently found that there is a single semantic neuron in the LLM that responds to continuous uppercase characters. For example, if there is "ABCD" in the input, this neuron will be activated. The small LLM model may have only one neuron responding to this. If this neuron is deactivated, the Loss will increase sharply when GPT is doing NTP to predict the next Token, indicating that the lack of this feature is essential for correct prediction of continuous uppercase in subsequent content. The characters are all wrong; however, in addition to this neuron, the large LLM model also splits fine-grained representation neurons. For example, for the company abbreviation "IBM", there may be a neuron that is responsible for the response, and for medical abbreviations, such as "GS (glucose injection)", there is another neuron responsible for the response. and so on.

The splitting of abstract features of this large model illustrates one point: even for abstract knowledge, large models have more detailed abstract feature expression capabilities than small models.

It can be seen that the large model is relatively small. From the perspective of encoding low-frequency world knowledge, it can be considered that more detailed information about the world has been learned. From the perspective of new abstract knowledge and abstract feature splitting, it shows that large LLM models are more difficult. and finer-grained abstract knowledge expression capabilities.

Differences in mission loops:

The task circuit is a circuit that is stimulated and communicated from bottom to top among the knowledge points that form a hierarchical structure. From the above analysis of the differences in the hierarchical knowledge structure between large and small models, a reasonable inference can be made: large LLM models have a high probability of being able to build circuits involving more fine-grained abstract knowledge points and more complex paths on the path. This is probably the main reason why large models can solve complex problems.

Combining the two, we can think that the small model is a coarse-grained modeling of the world, while the large model is a fine-grained high-definition modeling of the world. And the Scaling law shows that with the addition of more data and a larger model size, the LLM model can describe the world with higher clarity. From this point of view, it is not a big problem to say that the LLM model parameters are a lossy compression of the world.

04

The endless frontier: explaining unknown phenomena using 'loop competition'

In this part, we explain some phenomena of the current LLM model under the framework of "loop competition".

1. Emergence of models from the perspective of "loop competition"

Model emergence ability refers to that for some tasks (mostly In Context Learning or COT-related tasks), small models have almost no ability to solve them. Only when the model size reaches a certain critical point can this task be completed well. .

Although current research (refer to Are Emergent Abilities of Large Language Models a Mirage?) shows that the so-called "emergent ability" of the model is caused by the unreasonable choice of metrics, in fact there is no emergence, but the metrics for task selection are not enough Just precision.

I personally think that this statement should indeed be able to explain some of the tasks that currently reflect the "emergence ability", but I feel that this may not be the whole story. Some tasks may be difficult to explain only through this reason, so why does the large language model appear? Emergence ability, or should do further research.

If we look at this problem under the framework of "loop competition", then there are two possibilities for the small model not being able to do a certain task: one possibility is that for the small model, the excitation circuit corresponding to this task has not been established, and The large language model has been established; another possibility is that the circuit corresponding to the task of the small model has also been established, but it is very easy to fail in the circuit competition, which makes it seem that this task cannot be done.

I am more inclined to think that it is the first possible cause of the "emergent ability" of the model we are seeing now. As mentioned above, the small model probably creates a coarse-resolution blurred world image, while the large model creates a high-resolution, higher-definition world image.

Small models should have difficulty in establishing a complete excitation circuit for certain tasks. These difficulties may be reflected in several aspects: for example, one or some of the key to the formation of the circuit, relatively abstract conceptual knowledge points, small models because of their relatively abstract ability Weak, this knowledge point has not been established (similar to the example of the concept of "prime number" at the beginning of this article); for another example, the tasks that can generally reflect the emergent ability are relatively complicated, and the small model is not capable of establishing complex pathways. and so on.

When the scale of the model becomes larger, the ability to construct abstract concepts and complex circuits is enhanced. When a complete activation path for solving tasks is established, it seems to be able to solve this problem suddenly, reflecting the emergent ability of the model. However, it is very likely that for such a complex circuit, the ability to activate competition is not strong enough, so when a few task-related examples are assisted to promote the circuit corresponding to the task to win in the channel competition, you can see to a better solution.

2. In Context Learning and Chain of Thought (COT) from the perspective of "loop competition"

Looking at ICL from the perspective of circuit competition, there may be two types of circuits involved here: the task circuit and the attention circuit. The two compete or cooperate to determine the performance of the ICL task. COT is a special ICL, and the mechanism should be similar.

Let's first look at the function of the task loop, which is actually easy to understand. In Context Learning will first give the LLM model a few task-related examples ffc96b6488803c2f646b602f96621cfb.png, and then input them 04c74693f582eac169ff628645d47ae5.png, expecting the model to output 5722821b1e54cfed2ef9a4f9e00a2d63.pngthe corresponding correct results c53352c0f3b65c1214b60aeed22f7432.png.

The role of the n  examples given in the input  is to activate the corresponding task circuit learned in the pre-training stage of the LLM model, and then input 29f5b37ad0e552b9518f690d14a27a57.png, it is easy to follow this activated path to form a correct output a7b6fb4c2a277000c24044e3dd82c27c.png.

The role of COT should be similar, that is to say, if you don’t use COT, LLM may activate a task circuit with a simple structure, but if you use COT example, it is easy to activate a complex reasoning circuit with many detailed representations, resulting in subsequent The input also follows this sub-pathway, thus forming detailed reasoning steps. It can be seen that in the ICL scenario, the task loop always f5cbc3d57373f20030dd6c87907c61b5.pngplays a positive role in generating the correct answer.

9b4503f2a1abb0094fcff2142e09c1a2.jpeg

Let’s look at the Attention loop again, but this is also an idea (the purpose of the In-context Learning and Induction Heads work is to explain the ICL phenomenon through the Induction Head, but I think the Induction Head mechanism is too simple and may need to be strengthened a little bit).

Assuming that there is an enhanced version of the Induction Head loop, for example, we can call it "Enhanced Induction Head, EIH", its operating mechanism is likely to be like this (as shown in the figure above): The EIH loop will be based on the current input and each example of 6085651f3d7beff385f471425005191a.pngICL The semantic similarity of xi in xi, to copy the corresponding yi, 235bc16dedfcacb66d26a5e29e2f8708.pngand dcf89de820a8e2477c1240d94dc3cd41.pngthe higher the similarity, the greater the probability of copying eabb5dbdcd43cff2a8ac588629ea61e3.png.

This process is somewhat similar to the KNN model composed of EIH loops. You only need to vote to get the correct answer based on the similarity between the input examples and the corresponding labels. It does not require the model to learn the mapping function between x and y by  modifying the  parameters  It is a conditional Induction Head copy operation, and the conditional trigger is  the Attention similarity between the input examples x  .

It can be seen that aadf88ec43ea35e1d38181700dc2e3ab.pngwhich label affects the output should mainly depend on these types of examples in ICL: the c04194370edf47311ae94db2666e1272.pngmore similar the example, the greater the influence; the more  appears in the ICL , the greater the influence; and the closer bdf948786eba2164af8ec44a06a5fc9d.pngthe example influences The larger (the location information encoded by Position embedding and the large number of local correlations in NLP will probably lead to this result).

If there is an EIH loop, according to the above operating mechanism, we can infer the influence of the Attention loop on the correct prediction results in the following three cases f353f56631cddc7d04dc6caa1caf1069.png:

Situation 1: If the label corresponding to the input example  f8576bc4b498f92f82b8154c567f25fd.pngin ICL is Ground Truth Label, it is obvious that the EIH loop has a positive and positive influence, similar to the above-mentioned KNN mechanism to  judge according to the y  corresponding to  the y  example;73744b9620330e1a6701c9e5af24f946.png

Case 2: If the label of the example in ICL is not Ground Truth Label, it is randomly selected and given in the label space. Obviously, the EIH loop f94666b6768297ffe5d05ebfe64778af.pngshould have a negative effect on getting the correct answer, because 7b8b1681a2263bd7ef3721f51c2af48f.pngfrom the previous e58cc6e7a6864119715c5fb8d4b89cbd.pngexamples 208863d4dd70930e46a7deb82f07db8a.png, we will look for content that is more similar to it to copy the corresponding label, but this label is given randomly, so there is a high probability that it is wrong , leading to this situation EIH should be a negative effect.

Case 3: If the label of the example in ICL is another set of labels outside the label space, but  there is a corresponding mapping relationship with x  . In this case, the EIH loop should have a positive effect, which is similar to the first case. The KNN mechanism can learn this mapping relationship, so to get the correct one, it is nothing more than 07fc017bfec0ef8663e44a3ee84db269.pngusing is c95e3d78b29ea075294368c6bc9aba43.jpeginstead of not ef7dcd8affb69bd4102b5ed8b14b4815.png. Of course, if you still look at  the performance under the y  label, then ICL must have a negative effect.

If the internal task circuit of LLM and the pure attention circuit of EIH are jointly considered, the two sometimes cooperate in the same direction, and sometimes compete in the opposite direction. For example, in the above three cases, the first case is a synergistic effect, both play a role in promoting the correct answer, the second and third cases are competition, and the task circuit plays a role in promoting the correct answer, The EIH loop plays a negative role.

According to this line of thinking, it can roughly explain the many seemingly unexplainable phenomena that we have seen in ICL research. Here is an example. For example, the current research shows that assuming that the label space of ICL contains two labels: 2beae7836ab10ee2a0b755003f92ab1c.pngand 2dd1be34dd64548b31e3a5cfe1e4ddd5.png, if we reverse the label of the example in ICL, the original label is e66e9ddc5c5330d6328fe459de3b6066.pngreplaced by f8a5ad7978b6a162bc15e75cb5d0dd8e.png, and the original label is 36f20e5e926641861f7bc0918878f59e.png replaced by 58c3183b0dd0edcbe00d52900af25ba9.png. Then the ICL task The effect will be worse (see: Overthinking the Truth: Understanding how Language Models process False Demonstrations).

Assuming that ba5e6a052c2567c7345d04f10e9ee491.pngthe corresponding correct label is 7e23beb71f714addbab1a3a60f6428e0.png, from the perspective of the task loop and the EIH loop, the task loop will tend to give 4eb13fba9727d86fb30acdb9e210ffca.pngthe label. In this case, the EIH loop actually corresponds to the above-mentioned situation three, and the label inversion is a special kind. Change the label, because  the correspondence between and  y still exists. So at this time, the EIH loop seems to learn  the mapping relationship between and , and tends to give 0f918bae26e326de3e0a0cffb0997521.pnglabels. At this time, one of the two is positive and the other is negative, and they have a competitive relationship, so the effect of the model will be reduced.

In fact, many other phenomena can be explained within this framework, and the reason for the length of the article will not be expanded. Interested students can deduce it by themselves under this thinking framework.

3. Domain task Fine-Tuning from the perspective of "loop competition"

From the perspective of "loop competition", we can re-examine the possible impact of fine-tuning operations on general models using domain data. What we know now is that the use of domain data Fine-tuning will cause the problem of "catastrophic forgetting" of the basic model. That is to say, because the subsequent Fine-tuning corrects the model parameters, the model forgets some of the previously learned knowledge.

And my judgment is: At present, on the basic model, any form of Tuning operation will cause the loss of some capabilities of the basic model, which also includes the Instruct tuning done by ChatGPT to understand commands and follow human values. It should also damage some abilities of the base model, but we can't tell which abilities are damaged at the moment. This is the price that must be paid for Tuning the model under the current technical conditions.

But why does fine-tuning the base model cause capacity damage? What is its inner principle? We can analyze the impact of Fine-tuning from the perspective of "loop competition". I guess there are roughly two kinds of influences, and these two influences may work alone, or they may work together.

The first impact is: the Fine-tuning operation strengthens the response loop of the large language model to solve this task through a large amount of domain data. It is estimated that this has little impact on the underlying knowledge points of the model, because the bottom layer has more general features, and this task is also needed. It should correct more abstract knowledge nodes on the upper layer, and from the lower layer knowledge points to the upper layer abstract knowledge. The point establishes a path that stimulates the connection.

Another possible impact: it is likely that through the Fine-tuning operation, a shortcut shortcut is established inside the model, causing the information transmission to directly take a shortcut after inputting information, bypassing many paths that should be taken. For example, the task of text classification, the internal logic of this task should be very simple. It is estimated that it is to establish the excitation path from the knowledge points of vocabulary in specific fields at the bottom layer to the knowledge points of abstract category concepts in the upper layer, so it is likely to go directly from the knowledge points at the bottom layer to the high-level ones. The category concept knowledge point has established a very short Shortcut shortcut, and other complex circuits have been passed by this shortcut. It is not necessarily because the upper-level abstract knowledge points have been rewritten, and it is likely to be bypassed by taking shortcuts. .

No matter which of the above reasons is the result, the result is: for new input, although it may be for other tasks, it is easy to activate this specially strengthened circuit. That is to say, this strengthened circuit is easy to compete for victory when it should not compete for victory, resulting in poorer effects of other tasks.

4. Instruct Tuning from the perspective of "loop competition"

Instruct Tuning is essentially a special Fine-tuning to achieve alignment with human behavior. The technical report of GPT 4 also pointed out: Instruct tuning will not enhance the knowledge and ability of the basic model, on the contrary there may be some ability damage. High-quality Instruct Tuning is definitely very important, but it only makes the large language model "look like" better. This is only based on the user's subjective feelings, not the basic ability of the model.

So, from the perspective of "loop competition", how to understand what Instruct Tuning is doing? I think it can be understood in this way: the function of Instruct Tuning establishes a special activation circuit, that is to say, the activation circuit formed by the input command itself establishes a connection with the corresponding task circuit. After the model is trained according to the Instruct, when the command is input, it is beneficial to activate the corresponding task circuit, so it seems that the large language model understands the meaning of the command. This is somewhat similar to the operation mechanism of "conditioned reflex" in Pavlov's biological experiments, which means that a conditioned reflex pathway is established between user commands and corresponding task pathways.        

Using the "loop competition" conjecture, in addition to giving a reasonable explanation for the above-mentioned phenomenon of the currently unknown internal operation mechanism, it can also be used to explain some other phenomena. For example, the "serious nonsense" problem that often occurs in large models can be considered to be that in the process of loop competition, the correct loop fails to compete, or the intensity of excitation of the correct loop and a wrong loop is similar, resulting in a mixed result. , is an answer that looks reasonable but is factually wrong. Something like that.

05

Parametric reflection of the world: from the real world to the possible world

65044c98e77e93e352781865f78008e2.jpeg

The physical world has its own Hidden Rules that govern its operation. Conceptually, we can understand that there is a simple Hidden world, which produces a colorful world of appearances. If we classify all phenomena in the world, we can roughly classify them into natural phenomena, social phenomena, and psychological phenomena. Several categories of phenomena. Human beings are part of the physical world. By observing the appearance of the world and trying to understand the laws of the world, we can better maintain the survival of the population and individuals in this world.

From the perspective of the population, the survival of the fittest in the evolution process of tens of millions of years is the pre-training process of the human model. The optimization goal is "Next Person's survival Prediction". The smaller the Loss, the more surviving individuals in the population. The genetic code is the model parameter, the individual represented by the genetic code, those who adapt to the environment survive, and those who do not adapt to the environment are eliminated. The reason why survivors can survive is that certain characteristics represented by the genetic code are suitable for the living environment, so these genetic codes that match the living environment are strengthened in the population, and the human pre-training model completes a model parameter update.

The constant changes in the living environment of the external physical world drive the changes in the genetic code of the population, thereby promoting the survival of the population in the changing environment. The genetically encoded pre-training model that we are born with records the various survival strategies learned over tens of millions of years, forming the unconscious rapid response system 1 in the brain, which represents the collective memory of the population.

From an individual point of view, in addition to obtaining an innate survival strategy through a genetically encoded pre-training model, in order to maintain the individual's own survival in a specific environment, "Continual Pre-training" will be carried out throughout the life course. Its optimization goal is "Next Action Prediction", which seeks to output correct behavior in the environment to maintain survival.

Adopt a model parameter update strategy similar to LoRA: For individuals, the natural genetic code is the basic model that cannot be changed, which determines many of our behavior patterns, but there are some areas in the brain that can be corrected. Links between elements to learn new knowledge and skills.

If the output behavior has a negative impact on continued survival, adjust the model parameters to better cope with the living environment in the future. The function of this part of the brain area forms the conscious slow decision-making system 2, which represents the individual's personalized survival experience. "Natural genetic code + personal survival fine-tuning" has created a variety of different individual behaviors, which have commonality and individuality. The commonality comes from the collective memory of the population, and the individuality comes from the unique survival experience.

Language was originally used as a communication and collaboration tool between human individuals, which is conducive to promoting the survival of the population. With the development of technology, it is gradually recorded on the back of turtles, bamboo slips, paper, and electronic signals to form text. Everyone can be regarded as an independent "encoder-decoder". The individual observes and feels the physical world, and encodes it in the brain to form knowledge and thinking, and the decoding output forms text, which records the feeling and thinking of the world from the personal perspective , There are subjective feelings and objective records. The crowd forms a distributed "encoder-decoder", and the decoded output produces a large number of written records containing various objective facts about the operation of the world and subjective conflicting concepts.

Therefore, words are only appearances, and what is internally recorded is the cognition of the physical world and the subjective feelings of the world from the perspective of human beings (physics knowledge, social knowledge, event records, individual feelings, individual imagination, etc.). Behind it lies a world model from a human perspective. GPT tries to correctly reproduce the text produced by humans through the Next Token Prediction task. In essence, it decodes and restores the world model hidden behind the text appearance, and stores it in the model parameters of GPT, forming a parameter reflection of the physical world.

If we think more deeply, we may find that GPT not only learns how to generate content that conforms to the facts of our real world from a large amount of text, but may also learn to be a "possible world" generator. It starts from the text to simulate our real world, and then generalizes abstraction. Although it follows the physical laws of our world, it can not only produce real knowledge and content in the world that we perceive, but also produce other physical laws. and the possible worlds of human understanding logic.

Maybe you can’t say it’s wrong just because the content it produces doesn’t fit the real world. You can only say that it has the ability to show you all the logically possible worlds. There must be many situations that may not match the reality. After all, the existing world is just It is just a realistic choice that has occurred in the possible world, and it has the ability to present you with various reasonable possibilities.

06

The End of the World and Grim Wonderland: The "Brain in a Digital Vat" Thought Experiment

7f7838f38872a989767c51099e4a75b6.jpeg

"A mad scientist performs an operation in which he cuts out a person's brain and puts it in a container filled with a nutrient solution. The nutrient solution contains enough nutrients to keep the brain functioning properly, while the brain's nerve endings are connected to wires The other end of the wire is connected to a computer. The computer simulates the parameters of the real world and transmits information to the brain through the wire, making the brain feel that everything is completely normal, as if the people around you and familiar things are still going on as usual, without Any abnormality.

One day, the brain in the nutrient solution had a whim and thought of a very interesting thought experiment. In his/her perception of reality, at the moment when he/she is on the subway to work or in front of his office, he/she hears someone’s slight With the sound of footsteps, he/she took out his mobile phone and wrote down his thoughts in the memo, which reads as follows:

'OpenAI launched a new LLM model called GPT 4, which is very powerful, which may herald the arrival of the AGI era, and everyone around me is discussing it enthusiastically. Today I read an article analyzing its possible working mechanism. The title is "The Parametric Reflection of the World: Why GPT Can Generate Intelligence Through Next Token Prediction". After reading it, it was very inspiring and aroused my thinking.

We can imagine: if AGI is powerful enough in the future, it can read my written content, my photos and videos, and even scan and copy my brain response patterns to reconstruct a number exactly like me in the physical world brain. Then, another self will live in the digital space, and AGI will take over the various sensory signals of my digital brain, simulate my work and life scenes, and make the brain feel that everything is completely normal, as if the people around me are familiar and familiar. Things are still going on as usual, nothing unusual. So, can the me in this digital brain, or the me in real life, distinguish whether I am living in a digital space or a physical space? I call this thought experiment: a brain in a digital vat. Isn't this thought experiment interesting? '

I call this thought experiment: a brain in a digital vat. Isn't this thought experiment interesting? "


The links to the articles mentioned in the article are as follows:

1. The road to AGI: the essence of large language model (LLM) technology

https://zhuanlan.zhihu.com/p/597586623

2. What is Arithmetic Coding

https://zhuanlan.zhihu.com/p/390684936


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/130998657