Understand generative AI in seconds—how does a large language model generate content?

The core of the large language models that have attracted much attention is the understanding of natural language and the generation of text content. Regarding this, have you ever been curious about how they understand natural language and generate content, and what is their working principle?

To understand this, we have to first jump out of the field of large language models and come to machine translation. The traditional machine translation method still uses RNN recurrent neural network.

Recurrent neural network (RNN) is a recursive neural network that takes sequence data as input, performs recursion in the evolution direction of the sequence, and connects all nodes (recurrent units) in a chain.

Source of interpretation: Wen Xinyiyan

As for the sentence "I painted a painting", it will first split it into four words: "I", "painting", "a piece", and "painting", and then progressively break it down one word at a time. Understand and translate this sentence, such as:

Insert image description here

Then output: I have drawn a picture.

This method is simple and direct, but due to the linear structure of RNN itself, it cannot process massive amounts of text in parallel and runs slowly. In addition, there will be "reading to the end and forgetting the previous part", which will cause the gradient to disappear when RNN processes long sequences. or explosive conditions.

Until 2017, Google Brain and Groogle Research collaborated to publish a paper called "Attention Is All You Need". This paper provided a new way for machine translation processing and also created a game similar to "Transformers". Name - Transformer.

Transformer is a neural network that learns context and therefore meaning by tracking relationships in sequence data. This model was proposed by Google in 2017 and is one of the newest and most powerful categories of models ever invented.

Source of interpretation: Wen Xinyiyan

Transformer can process massive amounts of text in parallel because it uses a special mechanism called self-attention. Just like when we are reading for a long time, the brain will rely on attention to select key words to associate, so as to better understand the article after "skimming". The function of this mechanism is to give AI this ability.

Self-attention is an attention mechanism that linearly transforms the input sequence to obtain an attention weight distribution, and then weights each element in the input sequence according to this distribution to obtain the final output.

Source of interpretation: Wen Xinyiyan

The same sentence "Please pay attention to garbage classification" is also divided into four words: "I", "painting", "a piece", and "painting". In the Transformer, they will go through input, encoder, There are four stages of decoder and output.

Insert image description here

Specifically, when the sentence is decomposed and input into the encoder, the encoder will first generate an initial representation of each word, which can be simply understood as the initial judgment of each word. For example, "painting" is a noun , can also be a verb.

Then, the self-attention mechanism is used to calculate the degree of association between words, which can be understood as scoring. For example, if the first "painting" is highly related to "I", it will be given 6 points. The relationship between the two "paintings" and "one painting" is also scored as high as 8 points, and "I" has no relationship with "one painting", which is -2 points.

Insert image description here

Then, the previously generated initial representation is processed based on the scoring. The first "painting" has a high degree of correlation with "I", which can reduce the judgment of the noun part-of-speech in the representation and improve the judgment of the verb part-of-speech; the second " If the correlation between "painting" and "picture" is high, the judgment of the part-of-speech of the verb can be reduced and the judgment of the part-of-speech of the noun can be improved.

Finally, the processed representation is input to the decoder, which then outputs a translation based on its understanding of each word combined with the context. During this period, each word can be processed simultaneously, greatly improving the processing speed.

But what does such a Transformer have to do with large language models?

Big language models originally refer to deep learning models trained using large amounts of text data, and Transformer can provide sufficient power for training large amounts of text data. In addition, after the processed representations are input to the decoder, these representations can be relied on to infer the probability of the next word, and then generate content word by word from left to right. In this process, it will continue to combine the previously generated The word is jointly inferred.

Insert image description here

For example, based on the words "piece" and "painting", it is inferred that the next word is "style" with the highest probability. Then taking into account "piece", "painting" and "style", it is inferred that the next word is "ink" , and so on for the next word, and then the next word, and this is how we see the content generation of the large language model.

This is why it is generally believed that the starting point for the birth of large language models is Transformer.

So, how does the most critical self-attention mechanism in Transformer know "how many points to score"?

This is a relatively complex calculation formula:

Insert image description here

For a simple understanding, you can think about the knowledge about vectors in mathematics textbooks. When two vectors a and b are in the same direction, ab=lallb|; when a and b are perpendicular, ab=0; when a and b are in opposite directions, ab= -lallbl.

If the two vectors a and b here are regarded as the projection of two of the four words "I", "painting", "a piece" and "painting" in space, then the value of a times b is Score.

The larger this value is, the more consistent the directions of the two vectors are, which means the degree of correlation between the two words is greater;

If the value is 0, that means the two vectors are perpendicular, and there is no correlation between the same words;

If the numerical value is a negative number, the two vectors are opposite. Not only are the two words not related, but the gap is too large.

It's just a simple understanding. In reality, a set of complicated calculation processes are required, and it needs to be repeated many times to obtain more accurate information and determine the meaning of each word in accordance with the context.

The above is the working principle of the large language model. The usefulness of the powerful Transformer is not limited to the field of natural language processing, including computer vision and speech processing tasks such as image classification, object detection and speech recognition. It can be said that Transformer This is the key to the explosion of large models this year.

Of course, no matter how powerful the Transformer is, it can only process the input. If we want the content generated by generative AI to better meet our needs, a good input is an important prerequisite, so in the next issue we will talk about what is a good input. What is Prompt?

This article comes from Baidu AI. If there is any infringement, please contact us to delete it.

Guess you like

Origin blog.csdn.net/weixin_57291105/article/details/133135485