From principles to code understanding language model training and reasoning, it is easy to understand and quickly practice LLM

2612b020f63b94b7b2dde1af42f229f8.jpeg

Author : Health Controller
Link : https://zhuanlan.zhihu.com/p/656758138

Today I share a blog that introduces the training and inference of language models. It is easy to understand and captures the essential core. I highly recommend reading it.

Title : Language Model Training and Inference: From Concept to Code
Author : CAMERON R. WOLFE
Original text :
https://cameronrwolfe.substack.com/p/language-model-training-and-inference

To understand the large language model (LLM), you must first understand its essence. Whether in pre-training, fine-tuning or in the inference stage, the core is next token prediction, which is to gradually generate text from left to right in an autoregressive manner.

Enter the NLP group—> Join the NLP exchange group

What is a token?

Token refers to a word or subword in the text . Given a sentence of text, the first thing to do before sending it to the language model is to tokenize the original text, that is, to split a text sequence into discrete token sequences.

d380c29bac54c75c5919428d553623d2.png

Among them, the tokenizer is a tokenizer with a fixed and unique number of tokens trained on unlabeled corpus. The number of tokens here is what everyone often calls the vocabulary, that is, all the tokens known by the language model .

When we segment the text, each token can correspond to an embedding, which is the embedding layer in the language model. Obtaining the embedding of a certain token is similar to a table lookup process.

e80d4184134c55955e3d588bc63d2fd8.png

We know that text sequences are in order, and common language models are based on the transformer structure of the attention mechanism, which cannot automatically consider the order of the text. Therefore, position encoding needs to be added manually, that is, each position has a position embedding . Then add it to the token embedding at the corresponding position.

37fb14a626decacabd7c8fce69c1856d.png

In the model training or inference phase, you often hear the word context length . It refers to the maximum length of the token training received during model training. If only a shorter length position embedding is learned during the training phase, then the model is in inference. stage cannot be applied to longer texts (because it has not seen the positional encoding of long texts)

a18efa50581562539a81e4b2d2275943.png

Language model pre-training

When we have token embedding and position embedding, we send them to a decoder-only transofrmer model, which will output a corresponding embedding at the position of each token (which can be understood as like feature processing)

4a13aec9057b710ff5026ea3d44df363.png

After having an output embedding for each token, we can use it for next token prediction. In fact, it is treated as a classification problem :

  • First, we send the output embedding to a linear layer. The dimension of the output is the size of the vocabulary, which is to predict "which category" of the vocabulary the next token of this token belongs to.

  • In order to normalize the output probability, another softmax transformation is needed

  • During training, this probability is maximized so that it can predict the real next token.

  • During inference, the next token is sampled from this probability distribution.

e48527a162fb41036a3c96bd5bf0b89e.png

Training phase : Because of the existence of causal self-attention, we can predict the next token for each token of an entire sentence at once , and calculate the loss of all position tokens, so only one forward is needed

Inference phase : prediction in an autoregressive manner

  • Predict the next token each time

  • Splice the predicted token into the currently generated sentence

  • Then predict the next token based on the spliced ​​sentence.

  • Repeat until the end

Among them, when predicting the next token, we have a probability distribution for sampling every time. The sampling strategy will be slightly different according to different scenarios. Otherwise, there are greedy strategies, kernel sampling, Top-k sampling, etc. In addition, we often see As for the concept of Temperature, it is used to control the randomness of generation. The smaller the temperature coefficient, the more stable it is.

c9f46e95b984128f22c10504d2264164.png

Code

The code below comes from the project https://github.com/karpathy/nanoGPT/tree/master. It is also a good project and is recommended for beginners.

For various Transformer-based models, they are stacked up by many Blocks. Each Block mainly consists of two parts:

  • Multi-headed Causal Self-Attention

  • The schematic diagram of the Feed-forward Neural Network structure is as follows:

bfd08350c8f00c3d59681873600d099a.png

Look at the picture to build a single Block

2a1eff84c667b64a012d3eed64281391.png

Then look at the structure of the entire GPT

e5638fe8979274dda99cc7c8e7e0a9db.png

It mainly consists of two embedding layers (token, position), multiple blocks, some additional dropout and LayerNorm layers, and finally a linear layer used to predict the next token . It's that simple to say the least.

The weight tying technique is also used here, that is, the weight of the last linear layer used for classification is shared with the weight of the token embedding layer.

Next, we will focus on how the forward of training and inference is performed, which can help everyone better understand the principle.

First, you need to build the token embedding and position embedding, superimpose them and pass a dropout, and then send them to the transformer block.

7b19406cefe835c53ffe9cd50af78feb.png

It should be noted that the dimensions of the tensor after the transforemr block are the same as before. After obtaining the output embedding corresponding to each token position, it can be classified through the final advance layer, and then optimized using cross-entropy loss.

576f25f34177f8a731e64a3c7eaf777a.png

Let’s take a look at the complete process again. You only need to move the input one position to the left to use it as the target.

61bea4c7b7d03479967921dad3787635.png

Next let’s look at the inference stage:

  • Perform a forward propagation based on the current input sequence

  • Adjust the output probability distribution using temperature coefficients

  • Normalization via softmax

  • Sample the next token from a probability distribution

  • Splice into the current sentence and then enter the next cycle

6e57c713bfb565ee55b31a21a4d2f74c.png

Enter the NLP group—> Join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/133053957