LLaMA ChatGLM2 BLOOM model technical analysis and comparison

Model training data training data volume model parameters vocabulary size
LLaMA Latin languages ​​mainly in English, excluding Chinese, Japanese and Korean 1T/1.4T tokens 7B、13B、33B、65B 32000
ChatGLM-6B Bilingual in Chinese and English, the ratio of Chinese to English is 1:1 1T tokens 6B 130528
Bloom 46 natural languages ​​and 13 programming languages, including Chinese 350B tokens 560M、1.1B、1.7B、3B、7.1B、176B 250880
Model Model structure position encoding activation function layer norm
LLaMA Casual decoder RoPE SwiGLU Pre RMS Norm
ChatGLM-6B Prefix decoder RoPE GeGLU Post Deep Norm
Bloom Casual decoder ALiBi shave Pre Layer Norm

LLama

LLaMA[2] is a large language model proposed by Meta. The training data is mainly English and Latin, and also includes code data from GitHub. The training data is mainly in English, excluding Chinese, Korean and Japanese. All training data are open source. After word segmentation, there are approximately 1400B tokens.

According to the number of model parameters , the LLaMA model has four model versions with different parameter sizes: 7B, 13B, 33B, and 65B. Versions 7B and 13B use 1T tokens for training, and versions 33B and 65B use 1.4T tokens for training. [3] proved that under a given training budget, even if the number of model parameters is reduced, as long as the pre-training data size and training duration are increased (more training tokens), the effect of the original size model can be achieved or even exceeded. For comparison, the 280B Gopher model only trained 300B tokens, the 176B BLOOM model only trained 350B tokens, the GLM-130B only trained 400B tokens, and the LLaMA model trained 1T/1.4T tokens, significantly increasing the number of tokens. Increase the amount of training data. From the results, although the parameter amount of LLaMA-13B model is only less than 1/10 of that of GPT3, the effect on most tasks exceeds that of GPT3.

In terms of model structure , the same as GPT, LLaMA uses a causal decoder-only transformer model structure. In terms of model details, the following changes have been made:

  1. Pre-layer-normalization  [Reference GPT3]. In order to improve training stability, LLaMA normalizes the input of each transformer sub-layer, using  RMSNorm (i.e. only root mean square, no mean u) normalization function, Pre-normalization Introduced by Zhang and Sennrich (2019).

  2. SwiGLU Activation function  [Reference PaLM]. The ReLU activation function is not used, but the SwiGLU activation function is used. FFN usually has two weight matrices. First, the vector is raised from dimension d to the intermediate dimension 4d, and then reduced from 4d to d. The FFN using the SwiGLU activation function adds a weight matrix. There are three weight matrices in total. In order to keep the parameters consistent, the middle dimension uses 2/3d instead of 4d.

  3. Positional encoding:Rotary Embeddings  [Reference GPTNeo]. The input of the model is no longer used  positional embeddings, but positional embeddings (RoPE) are added to each layer of the network. The RoPE method was introduced by Su et al. (2021).

  4. The AdamW optimizer is used, and the cosine learning rate schedule is used,

  5. Reduce memory usage and runtime using an efficient implementation of causal multi-head attention. The implementation can be found at xformers

flameparameters

Here are some large models derived from LLaMA:

  • Alpaca : Stanford University fine-tuned 7B-scale LLaMA on the 52k English instruction follow dataset.
  • Vicuna : UC Berkeley fine-tuned the 13B-scale LLaMA on user-shared conversation data collected by ShareGPT.
  • baize : On the data generated by 100k ChatGPT, the model obtained by fine-tuning LLaMA through LoRA.
  • StableLM : The model obtained by Stability AI fine-tuning based on LLaMA.
  • BELLE : Lianjia only uses the data produced by ChatGPT to fine-tune the instructions of LLaMA and optimize it for Chinese.

Vocabulary expansion: Chinese LLaMa

Necessity of vocabulary expansion . The vocabulary size of the original LLaMA model is 32000. The tokenizer is mainly trained on English corpus, and the effect on Chinese and multilingual is relatively poor. The effect of LLaMA on Chinese is poor. On the one hand, the LLaMA model is trained on the Latin-based corpus mainly in English, and the training corpus does not contain Chinese; on the other hand, it is related to the tokenizer, and the vocabulary is small. Chinese characters are divided into multiple tokens, the encoding efficiency is low, and the model learning is difficult. The LLaMA vocabulary contains only a few Chinese characters. When segmenting Chinese text, the Chinese text will be segmented into smaller pieces. Multiple tokens are needed to represent a Chinese character, and the encoding efficiency is very low. After expanding the Chinese vocabulary, a single Chinese character tends to be divided into one token, which avoids the problem of one Chinese character being divided into multiple tokens and improves the efficiency of Chinese encoding.

How to expand the vocabulary ? [6] Try to expand the vocabulary and add Chinese tokens to the vocabulary to improve the efficiency of Chinese coding. The specific method is as follows.

  1. Use Sentence Piece to train a Chinese tokenizer on Chinese corpus, using 20,000 Chinese words. Then the Chinese tokenizer is merged with the original LLaMA tokenizer, and by combining the vocabulary of the two, a merged tokenizer is finally obtained, called the Chinese LLaMA tokenizer. The vocabulary size is 49953.

  2. In order to adapt to the new tokenizer, the embedding matrix of the transformer model is extended from V*h to V'*h , and the newly added Chinese token is appended to the end of the original embedding matrix to ensure that the embedding matrix of the original vocabulary is not affected. (The output layer should also be adjusted here)

  3. Further pre-train on Chinese corpus, freeze and fix the model parameters of the transformer, only train the embedding matrix, and learn the word vector representation of newly added Chinese tokens while minimizing interference to the original model.

  4. In the instruction fine-tuning stage, all model parameters can be released for training.

Introduction to SwiGLU

Swish activation function

Beta is a constant or trainable parameter,

The Swish function can be regarded as a smooth function between the linear function and the ReLU function.

swish

GELU activation function

GELU (Gaussian Error Linear Unit) is an activation function in the form of a non-elementary function and is a variant of RELU.  It was proposed in the 2016 paper  Gaussian Error Linear Units (GELUs) , and was subsequently adopted by NLP models such as GPT-2, BERT, RoBERTa, and ALBERT. The paper not only proposes the exact form of GELU, but also gives the approximate forms of two elementary functions. The function curve is as follows:

GELU μ=0, σ=1, ReLU and ELU α=1

RELU and its variants and Dropout determine the output of the network from two independent aspects. Is there any moderate method to combine the two into one? In terms of network regularization, Dropout randomly sets the neural unit output to 0 (multiplied by 0), and Zoneout randomly skips RNN units (multiplied by 1). Both multiply the output by a random variable m ~ Bernoulli(p) that obeys the Bernoulli distribution, where p is a specified certain parameter, indicating the probability of taking 1.

However, because the activation function is used in exactly the same way during training and testing, it requires a deterministic output and cannot directly multiply the input x by the random variable m, which is different from Dropout (Dropout does not randomly set 0 during testing). Since the mathematical expectation of a probability distribution is a certain value, we can instead find the expectation of the output: , that is, the expected value of the input multiplied by the Bernoulli distribution .

The paper hopes that p can vary with different input x, and when x is small, it is set to 0 with a high probability. Since the input of neurons usually obeys the normal distribution, especially in the network with Batch Normalization, it is sufficient to set p equal to the cumulative distribution function of the normal distribution

The cumulative distribution function curve of the normal distribution is similar to the sigmoid curve.

GEL:

where Φ(x) is the cumulative function of the normal distribution

In mathematics, the error function (also called the Gaussian error function ) is defined as follows

erf(x) is relatively close to tanh(x)

In code implementation, an approximate function can be used to fit erf(x). The two approximations given in the paper are as follows:

σ represents the SIGMOD activation function

However, many frameworks already have accurate erf calculation functions, which can be used directly. The reference code is as follows:

# BERT、GPT-2 的旧式 GELU 实现
def gelu(x):
    return x * 0.5 * (1 + tf.tanh(np.sqrt(2/np.pi)*(x+0.044715*tf.pow(x,3))))
# 使用erf函数的 GELU 实现
def gelu(x):
    cdf = 0.5 * (1.0 + tf.erf(x / tf.sqrt(2.0)))
    return x * cdf

GELU vs Swish

The function form and properties of GELU and Swish activation function (x · σ(βx)) are very similar, one is a fixed coefficient 1.702, and the other is a variable coefficient β (which can be a trainable parameter or can be determined by searching constant), the actual application performance of the two is not much different.

GLU(Gated Linear Unit)

The circle represents the Hadamard product, placewise multiplication

In the formula, the gating operation is first performed through the intermediate vector g(x)=xW, and the Sigmoid function σ is used to map it to a range between 0 and 1, indicating the probability that each element is retained. Then, the input vector x and the gated vector are multiplied element by element (i.e., ⊗ operation) to obtain the final output vector.

GLU controls the output through a gating mechanism, which like Attention can be regarded as the selection of important features. Its advantage is that it not only has the nonlinearity of the general activation function, but also has a linear channel when backpropagating the gradient . It is similar to the summation operation in the ResNet residual network to transfer the gradient, which can alleviate the gradient disappearance problem.

Why? Compare the gradients of sigmoid and gated tanh unit (GTU) used in LSTM:

GEGLU

is a variant of the activation function of GLU

Replace the sigmoid in GLU with GELU, the function form is as follows (ignoring the writing of the bias item):

GLU contains two learnable parameters W and V

GEGLU also contains two learnable parameters, W and V. Use GELU to replace SIGMOD.

SwiGLU

The SwiGLU activation function is used in the PaLM paper.
In FFN, that is, FC->activation function->FC, the general definition is as follows:

No bias term is used in the T5 paper, that is:

In the same way:

Combining the activation function + unused bias term + GLU, we get:

This is the activation function in PaLM, and the effect is also good:

PALM

  1. Using SwiGLU activation function: for MLP intermediate activation, using SwiGLU activation function: for MLP intermediate activation, because compared with standard ReLU, GELU or Swish activation, "GLU Variants Improve Transformer" paper mentioned: SwiGLU has been proven to be Significantly improve model performance

  2. Propose Parallel Layers: The "parallel" formula in each Transformer structure: As in GPT-J-6B, the standard "serialization" formula is used. The parallel formulation speeds up large-scale training by about 15%. The ablation experiment shows that the model effect decreases very little when the parameter amount is 8B, but there is no decrease in the model effect when the parameter amount is 62B.

  3. Multi-Query Attention: Each head shares the key/value mapping, that is, "key" and "value" are projected to [1, h], but "query" is still projected to shape [k, h], this operation is very important for There is no impact on model quality and training speed, but there is an effective cost savings on autoregressive decoding time.

  4. Use RoPE embeddings: Instead of using absolute or relative position embeddings, RoPE is used because RoPE embeddings have better performance on long texts.

  5. Use Shared Input-Output Embeddings: the input and output embedding matrices are shared. I understand this to be similar to the input W and output W' of word2vec.

ChatGLM-6B

ChatGLM-6B is a conversational language model proposed by Tsinghua University to support bilingual question answering in Chinese and English. ChatGLM-6B adopts the same model structure as GLM-130B[4]. As of July 2022, GLM-130B has only trained 400B tokens, with a Chinese and English ratio of 1:1. ChatGLM-6B uses more training data, up to 1T tokens. The training corpus only contains Chinese and English, and the ratio of Chinese to English is 1:1.

In terms of model structure , ChatGLM-6B adopts a prefix decoder-only transformer model framework, adopts a two-way attention mechanism on the input, and uses a one-way attention mechanism on the output. In terms of model details, the following changes have been made:

  1. Embedding layer gradient reduction:  In order to improve training stability, the gradient of the Embedding layer is reduced. specific

Among them, alpha is 0.1. The function of detach here is to return a new tensor and separate it from the calculation graph (not counting the gradient). The effect of gradient reduction is equivalent to reducing the gradient of the Embedding layer by 10 times.

  1. The order of Layer Normalization and residual connections are rearranged, using POST Normal and Deep Normal.

, f(x) represents attention and FFN, which is equivalent to doing residuals first and then standardizing. Initialize using Xavier (w,gaim=\beta) for FFN, V_p, O_p and Xavier (w, gain=1) for Q_p, k_p

  1. A single linear layer for output label prediction;

  2. Using GEGLU as the activation function:  Compared with the ordinary FFN, the GLU using the linear gating unit adds a new weight matrix. There are three weight matrices in total. In order to keep the parameters consistent, the middle dimension uses 8/3d instead of 4d.

  3. Position encoding : Absolute position encoding is removed and rotational position encoding RoPE is used

  4. Training objective : The training task of ChatGLM-6B is autoregressive text filling . Compared with the large language model with causal decoder-only structure, ChatGLM-6B with prefix decoder-only structure has a disadvantage: low training efficiency. The causal decoder structure will calculate the loss on all tokens, while the prefix decoder will only calculate the loss on the output, not the input loss. With the same number of training tokens, the prefix decoder is less effective than the causal decoder because the number of tokens actually used during the training process is smaller. In addition, the success of ChatGPT has proven that a large language model with a causal decoder structure can achieve very good few-shot and zero-shot generation capabilities, and the model's capabilities can be further stimulated through instruction fine-tuning. As for whether the large language model of the prefix decoder structure can obtain considerable few-shot and zero-shot capabilities, there is still a lack of sufficient verification.

  5. Mask a complete word during training, which can avoid a word being split into multiple Tokens, and then infer yourself based on yourself.

  6. tokenizer : Regarding the tokenizer , ChatGLM trained SentencePiece as the tokenizer on 25GB of Chinese and English bilingual data, with a vocabulary size of 130528.

The following are some large model applications derived from ChatGLM:

  • langchain-ChatGLM : A ChatGLM application based on langchain that implements question and answer based on an extensible knowledge base.
  • Wenda : A large language model calling platform that implements ChatPDF-like functions based on ChatGLM-6B.

BLOOM

The BLOOM[5] series of models are large language models trained by the BigScience team. The training data includes a total of 46 languages ​​including English, Chinese, French, Spanish, and Portuguese, as well as 13 programming languages. 1.5TB of deduplicated and cleaned text is converted into 350B tokens. The language distribution of the training data is shown in the figure below. You can see that the proportion of Chinese corpus is 16.2%.

According to the amount of model parameters , the BLOOM model has several models with different parameter sizes: 560M, 1.1B, 1.7B, 3B, 7.1B and 176B. The BLOOMZ series of models are fine-tuned on the xP3 dataset and are recommended for English prompt scenarios. The BLOOMZ-MT series models are fine-tuned on the xP3mt dataset and are recommended for non-English prompt scenarios.

In terms of model structure , the same as GPT, BLOOM adopts the causal decoder-only transformer model structure. In terms of model details, the following changes have been made:

  • Using ALiBi positional embedding, it directly decays the attention score based on the distance between the key and the query. It leads to smoother training and better downstream performance compared to the original Transformer and Rotary embeddings. ALiBi does not add positional embeddings to word embeddings; instead, it biases the attention score of query keys using a penalty proportional to their distance.

  • Embedding Layer Norm is used immediately after the first embedding layer to avoid unstable training.

  • layer normalization : In order to improve the stability of training, the traditional post layer norm is not used, but the pre layer Norm is used.

  • Activation function : The GeLU activation function is used.

  • Regarding the tokenizer , BLOOM uses the Byte Pair Encoding (BPE) algorithm to train on multi-lingual corpus to obtain the tokenizer, and the vocabulary size is 250880. A vocabulary of 250,000 tokens is used. Use byte-level BPE. This way, tokenization never produces unknown tokens

  • Fully connected layer:c041df1859f253b4c54bc1aeb806d1d2png

  • In terms of training goals , the training goal of BLOOM is the language model, which is to predict the next word based on the existing context.

The following are some large model applications derived from BLOOM:

  • Xuanyuan : A large model in the financial field, based on BLOOM-176B, Du Xiaoman has carried out targeted pre-training and fine-tuning for the Chinese general field and the financial field.
  • BELLE : Lianjia only used the data produced by ChatGPT to fine-tune the instructions of BLOOMZ-7B1-mt.

tokenizer comparison

The vocabulary sizes of the tokenizers of the above base models are different, and the word segmentation results of the same Chinese text will produce different results. Segmentation processing was performed on 69,000 Chinese and English parallel corpora of news_commentary, and the result and time-consuming of word segmentation were compared. The results are as follows. "Average number of tokens in Chinese" indicates the average number of tokens corresponding to each Chinese character after tokenizer segmentation.

Model vocabulary size Average number of tokens in Chinese Average number of English tokens Chinese processing time (s) English processing time(s)
LLaMA 32000 1.45 0.25 12.60 19.40
Chinese LLaMA 49953 0.62 0.249 8.65 19.12
ChatGLM-6B 130528 0.55 0.19 15.91 20.84
Bloom 250880 0.53 0.22 9.87 15.60

From the results,

  1. LLaMA's vocabulary is the smallest, and LLaMA has the largest average number of tokens in both Chinese and English. This means that LLaMA's Chinese and English word segmentation is relatively fragmented and fine-grained. Especially in Chinese, the average number of tokens is as high as 1.45, which means that LLaMA has a high probability of dividing Chinese characters into more than 2 tokens.
  2. After Chinese LLaMA expands the vocabulary, the average number of Chinese tokens is significantly reduced. One Chinese character or two Chinese characters will be divided into one token , which improves the efficiency of Chinese coding.
  3. ChatGLM-6B is the tokenizer with the best balance between Chinese and English word segmentation. Due to the relatively large vocabulary, the Chinese processing time has also increased.
  4. Although BLOOM has the largest vocabulary, but because it is multilingual, the word segmentation efficiency in Chinese and English is basically the same as that of ChatGLM-6B. It should be noted that the tokenizer of BLOOM is implemented with BloomTokenizerFast of transformers, and the word segmentation speed is faster.

From two examples, let's intuitively compare the word segmentation results of different tokenizers. "Why don't men take Wu hook and collect the fifty states in Guanshan." There are 16 words in total. The word segmentation results of several tokenizers are as follows:

  • LLaMA word segmentation is divided into 24 tokens: (It looks like BPE done at the Unicode encoding level)
 [ '男', '<0xE5>', '<0x84>', '<0xBF>', '何', '不', '<0xE5>', '<0xB8>', '<0xA6>', '<0xE5>', '<0x90>', '<0xB4>', '<0xE9>', '<0x92>', '<0xA9>', ',', '收', '取', '关', '山', '五', '十', '州', '。'] 
  • Chinese LLaMA word segmentation is divided into 14 tokens:
[ '男', '儿', '何', '不', '带', '吴', '钩', ',', '收取', '关', '山', '五十', '州', '。']
  • ChatGLM-6B is segmented into 11 tokens:
[ '男儿', '何不', '带', '吴', '钩', ',', '收取', '关山', '五十', '州', '。'] 
  • Bloom word segmentation is divided into 13 tokens:
 ['男', '儿', '何不', '带', '吴', '钩', ',', '收取', '关', '山', '五十', '州', '。'] 

"A mixture of pepper and fungus and cinnamon is not enough to preserve the tadpole." The length is 15 words. The word segmentation results of several tokenizers are as follows:

  • LLaMA word segmentation is 37 tokens:
[ '<0xE6>', '<0x9D>', '<0x82>', '<0xE7>', '<0x94>', '<0xB3>', '<0xE6>', '<0xA4>', '<0x92>', '与', '<0xE8>', '<0x8F>', '<0x8C>', '<0xE6>', '<0xA1>', '<0x82>', '<0xE5>', '<0x85>', '<0xAE>', ',', '<0xE5>', '<0xB2>', '<0x82>', '<0xE7>', '<0xBB>', '<0xB4>', '<0xE7>', '<0xBA>', '<0xAB>', '夫', '<0xE8>', '<0x95>', '<0x99>', '<0xE8>', '<0x8C>', '<0x9D>', '。']
  • Chinese LLaMA word segmentation is divided into 17 tokens:
[ '杂', '申', '椒', '与', '菌', '桂', '兮', ',', '岂', '维', '纫', '夫', '蕙', '<0xE8>', '<0x8C>', '<0x9D>', '。'] 
  • ChatGLM-6B is segmented into 17 tokens:
 [ '杂', '申', '椒', '与', '菌', '桂', '兮', ',', '岂', '维', '纫', '夫', '蕙', '<0xE8>', '<0x8C>', '<0x9D>', '。'] 
  • Bloom word segmentation is divided into 17 tokens:
 ['杂', '申', '椒', '与', '菌', '桂', '兮', ',', '岂', '维', '�', '�', '夫', '蕙', '�', '�', '。'] 

As can be seen from the above example, the LLaMA vocabulary contains a very small number of Chinese characters, and the common word "er" is also divided into 3 tokens. The vocabulary of Chinese LLaMA, ChatGLM-6B and Bloom covers most of the common Chinese characters, and also contains some common Chinese words, for example, the word "receive" is divided into a token; for some rare words, such as "茝" will also be divided into 2-3 tokens. In general, LLaMA usually divides a Chinese character into two or more tokens, and the Chinese encoding efficiency is low; Chinese LLaMA, ChatGLM-6B and Bloom have higher encoding efficiency for Chinese word segmentation.

Layer Normalization

As shown in the figure below, according to the different positions of layer normalization, it can be divided into post layer norm and pre layer norm.

post layer norm . In the original transformer, layer normalization is placed after the residual connection, called post LN. Deep transformer models using Post LN are prone to training instability. As shown in the figure below, as the number of transformer layers in post LN increases, the gradient norm gradually increases , leading to instability in training.

pre layer norm . Change the position of layer normalization and place it in the process of residual connection , before self-attention or FFN block, called "Pre LN". As shown in the figure below, the gradient norm of Pre layer norm in each transformer layer is approximately equal, which is beneficial to improving training stability. Compared with post LN, deep transformer training using pre LN is more stable and can alleviate the problem of training instability. But the disadvantage is that pre LN may slightly affect the performance of the transformer model.  A challenge for large language models is how to improve the stability of training . In order to improve training stability, large language models such as GPT3, PaLM, BLOOM, and OPT all use pre layer norm.

The two important parts of layer normalization are translation invariance and scaling invariance . [8] believes that the success of layer normalization is the scaling invariance, not the translation invariance. Therefore, the translation in the calculation process was removed, only the scaling was retained, and simplified, and RMS Norm (Root Mean Square Layer Normalization) was proposed , that is, root mean square norm.

The calculation process of layer normalization:

RMS calculation process:

Compared with normal layer normalization, RMS norm removes the part of calculating the mean value for translation, the calculation speed is faster, and the effect is basically the same, or even slightly improved. Large language models such as Gopher, LLaMA, and T5 all use RMS norm.

[9] proposed that Deep Norm can alleviate the problem of explosive model update, and limit the model update to a constant, making the model training process more stable. Specifically, the Deep Norm method up-scales the residual connection (��>1) before executing Layer Norm; in addition, it down-scales the model parameters during the initialization phase (��<1). ChatGLM-6B uses post LN based on Deep Norm.

activation function

Each transformer layer is divided into two parts: self attention block and FFN block. FFN usually first increases the dimension of the vector from dimension d to the intermediate dimension 4d, and then reduces the dimension from 4d to d. The calculation formula of FFN is as follows:

Among them, f() is a nonlinear activation function. The widely used activation functions include gelu (Gaussian Error Linear Unit) function and swish function. The swish function is a self-gating activation function.

Gelu is also an activation function that adjusts the output value through a gating mechanism. Similar to the swish function, it can be approximated by the tanh function or the function.

[10] proposed GLU (Gated Linear Units). Compared with the normal FFN, which has only two weight matrices, the FFN using GLU adds an additional weight matrix, which is V in the following formula, with a total of three weights. matrix, achieving better model performance.

The GLU calculation formula using the gelu activation function is:

The GLU calculation formula using the swish activation function is:

position encoding

For transformer models, positional encoding is essential. Because the attention module cannot capture the input sequence, it cannot distinguish tokens in different positions. Position coding is divided into absolute position coding and relative position coding.

The most direct way is training position coding , which uses position coding as a trainable parameter and trains a position coding vector matrix. GPT3 takes this approach. The disadvantage of training-style position coding is that it has no extrapolation . That is, if the maximum sequence length is 2048 during training, it can only process sequences with a maximum length of 2048 during inference, and it cannot be processed beyond this length.

Su Shen [11] proposed RoPE for rotational position encoding . The training-style position encoding acts on the token embedding, while the rotational position encoding RoPE acts on the self-attention block of each transformer layer. After calculating Q/K, the rotational position encoding acts on Q/K , and then calculates the attention score . Rotary position encoding implements relative position encoding through absolute encoding, which has good extrapolation properties . It is worth mentioning that RoPE does not contain trainable parameters . Large language models such as LLaMA, GLM-130B, and PaLM use rotational position encoding RoPE.

ALiBi (Attention with Linear Biases) [12] also acts on the self-attention block of each transformer layer, as shown in the figure below, after calculating the attention score, directly add a preset bias matrix to the attention score matrix . The bias matrix here is preset, fixed, and not trainable. This bias penalizes the attention score based on the relative distance between q and k. The greater the relative distance, the greater the penalty term. Equivalent to the farther the distance between two tokens, the smaller the mutual contribution. ALiBi position coding has good extrapolation. BLOOM uses this positional encoding.

Efficient parameter fine-tuning method PEFT

As the number of parameters of large language models becomes larger and larger, the cost of full fine-tuning of large models is very high. The high cost is mainly reflected in the high requirements for hardware resources and high video memory usage; slow training speed and long time-consuming; and high storage costs. Efficient parameter fine-tuning (parameter-efficient finetuning techniques, PEFT) only trains a small number of parameters when fine-tuning a large model, instead of training the full multi-model parameters. The efficient parameter fine-tuning method has the following advantages:

  • Less video memory usage, low requirements on hardware resources
  • Training is fast and takes less time
  • Lower storage cost, different tasks can share most of the weight parameters
  • Potentially better model performance, alleviating the overfitting problem

prompt tuning

The original meaning of prompt tuning[13] refers to obtaining better model effects by modifying the input prompt. The prompt here is a "hard prompt". We modify the input prompt directly, and the input prompt is not differentiable.

Corresponding to "hard prompt", "soft prompt tuning" splices a trainable tensor with the embeddings of the input text. This trainable tensor can be optimized through backpropagation, thereby improving the performance of the target task. model effect. The trainable tensor here can be understood as the embedding corresponding to the prompt text, which is a soft prompt. As shown in the figure below, the shape of this trainable tensor is [virtal_tokens_sum,embed_size]

Prompt tuning freezes the original parameters of the large model and only trains the newly added prompt tensor. The effect of prompt tuning will become better as the number of parameters of the base model increases.

prefix tuning

Prefix tuning [14] is similar to prompt tuning, adding a task-specific tensor to the input. This tensor is trainable, keeping the parameters of the pre-trained model unchanged. The main differences are as follows:

  1. Prefix tuning adds prefix parameters (trainable tensors) to all transformer layers, while prompt tuning only adds trainable matrices to the input embedding. Specifically, prefix tuning will add the prefix tensor as past_key_value to all transformer layers.

  2. Use a separate FFN to encode and optimize the prefix parameters instead of optimizing the soft prompt directly as it may cause instability and hurt performance. After updating the soft prompt, FFN is no longer used.

The role of prefix tuning is different from that of prompt tuning, which is somewhat similar to trainable position encoding and rotational position encoding RoPE. The former acts directly on the input embedding, and the latter acts on the self-attention block of all transformer layers. After K and V are calculated, it is concatenated with the trainable prefix tensor.

The shape of the prefix tuning trainable tensor is . The figure below is an example of prefix tuning for LLaMA-7B. LLaMA-7B has 32 transformer layers, the hidden dimension is 4096, and there are 30,262144=2×32×4096. 30 corresponds to the number of virtual words, and the 2 here corresponds to K and V.

Adapter

Adapter [16] is similar to prefix tuning in that both add additional trainable parameters to each transformer layer. The difference is: prefix tuning is to add the prefix to the input embedding; and the adapter inserts the adapter layer in two positions, as shown in the figure below.

LLaMA-Adapter

LLaMA-adapter [16] combines prefix tuning and adapter. Similar to prefix tuning, the LLaMA-adapter adds a trainable prompt tensor to the input embed. It should be noted that the prefix is ​​learned and maintained by an embedding matrix, not given externally. Each transformer layer has its own different learnable prefix, allowing more tailored adaptation of different model layers.

As shown in the figure above, LLaMA-adapter introduces a zero-initialization attention mechanism and a gating mechanism. The motivation is that adapter and prefix tuning combined with randomly initialized tensors (prefix prompts and adapter layers) are likely to damage the semantic knowledge of the pre-trained language model, leading to unstable fine-tuning and high performance loss in the initial stage of training.

Another important difference is that LLaMA-adapter only adds learnable adaptation prompts to L deep transformer layers, rather than to all transformer layers. The authors believe that this approach can more effectively fine-tune language representations that focus on high-level semantic information.

Guess you like

Origin blog.csdn.net/chaishen10000/article/details/132763738