Easy-to-understand LLM (Part 2)

Preface

  Book continues above.

1. Large model activation function

  Refer to the large model for structural details .

1、ReLU

  The original Transformer uses ReLU, and the activation function used by T5 and OPT is also ReLU. At this time, FFN can be expressed as: FFN (
x, W 1, W 2, b 1, b 2) = R e LU (x W 1 + b 1 ) W 2 + b 2 FFN(x,W_{1},W_{2},b_{1},b_{2})=ReLU(xW_{1}+b_{1})W_{2 }+b_{2}FFN(x,W1,W2,b1,b2)=R e LU ( x W1+b1)W2+b2

2、GeLU

  The activation function used by GPT-1, GPT-2, GPT-3, and BLOOM is GeLU (Gaussian Error Linear Unit). The expression of GeLU is as follows:
G e LU ( x ) = x Φ ( x ) GeLU(x)=x\Phi(x)G e LU ( x )=x Φ ( x )
  where:
Φ ( x ) = ∫ − ∞ xe − t 2 / 2 2 π dt \Phi(x)=\int_{-\infty}^{x}\frac{e^{-t^ {2}/2}}{\sqrt{2\pi}}dtΦ ( x )=x2 p.m et2/2d t
  Two approximate calculations of GeLU are given in the original paper:
x Φ ( x ) ≈ x σ ( 1.702 x ) x\Phi(x)\approx x\sigma(1.702x)xΦ(x)x σ ( 1.702 x )
  whereσ\sigmaσ s i g m o i d sigmoid s i g m o i d function. Or:
x Φ ( x ) ≈ 1 2 x [ 1 + tanh ( 2 π ( x + 0.044715 x 3 ) ) ] x\Phi(x)\approx \frac{1}{2}x[1+tanh(\ sqrt\frac{2}{\pi}(x+0.044715x^{3}))]xΦ(x)21x[1+t you (Pi2 (x+0.044715x3 ))]
  At this time, FFN can be expressed as:
FFN (x, W 1, W 2, b 1, b 2) = G e LU (x W 1 + b 1) W 2 + b 2 FFN(x,W_{ 1},W_{2},b_{1},b_{2})=GeLU(xW_{1}+b_{1})W_{2}+b_{2}FFN(x,W1,W2,b1,b2)=G e LU ( x W1+b1)W2+b2

3、GLU

  GLU Variants Improve Transformer proposes that the activation function can be improved by using gated linear units: GLU (Gated Linear Units). First of all, the basic form of GLU is: do a bilinear transformation on the input (an extra matrix V), and add sigmoid sigmoid to one of themsigmoid变换:
G L U ( x , W , V , b , c ) = σ ( x W + b ) ⊗ ( x V + c ) GLU(x,W,V,b,c)=\sigma(xW+b)\otimes(xV+c) GLU(x,W,V,b,c)=s ( x W+b)(xV+c )
  At this time, FFN can be expressed as:
FFN (x, W 1, W 2, V, b 1, b 2, c) = (σ (x W 1 + b 1) ⊗ (x V + c)) W 2 + b 2 FFN(x,W_{1},W_{2},V,b_{1},b_{2},c)=(\sigma(xW_{1}+b_{1})\otimes(xV +c))W_{2}+b_{2}FFN(x,W1,W2,V,b1,b2,c)=( σ ( x W1+b1)(xV+c))W2+b2

4、GeGLU

Convert the sigmoid sigmoid   in GLUIf s i g m o i d is replaced by the GeLU activation function, a variant of GLU is obtained: GeGLU. The activation function used by GLM-130B is GeGLU. The specific expression is as follows:
G e GLU ( x , W , V , b , c ) = G e LU ( x W + b ) ⊗ ( x V + c ) GeGLU(x,W,V,b,c)=GeLU(xW+b)\otimes(xV+c)GeGLU(x,W,V,b,c)=G e LU ( x W+b)(xV+c )
  At this time, FFN can be expressed as:
FFN (x, W 1, W 2, V, b 1, b 2, c) = (G e LU (x W 1 + b 1) ⊗ (x V + c) ) W 2 + b 2 FFN(x,W_{1},W_{2},V,b_{1},b_{2},c)=(GeLU(xW_{1}+b_{1})\otimes( xV+c))W_{2}+b_{2}FFN(x,W1,W2,V,b1,b2,c)=( G e LU ( x W1+b1)(xV+c))W2+b2

5、SwiGLU

Convert the sigmoid sigmoid   in GLUIf s i g m o i d is replaced with another activation function, another variant of GLU is obtained: SwiGLU. The activation function used by LLaMA is SwiGLU. The specific expression is as follows:
S wi GLU ( x , W , V , b , c ) = Swish β ( x W + b ) ⊗ ( x V + c ) SwiGLU(x,W,V,b,c)=Swish_{\beta}(xW+b)\otimes(xV +c)SwiGLU(x,W,V,b,c)=Swishb(xW+b)(xV+c )
  At this time, FFN can be expressed as:
FFN (x, W 1, W 2, V, b 1, b 2, c) = (S wish β (x W 1 + b 1) ⊗ (x V + c) ) W 2 + b 2 FFN(x,W_{1},W_{2},V,b_{1},b_{2},c)=(Swish_{\beta}(xW_{1}+b_{1} )\otimes(xV+c))W_{2}+b_{2}FFN(x,W1,W2,V,b1,b2,c)=(Swishb(xW1+b1)(xV+c))W2+b2
  其中, S w i s h β ( x ) = x ⋅ σ ( β x ) Swish_{\beta}(x)=x\cdot\sigma(\beta x) Swishb(x)=xσ ( β x )β \betaβ is a specified constant, usually 1. Compared with the original ReLU activation function, since it involves a linear representation of one more dimension, it will face an increase in the number of parameters and an increase in the amount of calculations. And how does LLaMA overcome this? LLaMA is to combine W 1 , W 2 , V W_{1}, W_{2}, Vin SwiGLUW1W2, the matrix dimension of V is from (dim, dim) (dim, dim)( d im , d im ) becomes(dim, 2 3 dim) (dim, \frac{2}{3}dim)(dim,32d im ) , thereby equalizing the amount of parameters and the amount of calculation.

2. Position coding

1. Rotation position encoding

  Rotary Position Embedding (RoPE) comes from Su Jianlin's Rotary Transformer . Traditional positional encoding is usually summed with token embedding when inputting. When performing subsequent Attention calculations, the formula is as follows:
QKT = x WQ ⋅ WKT x T + x WQ ⋅ WKT p T + p WQ ⋅ WKT x T + p WQ ⋅ WKT p TO = A ⋅ V = A ⋅ ( x WV + p WV ) \begin{aligned} QK^{T}&=xW_{Q}\sdot W_{K}^{T}x^{T} +xW_{Q}\sdot W_{K}^{T}p^{T}+pW_{Q}\sdot W_{K}^{T}x^{T}+pW_{Q}\sdot W_{K }^{T}p^{T}\\ O&=A\sdot V=A\sdot(xW_{V}+pW_{V}) \end{aligned}QKTO=xWQWKTxT+xWQWKTpT+pWQWKTxT+pWQWKTpT=AV=A(xWV+pWV)
其中 A = s o f t m a x ( Q K T d k ) A=softmax(\frac{QK^{T}}{\sqrt{d_{k}}}) A=softmax(dk QKT) is the attention matrix,xxx is the input token embedding,ppp is position embedding.
  In fact, the position vector does not necessarily have to be added to the input token embedding to work. From the above formula, we can see that as long asQQQ K K Adding position coding to the K matrix can also be applied to the attention matrix AAA , then acts onVVV on. Rotational position encoding adopts this idea.

  • Rotation position encoding formula : for QQQ 'smmm vectorsqqq K K K 'snnn vectorskkk , inject positional encoding via:
    • Theorem : Define the functional function 2.
      f ( q , m ) = ( cos ( m θ ) − sin ( m θ ) sin ( m θ ) cos ( m θ ) ) ( q 1 q 2 ) f(q ,m)= \begin{pmatrix} cos(m\theta) & -sin(m\theta) \\sin(m\theta) & cos(m\theta) \end{pmatrix} \begin{pmatrix} q_{ 1} \\q_{2}\end{pmatrix}f(q,m)=(cos ( ) _s in ( m θ )s in ( m θ )cos ( m θ ))(q1q2)
      f ( k , n ) = ( cos ( n θ ) − sin ( n θ ) sin ( n θ ) cos ( n θ ) ) ( k 1 k 2 ) f(k, n)= \begin{pmatrix} cos (n\theta) & -sin(n\theta) \\sin(n\theta) & cos(n\theta) \end{pmatrix}\begin{pmatrix}k_{1}\\k_{2}\end {pmatrix}f(k,n)=(cos ( ) _s in ( n θ )s in ( n θ )cos ( n θ ))(k1k2)
      where,q = ( q 1 q 2 ) q=\begin{pmatrix} q_{1} \\ q_{2} \end{pmatrix}q=(q1q2) isQQQ 'smmm vectors,k = ( k 1 k 2 ) k=\begin{pmatrix} k_{1} \\ k_{2} \end{pmatrix}k=(k1k2) isKKK 'snnn vectors,f ( q , m ) f(q,m)f(q,m) f ( k , n ) f(k,n) f(k,n ) are respectively qqafter rotation position encoding.q andkkk . It can be found that the rotation position encoding is actuallythe qqq andkkk is multiplied by a rotation matrix respectively (after the vector is multiplied by the rotation matrix, the module (vector size) remains unchanged and the direction changes). This is why it is called rotational position encoding. At this timeQKT QK^{T}QKThe calculation formula of T is (only withqqq andkkOne of k: (
      q 1 q 2 ) ( cos ( ( m − n ) θ ) − sin ( ( m − n ) θ ) sin ( ( m − n ) θ ) cos ( ( m − n ) θ ) ) ( k 1 k 2 ) \begin {pmatrix} q_{1} & q_{2} then{pmatrix} \begin{pmatrix} cos((mn)\theta) & -sin((mn)\theta) \\sin((mn)\theta) & cos((mn)\theta)\end{pmatrix}\begin{pmatrix}k_{1}\\k_{2}\end{pmatrix}(q1q2)(cos((mn ) i )sin((mn ) i )sin((mn ) i )cos((mn ) i ))(k1k2)
    • Multidimensional : The above calculation assumes that the word embedding dimension is 2 dimensions, and for d ≥ 2 d\ge2dIn the general case of 2 , qqThe elements in the q vector are grouped in pairs, and the same rotation operation is applied to each group. The specific calculation formula is as follows:
      f ( q , m ) = ( c o s ( m θ 0 ) − s i n ( m θ 0 ) 0 0 ⋯ 0 0 s i n ( m θ 0 ) c o s ( m θ 0 ) 0 0 ⋯ 0 0 0 0 c o s ( m θ 1 ) − s i n ( m θ 1 ) ⋯ 0 0 0 0 s i n ( m θ 1 ) c o s ( m θ 1 ) ⋯ 0 0 ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 0 0 ⋯ c o s ( m θ d / 2 − 1 ) − s i n ( m θ d / 2 − 1 ) 0 0 0 0 ⋯ s i n ( m θ d / 2 − 1 ) c o s ( m θ d / 2 − 1 ) ) ( q 0 q 1 q 2 q 3 ⋮ q d − 2 q d − 1 ) f(q,m)= \begin{pmatrix} cos(m\theta_{0}) & -sin(m\theta_{0}) & 0 & 0 & \cdots & 0 & 0 \\ sin(m\theta_{0}) & cos(m\theta_{0}) & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & cos(m\theta_{1}) & -sin(m\theta_{1}) & \cdots & 0 & 0 \\ 0 & 0 & sin(m\theta_{1}) & cos(m\theta_{1}) & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & cos(m\theta_{d/2-1}) & -sin(m\theta_{d/2-1}) \\ 0 & 0 & 0 & 0 & \cdots & sin(m\theta_{d/2-1}) & cos(m\theta_{d/2-1}) \\ \end{pmatrix} \begin{pmatrix} q_{0} \\ q_{1}\\ q_{2}\\ q_{3}\\ \vdots\\ q_{d-2}\\ q_{d-1} \end{pmatrix} f(q,m)= cos ( m θ0)s in ( m θ0)0000s in ( m θ0)cos ( m θ0)000000cos ( m θ1)s in ( m θ1)0000s in ( m θ1)cos ( m θ1)000000cos ( m θd/21)s in ( m θd/21)0000s in ( m θd/21)cos ( m θd/21) q0q1q2q3qd2qd1
      Among them, θ j = 1000 0 − 2 j / d \theta_{j}=10000^{-2j/d}ij=100002j/d j ∈ { 0 , 1 , … , d / 2 − 1 } j\in \{0,1,\dots,d/2-1\} j{ 0,1,,d/21} k k The calculation of k is similar and will not be shown here.
    • Efficient calculation : Due to the sparsity of the above-mentioned rotation matrix, directly using matrix multiplication to implement it will be a waste of computing power. It is recommended to implement RoPE in the following way:
      f ( q , m ) = ( q 0 q 1 q 2 q 3 ⋮ q d − 2 q d − 1 ) ⊗ ( c o s ( m θ 0 ) c o s ( m θ 0 ) c o s ( m θ 1 ) c o s ( m θ 1 ) ⋮ c o s ( m θ d / 2 − 1 ) c o s ( m θ d / 2 − 1 ) ) + ( − q 1 q 0 − q 3 q 2 ⋮ − q d − 1 q d − 2 ) ⊗ ( s i n ( m θ 0 ) s i n ( m θ 0 ) s i n ( m θ 1 ) s i n ( m θ 1 ) ⋮ s i n ( m θ d / 2 − 1 ) s i n ( m θ d / 2 − 1 ) ) f(q,m)= \begin{pmatrix} q_{0} \\ q_{1}\\ q_{2}\\ q_{3}\\ \vdots\\ q_{d-2}\\ q_{d-1} \end{pmatrix}\otimes \begin{pmatrix} cos(m\theta_{0}) \\ cos(m\theta_{0}) \\ cos(m\theta_{1}) \\ cos(m\theta_{1}) \\ \vdots \\ cos(m\theta_{d/2-1}) \\ cos(m\theta_{d/2-1}) \\ \end{pmatrix}+ \begin{pmatrix} -q_{1} \\ q_{0}\\ -q_{3}\\ q_{2}\\ \vdots\\ -q_{d-1}\\ q_{d-2} \end{pmatrix}\otimes \begin{pmatrix} sin(m\theta_{0}) \\ sin(m\theta_{0}) \\ sin(m\theta_{1}) \\ sin(m\theta_{1}) \\ \vdots \\ sin(m\theta_{d/2-1}) \\ sin(m\theta_{d/2-1}) \\ \end{pmatrix} f(q,m)= q0q1q2q3qd2qd1 cos ( m θ0)cos ( m θ0)cos ( m θ1)cos ( m θ1)cos ( m θd/21)cos ( m θd/21) + q1q0q3q2qd1qd2 s in ( m θ0)s in ( m θ0)s in ( m θ1)s in ( m θ1)s in ( m θd/21)s in ( m θd/21)
    • Intuitive display : There is a very intuitive picture in the paper showing the process of rotation transformation.
      Insert image description here
    • Remote attenuation : RoPE is somewhat similar in form to the triangular position encoding of traditional transformers, except that the triangular position encoding is additive, while RoPE can be regarded as multiplicative. At θ i \theta_{i}iiIn the selection ofij=100002j/d j ∈ { 0 , 1 , … , d / 2 − 1 } j\in\{0,1,\dots,d/2-1\} j{ 0,1,,d/21 } , which can bring certain remote attenuation. As shown in the figure below:
      Insert image description here
      From the figure we can see that asthe relative distance increases, the inner product result has a tendencyto attenuate. Therefore, chooseθ j = 1000 0 − 2 j / d \theta_{j}=10000^{-2j/d}ij=100002 j / d , which can indeed bring a certain degree of long-range attenuation. In the paper, we also tried to useθ j = 1000 0 − 2 j / d \theta_{j}=10000^{-2j/d}ij=100002 j / d is initialization, andθ j \theta_{j}ijTreat it as a trainable parameter, and then find θ j \theta_{j} after training for a period of timeijThere is no significant update, so simply fix θ j = 1000 0 − 2 j / d \theta_{j}=10000^{-2j/d}ij=100002 j / d .
    • RoPE practice : Take a look at the source code of LLaMA and ChatGLM later.
      • ChatGLM2: every qqq k k Only the first half of the k vector has rotation position coding added;
      • LLaMA: Implemented Su Jianlin's version of rotation position encoding.
    • Advantages and Disadvantages :
      • Advantages : RoPE implements relative position encoding through absolute position encoding. The advantage is that this method can not only process position information, but also distance information, because the rotation operation can well reflect the relative position relationship between elements;
      • Disadvantages : The disadvantage is that it requires more computational resources because the rotation operation is more complex than simple vector addition.
    • Reference materials : Understand Rotary Encoding (RoPE) in ten minutes , and understand the rotational position encoding in LLaMA in one article .

3. Decoder-only model

1. Generate tasks

  • Conditional text generation : When performing a text generation task, pre-generation conditions are given. For example, the input is: given text content: Spring is so beautiful, continue writing the story;
  • Unconditional text generation : When performing a text generation task, no prerequisite generation conditions are given. For example, the input is: write a composition.

2. Reasoning process

  • Casual LM
    • Original input composition : For GPT-like models, whether it is a conditional text generation task or an unconditional text generation task, the original input is fed into the model as a whole;
    • New input composition : generate tokens one by one, and the generated tokens will be spliced ​​behind the original input to form a new input feeding model, and continue to generate new tokens;
    • Attention calculation : The Mask Attention mechanism is satisfied between inputs (the later token can see the previous token, and the previous token cannot see the later token);
    • Token generation : The new token generation mechanism is to use the last column of logits in the output layer for full vocabulary classification; there are many ways to select which word to output, please refer to the decoding and generation method below .
  • Prefix LM
    • Original input composition : For GLM-like models, whether it is a conditional text generation task or an unconditional text generation task, the original input is fed into the model as a whole;
    • New input composition : generate tokens one by one, and the generated tokens will be spliced ​​behind the original input to form a new input feeding model, and continue to generate new tokens;
    • Attention calculation : The inputs satisfy the Prefix Attention mechanism (the original input tokens can see each other, and the newly spliced ​​token can only see the previous token, but not the following token);
    • Token generation : The new token generation mechanism is to use the last column of logits in the output layer for full vocabulary classification. There are many ways to select which word to output. Please refer to the decoding and generation method below .

3. Decoding generation method

  • Greedy Search : Greedy search.
    • Method : Select the word with the highest probability at each time step;
    • Parameter settings : do_sample = False, num_beams = 1;
    • Disadvantages :
      • Generate text repetition;
      • Generating multiple results is not supported. When num_return_sequencesthe parameter setting is greater than 1, the code will report an error saying that greedy search does not support this parameter greater than 1.
  • Beam Search : Beam search, an improvement on the greedy strategy, will degenerate into a greedy search strategy when num_beams = 1.
    • Method : Select the num_beams words with the highest probability at each time step;
    • Parameter settings : do_sample = False, num_beams > 1;
    • Disadvantages : Although the results are smoother than greedy search, there is still the problem of generating duplicates.
  • Top-K sampling
    • Method : At each time step, top-k tokens will be retained, then the probabilities of the top-k tokens will be renormalized, and finally sampling will be performed among the renormalized k tokens;
    • Parameters : do_sample = True, num_beams = 1, setting top_k;
    • Disadvantages : When the distribution is steep, tokens with low probability will still be sampled, or when the distribution is flat, only some of the available tokens can be sampled;
  • Top-P sampling
    • Method : At each time step, the tokens are sorted from high to low according to their probability of occurrence. When the sum of probabilities is greater than top-p, subsequent samples will not be taken. Then re-normalize the probabilities of these tokens and perform sampling;
    • Parameters : do_sample = True, num_beams = 1, setting top-p, value between 0-1;
    • Note : The top-p sampling method is often used in combination with the top-k sampling method. Each time the smallest sampling range of the two is selected for sampling, it can reduce the chance of sampling a very small probability token when the prediction distribution is too flat.
  • Temperature sampling
    • Method : Strictly speaking, temperature adjustment is not a sampling method, but a probability changing method that needs to be combined with other sampling methods to work. Specifically, the temperature is adjusted by softmax in softmaxAdd the temperature coefficient ttto so f t ma xt to change the probability of each token output:
      softmax (xi) = exi / t ∑ j = 1 nexj / t softmax(x_{i})=\frac{e^{x_{i}}/t}{\sum_ {j=1}^{n}e^{x_{j}}/t}softmax(xi)=j=1nexj/texi/t
      Among them, nn is the vocabulary size,xj x_{j}xjIndicates the logits value corresponding to each token in the vocabulary, xi x_{i}xiis the logits value of the current token, ttt is the temperature coefficient.
    • Parameters : temperature;
    • Note : It can be seen from the above formula that when ttWhen t approaches 0, the difference in probabilities between tokens will increase, and the probability distribution of all tokens will be steeper, as shown in the figure below. If the top-p sampling mode is adopted, the optional range of final output tokens will be narrowed (because only fewer tokens and probability sums are needed to meet the conditions), and the model output will be more stable; whenttWhen t approaches infinity, the probability difference between high-probability tokens will be reduced, and the probability distribution of all tokens will be flatter. That is to say, the optional range of final output tokens will increase, and the model output will be more random.
      Insert image description here

Summarize

  To understand LLM, just read this article! ! !

Guess you like

Origin blog.csdn.net/qq_39439006/article/details/132388975