[Comparison of base models of LLM series] Structure comparison of LLaMA, Palm, GLM, BLOOM, and GPT models

LLama

  • [GPT3] Use RMSNorm (Root Mean square Layer
    Normalization) to standardize the input data. RMSNorm can refer to the paper: Root mean square layer
    normalization.
  • [PaLM] uses the activation function SwiGLU, which can refer to the PALM paper: Glu variants improve transformer.
  • [GPTNeo] uses Rotary Embeddings for position encoding, which can refer to the paper Roformer: Enhanced
    transformer with rotary position embedding.
  • The AdamW optimizer is used, and the cosine learning rate schedule is used,
  • Using an efficient implementation of causal multi-head attention to reduce memory usage and runtime. The implementation is available at xformers

Palm

  • Using SwiGLU activation function: for MLP intermediate activation, using SwiGLU activation function: for MLP intermediate activation, because compared with standard ReLU, GELU
    or Swish activation, "GLU Variants Improve Transformer" paper mentioned: SwiGLU
    has been proven to be Significantly improve the model effect
  • Propose Parallel Layers: The "parallel" formula in each Transformer structure: As in GPT-J-6B
    , the standard "serialization" formula is used. The parallel formulation speeds up large-scale training by about 15%. The ablation experiment shows that the effect of the model decreases very little under the parameter amount of 8B, but
    there is no phenomenon that the effect of the model decreases under the amount of parameter 62B.
  • Multi-Query Attention: Each head shares the key/value mapping, that is, "key" and "value" are projected to [1, h]
    , but "query" is still projected to shape [k, h], this operation is very important for There is no impact on model quality and training speed, but there is an effective cost savings on autoregressive decoding time.
  • Use RoPE embeddings: instead of absolute or relative position embedding, RoPE is used because RoPE embedding has better performance on long text,
  • Using Shared Input-Output Embeddings: The input and output embedding matrices are shared, which I understand is similar to word2vec's input W and output W':

GLM

  • The order and residual connections of Layer Normalization are rearranged,
  • a single linear layer for outputting label predictions;
  • ReLUs replaced by GELUs
  • 2D position code

BLOOM

  • Using ALiBi positional embeddings, which directly decay attention scores according to the distance of keys and queries. Compared with the original Transformer and Rotary embedding, it can lead to smoother training and better downstream performance. ALiBi does not add positional embeddings to word embeddings; instead, it biases attention scores towards query keys using a penalty proportional to their distance.

  • Embedding Layer Norm is used immediately after the first embedding layer to avoid unstable training.

  • A vocabulary of 250,000 tokens is used. Use byte-level BPE. This way tokenization never produces unknown tokens

  • Two fully connected layers:

GPT

GPT uses Transformer's Decoder structure and makes some changes to Transformer Decoder. The original Decoder contains two Multi-Head Attention structures, and GPT only retains Mask Multi-Head Attention, as shown in the following figure:

Guess you like

Origin blog.csdn.net/yanqianglifei/article/details/130757394