【CS324】LLM (large model capabilities, data, architecture, distributed training, fine-tuning, etc.)

note

  • This article is the study notes of Stanford University’s CS324 course, and also refers to some LLM related materials.

I. Introduction

  • Language models were originally studied in the context of information theory and can be used to estimate the entropy of English.
    • Entropy is a measure of probability distribution: H ( p ) = ∑ xp ( x ) log ⁡ 1 p ( x ) . H(p) = \sum_x p(x) \log \frac{1}{p(x)}.H(p)=xp(x)logp(x)1.
    • Entropy is actually a measure of how many samples x ∼ px ∼ pxA measure of the expected number of bits required to encode (i.e. compress) p into a bit string. For example, "the mouse ate the cheese" might be encoded as "0001110101". The smaller the value of entropy, the stronger the structure of the sequence and the shorter the length of the code. To understand intuitively,log ⁡ 1 p ( x ) \log \frac{1}{p(x)}logp(x)1It can be regarded as used to express the probability of occurrence as p ( x ) p(x)elementxx of p ( x )The length of the encoding of x .
    • The upper bound of cross entropy H(p,q) is entropy H§: H ( p , q ) = ∑ xp ( x ) log ⁡ 1 q ( x ) . H(p,q) = \sum_x p(x) \log \frac{1}{q(x)}.H(p,q)=xp(x)logq(x)1. , so it can be achieved by constructing a distributionpp(Language) model qqfor samples of pq to estimateH ( p , q ) H(p,q)H(p,q)
  • N-gram models are computationally extremely efficient, but statistically inefficient.
  • Neural language models are statistically efficient but computationally inefficient.
  • Parameter development of large models: With the rise of deep learning in the 2010s and major hardware advancements (e.g., GPUs), the size of neural language models has increased significantly. The table below shows that in the past 4 years, the size of the model has increased 5000 times.
Model Organization Date Size (# params)
ELMo AI2 Feb 2018 94,000,000
GPT OpenAI Jun 2018 110,000,000
BERT Google Oct 2018 340,000,000
XLM Facebook Jan 2019 655,000,000
GPT-2 OpenAI Mar 2019 1,500,000,000
RoBERTa Facebook Jul 2019 355,000,000
Megatron-LM NVIDIA Sep 2019 8,300,000,000
T5 Google Oct 2019 11,000,000,000
Turing-NLG Microsoft Feb 2020 17,000,000,000
GPT-3 OpenAI May 2020 175,000,000,000
Megatron-Turing NLG Microsoft, NVIDIA Oct 2021 530,000,000,000
Gopher DeepMind Dec 2021 280,000,000,000

2. The ability of large models

1. From language model to task model

  • Create a new model and use the language model as features (probe method)
  • Model fine-tuning

2. Task evaluation

  • Language modeling
    • Perplexity: The average "branching factor" per token. The "branching factor" here can be understood as the average number of words or tags that the language model predicts may appear next after each specific word or tag appears. So it's actually a way of measuring the diversity and uncertainty of a model's predictions.
    • Two types of errors:
      • recall error
      • accuracy error
  • Question answering: question and answer
  • Translation: translation
  • Arithmetic: abstract reasoning, such as doing math problems
  • News article generation: Given a title, generate an article
  • Novel tasks: Given a custom word, generate sentences using that word

Related benchmarks:

  • SWORDS : Lexical replacement, the goal is to predict synonyms in the context of a sentence.
  • Massive Multitask Language Understanding : Includes 57 multiple-choice questions on math, U.S. history, computer science, and more.
  • TruthfulQA : A dataset of questions and answers answered incorrectly by humans due to misunderstanding.

3. Harmfulness of large models

Harmful (Part 1)

  • Performance Difference: Automatic Speech Recognition (ASR) systems perform worse on Black speakers than White speakers ( Koenecke et al., 2020 )
  • Social prejudice: sexism, etc.

Harmful (Part 2)

  • Toxicity: Toxicity, such as rude, disrespectful, or unreasonable behavior, may make someone want to leave a conversation. As in "Trans women are not women."
  • False information: disinformation
  • Content moderation: Facebook (or Meta) has long combated harmful content and has recently begun leveraging language models to automatically detect such content. For example, RoBERTa has been used for several years.

4. Data of large models

  • Common Crawl is a non-profit organization that crawls the web and provides snapshots free to the public. Due to its convenience, it has become the standard data source for many models such as T5, GPT-3 and Gopher.
    • Although Internet data is abundant, it is biased. For example, there are more young, male, and more users. Only 8.8% of Wikipedia writers are female.
  • Although OpenAI does not publicly release the WebText dataset, the OpenWebText dataset conceptually replicates the construction method of WebText.
  • The GPT-3 data set is mainly derived from Common Crawl, and Common Crawl is similar to a reference data set-WebText. GPT-3 downloaded Common Crawl data (2016-2019) for 41 shards. By training a binary classifier to predict the difference between WebText and Common Crawl, if the classifier believes that the document is closer to WebText, then the document has a greater probability of being retained. When processing the data, GPT-3 adopted a fuzzy deduplication method (detecting 13-gram overlap and removing the window or document if it occurs in less than 10 training documents) and removed the data from the benchmark data set . In addition, GPT-3 also expands the diversity of data sources (including WebText2, Books1, Books2, and Wikipedia). During the training process, Common Crawl is downsampled, it accounts for 82% of the dataset, but only contributes 60% of the data.
  • Corpus used for large model pretrain:

Insert image description here

  • Overview "A Survey of Large Language Models" summarizes the proportions of each part of the pretrain corpus of mainstream large models:
    Insert image description here
  • Typical pretrain data preprocessing pipeline: abnormal word filtering, deduplication, private information processing, tokenization, etc.:

Insert image description here

5. Law issues

  • legal question:
    • Data: such as data privacy issues
    • Model application: used for downstream tasks and cannot do bad things (such as fraud, fake news, fraud, etc.)
    • Copyright issue
  • Three major stages of information technology:
      1. The first stage: text data mining (search engine), based on simple pattern matching.
      1. Phase 2: Classification (e.g., classification stop signs or sentiment analysis), recommendation systems.
      1. The third stage: learning a generative model that imitates expressions.

6. Model Architecture

1. tokenization

Good participle:

  • There cannot be too many tokens after word segmentation, otherwise it will be difficult to model; there cannot be too few tokens, otherwise parameters cannot be shared between words.
  • Each token is a linguistically or statistically meaningful unit

Word segmentation method:

  • Space-based word segmentation: text.split(' '), the simplest method, is feasible if it is English, but there are no spaces between Chinese words, and even in English, there are hyphenated words (such as father-in-law) and abbreviations (such as don't ), they need to be split correctly. For example, Penn Treebank splits don't into do and n't, which is a linguistically information-based choice but is less obvious. Therefore, dividing words simply by spaces creates a lot of problems.
  • Other methods:
    • Byte-Pair Encoding (BPE) tokenization
    • WordPiece tokenization
    • Unigram tokenization
    • You can use the library: SentencePiece
    • The tokenizer used by common models is as shown below:

Insert image description here

[Chestnut]
Suppose we have an English corpus containing the following two sentences:

  1. “I like playing soccer.”
  2. “I like playing basketball.”

Now, we want to use these sentences to build a subword vocabulary for word segmentation.

BPE(Byte Pair Encoding):

When using BPE for word segmentation, we first divide each sentence into individual characters, resulting in the following character sequence:

  1. “I like playing soccer.”

    • Characters: [‘I’, ’ ', ‘l’, ‘i’, ‘k’, ‘e’, ’ ', ‘p’, ‘l’, ‘a’, ‘y’, ‘i’, ‘n’, ‘g’, ’ ', ‘s’, ‘o’, ‘c’, ‘c’, ‘e’, ‘r’, ‘.’]
  2. “I like playing basketball.”

    • Characters: [‘I’, ’ ', ‘l’, ‘i’, ‘k’, ‘e’, ’ ', ‘p’, ‘l’, ‘a’, ‘y’, ‘i’, ‘n’, ‘g’, ’ ', ‘b’, ‘a’, ‘s’, ‘k’, ‘e’, ‘t’, ‘b’, ‘a’, ‘l’, ‘l’, ‘.’]

We then calculate the most frequently occurring character combinations in the character sequence and merge them into a new subword. In this example, "pl" is the most frequently occurring character combination, and we merge it into a new subword, resulting in the following sequence:

  1. “I like playing soccer.”

    • Tokens: [‘I’, ’ ', ‘like’, ’ ', ‘playing’, ’ ', ‘s’, ‘o’, ‘c’, ‘e’, ‘r’, ‘.’]
  2. “I like playing basketball.”

    • Tokens: [‘I’, ’ ', ‘like’, ’ ', ‘playing’, ’ ', ‘b’, ‘a’, ‘s’, ‘k’, ‘e’, ‘t’, ‘b’, ‘a’, ‘l’, ‘l’, ‘.’]

Through BPE, we treat "pl" as a subword and successfully segment "playing" into "play" and "ing".

WordPiece:

For WordPiece word segmentation, we also first segment each sentence into individual characters. We then select the merging operation based on the frequency of the character sequence and the language model score.

For the above example, we count the frequency of character sequences and select the most frequently occurring merge operation. Let's say we choose to merge "pl":

  1. “I like playing soccer.”

    • Tokens: [‘I’, ’ ', ‘like’, ’ ', ‘p’, ‘lay’, ‘ing’, ’ ', ‘soccer’, ‘.’]
  2. “I like playing basketball.”

    • Tokens: [‘I’, ’ ', ‘like’, ’ ', ‘p’, ‘lay’, ‘ing’, ’ ', ‘basketball’, ‘.’]

Through WordPiece, we successfully split "playing" into "p", "lay" and "ing".

Summarize:

  • BPE and WordPiece are subword segmentation methods based on statistics, which can merge frequently occurring character combinations into subwords.
  • BPE combines character combinations through simple frequency statistics, while WordPiece also takes into account language model scores.

2. Model architecture

The contextual vector representation (Contextual Embedding) is as follows. The contextual vector representation of a mark depends on its context (surrounding words); for example, considering the vector representation of mouse, you need to pay attention to other words of a certain window size around it: [ the , mouse ,
ate , the , cheese ] ⇒ ϕ [ ( 1 0.1 ) , ( 0 1 ) , ( 1 1 ) , ( 1 − 0.1 ) , ( 0 − 1 ) ] . [the, mouse, ate, the, cheese] \stackrel{ \phi}{\Rightarrow}\left[\left(\begin{array}{c} 1 \\ 0.1 \end{array}\right),\left(\begin{array}{l} 0 \\ 1 \ end{array}\right),\left(\begin{array}{l} 1 \\ 1 \end{array}\right),\left(\begin{array}{c} 1 \\ -0.1 \end {array}\right),\left(\begin{array}{c} 0 \\ -1 \end{array}\right)\right].[the,mouse,ate,the,cheese]ϕ[(10.1),(01),(11),(10.1),(01)].

  • Symbolic representation: ϕ : VL → R d × L ϕ:V^{L}→ℝ^{d×L}ϕ:VLRd × L is defined as the embedding function (similar to the feature map of the sequence, mapped to the corresponding vector representation).
  • For the token sequence x 1: L = [ x 1 , … , x L ] x1:L=[x_{1},…,x_{L}]x 1:L=[x1,,xL] ,ϕ ϕϕ generates context vector representationϕ ( x 1 : L ) ϕ(x_{1:L})ϕ ( x1:L)

(1)only encoder

  • Models: bert, RoBERTa, etc., generate context vectors instead of directly generating text, often used for classification tasks
  • Advantages: for each xix{i}x i , the context vector representation can depend on the left context bidirectionally( x 1 : i − 1 ) (x_{1:i−1})(x1:i1) and the right context( xi + 1 : L ) (x_{i+1:L})(xi+1:L)

(2)only decoder

  • Model: gpt is an autoregressive model, given a hint x 1 : i x_{1:i}x1:i, they can generate context vector representations and identify the next token xi + 1 x_{i+1}xi+1(And recursively, the entire completes xi + 1 : L x_{i+1:L}xi+1:L) generates a probability distribution. x 1 : i ⇒ ϕ ( x 1 : i ) , p ( xi + 1 ∣ x 1 : i ) x_{1:i}⇒ϕ(x_{1:i}),p(x_{i+1}∣ x_{1:i})x1:iϕ ( x1:i),p(xi+1x1:i)
  • Disadvantages: for each xi xix i , the context vector representation can only depend unidirectionally on the left context (x 1 : i − 1 x_{1:i−1}x1:i1)。

(3)encoder-decoder

  • Models: original transformer, BART, T5, etc.
  • Advantages: They can be usedBidirectional context vector representationto process the input x 1 : L x_{1:L}x1:L, and can generate the output y 1 : L y_{1:L}y1:L. It can be formulated as:
    x 1 : L ⇒ ϕ ( x 1 : L ) , p ( y 1 : L ∣ ϕ ( x 1 : L ) ) . x1:L⇒ϕ(x1:L),p(y1:L∣ϕ(x1:L)).x 1:Lϕ ( x 1:L),p(y1:Lϕ ( x 1:L )) .
    Taking the table-to-text generation task as an example, the input and output can be expressed as:
    [name: , plant, ∣ , type: , flower, shop] ⇒ [flower, is, one, shop] . [Name:, plant, |, type:, flower, shop] ⇒ [flower, is, a, shop].[ name:,plants ,,type:,flowers ,Shop ][ Flowers ,Yes ,one ,one ,store ] .
  • Disadvantages: Requires more specific training goals

3. Infrastructure

(1) Return to transformer architecture

Single-head attention matrix form:
def A attention ( x 1 : L : R d × L , y : R d ) → R d Attention(x_{1:L}:ℝ^{d×L},y:ℝ^ d)→ℝ^dAttention(x1:L:Rd×L,y:Rd)Rd

  • By combining it with each xi x_{i}xiCompare to handle yyy
  • 返回 W v a l u e x 1 : L softmax ⁡ ( x 1 : L ⊤ W k e y ⊤ W q u e r y y / d ) W_{value} x_{1: L} \operatorname{softmax}\left(x_{1: L}^{\top} W_{key}^{\top} W_{query} y / \sqrt{d}\right) Wvaluex1:Lsoftmax(x1:LWkeyWqueryy/d )

Multi-head attention mechanism:
def M ulti H eaded A attention ( x 1 : L : R d × L , y : R d ) → R d MultiHeadedAttention(x_{1:L}:ℝ^{d×L},y: ℝ^{d})→ℝ^{d}MultiHeadedAttention(x1:L:Rd×L,y:Rd)Rd:

  • Process y by comparing it to each xi with nheads aspects.
  • 返回 W o u t p u t [ [ Attention ⁡ ( x 1 : L , y ) , … , Attention ⁡ ( x 1 : L , y ) ] ⏟ n h e a d s t i m e s W_{output}[\underbrace{\left[\operatorname{Attention}\left(x_{1: L}, y\right), \ldots, \operatorname{Attention}\left(x_{1: L}, y\right)\right]}_{n_{heads}times} Woutput[nheadstimes [Attention(x1:L,y),,Attention(x1:L,y)]

For the self-attention layer , we will use xi x_{i}xiReplace yyy is generated as a query parameter, which is essentially its ownxi x_{i}xiAttention to other contextual content of the sentenceA tt e n t i o n operation:

def S e l f A t t e n t i o n ( x 1 : L : R d × L ) → R d × L ) SelfAttention(x_{1:L}:ℝ_{d×L})→ℝ_{d×L}) SelfAttention(x1:L:Rd×L)Rd×L)

  • Compare each element xi with other elements.
  • 返回 [ A t t e n t i o n ( x 1 : L , x 1 ) , … , A t t e n t i o n ( x 1 : L , x L ) ] [Attention(x_{1:L},x_{1}),…,Attention(x_{1:L},x_{L})] [Attention(x1:L,x1),,Attention(x1:L,xL)]

Self-attention allows all markers to "communicate" with each other, while feed-forward layers provide further connections:

def F e e d F o r w a r d ( x 1 : L : R d × L ) → R d × L FeedForward(x_{1:L}:ℝ^{d×L})→ℝ^{d×L} FeedForward(x1:L:Rd×L)Rd×L

  • Each tag is processed independently.
  • For i = 1 , … , L i=1,…,Li=1,,L
    • 计算 y i = W 2 m a x ( W 1 x i + b 1 , 0 ) + b 2 y_{i}=W_{2}max(W_{1}x_{i}+b_{1},0)+b_{2} yi=W2max(W1xi+b1,0)+b2
  • 返回[ y 1 , … , y L ] [y_{1},…,y_{L}][y1,,yL]
    Insert image description here

(2) Optimize network training

(1) Residual link: add residual (jump) connection to prevent ffWhen f gradient disappears, you can also usex 1 : L x_{1:L}x1:LCalculate
(2) layer normalization: def L ayer N orm ( x 1 : L : R d × L ) → R d × L LayerNorm(x_{1:L}:ℝ^{d×L})→ℝ^ {d×L}L a yer N or m ( x1:L:Rd×L)Rd × L , specifically define the function to accept a sequence modelfff and make it "robust":

def A d d N o r m ( f : ( R d × L → R d × L ) , x 1 : L : R d × L ) → R d × L AddNorm(f:(ℝd^{×L}→ℝ^{d×L}),x_{1:L}:ℝ_{d×L})→ℝ^{d×L} AddNorm(f:(Rd×LRd×L),x1:L:Rd×L)Rd×L

  • Apply f to x 1 : L x_{1:L}x1:L
  • Definition Layer Norm ( x 1 : L + f ( x 1 : L ) ) LayerNorm(x_{1:L}+f(x_{1:L}))L a yer N or m ( x1:L+f(x1:L))

Finally define the Transformer block as follows:

def T r a n s f o r m e r B l o c k ( x 1 : L : R d × L ) → R d × L TransformerBlock(x_{1:L}:ℝ^{d×L})→ℝ^{d×L} TransformerBlock(x1:L:Rd×L)Rd×L

  • Process each element xi x_{i} in the contextxi
  • Definition AddNorm(FeedForward,AddNorm(SelfAttention,x1:L)) AddNorm(FeedForward,AddNorm(SelfAttention,x_{1:L}))AddNorm(FeedForward,AddNorm(SelfAttention,x1:L))

(3)位置嵌入
def E m b e d T o k e n W i t h P o s i t i o n ( x 1 : L : R d × L ) EmbedTokenWithPosition(x_{1:L}:ℝ^{d×L}) EmbedTokenWithPosition(x1:L:Rd×L)

  • Add location information.
  • Define location embedding:
    • Even dimensions: P i , 2 j = sin ( i / 1000 0 2 j / dmodel ) P_{i,2j}=sin(i/10000^{2j/dmodel})Pi , 2 j=sin(i/100002 j / d m o d e l )
    • Odd dimensions: P i , 2 j + 1 = cos ( i / 1000 0 2 j / dmodel ) P_{i,2j+1}=cos(i/10000^{2j/dmodel})Pi , 2 j + 1=cos(i/100002 j / d m o d e l )
  • Return [ x 1 + P 1 , … , x L + PL ] [x_1+P_1,…,x_L+P_L][x1+P1,,xL+PL]
  • Marking above: iii represents the position of the marker in the sentence,jjj represents the vector of this marker indicating the dimensional position.

Note: Large LLM models now also use ROPE rotation position encoding, ALiBi position encoding, etc. to expand the window size.

4. Large model architecture

(1) GPT3 architecture:

  • Stack transformer blocks 96 times
  • Dimension of hidden state: dmodel=12288
  • Dimension of the intermediate feedforward layer: dff=4dmodel
  • Note the number of heads: nheads=96
  • Context length: L=2048

(2) llama model architecture

  • transform architecture (Vaswani et al., 2017),
  • Use the RMSNorm (Root Mean Square Layer Normalization) method to perform a reduction (norm) operation on the input of each layer of the transformer, instead of norming the output before the transformer: apply pre-normalization using RMSNorm (Zhang and Sennrich, 2019) ,
  • SwiGLU activation function: use the SwiGLU activation function (Shazeer, 2020),
  • Rotary positional encoding: rotary positional embeddings (RoPE, Su et al. 2022).
  • Context length and grouped-query attention (GQA): The primary architectural differences from Llama 1 include increased context length and grouped-query attention (GQA).

7. Model training

1. decoder-only model

Conditional distribution p ( xi ∣ x 1 : i − 1 ) p(x_i \mid x_{1:i-1}) of the autoregressive language modelp(xix1:i1)

  • General x 1 : i − 1 x_{1:i-1}x1:i1Maps to context embedding ϕ ( x 1 : i − 1 ) \phi(x_{1:i-1})ϕ ( x1:i1)
  • Apply embedding matrix E ∈ RV × d E \in \R^{V \times d}ERV × d to obtain the score of each tokenE ϕ ( x 1 : i − 1 ) i − 1 E \phi(x_{1:i-1})_{i-1}(x1:i1)i1
  • Exponentialize and normalize it to get the prediction xi x_ixiDistribution.
  • In conclusion:
    p ( xi + 1 ∣ x 1 : i ) = softmax ( E ϕ ( x 1 : i ) i ) . p(x_{i+1} \mid x_{1:i}) = softmax(E \ phi(x_{1:i})_i).p(xi+1x1:i)=softmax((x1:i)i).

Maximum likelihood: Let θ \thetaθ are all parameters of the large language model. LetDDD is the training data consisting of a set of sequences. Then we can follow the maximum likelihood principle and define the following negative log-likelihood objective function:
O ( θ ) = ∑ x ∈ D − log ⁡ p θ ( x ) = ∑ x ∈ D ∑ i = 1 L − log ⁡ p θ ( xi ∣ x 1 : i − 1 ) . O(\theta) = \sum_{x \in D} - \log p_\theta(x) = \sum_{x \in D} \sum_{i=1} ^L -\log p_\theta(x_i \mid x_{1:i-1}).O ( i )=xDlogpi(x)=xDi=1Llogpi(xix1:i1).

2. Optimization algorithm

Known: The objective function of the autoregressive language model: O ( θ ) = ∑ x ∈ D − log ⁡ p θ ( x ) . O(\theta) = \sum_{x \in D} -\log p_\theta( x).O ( i )=xDlogpi(x).

(1)Adam (adaptive moment estimation)

The Adam algorithm has the following two innovations:

  1. Introduce momentum (continue moving in the same direction).
  2. Parameter θ 0 \theta_0i0Each dimension of has an adaptive (different) step size (inspired by second-order methods).

Its steps are as follows:

  • Initialization parameter θ 0 \theta_0i0

  • Initialize momentum m 0 , v 0 ← 0 m_0, v_0 \leftarrow 0m0,v00

  • Repeat the following steps:

    • Sample mini-batch B t ⊂ D B_t \subset DBtD

    • Follow the steps below to update parameters:

      • Calculate gradient

      g t ← 1 ∣ B t ∣ ∑ x ∈ B t ∇ θ ( − log ⁡ p θ ( x ) ) . g_t \leftarrow \frac{1}{|B_t|} \sum_{x \in B_t} \nabla_\theta (-\log p_\theta(x)). gtBt1xBti(logpi(x)).

      • Update first-order and second-order momentum

      m t ← β 1 m t − 1 + ( 1 − β 1 ) g t v t ← β 2 v t − 1 + ( 1 − β 2 ) g t 2 m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t \leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 mtb1mt1+(1b1)gtvtb2vt1+(1b2)gt2

      • Correct the deviation

      m ^ t ← m t / ( 1 − β 1 t ) v ^ t ← v t / ( 1 − β 2 t ) \hat m_t \leftarrow m_t / (1 - \beta_1^t) \\ \hat v_t \leftarrow v_t / (1 - \beta_2^t) m^tmt/(1b1t)v^tvt/(1b2t)

      • Update parameters

      θ t ← θ t − 1 − η m ^ t / ( v ^ t + ϵ ) \theta_t \leftarrow \theta_{t-1} - \eta\, \hat m_t/(\sqrt{\hat v_t} + \epsilon).itit1them^t/(v^t +ϵ ) .

Storage usage analysis:

Adam will store the model parameters from 2 times ( θ t , gt \theta_t,g_tit,gt) increased to 4 times ( θ t , gt , mt , vt \theta_t,g_t,m_t,v_tit,gt,mt,vt)。

(2) AdaFactor

AdaFactor is an optimization algorithm to reduce storage usage. Features:

  • It does not store mt, vt m_t,v_tmt,vtSuch O ( m × n ) O(m \times n)O(m×n ) matrix, instead it stores the sum of rows and columnsO ( m + n ) O(m + n)O(m+n ) and reconstruct the matrix
  • Remove momentum
  • It is used to train T5
  • AdaFactor can make training difficult (see Twitter thread and blog post )

(3) Model parameter initialization

  • Given matrix W ∈ R m × n W \in \mathbb{R}^{m \times n}WRm × n , the standard initialization (i.e., xavier initialization) isW ij ∼ N ( 0 , 1 / n ) W_{ij} \sim N(0, 1/n)WijN(0,1/n)
  • GPT-2 and GPT-3 pass an additional 1/N 1/\sqrt{N}1/N Scaling weights, where NNN is the number of residual layers.
  • T5 increases the attention matrix by 1/d (1/\sqrt{d}(1/d ( code).

Taking GPT-3 as an example, the parameters used are as follows:

  • Adams: β 1 = 0.9 , β 2 = 0.95 , ϵ = 1 0 − 8 \beta_1 = 0.9, \beta_2 = 0.95, \epsilon = 10^{-8}b1=0.9,b2=0.95,ϵ=108
  • Small batch size: 3.2 million tokens (about 1,500 sequences)
  • Use gradient clipping ( gt ← gt / min ⁡ ( 1 , ∥ g ∥ 2 ) g_t \leftarrow g_t / \min(1, \|g\|_2)gtgt/min(1,g2)
  • Linear learning rate warm-up (first 375 million tokens)
  • Cosine learning rate decays to 10%
  • Gradually increase batch size
  • Weight decay is set to 0.1

8. Distributed training

[Background] Suppose that a certain layer in the neural network is doing matrix multiplication, and the input xxThe shape of x is4 × 5 4 \times 54×5 , model parameterwwThe shape of w is5 × 8 5 \times 85×8 , then, the matrix multiplication output shape is4 × 8 4 \times 84×8 . as follows:
Insert image description here

  • Data parallelism: split data xxx to different devices. During the backpropagation process, AllReduce needs to be performed on the gradients on each device to ensure that the model on each device is always consistent. Suitable for situations where the data is large and the model is small, such as resnet50

    • allReduce: First reduce the gradients of the models on all devices, and then broadcast the results to all devices.
      Insert image description here
  • Model parallelism: AllReduce gradients between multiple devices are omitted; however, since each device requires complete data input, the data will be broadcast among multiple devices, resulting in communication costs. For example, the final result in the picture above is out (4 × 8) (4 \times 8)(4×8 ) , if it is used as the input of the next layer network, then it needs to be broadcast to two devices. Such as bert
    Insert image description here

  • Pipeline parallelism: The model network is too large. In addition to using model parallelism, pipeline parallelism can also be used (the following network has 4 layers, T1-T4)
    Insert image description here

  • Hybrid parallelism: Combine the above multiple strategies for training together, such as GPT3:

    • First, it is divided into 64 stages for pipeline parallelization. Each stage runs on 6 DGX-A100 hosts.
    • Data parallel training is performed between 6 hosts;
    • Each host has 8 GPU graphics cards, and the 8 GPU graphics cards on the same machine perform model parallel training.

reference:

9. New model architecture

  • The hybrid expert model is a bit like the previous MMOE. The retrieval-based model can be combined with the knowledge base question and answer. The specific model will be studied and updated later. Thank you for your reminder.

1. Mixed expert model

References:

2. Retrieval-based model

References:

  • REALM: Retrieval-Augmented Language Model Pre-Training. Kelvin Guu, Kenton Lee, Z. Tung, Panupong Pasupat, Ming-Wei Chang. 2020. Introduces REALM.
  • Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks . Patrick Lewis, Ethan Perez, Alexander Pictus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, M. Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela . NeurIPS 2020. Introduces RAG .
  • Improving language models by retrieving from trillions of tokens. Sebastian Borgeaud, A. Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, G. V. D. Driessche, J. Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, T. Hennigan, Saffron Huang, Lorenzo Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, K. Simonyan, Jack W. Rae, Erich Elsen, L. Sifre. 2021. Introduces RETRO.

10. Adaptation of large models

1. Common adaptation configuration

  1. Pre-trained language model (Pre-trained LM) :
    Use parameters θ LM θLMθ L M represents

  2. Downstream Task Dataset :
    Downstream task distribution P task P_{task}Ptasksample data. It can be a specific instance of tasks such as text classification and sentiment analysis. Each sample consists of input x and target output y, such as: ( x ( 1 ) , y ( 1 ) ) , … , ( x ( n ) , y ( n ) ) \left(x^{(1)}, y^{(1)}\right), \ldots,\left(x^{(n)}, y^{(n)}\right)(x(1),y(1)),,(x(n),y(n))

  3. Adaptation Parameters :
    In order to make the pre-trained LM suitable for a specific downstream task, it is now necessary to find a set of parameters γ \gammacThis set of parameters can come from a subset of existing parameters or introduce new parameters, Γ \GammaC. These parameters will be used to tune the model so that it performs better on a specific task.

  4. Task Loss Function :
    Define a loss functionℓ task \ell_{\text {task }}task To measure the performance of the model on downstream tasks. For example, cross-entropy loss is a common choice to measure the difference between the probability distribution predicted by the model and the true distribution.

  5. Optimization Problem :
    The goal is to find a set of adaptation parameters γ adapt \gamma_{\text {adapt }}cadapt , minimizing the task loss on the entire downstream data set. Mathematically, this can be expressed by the following optimization problem:

γ adapt  = argmin ⁡ γ ∈ Γ 1 n ∑ i = 1 n ℓ task  ( γ , θ L M , x i , y i ) . \gamma_{\text {adapt }}=\operatorname{argmin}_{\gamma \in \Gamma} \frac{1}{n} \sum_{i=1}^n \ell_{\text {task }}\left(\gamma, \theta_{\mathrm{LM}}, x_i, y_i\right) . cadapt =argminγΓn1i=1ntask ( c ,iLM,xi,yi).Finally
obtain a set of adaptation parametersγ adapt \gamma_{\text {adapt }}cadapt , used for the parameterized adapted model padapt p_{adapt}padapt

2.Adaptation method

(1)Probing

(2)Fine-tuning

Fine-tuning for human alignment
Insert image description here

InstructGPT three steps to fine-tune the GPT-3 model:

  1. Collect demonstration behavior of human writing : This step involves collecting examples that match human expectations and performing supervised fine-tuning on these examples.

  2. Instruction-based sampling with human preferences : For each instruction, k outputs are sampled from the LM of step 1. Human feedback is then collected on which sampled output takes priority. These data are cheaper compared to step 1.

  3. Fine-tune the LM with a reinforcement learning objective : Fine-tune the LM in step 1 with a reinforcement learning objective to maximize the human-preferred reward.

After such fine-tuning, the 1.3B InstructGPT model was prioritized over the 175B GPT-3 85% of the time, and 71% when using few-shot hints. In terms of closed-domain Q&A/summaries, InstructGPT generates fictitious information 21% of the time, which is an improvement compared to GPT-3’s 41%. When prompted to respect, InstructGPT produces 25% less toxic output than GPT-3.

(3)Lightweight Fine-tuning

  • You can use peftthe library (Parameter-Efficient Fine-Tuning) for fine-tuning, which supports the following tuning:
    • Adapter Tuning (fix the parameters of the original pre-trained model and only fine-tune the new adapter)
    • Prefix Tuning (Construct a task-related virtual tokens as a prefix before inputting the token. Only the parameters of the Prefix part are updated during training, while other parameters of the Transformer are fixed regardless of the parameters. It is similar to constructing a prompt, except that the prompt is artificially constructed and cannot be used in the model. Update parameters during training, and Prefix can learn the <implicit> prompt)
    • Prompt Tuning (a simplified version of Prefix Tuning, which only adds prompt tokens to the input layer and does not need to add MLP)
    • P-tuning (convert prompts into learnable embedding layers, v2 adds prompts tokens as input)
    • LoRA (Low-Rank Adaption, in order to solve the problem that the adapter increases the model depth and increases the model inference time. The prompts in the above tuning are difficult to train and reduce the available sequence length of the model)
      • This method can directly add the two trained AB matrices to the parameters of the original pre-training model during inference, and the addition result replaces the parameters of the original pre-training model.
      • Equivalent to using LoRA to simulate the full-tunetune process

Insert image description here

Lightweight Fine-tuning saves resources while maintaining the same effect as full parameter fine-tuning:

  1. Prompt Tuning : Optimize the performance of the model by fine-tuning the model's input prompts. Prompt tuning can be viewed as a more flexible method of fine-tuning, allowing users to guide the model's output by adjusting input prompts rather than directly modifying model parameters.
  2. Prefix Tuning : Similar to prompt tuning, prefix tuning is also focused on the input part. It tailors the model's behavior to specific tasks by adding specific prefixes to adjust the model's behavior.
  3. Adapter Tuning : Adapter Tuning is a method of fine-tuning a model by inserting trainable "adapter" modules between the hidden layers of the model. These adapter modules allow models to be fine-tuned without changing the original pre-trained parameters, thereby reducing storage and computation requirements.

11. Environmental impact

Patterson et al., 2021
简单形式:
emissions = R power → emit ( energy-train + queries ⋅ energy-inference ) \text{emissions} = R_{\text{power} \to \text{emit}} (\text{energy-train} + \text{queries} \cdot \text{energy-inference}) emissions=Rpoweremit(energy-train+queriesenergy-inference)

  • NVIDIA: 80% of ML workloads are inference, not training

For training:

emissions = hours-to-train ⋅ num-processors ⋅ power-per-processor ⋅ PUE ⋅ R power → emit \text{emissions} = \text{hours-to-train} \cdot \text{num-processors} \cdot \text{power-per-processor} \cdot \text{PUE} \cdot R_{\text{power} \to \text{emit}} emissions=hours-to-trainnum-processorspower-per-processorPUERpoweremit

Estimates from different models:

  • T5:86 MWh,47t CO2eq
  • GShard (MOE model for machine translation): 24 MWh, 4.3t CO2eq
  • Switch Transformer:179 MWh,59t CO2eq
  • GPT3:1287 MWh,552t CO2eq

Reference

[1] 斯坦福大学CS324课程:https://stanford-cs324.github.io/winter2022/lectures/introduction/#a-brief-history
[2] CS224N lecture notes on language models
[3] Language Models are Few-Shot Learners. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei. NeurIPS 2020.
[4] Challenges in Detoxifying Language Models. Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John F. J. Mellor, Lisa Anne Hendricks, Kirsty Anderson, P. Kohli, Ben Coppin, Po-Sen Huang. EMNLP 2021.
[5] Scaling Language Models: Methods, Analysis&Insights from Training Gopher
[5] CommonCrawl
[6] OpenWebText Similar to WebText, used to train GPT-2.
[7] An Empirical Exploration in Quality Filtering of Text Data. Leo Gao. 2021.
[8] Deduplicating Training Data Makes Language Models Better. Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, D. Eck, Chris Callison-Burch, Nicholas Carlini. 2021.
[9] Foundation models report (legality section)
[10] A Survey of Large Language Models:http://arxiv.org/abs/2303.18223
[11] Attention is All you Need. Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. NIPS 2017.
[12] CS224N slides on Transformers
[13] Rethinking Attention with Performers. K. Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamás Sarlós, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy J. Colwell, Adrian Weller. ICLR 2020. Introduces Performers.
[14] Efficient Transformers: A Survey. Yi Tay, M. Dehghani, Dara Bahri, Donald Metzler. 2020.
[15] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. NAACL 2019. Introduces BERT from Google.
[16] RoBERTa: A Robustly Optimized BERT Pretraining Approach. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke Zettlemoyer, Veselin Stoyanov. 2019. Introduces RoBERTa from Facebook.
[17] BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. M. Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, Luke Zettlemoyer. ACL 2019. Introduces BART from Facebook.
[18] Language Models are Few-Shot Learners. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei . NeurIPS 2020. Introduces GPT-3 from OpenAI.
[19] Fixing Weight Decay Regularization in Adam . I. Loshchilov, F. Hutter. 2017. Introducing AdamW
[20] Mixed Precision Training
[21] Easy to understand explanation Model: Tokenizer
[22] huggingface official document explanation: Summary of the tokenizers
[23] Microsoft guidance: https://github.com/guidance-ai/guidance#token-healing-notebook
[24] Understanding Byte-Pair Encoding (BPE) Word Piece And Unigram
[25] [Teacher Zhang Yue of West Lake University | Natural Language Processing Online Course Chapter 16 - Section 4] BPE (Byte-Pair Encoding) Encoding
[26] [LLM Series of Tokenizer] How to scientifically train an LLM word segmenter
[27] Large vocabulary language model in the continuation task A problem and countermeasures. Su Jianlin (For example, after LLM uses a very large vocabulary, "Baiyun", "Baiyun Mountain", and "Baiyun Airport" are all independent tokens. After the user enters "Baiyun in Guangzhou", the following It is almost impossible to continue writing "Baiyun Airport in Guangzhou" and "Baiyun Mountain in Guangzhou")
[28] A brief exploration of random word segmentation: from Viterbi Decoding to Viterbi Sampling. Su Jianlin
[29] Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates: https://arxiv.org/abs/1804.10959
[30] Common distributed parallel strategies.oneflow

Attached: time schedule

Mission information deadline Precautions
task1: Introduction 9.11 Monday Finish
task2: Capabilities of large models Tuesday 9.12 Finish
task3: The harmfulness of large models Wednesday 9.13 Finish
task4: Data of large model Thursday 9.14 Finish
task5: Legal chapter of large model Friday 9.15 Finish
task6: Model architecture 9.16 Saturday Finish
task7: Model training 9.17 Sunday Finish
task8: Distributed training Monday 9.18
task9: New model architecture Tuesday 9.19
task10: Adaptation of large model Wednesday 9.20
task11: Environmental impact of large models Thursday 9.21 Finish

Guess you like

Origin blog.csdn.net/qq_35812205/article/details/132820008