Calculations in the GPT model

calculation steps

Model framework
insert image description here

enter
Embedding
Multi-layer transformer block (12 layers)
Get two output results
calculate loss
backpropagation
update parameters

2.The following mainly introduces the block layer of Embedding and 3.transformer in the above steps

Embedding

The Embedding layer is a fully connected layer that takes one hot as input and the middle layer nodes as word vector dimensions. And the parameter of this fully connected layer is a "word vector table". Realize the transformation of text input dimension.
insert image description here
Embedding operation (here refers to text embedding) is actually a table lookup operation. One hot matrix multiplication is like a table lookup, so it directly uses the table lookup as an operation instead of writing it into a matrix for calculation. The amount of calculation is greatly reduced. It is emphasized again that the reduction in the amount of computation is not due to the emergence of word vectors, but because the one hot matrix operation is simplified to a table lookup operation.

code part

 def input_emb(self,seqs, segs):
        # device = next(self.parameters()).device
        # self.position_emb = self.position_emb.to(device)
        return self.word_emb(seqs) + self.segment_emb(segs) + self.position_emb

Including text embed and position embed

text embed

self.word_emb = nn.Embedding(n_vocab,model_dim)
self.word_emb.weight.data.normal_(0,0.1)

self.segment_emb = nn.Embedding(num_embeddings= max_seg, embedding_dim=model_dim)
self.segment_emb.weight.data.normal_(0,0.1)

The text embed part calls nn.Embedding to construct a word/segment vector matrix, and look up the table to achieve dimensionality reduction.

position embed

self.position_emb = torch.empty(1,max_len,model_dim)
nn.init.kaiming_normal_(self.position_emb,mode='fan_out', nonlinearity='relu')
self.position_emb = nn.Parameter(self.position_emb)

Position embed initializes a matrix containing the length of the input sequence as a dimension, and then uses it as a parameter to train and learn position information.

Transformer's block layer

insert image description here
From the upper right figure, the block contained in GPT is similar to the Decoder of Transformer. Each block contains two sublayers:

Sublayer1: Masked Multi-Head Attention (mask multi-head attention layer)
Sublayer2: Feed Forward Network (FFN layer)

After each sublayer, there are residual connections and normalization operations

sublayer1: mask multi-head attention layer

输入: q, k, v, mask
计算注意力(如下图左所示)：

Linear (matrix multiplication)
Scaled Dot-Product Attention
Concat (the result of multiple attentions)
Linear (matrix multiplication)

残差连接和归一化操作：Dropout operation → residual connection → layer normalization operation
insert image description here
The left image in the above figure introduces the calculation process of the entire attention layer

matrix multiplication

Multiply the input Q, K, V with the matrix to get new Q, K, V
insert image description here

Scaled Dot-Product Attention

The name of this operation looks very long, but it is actually the process of dot multiplication + zoom + mask (optional) + Softmax + dot multiplication to calculate the attention value. The first step: dot multiplication of the
transposition of Q and K, and calculate Q and K Similarity
Step 2: Scaling, divided by a scale factor
Step 3: Part of the value in the Mask matrix (optional, this operation is available in GPT)
Step 4: Softmax, converted to the probability of each token
Step 5: The matrix value after Softmax is multiplied by the V value, and each value is weighted by probability to obtain the attention score.
insert image description here

Supplement:
Mask operation: masked_fill_(mask, value)
mask operation, fills the element in the tensor corresponding to the value 1 in the mask with value. The shape of the mask must match the shape of the tensor to be filled. (Here, -inf padding is used, so that the softmax becomes 0, which is equivalent to not seeing the following words)
The mask operation in the transformer

Visualization matrix after mask:
The intuitive understanding is that each word can only see the word before it (because the purpose is to predict the future word, if you see it, you don’t need to predict it)
insert image description here

Concat operation

Combining the results of multiple attention heads is actually to transform the matrix: permute, reshape operations, and dimensionality reduction depending on the specific situation. (As shown in the red box in the figure below)

context = torch.matmul(self.attention,v)    # [n, num_head, step, head_dim]
context = context.permute(0,2,1,3)          # [n, step, num_head, head_dim]
context = context.reshape((context.shape[0], context.shape[1],-1))  
return context  # [n, step, model_dim]

matrix multiplication

A Linear layer that linearly transforms the attention results

Code for the entire mask multi-head attention layer

def forward(self,q,k,v,mask,training):
        # residual connect
        residual = q
        dim_per_head= self.head_dim
        num_heads = self.n_head
        batch_size = q.size(0)

        # 1.线性变换，linear projection
        key = self.wk(k)    # [n, step, num_heads * head_dim]
        value = self.wv(v)  # [n, step, num_heads * head_dim]
        query = self.wq(q)  # [n, step, num_heads * head_dim]

        # split by head
        query = self.split_heads(query)       # [n, n_head, q_step, h_dim]
        key = self.split_heads(key)
        value = self.split_heads(value)  # [n, h, step, h_dim]
        
        #2.3 Scaled Dot-Product Attention计算注意力分数 + Concat连接多头注意力
        context = self.scaled_dot_product_attention(query,key, value, mask)    # [n, q_step, h*dv]
        
        #4.Linear层，对注意力结果线性变换
        o = self.o_dense(context)   # [n, step, dim]
        
        #残差连接和归一化操作
        o = self.o_drop(o)
        o = self.layer_norm(residual+o)
        return o

Note that the residual connection and normalization operations are performed at the end, including:

Dropout operation
residual connection
layer normalization operation

sublayer2: FFN feedforward network

Mainly a multi-layer perceptron structure

Linear layer (matrix multiplication)
relu function activation
Linear layer (matrix multiplication)

Afterwards, the residual connection and normalization operations were also performed, including:

Dropout operation
residual connection
layer normalization operation

class PositionWiseFFN(nn.Module):
    def __init__(self,model_dim, dropout = 0.0):
        super().__init__()
        dff = model_dim*4
        self.l = nn.Linear(model_dim,dff)
        self.o = nn.Linear(dff,model_dim)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(model_dim)

    def forward(self,x):
        #1.2.线性层 + relu
        o = relu(self.l(x))
        #3.线性层
        o = self.o(o)
        #4.dropout
        o = self.dropout(o)
		#5.6.残差连接 + 层归一化
        o = self.layer_norm(x + o)
        return o    # [n, step, dim]