Transformers: From Scratch【01/2】

1. Description

        In our daily life, whether you are a data scientist or not, you are using the Transformer model one-way. For example. If you are using ChatGPT or GPT-4 or any GPT then in the box that answers the question for you is part of the transformer. If you're a data scientist or data analyst, you're probably using transformers for text classification, token classification, question answering, Text2text  , or any task for that matter, you're using a transformer model. We do study theory for our interviews, everyone does, but have you ever wondered how to create a transformer model from scratch.

        Starting with the transformer, let's take a look at the entire architecture:

Figure 1. Transformer model architecture

Already scratching your head? Let's break it down so we can understand it better.

  • We can have two boxes (Nx) side by side, which are the encoder on the left and the decoder on the right.

2. Start with the encoder 

Figure 2. Encoder

        There are two blocks inside the encoder. We just need to write the self-attention and feed-forward blocks in python.

        To understand it better, let's use a tokenizer. Therefore, we can understand the model with an example.

from transformers import AutoTokenizer
tokenizer = Autotokenizer.from_pretrained('bert-base-uncased')
text = 'I love data science.'
print(tokenizer(text, add_special_tokens=False, return_tensors='pt'))
inputs = tokenizer(text, add_special_tokens=False, return_tensors='pt')
# The above code will produce the following output
# {'input_ids': tensor([[1045, 2293, 2951, 2671, 1012]]), 
# 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 
# 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

        Now, in the above cell, we can see that the tokenizer has tokenized the sentence "I love data science" (input_ids). Note now that I'm not using any special tokens like [CLS] or [SEP].

3. Configure Bert

        After tokenizing our text, let's try to get the configuration of "bert-base-model" so that we can try to make a model that can produce results like the BERT model. Sounds interesting? Ako gave us an in-depth look at the configuration of BERT.

from transformers import AutoConfig
config = AutoConfig.from_pretrained('bert-base-uncased')
print(config)
# The above cell should output as:
BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.29.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

        As we can see, the BERT model has the above configuration.

4. Embedding layer

        Next, we need to create some dense embeddings. Dense in this context means that each entry in the embedding contains a non-zero value. In contrast, one-hot encoding is sparse because all entries except one are zero. In PyTorch, we can do this by using a layer that acts as a lookup table for each input ID:torch.nn.Embedding

from torch import nn
token_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
print(token_embeddings)
# output:
# Embedding(30522, 768)

        Now that we have our lookup table, we can generate embeddings by entering an ID:

inputs_embeds = token_embeddings(inputs.input_ids)
print(inputs_embeds.size())
# output:
# torch.Size([1, 5, 768])

        This gives us a tensor of shape. Now let's compute the attention score[batch_size, seq_len, hidden_dim]

        To calculate attention weights, there are four steps:

  • Project each token of the embedding into three vectors called query, key, and value.
  • Compute the attention score. We use a similarity function to determine the degree of association between the query and the key vector. As the name suggests, the similarity function for scaled dot product attention is the dot product, computed efficiently using embedded matrix multiplication. Similar queries and keys will have large dot products, while those that don't have much in common will have little overlap. The output of this step is called attention scores, and for a sequence with n input tokens, there is a corresponding nxn matrix of attention scores.
  • Compute attention weights. Dot products can often yield arbitrarily large numbers, which can destabilize the training process. To address this, attention scores are first multiplied by a scaling factor to normalize their variance, and then normalized using softmax to ensure that all column values ​​sum to 1. The resulting n × n matrix now contains all attention weights.
  • Update token embed. Once the attention weights are computed, we multiply them by the value vector to obtain an updated representation of the embedding.
import torch
import torch.nn.functional as F
from math import sqrt

query = key = value = inputs_embeds

def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k) # torch.bmm is batch matrix - matrix multiplication. 
                                                                 # Basically a dot product.
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

        This is how we calculate self-attention. Seems easy, right? So far so good? Ako Now let's calculate the long attention.

        Well, you might be thinking, if we have a concern, then why don't we stop there? Why use multi-head attention? Well, the reason is that a headed softmax tends to focus mainly on one aspect of the similarity. Having multiple heads allows the model to focus on multiple aspects simultaneously. For example, one head can focus on subject-verb interactions while another finds nearby adjectives. Obviously, we don't handcraft these relationships into the model, they are fully learned from the data. If you are familiar with computer vision models, you may see similarities to the filters in a convolutional neural network, where one filter can be responsible for detecting faces and another filter can spot car wheels in an image.

Figure 3. Multi-head attention layer

        Figure 3 clearly illustrates how we will code the multi-head attention layer. Let's start with that:

class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)
    
    def forward(self, hidden_state):
        attn_outputs = scaled_dot_product_attention(self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
        return attn_outputs
 

        Here we have initialized three separate linear layers that apply matrix multiplication to the embedding vectors to produce tensors of shape, where is the number of dimensions we project into. While not necessarily smaller than the token's embedding dimensionality ( ), in practice it is chosen to be a multiple of , so that the computation per head is constant. For example, BERT has 12 attention heads, so the size of each head is 768/12 = 64[batch_size, seq_len, head_dim]head_dimhead_dimembed_dimembed_dim

5. Multi-head attention 

        This is how we create a single head attention layer. Now let's make the multi-head attention layer:

class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList([AttentionHead(embed_dim, head_dim) for _ in range(num_heads)])
        self.output_linear = nn.Linear(embed_dim, embed_dim)
        
    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        x = self.output_linear(x)
        return x

        Let's examine the code so far.

multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)
print(attn_output.size())
# output
# torch.Size([1, 5, 768])

        If we get similar output, good job!

        Long story short, this module takes an input tensor (hidden_state), applies multiple "attention heads" independently, concatenates their outputs and passes them into a final linear layer. Each "attention head" learns to "attention" to different parts/features in the data. understand? If yes, then this is it. You've created your own multi-head attention layer, all by yourself. marvelous!

6. Feedforward layer

        Now, let's make the next block of the encoder, the feedforward layer. The feed-forward sublayers in the encoder and decoder are just a simple two-layer fully-connected neural network, but with a twist: instead of processing the entire sequence of embeddings as a single vector, it processes each embedding independently . Therefore, this layer is often referred to as a position-wise feed-forward layer .

class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

        Note that feed-forward layers are typically applied to shape tensors, which act independently on each element of the batch dimension. This actually works for any dimension except the last one, so when we pass a shape tensor, the layer will apply all token embeddings for the batch and sequence independently, which is what we want.nn.Linear(batch_size, input_dim)(batch_size, seq_len, hidden_dim)

七、输出正确性检查

        Let's check that the code we wrote produces the correct output.

feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_output)
print(ff_outputs.size())
# output
# torch.Size([1, 5, 768])

        If you get the same output, you're on the right track.

        It will become too large to read. So I split it into two parts. Let me know in the comments or if you follow me if you find this interesting. I'll be posting part two soon.

8. Postscript

        Note: In the second part, we will look at the implementation of layer normalization , positional embeddings and how to add a final layer to the model to make the model perform different tasks like text classification, token classification, etc. Then we'll look at the decoder part. Hope you are as excited as I am. Until then, happy coding!

References:

BERTopic: Fine-tune Parameters. In general, BERTopic works fine with… | by DamenC | Medium

Transformer’s from scratch in simple python. Part-I | by Harshad Patil | Aug, 2023 | Medium

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132296252