Want to better understand large model architecture? Quick start from calculating the parameter quantity

Editor's note: The most efficient way to understand a new machine learning architecture (and any other new technology) is to implement it from scratch. However, there is an easier way - counting the number of parameters.

By counting the number of parameters, the reader can better understand the model architecture and check its solution for undiscovered bugs.

This article provides an accurate calculation formula for the parameter amount of the Transformers model and a less accurate version of the abbreviated formula, enabling readers to quickly estimate the number of parameters in any model based on Transformer.

The following is the translation, Enjoy!

作者 | Dmytro Nikolaiev (Dimid)

Compile | Yue Yang

The most effective way to understand a new machine learning architecture (and any other new technology) is to implement it from scratch. While this can be complex, time-consuming, and sometimes nearly impossible to achieve, it's the best way to help us understand every technical detail. For example, without similar computing resources or data, we would not be able to ensure that there are no undiscovered bugs in our solutions.

However, there is an easier way - counting the number of parameters. This is much more difficult than just reading the paper, but allows us to dig deeper and check that the building blocks of the new architecture are fully understood (in this case the Transformer's Encoder and Decoder building blocks).

We can think about this with the following diagram, which shows three ways to understand a new ML architecture - the size of the circle indicates the level of understanding of the architecture.

picture

This article focuses on the well-known Transformer architecture and considers how to count the number of parameters in the PyTorch TransformerEncoderLayer [1] and TransformerDecoderLayer [2] classes. Therefore, we need to make sure that there is no mystery about what parts the architecture consists of.

TLDRs (Summary)

(The length of this article is relatively long. If you don’t want to discuss in depth or have limited time, you can directly read the summary part)

You can read the "Conclusions" section, all the parameter calculation formulas are summarized in the "Conclusions" section.

This article not only provides an accurate formula for calculating the amount of parameters, but also provides a less accurate approximate version of the formula, which will allow you to quickly estimate the number of parameters in any Transformer-based model.

01 Transformer architecture

The famous Transformer architecture was proposed in the paper "Attention Is All You Need [3]" in 2017, and it has become a natural language processing and A standard architecture in computer vision tasks.

As early as the beginning of 2023, the diffusion model (Diffusion) [4] became extremely popular due to the fire of the text-to-graph generation model [5]. Perhaps, soon diffusion models will be state-of-the-art for various tasks, just like Transformers are with LSTMs and CNNs. But let's take a look at Transformer first...

This article does not try to explain the Transformer architecture, because there are already many good enough articles to do this. This article just allows us to look at it from a different angle, or explain some details. So if you're looking for more resources to learn about this architecture, I can recommend some; otherwise, you can read on.

1.1 Learn more about Transformer resources

If you are looking for a more detailed overview of the Transformer architecture, you can read the following material (please note that there is a lot of technical content on the Internet, I just personally like these):

  • First, the official paper [3] can be read. Reading papers for the first time on Transformer might not be the best way to go, but it's not as complicated as it seems. Try Explainpaper to help you read this paper [6] or others (it's an AI-based tool that explains text marked with a mouse).
  • " Great Illustrated Transformer [7]" by Jay Alammar . If reading articles is not your thing, you can watch a YouTube video by the same author [8].
  • Lukasz Kaiser's "Awesome Tensor2Tensor" talk at Google Brain [9].
  • If you want to jump right in and build applications using various Transformer models, check out the Hugging Face course [10].

1.2 Original Transformer

First, let's review the basics of Transformer.

The Transformer's architecture consists of two components: the encoder (on the left) and the decoder (on the right). The encoder takes an input token sequence and generates a sequence of hidden states, and the decoder takes this hidden state sequence and generates an output token sequence.

picture

Transformer architecture diagram, from https://arxiv.org/pdf/1706.03762.pdf

Both the encoder and decoder consist of a bunch of identical layers. For the encoder, this layer consists of multi-head attention (1 - the numbers here and below refer to the numbered part of the image below) and a layer with some layer normalization (3) and feed-forward neural networks with skip connections ( 2).

The decoder is also similar to the encoder, but with the exception of the first multi-head attention (4) (masked in machine translation tasks so the decoder cannot cheat by looking at future tokens) and a prefix network (5) , which also has a second multi-head attention mechanism (6). It allows the decoder to use the context provided by the encoder when generating output. Like the encoder, the decoder also has some layer normalization (7) and skip connection components .

picture

Transformer architecture diagram with serial number marking components

From https://arxiv.org/pdf/1706.03762.pdf

I would not consider the input embedding layer (with positional encoding ) and the final output layer ( linear+softmax ) as Transformer components, but only focus on the encoder and decoder blocks. This is done because these components are specific to certain tasks and embedding methods, while encoder and decoder stacks are the basis for other architectures.

Examples of such architectures include BERT-based models for encoders ( BERT, RoBERTa, ALBERT, DeBERTa, etc. ), GPT-based models for decoders ( GPT, GPT-2, GPT-3, ChatGPT ), and Models built on full encoder-decoder frameworks ( T5, BART, etc. ).

Although we have labeled seven components in this schema, we can see that there are only three unique ones:

  • Multi-head attention;
  • Feed-forward network;
  • Layer normalization.

picture

The Transformer component comes from the paper https://arxiv.org/pdf/1706.03762.pdf

02 Transformer building blocks

Let's think about the internal structure of each module and how many parameters it takes. In this section, we will also start using PyTorch [11] to verify our computational results.

In order to check the number of parameters of a certain model nugget, I will use the following line of function [12]:

import torch

# https://discuss.pytorch.org/t/how-do-i-check-the-number-of-parameters-of-a-model/4325/9
def count_parameters(model: torch.nn.Module) -> int:
 """ Returns the number of learnable parameters for a PyTorch model """
 return sum(p.numel() for p in model.parameters() if p.requires_grad)

Before we start, note the fact that all building blocks are normalized and use skip connections . This means that the shape (more precisely, its last number , since batch size and number of tokens may vary) must be the same for all inputs and outputs . For the original paper, this number (d_model) is 512.

2.1 Multi-head attention

The well-known attention mechanism is the key to the Transformer architecture. However, regardless of design motivations and technical details, it only involves a few matrix multiplications.

picture

Transformer multi-head attention architecture diagram

From the paper https://arxiv.org/pdf/1706.03762.pdf

After calculating the attention of each head, we concatenate all the heads and pass it through a linear layer (W_O matrix). In turn, each head is a Scaled dot-product attention multiplied by three independent matrices for query, key, and value (W_Q, W_K, and W_V matrices, respectively). These three matrices are different for each head, which is why the subscript i appears.

The shape of the final linear layer (W_O) is d_model to d_model. The remaining three matrices (W_Q, W_K and W_V) have the same shape: d_model to d_qkv.

Note that in the image above, d_qkv is denoted as d_k or d_v in the original paper. I think this name is more intuitive because although these matrices may have different shapes, they are almost always the same.

Also, note that d_qkv = d_model / num_heads (h in the text). That's why d_model must be divisible by num_heads: to ensure the correct connection later.

You can test this yourself by checking the shapes of all intermediate stages in the image above (the correct shape is marked in the lower right corner).

Therefore, we need three smaller matrices and one large final matrix per head. So how many parameters do we need (don't ignore bias)?

picture

The formula used to calculate the number of parameters in the Transformer attention module. Image courtesy of the author

I hope the formula isn't too onerous - I tried to make the derivation as clear as possible. Don't worry! Future formulas will be even shorter.

The approximate number of parameters is such that we can ignore 4 * d_model^2 compared to 4 * d_model. Let's test it with PyTorch now.

from torch import nn

d_model = 512
n_heads = 8 # must be a divisor of `d_model`

multi_head_attention = nn.MultiheadAttention(embed_dim=d_model, num_heads=n_heads)
print(count_parameters(multi_head_attention)) # 1050624
print(4 * (d_model * d_model + d_model)) # 1050624

The numbers match, which means we're doing great!

2.2 Feedforward Network

The feedforward network in Transformer consists of two fully connected layers with a ReLU activation function in . The internal part of the network is more expressive than the input and output (input and output must be the same).

In the general case, it is MLP(d_model, d_ff) -> ReLU -> MLP(d_ff, d_model), and for the original paper, d_ff = 2048 .

picture

The feedforward neural network description map comes from the paper https://arxiv.org/pdf/1706.03762.pdf

It wouldn't hurt to do a little bit of visualization.

picture

Feedforward networks in Transformer. Image provided by the author.

Calculation of parameters is quite easy, the main thing is not to get confused.

picture

The formula used to calculate the number of parameters in the Transformer feed-forward network. Images are provided by the authors.

We can describe such a simple network and check the number of parameters using the following code (note that the official PyTorch implementation also uses dropout, as we will see later in the encoder/decoder code. But as we know , the dropout layer has no trainable parameters , so I omit it here for simplicity):

from torch import nn

class TransformerFeedForward(nn.Module):
 def __init__(self, d_model, d_ff):
 super(TransformerFeedForward, self).__init__()
        self.d_model = d_model
        self.d_ff = d_ff

        self.linear1 = nn.Linear(self.d_model, self.d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(self.d_ff, self.d_model)

 def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
 return x

d_model = 512
d_ff = 2048

feed_forward = TransformerFeedForward(d_model, d_ff)
print(count_parameters(feed_forward)) # 2099712
print(2 * d_model * d_ff + d_model + d_ff) # 2099712

Looking at the numbers in the picture again, there is only one component left that has not been introduced.

2.3 Layer Normalization

The last building block of the Transformer architecture is layer normalization . Simply put, just an intelligent (i.e. learnable) normalization with scaling to improve the stability of the training process.

picture

Transformer's layer normalization, the picture is provided by the author

The trainable parameters here are two vectors gamma and beta, each of dimension d_model.

picture

The formula used to calculate the number of parameters in the normalization module of the Transformer layer. Image provided by the author.

Let's use code to test our hypothesis.

from torch import nn

d_model = 512

layer_normalization = nn.LayerNorm(d_model)
print(count_parameters(layer_normalization)) # 1024
print(d_model * 2) # 1024

Great! In approximation, this number is negligible , since layer normalization has substantially fewer parameters than feed-forward networks or multi-head attention blocks (although this module comes up a few times).

03 Deduce the complete formula

Now that we have everything, we can calculate the parameters of the entire encoder/decoder module!

3.1 Encoder and decoder implemented in PyTorch

Let us remember that the encoder consists of an attention block, a feed-forward network, and two layers of normalization.

picture

Transformer encoder. From the paper https://arxiv.org/pdf/1706.03762.pdf

We can look at the details in the PyTorch code to verify that all the components are in place. The multi-head attention mechanism is marked in red (on the left), the feedforward network is marked in blue , and the layer normalization is marked in green (screenshot of the Python console in PyCharm).

picture

PyTorch TransformerEncoderLayer. Image courtesy of the author

3.2 Final formula

Once confirmed, we can write the following function to calculate the number of parameters. In fact, it's just three lines of code that can even be combined into one. The rest of the function is a docstring for clarification.

def transformer_count_params(d_model=512, d_ff=2048, encoder=True, approx=False):
 """
    Calculate the number of parameters in Transformer Encoder/Decoder.
    Formulas are the following:
        multi-head attention: 4*(d_model^2 + d_model)
            if approx=False, 4*d_model^2 otherwise
        feed-forward: 2*d_model*d_ff + d_model + d_ff 
            if approx=False, 2*d_model*d_ff otherwise
        layer normalization: 2*d_model if approx=False, 0 otherwise
    Encoder block consists of: 
        1 multi-head attention block, 
        1 feed-forward net, and 
        2 layer normalizations.
    Decoder block consists of: 
        2 multi-head attention blocks, 
        1 feed-forward net, and 
        3 layer normalizations.
    :param d_model: (int) model dimensionality
    :param d_ff: (int) internal dimensionality of a feed-forward neural network
    :param encoder: (bool) if True, return the number of parameters of the Encoder, 
        otherwise the Decoder
    :param approx: (bool) if True, result is approximate (see formulas)
    :return: (int) number of learnable parameters in Transformer Encoder/Decoder
    """

    attention = 4 * (d_model ** 2 + d_model) if not approx else 4 * d_model ** 2
    feed_forward = 2 * d_model * d_ff + d_model + d_ff if not approx else 2 * d_model * d_ff
    layer_norm = 2 * d_model if not approx else 0

 return attention + feed_forward + 2 * layer_norm \
 if encoder else 2 * attention + feed_forward + 3 * layer_norm

Now is the time to test it.

from torch import nn

encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
print(count_parameters(encoder_layer))  # 3152384
print(transformer_count_params(d_model=512, d_ff=2048, encoder=True, approx=False))  # 3152384
print(transformer_count_params(d_model=512, d_ff=2048, encoder=True, approx=True))   # 3145728
# ~0.21% difference

decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
print(count_parameters(decoder_layer))  # 4204032
print(transformer_count_params(d_model=512, d_ff=2048, encoder=False, approx=False))  # 4204032
print(transformer_count_params(d_model=512, d_ff=2048, encoder=False, approx=True))   # 4194304
# ~0.23% difference

The exact formula is correct, which means we have correctly identified all the building blocks and broken them down into their constituent parts. Interestingly, since we ignore relatively small values ​​(thousands compared to millions) in the approximate formula, the error relative to the exact result is only about 0.2% ! But there is a way to make these formulas even simpler.

The approximate number of parameters for the attention block is 4 * d_model^2. Considering that d_model is an important hyperparameter, it sounds like it would be trivial to calculate. But for feedforward network, we need to know d_ff, because the formula is 2 * d_model * d_ff.

d_ff is a separate hyperparameter that must now be memorized in the formula, so let's think about how to get rid of it. As we saw above, when d_model = 512, d_ff = 2048, so d_ff = 4 * d_model.

For many Transformer models, such an assumption would make sense, greatly simplifying the formulation and still giving an approximate number of parameters. After all, no one wants to know the exact amount, just to know whether it is hundreds of thousands or tens of millions.

picture

Approximate encoder-decoder formulation. Image provided by author.

To get an idea of ​​the orders of magnitude you're dealing with, you can also round the multiplier. That way you get 10*d_model^2 parameters per encoder/decoder layer.

04 Conclusion Conclusion

Here is a summary of all the formulas we derived today.

picture

Formula summary, image provided by author.

In this article, the number of parameters in the Transformer encoder/decoder block is calculated, but of course, we do not recommend you to calculate the parameters of all new models. I chose this approach because when I started researching Transformers, I was surprised not to find such an article.

Although the number of parameters can give us an idea of ​​the complexity of the model and the amount of data required for training, it is only a way to gain a deeper understanding of the model architecture. I want to encourage you to explore and experiment: to see, implement, run code with different hyperparameters, etc. So keep learning and having fun with artificial intelligence!

END

References

1.https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html

2.https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoderLayer.html

3.https://arxiv.org/abs/1706.03762

4.https://techcrunch.com/2022/12/22/a-brief-history-of-diffusion-the-tech-at-the-heart-of-modern-image-generating-ai/

5.https://www.washingtonpost.com/technology/interactive/2022/ai-image-generator/

6.https://www.explainpaper.com/papers/attention

7.https://jalammar.github.io/illustrated-transformer/

8.https://youtu.be/-QH8fRhqFHM

9.https://www.youtube.com/watch?v=rBCqOTEfxv

10.https://huggingface.co/course/chapter1/1

11.https://pytorch.org/

12.https://discuss.pytorch.org/t/how-do-i-check-the-number-of-parameters-of-a-model/4325/9

This article is authorized by the original author and compiled by Baihai IDP. If you need to reprint the translation, please contact us for authorization.

Original link :

https://towardsdatascience.com/how-to-estimate-the-number-of-parameters-in-transformer-models-ca0f57d8dff0

Guess you like

Origin blog.csdn.net/Baihai_IDP/article/details/131162882