[NLP] BERT model parameters

A BERT_Base  110M parameter disassembly

How are the 110M parameters of the BERT_base model composed? Let's calculate together:

It just happened to be able to gain a deeper understanding of the architectural details of the Transformer Encoder model.

Take a look at the architecture of the model with the help of the transformers module:

import torch
from transformers import BertTokenizer, BertModel

bertModel = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
for name,param in bertModel.named_parameters():

print(name, param.shape)

The resulting model parameters are:

embeddings.word_embeddings.weight torch.Size([30522, 768])
embeddings.position_embeddings.weight torch.Size([512, 768])
embeddings.token_type_embeddings.weight torch.Size([2, 768])
embeddings.LayerNorm.weight torch.Size([768])
embeddings.LayerNorm.bias torch.Size([768])

encoder.layer.0.attention.self.query.weight torch.Size([768, 768])
encoder.layer.0.attention.self.query.bias torch.Size([768])
encoder.layer.0.attention.self.key.weight torch.Size([768, 768])
encoder.layer.0.attention.self.key.bias torch.Size([768])
encoder.layer.0.attention.self.value.weight torch.Size([768, 768])
encoder.layer.0.attention.self.value.bias torch.Size([768])

encoder.layer.0.attention.output.dense.weight torch.Size([768, 768])
encoder.layer.0.attention.output.dense.bias torch.Size([768])
encoder.layer.0.attention.output.LayerNorm.weight torch.Size([768])
encoder.layer.0.attention.output.LayerNorm.bias torch.Size([768])

encoder.layer.0.intermediate.dense.weight torch.Size([3072, 768])
encoder.layer.0.intermediate.dense.bias torch.Size([3072])
encoder.layer.0.output.dense.weight torch.Size([768, 3072])
encoder.layer.0.output.dense.bias torch.Size([768])
encoder.layer.0.output.LayerNorm.weight torch.Size([768])
encoder.layer.0.output.LayerNorm.bias torch.Size([768])

encoder.layer.11.attention.self.query.weight torch.Size([768, 768])
encoder.layer.11.attention.self.query.bias torch.Size([768])
encoder.layer.11.attention.self.key.weight torch.Size([768, 768])
encoder.layer.11.attention.self.key.bias torch.Size([768])
encoder.layer.11.attention.self.value.weight torch.Size([768, 768])
encoder.layer.11.attention.self.value.bias torch.Size([768])
encoder.layer.11.attention.output.dense.weight torch.Size([768, 768])
encoder.layer.11.attention.output.dense.bias torch.Size([768])
encoder.layer.11.attention.output.LayerNorm.weight torch.Size([768])
encoder.layer.11.attention.output.LayerNorm.bias torch.Size([768])
encoder.layer.11.intermediate.dense.weight torch.Size([3072, 768])
encoder.layer.11.intermediate.dense.bias torch.Size([3072])
encoder.layer.11.output.dense.weight torch.Size([768, 3072])
encoder.layer.11.output.dense.bias torch.Size([768])
encoder.layer.11.output.LayerNorm.weight torch.Size([768])
encoder.layer.11.output.LayerNorm.bias torch.Size([768])

pooler.dense.weight torch.Size([768, 768])
pooler.dense.bias torch.Size([768])

Among them, the parameters of the BERT model mainly consist of three parts:

Embedding layer parameters

Transformer Encoder layer parameters

LayerNorm layer parameters

Two Embedding layer parameters

Since the word vector is composed of three parts: Token embedding, Position embedding, and Segment embedding , the parameters of the embedding layer also include the parameters of the above three parts.

BERT_base English vocabulary size is: 30522, hidden layer hidden_size=768, maximum text length seq_len = 512

Token embedding parameter quantity is: 30522 * 768;

Position embedding parameter amount is: 512 * 768;

The parameter amount of Segment embedding is: 2 * 768.

So the total number of parameters is: (30522 + 512 +2) * 768 = 23,835,648

 

The LN layer is in the Embedding layer

norm uses layer normalization, each dimension has two parameters

768 * 2 = 1536

Three Transformer Encoder layer parameters

This part can be disassembled into two parts: Self-attention layer parameters, Feed-Forward Network layer parameters .

1. Self-attention layer parameters

The change layer is mainly composed of three matrix operations of Q, K, and V. In the BERT model, it is a Multi-head self-attention (denoted as SA) mechanism. First, the corresponding weight matrix is ​​obtained through Q and K matrix operations and softmax transformation, then the weight matrix is ​​multiplied by the V matrix, and finally the results obtained by the 12 heads are concat to obtain the final SA layer output.

1. Because the multi-head is divided into 12 parts, the parameter of a single head is 768 * (768/12) * 3, and then multiple heads are concat and then transformed. At this time, the size of W is 768 * 768

    12 heads is 768 * (768/12) * 3 * 12 + 768 * 768 =  1,769,472 + 589,824 = 2359296

3. The LN layer is in the Self-attention layer

norm uses layer normalization, each dimension has two parameters

768 * 2 = 1536

2. Feed-Forward Network layer parameters

It can be seen from FFN(x)=max(0, xW1+b1)W2+b2 that the feedforward network FFN is mainly composed of two fully connected layers, and the shapes of W1 and W2 are (768,3072), (3072,768 ), so the parameter quantity of this layer is:

The parameters of feed forward are mainly composed of two fully connected layers, the intermediate_size is 3072 (4H length in the original text), then the parameter is 12*(768*3072+3072*768) = 56623104

LN layer is in FFN layer

norm uses layer normalization, each dimension has two parameters

768 * 2 = 1536

layer normalization

layer normalization has two parameters, gamma and beta. Layer normalization is used in three places, namely after the embedding layer, after the multi-head attention, and after the feed forward. The parameters of these three parts are 768*2+12*(768*2+768*2)=38400

Four summary

In summary, the total number of parameters of the BERT model is:

23835648 + 12*2359296(28311552) + 56623104  + 38400 = 108808704 ≈103.7M

The Embedding layer accounts for about 20% of the total parameters, and the Transformer layer accounts for about 80% of the total parameters.

Note: The parameters introduced in this article are only the parameters of the Transformer Encoder part of the BERT model. The bias involved is not included in this article because there are few parameters.

Guess you like

Origin blog.csdn.net/zwqjoy/article/details/132391020
Recommended