A BERT_Base 110M parameter disassembly
How are the 110M parameters of the BERT_base model composed? Let's calculate together:
It just happened to be able to gain a deeper understanding of the architectural details of the Transformer Encoder model.
Take a look at the architecture of the model with the help of the transformers module:
import torch
from transformers import BertTokenizer, BertModel
bertModel = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
for name,param in bertModel.named_parameters():
print(name, param.shape)
The resulting model parameters are:
embeddings.word_embeddings.weight torch.Size([30522, 768])
embeddings.position_embeddings.weight torch.Size([512, 768])
embeddings.token_type_embeddings.weight torch.Size([2, 768])
embeddings.LayerNorm.weight torch.Size([768])
embeddings.LayerNorm.bias torch.Size([768])
encoder.layer.0.attention.self.query.weight torch.Size([768, 768])
encoder.layer.0.attention.self.query.bias torch.Size([768])
encoder.layer.0.attention.self.key.weight torch.Size([768, 768])
encoder.layer.0.attention.self.key.bias torch.Size([768])
encoder.layer.0.attention.self.value.weight torch.Size([768, 768])
encoder.layer.0.attention.self.value.bias torch.Size([768])
encoder.layer.0.attention.output.dense.weight torch.Size([768, 768])
encoder.layer.0.attention.output.dense.bias torch.Size([768])
encoder.layer.0.attention.output.LayerNorm.weight torch.Size([768])
encoder.layer.0.attention.output.LayerNorm.bias torch.Size([768])
encoder.layer.0.intermediate.dense.weight torch.Size([3072, 768])
encoder.layer.0.intermediate.dense.bias torch.Size([3072])
encoder.layer.0.output.dense.weight torch.Size([768, 3072])
encoder.layer.0.output.dense.bias torch.Size([768])
encoder.layer.0.output.LayerNorm.weight torch.Size([768])
encoder.layer.0.output.LayerNorm.bias torch.Size([768])
encoder.layer.11.attention.self.query.weight torch.Size([768, 768])
encoder.layer.11.attention.self.query.bias torch.Size([768])
encoder.layer.11.attention.self.key.weight torch.Size([768, 768])
encoder.layer.11.attention.self.key.bias torch.Size([768])
encoder.layer.11.attention.self.value.weight torch.Size([768, 768])
encoder.layer.11.attention.self.value.bias torch.Size([768])
encoder.layer.11.attention.output.dense.weight torch.Size([768, 768])
encoder.layer.11.attention.output.dense.bias torch.Size([768])
encoder.layer.11.attention.output.LayerNorm.weight torch.Size([768])
encoder.layer.11.attention.output.LayerNorm.bias torch.Size([768])
encoder.layer.11.intermediate.dense.weight torch.Size([3072, 768])
encoder.layer.11.intermediate.dense.bias torch.Size([3072])
encoder.layer.11.output.dense.weight torch.Size([768, 3072])
encoder.layer.11.output.dense.bias torch.Size([768])
encoder.layer.11.output.LayerNorm.weight torch.Size([768])
encoder.layer.11.output.LayerNorm.bias torch.Size([768])
pooler.dense.weight torch.Size([768, 768])
pooler.dense.bias torch.Size([768])
Among them, the parameters of the BERT model mainly consist of three parts:
Embedding layer parameters
Transformer Encoder layer parameters
LayerNorm layer parameters
Two Embedding layer parameters
Since the word vector is composed of three parts: Token embedding, Position embedding, and Segment embedding , the parameters of the embedding layer also include the parameters of the above three parts.
BERT_base English vocabulary size is: 30522, hidden layer hidden_size=768, maximum text length seq_len = 512
Token embedding parameter quantity is: 30522 * 768;
Position embedding parameter amount is: 512 * 768;
The parameter amount of Segment embedding is: 2 * 768.
So the total number of parameters is: (30522 + 512 +2) * 768 = 23,835,648
The LN layer is in the Embedding layer
norm uses layer normalization, each dimension has two parameters
768 * 2 = 1536
Three Transformer Encoder layer parameters
This part can be disassembled into two parts: Self-attention layer parameters, Feed-Forward Network layer parameters .
1. Self-attention layer parameters
The change layer is mainly composed of three matrix operations of Q, K, and V. In the BERT model, it is a Multi-head self-attention (denoted as SA) mechanism. First, the corresponding weight matrix is obtained through Q and K matrix operations and softmax transformation, then the weight matrix is multiplied by the V matrix, and finally the results obtained by the 12 heads are concat to obtain the final SA layer output.
1. Because the multi-head is divided into 12 parts, the parameter of a single head is 768 * (768/12) * 3, and then multiple heads are concat and then transformed. At this time, the size of W is 768 * 768
12 heads is 768 * (768/12) * 3 * 12 + 768 * 768 = 1,769,472 + 589,824 = 2359296
3. The LN layer is in the Self-attention layer
norm uses layer normalization, each dimension has two parameters
768 * 2 = 1536
2. Feed-Forward Network layer parameters
It can be seen from FFN(x)=max(0, xW1+b1)W2+b2 that the feedforward network FFN is mainly composed of two fully connected layers, and the shapes of W1 and W2 are (768,3072), (3072,768 ), so the parameter quantity of this layer is:
The parameters of feed forward are mainly composed of two fully connected layers, the intermediate_size is 3072 (4H length in the original text), then the parameter is 12*(768*3072+3072*768) = 56623104
LN layer is in FFN layer
norm uses layer normalization, each dimension has two parameters
768 * 2 = 1536
layer normalization
layer normalization has two parameters, gamma and beta. Layer normalization is used in three places, namely after the embedding layer, after the multi-head attention, and after the feed forward. The parameters of these three parts are 768*2+12*(768*2+768*2)=38400
Four summary
In summary, the total number of parameters of the BERT model is:
23835648 + 12*2359296(28311552) + 56623104 + 38400 = 108808704 ≈103.7M
The Embedding layer accounts for about 20% of the total parameters, and the Transformer layer accounts for about 80% of the total parameters.
Note: The parameters introduced in this article are only the parameters of the Transformer Encoder part of the BERT model. The bias involved is not included in this article because there are few parameters.