Architecture of BertModel:
Take bert-base-chinese as an example:
model = BertModel.from_pretrained("../model/bert-base-chinese")
Statistical model parameters:
# 参数量的统计
total_params = 0 # 模型总的参数量
total_learnable_params = 0 # 可学习的参数量
total_embedding_params = 0 # embeddings 层的参数量
total_encoder_params = 0 # Encoder编码器部分的参数量
total_pooler_params = 0
for name , param in model.named_parameters():
print(name , "->" , param.shape)
if param.requires_grad:
total_learnable_params += param.numel()
if "embedding" in name :
total_embedding_params += param.numel()
if "encoder" in name :
total_encoder_params += param.numel()
if "pooler" in name :
total_pooler_params += param.numel()
total_params += param.numel()
From the above it can be seen that:
The proportion of the embedding layer is 0.16254008305735163
Encoder encoder part accounted for 0.8316849528014959
Pooler layer accounted for 0.005774964141152439
Total parameters: 102267648
Return value analysis:
The documentation on BertModel is as follows:
BERT We're on a journey to advance and democratize artificial intelligence through open source and open science. https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertModel explains in detail the progress here:
last_hidden_state and pooler_output must be returned, and hidden_state is returned when the model is set output_hidden_states=True
or config.output_hidden_states=True .
Let me explain here:
The length of outputs is 3:
# outputs[0] == last_hidden_state : (batch_size, sequence_length, hidden_size)
# outputs[1] == pooler_output : (batch_size, hidden_size)
# outputs[2] == hidden_state : (batch_size, sequence_length, hidden_size)
As can be seen from the figure above:
model.embeddings(input_tensor) == outputs[2][0]