A brief introduction to the MobileBERT model

Table of contents

1. Summary

2. In-depth expansion

2.1 Knowledge distillation method

2.2 Gradual knowledge transfer


1. Summary

MobileBERT can be regarded as a "slimming" BERT-large model, which uses a bottleneck structure (Bottleneck Structure), and has certain improvements in the design of self-attention and feedforward neural networks. MobileBERT can achieve 99.2% of the performance of the teacher model BERT-base (using the GLUE dataset as the benchmark), the reasoning speed is 5.5 times faster, and the number of parameters is reduced to 23.2%.
MobileBERT has made several structural improvements on the basis of BERT, such as removing layer normalization, using ReLU activation function, etc., which will not be introduced here. Interested readers can read the literature to learn more about the details of the model structure. The following focuses on the knowledge distillation method adopted by MobileBERT.

2. In-depth expansion

2.1 Knowledge distillation method

The loss function of MobileBERT consists of four parts: supervised MLM loss, supervised NSP loss, hidden layer distillation loss and attention distillation loss .

In the formula, 0≤α≤1 represents the hyperparameter for adjusting the weight of the loss function, and α=0.5 is taken in MobileBERT.
Among them, the supervised MLM loss and the supervised NSP loss are the same as the original BERT implementation. The hidden layer distillation loss is consistent with TinyBERT, calculating the mean square error loss between the hidden layer outputs of the teacher model and the student model. It should be noted that since the MobileBERT (student model) and the teacher model have the same number of layers (both are 12 layers), there is no need to design a mapping function here, and it is only necessary to make a one-to-one correspondence between the teacher model and each layer of the student model. Can. The attention distillation loss is also similar to TinyBERT, but a KL divergence-based approach is used in MobileBERT instead of the mean squared error loss in TinyBERT.

In the formula, K represents the number of attention heads; KL ( ) represents the KL divergence function. Unlike the MSE loss function, the KL divergence is not symmetric, which requires special attention. The overall structure of the MobileBERT model is shown in the figure below.

2.2  Gradual knowledge transfer

MobileBERT uses a Progressive Knowledge Transfer strategy. The figure below gives an example, where the teacher model is a 3-layer Transformer structure, and the light version of each color indicates that the parameters are frozen, that is, the parameters do not participate in the training.

It can be seen that in the progressive knowledge transfer, the weights of the word vector layer and the final classification output layer are directly copied from the teacher model to the student model, and never participate in parameter update. For the intermediate Transformer layer, a gradual approach is adopted for gradual training. First, the student model starts to learn the first layer of the teacher model. Next, the student model continues to learn the second layer of the teacher model, while the first layer weights of the student model do not participate in the update. By analogy, when the student model learns the i-th layer of the teacher model, all weights in the student model that are smaller than the i-th layer do not participate in the update. The author of the paper proves through experiments that this progressive knowledge transfer method is significantly better than other direct distillation methods, and interested readers can read the literature for more details.

Guess you like

Origin blog.csdn.net/weixin_45684362/article/details/130297181