The most complete history of BERT compression method to share summary

Model compression reduces the trained neural network redundancies. Compression model is particularly useful as BERT such complex models, because BERT, in particular BERT-Large GPU consumes a lot of memory, and is simply not suitable for memory-constrained smartphones for. Of course, improve memory, reasoning and speed can also be a large-scale cost savings.

In this article, some of the papers of some sort column compression BERT, and for everyone to share.

Bert common compression method

Cut - partially removing unnecessary network after training. This includes the weight cut, attention head cut, layer cut and so on. Some methods are still in the training process through regularization, to increase reliability (layer dropout).

Decomposition weight matrix - matrix is approximated by the product of the original parameter parameter matrices into two smaller matrices. This matrix is applied to the low-rank constraints. Weight factoring embedding that both the input (which saves a lot of disk memory), the parameters may be applied (to increase speed) feed forward / self-attention layer.

Knowledge extraction - also known as "Student Teacher". The pre training / downstream data from the beginning of training a much smaller Transformer. Under normal circumstances, this may fail, but for unknown reasons, soft labels using the full-size model can be improved optimization. Some methods also use other model architecture (LSTM) extraction BERT, these models have a faster speed of reasoning. Other methods using deeper teacher, not only focus on output, also focus on weight matrix and hidden activation.

Share weight - value share the same values and other parameters in the model some of the right model. For example, ALBERT same weight matrix for each layer of self-attention BERT.

Quantification - truncated floating-point number, to use only a few bits (which can lead to rounding errors). Quantized values can also learn during or after training.

Before training and after training - some way by some of downstream tasks only to compress BERT. Other methods places the task-independent way compression BERT.

Related articles

(By compression method classification)