Model compression reduces the trained neural network redundancies. Compression model is particularly useful as BERT such complex models, because BERT, in particular BERT-Large GPU consumes a lot of memory, and is simply not suitable for memory-constrained smartphones for. Of course, improve memory, reasoning and speed can also be a large-scale cost savings.
In this article, some of the papers of some sort column compression BERT, and for everyone to share.
Bert common compression method
Cut - partially removing unnecessary network after training. This includes the weight cut, attention head cut, layer cut and so on. Some methods are still in the training process through regularization, to increase reliability (layer dropout).
Decomposition weight matrix - matrix is approximated by the product of the original parameter parameter matrices into two smaller matrices. This matrix is applied to the low-rank constraints. Weight factoring embedding that both the input (which saves a lot of disk memory), the parameters may be applied (to increase speed) feed forward / self-attention layer.
Knowledge extraction - also known as "Student Teacher". The pre training / downstream data from the beginning of training a much smaller Transformer. Under normal circumstances, this may fail, but for unknown reasons, soft labels using the full-size model can be improved optimization. Some methods also use other model architecture (LSTM) extraction BERT, these models have a faster speed of reasoning. Other methods using deeper teacher, not only focus on output, also focus on weight matrix and hidden activation.
Share weight - value share the same values and other parameters in the model some of the right model. For example, ALBERT same weight matrix for each layer of self-attention BERT.
Quantification - truncated floating-point number, to use only a few bits (which can lead to rounding errors). Quantized values can also learn during or after training.
Before training and after training - some way by some of downstream tasks only to compress BERT. Other methods places the task-independent way compression BERT.
Related articles
(By compression method classification)
Other Related Resources
Sparse Transformer: Concentrated Attention Through Explicit Selection:https://openreview.net/forum?id=Hye87grYDH
Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks:http://arxiv.org/abs/1906.04393
Adaptively Sparse Transformers:https://www.semanticscholar.org/paper/f6390beca54411b06f3bde424fb983a451789733
Compressing BERT for Faster Prediction:https://blog.rasa.com/compressing-bert-for-faster-prediction-2/amp/
Past quality content recommendation
Tesla latest automatic driving video
2019 Google's latest Chinese version of "machine learning crash course" Share
Achieve full programming language based GPT-2 and one million auto-complete source code training
Graph Neural Network (GNN) organize the most comprehensive resource sharing
Stanford NLP Group -CS224n: NLP Share with deep learning -2019 spring full set of data
Depth neural network compression and acceleration associated most comprehensive resource sharing
Deep Learning Series 2019 base depth learning summer program video sharing (subtitles in English)
The most comprehensive Chinese natural language processing data sets, platforms and tools, organize
Metalearning (Meta Learning) the most complete papers, videos, books, organize resources