The most complete history of BERT compression method to share summary

    Model compression reduces the trained neural network redundancies. Compression model is particularly useful as BERT such complex models, because BERT, in particular BERT-Large GPU consumes a lot of memory, and is simply not suitable for memory-constrained smartphones for. Of course, improve memory, reasoning and speed can also be a large-scale cost savings.

     

    In this article, some of the papers of some sort column compression BERT, and for everyone to share.

 

Bert common compression method

    Cut - partially removing unnecessary network after training. This includes the weight cut, attention head cut, layer cut and so on. Some methods are still in the training process through regularization, to increase reliability (layer dropout).

     

    Decomposition weight matrix - matrix is approximated by the product of the original parameter parameter matrices into two smaller matrices. This matrix is applied to the low-rank constraints. Weight factoring embedding that both the input (which saves a lot of disk memory), the parameters may be applied (to increase speed) feed forward / self-attention layer.

     

    Knowledge extraction - also known as "Student Teacher". The pre training / downstream data from the beginning of training a much smaller Transformer. Under normal circumstances, this may fail, but for unknown reasons, soft labels using the full-size model can be improved optimization. Some methods also use other model architecture (LSTM) extraction BERT, these models have a faster speed of reasoning. Other methods using deeper teacher, not only focus on output, also focus on weight matrix and hidden activation.

     

    Share weight - value share the same values and other parameters in the model some of the right model. For example, ALBERT same weight matrix for each layer of self-attention BERT.

     

    Quantification - truncated floating-point number, to use only a few bits (which can lead to rounding errors). Quantized values can also learn during or after training.

     

    Before training and after training - some way by some of downstream tasks only to compress BERT. Other methods places the task-independent way compression BERT.

 

Related articles

(By compression method classification)

     

Other Related Resources

    Sparse Transformer: Concentrated Attention Through Explicit Selection:https://openreview.net/forum?id=Hye87grYDH

 

    Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks:http://arxiv.org/abs/1906.04393

 

    Adaptively Sparse Transformers:https://www.semanticscholar.org/paper/f6390beca54411b06f3bde424fb983a451789733

 

    Compressing BERT for Faster Prediction:https://blog.rasa.com/compressing-bert-for-faster-prediction-2/amp/

Past quality content recommendation

Tesla latest automatic driving video

2019 Google's latest Chinese version of "machine learning crash course" Share

Achieve full programming language based GPT-2 and one million auto-complete source code training

2019 BAT, TMD and other Internet-date and comprehensive eye examination questions and answers Summary

Stanford University --2019 years of natural language understanding (CS224u) a full curriculum resource sharing

Graph Neural Network (GNN) organize the most comprehensive resource sharing

Stanford NLP Group -CS224n: NLP Share with deep learning -2019 spring full set of data

Depth neural network compression and acceleration associated most comprehensive resource sharing

Deep Learning Series 2019 base depth learning summer program video sharing (subtitles in English)

The most comprehensive Chinese natural language processing data sets, platforms and tools, organize

Metalearning (Meta Learning) the most complete papers, videos, books, organize resources

图神经网络(GNN)无监督学习 - Thomas Kipf

历史最全量化交易书籍、视频教程、博客、代码、算法整理

深度学习-强化学习-图神经网络-自然语言处理等AI课程超级大列表

两分钟论文解读-视频人体跟踪技术详解

发布了167 篇原创文章 · 获赞 208 · 访问量 58万+

Guess you like

Origin blog.csdn.net/lqfarmer/article/details/103441200