Model Compression and Acceleration Methods for Large-Scale Neural Networks

Model Compression and Acceleration Methods for Large-Scale Neural Networks

This article is reproduced in: Model compression and acceleration methods for large-scale neural networks

Other excellent articles:
AI Review Column|
Review Paper on Deep Neural Network Acceleration and Compression: A Quick Look at Current Deep Neural Network Model Compression and Acceleration Methods
AAAI 2020 Selected: Northeast/William Mary/Tsinghua University Collaborate to Propose the Fastest Universal Mobile Neural Network Acceleration Framework ! Super full summary: Quantified model of neural network acceleration | Neural network compression and acceleration method
with code

Preamble

In the process of neural network development, increasing the capacity of the model will often bring positive effects on the performance of the model. In the research work in recent years, in order to pursue performance improvement, the scale of the neural network model has indeed shown a gradual increase. trend. The methods to increase the capacity of the model mainly include increasing the depth of the model and expanding the width of the model. Deep networks such as ResNet-156L and BERT have been fully verified for their effectiveness in the fields of image, speech, and language models. Using a wide model such as Transformer Big will also bring for greater performance improvement.

However, with the increase of the model, the amount of parameters and calculations of the model also increase. The current large-scale neural network model generally has tens of millions or even hundreds of millions of parameters, making it difficult to deploy on computing resources and storage. On small devices with limited resources, these models with superior performance cannot exert their actual value in some scenarios such as daily communication and personal privacy.

With the increasing demand for the combination of scientific research and production practice, model compression and acceleration has become one of the current hot research directions. This article aims to briefly introduce some common model compression and model acceleration methods (some related work is sorted out at the end of each section, and interested partners are welcome to check it out). These methods can reduce the redundancy existing in the model and transform the complex model into a lighter model. The main categories of methods involved in this paper are: knowledge distillation , efficient network structure design , conditional calculation , model pruning , parameter sharing , and quantization .

Knowledge Distillation

The most direct way to reduce the number of parameters and calculations of the model is to train a smaller model. However, the performance of directly trained small models often fails to meet our expectations. At this time, we need some auxiliary means to improve the performance of small models. Knowledge distillation is an optional solution.

In the realization of knowledge distillation, first we need a large model, which we call "teacher model", and the target model is called "student model". Knowledge distillation uses the idea of ​​transfer learning to use the prediction results of the network (the hidden state of the output layer or the middle layer) or parameters as a carrier of knowledge, and transfer the knowledge of the teacher network to the student network. The effectiveness of knowledge distillation needs to be based on two assumptions:
(1) "knowledge" is transferable between models ;
(2) "knowledge" contained in the teacher model is easier to be captured by the student model than the "knowledge" in the original data .

For the first point of the above assumption, pre-training is a good example. By using as much training data as possible, the training model can extract richer features. Such a pre-training model can be applied to different downstream tasks and bring significant benefits to these downstream tasks.
For the second point of the above assumption, we can call the knowledge contained in the real label data as "hard knowledge", and the knowledge learned by the teacher model as "soft knowledge". Although the neural network is a black box model for us, we can regard a neural network as an unknown function, and the input and output are known as our training data.

During the training process of the teacher model, the relationship between the softmax distribution and the ground truth label matching can be learned. For example, when we want to translate the word "cat", the real label can only tell us that "cat" is translated as "cat". However, the "soft knowledge" learned by the large model can contain richer information. We can know that "cat" has a high probability of being translated into "cat", and a small probability of being translated into "dog", but only a very small probability. Translated as "noodles". Through the learning of "hard knowledge" in the original data by the teacher model, it is transformed into "soft knowledge" and taught to the student model, which can reduce the learning difficulty of the task and help the student model achieve better performance.

As a simple and effective model compression method, knowledge distillation has been extensively studied in recent years. In terms of model size, it is no longer limited by the fact that the scale of the teacher model is larger than that of the student model. Some changes in knowledge distillation methods Other entities (such as self-distillation and reverse distillation) are also gradually derived, which provide more possibilities for the development of knowledge distillation.

Related literature:

[1] Distilling the Knowledge in a Neural Network
https://arxiv.org/pdf/1503.02531.pdf

The classic of knowledge distillation, the concept of knowledge distillation was first proposed. The reason why it is named knowledge "distillation" is like doing a thermodynamic distillation experiment. In this process, the "essence" part of the knowledge in the teacher network is extracted by means of "distillation", and then it is used for the student network to learn.

[2] FitNets: Hints for Thin Deep Nets
https://arxiv.org/pdf/1412.6550.pdf

The innovation of this paper is that the teacher model in knowledge distillation is a "wide and shallower" model, while the student model is a "thin and deeper" model, such a student model can use fewer parameters and calculations achieve better performance than the teacher model.

[3] Sequence-Level Knowledge Distillation
https://arxiv.org/pdf/1606.07947.pdf

The sequence-level knowledge distillation method proposed in this paper is different from the previous word-level, which learns knowledge at the sequence level, and the experimental results on tasks such as machine translation are better than the word-level knowledge distillation method.

[4] Ensemble distillation for neural machine translation
https://arxiv.org/pdf/1702.01802.pdf

In this paper, the ensemble model is used as a teacher model to improve the quality of distilled knowledge and obtain a student model with better performance.

[5] Self-Knowledge Distillation in Natural Language Processing
https://arxiv.org/pdf/1908.01851.pdf

Propose self-knowledge distillation (self-distillation), and carry out knowledge distillation by proposing self-knowledge.

[6] Selective Knowledge Distillation for Neural Machine Translation
https://arxiv.org/pdf/2105.12967.pdf

The "knowledge" in knowledge distillation is not the more the better, and the existence of some knowledge may even damage the overall performance of the student model. This paper proposes two strategies, batch-level and global-level, to select samples in knowledge distillation.

[7] Tinybert: Distilling BERT for natural language understanding
https://arxiv.org/pdf/1909.10351.pdf

This paper proposes a knowledge distillation method based on the Transformer model structure for the compression and acceleration of the BERT model in pre-training. The final experiment shows that under the premise of achieving the same performance as the original BERT model, the model size can be reduced to 1/7.5, and the model inference time can be reduced to 1/9.4.

[8] Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter
https://arxiv.org/pdf/1910.01108.pdf

This paper still conducts experiments based on the BERT model, and introduces three loss functions: supervised Mask Language Model (MLM) loss, distillation MLM loss, word vector cosine loss, and finally compresses the model parameters to 40% of the original, and improves the reasoning speed 60%.

[9] MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
https://arxiv.org/pdf/2004.02984.pdf

Based on the BERT model, a task-agnostic model compression method is proposed.

Efficient Neural Architecture Design

Efficient network structure design methods are mainly divided into two categories : manual design and structure search . Manually designing a neural network requires researchers with professional background knowledge to redesign the neural network structure or modify the existing network according to the specific characteristics of the task . For example, the current Transformer model that has achieved remarkable results in various tasks and fields, compared with the traditional RNN model, Transformer pays attention to the self-attention mechanism, its computational complexity and parameter complexity are small, and it benefits from the learning The alignment relationship between the received words and words has achieved better results than RNN.

Compared with manually designing a neural network, neural network structure search looks much "smart". Its main idea is to "design neural networks with neural networks", and usually use reinforcement learning methods or evolutionary algorithms to design complex networks . Structure search finds the optimal structure of the neural network under a specific target in an automated way by formulating a reasonable search strategy and search algorithm. Although most of the work is done in an automated manner, humans still need to formulate search methods and evaluation strategies. And because the structure search needs to pre-design a large search space, the computing power, time cost and hardware cost required for this search process are relatively high.

Manually designing a neural network requires high labor costs, and structure search requires high time and hardware costs. Both methods have their own advantages and disadvantages. Which method is more advantageous needs to be analyzed in combination with specific scenarios.

Related literature:

[10] Non-autoregressive neural machine translation
https://arxiv.org/pdf/1711.02281.pdf

To understand non-autoregressive neural machine translation, you first need to know what autoregressive is. Autoregressive, that is, a feature of the decoder in the machine translation model, means that the decoder needs to rely on the previously generated words when generating the current word, so the decoding process can only be non-parallel. Non-autoregressive can generate the output of the model in parallel, which significantly improves the inference efficiency of the model.

[11] Deep encoder, shallow decoder: Reevaluating the speed-quality tradeoff in machine translation
https://arxiv.org/pdf/2006.10369.pdf

The current neural machine translation model does not balance the model depth of encoder and decoder well. In this paper, neural machine translation models with different depth encoders and decoders are explored. Experiments show that a deep encoder and shallow decoder model can achieve a good balance between speed and quality.

[12] Accelerating neural transformer via an average attention network
https://arxiv.org/pdf/1805.00631.pdf

This paper applies the average attention network to the Transformer model. It is proposed in the paper that in the calculation process of the Transformer model, there is no need to dynamically calculate the weight of self-attention, but to calculate the information of the previous words through a cumulative average operation, and it is considered that all previous words contribute the same to the current word, thereby improving this Part of the computational efficiency.

[13] Sharing attention weights for fast transformer
https://arxiv.org/pdf/1906.11024.pdf

This paper simplifies the calculation of the attention at the decoder by sharing the attention weights between different layers, thereby improving the inference efficiency of the Transformer model.

[14] An efficient transformer decoder with compressed sub-layers
https://arxiv.org/pdf/2101.00542.pdf

This paper simplifies the architecture by compressing the basic modules of the Transformer, and merges the three sublayers in the decoder into one, thereby achieving a higher degree of parallelism.

[15] Reducing Transformer Depth on Demand with Structured Dropout
https://arxiv.org/pdf/1909.11556.pdf

This paper proposes LayerDrop, which introduces a structured dropout to sample the sub-networks in the large network during the training process of the Transformer, and obtains a small network through the model pruning method after the training process.

[16] Efficient softmax approximation for gpus
https://arxiv.org/pdf/1609.04309.pdf

The adaptive softmax proposed in this paper can improve the operational efficiency of the softmax function. The dimension of the parameter matrix in softmax is the size of the vocabulary, so this method is very suitable for neural networks with large word vector dimensions.

[17] Reformer: The Efficient Transformer
https://arxiv.org/pdf/2001.04451.pdf

This paper proposes two methods to reduce the time complexity of Transformer model computation. First, replace dot-product attention with locality-sensitive hashing, and change the computational complexity of this part from O(L2) to O(Llog L), where L is the sequence length. Second, the reversible residuals are used instead of standard residuals to further improve the computational efficiency of the model.

[18] Neural Architecture Search with Reinforcement Learning
https://arxiv.org/pdf/1611.01578.pdf

Combining reinforcement learning with deep learning, the pioneering work of Neural Architecture Search (NAS).

[19] SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
https://arxiv.org/pdf/2006.11316.pdf

This paper explores how to use grouped convolutional structures to design neural networks based on self-attention mechanisms. SqueezeBERT replaces the fully connected layer with group convolution, and combines this group convolution structure with the self-attention mechanism to improve the computational efficiency of the network.

[20] AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search
https://arxiv.org/abs/2001.04246

In this paper, the BERT model is compressed through Differentiable Neural Architecture Search (DNAS), which can adaptively generate small models of different structures according to different tasks.

[21] Progressive Neural Architecture Search
https://arxiv.org/pdf/1712.00559.pdf

This paper proposes a progressive neural network structure search algorithm, which is proved to be more effective than existing structure search algorithms based on reinforcement learning and evolutionary algorithms.

Conditional Computation

The performance of the neural network model is closely related to the number of training samples. For large-scale training data, if there is no model with a matching scale, it is easy to cause training problems such as underfitting. However, if the size of the model increases, the computational cost will also increase.

Conditional calculations can be calculated for each sample and some sub-networks of the activation network, that is, which part of the network is activated is determined by the input training samples. Using this method can significantly increase the model without increasing the computational cost. capacity . In the MoE of ICLR2017, the author proposed a super-large-scale neural network, which has improved the model capacity by more than 1000 times, and does not impose an excessive burden on computational efficiency. Among them, the sparsely gated mixed expert layer consists of thousands of feed-forward neural networks, and the selection of which expert layers is controlled by the gating mechanism. The entire MoE network consists of billions of parameters. The proposal of this method has made researchers a big step forward in the exploration process of ultra-large-scale neural networks.

Related literature:

[22] Universal Transformers
https://arxiv.org/pdf/1807.03819.pdf

Based on the Transformer model, a cycle on the depth dimension of the model is added, and the calculation times of the model are controlled by conditional calculations, and the required number of model layers can be selected according to different samples.

[23] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
https://arxiv.org/pdf/1701.06538.pdf

The sparse gated mixed expert layer proposed in this paper is composed of thousands of feed-forward neural networks. Although the entire MoE consists of billions of parameters, it still ensures the computational efficiency of the model.

[24] Depth-Adaptive Transformer
https://arxiv.org/pdf/1910.10073.pdf

This paper proposes a deep adaptive Transformer model. During the inference process of the model, models with different layers can be selected for calculation through training samples, which improves the efficiency of network inference.

[25] FastBERT: a Self-distilling BERT with Adaptive Inference Time
https://arxiv.org/pdf/2004.02178.pdf

[26] DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference
https://arxiv.org/pdf/2004.12993.pdf

FastBERT and DeeBERT have similar ideas, both of which accelerate the inference process of BERT by inserting additional classifiers between Transformer layers.

Model pruning (Pruning)

Model pruning reduces the number of model parameters and training costs by making a dense neural network sparse.

Classified from the granularity of pruning, model pruning methods are divided into unstructured pruning and structured pruning .

Unstructured pruning is usually performed in units of neurons . Although this method can significantly reduce the amount of model parameters, since the underlying computing library usually only supports structured acceleration, this unstructured sparse method often cannot To achieve the actual acceleration effect.
Structural pruning usually refers to the pruning of model layer, block, channel and other levels of structure . Although this "coarse-grained" pruning will have a greater impact on model performance, it can obtain real acceleration effects , which has greater advantages in actual production applications and is more conducive to deployment. In the implementation of the method, the importance of the module can be judged by some artificially set inspection standards, or L1 and L2 regularization can be performed on the network, and the latter can automatically evaluate the contribution of the module during the training process of the model. The important part is cut "safely".

From the stage of pruning, model pruning can be divided into pre-training pruning, training pruning and post-training pruning. A typical model pruning method can be divided into three stages: Training, Pruning, and Fine-tuning:

(1) Train a "over-parameterized" large network, and use this as a benchmark for pruning. As for why you need to train a large network, you can refer to the lottery hypothesis of ICLR2019 best paper. A large network contains multiple small sub-networks, so a larger network contains more "possibility" and has a greater probability to perform better; (2)
distinguish between "important" and "unimportant" in the network ", cut out the unimportant parts;
(3) fine-tune the pruned model, extract the important parts of the large network separately during the implementation process, and then perform a fast finetune on the training data set. This step It is to reduce the performance loss of the model after pruning as much as possible.

In the above three steps, the method of training and pruning in sequence belongs to post-training pruning. If it is performed in parallel, it is pruning during training. The method of setting the step of pruning before training belongs to Prune before training. Currently, the most commonly used pruning method is post-training pruning. Through the process of training-iterative pruning-fine-tuning, a better subnetwork in the large network is obtained. Although the training cost is high, the method is simple and the engineering practice is more friendly. , and the model performance loss is also small.

Related literature:

[27] Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding
https://arxiv.org/pdf/1510.00149.pdf

ICLR 2016's best paper, Compressing Deep Neural Networks via Pruning, Quantization, and Huffman Coding.

[28] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
https://arxiv.org/pdf/1803.03635.pdf

The best paper of ICLR 2019 proposed the "Lottery Hypothesis".

[29] Are Sixteen Heads Really Better than One?
https://arxiv.org/pdf/1905.10650.pdf

[30] Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned
https://arxiv.org/pdf/1905.09418.pdf

The above two papers are all about model pruning for the multi-head attention in Transformer. The experiment found that cutting off some heads in the multi-head attention has little effect on the model performance.

[31] DynaBERT: Dynamic BERT with Adaptive Width and Depth
https://arxiv.org/pdf/2004.04037.pdf

Based on the BERT model, sub-networks of different sizes are trained, and the model is directly pruned in the inference stage after training, so as to achieve the purpose of model compression.

[32] LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression
https://arxiv.org/pdf/2004.04124.pdf

LadaBERT proposed in this paper uses a hybrid model compression framework to iteratively compress the model to the target size, which involves model compression methods such as knowledge distillation, model pruning, and matrix decomposition.

Parameter Sharing

Parameter sharing is also called weight sharing. Its main function is to reduce the number of parameters of the model by sharing the weights in the network, thereby reducing the storage space required by the model. For the Transformer model, there are two most common parameter sharing methods:
1) Embedding layer weight sharing between Encoder and Decoder ;
2) Hidden layer cross-layer parameter sharing (such as ALBERT);

In the weight sharing method of the Embedding layer, taking the machine translation task as an example, although the source language and the target language in the translation belong to different categories, words (such as numbers, punctuation, characters, etc.) that co-occur in the two languages ​​can be Sharing a large vocabulary makes it possible to share vocabulary. Since the size of the vocabulary is usually in the tens of thousands, the model parameters that can be saved by the shared vocabulary method are very considerable. In the commonly used BPE method, since the smallest unit is the sub-word level, similar to English and German, the two languages ​​of the same family have a very high sub-word overlap, so embedding layer sharing can also better carry out word to improve the performance of the model. However, for a language pair with a large difference such as Chinese and English, the expected benefit of using a shared vocabulary is small.

The concept of cross-layer parameter sharing is used in work such as ALBERT. Taking a 12-layer BERT-base model as an example, only the parameters of the first layer can be learned during training and reused in the remaining 11 layers. This set of parameters, rather than having each of the 12 layers learn different parameters. Compared with 110 million parameters of BERT-base, the ALBERT model with the same number of layers and hidden size has only 31 million parameters. It can be seen that such methods can significantly reduce the amount of model parameters.

Related literature:

[33] ALBERT: A lite BERT for self-supervised learning of language representations
https://arxiv.org/pdf/1909.11942.pdf

The method of cross-layer parameter sharing and matrix decomposition is adopted to speed up the training speed of BERT and reduce the memory consumption during training.

[34] Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation
https://arxiv.org/pdf/2106.10002.pdf

The neural network model is constructed in the way of cyclic stacking layers. Experiments on the Transformer model show that the performance similar to that of the 6-layer model can be achieved by superimposing a layer of parameters 6 times.

[35] Speeding up Deep Model Training by Sharing Weights and Then Unsharing
https://arxiv.org/pdf/2110.03848.pdf

This paper improves the training efficiency of deep neural networks by sharing parameters.

Quantization

The quantization method is to map the floating-point 32-bit value commonly used in the model to a value with fewer digits. According to whether the mapping intervals are equal, the quantization method can be divided into uniform quantization and non-uniform quantization .

The most common application of the quantization method is mixed-precision calculation. Through data type conversion and alternate calculation between floating-point 32-bit and 16-bit, this method can achieve significant acceleration on specific hardware (such as TITAN V) without loss Model performance. In more extreme cases, you can also use 8-bit integers or values ​​represented by fewer bits (such as 2-value networks, 3-value networks) for calculations, which can theoretically achieve greater acceleration. However, such methods are limited by the hardware and the underlying computing library, and the support of the underlying computing library is required to achieve the actual acceleration effect.

Related literature:

[36] Mixed precision training
https://arxiv.org/pdf/1710.03740.pdf

A simple and practical technology, through the mixed-precision calculation of FP32 and FP16, it can significantly improve the calculation efficiency of the model and reduce the consumption of video memory resources under the premise of ensuring the performance of the model.

[37] Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1
https://arxiv.org/pdf/1602.02830.pdf

[38] XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
https://arxiv.org/pdf/1603.05279.pdf

The above two papers only use 1-bit in the quantization process, which is a relatively extreme quantization method.

[39] Pieces of eight: 8-bit neural machine translation
https://arxiv.org/pdf/1804.05038.pdf

[40] Towards Fully 8-bit Integer Inference for the Transformer Model
https://arxiv.org/pdf/2009.08034.pdf

[41] Q8BERT: Quantized 8Bit BERT
https://arxiv.org/pdf/1910.06188.pdf

The 8bit quantization method can achieve 4 times the model compression effect without losing the model performance as much as possible.

Guess you like

Origin blog.csdn.net/qq_33757398/article/details/125710499