LITE TRANSFORMER WITH LONG-SHORT RANGE ATTENTION

1. Summary

In this paper, we propose an efficient mobile NLP architecture - Lite Transformer, to deploy mobile NLP applications on edge devices. Transformers have become a ubiquitous technique in natural language processing (e.g. machine translation, question answering systems), but the high computational resources required to achieve high performance make it unsuitable for mobile applications where hardware resources and batteries are constrained. The key primitive of the Lite Transformer is Long-Short Range Attention (LSRA), where one set of heads focuses on local context modeling (via convolutions), while another set of heads focuses on long-range relationship building. model (via attention mechanism). This specialization has led to consistent improvements in three well-known language tasks: machine translation, abstract summarization, and language modeling. Under constrained resource conditions (500M/100M MACs), Lite Transformer improves the WMT'14 English-French machine translation task by 1.2/1.7 BLEU scores than Transformer, respectively. Lite Transformer reduces the calculation of the Transformer base model by 2.5 times, and the BLEU score drops by 0.3. Combining pruning and quantization, we further compress the model size of Lite Transformer by a factor of 18.2. In terms of language modeling, Lite Transformer is 1.8 lower than Transformer perplexity under the condition of about 500M MACs. Remarkably, the Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 BLEU scores in the mobile NLP setting without requiring an expensive architecture search that consumes more than 250 GPU-years.

2. Introduction

Transformer (Vaswani et al., 2017) has been widely used in natural language processing due to its efficient training ability and excellent ability in capturing long-range dependencies. On this basis, modern state-of-the-art models, such as BERT (Devlin et al., 2019), are able to learn powerful language representations from unlabeled text and even exceed human performance on challenging question answering tasks.

However, this excellent performance comes at a huge computational cost. For example, a single Transformer model requires over 10G multiply-add operations when translating a sentence of only 30 words. This extremely high demand for computing resources exceeds the capabilities of many edge devices such as smartphones and IoT devices. Therefore, it is of great significance to design an efficient and fast Transformer architecture for real-time NLP applications on the edge. Automated neural architecture search (Zoph & Le, 2017; So et al., 2019) is an option for high-accuracy model design, but the huge search cost (GPU hours and CO2 emissions) raises serious environmental concerns.

In this paper, we focus on efficient inference on mobile devices, where the total number of multiply-add operations is limited below 500M. A straightforward way to reduce the amount of Transformer computation is to shrink the embedding size. Although this can effectively reduce model size and computation, it also weakens the model's ability to capture long- and short-range relationships. To this end, we systematically investigate the decomposition of Transformer computations.

This paper observes that in Transformer, computation (multiply-add operations) is mainly dominated by feed-forward networks (FFNs). We find that the current mainstream bottleneck structured Transformer block is not efficient. Therefore, we propose a novel Long-Short Range Attention (LSRA) primitive. LSRA trades off computation in FFNs with wider attention layers. It expands the bottleneck, increases the ability of the attention layer to capture dependencies, and then reduces the overall computation by shrinking the embedding size while maintaining the same performance. LSRA no longer dedicates a module for "generic" information, but specifically assigns heads to long-range and short-range context modeling. Inspired by Wu et al. (2019b), LSRA introduces convolutions in parallel branches to capture local dependencies, enabling the attention branch to focus on global context capture. By stacking this primitive, we build a Lite Transformer suitable for mobile NLP applications.

Extensive experiments prove that our Lite Transformer model shows significant improvement over Transformer on three language tasks of machine translation, abstract summarization and language modeling. On the IWSLT 2014 German-English machine translation task, under 100M multiply-add operations, it is 3.1 BLEU higher than Transformer; on the WMT 2014 English-German machine translation task, it exceeds Transformer by 0.4 BLEU under 500M multiply-add operations, and under 100M multiply-add operations In the WMT 2014 English-French machine translation task, compared with Transformer, it has also achieved continuous improvement: it has increased by 1.2 BLEU under 500M multiplication and addition operations, and has increased by 1.7 BLEU under 100M multiplication and addition operations. Furthermore, combined with common model compression techniques (pruning and quantization), our Lite Transformer can achieve 18.2x model size compression. In the summarization task, it reduces the calculation of the Transformer base model by 2.4 times on the CNN-DailyMail dataset. In terms of language modeling, under the condition of about 500M multiplication and addition operations, its perplexity is 1.8 lower than that of Transformer.

Based on our design insights, our manually designed Lite Transformer outperforms the AutoML-based Evolved Transformer (So et al., 2019) by 0.5 BLEU in the mobile NLP setting, while the Evolved Transformer requires more than 250 GPU-years of search, and in their Carbon emissions equivalent to five cars are generated over the lifetime (see Figure 1b). This shows that AutoML is not a panacea: careful analysis and design insights (e.g.

Remove the bottleneck, specialize the head) can effectively reduce the search space and improve the sample efficiency.

The contributions of this paper are in four aspects:

  1. We perform a systematic analysis of commonly used computational bottleneck structures in modern neural networks, and find that bottleneck designs do not optimize 1-D attention when using FLOPs as an evaluation metric.
  2. We propose a specialized multi-branch feature extractor, Long Short Range Attention (LSRA), as the basic building block of our Transformer, where convolutions help capture local context while attention focuses on global context .
  3. We built the Lite Transformer based on LSRA. With constrained mobile computing resources (500M multiply-add operations), our Lite Transformer exhibits consistent improvements on three widely used machine translation datasets. With additional experiments on other tasks, Lite Transformer also shows high efficiency in multilingual applications.
  4. Compared with the Evolved Transformer based on AutoML search, our Lite Transformer provides 0.5 higher BLEU score on the WMT En-De dataset in the mobile setting, saving 20000 times the design cost (in terms of CO2 emissions). This reminds us to rethink the practicality of AutoML in terms of design cost and "green AI".

3 IS BOTTLENECK EFFECTIVE FOR 1-D ATTENTION?

Attention mechanisms have been widely used in various application domains, including 1-D (language processing (Vaswani et al., 2017)), 2-D (image recognition) and 3-D (video recognition (Wang et al., 2018)). It models short- and long-term relationships by computing pairwise dot products between input elements. Despite its effectiveness, this operation introduces a large amount of computation. Assuming the number of elements input to the attention layer (such as token length in language processing, number of pixels in an image, etc.) 2d. For images and videos, N is usually very large. For example, the intermediate feature map in VideoNet (Wang et al., 2018) has 16 frames, each with a resolution of 112×112, resulting in N = 2×105. The computation of the convolutional and fully connected layers grows linearly with N, while the computation of the attention layer grows quadratically with N. As N increases, the computational load of the attention module quickly becomes huge.

To address this dilemma, a common practice is to use a linear projection layer to reduce the number of channels d before applying attention, and then increase the dimensionality afterwards (as shown in Figure 2). In the original Transformer design (Vaswani et al., 2017), the channel dimension in the attention module is 4 times smaller than that of the FFN layer. Similarly, in non-local video networks (Wang et al., 2018), the number of channels is halved before applying a non-local attention module. This approach can save 16 times or 4 times the amount of computation. However, this also reduces the context capture ability of the attention layer due to the smaller feature dimension. For language processing, the situation may be even worse, because attention is the main module of context capture (unlike convolutional layers in images and videos, which mainly focus on information capture).

For tasks such as translation, the length N of the input sequence tends to be small, usually around 20-30. A Transformer block consists of an attention layer (the decoder has two) and a feed-forward network (FFN). For the attention layer, the number of Mult-Adds is O(4Nd^2 + N^2d); for FFN, the number of Mult-Adds is O(2 × 4Nd^2). Given a small N, it is doubtful whether the bottleneck design achieves a good trade-off between computation and accuracy for 1D attention. To test this idea, we first analyzed the calculations in Transformer. Surprisingly, for the original Transformer (labeled "Base" in the figure), the FFN layers actually consume most of the computational resources. This is undesirable since FFN itself cannot do any context capture. In conclusion, the bottleneck design cannot significantly reduce the computation of 1D attention due to small N, and the limited benefit of computation reduction is further weakened by larger FFN layers. This also hurts the capacity of the attention layer, which is the main context capture unit in Transformer, due to the smaller dimensionality.

Therefore, we believe that the bottleneck design is not optimal for 1-D attention. We instead design a "flattened" version of the Transformer block that neither reduces nor increases the channel dimension. With this new design, in the flattened Transformer model in Figure 2, the attention part occupies the main computation, leaving more space for further optimization. We also tested the performance change of this modification on the WMT'14 En-Fr dataset. We can obtain comparable

4 LONG-SHORT RANGE ATTENTION (LSRA)

[External link image transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the image and upload it directly (img-ZGCDOhmg-1688021656393)(/Users/zhangkai/Library/Application Support/typora-user-images/image-20230629143625076 .png)]

Researchers have been trying to understand the context captured by attention. Kovaleva et al. (2019) and Clark et al. (2020) visualize the attention weights of different layers in BERT. As shown in Figure 3b, the weight w describes the word relationship between the source and target sentences (as does the self-attention). With larger weight wij (darker color), the i-th word in the source sentence pays more attention to the j-th word in the target sentence. Attention maps often have distinct patterns: sparse and diagonal. They represent the relationship between some specific words: sparse indicates the relationship of long-term information, and the diagonal indicates the correlation within a small neighborhood. We refer to the former as "global" relations and the latter as "local" relations.

For translation tasks, the attention module must capture both global and local contexts, requiring a large capacity. This is not optimal compared to a dedicated design. Taking hardware design as an example, general-purpose hardware such as CPU is not as efficient as special-purpose hardware such as FPGA. Here we should deal exclusively with global and local context capture. When the model capacity is relatively large, redundancy can be tolerated and may provide better performance. However, in mobile applications, the model should be more efficient due to computational and power consumption constraints. Therefore, specialized context capture is more demanding. To address this issue, we propose a more specialized architecture, Long Short Range Attention (LSRA), which captures global and local contexts, respectively.

As shown in Figure 3a, our LSRA module adopts a two-branch design. The left branch captures the global context, while the right branch simulates a local context. We split the input into two parts along the channel dimension, and these parts will be mixed in subsequent FFN layers. This practice reduces the overall computation by a factor of 2. The left branch is a normal attention module like in Vaswani et al. (2017), except that the channel dimension is halved. For the right branch of local relations, a natural idea is to apply convolutions on the sequence. With sliding windows, diagonal groups can be easily covered by this module. To further reduce computation, we replace regular convolutions with a more lightweight version (Wu et al., 2019b) consisting of linear layers and depthwise convolutions. In this way, we place attention and convolution modules side-by-side, encouraging them to have different perspectives on sentences, so that the architecture can benefit from specialization and achieve better efficiency.

For better understanding, we visualize the average attention weights of the same layers of the fully trained base transformer and our Lite Transformer in Fig. 3. It can be easily distinguished that unlike the basic transformer which tries to model both global and local contexts, the attention module in LSRA only focuses on the capture of the global context (no diagonal pattern), leaving the capture of the local context to the convolution branch.

Guess you like

Origin blog.csdn.net/m0_47005029/article/details/131456440