DeLighT: Deep and lightweight Transformer

DELIGHT: DEEP AND LIGHT-WEIGHT TRANSFORMER

Paper: https://arxiv.org/abs/2008.00623

Code: https://github.com/sacmehta/delight

1 Introduction

This paper proposes a deeper and lighter Transformer-DeLighT, DeLighT more efficiently allocates parameters in each Transformer Block:

(1) Use DeLighT conversion for deep and lightweight conversion;

(2) Use Block-wise Scaling to cross blocks, allowing shallower and narrower DeLighT Blocks near the input, and wider and deeper DeLighT Blocks near the output.

Overall, the DeLighT network is 2.5 to 4 times deeper than the standard Transformer, but with fewer parameters and operations. Experiments on machine translation and language modeling tasks show that DeLighT reduces the number of parameters by an average of 2 to 3 times while improving the performance of the baseline Transformer.

2. Whatever you do

2.1 Model Scaling

Model scaling is a standard way to improve the performance of sequential models. The size of the model increases in width scale while stacking more blocks in depth scale. In both cases (and combinations thereof), the parameters in each block of the network are the same, which can lead to suboptimal solutions. To further improve the performance of sequence models, [1] introduces block scaling, which allows designing variable-sized blocks and efficiently distributing parameters in the network.

The findings of the paper show that:

Shallow and narrow DeLighT blocks closer to the input, and deeper and wider DeLighT blocks closer to the output give the best performance.
Compared to using only model scaling, block-scaling based models can achieve better performance.
Convolutional neural networks (CNNs) can also learn shallow and narrow representations close to the input, and deep and wide representations close to the output. Unlike CNNs, which perform a fixed number of operations in each convolutional layer, the proposed block scaling uses a variable number of operations in each layer and block.

2.2 Improved sequence model

Important work:

(1) Use better marker-level representations (e.g. using BPE), adaptive input and output, and definitions to improve accuracy

(2) Use compression, pruning and distillation for efficiency

The closest thing our work comes to is Defining Transformation, which also learns representations using an "expand-shrink" strategy. The key difference between the DeFINE transformation (Fig. 1c) and the DeLighT transformation (Fig. 1d) is that the DeLighT transformation more efficiently distributes the parameters in the expansion and reduction layers.

3.Delight Transformer

The standard Transformer block is shown in Figure (a):

These include modeling relationships between sequence tokens using queries, key, value pairs, and using feed-forward networks (FFNs) to learn broader representations.

Multi-head attention is obtained by applying 3 projections to the input to obtain Query, Key and Value. Each projection consists of h linear layers (or heads), and the dimension inputs are mapped to a one-dimensional space, namely the head dimensions.

FFN is done by the following two linear layer operations:

Step 1: Dimensional Expansion

Step 2: Size Reduction

The depth of Transformer Block is 4. Generally, Transformer-based network design is to stack Transformer Blocks in order to increase network capacity and depth.

3.1DeLighT

The DeLighT transform first maps the dimensional input vector to a high-dimensional space, and then reduces it to a dimensional output vector (order reduction) using an N-layer group transform, as shown in Figure 1b.

In the expansion-reduction stage, DeLighT transforms use group linear transformations (GLTs) because they learn local representations by deriving outputs from specific parts of the input, which is more efficient than linear transformations. To learn a global representation, the DeLighT transform uses feature transformations to share information between different groups of group linear transformations, similar to channel transformations in convolutional networks.

The standard way to increase the expressive power and capacity of Transformer is to increase the input dimension dm. However, the linear increase also increases the complexity of the multi-head attention of the standard Transformer block (where the sequence length is). In contrast, to increase the expressiveness and capacity of the DeLighT block, we use expansion and contraction stages to increase the depth and width of the intermediate DeLighT transitions. This enables DeLighT to compute attention with a smaller size and fewer operations.

The DeLighT transformation is controlled by 5 configuration parameters:

(1) Number of GLT layers N (2) Width multiplier wm (3) Input dimension dm (4) Output dimension d0 (5) Maximum number of groups in GLT

In the expansion stage: DeLighT transformation projects the dm-dimensional input to a high-dimensional space, d{max} = wm*dm, and the linear layer is N/2 layer.

In the reduction phase: the DeLighT transform uses the remaining NN/2 GLT layer to d{max}project the dimensional vector into the d0 dimensional space.

 3.2 DeLighT Block

Figure (b) shows how to integrate DeLighT Transformation into the Transformer block to improve its efficiency. The input of dm dimension is first fed back to DeLighT transformation to generate the output of do dimension, where do<dm. These d0-dimensional outputs are then fed into a single head attention, followed by a lightweight FFN to model the relationship between them.

DeLighT layer and single head Note: Suppose there is a sequence of n input tokens, each token has dimension dm. These n dm-dimensional inputs are first fed back to the DeLighT transform to produce n, do-dimensional outputs, where do<dm. These n do-dimensional outputs are then simultaneously projected using three linear layers to produce a do-dimensional query Q, key K and value v. The scaled dot-product attention is then used to model the contextual relations among these n tokens. To enable the use of residual connections, the do-dimensional output of this attention operation is linearly projected into the dm-dimensional space.

Hypothetically, DeLighT's ability to learn broader representations allows replacing multi-head attention with single-head attention. The computational cost of computing attention in the standard transformer and the DeLighT block are O(dmn2) and O(don2) respectively, where do<dm. Thus, the DeLighT block reduces the cost of computing attention by a factor of dm/do. In experiments, do=dm/2 is used, so the required multiply-add operations are reduced by a factor of 2 compared to the Transformer architecture.

Lightweight FFN: Similar to FFN in transformer, this block also consists of two linear layers. Since the DeLighT block already incorporates a wider representation using the DeLighT transform, it allows inverting the functionality of the FFN layer in the Transformer. The first layer reduces the dimensionality of the input from dm to dm/r, while the second layer expands the dimensionality from dm/r to dm, where r is the reduction factor. Lightweight FFN reduces the number of parameters and operations in FFN by a factor of rdf/dm. In the standard transformer, the dimension of FFN is expanded by 4 times. Use r=4. Therefore, Lightweight FFN reduces the number of parameters in FFN by a factor of 16.

The DeLighT block stack includes:

1), 1 DeLighT conversion with N GLTs,

2), 3 parallel linear layers for key, query and value,

3), a projection layer,

4), 2 linear layers of lightweight FFN.

Therefore, the depth of the DeLighT block is N+4. DeLighT blocks are deeper compared to the standard transformer (depth 4).

3.3 Block-Wise Scaling

Standard approaches to improving the performance of sequence models include increasing the model size (width scaling), stacking more blocks (depth scaling), or both. However, this scaling transformation is not very effective on small datasets. To create networks of depth and width, the model is scaled to the block level.

Scaling the DeLighT block: The DeLighT block learns a depth and width representation using the DeLighT transform, whose depth and width are controlled by two configuration parameters, respectively: the number of GLT layers N and the width multiplier wm (Fig. a). These configuration parameters allow increasing the number of learnable parameters within a DeLighT block independently of the input dm and output do dimensions. This calibration is not possible with standard transformer blocks, since their expressiveness and capacity are functions of the input (input dimension = number of heads x head dimension). Here, blockwise scaling is introduced, which creates a network with DeLighT blocks of different sizes, allocating shallower and narrower DeLighT blocks near the input and deeper and wider DeLighT blocks near the output.

To this end, two network-wide configuration parameters are introduced: minimum Nmin and maximum Nmax number of GLTs in the DeLighT transform. For the b-th DeLighT block, the number Nb of GLTs and the width multiplier wmb in the DeLighT transform using linear scaling are calculated. With this scaling, each DeLighT block has a different depth and width.

Network Depth: The transformer block depth is fixed at 4. Therefore, previous work has correlated the depth of transformer-based networks with the number of transformer blocks. This paper proposes a different perspective to learn deeper representations, where the size of each block is variable. To calculate network depth, standard definitions across different fields, including computer vision and theoretical machine learning, are used. These works measure network depth as the number of consecutive learnable layers (e.g., convolutional layers, linear layers, or groups of linear layers). Similarly, the depths of DeLighT network and transformer network with B blocks are ∑b=0B−1(Nb+4) and 4B, respectively.

4. Experiment

Better, better and more powerful hahaha

Guess you like

Origin blog.csdn.net/Zosse/article/details/125798438