[Natural Language Processing | Language Models] Language Models common algorithm introduction collection (7)

1. DeeBERT

DeeBERT is a method to accelerate BERT inference. It inserts an additional classification layer (called egress) between each transformer layer of BERT. All transformer layers and exit ramps are jointly fine-tuned on the given downstream data set. During inference, after the sample passes through the transformer layer, it is passed to the next outlet. If the off-ramp is confident in the prediction, the result is returned; otherwise, the sample is sent to the next transformer layer.

Insert image description here

二、Probabilistically Masked Language Model

Probabilistic Masked Language Model (PMLM) is a language model that utilizes a probabilistic masking scheme and aims to bridge the gap between masked language models and autoregressive language models. The basic idea behind connecting two types of models is similar to Germain et al. (2015)'s MADE. PMLM is a masking language model with a probabilistic masking scheme that defines how sequences are masked by following a probability distribution. The authors adopted a simple uniform distribution of masking ratios and named the model u-PMLM.

Insert image description here

三、Table Pre-training via Execution

TAPEX is a conceptually simple and empirically powerful pre-training method that can provide tabular reasoning skills to existing models. TAPEX implements table pre-training by learning a neural SQL executor on a synthetic corpus obtained by automatically synthesizing executable SQL queries.

4. Fastformer

Fastformer is a type of Transformer that uses additional attention as a building block. Instead of modeling pairwise interactions between tokens, additional attention is used to model the global context, and then each token representation is further transformed based on its interaction with the global context representation.

Insert image description here

5. Parallel Layers

Parallel layer - We use a "parallel" formulation (Wang and Komatsuzaki, 2021) in each Transformer block instead of the standard "serial" formulation. Specifically, the standard formula can be written as:
y = x + MLP(LayerNorm(x + Attention(LayerNorm(x)))

The parallel formula can be written as:
y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x))

Since MLP and Attention input matrix multiplication can be fused, the parallel formulation can speed up large-scale training by approximately 15%. Ablation experiments show a slight quality degradation at the 8B scale, but no quality degradation at the 62B scale, so we infer that the effect of the parallel layer should be quality neutral at the 540B scale.

六、Single Headed Attention RNN(SHA-RNN)

SHA-RNN (Single Head Attention RNN) is a recurrent neural network and language model combined with embedding input and softmax classifier, based on core LSTM components and single head attention module. Other design choices include the use of Boom feedforward layers and layer normalization. The authors' guiding principles were to ensure architectural simplicity and limit computational cost (the model was initially trained using a single GPU).

Insert image description here

七、New current forms

Nyströmformer replaces the self-attention in BERT-small and BERT-base using the proposed Nyström approximation. This reduces self-attention complexity to O(n) and allows the Transformer to support longer sequences.

Insert image description here

八、Gated Convolution Network

Gated convolutional network is a language model that combines a convolutional network with a gating mechanism. Use zero padding to ensure no future context is seen. Gated convolutional layers can be layered on top of other layers. Model predictions are then obtained through an adaptive softmax layer.

Insert image description here

9. AutoTinyBERT

AutoTinyBERT is an efficient BERT variant discovered through neural architecture search. Specifically, one-shot learning is used to obtain large super pre-trained language models (SuperPLMs), where pre-trained or task-agnostic BERT distillation is used as the objective. Then, an evolutionary algorithm is run on SuperPLM to search for the optimal architecture given specific latency constraints. Finally, we extract the corresponding sub-models based on the optimal architecture and further train these models.

Insert image description here

10. PermuteFormer

PermuteFormer is a Performer-based model with relative position encoding that scales linearly over long sequences. PermuteFormer applies position-dependent transformations to queries and keys, encoding position information into attention modules. This transformation is carefully designed so that the final output of self-attention is not affected by the absolute position of the token.

The query/key features of each token are represented in the diagram as a row block, and its elements are marked with different colors. Position-aware arrangement arranges the elements of each labeled query/key feature along the head size dimension in each attention head. Depending on the position of the token, the permutation applied to the query/key features is different.

Insert image description here

11. NormFormer

NormFormer is a Pre-LN transformer that adds three normalization operations to each layer: layer norm after self-attention, head-wise scaling of self-attention output, and after the first fully connected layer. layer norm. These modifications introduce a small number of additional learnable parameters that provide each layer with a cost-effective way to change the size of its features and thus the gradient size of subsequent components.

Insert image description here

12. BP-Transformer

BP-Transformer (BPT) is a type of Transformer motivated by the need to find a better balance between self-attention capabilities and computational complexity. The architecture divides the input sequence into different multi-scale ranges through binary partitioning (BP). It incorporates an inductive bias that focuses on contextual information from fine-grained to coarse-grained as relative distance increases. The further away the contextual information is, the coarser its representation. BPT can be viewed as a graph neural network whose nodes are multi-scale spans. Token nodes can participate in smaller scale spans of closer contexts and larger scale spans of more distant contexts. The representation of nodes is updated via graph self-attention.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/133100801