[Natural Language Processing | Transformers] Transformers common algorithm introduction collection (4)

1. BigBird

BigBird is a Transformer with a sparse attention mechanism that reduces the quadratic dependence of self-attention to linearity in the number of tokens. BigBird is a universal approximator for sequence functions and is Turing complete, thus preserving these properties of quadratic full attention models. Specifically, BigBird consists of three main parts:

A set of global markers that participate in all parts of the sequence.
All tokens participate in a set of local adjacent tokens.
All coins participate in a random set of tokens.

Insert image description here

二、Levenshtein Transformer

Levenshtein Transformer (LevT) is a transformer designed to address the lack of flexibility of previous decoding models. It is worth noting that in previous frameworks, the length of the generated sequence is either fixed or increases monotonically as decoding proceeds. The authors argue that this is incompatible with human-level intelligence, where humans can modify, replace, undo, or delete any part of the text they generate. Therefore, LevT is proposed to bridge this gap by breaking the currently standardized decoding mechanism and replacing it with the two basic operations of insertion and deletion.

LevT is trained using imitation learning. The resulting model contains two strategies, and they are executed in an alternating manner. The authors believe that with this model, decoding becomes more flexible. For example, when the decoder is given an empty token, it falls back to the normal sequence generation model. On the other hand, when the initial state is a low-quality generated sequence, the decoder acts as a refinement model.

One of the key components of the LevT framework is the learning algorithm. The authors exploit the characteristics of insertions and deletions - they are complementary but also antagonistic. The algorithm they proposed is called "dual-strategy learning." The idea is that when training a policy (insertion or deletion) we use its opponent's output from the previous iteration as input. Expert policy, on the other hand, is used to provide correction signals.

Insert image description here

3. Primer

Primer is a Transformer-based architecture that improves on the Transformer architecture. Two improvements were discovered through neural architecture search: squared RELU activation in the feed-forward block, and depth convolutions added to the attention multi-head projection. Product: Generates a Multi-DConv-Head Attention.

Insert image description here

4. ProphetNet

ProphetNet is a sequence-to-sequence pre-trained model that introduces a novel self-supervised objective called future n-gram prediction and the proposed n-stream self-attention mechanism. Rather than optimizing one-step predictions in traditional sequence-to-sequence models, ProphetNet optimizes in the following ways:

Lookahead predictions that predict the next step are simultaneously tagged at each time step based on previous context tags. Future n-gram prediction explicitly encourages the model to plan for future tokens and further helps predict multiple future tokens.

Insert image description here

五、Transformer in Transformer(TNT)

Transformer is a self-attention-based neural network originally applied to NLP tasks. Recently, purely transformer-based models have been proposed to solve computer vision problems. These visual transformers typically treat images as a series of patches, ignoring the intrinsic structural information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling block-level and pixel-level representations. In each TNT block, the outer transformer block is used to process patch embeddings, and the inner transformer block extracts local features from pixel embeddings. Pixel-level features are projected into the patch embedding space through a linear transformation layer and then added to the patch. By stacking TNT blocks, we build a TNT model for image recognition.

Insert image description here

6. MATE

MATE is a Transformer architecture designed to model the structure of Web tables. It uses sparse attention in a way that allows the head to efficiently focus on rows or columns in a table. Each attention head reorders the tokens by column or row index and then applies a windowed attention mechanism. Unlike traditional self-attention mechanisms, Mate scales linearly on sequence length.

Insert image description here

7. Bort

Bort is a parameterized architecture variant of the BERT architecture. It extracts the optimal subset of architectural parameters of the BERT architecture through neural architecture search methods; specifically the Full Polynomial Time Approximation Scheme (FPTAS). This optimal subset - "Bort" - is clearly smaller,

Insert image description here
Insert image description here

8.Charformer

Charformer is a type of Transformer model that learns subword tokenization end-to-end as part of the model. Specifically, it uses GBST to automatically learn latent subword representations from characters in a data-driven manner. After GBST, the soft subword sequence passes through the Transformer layer.

Insert image description here

九、Edge-augmented Graph Transformer(EGT)

Transformer neural networks have achieved state-of-the-art results on unstructured data such as text and images, but their adoption on graph structured data has been limited. This is partly due to the difficulty of incorporating complex structural information into a basic transformer framework. We propose a simple but powerful transformer extension - residual edge channel. The resulting framework, which we call the edge-enhanced graph transformer (EGT), can directly accept, process, and output structural information as well as node information. It allows us to directly apply global self-attention (a key element of transformers) to the graph, with the benefit of long-range interactions between nodes. Furthermore, edge channels allow structural information to evolve from one layer to another, and prediction tasks on edges/links can be performed directly from the output embeddings of these channels. Furthermore, we introduce a generalized position coding scheme based on singular value decomposition, which can improve the performance of EGT. Compared to convolutional/message-passing graph neural networks that rely on local feature aggregation within neighborhoods, our framework relies on global node feature aggregation and achieves better performance. We conduct extensive experiments on benchmark datasets to validate the performance of EGT in a supervised learning environment. Our results demonstrate that convolutional aggregation is not an intrinsic inductive bias of graphs and that global self-attention can serve as a flexible and adaptive alternative.

Insert image description here

10. MobileBERT

MobileBERT is a reverse bottleneck BERT that compresses and accelerates the popular BERT model. MobileBERT is a slimmed down version of BERT_LARGE, equipped with both a bottleneck structure and a carefully designed balance between self-attention and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted bottleneck model containing BERT_LARGE. We then migrate this teacher's knowledge to MobileBERT. Like original BERT, MobileBERT is task-agnostic, that is, it can be universally applied to various downstream NLP tasks with simple fine-tuning. It is trained layer by layer by imitating the inverse bottleneck BERT.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132983345
Recommended