[Natural Language Processing | Transformers] Transformers common algorithm introduction collection (3)

1. CodeBERT

CodeBERT is a dual-module pre-trained model for programming languages ​​(PL) and natural language (NL). CodeBERT learns universal representations that support downstream NL-PL applications, such as natural language code search, code documentation generation, etc. CodeBERT is developed with a Transformer-based neural architecture and trained using a hybrid objective function that combines the pre-trained training task of replacement token detection, i.e. detecting reasonable alternatives sampled from the generator. This enables exploiting both bimodal and unimodal data of NL-PL pairs, with the former providing input labels for model training and the latter helping to learn better generators.

Insert image description here

2. PEGASUS

PEGASUS proposes a transformer-based abstract generalization model. It uses a special self-supervised pre-training objective called Gapped Sentence Generation (GSG), which is designed to perform well on summary-related downstream tasks. As reported in the paper, “Both GSG and MLM are simultaneously applied to this example as pre-training targets. Initially there are three sentences. One of the sentences is masked with [MASK1] and used as target generative text (GSG). The other two sentences are retained in the input, but some tokens are randomly masked by [MASK2]."

Insert image description here

三、Sparse Transformer

Sparse Transformer is a transformer-based architecture that utilizes sparse decomposition of the attention matrix to reduce time/memory. Other changes to the Transformer architecture include: (a) reconstructed residual blocks and weight initialization, (b) a set of sparse attention kernels that efficiently compute a subset of the attention matrix, © recomputation of attention during backward passes Weights to reduce attention memory usage

Insert image description here

四、Vision-and-Language BERT(ViLBERT)

Vision and Language BERT (ViLBERT) is a BERT-based model for learning task-agnostic joint representations of image content and natural language. ViLBERT extends the popular BERT architecture into a multimodal two-stream model, processing visual and textual inputs in separate streams and interacting through a joint attention transformer layer.

Insert image description here

五、Extended Transformer Construction(ETC)

Extended Transformer Construction (ETC) is an extension of the Transformer architecture, using a new attention mechanism, which mainly extends the original architecture in two ways: (1) It allows the input length to be extended from 512 to thousands; (2) It Can ingest structured input rather than just linear sequences. The key idea that enables ETC to achieve these goals is a new global local attention mechanism coupled with relative position encoding. ETC also allows lifting weights from existing BERT models, thus saving computing resources during training.

Insert image description here

6.RAG

Retrieval Augmented Generation (RAG) is a language generation model that combines pre-trained parametric and non-parametric memory to generate language. Specifically, the parametric memory is a pre-trained seq2seq model, and the non-parametric memory is a dense vector index of Wikipedia, accessible through a pre-trained neural retriever. For query, Maximum Inner Product Search (MIPS) is used to find top-K documents. For final prediction, we treat seq2seq predictions as latent variables and marginalize them given different documents.

Insert image description here

7. CodeT5

CodeT5 is a Transformer-based model for code understanding and generation based on the T5 architecture. It leverages an identifier-aware pre-training objective that takes into account critical token type information (identifiers) in the code. Specifically, T5's denoising Seq2Seq goal is extended with two identifier tagging and prediction tasks to enable the model to better exploit tag type information in programming languages, i.e., developer-assigned identifiers. In order to improve the consistency between natural language and programming language, bimodal dual learning objectives are used to achieve bidirectional conversion between natural language and programming language.

Insert image description here

8. CTRL

CTRL is a conditional transformer language model trained to condition on control code that controls style, content, and task-specific behavior. Control codes are derived from structures that naturally coexist with the original text, retaining the benefits of unsupervised learning while providing clearer control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely to be a given sequence.

Insert image description here

九、Universal Transformer

The general Transformer is a generalization of the Transformer architecture. Universal Transformers combine the parallelism and global receptive fields of feedforward sequence models such as Transformers with the recurrent inductive bias of RNNs. They also utilize dynamic per-position stopping mechanisms.

Insert image description here

十、Switch Transformer

Switch Transformer is a sparsely activated expert Transformer model designed to simplify and improve the Mixture of Experts. By distilling sparse pre-trained and specialized fine-tuned models into small dense models, it can reduce model size by up to 99% while retaining the 30% quality gain of large sparse teachers. It also uses selective precision training to train with lower bfloat16 precision, as well as an initialization scheme that allows scaling to more experts, and adds regularization to improve sparse model fine-tuning and multi-task training.

Insert image description here

11. Reformer

Insert image description here
Insert image description here

12. Linformer

Linformer is a linear Transformer that uses a linear self-attention mechanism to solve the self-attention bottleneck of the Transformer model. The original scaled dot product attention is decomposed into multiple smaller attentions via linear projection, such that the combination of these operations forms a low-order decomposition of the original attention.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132982686