[Natural Language Processing | Transformers] Transformers common algorithm introduction collection (2)

1. DistilBERT

DistilBERT is a small, fast, cheap and lightweight Transformer model based on the BERT architecture. Knowledge distillation is performed in the pre-training stage to reduce the size of the BERT model by 40%. To exploit the inductive bias learned by large models during pre-training, the authors introduce a triple loss that combines language modeling, distillation, and cosine distance loss.

Insert image description here

2. ELECTRA

ELECTRA is a transformer with a new pre-training method that trains two transformer models: generator and discriminator. The generator replaces tokens in the sequence - trained as a masked language model - and the discriminator (contributed by ELECTRA) tries to identify which tokens in the sequence were replaced by the generator. This pre-training task is called replacement token detection and is an alternative to masking input.

Insert image description here

3. Electric

Electric is an energy-based cloze model for text representation learning. Like BERT, it is a conditional generation model of tokens given a context. However, Electric does not use masking or output the full distribution of tokens that may occur in the context. Instead, it assigns each input token a scalar energy score indicating the likelihood of it being given context.

Insert image description here
Insert image description here

4. Longformer

Longformer is an improved Transformer architecture. Traditional Transformer-based models cannot handle long sequences due to their self-attention operations, which scale quadratically with the sequence length. To solve this problem, Longformer uses an attention model that scales linearly with sequence length, allowing it to easily handle documents of thousands of tokens or longer. The attention mechanism is a drop-in replacement for standard self-attention and combines local window attention with task-driven global attention.

The attention modes used include: sliding window attention, dilated sliding window attention and global + sliding window. These can be viewed in the Components section of this page.

Insert image description here

5. mT5

mt5 is a multilingual variant of T5, pre-trained on the new Common Crawl-based dataset, covering 101 languages.

Insert image description here

六、Pathways Language Model(PaLM)

PaLM (Pathways Language Model) uses the standard Transformer model architecture (Vaswani et al., 2017) in a decoder-only setting (i.e. each time step can only focus on itself and past time steps) with some modifications. PaLM is trained as a 540 billion parameter, densely activated autoregressive Transformer on 780 billion tokens. PaLM leverages Pathways (Barham et al., 2022) to enable efficient training of very large neural networks across thousands of accelerator chips.

Insert image description here

7. Performer

Performer is a Transformer architecture that can estimate regular (softmax) full-rank attention Transformers with provable accuracy, but using only linear (instead of quadratic) space and time complexity without relying on any priors, Such as sparsity or low rank. Performers are linear architectures fully compatible with regular Transformers and have strong theoretical guarantees: unbiased or near-unbiased estimation of the attention matrix, uniform convergence, and low estimation variance. To approximate the softmax attention kernel, Performer uses Fast Attention via Orthogonal Random Features (FAVOR+), a new method for approximating softmax and Gaussian kernels.

Insert image description here

8. Transformer-XL

Transformer-XL (meaning extra long) is a Transformer architecture that introduces the concept of recursion into deep self-attention networks. Transformer-XL does not recompute the hidden state for each new segment, but reuses the hidden state obtained in previous segments. The reused hidden state acts as memory for the current segment, thereby establishing cyclic connections between segments. Therefore, it becomes possible to model very long-term dependencies because information can be propagated through circular connections. As an additional contribution, Transformer-XL uses a new relative position encoding formulation that generalizes to longer attention lengths than those observed during training.

Insert image description here

9. DeBERTa

DeBERTa is a Transformer-based neural language model that aims to improve the BERT and RoBERTa models through two techniques: unraveling the attention mechanism and enhanced mask decoder. The disentangled attention mechanism is that each word is represented invariably using two vectors encoding its content and position respectively, and the attention weight between words is calculated using the disentangled matrix of its content and relative position. The enhanced mask decoder is used to replace the output softmax layer to predict the mask tokens pre-trained by the model. In addition, a new virtual adversarial training method is used for fine-tuning to improve the model's generalization ability to downstream tasks.

Insert image description here

10. mBART

mBART is a sequence-to-sequence denoising autoencoder pre-trained on large-scale monolingual corpora in multiple languages ​​using the BART objective. Noise the input text by masking phrases and permuting sentences, and learn a single Transformer model to recover the text. Unlike other machine translation pre-training methods, mBART pre-trains a complete autoregressive Seq2Seq model. mBART trains once for all languages, providing a set of parameters that can be fine-tuned for any language pair in supervised and unsupervised settings without the need for any task-specific or language-specific modifications or initialization schemes.

Insert image description here

11. XLM

XLM is a Transformer-based architecture that is pre-trained using one of three language modeling objectives:

Causal Language Modeling - Modeling the probability of a word given the preceding words in a sentence.
Masked Language Modeling - BERT's masked language modeling objective.
Translation Language Modeling - A (new) translation language modeling objective for improving cross-language pre-training.
The authors found that both CLM and MLM methods provide powerful cross-language features that can be used to pretrain models.

Insert image description here

12.ERNIE

ERNIE is a Transformer-based model consisting of two stacked modules: 1) text encoder and 2) knowledge encoder, which is responsible for integrating additional markup-oriented knowledge information into text information. This layer consists of stacked aggregators designed to encode tokens and entities and fuse their heterogeneous characteristics. To enhance representation through this layer of knowledge integration, ERNIE employs a special pre-training task - it involves randomly masking token entity alignments and training the model to predict all corresponding entities based on aligned tokens (aka denoising entity autoencoders ).

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132975403