[Natural Language Processing | Language Models] Language Models common algorithm introduction collection (9)

一、ERNIE-GEN

ERNIE-GEN is a multi-stream sequence-to-sequence pre-training and fine-tuning framework that bridges the gap between training and inference through a padding generation mechanism and a noise-aware generation approach. To bring generation closer to human writing patterns, the framework introduces a span-by-span generation process, which trains the model to predict semantically complete spans continuously rather than word by word. Different from existing pre-training methods, ERNIE-GEN uses multi-granularity target sampling to construct pre-training data, which enhances the correlation between the encoder and decoder.

Insert image description here

二、Sandwich Transformer

A sandwich transformer is a variation of a transformer that reorders sub-layers in the architecture to achieve better performance. The reordering is based on the authors' analysis that models with more self-attention to the bottom and more feedforward sublayers to the top tend to perform better overall.

Insert image description here

3. DeLight

DeLiGHT is a Transformer architecture that achieves parameter efficiency improvements by (1) using DExTra (a deep lightweight transformation) within each Transformer block, allowing the use of single-head attention and bottleneck FFN layers, and ( 2) Use block-wise scaling across blocks, allowing the use of lighter and narrower DeLight blocks near the input and wider and deeper DeLight blocks near the output.

Insert image description here

4. PAR Transformer

PAR Transformer is a Transformer model that uses 63% fewer self-attention blocks and replaces them with feedforward blocks while retaining test accuracy. It is based on the Transformer-XL architecture and uses neural architecture search to find valid block patterns in the Transformer architecture.

Insert image description here

5. ConvBERT

ConvBERT is a modified version of the BERT architecture that uses span-based dynamic convolutions instead of self-attention heads to directly model local dependencies. Specifically, a new hybrid attention module replaces the self-attention module in BERT, which takes advantage of convolution to better capture local dependencies. Additionally, a new stride-based dynamic convolution operation is used to dynamically generate convolution kernels utilizing multiple input markers. Finally, ConvBERT also incorporates some new model designs, including bottleneck attention and grouped linear operators of feedforward modules (reducing the number of parameters).

Insert image description here

六、Enhanced Seq2Seq Autoencoder via Contrastive Learning (ESACL)

ESACL (Enhanced Seq2Seq Autoencoder via Contrastive Learning) is a sequence-to-sequence (seq2seq) autoencoder with denoising via contrastive learning for abstract text summarization. The model uses a standard Transformer-based architecture with multi-layer bidirectional encoders and autoregressive decoders. To enhance its denoising capabilities, self-supervised contrastive learning is combined with various sentence-level document enhancements.

Insert image description here

七、Multi-Heads of Mixed Attention

Hybrid attention multi-heads combine self-attention and cross-attention, encouraging advanced learning of interactions between entities captured in various attention features. It is built from multiple attention heads, each of which can implement self-attention or cross-attention. Self-attention means that the key features and query features are the same or come from the same domain features. Cross-attention means that key features and query features are generated from different features. MHMA modeling allows the model to identify relationships between features from different domains. This is useful in tasks involving relational modeling, such as human-object interaction, tool-organization interaction, human-computer interaction, human-computer interface, etc.

Insert image description here

8. RealFormer

RealFormer is a Transformer based on the idea of ​​residual attention. It adds jumping edges to the backbone Transformer to create multiple direct paths, one for each type of attention module. It adds no parameters or hyperparameters. Specifically, RealFormer uses a Post-LN style Transformer as the backbone and adds skip edges to connect multi-head attention modules in adjacent layers.

Insert image description here

九、Sinkhorn Transformer

Sinkhorn Transformer is a transformer that uses sparse Sinkhorn Attention as a building block. This component is a plug-in replacement for dense fully connected attention (as well as local attention and sparse attention alternatives) and allows for reduced memory complexity and sparse attention.

Insert image description here

10. SongNet

SongNet is a Transformer-based autoregressive language model for strictly formatted text detection. Symbol sets are specifically designed to improve modeling performance, especially in terms of format, prosody, and sentence completeness. The attention mechanism is improved to prompt the model to capture some future information about the format. A pre-training and fine-tuning framework is designed to further improve the generation quality.

Insert image description here

11. Funnel Transformer

A funnel transformer is a type of transformer that gradually compresses a sequence of hidden states into shorter sequences, thereby reducing computational cost. Model capacity is further increased by reinvesting the FLOPs saved by length reduction into building deeper or wider models. Furthermore, to perform token-level predictions as required by common pre-training objectives, the Funnel-transformer is able to recover the deep representation of each token from the reduced hidden sequence via the decoder.

The proposed model maintains the same overall skeleton of interleaved S-Attn and P-FFN submodules wrapped by residual connections and layer normalization. But the difference is that in order to achieve representation compression and calculation reduction, this model uses an encoder that gradually reduces the sequence length of hidden states as the layer deepens. Furthermore, for tasks involving per-token prediction (e.g. pre-training), a simple decoder is used to reconstruct the complete sequence of token-level representations from the compressed encoder output. Compression is achieved through pooling operations,

Insert image description here

12. Transformer Decoder

Transformer-Decoder is a modification of Transformer-Encoder-Decoder for long sequences, which removes the encoder module, combines input and output sequences into a single "sentence", and is trained as a standard language model. It is used in GPT and its successors.

13. SC-GPT

SC-GPT is a multi-layer Transformer neural language model that is trained in three steps: (i) pre-training on plain text, similar to GPT-2; (ii) continuous pre-training on a large number of conversational behavior labeled discourse corpora , to obtain the ability of controllable generation; (iii) to fine-tune the target domain using a very limited number of domain labels. Unlike GPT-2, SC-GPT generates semantically controlled responses conditioned on a given semantic form, similar to SC-LSTM but requires fewer domain labels to generalize to new domains. It is pre-trained on a large annotated NLG corpus to obtain controllable generative capabilities and is fine-tuned with only a few domain-specific labels to adapt to new domains.

Insert image description here

十四、Chinese Pre-trained Unbalanced Transformer

CPT, Chinese Pre-trained Unbalanced Transformer, is a pre-trained unbalanced Transformer used for Chinese natural language understanding (NLU) and natural language generation (NLG) tasks. CPT consists of three parts: shared encoder, understanding decoder and generating decoder. Two specific decoders with a shared encoder are pretrained by masked language modeling (MLM) and denoising autoencoding (DAE) tasks respectively. Through a partially shared architecture and multi-task pre-training, CPT can (1) use two decoders to learn specific knowledge of NLU or NLG tasks, and (2) flexibly fine-tune to fully realize the potential of the model. Two specific decoders with a shared encoder are pretrained by masked language modeling (MLM) and denoising autoencoding (DAE) tasks respectively. Through a partially shared architecture and multi-task pre-training, CPT can (1) use two decoders to learn specific knowledge of NLU or NLG tasks, and (2) flexibly fine-tune to fully realize the potential of the model.

Insert image description here

15. BinaryBERT

BinaryBERT is a variant of BERT that applies quantization in the form of weight binarization. Specifically, ternary weight splitting is proposed to initialize BinaryBERT by taking an equivalent split from a half-sized ternary network. To obtain BinaryBERT, we first train a half-size ternary BERT model, and then apply the ternary weight splitting operator to obtain the underlying full-precision and quantized weights as an initialization for full-size BinaryBERT. Then, we fine-tune BinaryBERT for further refinement.

Insert image description here

16. Adaptively Sparse Transformer

Adaptive sparse transformer is a type of transformer.

Insert image description here

17. Feedback Transformer

A feedback transformer is a sequential transformer that exposes all previous representations to all future representations, meaning that the lowest representation at the current time step is formed from the highest level abstract representation in the past. This feedback property allows the architecture to perform recursive computations, iteratively building stronger representations on previous states. To achieve this goal, the standard Transformer's self-attention mechanism is modified so that it focuses on higher-level representations rather than lower ones.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/133105141