[Natural Language Processing | Transformers] Transformers common algorithm introduction collection (7)

一、Multi-Heads of Mixed Attention

Hybrid attention multi-heads combine self-attention and cross-attention, encouraging advanced learning of interactions between entities captured in various attention features. It is built from multiple attention heads, each of which can implement self-attention or cross-attention. Self-attention means that the key features and query features are the same or come from the same domain features. Cross-attention means that key features and query features are generated by different features. MHMA modeling allows the model to identify relationships between features from different domains. This is useful in tasks involving relational modeling, such as human-object interaction, tool-organization interaction, human-computer interaction, human-computer interface, etc.

Insert image description here

二、RealFormer

RealFormer is a Transformer based on the idea of ​​residual attention. It adds jumping edges to the backbone Transformer to create multiple direct paths, one for each type of attention module. It adds no parameters or hyperparameters. Specifically, RealFormer uses a Post-LN style Transformer as the backbone and adds skip edges to connect multi-head attention modules in adjacent layers.

Insert image description here

三、Sinkhorn Transformer

Sinkhorn Transformer is a transformer that uses sparse Sinkhorn Attention as a building block. This component is a plug-in replacement for dense fully connected attention (as well as local attention and sparse attention alternatives) and allows for reduced memory complexity and sparse attention.

Insert image description here

4. SongNet

SongNet is a Transformer-based autoregressive language model for strictly formatted text detection. Symbol sets are specifically designed to improve modeling performance, especially in terms of format, prosody, and sentence completeness. The attention mechanism is improved to prompt the model to capture some future information about the format. A pre-training and fine-tuning framework is designed to further improve the generation quality.

Insert image description here

五、Funnel Transformer

A funnel transformer is a type of transformer that gradually compresses a sequence of hidden states into shorter sequences, thereby reducing computational cost. Model capacity is further increased by reinvesting the FLOPs saved by length reduction into building deeper or wider models. Furthermore, to perform token-level predictions as required by common pre-training objectives, the Funnel-transformer is able to recover the deep representation of each token from the reduced hidden sequence via the decoder.

The proposed model maintains the same overall skeleton of interleaved S-Attn and P-FFN submodules wrapped by residual connections and layer normalization. But the difference is that in order to achieve representation compression and calculation reduction, this model uses an encoder that gradually reduces the sequence length of hidden states as the layer deepens. Furthermore, for tasks involving per-token prediction (e.g. pre-training), a simple decoder is used to reconstruct the complete sequence of token-level representations from the compressed encoder output. Compression is achieved through pooling operations

Insert image description here

六、Transformer Decoder

Transformer-Decoder is a modification of Transformer-Encoder-Decoder for long sequences, which removes the encoder module, combines input and output sequences into a single "sentence", and is trained as a standard language model. It is used in GPT and its successors.

7. SC-GPT

SC-GPT is a multi-layer Transformer neural language model that is trained in three steps: (i) pre-training on plain text, similar to GPT-2; (ii) continuous pre-training on a large number of conversational behavior labeled discourse corpora , to obtain the ability of controllable generation; (iii) to fine-tune the target domain using a very limited number of domain labels. Unlike GPT-2, SC-GPT generates semantically controlled responses conditioned on a given semantic form, similar to SC-LSTM but requires fewer domain labels to generalize to new domains. It is pre-trained on a large annotated NLG corpus to obtain controllable generative capabilities and is fine-tuned to adapt to new domains using only a few domain-specific labels.

Insert image description here

七、Siamese Multi-depth Transformer-based Hierarchical Encoder(SMITH)

SMITH (Siamese Multi-depth Transformer-based Hierarchical Encoder) is a Transformer-based document representation learning and matching model. It contains multiple design choices to adapt self-attention models to long text inputs. For model pre-training, in addition to the original masked word language model task used in BERT, a masked sentence chunk language modeling task is also used to capture sentence chunk relationships within documents. Given a sequence of sentence chunk representations, the document-level Transformer learns the contextual representation and final document representation of each sentence chunk.

Insert image description here

八、Chinese Pre-trained Unbalanced Transformer

CPT, Chinese Pre-trained Unbalanced Transformer, is a pre-trained unbalanced Transformer used for Chinese natural language understanding (NLU) and natural language generation (NLG) tasks. CPT consists of three parts: shared encoder, understanding decoder and generating decoder. Two specific decoders with a shared encoder are pretrained by masked language modeling (MLM) and denoising autoencoding (DAE) tasks respectively. Through a partially shared architecture and multi-task pre-training, CPT can (1) use two decoders to learn specific knowledge of NLU or NLG tasks, and (2) flexibly fine-tune to fully realize the potential of the model. Two specific decoders with a shared encoder are pretrained by masked language modeling (MLM) and denoising autoencoding (DAE) tasks respectively. Through a partially shared architecture and multi-task pre-training, CPT can (1) use two decoders to learn specific knowledge of NLU or NLG tasks, and (2) flexibly fine-tune to fully realize the potential of the model.

Insert image description here

9. ClipBERT

ClipBERT is a framework for end-to-end learning of video and language tasks that employs sparse sampling, where each training step uses only one or a few sparsely sampled clips from the video. ClipBERT differs from previous work in two ways.

First, compared to densely extracting video features (as adopted by most existing methods), CLIPBERT only sparsely samples one or a few short clips from the full video at each training step. The hypothesis is that the visual features of sparse clips have captured key visual and semantic information in the video, since consecutive clips often contain similar semantics from consecutive scenes. Therefore, a few clips are enough for training instead of using full videos. Predictions from multiple densely sampled segments are then aggregated to obtain final video-level predictions during inference, which is computationally less demanding.

The second differentiating aspect involves the initialization of model weights (i.e. transfer via pre-training). The authors use 2D architectures (such as ResNet-50) instead of 3D features as the visual backbone for video encoding, allowing them to leverage the power of image-text pre-training to understand video text, as well as the advantages of low memory cost and run-time efficiency.

Insert image description here

10. BinaryBERT

BinaryBERT is a variant of BERT that applies quantization in the form of weight binarization. Specifically, ternary weight splitting is proposed to initialize BinaryBERT by taking an equivalent split from a half-sized ternary network. To obtain BinaryBERT, we first train a half-size ternary BERT model, and then apply the ternary weight splitting operator to obtain the underlying full-precision and quantized weights as an initialization for full-size BinaryBERT. Then, we fine-tune BinaryBERT for further refinement.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132985480
Recommended