[Natural Language Processing | Transformers] Transformers common algorithm introduction collection (8)

一、Adaptively Sparse Transformer

Adaptive sparse transformer is a type of transformer.

Insert image description here

2. I-BERT

I-BERT is a quantized version of BERT that uses pure integer arithmetic to quantize the entire inference. Based on lightweight pure integer approximation methods for nonlinear operations, such as GELU, Softmax and Layer Normalization, it performs end-to-end pure integer BERT inference without any floating point calculations.

In particular, GELU and Softmax are approximated using lightweight second-order polynomials that can be evaluated using pure integer arithmetic. For LayerNorm, only integer calculations are performed by leveraging known square root integer calculation algorithms.

Insert image description here

3. SqueezeBERT

SqueezeBERT is an efficient architectural variant of BERT for natural language processing using grouped convolutions. It is much like BERT-base, but has positional feedforward connection layers implemented in a convolutional form, and many layers of grouped convolutions.

Insert image description here

四、Feedback Transformer

A feedback transformer is a sequential transformer that exposes all previous representations to all future representations, meaning that the lowest representation at the current time step is formed from the highest level abstract representation in the past. This feedback property allows the architecture to perform recursive computations, iteratively building stronger representations on previous states. To achieve this goal, the standard Transformer's self-attention mechanism is modified so that it focuses on higher-level representations rather than lower ones.

Insert image description here

五、Sandwich Transformer

A sandwich transformer is a variation of a transformer that reorders sub-layers in the architecture to achieve better performance. The reordering is based on the authors' analysis that models with more self-attention to the bottom and more feedforward sublayers to the top tend to perform better overall.

Insert image description here

6. MixText

MixText is a semi-supervised learning method for text classification that uses a new data augmentation method called TMix. TMix creates a large number of augmented training samples by inserting text in the hidden space. The technique leverages advances in data augmentation to guess low-entropy labels for unlabeled data, making them as easy to use as labeled data.

Insert image description here

7. ALDEN

ALDEN (Active Learning with DivErse iNterpretations) is an active learning method for text classification. Through local interpretation in DNN, ALDEN identifies linearly separable regions of samples. It then selects samples based on the diversity of local interpretations and queries their labels.

Specifically, we first compute the local interpretation of each sample in the DNN as gradient backpropagation from the final prediction to the input features. We then measure its diversity using the most diverse word interpretations in the sample. Therefore, we select unlabeled samples with the largest different interpretations for labeling and retrain the model using these labeled samples.

Insert image description here

八、Dual Contrastive Learning

Contrastive learning has achieved remarkable success in representation learning through self-supervision in unsupervised environments. However, effectively adapting contrastive learning to supervised learning tasks remains a challenge in practice. In this work, we introduce a dual contrastive learning (DualCL) framework that can simultaneously learn the features of input samples and the parameters of the classifier in the same space. Specifically, DualCL treats the parameters of the classifier as augmented samples associated with different labels, and then exploits contrastive learning between input samples and augmented samples. Empirical studies on five benchmark text classification datasets and their low-resource versions demonstrate improvements in classification accuracy and confirm DualCL's ability to learn discriminative representations.

9. Lbl2TransformerVec

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132986180
Recommended