[Natural Language Processing | Language Models] Language Models common algorithm introduction collection (8)

一、Inverted Bottleneck BERT

IB-BERT, Inverted Bottleneck BERT, is a BERT variant using an inverted bottleneck structure. It is used as a teacher network to train the MobileBERT model.

Insert image description here

2. MacBERT

MacBERT is a Chinese NLP model based on Transformer, which changes RoBERTa in many ways, including a modified masking strategy. Instead of masking with a [MASK] token that never appeared in the fine-tuning phase, MacBERT masks the word using its similar words. Specifically, MacBERT shares the same pre-training task as BERT, but with some modifications. For the MLM task, make the following modifications:

Whole-word masking and Ngram masking strategies are used to select mask candidate tags, with the proportions of word-level one-element to four-element being 40%, 30%, 20%, and 10%.
Instead of using the [MASK] tag for masking, which never appears in the fine-tuning phase, similar words are used for masking purposes. Obtain similar words using the Synonyms toolkit based on word2vec similarity calculation. If we choose an N-gram to mask, we will find similar words individually. In rare cases when there are no similar words, we downgrade to using random word replacement.
15% of the input words are used for masking, 80% of which will be replaced with similar words, 10% with random words, and the remaining 10% keep the original words.

Insert image description here

3. CPM-2

CPM-2 is an 11 billion parameter pre-trained language model based on the standard Transformer architecture, consisting of a bidirectional encoder and a unidirectional decoder. The model is pre-trained on WuDaoCorpus, which contains 2.3TB of cleaned Chinese data and 300GB of cleaned English data. The pre-training process of CPM-2 can be divided into three stages: Chinese pre-training, bilingual pre-training, and MoE pre-training. Multi-stage training with knowledge inheritance can significantly reduce computational costs.

4. mBARTHez

BARThez is a French self-supervised transfer learning model based on BART. Compared with existing BERT-based French models (such as CamemBERT and FlauBERT), BARThez is well suited for generation tasks because both its encoder and decoder are pre-trained.

Insert image description here

五、PanGu— α \alphaa

An autoregressive language model (ALM) pre-trained on large text corpora (mainly Chinese) with up to 200 billion parameters. The architecture of PanGu-α is based on Transformer, which has been widely used as the backbone of multiple pre-trained language models (such as BERT and GPT). Unlike them, an additional query layer is developed on top of the Transformer layer, aiming to explicitly induce the expected output.

Insert image description here

6. OPT-IML

OPT-IML is a version of OPT fine-tuned for more than 1500 NLP tasks divided into different task categories.

七、Siamese Multi-depth Transformer-based Hierarchical Encoder(SMITH)

SMITH (Siamese Multi-depth Transformer-based Hierarchical Encoder) is a Transformer-based document representation learning and matching model. It contains multiple design choices to adapt self-attention models to long text inputs. For model pre-training, in addition to the original masked word language model task used in BERT, a masked sentence chunk language modeling task is also used to capture sentence chunk relationships within documents. Given a sequence of sentence chunk representations, the document-level Transformer learns the contextual representation and final document representation of each sentence chunk.

Insert image description here

8. ClipBERT

ClipBERT is a framework for end-to-end learning of video and language tasks that employs sparse sampling, where each training step uses only one or a few sparsely sampled clips from the video. ClipBERT differs from previous work in two ways.

First, compared to densely extracting video features (as adopted by most existing methods), CLIPBERT only sparsely samples one or a few short clips from the full video at each training step. The hypothesis is that the visual features of sparse clips have captured key visual and semantic information in the video, since consecutive clips often contain similar semantics from consecutive scenes. Therefore, a few clips are enough for training instead of using full videos. Predictions from multiple densely sampled segments are then aggregated to obtain final video-level predictions during inference, which is computationally less demanding.

The second differentiating aspect involves the initialization of model weights (i.e. transfer via pre-training). The authors use 2D architectures (such as ResNet-50) instead of 3D features as the visual backbone for video encoding, allowing them to leverage the power of image-text pre-training to understand video text, as well as the advantages of low memory cost and run-time efficiency.

Insert image description here

9. I-BERT

I-BERT is a quantized version of BERT that uses pure integer arithmetic to quantize the entire inference. Based on lightweight pure integer approximation methods for nonlinear operations, such as GELU, Softmax and Layer Normalization, it performs end-to-end pure integer BERT inference without any floating point calculations.

In particular, GELU and Softmax are approximated using lightweight second-order polynomials that can be evaluated using pure integer arithmetic. For LayerNorm, only integer calculations are performed by leveraging known square root integer calculation algorithms.

Insert image description here

10. SqueezeBERT

SqueezeBERT is an efficient architectural variant of BERT for natural language processing using grouped convolutions. It is much like BERT-base, but has positional feedforward connection layers implemented in a convolutional form, and many layers of grouped convolutions.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/133104679