Model summary:
-
T5: Based on Transformer, it combines multi-task learning and unsupervised pre-training, and uses a large-scale English Wikipedia corpus for training.
-
GPT-3: Also based on Transformer, it uses an extremely large corpus and uses Zero-shot learning to realize natural language reasoning.
-
Chinchilla: A novel natural language generation model using adaptive regularization and dynamic use of attention.
-
PaLM: Combining the advantages of unidirectional and bidirectional models, and using bidirectional training and pre-training with additional tasks, it has achieved quite good results.
-
LLaMA: A Natural Language Understanding Model Using Language Modeling as a Prior, Using Linguistic and Probabilistic Modeling of the Target Task to Optimize Network Parameters.
-
Alpaca: A meta-learning-based multi-task learning model that can be quickly applied to new NLP tasks.
-
ELECTRA: A novel pre-trained model that learns language representations using an "alternative observation" approach, achieving promising results.
-
Roberta: Using more training data, longer training time, and larger model sizes, a combination of dynamic distillation and other techniques has achieved good results.
-
BART: It combines the technologies of speech recognition and machine translation, and uses a bidirectional encoder-decoder structure, which has achieved very good results.
-
UniLM: Using vertical and horizontal pre-training mechanisms, it integrates language generation and language understanding, and can be applied to a variety of natural language processing tasks.
-
GShard: A Transformers framework that supports large-scale distributed training, which can be trained on multiple GPUs with very good performance.
-
LSDSem: A Semantic Dependency Analysis Model Based on Multi-Level Detection, Considering Both Syntactic and Semantic Information.
-
BertRank: A model for conversational search, based on BERT's two-tower architecture, using multi-task learning and local attention mechanisms, and achieved good results.
-
BERT-DP: A BERT-based dependency parsing model that utilizes the dynamic programming technology of neural networks to achieve high precision.
-
NLR: A natural language reasoning model based on generative confrontation networks, which uses unsupervised data enhancement technology and has achieved quite good results.
-
MT-DNN: A natural language processing model based on multi-task learning, which improves model performance by jointly training multiple tasks.
-
ERNIE: A language representation framework that combines knowledge graphs and external entities to support cross-language and cross-domain applications.
-
XLNet: Using an autoregressive network and a recurrent reverse language model, the model can process bidirectional contextual information during the pre-training phase.
-
TAPAS: A table-based natural language inference model using Transformer encoders and decoders, combined with parse tree information.
-
DeBERTa: A novel multi-stream model that utilizes separate mask networks and global networks to assign different importance to words.
-
FNet: Replace the convolutional layer with a custom inverse time Fourier (IFFT) layer, which achieves comparable effects to Transformer-based models.
-
AdaBERT: An adaptive inference-based natural language processing model that uses two modules to independently learn context representations and task representations.
-
UniSkip: Use the span information in the sentence to control the flow of information, and achieve the effect of paying more attention to the important information of the input sentence.
-
Transformer-XH: Test to determine the size and number of hidden layers, realize automatic model selection, and achieve better results on multiple tasks.
-
Embedding Propagation: Automatically learn the embedding vector of each word, and realize a richer semantic representation with the help of manifold space technology.
-
EAT: An entity-relationship representation model based on Transformer, which introduces self-attention mechanism and global feature attention, and has achieved good results.
-
GPT-2: A Transformer-based pre-trained language representation model, using unsupervised learning and multi-level structure, achieved good results.
-
ULMFiT: Using CycleGAN to achieve data set enhancement, fine-tuning was done through the sequence-to-sequence method, and good results were achieved.
-
BERT-MRC: A BERT-based reading comprehension model that extends the form of binary classification to span extraction and improves accuracy.
-
ERNIE-Gram: A natural language generation model based on ERNIE, which uses large-scale weakly supervised data and unsupervised pre-training technology, and has achieved good results.
List of pros and cons:
model name | Advantage | disadvantage |
---|---|---|
T5 | Combination of multi-task learning and unsupervised pre-training; use large-scale corpus for training | longer training time |
GPT-3 | Huge corpus; realized Zero-shot learning to realize natural language reasoning function | Not yet fully open |
Chinchilla | Attention mechanism using adaptive regularization and dynamic usage | Not suitable for all application scenarios |
PaLM | Combines the advantages of unidirectional and bidirectional models; uses bidirectional training and pre-training with additional tasks | May require large computing power and data volume |
LLaMA | Language modeling can be used as a priori to optimize network parameters | Performance can be affected by data bias in the model |
Alpaca | Multi-task learning model based on meta-learning; can be quickly applied to new NLP tasks | Few open source implementations |
ELECTRA | Using the "surrogate observation" method to learn language representation, achieved good results | Not yet thoroughly tested on all NLP tasks |
Roberta | Use more training data, longer training times, and larger model sizes; incorporates dynamic distillation and other techniques | May require more computing resources to train |
BART | Technology that combines speech recognition and machine translation; uses a bidirectional encoder-decoder structure | Some applications require higher precision |
UniLM | Blends language generation and language understanding; applicable to a variety of natural language processing tasks | Processing large-scale data and training time may be long |
GShard | Supports large-scale distributed training; performance is very good | Higher cost of use |
LSDSem | Both syntactic and semantic information are considered | Not currently available for all NLP tasks |
BertRank | Using multi-task learning and local attention mechanism | There may be a risk of overfitting in some application scenarios |
BERT-DP | Utilizes the dynamic programming technology of the neural network to achieve high precision | Sensitive to noise or errors in input data |
NLR | Utilizes unsupervised data augmentation techniques; achieves reasonably good results | Like BERT-DP, it is more sensitive to noise or error of input data |
MT-DNN | Jointly train multiple tasks to improve model performance | High training time and computing resource requirements |
ERNIE | Combines knowledge graph and external entities; supports cross-language and cross-domain applications | In some application scenarios, the effect is not satisfactory |
XLNet | Handle bidirectional contextual information using autoregressive networks and recurrent inverse language models | Training and tuning require more time and computing resources |
TAPAS | Transformer encoder and decoder are used, combined with parse tree information | In some application scenarios, the effect is not satisfactory |
DeBERTa | Utilizes separate mask network and global network to assign different importance to words | Training and tuning require more time and computing resources |
FNet | Achieved equivalent effects to Transformer-based models; more computationally efficient | still in research stage |
AdaBERT | Two modules are used to learn context representation and task representation independently | Requires more training resources and tuning time |
UniSkip | Pay more attention to the important information of the input sentence | Processing large-scale data and training time may be long |
Transformer-XH | Automated model selection is achieved; better results have been achieved on multiple tasks | The principle is more complicated |
Embedding Propagation | Learn embedding vectors for each word and achieve richer semantic representation | In some application scenarios, the effect is not satisfactory |
EAT | Using self-attention mechanism and global feature attention, achieved good results | Training and tuning have high demands on computing resources |
GPT-2 | used unsupervised learning and multi-level structure, and achieved good results | Not suitable for all NLP tasks |
ULMFiT | Use CycleGAN to achieve data set enhancement; use sequence-to-sequence method to do fine-tuning | Requires more computing resources and time |
BERT-MRC | Extended the form of binary classification to span extraction, and improved the accuracy | Not suitable for all reading comprehension tasks |
ERNIE-Gram | Using large-scale weakly supervised data and unsupervised pre-training technology, it has achieved good results | In some application scenarios, the effect is not satisfactory |