NLP commonly used Backbone model cheat sheet (1)

foreword

Since the appearance of Transformer in 2017, it has appeared in all major NLP jobs. Recently, Stanford also opened a course CS25 for transformers: [Stanford] CS25 Transformers United | Fall 2021

People who are new to NLP can read an article I wrote before. Research 0_NLPer set off

For the corresponding model, you can go to hugginface's transfomers library to see transformers/models (github) , you can find the corresponding model and see its source code implementation.

Now it is mainly the dynamic word vector encoding technology combined with the context, and the word2vec and glove vocabulary are rarely used for static word vector mapping.

B station a video blow up ! Doctor of Computer [NLP Natural Language Processing] is worthy of being a professor of Tsinghua University! 5 hours got me done with NLP Natural Language Processing! (Although the title is a bit emm... but after reading the catalogue, it seems to be ok...

Before the prompt appeared, the adapter technology was also very popular, see a ! Adapter technology in NLP

Several common initialization methods: several common weight initialization methods for deep learning

Data Augmentation Methods: An article to understand data augmentation in the field of NLP and CV


paper

CPT


Paper: CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation
Code: https://github.com/fastnlp/CPT , I looked at the source code before and found that the encoder uses bert

A magical architecture of one encoder, two decoders, supports Chinese.



Bart

BART: Denoising Sequence-to-Sequence Pre-training for Natural
Language Generation, Translation, and Comprehension



T5

The 67-page arxiv version of the paper is really long...
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer



Mass

MASS: Masked Sequence to Sequence Pre-training for Language Generation



GPT

The decoder architecture
GPT series Mushen has been mentioned before. GPT, GPT-2, GPT-3 Paper Intensive Reading [Thesis Intensive Reading]_Follow Li Muxue AI_bilibili

GPT-1

Legend has it that Bert was inspired by GPT-1 to have a little brother who came up with it within two months.
Improving Language Understanding by Generative Pre-Training

GPT-2

GPT-2 is not as good as Bert, but it is suitable for generative tasks. GPT-3 is relatively large (if the laboratory does not have equipment), so some people still use GPT-2 to do some demo examples.
Language Models are Unsupervised Multitask Learners


GPT-3

Also used to generate code or something.
Thesis: Language Models are Few-Shot Learners



Bert

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

encoder structure. There are many bert families, such as the distilled version distilBert, the variant Roberta, etc.

Word vector input composition:



transformer

The famous self-attention comes from this article.
Attention Is All You Need

This model has been reproduced before: transformer structure reproduction __attention is all you need (pytorch)

encoder-decoder structure:

Attention module:



Guess you like

Origin blog.csdn.net/weixin_43850253/article/details/126070768