[Overview of 100 Large Models] UL2 (Google)

[Overview of 100 Large Models] UL2 (Google)

Author: Wang Jianing, the content of this article is reprinted or organized, warehouse link: https://github.com/wjn1996/LLMs-NLP-Algo

Subscribe to the column [Large Model & NLP & Algorithm] to get all the NLP, large model and algorithm dry goods data spree accumulated by the blogger for many years, nearly 200 papers, 300 markdown notes written by the blogger himself, and nearly 100 large model data cards . Help NLP research, study and job hunting.


UL2 large model basic information data card

serial number big model name attribution launch time scale pre-training corpus Benchmark Model and Training Method open source paper model address Relevant information
11 UL2 Google 2022-05 20B C4 (Colossal Clean Crawled Corpus) pre-training corpus 50 NLP tasks. Including text generation (manual + automatic evaluation methods), text classification, QA, common sense reasoning, long text reasoning, knowledge reorganization, information retrieval, etc. A unified training method that is independent of models and tasks.
(1) Sparse training
(2) The three tasks of MLM, PrefixLM and CausalLM are unified as Span Correction (mixed-of-denoiser, MoD). R-denoising is similar to MLM, masking an interval and letting the model predict; S-denoising: masking the end text and letting the model generate it; X-denoising: increasing the span length and number of masking.
image.png
(3) JAX/Flax framework training
https://github.com/google-research/google-research/tree/master/ul2 https://arxiv.org/pdf/2205.05131.pdf https://huggingface.co/google/ul2 Check

UL2


  Google 's Yi Tay (and Mostafa) team proposed a new strategy, Mixture-of-Denoisers , which unifies the major pre-training paradigms .

  Rethinking the current pre-training fine-tuning, we have a variety of pre-training paradigms: decoder-only or encoder-decoder, span corruption or language model, etc. Different paradigms model different contextual relationships, and are also Because of this, different pre-training paradigms are suitable for different types of downstream tasks . For example, bidirectional context-based pre-training (span corruption, such as T5) is more suitable for fact completion, and unidirectional context-based (PrefixLM/LM, such as GPT, etc.) is more suitable for open ended. That is to say, specific downstream task types require Choose a specific pre-training strategy ...

  To be precise, there are three common paradigms: CausalLM (ie LM) for unidirectional text modeling, span corruption for bidirectional text modeling, and PrefixLM for prefix text modeling.

  Is this a grand unification ? I feel that it can only be a small unification, and I always feel that there is still a lack of blind dishes! Today, Google made up this dish! That is: Mixture-of-Denoisers . Let’s feel the effect first:

Paper Title: Unifying Language Learning Paradigms
Paper Author: Yi Tay, Mostafa Dehghani, etc. (Google)
Paper Link: https://arxiv.org/pdf/2205.05131.pdf


method

  Let me first talk about the purpose of the method in this paper: to build a pre-training strategy independent of the model architecture and the type of downstream tasks , which can be flexibly adapted to different types of downstream tasks.
The framework of the whole method is very similar to UniLM[1], but sparsification is introduced.

Mixture-of-Denoisers

  First, review the three pre-training paradigms mentioned above: CausalLM, PrefixLM, and span corruption. In fact, they can all be unified into span corruption : define a function, here is the average span length, is the corruption rate, and is the number of corrupted spans. The length of the input sequence is defined as, after the corrputed span is sampled by normal distribution or uniform distribution, the training model learns to restore these spans. It can be seen that for CausalLM, only need to be set; for PrefixLM, only need to be set (for prefix length).
Based on this, the author proposes Mixture-of-Denoisers :

  • R-Denoiser : regular denoising. The length of corrupted span is 2-5 tokens, which is about 15% mask rate. Usually used for acquiring knowledge rather than the ability to generate fluent text.
  • S-Denoiser : sequential denoising. Retains strict sequence order, usually used for inputs-to-targets tasks, such as PrefixLM. It should be noted that the visible Prefix is ​​still the context modeling method, but the long span that is masked out is Invisible.
  • X-Denoiser : extreme denoising. It can be regarded as an intermediate between R-denoiser and S-denoiser. It is an extreme case, that is, the span length is very long, or the masking rate is very large. It is generally used for long text generation tasks, because such tasks generally have very limited contextual memory information.

Finally, the seven denoisers   used in this article are set as follows:

Mode Switching

  This paper proposes to carry out paradigm-shifting through mode-switching. First, during pre-training, add three special tokens: , which refer to three paradigms (ie... denoiser). Then, when fine-tuning downstream tasks or learning with small samples , and also add a paradigm token for the setting and needs of specific tasks to trigger the model to learn a better solution. Then, on the main model architecture, whether to use encoder-decoder or decoder-only is not important , because the original intention of the method in this paper is architecture-agnostic (architecture-independent) . Therefore, based on T5, the author has correlated both settings experiment.

experiment

Ablation experiment

Task setting:

  • SuperGLUE (SG) :8 NLU sub-tasks
  • GEM benchmark : XSUM (summarization), ToTTo (table-to-text generation), Schema Guided Dialog (SGD)
  • C4 validation set

Scene setting:

  • Fine-tuning
  • Hint-based one-shot learning

Baselines :

  • Causal Language Model (CLM) : GPT-style
  • Prefix LM (PLM)
  • Span Corruption (SC) : T5-style
  • Span Corruption + LM (SCLM)
  • UniLM (ULM)
  1. Decoder v.s. Encoder-Decoder


  Conclusion: Encoder-decoder is better than decoder-only when storage is not considered; self-supervised goals are more important than backbone architectures.

  1. Paradigm Prompt (mode switching)

  Conclusion: In one-shot scenarios, it is almost always better to use a paradigm prompt, however, choosing the right paradigm prompt is critical.

  1. Mixture-of-Denoisers


▲SD% means the proportion of S-Denoisers.
Conclusion: X-denoising has a supplementary effect, but it cannot be used alone; it is better to use only a small part of S-Denoisers ().

  1. Slightly increase model size and pre-training data volume


  Conclusion: The method in this paper is a little worse than T5 on SuperGLUE, but other tasks are still ahead.

20 billion parameters!

  Ok, now let's get started: Scaling to 20B Parameters! Although this method is architecture agnostic (architecture-independent), based on the above ablation experiments, we prefer the Encoder-Decoder architecture, and more importantly: Encoder-Decoder has inherent sparse characteristics (intrinsic sparsity)

  Task setting:

  • Text generation: summary and data-to-text generation. Datasets: CNN/Dailymail, XSUM, MultiNews, SAMSum, WebNLG, E2E, CommonGen
  • Human-Evaluated Text Generation: aNLG, ARC-DA, WMT19, XSUM
  • Understanding, Classification, Question Answering: RACE, QASC, OpenBookQA, TweetQA, QuAIL, IMDB, Agnews, DocNLI, Adversarial NLI, VitaminC, Civil Comments and Wikipedia Toxicity detection
  • Products:HellaSwag, SocialQA/SIQA, PhysicalQA/PIQA, CosmosQA, AbductiveNLI, CommonsenseQA, CommonsenseQA2
  • Long-range reasoning: Scrolls benchmark (GovReport, SumScr, QMSUm, QASPER, NarrativeQA, QuaLITY, ContractNLI )
  • Structured Knowledge Grounding (Structured Knowledge Grounding): UnifiedSKG (WikiTQ, CompWQ, FetaQA, HybridQA, WikiSQL, TabFat, Feverous, SQA, MTOP, DART)
  • Information retrieval: Natural Questions

  What's interesting is: for information retrieval, the author uses the experiment conducted by DSI[2], which is simply text-to-docid retrieval.

  Evaluation results:

  1. Tradeoffs between Finetuning and Prompt-based Zero-shot Learning (SuperGLUE)


  1. Generative Few-shot: XSUM Summarization

  1. Summary of UL20B results compared to state-of-the-art



Summarize

  Prompt is mainly suitable for three scenarios: low resource, low computing power, and unified scenarios . I also published an idea on Zhihu: Prompt can perform expertization or modularization of the model to some extent, and needs to communicate with Mixture-of-Experts . This article uses the paradigm prompt for denoiser mode switching, which is further inspiring. Without the mixture of denoiser, there may be a more grandiose picture.

  In addition, it has always been a resource-intensive way to deploy specific models for different downstream tasks. Therefore, a unified black box is inevitable. Although GPT-3/T0[3] provides ideas to solve this problem through instruction/prompt or in-context learning, there is still a long way to go to truly beat task-specific finetuning. I hope that starting from this article, this key deployment problem can be completely solved.

  The blog records the pace of learning and shares the latest technology. Thank you very much for reading. This blog will be updated continuously, hoping to help you technically.


【Large Model & NLP & Algorithm】Column

Nearly 200 papers and 300 markdown notes written by bloggers . Subscribe to this column [Large Model & NLP & Algorithm] column , or go to https://github.com/wjn1996/LLMs-NLP-Algo to get all the following information:

  • Machine learning & deep learning basics and advanced dry goods (notes, PPT, code)
  • NLP basics and advanced dry goods (notes, PPT, code)
  • A full set of large model systems - pre-trained language model foundation, knowledge pre-training, large model overview, large model training and optimization, large model tuning, ChatGPT-like reproduction and application, etc.;
  • Dachang algorithm brush questions;

insert image description here

Guess you like

Origin blog.csdn.net/qq_36426650/article/details/131612303