[Natural Language Processing] [Large Model] BLOOM: A multilingual model with 176B parameters and open access

BLOOM: A 176B parameter and open access multilingual model
《BLOOM: A 176B-Parameter Open-Access Multilingual Language Model》

Paper address: https://arxiv.org/pdf/2211.05100.pdf

Related blog
[Natural Language Processing] [Large Model] BLOOM model structure source code analysis (stand-alone version)
[Natural Language Processing] [Large Model] Very low resource fine-tuning large model method LoRA and BLOOM-LORA implementation code
[Natural Language Processing] [Big Model] DeepMind's large model Gopher
[Natural Language Processing] [Large Model] Chinchilla: a large language model with optimal training and computing utilization
[Natural Language Processing] [Large Model] Large language model BLOOM reasoning tool test
[Natural Language Processing] [ Large model] GLM-130B: an open source bilingual pre-trained language model
[Natural Language Processing] [Large Model] Introduction to 8-bit matrix multiplication for large Transformers
[Natural Language Processing] [Large Model] BLOOM: a 176B parameter and can Open-access multilingual model
[Natural Language Processing] [Large Model] PaLM: A large language model based on Pathways
[Natural Language Processing] [chatGPT series] Large language models can improve themselves
[Natural Language Processing] [ChatGPT series] WebGPT: Based on Browser-assisted Q&A with human feedback
[Natural Language Processing] [ChatGPT Series] FLAN: Fine-tuning the language model is a Zero-Shot learner
[Natural Language Processing] [ChatGPT Series] Where does the intelligence of ChatGPT come from?
[Natural Language Processing] [ChatGPT Series] Emergence of Large Models

1. Introduction

Pretrained language models have become the cornerstone of modern natural language processing pipelines because they produce better results on small amounts of labeled data. With the development of ELMo, ULMFiT, GPT, and BERT, the paradigm of fine-tuning on downstream tasks using pre-trained models is widely used. The usefulness of the pretrained language model was subsequently found to perform useful tasks without any additional training. Furthermore, the empirical observation that the performance of language models increases (sometimes predictably, sometimes abruptly) with larger models also leads to a trend toward larger and larger model sizes. Regardless of the environment, the cost of training a large language model (LLM) can only be afforded by resource-rich organizations. Furthermore, until the end, most LLMs were not released publicly. Consequently, most of the research community has been excluded from the development of LLMs. This has concrete consequences for not publishing publicly: for example, most LLMs are primarily trained on English text.

To solve these problems, we propose BigScience Large Open-science Open-access Multilingual Language Model (BLOOM). BLOOM is a 176 billion parameter language model trained on 46 natural languages ​​and 13 programming languages, developed and published by hundreds of researchers. The computing power used to train BLOOM is from the French public grants GENCI and IDRIS, making use of the Jean Zay supercomputer at IDRIS. To build BLOOM, each component is designed in detail, including training data, model architecture and training objectives, and engineering strategies for distributed learning. We also performed an analysis of model capacity. Our overall goal is not only to publicly release a large-scale multilingual language model comparable to recently developed systems, but also to document the coordination process in its development.

2. BLOOM

insert image description here

1. Training data

insert image description here
BLOOM is trained on a corpus called ROOTS, which is a corpus consisting of 498 Hugging Face datasets. A total of 1.61TB of text, including 46 natural languages ​​and 13 programming languages. Figure 3 above shows a high-level overview of the dataset, while Table 1 above details each language and its genus, language family, and macroregion. In addition to producing the corpus, the process also resulted in the development and release of a number of organizational and technical tools.

1.1 Data Management

Large text corpora are created by and about people. Different people or institutions can "legally" own the data, which is called the ownership of the data. As machine learning developers collect and organize these data into larger and larger data sets, it is increasingly important to consider the stakeholders involved in the development of data, including: developers, data subjects and rights holders.

​ BigScience aims to solve these problems by combining multidisciplinary knowledge such as technology, law, and sociology. The organization focuses on two main goals on two different timescales: designing a long-term international data governance structure that prioritizes data rights holders, and providing specific recommendations for data directly used by BigScience projects. Progress towards the first goal Jernite et al.is shown in the work, which further motivates the need for data governance and describes a network of data custodians, rights holders, and other actors. The interactions of these actors are designed to consider data and algorithmic privacy, intellectual property and user rights. In particular, this approach relies on a structured agreement between the data provider and the data host specifying the purpose of the data.

​ Although it was impossible to establish a complete international organization in the relatively short period of time from the beginning of the project to the training of the model, we have also worked hard to learn lessons from this process: (1) BigScience will try to obtain clear data usage from data providers license; (2) maintain single-source independence and maintain traceability until the final stage of pre-processing. (3) Adopt a combined publishing method for each data source that constitutes the entire corpus, thereby promoting reusability and subsequent research. The ROOTS corpus resource can be accessed and visualized in Hugging Face's organization "BigScience Data".

1.2 Data sources

After determining the data management strategy, the next step is to determine the composition of the training language. This phase is driven by several goals, which are inherently in conflict. These conflicts of memory include: building a language model that is accessible to as many people in the world as possible, while also requiring sufficient knowledge of the language to manage datasets of comparable size to previous ones to improve standard documentation, and following the body of data and algorithms right.

  • language choice

    Based on these considerations, we adopt an incremental approach to select the languages ​​included in the corpus. Start by listing the 8 most spoken languages ​​in the world, actively promoting those languages ​​early in the project and inviting fluent speakers of that language to join the project. Then, Swahili in the original selection was extended to the Niger-Congo language category, and Hindi and Urdu were extended to Indic languages, as suggested by the community. Ultimately, we propose that if a language has more than 3 fluent speakers participating, it can be added to the support list.

  • source selection

    The largest portion of the corpus was curated by workshop participants and research teams, who jointly wrote the "BigScience Catalog": a list covering various processed and unprocessed languages. This takes the form of hackathons organized by communities such as Machine Learning Tokyo, Masakhane, and LatinX in AI. In addition to these sources, other working group participants compiled language-specific resources, such as the Arabic-focused Masader repository. This bottom-up approach identified a total of 252 sources, with at least 21 sources per language. Additionally, to increase the geographic coverage of resources in Spanish, Chinese, French, and English, participants identified locally relevant URLs for the languages ​​that were added to the corpus via pseudocrawl.

  • GitHub code

    The programming language datasets in this directory are further supplemented by the GitHub datasets on Google's BigQuer, and then de-duplicated using exact matching.

  • OSCAR

    In order not to deviate from the standard research using web pages as the pre-training data source, and to meet the data volume requirements of BLOOM size calculation costs, we further use OSCAR version 21.09 as the data source, corresponding to the Common Crawl snapshot in February 2021, which occupies 38% of the final corpus.

1.3 Data preprocessing

insert image description here

After identifying the data source, data processing involves several steps of data management. In Figure 2 above, you can see the overall view of the pipeline for building ROOTS. All tools developed during this process are available on GitHub.

  • get source data

    The first step involved obtaining text data from identified data sources, which included downloading and extracting text fields from NLP datasets in various formats, scraping and processing a large number of PDF files from archives, 192 websites from a catalog Text was extracted and preprocessed from an additional 456 geographically diverse sites selected by members of the Items and Data Working Group. The latter required the development of new tools to extract text from HTML in Common Crawl WARC files. We were able to locate and extract usable data from all URLs across 539 networks.

  • quality filter

    After obtaining the text, we found that most sources contained a large amount of unnatural language, such as preprocessing errors, SEO pages, or garbage. To filter non-natural language, we define a set of quality metrics, where high-quality text is defined as "written by humans for humans", without distinguishing a priori judgments about content or grammar. Importantly, these metrics are adapted to the needs of each source in two main ways. First, their parameters, such as thresholds and lists of supported items, are chosen individually by fluent speakers of each language. Second, we first examine each individual source to determine which metrics are most likely to identify non-natural language. Both processes are supported by tools to visualize impact.

  • Deduplication and PrivacyEdit

    Ultimately, we used two iteration steps to remove near-duplicate documents and redacted personally identifiable information identified from the OSCAR corpus. Because it is considered the source of the highest privacy risk, this motivates us to use regular expression-based editing, even though expressions have some problems with false positives.

1.4 Prompted Dataset

insert image description here

Multi-task prompt fine-tuning (also known as instruction tuning) involves fine-tuning a pre-trained language model on a fine-tuned dataset consisting of a large number of different tasks via natural language prompts. T0 demonstrates the strong zero-shot generalization of models fine-tuned on multi-task mixed prompted datasets. Furthermore, T0 outperforms language models that are orders of magnitude larger without such fine-tuning. Inspired by these results, we explore the use of existing natural language datasets for multi-task prompted fine-tuning.

T0 is trained on the Public Pool of Prompt (P3) subset, which is a collection of prompts for various existing, open source applied natural language datasets. This collection of prompts was created through a series of hackathons in which BigScience collaborators participated, in which hackathon participants wrote 2000+ prompts for 170+ datasets. The datasets in P3 cover various natural language tasks, including sentiment analysis, question answering, natural language reasoning, and exclude harmful content or unnatural language. PromptSource, an open source toolkit that facilitates the creation, sharing, and consumption of natural language prompts.

After pre-training BLOOM, we apply the same large-scale multi-task fine-tuning to make BLOOM generalize to multilingual zero-shot tasks. We call the resulting model BLOOMZ. To train BLOOMZ, we extend P3 to include new datasets and new tasks in non-English languages, such as translation. This yielded xP3, a boosted collection of 83 datasets covering 46 languages ​​and 16 tasks. As mentioned in Figure 4 above, xP3 reflects the language distribution of ROOTS. Tasks in xP3 include both cross-lingual and monolingual. We use PromptSource to collect these prompts, adding additional metadata to the prompts, such as input and target language. To investigate the importance of multilingual prompts, we also machine-translate the English prompts in xP3 into the corresponding dataset language to generate a collection called xP3mt.

2. Model Architecture

insert image description here

2.1 Design method

​ The choice of architecture design is very large, and it is impossible to fully explore. One option is to completely replicate the architecture of an existing large model. On the other hand, a lot of work on improving existing architectures has rarely been adopted, and adopting some recommended practices can lead to a better model. We take a middle ground, choosing model families that have been shown to scale well, and that are reasonably supported in publicly available tools and code bases. We performed ablation experiments on the model's components and hyperparameters, seeking to maximize our final computational budget.

  • Ablation Experimental Design

    The main appeal of LLMs is their ability to perform tasks in a "zero/few-shot" fashion: sufficiently large models can simply perform new tasks from in-context instructions and examples, without training on supervised samples. Since fine-tuning a 100B+ model is cumbersome, we evaluate architectural decisions focusing on zero-shot generalization capabilities and do not consider transfer learning. Specifically, we measure the zero-shot performance of different sets of tasks: 29 tasks from the EleutherAI Language Model Evaluation Harness (EAI-Eval), and 9 tasks from the validation set of T0 (T0-Eval). There is a large overlap between the two: only one task in T0-Eval is not in EAI-Eval, although all the prompts of the two are different.

    In addition, ablation experiments are also performed using smaller models. Use the 6.7B model to conduct ablation experiments on pre-trained targets, and use the 1.3B model to conduct ablation experiments on position embedding, activation functions, and layer normalization. Recently, Dettmers discovered a phase transition in a model larger than 6.7B, and observed the appearance of "anomaly features". So is it possible to extrapolate from the final model size on the 1.3B scale?

  • Schema out of scope

    We did not consider mixture-of-experts (MoE) due to the lack of widely used GPU-based code bases suitable for training it at scale. Similarly, we do not consider state-space models. When BLOOM was designed, they performed poorly on natural language tasks. Both approaches are promising and now demonstrate competitive results on large-scale MoE and on smaller scales using state-space models with H3.

2.2 Architecture and pre-training objectives

Although most modern language models are based on the Transformer architecture, there are significant differences between architectural implementations. Obviously, the original Transformer is based on the encoder-decoder architecture, and many popular models only choose encoder-only or decoder-only methods. Currently, all state-of-the-art models with more than 100B parameters are decoder-only models. This is contrary to the findings of Raffel et al., where the encoder-decoder model significantly outperforms the decoder-only model in terms of transfer learning.

Prior to our work, the literature lacked systematic evaluation of zero-shot generalization for different architectures and pre-training objectives. We Wang et al.(2022a)explore this question in the work of et al., which explores encoder-decoder and decoder-only architectures and interactions with causal, prefix, and masked language modeling pretrained models. Our results show that the causal decoder-only model performs best after pre-training, validating the choice of state-of-the-art LLM.

2.3 Modeling details

In addition to choosing the architecture and pre-training target, many changes are proposed to the original Transformer architecture. For example, alternative location embedding schemes or novel activation functions. We performed a series of experiments to evaluate each modification on Le Scao et al.the causal decoder-only model. We employ two variations in BLOOM:

  • ALiBi position embedding

    Compared with adding location information in the embedding layer, ALiBi directly attenuates the attention score based on the distance between keys and queries. While the original motivation for ALiBi was its ability to extrapolate to longer sequences, we found that it also leads to more balanced training and better downstream performance over the original sequence length, beyond learnable and rotated embeddings.

  • Embedding LayerNorm

    In an initial trial of training a 104B parameter model, we tried layer normalization immediately after the embedding layer, as recommended by the bitsandbytes library and its StableEmbedding layer. We found that this can significantly improve training stability. Although we Le Scao et al.found in our work that it has a penalty for zero-shot generalization, we added an additional layer normalization layer after the first embedding layer of BLOOM to avoid training instability. Note that float16 is used in the preliminary 104B experiment, and bfloat16 is used in the final training. Because float16 has been identified as the cause of many instabilities observed when training LLMs. bfloat16 has the potential to alleviate the need for embedding LayerNorm.

    The entire architecture of BLOOM is shown in Figure 5 above.

3. Tokenization

Design choices for tokenizers are often ignored in favor of the "default" setting. For example, both OPT and GPT-3 use the tokenizer of GPT-2, trained for English. Due to the diverse nature of BLOOM's training data, careful design choices are required to ensure that the tokenizer encodes sentences in a lossless manner.

3.1 Verification

We compare the tokenizer (Acs, 2019) used in this paper with existing monolingual tokenizers as a metric for integrity detection. Fertility is defined as the number of subwords created by the tokenizer per word or per dataset, which we measure using Universal Dependencies 2.9 and the OSCAR subset of the language of interest. Having a very high Fertility on one language compared to a monolingual tokenizer may indicate performance degradation on multiple languages ​​downstream. Our goal is to ensure that the fertility capability of each language is no more than 10 percentage points lower when comparing our multilingual tokenizer with its linguistic counterpart. In all experiments, the Hugging Face Tokenizers library was used to design and train test tokenizers.

3.2 tokenizer training data

We initially use a non-repeated subset of ROOTS. However, a qualitative study on tokenizer vocabularies revealed problems with the training data. For example, on earlier versions of the tokenizer we found that full URLs were stored as tokens, which was caused by several documents that contained a lot of duplication. This issue motivates us to remove duplicate rows in the tokenizer training data.

3.3 Vocabulary size

A large vocabulary size reduces the risk of over-segmenting certain sentences, especially for low-resource languages. We perform validation experiments using 150k and 250k vocabulary sizes for comparison with the existing multilingual modeling literature. Compared with the monolingual tokenizer, we finally determined the vocabulary size to be 250k tokens to achieve the initial fertility goal. Because the size of the vocabulary determines the size of the embedding matrix, the embedding size must be divisible by 128 for GPU efficiency, and it must be divisible by 4 in order to use tensor parallelism. We ended up using a vocabulary size of 250680, with 200 tokens reserved for future applications, such as using placeholder tokens to cull private information.

3.4 Byte-level BPE

The tokenizer is a learnable subword tokenzier trained using the Byte Pair Encoding (BPE) algorithm. In order not to lose information during the tokenization process, the tokenizer creates merges from bytes instead of characters as the smallest unit. This way, tokenization can never generate unknown tokens, since all 256 bytes can be included in the tokenizer's vocabulary. Furthermore, Byte-level BPE maximizes vocabulary sharing between languages.

3.5 Normalization

​ In the upstream of the BPE algorithm, in order to obtain the most general model possible, the text is not normalized. Adding unicode normalization such as NFKC does not reduce fertility by more than 0.8% in all cases, but at the cost of making the model less general. For example, resulting in 2 2 2^222 and 22 are encoded in the same way.

3.6 Pre-tokenizer

Our pre-tokenization has two goals: produce the first partition of the text, and limit the maximum length of token sequences produced by the BPE algorithm. The pre-tokenization scale uses the following regular expression: ?[^(\S|[.,!?...。,、|_])]+, which separates words while preserving all characters, especially whitespace and newline sequences that are crucial to programming languages. We do not use the English-centric partitions that are common in other tokenizers. We also didn't use division on numbers, which caused problems with Arabic and code.

4. Engineering

4.1 Hardware

The model was trained on Jean Zay, a French government-funded supercomputer owned by GENCI and run by IDRIS, the national computing center of the French National Center for Scientific Research (CNRS). Training BLOOM took 3.5 months to complete and consumed 1,082,990 compute hours. Training was performed on 48 nodes, each with 8 NVIDIA A100 80GB GPUs (384 GPUs in total); we also kept 4 spare nodes due to possible hardware damage during training. The nodes are equipped with 2x AMD EPYC 7543 32-Core CPUs and 512 GB of RAM, while storage is a SpectrumScale (GPFS) parallel file system hybrid of all-flash and hard disk drives.

4.2 Framework

BLOOM is trained using Megatron-DeepSpeed, a framework for large-scale distributed training. It consists of two parts: Megatron-LM provides Transformer implementation, tensor parallelism and data loading primitives, while DeepSpeed ​​provides ZeRO optimizer, model pipeline, and distributed training components. This framework allows us to use 3D parallelism for efficient training—a fusion of three complementary distributed deep learning methods. These methods are described below:

insert image description here

  • Data parallelism (Data parallelism, DP)

    Replicate multiple copies of the model, each copy is placed on a different device, and input data shards. This process is done in parallel, with all model copies synchronized at the end of each training step.

  • Tensor parallelism (TP)

    Divide independent layers of your model across multiple devices. This way, instead of putting the entire activation tensor or gradient tensor on a single GPU, we put fragments of this tensor on a single GPU. This technique is sometimes called horizontal parallelism or intra-model parallelism.

  • Pipeline parallelism (PP)

    Divide the layers of the model across multiple GPUs, each GPU placing only a fraction of the model layers. This is also sometimes called vertical parallelism.

Ultimately, the Zero Redundancy Optimizer (ZeRO) runs different processes holding only part of the data (parameters, gradients, and optimizer state) and the data needed for a training step. We use ZeRO stage 1, meaning only the optimizer state is sharded this way.

The combination of the four components described above can scale to hundreds of GPUs with extremely high GPU utilization. We were able to achieve 156 TFLOPs on the fastest configuration of the A100 GPU, half of the theoretical peak of 312 TFLOPs.

4.3 Floating point format

In preliminary experiments with a 104B parameter model on NVIDIA V100 GPUs, we observed numerical instability leading to irreversible training divergence. We hypothesize that these instabilities arise from the original use of IEEE float16, a 16-bit floating-point number format with a very limited dynamic range that could lead to overflow. We finally got permission to support the bfloat16 format, which has the same dynamic range as float32. On the other hand, bfloat16 precision is still much lower, which motivates us to use mixed precision training. The technique performs precision-sensitive operations such as gradient accumulation and softmax in float32 precision, and uses low precision for the remaining operations, which allows a balance between high performance and training stability. Finally, we performed the final training with bfloat16 mixed precision, which proved to solve the problem of training instability.

4.4 Fusion of CUDA cores

In general, GPUs cannot perform these calculations at the same time as the data is being retrieved. Additionally, the computational performance of modern GPUs is much higher than the memory transfer speed required for each operation (known as a core in GPU programming). Kernel fusion is an optimization method based on GPU computing by performing multiple consecutive operations in one kernel call. This method provides a way to minimize data transfers: intermediate results are left in GPU registers instead of being copied to VRAM, saving overhead.

​ We used Megatron-LM to provide several custom fused CUDA cores. First, we use an optimized kernel to perform LayerNorm, and use the kernel to fuse various combinations of scaling, masking, and softmax operations. Add a bias term to the GeLU activations using Pytorch's JIT functionality. As an example using fused cores, adding a bias term to the GeLU operation adds no extra time because the operation is memory bound: the extra computation is negligible compared to data transfers between GPU VRAM and registers. So fusing the two operations substantially reduces their running time.

4.5 Additional challenges

Scaling to 384 GPUs required two modifications: disabling asynchronous CUDA kernel launches (to facilitate debugging and prevent deadlocks), and dividing parameter groups into smaller subgroups (to avoid excessive CPU memory allocation).

During the training process, we faced the problem of hardware failure: on average, 1-2 GPU failures per week. Since backup nodes are available and used automatically, and checkpoints are saved every three hours, this does not significantly affect training throughput. Pytorch deadlock bug and disk space failure in data loader can cause 5-10h downtime. Given that the engineering problems were relatively sparse, and the model recovered quickly due to only one loss spike, human intervention was less necessary than similar projects.

5. Training

insert image description here

  • pre-trained model

    We train the 6 size variants of BLOOM using the hyperparameters detailed in Table 3 above. The architecture and hyperparameters are derived from our experimental results (Le Scao et al.) and previous training of large language models (Brown et al.). The depth and width of the non-176B model roughly follow the previous literature (Brown et al.), and the deviation of 3B and 7.1B is just to make it easier to fit our training setting. Due to the larger multilingual vocabulary, the embedding parameter size of BLOOM is larger. During the development of the 104B parameter model, we used different Adam β \betaβ parameters, weight decay, and gradient clipping were used to experiment with objective stability, but were not found to be helpful. For all models, we use a cosine learning rate decay schedule at 410B tokens, which is used as an upper bound on the training length if the computation allows, and warmups 375M tokens. We use weight decay, gradient clipping, and no dropout. The ROOTS dataset contains the text of 341B tokens. However, based on the revised scaling laws published during training, we decided to train large models with an additional 25B tokens on repeated data. Since warmup tokens + decay tokens is greater than the total number of tokens, the learning rate decay has never reached the end.

  • Multitasking fine-tuning

    The fine-tuned BLOOMZ model maintains the same architectural hyperparameters as the BLOOM model. The fine-tuned hyperparameters are roughly based on T0 and FLAN. The learning rate is to double the minimum learning rate of the corresponding pre-trained model, and then round up. For smaller variants, the global batch size is multiplied by 4 to increase throughput. The model is fine-tuned on 13B tokens, and the optimal checkpoint is selected based on an independent validation set. After fine-tuning 1-6B tokens, the performance tends to be stable.

  • contrast fine-tuning

    We also used the SGPT Bi-Encoder scheme for comparative fine-tuning of the BLOOM models with 1.3B and 7.1B parameters to train models that produce high-quality text embeddings. We created SGPT-BLOOM-1.7B-msmarco for multilingual information retrieval, and SGPT-BLOOM-1.7B-nli for multilingual semantic similarity. However, recent benchmarks have found that such models can also generalize to various other embedding tasks, such as bitext mining, rearrangement, or feature extraction for downstream classification.

6. Publish

​ Openness is at the heart of BLOOM's development, and we want to make sure it's easy for the community to use.

6.1 Model Card

​ Following best practices for releasing machine learning models, BLOOM models are released together with a detailed Model Card, which describes technical specifications, training details, intended use, out-of-scope uses, and model limitations. Participants across working groups work together to generate the final Model Card and each checkpoint card.

6.2 Licensing

Given the potentially harmful use cases that BLOOM could bring, I chose to strike a balance between unrestricted development access and responsible use, including a code of conduct to limit the application of the model to potentially harmful use cases. These terms are usually included in "Responsible AI Licenses (RAIL)", the licenses adopted by the community when releasing models. The significant difference from the RAIL license adopted by BLOOM at the beginning is that it separates "source code" and "model". It also includes detailed definitions of model "use" and "derived work" to ensure clear identification of downstream uses through prompting, fine-tuning, distillation, use of logits, and probability distributions. The license contains 13 behavioral use restrictions, which are determined according to the intended use and restrictions described by the BLOOM Model Card, and the BigScience Code of Ethics. The license provides the model for free, and users can use the model freely as long as they abide by the terms. BLOOM's source code has been made accessible under the Apache 2.0 open source license.

3. Evaluation

The evaluation focuses on the zero-shot and few-shot settings. Our goal is to present an accurate picture of BLOOM compared to existing LLMs. Due to the size of these models, prompt-based approaches and few-shot "in-context learning" are more common than fine-tuning.

1. Experimental design

1.1 Prompts

insert image description here

Based on recent research on the impact of prompting on language model performance, we decided to build a language model evaluation suite that allows us to vary the underlying task data, as well as prompting for "contextualized" tasks. Our prompt was developed prior to the BLOOM release without any prior improvements using the model. Our goal in designing the prompt in this way is to simulate the zero-shot or one-shot results new users expect from BLOOM.

We use promptsource to generate multiple prompts for each task. We followed Sanh et al.(2022)the process used, the prompts were generated by crowdsourcing, so we were able to see different lengths and styles of prompts. To promote quality and clarity, perform multiple peer reviews on each prompt.

Table 5 above shows some final prompts for WMT'14 tasks. Due to resource constraints, we also generate prompts for many tasks not included in this article. All prompts for all tasks are publicly accessible.

1.2 Infrastructure

Our framework extends EleutherAI's Language Model Evaluation Harness by integrating the promptsource library. We released the Prompted Language Model Evaluation Harness as an open source library for people to use. We use this framework to run experiments and aggregate results.

1.3 Dataset

  • SuperGLUE

    We use a subset of the evaluation suite of SuperGLUE classification tasks, in particular: Ax-b, Ax-g, BoolQ, CB, WiC, WSC and RTE tasks. We excluded the remaining tasks because they required an order of magnitude more computation than all the tasks we considered combined. The tasks are in plain English, mainly to facilitate comparison with previous work. We also note that performance using zero-shot and one-shot prompted settings on these tasks is not widely reported. The first exception is T0, but the model is instruction-tuned, so it cannot be directly compared with BLOOM and OPT. For each task, we randomly pick 5 samples from the prompt source, and then evaluate all models on the prompt set.

  • Machine Translation(MT)

    We evaluate BLOOM on three datasets: WMT14 eng ↔ fre \text{eng}\leftrightarrow\text{fre}engfre eng ↔ hin \text{eng}\leftrightarrow\text{hin} enghin , Flores-101 and DiaBLa. We use sacrebleu, a BLEU implementation, for evaluation. Use greedy decoding in the generation process until EOS token, add \n###\n for 1-shot.

  • Summarization

    We evaluate summarization on the WikiLingua dataset. WikiLingua is a multilingual summarization dataset consisting of WikiHow articles and step-by-step summarization pairs. Models comparable in size to BLOOM typically do not report one-shot conditional natural language generation. PaLM is the first exception, reporting on WikiLingua; however, only the English summarization ability of the model is examined. As a comparison, we test the multilingual capabilities of BLOOM by evaluating abstract summaries in the source language. We are focusing on 9 languages ​​(Arabic, English, Spanish, French, Hindi, Portuguese, Vietnamese and Chinese) which are the goals of BigScience.

1.4 Baseline model

  • mGPT: GPT-style models trained on 60 languages;
  • GPT-Neo, GPT-J-6B, GPT-NeoX: GPT-style model family trained on Pile;
  • T0: T5 variant fine-tuned by multi-task promoted on the P3 dataset;
  • OPT: a GPT style model, trained on mixed datasets;
  • XGLM: GPT-style multilingual model trained on CC100 variants;
  • M2M: Encoder-decoder model trained with masked and causal objective functions on Wikipedia and mC4;
  • mTk-Instruct: T5 variant for multi-task prompted fine-tuning on the Super-NaturalInstructions dataset;
  • Codex: GPT model fine-tuned on code from GitHub;
  • GPT-fr: GPT style model trained on French text;

2. Zero-Shot effect

insert image description here

In natural language understanding and generation tasks, we found that the zero-shot performance of pre-trained language models is close to random. Figure 7 above shows the performance of zero-shot.

2.1 SuperGLUE

On SuperGLUE, although some independent prompts show about 10 point performance improvement, the average improvement of individual prompts is close to random . Except for the T0 model, which exhibits a strong effect. However, this model is fine-tuned in the multi-task setting to improve the effect of zero-shot prompting.

2.2 Machine translation

In the zero-shot setting, machine translation results are very poor . There are two main problems with the generated results: (1) over-generated; (2) not able to generate correct language.

3. One-shot results

3.1 SuperGLUE

​ The display in Figure 7 above also shows the effect of one-shot. Compared with the effect of zero-shot, the variability of one-shot effect of all prompts and models on SuperGLUE is reduced. Overall, there is no significant improvement in the one-shot setting: the accuracy of the model is still close to random. We perform additional analyzes on BLOOM at different model sizes. As a baseline, we also measure the one-shot accuracy of similarly sized OPT models. The OPT and BLOOM model families improve slightly with scale. BLOOM-176B outperforms OPT-175B on Ax-B, CB and WiC.

3.2 Machine translation

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-BL4WvdO9-1675687454712)(.\图\BLOOM_T8.png)]

In the 1-shot setting, we use the XGLM prompt to test central language directions on the Flores-101 dev-test set. We randomly select 1-shot samples from the same dataset, which may differ from past work. We separate different results by high-resource language pairs, high-to-medium-resource language pairs, low-resource language pairs, and Romance language families. According to the proportion of languages ​​in ROOTS, languages ​​are classified as low, medium, and high resources. For high-resource and medium-high-resource language pairs, we compare with the supervised results of the M2M-124 model with 615M parameters. Furthermore, we compare the XGLM(7.5B) 1-shot results with the 32-shot AlexaTM results. **Good results for high-resource language translations and high-resource language to medium-resource language translations. **This shows that BLOOM has good multilingual ability. Compared to supervised M2M models, results are often comparable or even better in the 1-shot setting, and in many cases results are comparable to AlexaTM.

**Translation quality is good for many low-resource languages, comparable or even better than supervised M2M models. **However, the performance between Swahili and Yoruba is quite poor, due to the lack in the BLOOM training data.

3.3 Summary

insert image description here

​ Figure 9 above shows the comparison of the one-shot results of BLOOM and OPT-175B. Each dot represents a pre-prompt score. The key conclusion is that BLOOM achieves higher performance than OPT on multilingual summarization, and the performance increases as the model parameters increase. We suspect this is due to BLOOM's multilingual training.

4. Multitasking fine-tuning

insert image description here

Building on recent work on multi-task fine-tuning, we explore the use of multilingual multi-task fine-tuning to improve the zero-shot performance of BLOOM models. We use the xP3 corpus to perform multilingual multitask fine-tuning on BLOOM. We find that the ability to perform zero-shot performance is significantly enhanced. In Figure 10 above, we compare the zero-shot effect of BLOOM and XLGM models with multi-task fine-tuning BLOOMZ, T0 and mTk-Instruct. The performance of BLOOM and XGLM resolves to random baselines. After multilingual and multi-task fine-tuning (BLOOMZ), the effect of zero-shot is significantly improved. Despite being fine-tuned on multiple tasks, since T0 is a monolingual English model, it performs poorly on multilingual datasets. However, Muennighoff et al.additional results are presented to show that the model fine-tuned on the xP3 English dataset still outperforms T0 when controlling for size and architecture. This could be due to the fact that the T0 fine-tuned dataset (P3) contains a lower diversity of datasets and prompts than xP3.

5. Code Generation

insert image description here

BLOOM pre-training corpus ROOTS contains about 11% of the code. Table 9 above shows the benchmark results of BLOOM on HumanEval. We find that pretrained BLOOM performs as well as a similarly sized GPT model trained on Pile. Pile contains English data and about 13% code, which is similar to code data source and proportion in ROOTS. The Codex model, which is fine-tuned on the code alone, is much stronger than other models. Compared to the BLOOM model, the multi-task fine-tuned BLOOMZ model does not improve significantly. We assume this is because the fine-tuning dataset xP3 does not contain a large amount of pure code.

Guess you like

Origin blog.csdn.net/bqw18744018044/article/details/128908060