【LLM Series BLOOM】BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

论文题目:《BLOOM: A 176B-Parameter Open-Access Multilingual Language Model》
论文链接:https://arxiv.org/abs/2211.05100
github链接:https://github.com/huggingface/transformers-bloom-inference/tree/main
huggingface链接:https://huggingface.co/bigscience/bloom

1 Model Introduction

Pre-trained language models have become the cornerstone of modern natural language processing pipelines because they produce better results on small amounts of labeled data. With the development of ELMo, ULMFiT, GPT, and BERT, the paradigm of fine-tuning on downstream tasks using pre-trained models is widely used. The usefulness of the pretrained language model was subsequently found to perform useful tasks without any additional training. Furthermore, the empirical observation that the performance of language models increases (sometimes predictably, sometimes abruptly) with larger models also leads to a trend toward larger and larger model sizes. Regardless of the environment, the cost of training a large language model (LLM) can only be afforded by resource-rich organizations. Furthermore, until the end, most LLMs were not released publicly. Consequently, most of the research community has been excluded from the development of LLMs. This has concrete consequences for not publishing publicly: for example, most LLMs are primarily trained on English text.

To address these issues, we propose BigScience Large Open-science Open-access Multilingual Language Model (BLOOM). BLOOM is a 176 billion parameter language model trained on 46 natural languages ​​and 13 programming languages, developed and published by hundreds of researchers. The computing power used to train BLOOM is from the French public grants GENCI and IDRIS, making use of the Jean Zay supercomputer at IDRIS. To build BLOOM, each component is designed in detail, including training data, model architecture and training objectives, and engineering strategies for distributed learning. We also performed an analysis of model capacity. Our overall goal is not only to publicly release a large-scale multilingual language model comparable to recently developed systems, but also to document the coordination process in its development.

2 BLOOM training and data set

2.1. BigScience

  • The development of BLOOM is done by BigScience, an open research collaboration whose goal is to publicly release LLM.
  • More than 1,200 people have registered as BigScience participants

2.2 Training Corpus

  • The motivation for the ROOTS corpus above is to build a language model that is accessible to as many people in the world as possible, and at a scale comparable to previous efforts.
  • Left: Linguistic dendrograms for all 46 natural languages, where the surface area is proportional to the number of bytes. Indo-European and Sino-Tibetan take up the huge portion with a total capacity of 1321.89 GB. The thin orange surface represents 18GB of Bahasa Indonesia data, and the green rectangle 0.4GB constitutes the Niger-Congo subset.
  • Right: A waffle chart of 13 programming languages ​​by number of files, with one square representing approximately 30,000 files.

2.3 xP3: Prompted Dataset

  • Multi-task hinted fine-tuning (also known as instruction tuning) involves fine-tuning a pretrained language model on a training mixture consisting of a large number of different tasks specified through natural language cues.
  • The original P3 dataset is extended to include new datasets for languages ​​other than English and new tasks such as translation. This leads to xP3, a collection of cues across 83 datasets covering 46 languages ​​and 16 tasks.

After pre-training BLOOM, the large-scale multi-task fine-tuning method is applied to make BLOOM have the generalization ability of multilingual zero-shot tasks, and the resulting model is BLOOMZ.

3 BLOOM model structure and training

3.1 Model structure

  • Transformer models using causal decoders have two architectural deviations.

  • Using ALiBi positional embeddings, which directly decay attention scores according to the distance of keys and queries. Compared with the original Transformer and Rotary embedding, it can lead to smoother training and better downstream performance. ALiBi does not add positional embeddings to word embeddings; instead, it biases attention scores towards query keys using a penalty proportional to their distance.

    As shown in the figure, just add a preset offset matrix to the attention score, which is equivalent to adding a -1 offset when the relative position difference between q and k is 1. In fact, it is equivalent to assuming that the farther the distance between two tokens is, the lower the mutual contribution will be.
    Of course, it is not enough to just add this matrix directly, or to add multiple sets of bias in the T5 Bias. The main bias matrix is ​​the same, the only difference is the m coefficient next to it, which can be regarded as a slope (Slope).
    The m coefficient in the paper is also preset. The author will set a set of m coefficients according to the number of heads, specifically according to the index difference between the number of heads n and n. For example, if there are 8 heads, then it can be set to M. training, but the authors found that training did not bring with it the better properties. Paper: https://arxiv.org/pdf/2108.12409.pdf

  • Embedding Layer Norm is used immediately after the first embedding layer to avoid unstable training.

  • A vocabulary of 250,000 tokens is used. Use byte-level BPE. This way tokenization never produces unknown tokens

3.2 Project realization

  • BLOOM uses Megatron-DeepSpeed ​​for training, which consists of two parts: Megatron-LM provides Transformer implementation, tensor parallelism and data loading primitives, while DeepSpeed ​​provides ZeRO optimizer, model pipeline and general distributed training components.
  • Data Parallelism (DP) Model of multiple copies, each copy placed on a different device and serving a portion of the data. Processing is done in parallel, and all model copies are synchronized at the end of each training step.
  • Tensor parallelism (TP) partitions the layers of a model across multiple devices. This way, instead of residing the entire activation or gradient tensor on a single GPU, fragments of that tensor are placed on separate GPUs.
  • Pipeline parallelism (PP) splits the layers of the model across multiple GPUs, so only a fraction of the model layers are placed on each GPU.
    Use bfloat16 mixed precision. Use fused CUDA kernels.

3.3 Model variants

  • Model variants for six parameter quantities
  • BLOOM consumes slightly more energy than OPT, but BLOOM emits about 2/3 less (25 tons vs. 70 tons). This is thanks to the low carbon intensity of the energy grid used to train BLOOM
    , which emits 57 gCO2eq/kWh.

  • Both BLOOM and OPT produce significantly less carbon emissions than GPT-3, which can be attributed to several factors, including more efficient hardware and less carbon-intensive energy sources.

3.4 Prompt Learning


The hints were developed before the release of BLOOM and have not been improved a priori, exemplifying some examples of hints for machine translation (MT).

4 Experimental results

4.1 Zero-Sample Capabilities


Average performance across cues always hovers around chance. The exception is the T0 model, which shows strong performance. However, the model was fine-tuned in a multi-task setting and cannot be directly compared.


In the zero-shot setting, MT results are often poor. The two main problems observed are (i) over-generation and (ii) under-production of correct language.

4.2 1-shot effect


The one-shot performance variability of SuperGLUE is reduced across all prompts and models.
Overall, the oneshot setting does not improve significantly: the average accuracy of the model is still almost always by chance.


Both the OPT and BLOOM model families improved slightly with increasing size, and there were no consistent differences between the families across all tasks. BLOOM-176B outperforms OPT-175B on Ax-b, CB and WiC.


Translation quality for many low-resource languages ​​is comparable to, or even slightly better than, supervised M2M models.

4.3 Text summarization


BLOOM achieves higher performance than OPT on multilingual summarization, and the performance increases with the number of model parameters.

4.4 Multi-task fine-tuning


Multilingual Multi-Task Fine-tuning, or BLOOMZ, is used to improve the zero-shot performance of BLOOM models.

4.5 Code generation


The performance of the pretrained BLOOM model is similar to that of a similarly sized GPT model trained on Pile.
However, the Codex model fine-tuned only on the code is much stronger than the other models.

5 summary

  • BLOOM mainly improves the multilingual ability of LLM
  • The AIBI and layer normalization used in the optimization method are similar to other models

Guess you like

Origin blog.csdn.net/yanqianglifei/article/details/130715624