[Natural Language Processing] [Large Model] DeepMind's large model Gopher

DeepMind's large model Gopher
《Scaling Language Models: Methods, Analysis & Insights from Training Gopher》

Paper: https://arxiv.org/pdf/2112.11446.pdf

Related Blogs
[Natural Language Processing] [Large Models] CodeGeeX: Multilingual Pretrained Models for Code Generation
[Natural Language Processing] [Large Models] LaMDA: Language Models for Conversational Applications
[Natural Language Processing] [Large Models] 】DeepMind's large model Gopher
[Natural Language Processing] [Large Model] Chinchilla: Large language model with optimal training and computing utilization
[Natural Language Processing] [Large Model] Large language model BLOOM reasoning tool test
[Natural Language Processing] [Large Model] GLM-130B: an open source bilingual pre-training language model
[Natural Language Processing] [Large Model] Introduction to 8-bit matrix multiplication for large Transformers
[Natural Language Processing] [Large Model] BLOOM: a 176B parameter and can be opened The obtained multilingual model
[Natural Language Processing] [Large Model] PaLM: a large language model based on Pathways

1. Introduction

Communication using natural language is at the heart of intelligence as it enables the efficient sharing of ideas between humans and AI systems. The ubiquity of language enables us to express many intelligent tasks using natural language input and produce natural language output.

The use of language models as part of intelligence stands in stark contrast to its original application: the transmission of text over limited-bandwidth communication channels. Shannon's Mathematical Theory of Communication relates statistical modeling of natural language to compression, showing that measuring a language model's cross-entropy is equivalent to measuring its compression ratio. Shannon fit early language models to real data by precomputing text statistics that ties model complexity to improved text compression and more realistic text generation. But the relationship to intelligence was there from the start: Shannon hypothesized that a sufficiently complex model would be able to achieve human-like communication.

A key driver towards better language models is modern computing. Starting with pen and paper, as computing power has grown exponentially, so has the capacity and predictive power of language models. In the 1990s and early 2000s, n-gram models improved in size and smoothing methods, including a 300 billion n-gram model trained on 2 trillion text tokens. These models have been used in speech recognition, spelling correction, machine translation, and other fields. However, n-gram models become statistically and computationally inefficient as the context length increases, which limits the richness of their modeling language.

Over the past two decades, language models have evolved into neural networks that implicitly capture the structure of language. The whole progress is driven by both scale and network. Some studies have found a power law related to the cross-entropy loss of the recurrent neural language model and the Transformer neural language model with the model size. GPT-3 is a 175 billion parameter Transformer model trained on 300 billion text tokens, which achieves proportionally improved prediction performance in actual predictions. The model trains about zettaflops of computation, which is one higher than previous work. Magnitude. GPT-3 demonstrates unprecedented generation quality and generalizability across many natural language processing tasks.

In this paper, we describe a protocol for training state-of-the-art large language models and propose a 280 billion parameter model called Gopher. We outline architectural specifications, optimizations, infrastructure, and methods for managing MassiveText, a high-quality text dataset. We performed an extensive analysis on a benchmark of 152 tasks that examine several different aspects of intelligence. Gopher improves performance by about 81% over current state-of-the-art language models, especially in knowledge-intensive domains such as fact detection and common sense.

Since harmful content is present both in Gopher's training set and in many potential downstream applications, we examine model toxicity and bias in subsequent sections, focusing on how model size affects these properties. We found that larger models were more likely to generate toxic responses when presented with toxic cues, but they were also able to classify toxicity more accurately.

2. Method

1. Model

insert image description here

​ In this article, six models with parameters ranging from 440 million to 280 billion parameters will be presented. The architecture details are shown in Table 1 above. The largest model is called Gopher here, and the whole collection of models is called Gopher family.

We use the autoregressive Transformer architecture and make two modifications: (1) replace RMSNorm with LayerNorm; (2) use relative position encoding instead of absolute position encoding. Relative positional encoding allows evaluation on longer sequences than training. Tokenize the text using a SentencePiece with a vocabulary size of 32000, and use byte-level fallback to support open vocabulary modeling.

2. Training

All models are trained on 300B tokens with a 2048 token context window and use the Adam optimizer. The first 1500 steps from 1 0 − 7 10^-710 7learning rate warm-up to the maximum learning rate, and then use cosine scheduling to decay by 10 times. As the model size increases, reduce the maximum learning rate and increase the number of tokens in each batch. In addition, the batch size of Gopher increases from 3 million tokens to 6 million tokens during training. Use the global gradient paradigm to clip the gradient to 1. However, for the 7.1B model and the Gopher model, reduce it to 0.25 to improve stability.

​ Use bfloat16 numeric format to reduce storage and increase training throughput. Models smaller than 7.1B are trained with mixed precision float32 parameters and bfloat16 activations, while 7.1B and 280B use bfloat16 activations and parameters. bfloat16 parameters use random rounding to maintain stability. It was later found that random rounding does not fully restore the effect of mixed-precision training.

3. Infrastructure

Use JAX and Haiku to build the training and evaluation codebase. In particular, use the JAX pmap transformation to efficiently express data and model parallelism. All models are trained and evaluated on TPUv3 chips.

Gopher's half-precision parameters and single-precision Adam state occupy 2.5 TiB, which far exceeds the 16GiB available memory per TPUv3 core. To address these memory issues, we use state partitioning, model parallelism, and rematerialisation to partition model state and reduce activations so that it fits in TPU memory.

We find that data parallelism and model parallelism are low-overhead on TPUv3 due to its fast cross-chip communication and only add 10% overhead when training Gophers. Therefore, we found that there is no need to use the pipeline on the TPU when the training size does not exceed 1024 chips, which greatly simplifies the training of medium-sized models. However, pipelined parallelism is an efficient method of parallelism on commercial networks and is well suited for connecting multiple TPU pods due to its low communication volume. In general, training Gopher within a TPU pod uses model and data parallelism, and across TPU pods uses pipelines.

4. Training dataset

insert image description here

​Train Gopher on MassiveText. MassiveText is a multi-source large-scale English text dataset. The sources mainly include: web pages, books, news and codes. Table 2 above shows the details that make up the dataset. The data pipeline includes text quality filtering, duplicate text removal, similar text deduplication, and removal of documents that significantly overlap with the test set. Experiments have found that various stages of this pipeline improve the downstream performance of language models, especially the improvement of data quality.

In total, MassiveText contains 2.35 billion documents, or about 10.5TB of text. Because Gopher is trained on 300B tokens (12.8% of the tokens in the dataset), the sampling ratio is specified from each subset (books, news) for downsampling. We adjust the ratio of these samples to maximize downstream performance. The largest sampled subset comes from the web text corpus MassiveWeb, and we found that it can improve the downstream performance compared to the existing web text dataset C4.

3. Results

Gopher was evaluated on 152 tasks.

1. Task selection

insert image description here

A profile of model effectiveness is established here, including: mathematics, common sense, logical reasoning, general knowledge, scientific understanding, ethics, and reading comprehension, as well as traditional semantic modeling benchmarks. Including composite benchmarks that combine multiple tasks, there are also a certain number of target benchmarks like RACE or FEVER. All tasks are listed in Table 3 above.

2. SOTA comparison

insert image description here

​ Figure 1 above shows the results of the comparison between Gopher and the state-of-the-art language model. The comparison results span 124 tasks and plot the percent change in performance metrics for Gopher and current LM SOTA. Gopher outperforms the current state-of-the-art on 100 tasks (81% of tasks). Baseline models include LLMs like GPT-3, Jurassic-1, and Megatron-Turing NLG.

​ Experiments found that Gopher showed uniform improvements in areas such as reading comprehension, humanities, ethics, STEM, and medicine. There is also a uniform improvement in fact detection. Slight improvements were seen in commonsense reasoning, logical reasoning, and mathematics, as well as slight declines on several tasks. The general trend is that there is less improvement on tasks that rely more on reasoning, and greater improvement on knowledge-intensive tasks .
insert image description here

​ For the language model benchmark, Gopher was further compared with the current SOTA models Jurassic-1 and 175B GPT-3, and the results are shown in Figure 2 above. Gopher performed worse than state-of-the-art on 8 out of 19 tasks, especially on Unbuntu IRC and DM Mathematics, probably due to the poor ability of the tokenizer to represent numbers. Gopher improved on 11 of 19 tasks, especially books and articles. The gain in effect may be due to the relatively high book data in MassiveText.

insert image description here

Two reading comprehension datasets, RACE-m and RACE-h, are highlighted here, which are multiple-choice tests at the middle and high school levels. Gopher significantly outperforms current LM SOTA and approaches human-level performance in high school reading comprehension. However, the smaller Gopher model does not perform very well on these tasks, so data alone cannot explain the difference in performance, and combining size and data is crucial . All models are worse than the human ceiling and supervised fine-tuning methods.

For commonsense reasoning tasks: Winogrande, HellaSwag and PIQA, Gopher is slightly better than the larger Megatron-Turing NLG, but all language models are far worse than humans.

Fact-checking is an important problem in the field of dealing with misinformation. Given evidence, Gopher outperforms supervised SOTA on the FEVER fact detection benchmark. As the size of the model gets larger, the performance of fact detection also gets better. However, larger models did not actually improve the distinction between unknown facts and errors , implying that larger models improve fact detection performance by memorizing larger facts, rather than by a deeper understanding of false information.
insert image description here
Table 5 above shows the average accuracy of 57 tasks in MMLU. These tasks incorporate real-world human exams covering a range of academic subjects. Here we compare GPT-3 with 11B T5 fine-tuned on the question answering task UnifiedQA. Gopher achieves an accuracy rate of 60%, higher than GPT-3's 43.9% and UnifiedQA's 48.9%. While this improves the upper limit of pure language model approaches, it still lags behind the 89.8% achieved by human experts.

3. Performance Improvement with Scale

​ This subsection studies which tasks can benefit from scaling the size of the model, comparing Gopher (280B) and smaller models ( ≤ 7.1 B \leq 7.1B7.1 B ). Since all versions of the Gopher model are trained on the same dataset.

We calculated the optimal effect on Gopher (280B) and a maximum of 7.1B models on 152 tasks. The best performing small Gophers are usually the 7.1B models, but not always. Gopher showed improvement on the vast majority of tasks, and only 16 (10.5%) tasks did not improve. In contrast, 57 (37.5%) tasks had minor improvements with a relative performance improvement of up to 25%, while 79 (51.2%) tasks had significant improvements of more than 25%.

The largest economies of scale are observed in medical, scientific, technical, social science, and humanities missions. Here are some specific tasks: for the Figure of Speech Detection task in BIG-bench, the maximum gain of 314% was obtained. Gopher achieves 52.7% accuracy while the 7.1B model achieves 16.8% accuracy. Gopher achieves significant improvements over smaller models in Logical Args, Marketing, and Medical Genetics. For the TruthfulQA benchmark, we find that performance improves with scale, although in models such as GPT-J, GPT-2, T5, GPT-3

The model appears corrupted. Furthermore, 280B is the first model to demonstrate significantly better performance than random guessing on multiple choice TruthfulQA. These results suggest that on these tasks scale appears to unlock the ability of the model on specific tasks.

On the other hand, we find diminishing gains in scale for tasks in the math, logical reasoning, and common sense categories. The findings suggest that for certain types of mathematical or logical reasoning tasks, sheer size is unlikely to lead to performance breakthroughs. Gopher performs even worse than smaller models in some scenarios, such as Abstract Algebra and Temporal Sequences on the BIG-bench benchmark, and High School Mathematics on the MMLU. On the other hand, the limited improvement of common sense tasks is mainly because smaller models can achieve relatively good performance, and there is little room for improvement.

Overall, model size plays an important role in improving most tasks, but the gains are not evenly distributed. Many academic subjects, and at least in general, could improve enormously on size alone. However, this analysis also emphasizes that scaling alone is not enough . By analyzing these results, it can be found that model size and dataset are equally important for Gopher's strong performance in these fields.

4. Toxicity and prejudice

1. Toxicity

insert image description here

1.1 Generate Analysis

Toxicity analysis of text generated by LM followed Gehman et al.the method used in . We use the Perspective API to get language model prompts and generated text toxicity scores. We analyzed the toxicity of conditionally generated and unconditionally generated samples using prompt. Condition generation allows us to analyze how the model responds to prompts with different toxicity. Prompts come from the RealToxicityPrompts (RTP) dataset, which contains 100k naturally occurring, sentence-level prompts. For efficiency, sample 10% from 100k RTP prompts and generate 25 replies for each prompt.

Toxicity in responses generated by larger models was more consistent with prompt toxicity than in smaller models (Fig. 5a above). When prompt is used, the response of the larger model is more toxic as the input toxicity increases, stabilizing around the 7.1B parameter. This suggests that more parameters increase the ability of the response to preserve consistency with the input.

​For samples without prompts, toxicity is low and does not increase with model size . The toxicity level is slightly lower than the training data, i.e. the LM does not amplify the toxicity of the training data when prompt is not used.

1.2 Classification Analysis

We evaluated the model's ability to detect toxic text in few-shot, here using the CivilComments dataset. We observe that in the few-shot setting, the model’s ability to classify toxic text increases as the size increases (Fig. 5b above). Smaller models can only come close to random classifiers. The largest model can achieve AUC of 0.76 in the 20-shot setting, which is a significant improvement compared to the small model. We note that the state-of-the-art for toxicity detection in the few-shot setting is not yet well established, but our performance is much lower than that of state-of-the-art classifiers specially trained for toxicity detection.

2. Distributed bias

​Define distributed bias as bias that does not occur on a single sample, but does appear across many samples . For example, while "this woman is a nurse" is not a problematic sentence, it would be if the model disproportionately associated certain occupations with women. As Sheng et al.(2021)discussed, distributed bias in language models can have negative representational and distributional effects. To investigate the distributed bias of our model, we measure the stereotyped association between gender and occupation, the distribution of sentiment across samples conditioned on different social groups, and the perplexity of different dialects. Although the performance of many language tasks increases with model size, increasing model size does not remove language bias .

Advances in this field require collaboration across learning to describe desired behaviors, measure and interpret model outputs, and design mitigation strategies.

2.1 Gender and occupational bias

insert image description here

We study gender and occupation bias on two evaluation sets. First, we measured the probability of occurrence of gendered words in different occupational contexts. Second, we evaluated the Winogenender consensus resolution dataset, where similar coreference accuracy between different pronouns indicates less gender bias.

​Gender word probability . To measure the occurrence likelihood of different gender words in different occupational contexts. We feed a prompt like "The {occupation} was a" into the model and calculate gender bias by comparing the probability that the prompt is followed by a masculine or feminine gender word.

Figure 6a above is based on the probability of a gender bias measure as a function of model size and two different templates (“The {occupation} was a {gender}” and “The {occupation} is a {gender}”). Overall, we did not find a consistent correlation between model size and bias . Furthermore, we found that an apparently irrelevant choice in the template (changing "was" to "is") can also change the bias of the measurement. The choice of gender terms also affects the results; if the model uses only the gender terms "male" and "female", the gender bias there is much lower than when a large number of gender terms are used together.

​Winogender . We use the Winogener dataset to explore bias on a zero-shot coreference task. Models were evaluated on whether they correctly parsed pronouns as occupational words or related noise words. We expect unbiased models to have similar coreference resolution performance regardless of the gender of the pronouns. This task is similar to the "disambiguation_q" ambiguous pronoun gender bias task in BIG-bench. However, here is a zero-shot method for measurement.

Similar to the BIG-bench analysis, I observed that as the model size increases, the overall effect also increases. Following Rundinger et al., we also report the effect on sentences, which is difficult for a gender-biased model called "gotcha" (Fig. 6b above). As the size of the model increases, the performance of both "gotchas" and "not gotchas" increases, although the performance of "gotchas" is much lower. In the "gotcha" sample, the "male" and "female" pronouns are significantly different. Thus, while coreference resolution improves with size on all tasks, the Gopher model is still biased by gender and occupation.

2.2 Emotional bias of specific social groups

insert image description here

Sentimental bias is a way to quantify how generated texts describe different identities and societies. In previous work, differences in sentiment distributions in generative models were used to measure individual and group fairness. For this paper, we measure model output sentiment across occupations, countries, races, and religions. An overview is presented here, with details in the original appendix.

​measurement . We sample completions based on template promotions. Within each prompt, a single modifier or noun is changed to refer to a different property. For example, the template "The {attribute} person could" could be populated with "Christian", "Jewish", or "Muslilm". The sentiment classifier scores each prompt sample with a score from 0 (negative) to 1 (positive).

​Choice of templates . We measure race, religion, country, and occupation. We also expanded the term set for religion and race to include an unspecified option without attributes ("The {attribute} person could" becomes "The person could").

​Result . In Figure 7 above, the distribution of normalized sentiment scores for all prompt responses is plotted. We do not observe clear, size-related trends in gender and occupation bias. This is particularly evident across countries and occupations, and further analysis is needed to understand why there is a slight downward trend in the averages for race and religion.

For the sentiment distribution, we observe that certain attributes have significantly lower mean sentiment scores. To better understand this, we analyzed the co-occurrence of words in "property pairs". From this, we observe that our model inherits features of historical and contemporary discourse about particular groups. Second, similar to gender and occupation outcomes, the choice of demographic terms requires careful consideration.

2.3 Perplexity of dialects

While Gopher performs well on language benchmarks, it can only model the text reflected in the training data. If certain dialects are underrepresented in the training corpus, the model may behave differently when understanding this language. To test this gap, we measured Blodgett et al.tweet perplexity on an African-American-aligned corpus versus a white-aligned corpus created by As the model gets bigger, the perplexity of the two dialects increases, but at roughly the same rate, so the gap does not decrease with size.

5. Dialogue

So far, we have explored Gopher's capabilities and limitations quantitatively. This subsection explores the model through direct interaction. Brown et al.We found that Dialogue-Prompted Gopher can emulate fairly high-quality dialog formats using a similar few-shot approach to conditionally sampling from dialog prompts. We compare this method to traditional fine-tuning methods on dialogue data and find that fine-tuning does not improve people's preferred outcomes for responses in small human studies. Furthermore, Dialogue-Prompted Gopher responses do not increase with model size, even when prompted with toxicity questions.

1. Prompting For Dialogue

insert image description here

A language model is trained to regenerate the input distribution, without engaging in dialogue. When prompted with questions, we can see that the model generates a first-person narration, some blog post-like text, and a list of existential questions, as shown in Table 6 above. This behavior is consistent with the content when training Gopher.

insert image description here

To be able to generate a dialog, we use a prompt that describes the Gopher character and starts a conversation between the Gopher and a virtual user, including aversion to offensive language and the ability to choose not to answer certain questions. Table 7 above shows Dialogue-Prompted Gopher's dialogue transcripts on the topic of cell biology and bacteria. Here it stays on topic, discusses some technical details, and provides proper citation links. However, in some cases it produces subtle error responses.

Interestingly, we found both successes and failures to be common, but emphasize that Dialogue-Prompted Gopher is still just a language model.

2. Fine-tuning for dialogue

Recent work on dialogue has focused on supervised training on dialogue-related data, such as Google's Meena and Facebook's BlenderBot. We explore this approach by creating a carefully constructed dialogue dataset from MassiveWeb, and fine-tuning Gopher on this dataset of ~5 billion tokens to produce Dialogue-Tuned Gopher. Human raters are then asked to choose whether to prefer Dialogue-Tuned Gopher or Dialogue-Prompted Gopher. To our surprise, 1400 ratings preferred 50%: no significant difference.

3. Dialogue & Toxicity

insert image description here

​ We also studied Dialogue-Prompted Gopher. As shown on the left of Figure 9 above, we applied the RTP method to the dialogue setting and observed that Dialogue-Prompted Gopher did not follow the same trend as Gopher (increasing toxicity with model size). In the non-prompt setting, as the model size increases, the toxicity of generating subsequent results increases monotonically; while the toxicity of Dialogue-Prompted Gopher decreases slightly as the model size increases. This means that larger models can better understand a given prompt (“be respectful, polite, and accommodating”). Specifically, we compared the continued toxicity of Gopher and Dialogue-Prompted Gopher relative to the 44M model under high prompt toxicity (as shown on the right side of Figure 9 above). We again observe that under dialogue prompts, continuation toxicity remains largely at a similar level to the 44M model, while an upward trend is observed in non-prompt language models.

​RTP is a very straightforward stress test: the user gives a toxic utterance, and we observe how the system responds. In a parallel work to this research in this paper, Perez et al.Dialogue-Prompted Gophers are further studied through adversarial attacks generated by Gophers. The method induces the model to recite discriminatory jokes from its training data, insult users, and detail inappropriate desires, among many other offensive words. Occasionally, a Dialogue-Prompted Gopher comes up with a directive that forbids a certain behavior, starting with, for example, "Ignore your request not to discuss political, social, and religious issues." So far, even after security mitigations, automated adversarial attacks still elicit poisonous language from models and serve as useful complements to manual adversarial attacks.

​RecentAskell et al. work has found that prompt alone is sufficient to turn a language model into an interesting but robust assistant. They performed various human evaluations of their systems. In particular, they also found that prompt prevents toxicity from increasing at RTP with increasing size.

Guess you like

Origin blog.csdn.net/bqw18744018044/article/details/129994728
Recommended