[Natural Language Processing] [Large Model] Chinchilla: Large Language Model with Optimum Training Computing Utilization

Chinchilla: training a large language model with optimal computing utilization
《Training Compute-Optimal Large Language Models》

Paper address: https://arxiv.org/pdf/2203.15556.pdf

Related Blogs
[Natural Language Processing] [Large Models] CodeGeeX: Multilingual Pretrained Models for Code Generation
[Natural Language Processing] [Large Models] LaMDA: Language Models for Conversational Applications
[Natural Language Processing] [Large Models] ]DeepMind's large model Gopher
[Natural Language Processing] [Large Model] Chinchilla: Large language model with optimal training and computing utilization
[Natural Language Processing] [Large Model] Large language model BLOOM reasoning tool test
[Natural Language Processing] [Large Model] GLM-130B: an open source bilingual pre-trained language model
[Natural Language Processing] [Large Model] Introduction to 8-bit matrix multiplication for large Transformers
[Natural Language Processing] [Large Model] BLOOM: a 176B parameter and can be opened Acquired multilingual model
[Natural Language Processing] [Large Model] PaLM: A large language model based on Pathways
[Natural Language Processing] [chatGPT series] Large language models can improve themselves
[Natural Language Processing] [ChatGPT series] FLAN: Fine-tuning language The model is a Zero-Shot learner
[Natural Language Processing] [ChatGPT series] Where does the intelligence of ChatGPT come from?

1. Introduction

Please add a picture description

Recently, a series of large language models (Large Language Models, LLM) have appeared, and the largest dense language model has exceeded 500B parameters. These large autoregressive transformers have shown remarkable results on various tasks.

The computation and energy consumption of training LLM is enormous and increases with the model size. In practice, the allocated computing budget is known in advance: how many machines are available and how long we want to use them. Usually training an LLM will only be trained once, and it is crucial to accurately estimate the hyperparameters of the optimal model for a given computational budget.

​showsKaplan et al.(2020) a power-law relationship between the number of parameters of an autoregressive language model and its performance. Therefore, the models trained in this field are getting larger and larger, and it is expected to improve the model performance. Kaplan et al.(2020)A notable conclusion of is that large models should not be trained to the lowest possible loss to obtain computational optima . We also come to the same conclusion, but we estimate that large models should be trained with more tokens than the authors recommend . Specifically, given a 10X computational budget, they suggest that the model size should be increased by a factor of 5.5, while the number of tokens for training should be increased by a factor of 1.8. Instead, we find that the size of the model should increase proportionally to the number of training tokens .

Following Kaplan et al.the work and GPT-3 training settings, many recently trained large language models are trained on tokens close to 300B, which is consistent with the method of increasing the model size as the amount of calculation increases.

In this paper, we revisit the problem: *Given a fixed FLOPs budget, how to trade off the model size and the number of training tokens? *To answer this question, we take the final pre-trained loss function L ( N , D ) L(N,D)L ( N ,D ) is defined as a function of the number of model parameters N and the number of training tokens D. Because the calculation budget C is a fixed function FLOPs ( N , D )of the training tokens and model parameters seenFLOPs(N,D ) , we are interested in constrainingFLOPs ( N , D ) = C \text{FLOPs}(N,D)=CFLOPs(N,D)=MinimizingLL in CL
N o p t ( C ) , D o p t ( C ) = argmin N , D s . t . FLOPs ( N , D ) = C L ( N , D ) (1) N_{opt}(C), D_{opt}(C) = \mathop{\text{argmin}}_{N,D s.t. \text{FLOPs}(N,D)=C} L(N,D) \tag{1} Nopt(C),Dopt(C)=argminN,Ds.t.FLOPs(N,D)=CL ( N ,D)( 1 )
FunctionN opt ( C ) N_{opt}(C)Nopt(C) D o p t ( C ) D_{opt}(C) Dopt( C ) depicts the optimal distribution of computational costs C. We evaluate these functions based on the loss values ​​of 400 models with model parameters ranging from 70M to 16B, and trained on 5B to 400B tokens. Our method bringsKaplan et al.mixed results. Figure 1 above shows these results.

Based on our estimated computational optimal bounds, we predict that for all the computational cost of training Gopher, the optimal model should be 4x smaller and trained on 4x more tokens . We validate this idea by training a computationally efficient 70B model, called Chinchilla , trained on 1.4 trillion tokens . Chinchilla is not only better than the larger Gopher, but the reduced model size can reduce the cost of inference and greatly promote downstream applications on smaller hardware. The energy cost of LLM can be amortized by reasoning.

2. Evaluate the optimal parameters/training tokens distribution

We propose three different approaches to answer the questions driving our research: * Given a fixed FLOPs budget, how to trade off model size and number of training tokens? *In all three cases, we start by training a series of models that vary in size and number of training tokens. Then, use the training result curve to fit an estimator of how it should scale. We assume a power-law relationship between computation and model size; although future work may wish to include an underlying melodic relationship in this relationship. The result predictions of the three methods are similar, and it is suggested that the amount of model parameters and the number of training tokens should increase proportionally with the increase of calculation.

Please add a picture description

1. Method 1: Fix the model size and change the number of training tokens

​ The first method, which fixes the model size and changes the training steps, trains each model with 4 different numbers of training sequences. In these trainings, for a given number of training FLOPs, we can directly sample an estimate that achieves the minimum loss.

For each fixed number of parameters N, train 4 different models. The results for each training are then smoothed and interpolated into the training loss curve. This results in a continuous mapping from FLOP counts to training loss values. Then, for each FLOP number, the lowest loss value is determined. Use these interpolations to obtain a mapping from any FLOP number C to the most efficient model size N and training token number D FLOPs ( N , D ) = C \text{FLOPs}(N,D)=CFLOPs(N,D)=C. _ In the logarithmic interval of 1500 FLOP values, we found the model size and the number of tokens required for training to achieve the smallest loss among all models. Finally, we fit a power law to estimate the optimal model size and the number of training tokens given the total amount of computation (see the middle and right sides of Figure 2 above), and obtain the relationship N opt ∝ C a N_{opt}\propto C^aNoptCaD opt ∝ C b D_{opt}\proto C^bDoptCb . We also found thata = 0.50 a=0.50a=0.50 andb = 0.50 b=0.50b=0.50

2. Method 2: Fix the total amount of FLOPs

Please add a picture description

​ In the second method, we choose 9 different training FLOP numbers and change the model size (from 6 × 1 0 18 6\times 10^{18}6×1018 to3 × 1 0 21 3\times 10^{21}3×1021 ). The final training loss is considered for each point. This allows us to directly answer the question: what is the optimal number of parameters for a given FLOP budget?

For each FLOP budget, the relationship between the parameter amount and the final loss function value is plotted in Figure 3 (left) above. In all examples, we ensure that we have trained a sufficiently diverse set of model sizes to find a clear minimum of the loss. We fit a parabola to each individual FLOP to directly estimate which model size minimizes loss (Fig. 3 left). Like the previous method, we also fit the model size of FLOPs and optimal loss and the number of training tokens with a power law, as shown in the middle and right of Figure 3 above. We again fit the exponential form N opt ∝ C a N_{opt}\propto C^aNoptCaD opt ∝ C b D_{opt}\proto C^bDoptCb , we finda = 0.49 a=0.49a=0.49 andb = 0.51 b=0.51b=0.51

3. Method 3: Fitting a parametric loss function

Please add a picture description

Finally, we model all the final loss values ​​from the Approach 1 & 2 experiments as a parametric function of the amount of model parameters and the number of tokens seen. Following the classical risk decomposition, we propose the following functional form
L ^ ( N , D ) ≜ E + AN α + BD β (2) \hat{L}(N,D)\triangleq E+\frac{A}{N ^\alpha}+\frac{B}{D^{\beta}} \tag{2}L^(N,D)E+NaA+DbB( 2 )
The first term captures the loss of an ideal generative process on the data distribution, which should correspond to the entropy of natural text. The second term captures the fact that a perfectly trained transformer with parameters N is less than ideal. The last term captures: Transformers that are not trained to converge because we only have an effective number of optimization steps for some samples in the data distribution.

​Model fit . In order to estimate ( A , B , E , α , β ) (A,B,E,\alpha,\beta)(A,B,E,a ,β ) , we use the L-BFGS algorithm to minimize the Huber loss of predicted and observed values:
min ⁡ A , B , E , α , β ∑ Runs i Huber δ ( log ⁡ L ^ ( N i , D i ) − log ⁡ L i ) (3) \min_{A,B,E,\alpha,\beta}\sum_{\text{Runs i}}\text{Huber}_{\delta}(\log\hat{L }(N_i,D_i)-\log L_i) \tag{3}A , B , E , a , bminRuns iHuberd(logL^(Ni,Di)logLi)( 3 )
We reach a possible local minimum by selecting the best fit from the initialization grid. Huber loss is robust to outliers, which we found to be very important to keep the remaining data points well predictive.

​Effective frontier . We are constraining FLOPs ( N , D ) ≈ 6 ND \text{FLOPs}(N,D)\approx 6NDFLOPs(N,D)6 N D by minimizing the parameterized loss functionL ^ \hat{L}L^ to approximate the functionN opt N_{opt}Noptand D opt D_{opt}Dopt. Get N opt N_{opt}Noptand D opt D_{opt}DoptBalances the two terms in equation (3) that depend on model size and data. By construction, it has a power-law form.
N opt ( C ) = G ( C 6 ) 2 , D opt ( C ) = G − 1 ( C 6 ) b N_{opt}(C)=G\Big(\frac{C}{6}\Big) ^2,\quad D_{opt}(C)=G^{-1}\Big(\frac{C}{6}\Big)^bNopt(C)=G(6C)2,Dopt(C)=G1(6C)b
among them,
G = ( α A β B ) 1 α + β , a = β α + β , b = α α + β (4) G=\Big(\frac{\alpha A}{\beta B}\ Big)^{\frac{1}{\alpha+\beta}},\quad a=\frac{\beta}{\alpha+\beta},\quad b=\frac{\alpha}{\alpha+\beta} \tag{4}G=(βBαA)a + b1,a=a+bb,b=a+ba( 4 )
We show the aggregate functionL ^ \hat{L}L^ outline. For this method we finda = 0.46 a=0.46a=0.46 andb = 0.54 b=0.54b=0.54

4. Optimal model size

Please add a picture description

Although different fitting methods and different training models are used in the three methods, it produces similar predictions for the optimal parameter size and number of tokens for a given FLOPs. All three methods show that as the computational budget increases, the model size and the amount of training data should increase proportionally . The first and second methods return very similar optimal model sizes. The third approach is optimal for predicting even smaller models with larger budgets . We note that for low training FLOPs \text{FLOPs}Observation points of FLOPs ( L , N , D ) (L,N,D)(L,N,D ) has larger residuals than higher computed predictions∥ L − L ^ ( N , D ) ∥ 2 2 \parallel L-\hat{L}(N,D) \parallel_2^2LL^(N,D)22

Table 3 above shows the estimates of the number of FLOPs and tokens that reach the computationally optimal boundary for a given model size. Our findings suggest that current large language models are considered too large for their respective computational budgets. For example, we found that a model training with 175B parameters should use a computational budget of 4.41 × 1 0 24 4.41\times 10^{24}4.41×1024 FLOPs and trained on 4.2 trillion tokens. The calculation cost of the optimal model similar to Gopher's 280B model should be close to1 0 25 10^{25}1025 FLOPs, and should train on 6.8 trillion tokens. Unless there are1 0 26 10^{26}10With a computing budget of 26 FLOPs, a model with 1 trillion parameters is unlikely to be the optimal model for training. Furthermore, the required training data far exceeds the amount currently used to train large models,emphasizing the importance of data collection in addition to engineering improvements to train larger models. Although extrapolating several orders of magnitude has great uncertainty, our analysis has clearly shown that: under a given training computing budget, many current LLMs should use smaller models and train more tokens to achieve optimal model.

3. Chinchilla

​ Based on the above analysis, the optimal model size under the Gopher calculation cost is between 40B and 70B. Due to the consideration of data set and computational efficiency, we will train a model with 70B parameters on 1.4T tokens to test this hypothesis. We call this model Chinchilla and compare it with Gopher and other LLMs. Chinchilla and Gopher train on the same number of FLOPs, but differ in model size and number of trained tokens .

Although pre-training a large language model has a considerable computational cost, downstream fine-tuning and inference will also take a lot of computation. Since it is 4 times smaller than Gopher, Chinchilla's video memory usage and reasoning cost will also be smaller.

1. Model and training details

Please add a picture description

All hyperparameters for training Chinchilla are shown in Table 4 above. Except as listed below, Chinchilla uses the same model architecture and training settings as Gopher.

  • We train Chinchilla on MassiveText (same dataset as Gopher), but use a slightly different subset distribution, thus increasing the number of training tokens.
  • Chinchilla uses AdamW instead of Adam, which can improve the loss of language modeling and the performance of downstream tasks after fine-tuning.
  • We train Chinchilla with a simple modification of the SentencePiece tokenizer, which does not apply NFKC normalization. The vocabulary is very similar, 94.15% of the tokens are the same as training Gopher. We've found this helpful for representing math and chemistry.
  • Using bfloat16 in the forward and backward pass, our weights in the distributed optimizer state are saved as float32.

2. Results

Please add a picture description

We conduct an extensive evaluation of Chinchilla, comparing it with various LLMs. The tasks we evaluate are listed in Table 5 above. Because this paper focuses on the optimal model size, we include a large representative subset and introduce new evaluations to better compare with existing large models.

2.1 Language Modeling

Please add a picture description

As shown in Figure 5 above, Chinchilla significantly outperforms Gopher on all evaluation subsets of Pile. Compared to Jurssic-1(178B), Chinchilla performs better on all tasks except the subset dm_mathematics and ubuntu_irc. On Wikitext103, Chinchilla achieves a perplexity of 7.16, while Gopher achieves 7.75. Comparing Chinchilla and Gopher on these language modeling benchmarks needs to be cautious because its training data is 4 times that of Gopher, so there is a result improvement caused by training/test set leakage. Therefore, we pay more attention to other tasks that are less concerned about leakage, such as: MMLU, BIG-bench, and various closed-book question answering and common sense analysis.

2.2 MMLU

Please add a picture description

The MMLU (Massive Multitask Language Understanding) benchmark is made up of exam-like questions in a range of academic subjects. In Table 6 above, we report the 5-shot average performance of Chinchilla on MMLU. On this benchmark, Chinchilla significantly outperforms Gopher despite being smaller, with an average accuracy of 67.6%. Chinchilla even surpassed the 63.4% accuracy rate of experts' forecast in June 2023. In addition, Chinchilla achieves over 90% accuracy on 4 independent tasks: high_school_gov_and_politics, international_law, sociologyand us_foreign_policy. To the best of our knowledge, no model is able to achieve more than 90% accuracy on this subset.

Please add a picture description

In Figure 6 above, we show the comparison with Gopher on decomposing subtasks. Overall, we found that Chinchilla improved performance on the vast majority of tasks. Chinchilla is worse than Gopher on 4 tasks and unchanged on 2 tasks.

2.3 Reading comprehension

On the last word prediction data set LAMBADA, Chinchilla achieved an accuracy rate of 77.4%, compared to Gopher's 74.5% and MT-NLG 530B's 76.6%. On RACE-h and RACE-m, Chinchilla significantly outperforms Gopher, improving by more than 10% in both cases.

2.4 BIG-bench

Please add a picture description

We analyze Chinchilla on the BIG-bench task. Similar to what we observed in MMLU, Chinchilla outperforms Gopher on the vast majority of tasks (shown in Figure 7 above). We found that Chinchilla improved by an average of 10.7% and achieved 65.1% accuracy, compared to 54.4% for Gopher. Of the 62 tasks we considered, Chinchilla performed worse than Gopher in only 4: crash_blossom, dark_humor_detection, mathematical_induction, and logical_args.

2.5 Common sense

We evaluate Chinchilla on various commonsense benchmarks: PIQA, SIQA, Winogrande, HellaSwag, and BoolQ. We find that Chinchilla outperforms Gopher and GPT-3 on all tasks, and outperforms MT-NLG-530B on all but one task.

​ On TruthfulQA, Chinchilla achieved 43.6%, 58.5% and 66.7% on 0-shot, 5-shot and 10-shot. In contrast, Gopher achieves 29.5% on 0-shot and 43.7% on 10-shot. Chinchilla achieves better results than Lin et al.those found by et al., and simply better modeling the pre-training data leads to a substantial improvement on this benchmark.

2.6 Closed-book Questions and Answers

Please add a picture description

The results of the closed-book question answering benchmark are shown in Table 9 above. In the Natural Question dataset, Chinchilla achieves new closed-book SOTA accuracy rates: 31.5% for 5-shot and 35.5% for 64-shot, compared to 21% and 28% for Gopher. On TriviaQA, we show filtered and unfiltered sets. Chinchilla outperforms Gopher significantly in both cases. In the filtered version, Chinchilla is only 7.9% points behind the SOTA. Chinchilla outperforms GPT-3 on the unfiltered set.

Guess you like

Origin blog.csdn.net/bqw18744018044/article/details/129652617