Death of Totoro? (Chinchilla’s Death)

Death of Totoro? (Chinchilla’s Death)


Preface

main content

  1. At the beginning of training, smaller models train faster than larger models; after a while, small models slow down and are surpassed by larger models; when the training loss enters the linear decline stage, smaller models The models descend more steeply to advanced knowledge, and again they outperform the larger models! (The phenomenon given by 7B and 13B of LLaMA1. This phenomenon exists throughout LLaMA2);
  2. If the computation spent on training a large model is spent on a small model, the small model may achieve lower perplexity;

Original link: (https://espadrine.github.io/blog/posts/chinchilla-s-death.html)

text

Smaller models have fewer multiplications. So they run faster. So they train faster. However, in theory, they will eventually reach the limit of their knowledge capacity, after which learning will slow down, and the learning capacity of larger models with greater capacity will surpass them and achieve better results in a given training time. Performance.
Insert image description here
In the Chinchilla work, Figure 2 shows the training loss for a large number of training runs for models of different sizes. At first glance, these curves follow theory: the smaller model initially has lower loss (good), but eventually it slows down and is surpassed by the larger model's curve (bad). (In that chart, they drew gray dots every time they pointed out that the smaller model started losing to the larger model. The gray line, the Pareto frontier, was how they calculated the scaling law.) The problem with this assumption is that we don't know what will happen if we let the smaller models train for longer, because they stop training as soon as they are surpassed.


CALLS


Earlier this year, Meta trained four models of different sizes. Unlike other productions, they have very long training times for everyone; even the smaller ones.
They published the training running curve:

  1. Each curve first falls straight down according to a power law,
  2. Then there seems to be a near-linear decrease in loss (corresponding to a fairly constant rate of knowledge acquisition).
  3. At the very top of the curve, they all break the line by flattening it slightly. *
    Insert image description here

First, I want to address a subtle misunderstanding that people may have about flattening the ends of curves. They are both trained with gradient descent using a variable learning rate (roughly speaking, a hyperparameter that indicates how much to progress in the direction of the gradient). To get good training, they have to continually reduce the learning rate so that it can detect more subtle patterns in the source material. Their formula for reduction is the most widely used: the cosine schedule.
Insert image description here

As can be seen from the plot, at the end of the training run, the cosine plan stops reducing the learning rate at a rate that produces such a good, nearly linear training loss curve. The slowdown in learning speed is caused by this situation (that is to say, the final loss of the model does not decrease not because the model learning is over, but because the learning rate is too small). The model doesn't necessarily no longer have the ability to learn at the same near-linear rate! In fact,if we have more text for it, we lengthen the cosine schedule, so its learning rate continues to decrease at the same rate.

The fitness of a model does not depend on the amount of data we can provide it with for training; therefore a change in learning rate reduction is unjustified.

But that’s not the point of this article.
Training loss curves can be misleading in another way. Sure, they're all trained on the same data; but they don't process that data at the same speed. What we want to know is not how sample-efficient the model is (in this regard, a larger model can obviously learn more from what it sees). Let's imagine a race:All these models start at the same time and we want to know which model crosses the finish line first. In other words, when a fixed amount of computation is invested in training, who learns the most during that time?

The model draws a straight line perpendicular to the More computing resources are spent.
Thankfully, we can combine the loss curve with another piece of data provided by Meta: the time spent training each model.
Model GPU-------------Hours----------Tokens/second
LLaMA1-7B----------82432----------3384.3
LLaMA1-13B---------135168-- -------2063.9
LLaMA1-33B---------530432--------730.5
LLaMA1- 65B--------1022362---------379.0
Insert image description here

This picture should be the same data training three models of different sizes
First of all, we have to mention that the entire Chinchilla picture we see only covers the left side of the picture. A small part of the side. In this strip we see the same behavior documented by Chinchilla. For example, look at 7B (which is actually one of the two curves with the largest size in the Chinchilla plot): it initially reduces the loss faster than the larger model, then slows down, with the 13B model surpassing it, first reaching 1.9.
But then, there was an unexpected twist: 7B entered a near-linear state, trended sharply downward, and seemed like it might overtake 13B again? It's hard to see from this graph what would happen if 7B was trained for longer.
However, the same behavior appears to be present between 13B and 33B, with the initial Chinchilla slowdown also giving way to a near-linear regime, where 13B drops off rapidly! It's just unfairly outclassed by the 33B, which takes more than double the computation time.
The same deceleration and then acceleration occurs between 33B and 65B, so that 33B is never actually overtaken by 65B. What this graph shows breaks OpenAI and Chinchilla’s assumptions: The bigger models haven’t won yet. The slowdown they detected was not actually caused by reaching some capacity limit!
However, line 7B is still a bit unsatisfactory. If only Meta could train it for longer...

FLAME 2

End of suspense: They did it! They released LLaMA 2 this week!
It’s time to confirm our suspicions
Insert image description here

We also gained training time again:
Model------------------GPU hours----- -------------Tokens/second
LLaMA2-7B---------184320----------- ---------------3031.9
LLaMA2-13B--------368640------------- ----------1515.9
LLaMA2-34B--------1038336------------------ ----533.7
LLaMA2-70B--------1720320----------------------322.1< /span>
Insert image description here

This picture should be the same data used to train three models of different sizes (the slope of the linear stage is much reduced)
We noticed at a glance that the training curve is the same as LLaMA 1 The training curves don't match, even though the models are the same. It turns out that LLaMA 2 is trained on double the context size and a longer cosine schedule which unfortunately does not work for all model sizes had a negative impact. However, smaller models are more severely affected than larger models. As a result, the 34B model consistently outperforms the 65B model at any training time in LLaMA 1, and now slightly outperforms the 70B model before surpassing it:
More importantly, comparing training speeds strongly Confirming our suspicions about LLaMA 1:

1. First, smaller models train faster than larger models,
2. Then, they slow down and are surpassed by larger models (according to Totoro),
3. But then, they enter a near-linear regime where the smaller models decline more steeply to advanced knowledge, and they outperform the larger models again! (The phenomenon given in 7B and 13B of LLaMA1. This phenomenon exists throughout LLaMA2).

A fascinating result has to do with making the right choice when starting training: Contrary to popular belief, larger models produce worse results. If you have to choose parameter size and dataset, then you are better off choosing 7B model and training on trillions of tokens for 7 epochs.
Look at the near-linear state of the 7B model, and extrapolate its line to where the 70B model stops: if the 70B computation was spent on 7B, it might reach a lower perplexity!
Another thing we noticed from LLaMA 2 is that the learning slowdown at the end of the LLaMA 1 curve is indeed an artifact of the cosine schedule. (Why do you say that? Because the same size model in llama2 has smaller loss when trained on more data, and logically, the model of llama1 learns less data. He should be trained on this part. The better data fitting is closer to overfitting. The loss should be that of llama2, but the actual situation is the opposite. That is to say, the model of llama1 has not been fully learned, but other factors have caused the final loss area to be flat. ) In the LLaMA 2 training run, this slowdown was completely absent in reading the corresponding tokens of 1 trillion tokens.
In fact, perhaps at the same time, the LLaMA 2 7B model is of worse quality than the LLaMA 1 7B model, probably because its cosine schedule is stretched!
Let us return to the Chinchilla paper to demonstrate this point. In Appendix A, Figure A1, they show an ablation study of various cosine scheduling parameters (in other words: various ways of stretching the learning rate curve).
Insert image description here

They pointed out that the loss is lowest when the curve is not stretched. The graph supports this, but we notice something isn't right. After reading 6 million tokens, the training loss of the model in the above picture is less than 2.8; at the same time, under the same mark, the training loss of the model in the lower right picture is greater than 2.8. (Since the amount of training data for llama1 and llama2 is different, it may be because the 6 million tokens above may have been viewed twice with 3 million tokens, while the one below has been viewed with 6 million different tokens. once, so llama1 learns better at the beginning). However, the only difference between the models is the cosine schedule! Since the bottom model is expected to go through more training data, the "unstretched" cosine plan is computed for more steps, effectively stretching it. If the learning rate follows a schedule that is allocated to fewer training steps, it will give a better loss for the same training time.
More broadly, this raises an unanswered question for me: if a cosine schedule is suboptimal, what should be the shape of its tail

Summarize

  1. At the beginning of training, smaller models train faster than larger models; after a while, small models slow down and are surpassed by larger models; when the training loss enters the linear decline stage, smaller models The models descend more steeply to advanced knowledge, and again they outperform the larger models! (The phenomenon given by 7B and 13B of LLaMA1. This phenomenon exists throughout LLaMA2);
  2. If the computation spent on training a large model is spent on a small model, the small model may achieve lower perplexity;

Guess you like

Origin blog.csdn.net/Brilliant_liu/article/details/134097250