Why is it difficult to train large models?

Link: https://www.zhihu.com/question/498271491

Editor: Deep Learning and Computer Vision

Disclaimer: For academic sharing only, intrusion and deletion

Since the Bert network model has produced more than 300 million scales, I just felt that the performance was good at that time. Unexpectedly, after the GPT series came out, GPT-3 directly crushed competing products with the parameter model accuracy of 170B scale.

Then there is a new round of competition, and the following things are a bit scary. Google launched a trillion sparse switch transformer, Huawei launched a 200 billion dense Pengcheng Pangu model, Microsoft launched Turing-NLG with 100 billion parameters, and Nvidia launched the MegatronLM series .

Everyone says that large models are difficult. Apart from the trouble of cluster scheduling, is large model training difficult?

Author: Jin Yingruoyu
https://www.zhihu.com/question/498271491/answer/3051092055

The main difficulty is how to find out where the problem is when something goes wrong (it will definitely go wrong). Here are some practical examples:

Thousands of tens of thousands of GPUs are trained together, and the cost of a single training session starts at tens of millions of yuan. You have 100 experiments you want to try, how to judge which ones are most likely to succeed?

Which data is worth training, which is not important and can be thrown away, and which data is added will make the effect worse.

What should I do if there is only English data and no Chinese data.

So where does the huge data exist and how to access it can ensure that the machine will not be full and can be quickly retrieved.

The program crashes randomly, how to debug? Is it a data problem, a hardware problem, or a code problem.

If it is a hardware problem, one of the thousands of GPUs will randomly generate strange errors. How can I find out which one it is without spending tens of millions of reruns?

It takes one month of model training to know whether the result is good or bad, and tens of millions have already been spent by then. How can we know the approximate results earlier to stop the unpromising experiments?

Users say that the model will talk nonsense. If you don’t fix this problem, your model will be taken off the shelf. How to fix it? Tuning? Change the training data? Change the model structure?

In short, there are too many challenges. To sum up, large model training is to search in a huge solution space. Every shot has huge time and economic costs. How to find the optimal solution at the minimum cost.

Author: Master Bao
https://www.zhihu.com/question/498271491/answer/3055245869

Because there are three major difficulties in the training of large models, 1. It consumes huge computing resources. 2. There are extremely high requirements on the quantity and quality of data. 3. It is difficult to evaluate his quality with technical indicators.

These three points determine that your experiments will be slow, unreliable, laborious and uncertain.

This has created a huge challenge for people to sum up experience and adjust the direction of experiments.

Some people say that the big model is easy, just give me the card, and you will know how difficult it is when the boss really buys you 1,000 cards. The boss said, Xiao Wang, the card is bought, and I will get it out in three months .

Then you find that even if someone tells you the correct code, data, and parameters, you can run it once and train it completely.

Not to mention that you have to write the model code yourself, debug it, find a way to evaluate the model, and adjust the direction of the experiment according to the result feedback. The time and resources are completely insufficient!

The most valuable algorithm talents in the era of large-scale models are those who spend money and time on practical experience, talents who can summarize a set of training methodologies, and masters who have practical experience and have run hundreds of experiments. Ordinary people have absolutely no conditions and no resources. . Compared with equipment money and time money, the annual salary of 3 million for these people is really too little.

Most people are still talking on paper, listening to wind and rain.

This is not a game for ordinary players to participate in.

Author: Zhihu user
https://www.zhihu.com/question/498271491/answer/2961844274

First of all, this is Susan Zhang from META AI sharing their experience and lessons in training OPT-175B, which is the implementation model corresponding to GPT-3.

A team of 5 engineers trained LLM with 175B parameters, using 1024 A100 (80G video memory), and it took about three months in total.

According to the training efficiency estimate, it takes 33 days to train on the 300B token data set without errors and restarts.

The first round: Initial training three times (the training here does not necessarily run through all the data, just start and stop the training process), first assume the model and training hyperparameters according to experience, and simply adjust them according to the actual situation. Such as increasing the weight decay from 0.01 to 0.1, setting the global gradient norm clipping to 1.0, adjusting the parameters of Adam, etc. These adjustments are made based on the observation of the loss results of each batch during training. But in fact, it doesn't make sense, because they found that their code had bugs (miserably, the first three times were in vain), so they should test the code on small-scale data and model parameters.

The second round: hyperparameter adjustment, repeatedly confirming which parameters are more effective based on observation (the most test of observation ability and experience).

The third round: The final hyperparameters are determined (in fact, many parameters are still estimated), and formal training begins (a month has passed). During the training process, the loss curve is still observed (there are many spikes), and the parameters are constantly adjusted. In particular, Run11.6 began to repeatedly recalculate the same batches to observe the impact of different hyperparameters on the results. In Run11.10, the activation function Gelu->ReLU was also changed.

Fourth round ("last" round): 33 days, 175B parameters, 300B tokens, 992 A100 cards with 80G memory. Encountered including but not limited to: hardware problems such as GPU disconnection, CUDA errors, task hangs, NCCL errors, code bugs (checkpoint storage problems, loss function problems, etc.), and training instability problems occurred again.

Therefore, even with rich experience, sufficient data sets and huge hardware resources, it is still difficult to train large models.

Author: Sheng Dong
https://www.zhihu.com/question/498271491/answer/2232480465

Because this field has only become popular in recent years, and the previous frameworks such as pytorch and tensorflow have appeared long ago. At that time, there was no in-depth abstract design and optimization for the demand scenarios of distributed training of large models. Therefore, this field urgently needs a set of programming paradigms and corresponding computing frameworks such as MapReduce (to solve the establishment of inverted indexes on search engines), Stream computing (to introduce real-time updates in the e-commerce/social field) to solve @西门宇少@ZOMI The question mentioned by the sauce.

I switched from the cloud platform to SysML half a year ago because I like this field: there are many things involved, and it is difficult enough to improve my technical level.

I specially made a github repo: Hack SysML to record my study notes in this field. To do development in this field, you must understand both the system and the algorithm. In practice, you need to be familiar with Pytorch, C++, CUDA, understand the architecture, computer network, and verify that the accuracy meets the requirements by constructing an exquisite data set, which is really difficult.

Not to mention, I am going to learn about floating-point numbers, and this week I was trapped by a related bug for three days ‍

☆ END ☆

If you see this, it means you like this article, please forward and like it. Search "uncle_pn" on WeChat, welcome to add the editor's WeChat "woshicver", and update a high-quality blog post in the circle of friends every day.

Scan the QR code to add editor↓

117adeec48549cc7e3f13ad7b846c6b1.jpeg

Guess you like

Origin blog.csdn.net/woshicver/article/details/131778008