Get GPT-3 hyperparameters with a single GPU! Train a small model first, then "one-click migration" | Open source

Click on " Study Algorithm " above, and choose to add " star " or " top "

Heavy dry goods, delivered as soon as possible1cbafd9dbbc2fa51bbba5bcd67780a58.png


Fengse comes from the qubit of the concave temple | public account QbitAI

"One GPU can't train GPT-3, let alone tune hyperparameters on it."

No, no, things have changed now—

Hyperparameters for large-scale models can be tuned entirely on a single GPU.

How to say?

It turns out that someone has discovered a new parameter tuning method, which can maintain the performance of the optimal hyperparameters regardless of the size of the model.

From this, we can first train a small version of the model , adjust the hyperparameters indirectly on it , and then copy them directly to the full-scale model in a zero-shot way , and get pretty good performance.

This is not very good for people who don't have enough GPU resources in their hands.

At present, the related post is also causing a heated discussion on Reddit, with 300+ likes and support.

acea250d9dbfed29ce20c84047af04de.png

Tuning GPT-3 Large Models on One GPU

The method is called muP  (Maximal Update Parametrization), and the authors are from Microsoft and OpenAI .

The idea is simple, using a special parametric idea called µP that they discovered in previous work:

Narrow and wide neural networks share the same set of optimal hyperparameters, even when the width is infinite (width->∞).

The specific principle can be found in the paper "Feature Learning in Infinite-Width Neural Networks".

Shareable hyperparameters include learning rate, learning rate schedule, initialization, parameter multipliers...even each parameter tensor individually.

The author verified this conclusion on Transformer with a width of up to 4096 and ResNet .

Therefore, resource-poor alchemists can perform hyperparameter tuning on a small version of the GPT-3 model on a single GPU:

If the parameters obtained on this small model are close to optimal, the same results can be obtained on the large model.

ps. This parameter tuning method is also named " µTransfer ".

0841d55c8a340199061673ae84f1db9c.png

What is the specific effect?

The author trained a small GPT-3 with only 40 million parameters, which is small enough to run directly on a GPU.

It then "migrated" its hyperparameters to a large-scale GPT-3 with 6.7 billion parameters, and found that its performance was exactly the same as the original GPT-3 - although the original GPT-3's parameter size was still its size double!

And this adjustment cost is only 7% of the entire pre-training cost.

Due to the increase in model size, the cost of directly adjusting the small model is still roughly the same. If this method is used to adjust the 17.5 billion-scale GPT-3, the cost may only be at most 0.3% of the total pre-training cost.

de3436cd8c6c58af132c729a173c56dc.png

Well, at this point you may ask: Can you just reduce the width of the model?

The authors state that there is no theoretical guarantee for "non-width stuff".

But the good news is that they tested the transfer effect of depth, batch size, sequence length and timestep within reasonable limits of the preLN Transformer.

609449903a3e77d2644c88b3feb67e29.png

Among them, they reduced BERT-base and BERT-large to the same scale in width and depth, and then performed hyperparameter tuning at the same time and found that:

Compared to the already-tuned megatron BERT baseline, the performance of both is improved, especially for BERT-large .

fb83aed14bc25aafe0750b318970998d.png

This also concludes a reason:

The larger the scale of the migrated model, the higher the benefit.

So the author also joked that although we did not test GPT-3 on a scale of 17.5 billion, we guarantee the results will make you "drool" .

d3eefbe3e6da9f0a3bf260ce886637fe.png

Having said all that, how does it work?

The following table outlines how to tune the initialization and learning rate of your model via fan-in or fan-out.

where the pink text is µP, and the gray text in brackets is the pytorch default.

a29205687a94c0f7e5fa72ea21a2f9e7.png

Of course, if you don't want to do it manually, the author has also open sourced the Pytorch implementation , which pip install mupcan be applied to your model.

0fb7c9bf354d5e4e3909e1424b71fb9b.png

About the author

One is called Greg Yang, a senior researcher at Microsoft.

The corresponding author is Gao Jianfeng, a partner research manager of the Deep Learning Technology Center of Microsoft Research and an IEEE Fellow.

There are also two Chinese authors Liu Xiaodong from Microsoft (alumni of Beijing University of Posts and Telecommunications) and Chen Weizhu (who has worked at Microsoft for 16 years).

Their work has been accepted at NeurIPS 2021.

06ee27917483c94ab59cc5ad6f22c0bc.png

GitHub link:
https://github.com/microsoft/mup

Paper address:
https://arxiv.org/abs/2203.03466

Official blog link:
https://www.microsoft.com/en-us/research/blog/%C2%B5transfer-a-technique-for-hyperparameter-tuning-of-enormous-neural-networks/

Reddit discussion:
https://www.reddit.com/r/MachineLearning/comments/tb0jm6/r_you_cant_train_gpt3_on_a_single_gpu_but_you_can/

-End- _ _

This article is the original content of the account [qubit] signed by the NetEase News • NetEase Account Featured Content Incentive Program. Without the authorization of the account, it is prohibited to reprint it at will.5406a5f9781adde347c89608625036f4.png

outside_default.png

Click to see the paper constantly!

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326875466&siteId=291194637