Large model generation is accelerated by 2 times! A single GPU can be fine-tuned in a few hours, Peking University School of Mathematics alumni jointly work on open source

Xiao Xiao comes from Aofeisi
Qubit | Public account QbitAI

Just "add some small parts" to the large model, and the inference speed will immediately increase by 2 times!

404782babbd9063cdaab1da6a053bc6e.gif

There is no need to train an additional model or optimize the computing hardware. A single A100 can be fine-tuned in a matter of hours.

This new study is called Medusa and comes from Princeton, UIUC, CMU and the University of Connecticut, and FlashAttention author Tri Dao is also among them.

81ffc8793be27db83e82e58517c8da47.png

At present, it has been successfully deployed in Berkeley's 7 billion parameter "Lama" Vicuna . It will also support other large models in the future and has already appeared on the GitHub hot list:

080c039b7f7be066e7d771ab43857eae.png

But in fact, before the introduction of this method, there were no large model inference acceleration methods in the industry. The mainstream one was speculative sampling (speculative decoding) launched by DeepMind.

Compared with this method, what is different about Medusa?

Two "bugs" in speculative sampling

To speed up large model inference, you need to first know what "limits" its speed.

Compared with the increase in calculation amount, the inference speed of large models is more easily affected by memory bandwidth (memory bound).

This is because large models have huge parameters that far exceed the cache capacity. Therefore, the weights need to be read from the external memory (video memory) to the cache once during inference. This process is limited by memory bandwidth and is usually very slow.

bcd456b89d1f2610e6014849385b44fc.png

Therefore, when the model does batch inference, there is not much difference in the time between processing 100 tokens and one token at a time.

Based on this feature, DeepMind came up with a magical operation called speculative sampling in November last year——

Train a smaller model (draft model) and generate a batch of "candidate words" for the large model in advance. Instead of letting the large model "think" and generate by itself, just make "selections" directly.

5f8eeedf4dbac7d9ff6389b835bec91f.png

Since the small model is generated several times faster than the large model, once the large model feels that the words already available in the small model are "available", it can directly use them without having to slowly generate them again.

This process is a bit like the input method's associated word candidates . Before we (the large model) decide what to use for the next word, the input method (the small model) first lists some alternatives:

If you see something that sounds good to you, just choose one and use it; if you feel that none of the generated ones are good, just pass it and try again.

13fe04d53acfbd678e3ac915edfa4417.png

This speculative sampling method has indeed achieved remarkable results, and can even easily run the 34 billion parameter LLaMA large model on the M2 Ultra with high accuracy.

42eecf5d25d6d654c4c2d35ad627d193.png

BUT, there are two problems with this approach.

On the one hand, it is not that easy to find a small draft model that generates "candidate words" for a large model .

This small model is not just a generation model that can be used. In addition to requirements such as unified interfaces and close probability distributions, the generation quality cannot be much worse than the large model.

It may be okay for models like LLaMA released by Meta. There are both large model versions with tens of billions of parameters and small model versions with billions of parameters. You can use the smaller version as a draft model.

But for other large open source models, this method is not very applicable. Building and training a small model yourself will not only cost more time, but the generation effect may not meet expectations.

On the other hand, the combination of dual models makes subsequent system tuning more complicated .

This is because, compared to the large model itself, which is a system, the newly added draft model is equivalent to the introduction of another system.

This will make the model deployment more complex, including additional network transmission and different hardware conditions that need to be taken into consideration, and the difficulty will be further increased during calculation optimization.

To solve these problems, Medusa appeared.

No need for small models, just add a few "heads"

Medusa (Medusa, a monster with multiple heads) is a new method for accelerating large model inference.

Compared with speculative sampling, it chooses to directly add several more decoding heads to the Transformer large model. Each head is a single-layer feedforward network.

17f501e28c0b9b662ccc3d90c1dc2469.png

These extra decoding heads allow large models to directly generate several more words at a time, instead of generating them one by one in a "toothpaste-squeezing" style.

The generation accuracy is also good. When predicting "the next word of the next word", Medusa's accuracy reaches 60%, and it is still being optimized.

Subsequently, these words are verified in parallel using a tree-based attention mechanism to achieve inference acceleration.

8e5dd02e0d84c3634906f78587143f68.png

Based on Medusa, Vicuna’s 7 billion, 13 billion and 33 billion parameter large model inference speeds have increased by more than 1.9 times :

82dbf372b1ffb91bbc2f4b577fbf7e6d.png

For the 7 billion parameter model, the researchers also tested the acceleration effect on different tasks, showing a maximum speed increase of 2.15 times in code generation.

49096f1050acd515c55ab049f7f6ecac.png

The most important thing is that after using Medusa, there is no need to retrain the entire large model.

In contrast, it can be trained together with a large model, just by freezing the parameters of the large model, and even a single GPU can do it.

Since no additional models are added, it is also friendly to distributed reasoning.

about the author

This study has two co-authors.

Co-author Cai Tianle is a doctoral student at Princeton University. His research interests include optimization, representation learning, and architecture design. He graduated from the School of Mathematical Sciences at Peking University with a double major in applied mathematics and computer science.

f14c4511b0180286df39a83ad27495da.png

The co-author is Yuhong (Jesse) Li, a doctoral student at the University of Illinois at Urbana-Champaign (UIUC). His research direction is efficient machine learning. He holds a bachelor's degree from Beijing University of Posts and Telecommunications.

314b023443f2983e243f336e4e6ce30c.jpeg

In addition, this research also involved the participation of Tri Dao, the author of FlashAttention and a Ph.D. from Stanford.

FlashAttention is a method that can speed up attention and reduce memory usage. Compared with PyTorch's standard attention implementation, it can be up to 9 times faster.

ce738a70975f7560cf05212e1b3563a9.png

GitHub address:
https://github.com/FasterDecoding/Medusa

Research address:
https://sites.google.com/view/medusa-llm

Guess you like

Origin blog.csdn.net/QbitAI/article/details/132959349