MoE: LLM possibilities for lifelong learning

This article is shared from Huawei Cloud Community " DTSE Tech Talk | Issue 47: MoE: The Possibility of LLM Lifelong Learning ", author: Selected by Huawei Cloud Community.

In the 47th live broadcast of DTSE Tech Talk "MoE: The Possibility of LLM Lifelong Learning", Mr. Lu, a technical expert from MindSpore, shared the concept of LLM lifelong learning with developers to help everyone understand the characteristics and theory of continuous learning. knowledge, and also introduces the development history of MoE in detail, allowing us to understand the technical points more intuitively.

Continual lifelong learning

A lifelong learning system is defined as an adaptive algorithm capable of learning from continuous information that becomes progressively available over time and where the number of tasks to be learned (e.g., membership classes in a classification task) is not predefined of. Crucially, accommodation of new information should occur without catastrophic forgetting or interference.

The goals and characteristics of continuous learning

The goal of continuous learning is to avoid forgetting old knowledge while the model continues to input new data for learning. The following are its properties and definitions.

Reference: Continual Lifelong Learning in Natural Language Processing: A Survey-Published in 2020

nature definition
knowledge retention The model is less prone to forgetting disasters
forward transfer Use old knowledge to learn new tasks
backward transfer Improve old tasks after learning new tasks
online learning Continuous data flow learning
no task boudaries No explicit task or data definition required
fixed model capacity Model size does not vary with tasks and data

Properties of LLM:

Reference: A Survey-Published in 2020
nature   illustrate
knowledge retention After LLM is pre-trained, it has world knowledge, and small-scale finetune will not easily cause forgetfulness disaster to LLM. But large-scale data continued training will cause.
forward transfer Zero shot, few shot, finetune based on world knowledge.
backward transfer - Finetune may cause performance degradation of some tasks. The second finetune will lose the performance of the first finetune.
online learning × Offline pre-training and fine-tuning.
No task boudaries Unsupervised pre-training and fine-tuning, without distinguishing tasks.
Fixed model capacity The size remains unchanged after LLM pre-training.

It can be seen from the above that LLM has actually satisfied most of the properties of continuous learning. After sufficient pre-training, large models at the scale of tens of billions have a large amount of world knowledge and emergent capabilities, and life-long learning based on this becomes possible.

Common LLM lifelong learning methods include Rehearsal (rehearsal), Regularization (regularization), Architectural (structural transformation) and other methods, which are actually not suitable under the parameter amount and training mode of LLM. The Mixture of Experts (MoE) method adopted by LLM in order to increase the number of parameters and reduce reasoning costs seems to have become a new approach to lifelong learning in LLM.

Introduction to MoE

MoE is a mixed expert model, which is called Mixture of Experts in English. It has been developed for more than 30 years. MoE is a model design strategy that directly combines multiple models to achieve better prediction performance. In large models, the MoE solution can effectively improve the capacity and efficiency of the model.

Generally speaking, the MoE of large models has a gating mechanism and a set of gating output mechanisms to combine and balance the selection of experts to determine the final prediction of each expert; there is an expert model selection mechanism that will determine the final prediction based on the gating. The output of the mechanism selects a subset of expert models for prediction. This results in less computational effort and enables the model to select the most appropriate expert model for different inputs.

Schematic diagram of MoE

Multiple Export Networks in the figure are used to learn different data, and a Gating Network is used to allocate the output weight of each Expert. For an input sample c, the output of the i-th expert is , and the ground truth is

Then the loss function is:

will be advanced, so that each expert model calculates the loss function independently, encouraging competition among different expert models, so that each data sample is processed by one expert as much as possible. The competition and cooperation of expert models, as well as the distribution method of Gating Network, have also become new directions in the evolution of MoE. The 2017 MoE has begun to take shape.

Sparse MoE

Shazeer, Noam, and others from Google Brain proposed a method of using a sparse MoE structure to increase the model capacity, that is, using a large number of expert models during training and activating a small number of expert models during inference.

Sparse MoE example diagram

As shown in the figure above, the model has a total of n Experts, and the Gating Network selects a few Experts for calculation. In addition, during the training process, early-numbered experts will be more likely to be selected by the gating network, resulting in only a few experts being useful. This is called the Expert Balancing problem. The goal of Sparse MoE at this time is to make the model larger and conduct training and inference cost-effectively. In the same year, the emergence of Transformer, which can be trained in parallel, focused everyone's attention.

Transformer MoE

When the model parameters reached the level of hundreds of billions, it became increasingly difficult to expand upwards, and the economical and practical MoE was restarted. It was Google who proposed GShard[4], which was the first work to extend the MoE idea to Transformer. Then Siwtch Transformer[5], GLaM[6] and other work continued to improve the structure of Transformer MoE, and also increased the number of parameters of LLM from thousands to billion to the trillion level.

Gshard: the first MoE+Transformer model

Gshard's paper was first published on June 30, 2020 (Gshard Scaling Giant Models with Conditional). In the encoder and decoder of Transformer, every other FFN layer is replaced with a position-wise MoE layer.

Switch Transformer claims to have a trillion-level Transformer class model

In January 2021, the Google Brain team published the article "Switch Transformer: scaling to trillion parameter models with simple and efficient sparsity", which simplifies the MoE routing algorithm, and the gating network only routes to 1 expert at a time.

GlaM: Reduce costs, increase efficiency, and achieve more precise accuracy

That same year, Google’s GlaM model showed that Transformer and MoE-style layers could be combined to produce a model that exceeded the accuracy of GPT-3 models on average across 29 benchmarks, while using 3x less energy to train and 2x Reasoning with less computation.

PanGu-Sigma

Pangu-sigma[8] is the Lifelong-MoE model implemented by Huawei's Noah's Ark Laboratory in March this year based on the Pangu-alpha model to expand MoE. It proposed the random routing expert (RRE) method, so that the Gating Network can also be tailored with Expert. The picture below is a schematic diagram of PanGu-Sigma:

Here we focus on the design of RRE. As mentioned earlier, since the learnable Gating Network is difficult to cut, manual gating can be used simply and crudely. RRE has this idea, just to alleviate overly rough domain distinctions (one of the properties of continuous learning is that there are no task boundaries, and manual Gating violates this to a certain extent), RRE has a two-layer design:

  • At the first level, different expert groups are assigned according to tasks (multiple experts form an expert group for use by one task/domain).
  • The second layer uses random gating within the group to allow experts in the expert group to achieve load balancing.

The benefits of this are obvious. As long as the expert group is cut, the sub-model in a certain field can be completely stripped out for reasoning and deployment. At the same time, new expert groups can be continuously updated and iterated to achieve Lifelong-learning. The figure below is a schematic diagram of sub-model extraction for the pre-trained MoE model.

The above two jobs are two typical jobs of Lifelong-MoE, and they also extend the LLM capabilities of the two companies. But it is worth mentioning that MoE LLM is actually divided into two groups from the starting point of training, namely from scratch and from pretrained, while GPT4 is said to be a set of 8 Experts from scratch, which in a sense may be more like returning At the ensemble stage, it is more about business results than the continuous evolution of LLM.

There is a problem with MoE

Lifelong-MoE seems to be very useful, but nothing is perfect. However, the MoE method itself still has some problems. The following is a brief introduction, which can also be regarded as a discussion of the direction of subsequent evolution.

  • MoE structural complexity

Transformer's MoE will perform MoE expansion on the FFN layer, but the Transformer structure itself also has a Multihead Attention structure, which makes the MoE expansion an intrusive transformation of the Transformer structure, regardless of whether it is an intrusive transformation of parallelization before training or the training is completed. Subsequent extraction of sub-models will require a lot of manpower due to the complex structure.

  • Expert balancing

There will always be some tasks or fields that account for the majority of all data, and there will definitely be long-tail data. Using equal parameter amounts and random gating for forced balanced distribution actually harms the model's ability to fit the real world. The winner takes all determined by the characteristics of the neural network. Gating Network can learn to naturally allocate data towards several experts with better fit. This still requires a lot of attempts and research, which may be alleviated or solved.

  • Distributed communication issues

Current LLM pre-training must use distributed parallel segmentation, and the difference between the MoE structure and the ordinary Dense model is that it requires additional AllToAll communication to implement data routing (Gating) and result recycling. AllToAll communication will cross Node (server) and pod (routing), causing a large number of communication blocking problems.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10141333