Leakage and Analysis of GPT4 Model Architecture

The GPT4 model is so far the groundbreaking one, available to the public for free or through its commercial portal (for public beta use). It inspired new project ideas and use cases for many entrepreneurs, but the secrecy around parameter counts and models killed all enthusiasts betting on the first 1 trillion parameter model to 100 trillion parameter declaration!

Model Secrets Revealed

On June 20, George Hotz, founder of the self-driving startup Comma.ai, revealed that GPT-4 is not a single overall dense model (such as GPT-3 and GPT-3.5), but a hybrid model of 8 x 220 billion parameters composition

Later in the day, Meta's PyTorch co-founder reiterated the incident.

Just the day before, Mikhail Parakhin , head of Microsoft's Bing AI , hinted at as much.

 

 

GPT 4: not a monolith

What do all the tweets mean? GPT-4 is not a single large model, but a union/ensemble of 8 smaller models that share expertise. Each of these models is rumored to have 220 billion parameters.

The approach is known as a hybrid of expert model paradigms (link below). This is a well-known method, also known as the model hydra. It reminds me of Indian mythology and I would go with Ravana.

Take this with a grain of salt, this is not official news, but it has been stated/implied by important high-level members of the AI ​​community. Microsoft hasn't confirmed any of these yet.

What is the Expert Mixed Paradigm?

Now that we've discussed expert blending, let's take a deeper look at what it is. Expert mixture is an ensemble learning technique developed specifically for neural networks. It is slightly different from the general ensemble technique (the form is generalized) of traditional machine learning modeling. So you can think of the mixing of specialists in the LL.M. as a special case of the ensemble approach.

Briefly, in this method, the task is divided into subtasks, and experts in each subtask are used to solve the model. This is one way to divide and conquer when creating a decision tree. One can also think of it as meta-learning on top of expert models for each individual task.

Smaller, better models can be trained for each subtask or problem type. The meta-model learns which model to use to better predict a particular task. The meta-learner/model acts as a traffic cop. Subtasks may or may not overlap, meaning combinations of outputs can be merged together to arrive at the final output.

Mixture of Experts (MoE or ME for short) is an ensemble learning technique that implements the idea of ​​training experts on subtasks of predictive modeling problems.

In the neural network community, several researchers have worked on factorization methods. [...] methods decompose the input space so that each expert examines a different part of the space. […] The gated network is responsible for combining the individual experts.

— Page 73, Pattern Classification Using Ensemble Methods , 2010.

The method has four elements, which are:

  • Divide tasks into subtasks.
  • Develop an expert for each subtask.
  • Use a gating model to decide which expert to use.
  • Pool prediction and gating model output to make predictions.

The diagram below, taken from page 94 of the 2012 book " Integrated Methods" , provides a useful overview of the architectural elements of the method

Example of mixture of expert models with expert membership and gating network
Taken from: Ensemble Methods

Subtasks

The first step is to divide the predictive modeling problem into subtasks. This usually involves using domain knowledge. For example, an image can be divided into individual elements such as background, foreground, objects, colors, lines, etc.

...ME employs a divide-and-conquer strategy, where a complex task is decomposed into several simpler and smaller subtasks, and individual learners (called experts) are trained on different subtasks.

— Page 94, Ensemble Methods , 2012.

For those problems where the division of tasks into subtasks is not obvious, simpler and more general methods can be used. For example, one can imagine a method that partitions the input feature space by column groups, or separates examples in the feature space according to distance measures from standard distributions, inliers and outliers, etc.

…In ME, a key problem is how to find a natural division of tasks and then derive the overall solution from the sub-solutions.

— Page 94, Ensemble Methods , 2012.

expert model

Next, designate an expert for each subtask.

Mixed-expert methods were originally developed and explored in the field of artificial neural networks, so traditionally the experts themselves are the neural network models used to predict numerical values ​​in the case of regression or class labels in the case of classification.

It should be clear that we can "plug in" any model for the expert. For example, we can use neural networks to represent gate functions and experts. The result is known as a mixed density network.

— Page 344, Machine Learning: A Probabilistic Perspective , 2012.

Each expert receives the same input pattern (rows) and makes a prediction.

gating model

The model is used to explain the predictions made by each expert and to help decide which expert to trust for a given input. This is called a gated model or gated network because it is traditionally a neural network model.

The gating network takes as input the input patterns provided to the expert model and outputs the contribution that each expert should make in making predictions on the input.

…the weights determined by the gating network are assigned dynamically given the input, since the MoE effectively learns which part of the feature space each ensemble member has learned

— Page 16, Integrating Machine Learning , 2012.

The gating network is the key to the method, and the model efficiently learns to select type subtasks for a given input, and then select trusted experts to make strong predictions.

Mixture of experts can also be viewed as a classifier selection algorithm where individual classifiers are trained to be experts in some part of the feature space.

— Page 16, Integrating Machine Learning , 2012.

When using a neural network model, the gating network is trained with experts so that the gating network learns when to trust each expert to make predictions. This training process is traditionally implemented using expectation maximization (EM). The gating network may have a softmax output that provides probability-like confidence scores for each expert.

In general, the training process tries to achieve two goals: for a given expert, find the optimal gate function; for a given gate function, train the expert according to the distribution specified by the gate function.

— Page 95, Ensemble Methods , 2012.

pooling method

Finally, a mixture of expert models has to make predictions, which is achieved through pooling or aggregation mechanisms. This could be as simple as selecting the expert with the greatest output or confidence provided by the gating network.

Alternatively, weighting and prediction can be done, explicitly combining the predictions made by each expert with the confidence estimated by the gating network. You can probably think of other ways to make efficient use of predictive and gating network outputs.

The pooling/combining system can then choose the single classifier with the highest weight, or compute a weighted sum of the classifier outputs for each class and choose the class that receives the highest weighted sum.

— Page 16, Integrating Machine Learning , 2012.

switch routing

We should also briefly discuss switch routing methods that differ from the MoE paper. I bring this up because Microsoft seems to be using switch routing instead of expert models to save some computational complexity, but I'm happy to be proven wrong. When there are multiple experts' models, their routing functions may have a non-trivial gradient (when to use which model). This decision boundary is controlled by the switching layer.

The benefits of the switching layer are threefold.

  1. Routing calculations are reduced if tokens are only routed to a single expert model
  2. Since a single token goes into a single model, the batch size (expert capacity) can be at least halved
  3. Simplifies routing implementation and reduces communication.

The overlap of the same token with more than 1 expert model is called capacity factor. Below is a conceptual description of how routing works with different expert capability factors

An illustration of token routing dynamics.
Each expert processes a fixed batch size of tokens modulated by a capacity factor. Each token is routed to the expert with the highest routing probability
, but each expert has a fixed batch size
(total number of tokens/experts) x capacity factor. If the distribution of tokens is uneven
, some experts will overflow (indicated by the red dotted line), causing
these tokens not to be processed by the layer. Larger capacity factors alleviate
this overflow problem, but also increase computational and communication costs
(indicated by filled white/empty slots). (Source https://arxiv.org/pdf/2101.03961.pdf )

Compared to MoE, the findings of the MoE and Switch papers show that:

  1. Switching transformers outperform well-tuned dense models and MoE transformers on a speed-quality basis.
  2. Smaller calculation space for switching transformers than MoE
  3. Switching transformers perform better at lower capacitance factors (1–1.25).

in conclusion

Two caveats, first, this is all from hearsay, and second, my understanding of these concepts is rather weak, so I urge readers to take it with a grain of salt.

But what did Microsoft achieve by hiding this architecture? Well, they created a stir and sparked suspense. This may help them better tell their stories. They keep their innovations to themselves, preventing others from catching up with them more quickly. The whole idea is likely Microsoft's usual game plan of throwing 10B into a company while thwarting competition.

GPT-4's performance is great, but it's not an innovative or groundbreaking design. It's an ingenious implementation of an approach developed by engineers and researchers complemented by a corporate/capitalist deployment. OpenAI neither denies nor agrees with these claims ( https://thealgorithmicbridge.substack.com/p/gpt-4s-secret-has-been-revealed ), which makes me think that this architecture of GPT-4 is very likely real

Guess you like

Origin blog.csdn.net/qq_41929396/article/details/132459881