Generative AI analysis: the magical effect of large models + large amounts of data

foreword

If you are interested in this article, you can click " 【Visitor Must Read-Guide Page】This article includes all high-quality blogs on the homepage " to view the complete blog classification and corresponding links.


Emergent Ability of Large Models

The figure below shows the relationship between model performance (Loss for next token prediction) and "parameter amount" and "dataset size". It can be seen that as the "parameter amount" and "dataset size" continue to increase, the model performance continues to increase. Enhanced, it seems that there will be no bottleneck.

insert image description here

The figure below shows the emergence capability of the large model, that is, the performance of the language model does not have a linear relationship with the increase in the number of parameters, but a sudden jump, that is, emergence. Performance hovered at random levels until the threshold was not reached.
insert image description here
insert image description here
insert image description here

Calibration

In the above experimental diagram, Calibration refers to the relationship between "model confidence" and "true probability", that is, a model that satisfies "high confidence -> correct" and "low confidence -> possible error", its Calibration index the better.

Therefore, Calibration actually corresponds to the matter of "whether the model knows that it is wrong". As shown in the figure below, models with different parameters correspond to different colors. It can be seen that the larger the model, the greater its confidence in whether it is wrong. That is, the "model confidence" and the "true probability" are more in line.

insert image description here

Inverse Scaling Prize

A competition with prizes and rewards, looking for tasks that can make "the bigger the model, the worse the performance".

insert image description here
In the task of this competition, the performance of many previous "big models" did deteriorate as the number of parameters increased, but when a larger model was taken out, its performance improved again, and a U-shaped curve.

insert image description here
The tasks in this competition are generally "specific and misleading", such as the following example:

insert image description here
Therefore, for the above U-shaped curve, one guess is that these tasks usually contain some misleading tasks, such as the above-mentioned 5 yuan. When the model is not very large, due to half-knowledge, it will accept misleading methods. This in turn makes it worse than a random effect, but when it gets large enough, you get real results, similar to calculating the expected value above.

insert image description here

Switch Transformer

The Switch Transformer model parameter is 1.6T (GPT-3 is 175 billion, GPT-3.5 is 200 billion), which uses the Mixture-of-expert structure, that is, when the model is inferred (Inference), different Modules are selected, This in turn speeds up inference.

insert image description here


The Importance of Big Data

A large enough amount of data is required for the model to learn "common sense, that is, world knowledge". As shown in the figure below, the abscissa is the amount of data.
insert image description here
Dataset preparation process:

  • Filter harmful content (google safe search)
  • Remove HTML tags
  • Use rules to remove low-quality data
  • deduplication
  • Filter out the test set (for example, GPT-3 does not filter out the test set)

insert image description here

"Big Model" or "Big Data"

In the case of fixed computing resources, should we give priority to "big model" or "big data"? Looking at the current trend, the model size is getting bigger and bigger, but the amount of training data has not changed significantly.

insert image description here

According to the figure below (the color represents fixed computing resources, the abscissa is the amount of parameters, the larger the amount of parameters, the smaller the amount of data), it can be found that "big model" and "big data" need to be balanced, only increase the size of the model, not increase Computing power will only make the training results worse.

insert image description here

Each U-shaped curve takes a minimum point, and the relationship between computing power, parameters (Parameters) and data volume (Tokens) is obtained as shown in the figure below.

insert image description here
According to the above estimation diagram, Google re-estimated the number of parameters and data volume schemes that should be adopted under the computing power corresponding to Gopher (the amount of parameters is 280 Billion, and the amount of data is 300 Billion), so the training obtained Chinchilla (the amount of parameters is 63 Billion , the amount of data is 1.4 Trillion). After the comparison, it was found that Chinchilla beat Gopher by a large margin.

insert image description here
According to the above results, the relationship between the specific "parameter amount" and "data amount" is further given:

insert image description here

The latest LLaMA also adopts this "reduce the amount of parameters and expand the amount of data" scheme:

insert image description here


KNN LM

Generally speaking, the language model is doing a classification problem, that is, the input is "Tsinghua University", the output is the probability of each candidate word, and then the word with the highest probability can be selected.

As shown below, the Transformer gets the Embedding of Text, and then converts it to a classification problem through a linear layer + softmax.
insert image description here
In contrast, KNN LM not only trains a classifier after getting the Repesentation, but also calculates the distance between the Repesentation of the test Text and the Repesentation obtained from the training data, and obtains the predicted probability of the next word according to the distance, and then compares it with the original classifier Combine to get the final result.

insert image description here
In addition, KNN LM can calculate the distance between any data and the Representation of the test Text, not limited to the training data. Therefore, the mechanism of KNN LM can make model training more focused on some more difficult problems, and some problems that only require memory can be solved in this way.


References

Guess you like

Origin blog.csdn.net/qq_41552508/article/details/129915054