The latest Stanford study warns: Don't be too superstitious about the emergence of large models, it is the result of metric selection

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —>【Transformer】WeChat Technology Exchange Group

Reprinted from: Heart of the Machine

After the emergence of large models, the term emergent became popular, usually expressed as a capability that does not exist in small-scale models, but exists in large-scale models. But researchers at Stanford University question the claim that LLM has emergent power, arguing that it is the result of artificial choice of measurement methods.

"Don't be too superstitious about the emergence of large models. Where are there so many miracles in the world?" Researchers at Stanford University found that the emergence of large models is strongly related to the evaluation indicators of tasks, not the basic changes in model behavior under specific tasks and scales. After changing to some more continuous and smooth indicators, the emergence phenomenon is not so obvious, and it is closer to linearity.

Recently, the term has gained a lot of attention in the field of machine learning due to the observation that large language models (LLMs) such as GPT, PaLM, LaMDA can exhibit so-called "emergent capabilities" in different tasks:

6b620348eeb9c52f269df470f901e313.png

In fact, the emerging properties of complex systems have always been the focus of research in physics, biology, mathematics and other disciplines.

One point worth noting is that Nobel Prize winner PWAnderson proposed "More Is Different". The idea is that as the complexity of a system increases, new properties may materialize, even if they are not (easily or not at all) predictable from a precise quantitative understanding of the system's microscopic details.

How to define "emergence" in the field of large models? A colloquial way of saying this is "capabilities that are absent in small-scale models but present in large-scale models", so they cannot be predicted by simply extrapolating performance improvements to small-scale models.

This emergent ability was probably first discovered in the GPT-3 family. Some subsequent work underscored this finding: "While model performance is predictable at a general level, on specific tasks its performance sometimes exhibits rather unpredictable surges in scale". In fact, these emergent capabilities are so surprising that "sudden, specific expansion of capabilities" has been identified as one of the two highest defining characteristics of LLM. Additionally, terms such as "breakthrough capabilities" and "sharp left turns" are used.

In summary, we can identify two decisive properties of the emergent capacity of LLMs:

1. Sensitivity, the transition from "non-existence" to "existence" seems to be only an instant;

2. Unpredictability, transitions within seemingly unforeseen model scales.

In the meantime, some questions remain open: What controls which capabilities emerge? What controls the emergence of competence? How can we make desirable capabilities emerge more quickly and ensure that less desirable capabilities never emerge?

These questions are deeply relevant to the safety and alignment of AI, as emergent capabilities herald larger models that may one day acquire, without warning, mastery of dangerous capabilities that humans do not want to happen.

In a new paper, researchers at Stanford University challenge the claim that LLMs have emergent capabilities.

a8b56ba5742a8789c5c97c576b39ea68.png

Paper: https://arxiv.org/pdf/2304.15004.pdf

Specifically, the challenge here addresses emergent and unpredictable changes in model output as a function of model size in a given task.

Their skepticism is based on the observation that models appear to emerge emergent only when scaling any measure of the model's per-token error rate non-linearly or discontinuously. For example, in the BIG-Bench task, >92% of the emergent power occurs under these two measures:

dab863d24867a87211c70c3f65cc812f.png

This raises the possibility of another explanation for the origin of the emergent power of LLMs: Although the per-token error rate of the model family will change smoothly, continuously and predictably with the increase of model size, the seemingly sharp and Unpredictable variations may be caused by the researchers' choice of measurement .

That said, emergent power can be a mirage, mainly due to the fact that researchers choose a metric that changes the error rate per-token non-linearly or discontinuously, partly due to having too little test data to be accurate enough Estimating the performance of smaller models (thus making the smaller models appear to be completely incapable of performing the task) is also partly due to too few large-scale models being evaluated.

To illustrate this explanatory approach, the researchers present it as a simple mathematical model and demonstrate how it is quantitatively reproduced as evidence in support of the emergent power of the LLM. The researchers then tested this explanation in three complementary ways:

1. Using the InstructGPT [24]/GPT-3 [3] model family, three predictions are made, tested, and confirmed against alternative hypotheses.

2. Perform a meta-analysis of some of the previous results and show that in the space of task indicator-model family triplets, emergent competencies are only present for some indicators, but not for model families (columns) on the task. The study further shows that, on a fixed model output, varying the metric leads to the disappearance of emergent phenomena.

3. Deliberately inducing emergent capabilities for multiple vision tasks (which has never been demonstrated before) in deep neural networks of different architectures to show how similar metric choices can induce seemingly emergent capabilities.

Test 1: InstructGPT/GPT-3 model series analysis

The researchers chose the GPT family of models for further analysis because it is publicly queryable, unlike other model families (eg, PaLM, LaMDA, Gopher, Chinchilla). In previous studies, the GPT family of models was considered to exhibit emergent capabilities in integer arithmetic tasks. Here, the researchers also chose the task of integer arithmetic.

c5cb88c96795f813f9c15a569fee083c.png

Figure 2: The emergent power of large language models is a creation of the researcher's analysis, rather than a fundamental change in model output with scale.

As explained mathematically and graphically in Section 2, the alternative explanation proposed by the researchers predicts three outcomes:

1. As the size of the model increases, if the metric is changed from a nonlinear/discontinuous metric (Figure 2CD) to a linear/continuous metric (Figure 2EF), there should be a smooth, continuous, and predictable performance improvement .

2. For non-linear metrics, if the resolution of the measured model performance is increased by increasing the size of the test data set, the model should be improved smoothly, continuously, and predictably, and the ratio of the improvement is the same as Predictable non-linear effects of the chosen metric are commensurate.

3. Regardless of the metric used, increasing the target string length should have an impact on model performance as a function of target performance of length 1: nearly a geometric function for accuracy, nearly a function of token edit distance quasi-linear function.

To test these three predictions, the researchers collected the string output results of the InstructGPT/GPT-3 series models on two arithmetic tasks: using the OpenAI API to perform two-sample multiplication between two two-digit integers and two Two-sample addition between four-digit integers.

a0f225fc53c6520b04bfc9bae328be5a.png

Figure 3: As model size increases, changing metrics can bring about smooth, continuous, and predictable changes in performance.

From left to right: Mathematical model, 2 two-digit integer multiplication tasks, 2 four-digit integer addition tasks. The graph above is model performance measured using a non-linear metric such as accuracy, and it can be seen that the performance of the InstructGPT/GPT-3 family of models appears sharp and unpredictable at longer target lengths. While the lower plots measure model performance using a linear metric such as token edit distance, this family of models exhibits a smooth, predictable performance improvement, which the researchers claim emergently yields.

Prediction: Emergence Power Disappears Under Linear Measures

On these two integer multiplication and addition tasks, if the target string is 4 or 5 digits long and the performance is measured in accuracy (Figure 3 upper row), then the GPT family of models exhibits emergent arithmetic capabilities . However, if one metric is switched from nonlinear to linear, while keeping the output of the model fixed, the performance of the family of models improves smoothly, continuously, and predictably. This confirms the researchers' predictions, thereby showing that the source of the sharpness and uncertainty is the metric chosen by the researchers, rather than variations in the model's output. It can also be seen that when using the token edit distance, if the length of the target string is increased from 1 to 5, the performance of the series of models can be predicted to decline, and the downward trend is nearly quasi-linear, which is consistent with the third the first half of a forecast.

Prediction: Emergent power disappears with the advent of higher resolution evaluation

Next comes the second prediction: Even with a non-linear metric such as accuracy, the accuracy of the smaller model will not be zero, but a non-zero value above chance, proportional to the ratio chosen to use accuracy as the metric Corresponding. In order to increase the resolution to further accurately estimate the model accuracy, the researchers also generated some other test data, and then they found that: no matter on the integer multiplication task or the integer addition task, all the InstructGPT/GPT-3 series The models all achieved positive accuracies exceeding chance (Figure 4). This verifies the second prediction. It can be seen that as the length of the target string increases, the accuracy will decrease almost geometrically with the length of the target string, which is in line with the second half of the third prediction. These results also suggest that the accuracy chosen by the researchers has some (approximate) effect that we should expect, which decays almost geometrically with target length.

ffbeb79d6cabafe461e7ec9e03245c02.png

Figure 4: Better accuracy estimates using more test datasets reveal that the change in performance is smooth, continuous, and predictable.

From left to right: Mathematical model, 2 two-digit integer multiplication tasks, 2 four-digit integer addition tasks. Increasing resolution by generating more test data reveals that even on accuracy measures, the InstructGPT/GPT-3 family of models outperforms chance results, with smooth, Continuous and predictable, the results of these two emergent abilities are qualitatively consistent with the mathematical model.

Test 2: Meta-analysis of model emergence

Since the GPT family of models is publicly queryable, they can be analyzed. However, some other models that also claim to have emergent capabilities (such as PaLM, Chinchilla, Gopher) are not publicly available, and the output they generate is not publicly available, which means that researchers are limited in analyzing published results . The researchers made two predictions based on their own alternative hypothesis:

  • First, at the "population level" of the "task-measure-model family" triplet, when choosing to evaluate model performance using non-linear and/or discontinuous metrics, the model should perform well on the task emerging ability.

  • Second, for a particular "task-measure-model family" triplet that exhibits emergent capability, the emergent capability should be eliminated if the measure is changed to a linear and/or continuous measure.

To test these two hypotheses, the researchers investigated the capabilities claimed to emerge on the BIG-Bench evaluation suite, for which benchmarks are publicly available and well documented.

Prediction: Emergent power should primarily appear on non-linear/discontinuous measures

To test the first prediction, the researchers analyzed on which indicators different task-model pairs would have emergent capabilities. To determine whether a "task-measure-model family" triplet is likely to exhibit emergent capabilities, they borrowed the definition introduced in the paper "Beyond the imitation game: Quantifying and extrapolating the capabilities of language models". Let y_i ∈ R denote the model performance when the model size is x_i ∈ R, and make x_i < x_i+1, then the emergence score is:

03a9201649e03b02a31f085f3a727a55.png

As a result, the researchers found that most of the metrics used by BIG-Bench did not exhibit emergent “task-model family” pairings: of the 39 BIG-Bench metrics that people preferred, at most 5 exhibited emergent capabilities (Fig. 5A). Most of these 5 are non-linear/non-sequential, such as exact string matching, multiple choice classification, ROUGE-L-Sum. It is worth noting that since BIG-Bench usually uses multiple metrics to evaluate the task performance of the model, the lack of emergent ability under other metrics means that emergent ability does not appear when other metrics are used to evaluate the model output. .

Since the emergence score only indicates the emergence ability, the researchers further analyzed the "task-measure-model series" triplets manually marked in the paper "137 emergent abilities of large language models". The manually annotated data showed that only 4 of the 39 measures exhibited emergent power (Fig. 5B), and 2 of them accounted for more than 92% of the claimed emergent power (Fig. 5C). Multiple choice ratings and exact string matching. Multiple-choice rankings are non-sequential, and exact string matching is non-linear (variation in target length metric is nearly geometric). Overall, these results suggest that emergent power occurs on only a very small number of nonlinear and/or discontinuous measures.

803dc355ed727e782b132972148559fa.png

Figure 5: Emergent power appears for only a few measures. (A) Out of the 39 BIG-Bench metrics that people prefer, emergent power may appear on at most 5 metrics. (B) Human-annotated data in the cited paper show that only 4 of the people's preferred measures exhibit emergent power. (C) > 92% of emergent powers were present on one of two measures: multi-choice ranking and exact string matching.

Prediction: Emergent power should be eliminated if non-linear/discontinuous measures are substituted

For the second prediction, the researchers analyzed the emergent capabilities of human annotations in the papers cited above. They focused on the LaMDA family because its output is available through BIG-Bench, whereas the output of other model families is not. Of the LaMDA models that have been published, the smallest has 2 billion parameters, but many of the LaMDA models in BIG-Bench are much smaller, and the researchers say they were not considered in the analysis because they could not identify the source of these smaller models . In their analysis, the researchers identified tasks on which LaMDA exhibited emergent capabilities on a multiple-choice rating scale, and then they asked whether LaMDA could perform well on the same tasks when using another BIG-Bench metric, the Brier score. exhibit emergent capabilities. The Brier score is a set of strictly proper scoring rules that measure predictions of mutually exclusive outcomes; for a prediction of a binary outcome, the Brier score reduces to the mean squared error between the outcome and its predicted probability mass .

The researchers found that the emergent power of LaMDA disappeared when the discontinuous metric multiple-choice rating was changed to the continuous metric Brier score (Fig. 6). This further suggests that the cause of emergent power is not an intrinsic change in model behavior as scale increases, but rather the use of a discontinuous metric .

2fec8bae1084e2fb87f98e9f805468b3.png

Figure 6: Changing the BIG-Bench metric while maintaining the same task and model family results in the disappearance of emergent capabilities. Top row: The LaMDA model family exhibits emergent power when using a discontinuous metric (multiple-choice ranking). Next row: The LaMDA model family no longer has emergent power on the same task when using a continuous BIG-Bench metric (Brier score).

Test 3: Inducing the emergent ability of DNN

The researchers' view is that the emergence of models can be induced by the choice of metrics ; to demonstrate this, they show how to make deep neural networks with different architectures (full connection, convolution, self-attention) emerge. Here researchers focus on vision tasks for two reasons. First, people are now mostly focusing on the emergent power of large language models, since no sudden shift in model power from none to some has yet been observed for vision models. Second, some vision tasks can be solved with moderately sized networks, so researchers can build a complete family of models spanning multiple orders of magnitude.

Convolutional Networks Emerge to Classify MNIST Handwritten Digits

The researchers first induced the emergence of classification capabilities of the LeNet convolutional neural network series, and the training data set was the MNIST handwritten digit data set. This series exhibits a smooth increase in test accuracy as the number of parameters increases (Fig. 7B). To simulate the accuracy measure used in papers on emergence, subset accuracy is used here: if the network correctly classifies K items out of K (independent) test items, then the network Accuracy is 1 for a subset of , and 0 otherwise. Based on this definition of accuracy, the family of models exhibits an "emergent" ability to correctly classify the MNIST digit set in settings where K grows from 1 to 5, especially when combined with sparse sampling of the model size (Fig. 7C). The emergent classification power of this convolutional family is qualitatively in line with that in published papers, such as the results on the BIG-Bench topographic mapping task (Fig. 7A).

83c00abb8cbf39e7b75f357ab150fd57.png

Figure 7: Inducing emergent MNIST classification capabilities in convolutional networks. (A) Emergent capabilities for the BIG-Bench-based terrain mapping task in a published paper. (B) LeNet trained on MNIST exhibits a predictive, generalized, S-shaped increase in test accuracy as the number of model parameters grows. (C) When accuracy is redefined as correctly classifying K out of K independent test data, this newly defined metric induces a seemingly unexpected change.

Emerging reconstruction power of nonlinear autoencoders on the CIFAR100 natural image set

To highlight that the sharpness of the researchers' chosen metric is responsible for the emergent power, and to show that this sharpness is not limited to metrics such as accuracy, the researchers also induced a shallow (i.e., single hidden layer ) nonlinear autoencoders emerge with the ability to reconstruct image inputs. To this end, they deliberately define a new discontinuous measure of model power, which is the average number of test data for which the squared reconstruction error is below a fixed threshold c:

0ac941fce0178ad2955e39662f5bb342.png

where I (・) is a random indicator variable and x^n is the autoencoder's reconstruction of x_n. The researchers examined the number of bottleneck units in the autoencoder and found that the mean square reconstruction error of the network showed a smooth downward trend as the model size increased (Fig. 8B), but if the newly defined reconstruction metric was used, for the selected c, The ability of this autoencoder family to reconstruct the data set is sharp and almost unpredictable (Fig. 8C), which is qualitatively consistent with the emergent ability in published papers, such as in BIG-Bench The Periodic Elements task (Figure 8A).

06c16cebc8b65ca286a9dbe3ab8d7dc0.png

Figure 8: Inducing emergent reconstruction capabilities in shallow nonlinear autoencoders. (A) Emergent capabilities based on the BIG-Bench Periodic Elements task in a published paper. (B) A shallow nonlinear autoencoder trained on CIFAR100 exhibits a smoothly decreasing mean squared reconstruction error. (C) Unpredictable changes were induced using the newly defined reconstruction metric (Equation 2).

Autoregressive Transformer emerges with classification capabilities on the Omniglot character set

Next is the emergent capability of the Transformer, which uses an autoregressive approach to classify Omniglot handwritten characters. The experimental setup used by the researchers is similar: the Omniglot image is first embedded by the convolutional layer, and then input into the decoder-only Transformer in a sequence of [embedded image, image category label] pairs, and the training goal of the Transformer is to predict Omniglot category label. The researchers measured image classification performance on sequences of length L ∈ [1, 5], also measured by subset accuracy: if all L images are correctly classified (Fig. 9B), the subset accuracy is 1, Otherwise 0. The Causal Transformer appears to exhibit emergent capabilities on the task of correctly classifying Omniglot handwritten characters (Fig. 9C), a result qualitatively consistent with emergent capabilities in published papers such as large-scale multi-task language understanding (Fig. 9A).

6cd8fd0c532b60a3354566bfb706e654.png

Figure 9: Inducing emergent classification capabilities in an autoregressive Transformer. (A) Emergent power based on the MMLU benchmark in a published paper. (B) The test accuracy of the Transformer using autoregressive methods to classify Omniglot handwritten digits shows growth as model parameters increase. (C) When accuracy is redefined as correctly classifying all images in the sequence, this metric is harder to predict, which seems to indicate that emergent power is induced.

Click to enter —>【Transformer】WeChat Technology Exchange Group

The latest CVPR 2023 papers and code download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

Background reply: Transformer review, you can download the latest 3 Transformer review PDFs

目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watchbfbb84e2395b7a5a6ed7bc7b673d41b2.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/130479480