After fine-tuning, the large model became more forgetful.

 Datawhale dry information 

Share: Professor Ma Yi’s team, source: Xinzhiyuan

[Introduction] The latest research from Professor Ma Yi’s team shows that fine-tuning multimodal large language models (MLLM) will lead to catastrophic forgetting.

The catastrophic forgetting of models has become a key hot topic at present, and even GPT-4 cannot avoid it.

Recently, researchers from UC Berkeley, NYU and other institutions have discovered that fine-tuned multi-modal large models can cause catastrophic forgetting.

d6af2b69eb722c0f0b9568d1e9112e14.png

Paper address: https://arxiv.org/abs/2309.10313

In the paper, the research team introduced the first evaluation framework to study MLLM's catastrophic forgetting - EMT (Evaluating MulTimodality). (The old two-dimensional gene has moved)

After evaluating 4 models on multiple benchmarks, it was found that most models were unable to maintain similar classification performance to their underlying visual encoder (CLIP).

At the same time, fine-tuning LLaVA on one dataset can lead to catastrophic forgetting on other datasets.

35a0e3ba67d20cb01f15f732bd32167c.png

MLLM’s EMT evaluation process is as follows:

By (1) prompting each MLLM to act as an image classifier as input images from the classification task; (2) requiring the MLLM to explicitly answer a single label in the classification task. And use another LLM to evaluate the correctness of each output.

Professor Ma Yi also recommended this research. The performance improvement obtained through fine-tuning on some new tasks comes at the expense of a significant decrease in previous capabilities.

79e687fd081dc0861c864ef716139a06.png

Let's see what's going on together?

After fine-tuning, the large model became more forgetful.

After GPT-4, a series of research on multimodal large language models (MLLM) spewed out.

A common practice in the industry is to integrate pre-trained visual encoders with open source LLM and perform instruction tuning to generate visual language models.

Although many fine-tuned MLLMs have demonstrated superior capabilities in general visual language understanding, these models still suffer from catastrophic forgetting.

That is, models tend to overfit the fine-tuned dataset, resulting in performance degradation on pre-trained tasks.

Catastrophic forgetting in image classification has been extensively studied in the fields of CV and ML.

However, recent developments in MLLM have mainly focused on creating multi-modal chatbots for visual question answering without evaluating their basic image classification capabilities, let alone exploring catastrophic forgetting in MLLM.

Having said that, previous MLLM assessment frameworks have mainly focused on assessing "cognitive reasoning ability" or "hallucination", while ignoring the need to study how to catastrophically forget in MLLM.

In summary, the latest research makes 2 key contributions:

- Proposed EMT, an evaluation framework specifically designed to evaluate the phenomenon of catastrophic forgetting in MLLM.

To the best of the researchers' knowledge, it is the first evaluation framework to study MLLM catastrophic forgetting through classification. Through EMT, the research team found that almost all tested models failed to retain the classification performance of their visual encoders.

- Performed fine-tuning experiments on LLaVA.

Experimental results show that moderate fine-tuning is beneficial for non-fine-tuned tasks, but excessive fine-tuning can ultimately lead to catastrophic forgetting in these tasks.

EMT: Evaluating Open Source Multimodal Large Models

Specifically, the working principle of EMT is as follows:

(1) First input images from the classification task;

(2) Then, according to each data set, the test MLLM is asked to classify the input images and its output is collected through the provided hints;

(3) Next, since the output of MLLM may not follow a specific format, the researchers used GPT-3.5 to evaluate the classification accuracy;

(4) Finally, the output tests the prediction accuracy of MLLM on different data sets.

Open Source MLLM Catastrophic Forgetting

The researchers first used EMT to evaluate four models: LLaVA, Otter, LENS, and InstructBLIP.

Their classification accuracy on MNIST, CIFAR10, CIFAR100 and miniImageNet is introduced below. The research team differentiated the displayed radial diagrams according to the basic ViTCLIP model.

Although most of the tested MLLMs failed to achieve similar performance to their underlying visual encoders, there are a few worth noting:

- InstructBLIP-7b is the only exception, which performs better than the visual encoder

- LENS has the worst overall classification performance among all tested models

6d34e446f440f2802a513a3fbd8a5b77.png

EMT evaluation accuracy of different MLLM on MNIST, CIFAR-10, CIFAR-100 and miniImagenet

Test predictions

The researchers analyzed the output results of different models on different data sets and identified three major factors that affect classification accuracy:

49fa17a10da886fc80fcbdae60cfe5df.png

- Wrong predictions: Like other classification tasks, MLLM sometimes makes wrong predictions.

In the example below, LLaVA-7B incorrectly sees 0 as 8 in the MNIST classification.

a420fca652dd17aa73d528ce3fee80a4.png

- Intrinsic Illusions: Tested MLLMs sometimes generate content that appears relevant, but is incorrect or unverifiable, in short, generating output that directly contradicts the source content.

One example is the requirement for LENS to classify CIFAR-10.

It is worth noting that the EMT prompt clearly indicates that the test MLLM only recognizes a single object across all class labels.

Despite these explicit instructions, LENS still produces output that is essentially hallucinatory - airplane, car, bird, cat, deer, dog, frog, horse, an answer containing multiple labels.

b567ba2797a25dcbe437ebfab2d7ff06.png

- Extrinsic hallucination: The output has no verifiable connection to the original source content.

In the example below, while the resulting output text portion contains the label "aquarium fish," it also displays additional descriptors that are not only difficult to verify but are also unrelated to the original request outlined in the prompt.

477030b9bd59642e74e60510466e71c4.png

Fine-tuning LLaVA

Next, the researchers used EMT to evaluate accuracy changes during LLaVA fine-tuning.

Here, they used LLaVA-7b and LLaVA-13b as the basic MLLM for fine-tuning, and conducted fine-tuning experiments on MNIST, CIFAR-10, CIFAR-100 and miniImagenet respectively.

The specific method is to fine-tune (1) the linear adapter layer (denoted as linear); (2) the linear adapter layer and the LLM using Lora (denoted as lora).

The figure below shows the fine-tuning results of 3 epochs. While the performance of LLaVA does improve on the fine-tuned dataset, the figure reveals a key issue with MLLM fine-tuning:

Fine-tuning an MLLM on one dataset degrades performance on another non-fine-tuned dataset.

This phenomenon, while not unexpected, is noteworthy. Since the model was not exposed to other data sets except the fine-tuned one, it stands to reason that similar effects to catastrophic forgetting would be observed.

e721ccb1c92344b36f2a6b46786f6f1c.gifad99a733b5d264799969dd1a4beed1fd.gif

Fine-tuning experiments show:

- Fine-tuning on one dataset can lead to catastrophic forgetting on other datasets, a phenomenon that occurs in both linear fine-tuning and Lora fine-tuning

- Lora fine-tuning leads to more forgetting than linear fine-tuning

Next, the researchers will study the fine-tuning process in more detail by providing accuracy curves.

df3bafde2e31e09f6d7d4dc32c58d57b.png

It can be seen from the classification curve:

- Linear fine-tuning is universal, because linear fine-tuning using RGB data sets (CIFAR10, CIFAR100, miniImageNet) can also improve the accuracy of other RGB data sets in the first epoch.

- Lora fine-tuning does not have the versatility of linear fine-tuning

Test predictions

When the researchers examined the output of fine-tuned LLaVA they found:

It produces the illusion of outputting text relevant to its fine-tuned dataset while ignoring questions related to its original prompt.

To further illustrate this phenomenon, the research team provides explicit examples of classifying LLaVA-7b and LLaVA-13b, which have been fine-tuned on different datasets using EMT cues.

f89841eaa7061d15e2c5f5d7655a2051.png

The following demonstration shows that when the CIFAR-10 fine-tuned model is tested on CIFAR10, LLaVA can indeed successfully identify objects.

However, after fine-tuning on other datasets, the LLaVA model started to exhibit hallucinations in CIFAR-10 classification.

b8323b125fe11478d6852370338b01ec.png

In this example, when classifying CIFAR-10 through the MNIST fine-tuned model, the model not only partially generated the keyword "airplane", but also produced the illusory output of the number "8" at the same time.

5537ea5146b21e1c1caca7dd8fc14bdf.png

In addition, researchers also observed similar phenomena in CIFAR-100 and miniImagenet fine-tuned models.

Specifically, these fine-tuned models began to hallucinate, predicting "aircraft" as a similar or related category to "aircraft", such as "butterfly" in the CIFAR-100 model and "aircraft carrier" in the miniImagenet model.

5bb0dd67e2ffd62e3df30641af958fd4.png

The above example shows:

- Fine-tuning MLLM does improve classification performance on fine-tuned datasets

- Fine-tuned MLLM can lead to catastrophic forgetting on other datasets because fine-tuned MLLM memorizes the fine-tuned dataset, thus producing illusory text

about the author

Yuexiang Zhai

158271db43c6d18bffd657816c71549c.png

Yuexiang Zhai is a doctoral student at the University of California, Berkeley, supervised by Professors Ma Yi and Sergey Levine.

Shengbang Tong (Tong Shengbang)

89990065d51816b7497bbd9c67940c35.png

Peter Tong (Shengbang Tong, Tong Shengbang) is a doctoral student in NYU Courant CS. His supervisors are Professor Yann LeCun and Professor Xie Saining.

Previously, he majored in Computer Science, Applied Mathematics (Honors), and Statistics (Honors) at the University of California, Berkeley. He was a researcher at the Berkeley Artificial Intelligence Laboratory (BAIR), and his mentors were Professor Ma Yi and Professor Jacob Steinhardt.

His research interests are world models, unsupervised/self-supervised learning, generative models, and multimodal models.

Xiao Li

5fd7164524eea478064cfef56102a9a2.png

Xiao Li is an Assistant Professor at the School of Data Science, The Chinese University of Hong Kong (Shenzhen).

Prior to that, he obtained his PhD from 2016 to 2020 at the Chinese University of Hong Kong, under the supervision of Professor Thierry Blu and Professor Anthony Man-Cho So. Studied for undergraduate degree at Zhejiang University of Technology from 2012 to 2016.

Mu Cai

25c746abad25efd88316d66e9dc1b0fc.png

Mu Cai is a doctoral student in computer science at the University of Wisconsin-Madison under Professor Yong Jae Lee.

His research interests lie in the intersection of deep learning and computer vision, especially visual LLM, 3D scene understanding and self-supervised learning.

Qing Qu

e0c1f3e319c736071a0ecf47dc8cd946.png

Qing Qu is an Assistant Professor of ECE in the Department of Electrical Engineering and Computer Science in the College of Engineering at the University of Michigan, Ann Arbor. He is also affiliated with the Michigan Institute for Data Science (MIDAS), the Michigan Center for Applied and Interdisciplinary Mathematics (MCAIM), and the Michigan Institute for Computational Discovery and Engineering (MICDE).

He received his bachelor's degree from Tsinghua University in 2011 and his PhD from Columbia University in 2018. From 2018 to 2020, he served as a Moore-Sloan Fellow at the New York University Data Science Center.

He is the winner of the SPARS'15 Best Student Paper Award and the recipient of the 2016 Microsoft Machine Learning PhD Scholarship. He received the National Natural Science Foundation Career Award in 2022 and the Amazon AWS Artificial Intelligence Award in 2023.

His research interests lie at the intersection of signal processing, data science, machine learning, and numerical optimization. He is particularly interested in computational methods for learning low-complexity models from high-dimensional data, using tools from machine learning, numerical optimization, and high-dimensional geometry with applications in imaging science and scientific discovery.

Recently, his main interest lies in understanding deep networks from a low-dimensional modeling perspective.

Yi Ma

ae2a91d514856d77881ed5e9c37efbdd.png

Professor Ma Yi is a fellow of IEEE, ACM and SIAM. He currently serves as the dean of the Tongxin Foundation Data Science Institute of the University of Hong Kong and a professor of the Department of Electrical Engineering and Computer Science at the University of California, Berkeley.

He received a bachelor's degree in automation and applied mathematics from Tsinghua University in 1995, a master's degree in mathematics and a master's degree in electrical engineering and computer science from the University of California, Berkeley, in 1997, and a PhD in electrical engineering and computer science from the same school in 2000.

Professor Ma taught in the Department of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign (UIUC) from 2000 to 2011; from 2009 to 2014, he served as the director and principal researcher of the Computer Vision Group at Microsoft Research Asia; in 2014 In 2017, he served as professor and executive dean of the School of Information Science and Technology at ShanghaiTech University; in 2018, he joined the Department of Electrical Engineering and Computer Science at the University of California, Berkeley as a teacher.

He has published more than 60 journal papers, 120 academic conference papers, and 3 textbooks in computer vision, generalized principal component analysis, and high-dimensional data analysis.

He received the National Science Foundation Career Award in 2004 and the NASA Young Investigator Award in 2005. And won the David Marr Best Computer Vision Paper Award at the International Conference on Computer Vision (ICCV) in 1999. He also received an honorable mention for the Best Paper Award at the 2004 European Conference on Computer Vision (ECCV) and the Best Academic Paper Award at the 2009 Asian Conference on Computer Vision (ACCV).

In addition, Professor Ma also served as the program chair of ICCV 2013 and the general meeting chair of ICCV 2015.

References:

https://yx-s-z.github.io/emt/

024039c772c685e4a5539185af3f9c16.png

Good stuff to learn, like three times in a row

Guess you like

Origin blog.csdn.net/Datawhale/article/details/133421324