UniIVAL: The first grand unified model supporting image, video, audio and text tasks!

UniIVAL, the first unified model capable of supporting image, video, and audio-text tasks!

d4943f845327058559132726c5ad9b6e.png

Enter the NLP group —> join the NLP exchange group

Large language models (LLMs) make the ambitious quest for generalist agents no longer a fantasy.

A key obstacle to building such general models is the diversity and heterogeneity of tasks and modalities.

One promising solution is Unification, allowing support for countless tasks and modes within a unified framework.

While large models such as Flamingo (Alayrac et al., 2022) trained on massive datasets can support more than two modalities, current small to medium unified models are still limited to 2 modalities, usually image-text or video-text.

The question we ask is: Is it possible to efficiently build a unified model that can support all modalities?

dbb0f44e1bf1448fd92284c8d1ce2d6b.png

To answer this question, we propose UniIVAL , a step towards this ambitious goal.

Without relying on fancy dataset sizes or model billions of parameters, the ~0.25B parameter UniVAL model transcends both modalities, unifying text, images, video, and audio into a single model.

e7eb1ae61c5d6ace09f6487b46354f54.png

Our model is effectively pretrained on many tasks based on task balancing and multimodal curriculum learning.

c68a815c841b62d9314ab7cac1437b13.png ec9850cb2ee66b0b16baaa5d4da6c736.png
Multimodal Curriculum learning (MCL).
c1d11355089003091243a2bc20927399.png
Multimodal task balancing
af5db9370e9a0e8380c5a3b043d69bed.png
Knowledge transfer across tasks and data

UniIVAL shows competitive performance of existing state-of-the-art methods across image and video-text tasks.

d6cad700f3d604e75fca0c9b8e4dc14d.png
Fine-tuning of visual localization tasks on RefCOCO, RefCOCO+, and RefCOCOg data
e9512f4da20559434446c9253f1542ae.png
Image-text understanding and generation task data fine-tuning

Feature representations learned from image and video-text patterns allow models to achieve competitive performance when fine-tuned on audio-text tasks, despite not being pretrained on audio.

b651451ad0885287e257f383f6c26d7a.png
Video Q&A Spinner
7ffe246e51c8f951b0bd707c3dfbcc00.png
Video Captioning fine-tuning
4705dcaf637869ba5e044b690aecb5df.png
Speech-to-text fine-tuning
a53dc58ef96c3832000f5fcf525500e4.png
Evaluation without finetuning
162071be845eacd42695c037a8ae08c1.png
Zero-Shot Evaluation

Benefiting from the unified model, we present a novel study of multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits especially for out-of-distribution generalization.

5168204bb184091341f9d35bb4a2bc02.png 67b87b4c615977dcaf1d972064405b3f.png

Finally, we incentivize unification by demonstrating synergy between tasks.

Summarize

In this study, we introduce UniIVAL, the first unified model capable of supporting image, video, and audio-text tasks.

We do this with a relatively small model with ~0.25B parameters on a relatively small dataset.

Our unified system is multi-task pre-trained with multiple advantages. It exploits the synergy between different tasks and modalities, enables more efficient data training, and exhibits strong generalization ability to novel modalities and tasks.

The unifying aspect of our strategy paves the way for interesting techniques for merging models fine-tuned on different multimodal tasks: we demonstrate that, in addition to multi-task pre-training, task diversity can be further exploited by weight interpolation merging.

Ultimately, we hope that our work will inspire the research community and accelerate progress in building modality-independent generalist assistant agents.


4b29f5542c7919a9c16c3ef67da8b414.png

Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/132095170