Refresh multiple SOTA! Meta's new blockbuster AnyMAL: the multi-modal version Llama2 is here!

Click on the card below to follow the " CVer " public account

AI/CV heavy-duty information, delivered as soon as possible

Click to enter -> [Multimodal and Transformer] communication group

Reply in the background of CVer WeChat public account: AnyMAL, you can download the pdf of this paper and start learning!

Reprinted from: Heart of the Machine | Editor: Dapanji, Zenan

It has achieved the industry's best zero-shot performance in multiple benchmark tests.

A unified model that can understand different modal input contents (text, image, video, audio, IMU motion sensor data) and generate text responses. The technology is based on Llama 2 and comes from Meta.

Yesterday, the research on the multi-modal large model AnyMAL attracted the attention of the AI ​​research community.

Large language models (LLMs) are known for their enormous size and complexity, which greatly enhance the ability of machines to understand and express human language. Advances in LLM have enabled significant advances in the field of visual language, bridging the gap between image encoders and LLMs, combining their inference capabilities. Previous multimodal LLM research has focused on models that combine text with another modality, such as text and image models, or on proprietary language models that are not open source.

If there is a better way to achieve multi-modality, and various modalities can be embedded in LLM, will it bring us a different experience?

6a39ca507f7564c47a7d17f87a15ece3.png

Output example

To solve this problem, researchers from Meta recently launched AnyMAL (Any-Modality Augmented Language Model). This is a collection of multi-modal encoders trained to transform data from various modalities (including images, video, audio, and IMU motion sensor data) into the text embedding space of LLM. 

3a953c276434bb201eff8b0019cf3799.png

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

Paper address: https://arxiv.org/abs/2309.16058

According to reports, the main contributions of this research are as follows:

  • An efficient and scalable solution is proposed for building multimodal LLM. This article provides projection layers pre-trained on large datasets containing multiple modalities (e.g., 200 million images, 2.2 million audio segments, 500,000 IMU time series, 28 million video segments), all All aligned to the same large model (LLaMA-2-70B-chat), enabling interleaved multi-modal contextual cues.

  • This paper further fine-tunes the model using a multi-modal instruction set across three modalities (image, video, and audio), covering a variety of unconstrained tasks beyond the simple QA domain. This dataset has high-quality manually collected instruction data, so this paper also uses it as a benchmark for complex multi-modal reasoning tasks.

  • Compared with models in the existing literature, the best model in this paper achieves excellent zero-error performance in both automatic and manual evaluation on various tasks and modes, improving the relative accuracy by 7.0% on VQAv2 and zero error on VQAv2. The error increased by 8.4% CIDEr on COCO image subtitles and by 14.5% CIDEr on AudioCaps, creating a new SOTA.

method

6a84e32b4b652bb1b109d1599423ecd3.pngMethod overview

Pre-trained modal alignment

This article uses paired multi-modal data (specific modal signals and text narratives) to pre-train LLM to achieve multi-modal understanding capabilities, as shown in Figure 2. Specifically, we study training a lightweight adapter for each modality to project the input signal into the text token embedding space of a specific LLM. In this way, the text token embedding space of LLM becomes a joint token embedding space, where the token represents text or other modalities.

For image alignment, we used a clean subset of the LAION-2B dataset, filtered using the CAT method, and blurred for any detectable faces. For audio alignment, the study uses AudioSet (2.1M), AudioCaps (46K) and CLOTHO (5K) datasets. The study also used the Ego4D dataset for IMU and text alignment (528K).

For large datasets, scaling pre-training to a 70B parameter model is resource-intensive, often requiring the use of FSDP wrappers to shard the model across multiple GPUs. In order to effectively expand the training scale, this paper implements a quantization strategy (4 bit and 8 bit) in a multi-modal setting, where the LLM part of the model is frozen and only the modal tokenizer is trainable. This approach reduces memory requirements by an order of magnitude. Therefore, 70B AnyMAL can complete training on a single 80GB VRAM GPU with a batch size of 4. Compared with FSDP, the quantization method proposed in this paper only uses half of the GPU resources but achieves the same throughput.

3122a5992c850350e73580a3abd9f063.jpeg

Fine-tuning with multimodal instruction datasets

In order to further improve the model's ability to follow instructions for different input modalities, the study used the Multi-Modal Instruction Tuning (MM-IT) data set for additional fine-tuning. Specifically, we concatenate the input as [ 370a4eaae494032f8d87548e34a9c4d0.png] so that the response target is based on both textual instructions and modal input. The study reduces the following two situations: (1) training the projection layer without changing the LLM parameters; or (2) using low-level adaptation (Low-Rank Adaptation) to further adjust the LM behavior. The study uses both manually collected instruction-tuned datasets and synthetic data.

Experiments and results

Image caption generation

Table 2 shows the zero-shot image caption generation performance on COCO and a subset of the MM-IT dataset labeled “Detailed Description” task (MM-IT-Cap). As can be seen, the AnyMAL variant performs significantly better than the baseline on both datasets. It is worth noting that there is no significant gap in performance between the AnyMAL-13B and AnyMAL-70B variants. This result shows that the underlying LLM capabilities have a small impact on the image caption generation task, but strongly depend on the data size and registration method.

84ce8b33404a0dd2569b32f60a2505ad.png

Human evaluation of multimodal reasoning tasks

Figure 3 shows that AnyMAL performs strongly compared to the baselines (LLaVA: 34.4% win rate and MiniGPT4: 27.0% win rate) and has a smaller gap with human-annotated real samples (41.1% win rate). Notably, models fine-tuned with the full instruction set showed the highest priority winning rate, showing visual understanding and reasoning capabilities comparable to human-annotated responses. It is also worth noting that BLIP-2 and InstructBLIP perform poorly on these open queries (4.1% and 16.7% priority win rate, respectively), although they perform well on the public VQA benchmark (see Table 4) .

38c4f208769a74e5fbf9c4b530b913ae.png

VQA benchmark

Table 4 shows the zero-shot performance on the Hateful Meme dataset, VQAv2, TextVQA, ScienceQA, VizWiz, and OKVQA, compared with models in the literature that reported zero-shot results on their respective benchmarks. Research focuses on zero-shot evaluation to best estimate the model's performance on open queries at inference time.

64a54e417c2172d4e861eea438777045.png

Video QA Benchmark

As shown in Table 6, the model was evaluated on three challenging video QA benchmarks.

6cdc79af5adbb38f6a44f1f6fc4ddf33.png

Audio subtitle generation

Table 5 shows the audio subtitle generation results on the AudioCaps benchmark dataset. AnyMAL significantly outperforms other state-of-the-art audio subtitle models in the literature (e.g., CIDEr +10.9pp, SPICE +5.8pp), indicating that the proposed method is not only applicable to vision but also to various modalities. The Text 70B model shows clear advantages compared to the 7B and 13B variants.

ef1db34acf54dbdfa1267f20db8e7346.png

3908fe531e0b1e6a6e046e10965b3ccb.png

Interestingly, some speculated from the manner, type, and timing of the AnyMAL paper submission that Meta seems to be planning to collect multi-modal data through its newly launched mixed reality/metaverse headset. Perhaps these research results will be integrated into Meta's Metaverse product line in the future, or may soon enter consumer applications.

Please read the original article for more details.

Reply in the background of CVer WeChat public account: AnyMAL, you can download the pdf of this paper and start learning!

Click to enter -> [Multimodal and Transformer] communication group

ICCV/CVPR 2023 paper and code download

 
  

Backstage reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
多模态和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-多模态或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer、NeRF等。
一定要备注:研究方向+地点+学校/公司+昵称(如多模态或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号
整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/133446745