Explosion! ImageBind: the king of cross-modality, bind all 6 modes!

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —>【Transformer】WeChat Technology Exchange Group

Reprinted from: Heart of the Machine

Meta's new open source model, ImageBind, joins together multiple data streams for 6 modalities including text, video, and audio.

In the human senses, a picture can bring many experiences together, for example, a picture of a beach can remind us of the sound of waves, the texture of sand, the breeze on our face, and even inspire a poem . This "binding" property of an image provides a substantial source of supervision for learning visual features by aligning itself with any sensory experience associated with it.

Ideally, visual features should be learned by aligning all senses for a single joint embedding space. However, this requires pairing data for all sensory types and combinations through the same set of images, which is obviously not feasible.

Recently, many methods learn image features aligned with text, audio, etc. These methods use a single pair of modalities or at most a few visual modalities. The final embedding is limited to the modality pairs used for training. Therefore, video-audio embeddings cannot be directly used for image-text tasks and vice versa. A major obstacle to learning true joint embeddings is the lack of large amounts of multimodal data where all modalities are fused together.

Today, Meta AI proposes ImageBind , which learns a single shared representation space by leveraging multiple types of image pairing data. The study does not require a dataset where all modalities co-occur with each other, instead exploiting the binding properties of images, as long as the embeddings of each modality are aligned with the image embeddings, all modalities are quickly aligned . Meta AI also released the corresponding code.

45400227be281da6150c9298cd86fac2.png

  • Homepage: https://imagebind.metademolab.com/

  • Paper address: https://dl.fbaipublicfiles.com/imagebind/imagebind_final.pdf

  • GitHub address: https://github.com/facebookresearch/ImageBind

Specifically, ImageBind leverages network-scale (image, text) matching data and combines it with naturally occurring paired data (video, audio, image, depth) to learn a single joint embedding space. Doing so enables ImageBind to implicitly align text embeddings with other modalities (such as audio, depth, etc.), enabling zero-shot recognition on these modalities without explicit semantic or textual pairing.

5c5a2c9c1bf1291ce3a1b3bfcf8cf5d5.gif

Figure 2 below is an overall overview of ImageBind.

ffca113d3bd6f99b7581eda64e0e1246.png

At the same time, the researchers show that ImageBind can be initialized with large-scale visual-language models such as CLIP, thereby taking advantage of the rich image and text representations of these models. Therefore, ImageBind can be applied to a variety of different modalities and tasks with minimal training.

ImageBind is part of Meta's effort to create multimodal AI systems that can learn from all relevant types of data. As the number of modalities increases, ImageBind opens the floodgates for researchers to try to develop new holistic systems, such as combining 3D and IMU sensors to design or experience immersive virtual worlds. In addition, it can provide a rich way to explore memory, that is, to search for images, videos, audio files or text information using a combination of text, video and images.

Binding content and images, learning a single embedding space

Humans have the ability to learn new concepts from very few samples, for example, after reading a description of animals, they can recognize them in real life; from a photo of an unfamiliar car model, they can predict the likely sound of its engine . This is partly because a single image can "bundle" the overall sensory experience together. However, in the field of artificial intelligence, although the number of modalities has been increasing, the lack of multisensory data will limit the standard multimodal learning that requires paired data.

Ideally, a joint embedding space with different kinds of data would allow the model to learn other modalities alongside visual features. Previously, it was often necessary to collect all possible paired data combinations for all modalities to learn a joint embedding space.

ImageBind circumvents this difficulty by exploiting recent large-scale visual language models. It extends the zero-shot capabilities of recent large-scale visual language models to new modalities that are paired naturally with images, such as video-audio and image-depth data. , to learn a joint embedding space. For the other four modalities (audio, depth, thermal, and IMU readouts), the researchers used naturally paired self-supervised data.

c9b6efbebc38ea4fbee470e40df6732d.png

By aligning the embeddings of the six modalities into a common space, ImageBind can retrieve different types of content that are not observed simultaneously across modalities, add embeddings of different modalities to naturally combine their semantics, and use Meta Audio Embedding for AI with pre-trained DALLE-2 decoder (designed for text embedding with CLIP) for audio-to-image generation.

A large number of images appearing together with text exist on the Internet, so training image-text models has been extensively studied. ImageBind takes advantage of the binding property that images can be connected to various modalities, such as connecting text to images using network data, or connecting motion to video using video data captured in a wearable camera with an IMU sensor .

Visual representations learned from large-scale network data can be used as targets for learning different modal features. This makes ImageBind align the image with any modals that are present at the same time, naturally aligning those modals with each other. Modalities that have a strong correlation with images, such as heatmaps and depthmaps, are easier to align. Non-visual modalities such as audio and IMU (inertial measurement unit) are less relevant, and certain sounds such as a baby crying can be paired with various visual backgrounds.

ImageBind showed that image pairing data was sufficient to bind the six modalities together. The model can interpret content more fully, allowing different modalities to "talk" to each other and find connections between them without observing them simultaneously. For example, ImageBind can link audio and text without looking at them together. This enables other models to "understand" new modalities without any resource-intensive training.

The powerful scaling performance of ImageBind enables this model to replace or augment many AI models, enabling them to use other modalities. For example, while Make-A-Scene can generate images by using a text prompt, ImageBind can upgrade it to generate images using audio, such as laughter or the sound of rain.

Excellent performance of ImageBind

Meta-analysis shows that the scaling behavior of ImageBind improves with the strength of the image encoder. In other words, ImageBind's ability to align modalities scales with the power and size of the visual model. This suggests that larger vision models are beneficial for non-vision tasks, such as audio classification, and that the benefits of training such models extend beyond computer vision tasks.

In experiments, Meta uses ImageBind's audio and depth encoders and compares them to previous work on zero-shot retrieval and audio and depth classification tasks.

534644f2e2a5b95f741b1026d54c518e.png

On benchmarks, mageBind outperforms expert models in audio and depth.

Meta found that ImageBind can be used for few-shot audio and deep classification tasks, and outperforms previous custom methods. For example, ImageBind significantly outperforms Meta's self-supervised AudioMAE model trained on Audioset, as well as a supervised AudioMAE model fine-tuned on audio classification.

Furthermore, ImageBind achieves new SOTA performance on the cross-modal zero-shot recognition task, outperforming even state-of-the-art models trained to recognize concepts of this modality.

Reference link: https://ai.facebook.com/blog/imagebind-six-modalities-binding-ai/

Click to enter —>【Transformer】WeChat Technology Exchange Group

The latest CVPR 2023 papers and code download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

Background reply: Transformer review, you can download the latest 3 Transformer review PDFs

多模态和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-多模态或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如多模态或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watch7e708677316d06069e6eeb367540062d.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/130613039