ImageBind: One Embedding Space To Bind Them All

  • meta AI
  • 2023.5.9

abstract

  • Question: When human beings are exposed to the world, multiple senses such as vision, hearing, smell, and touch are receiving information. However, the current multimodal task is actually an interaction between two modalities, such as image-text pair, speech-text pair, image-audio pair, and does not actually realize the three modalities of vision-image-text get through. The main difficulty is that if you want to map multiple modalities in the same space, you need description information for multiple modalities in a picture, and such a data set does not exist.
  • Idea: Align each modality and image to obtain multimodal alignment information, so as to realize six modalities (images, text, audio, depth–3D, thermal-thermodynamics, and IMU data) in a joint embedding space - mapping of motion parameters). The initialization model of this model can be CLIP, a model that has been well modeled on image-text, to continue training to get an out-of-the-box experience in cross-modal scenarios (zero-shot classification and retrieval performance), or through comparison Less data finetune for other modal scenarios.

method

insert image description here

  • Using pictures as the intermediate state of alignment, you can get a lot of <image, text>, <image, thermal>, <audio, image> data. As shown in the above figure, various modalities are mapped to a space, so that some unpaired data can be modeled;
  • ( I , M ) (I,M) (I,M ) is the constructed pair data, whereIII是image, M M M is any other modality, after different encoder–f/g,qi = f ( I i ) q_i=f(I_i)qi=f(Ii) k i = g ( M i ) k_i=g(M_i) ki=g(Mi) , optimized using contrastive learning InfoNCE loss; in practice, using symmetric loss,LI , M + LM , I L_{I,M}+L_{M,I}LI,M+LM,I;
  • During training, it is found that during ( I ​​, M 1 ) (I, M_1)(I,M1) sum( I , M 2 ) (I, M_2)(I,M2) alignment learning,( M 1 , M 2 ) (M_1,M_2)(M1,M2) There is also alignment between them, indicating that ImageBind can do zero-shot cross-modal retrieval tasks. The experimental results show that the text-audio classification results of SOTA can be achieved by using the text prompt without audio-text data.
  • Model structure: image/text/audio/thermal image/depth image/IMU are separate encoders, and image and video share an encoder; each modality is encoder+linear+norm, and the final linear process is processed to the specified dimension for easy calculation infoNCE, norm is helpful for model convergence, and pre-trained models can be used (iamge/text encoder comes from CLIP). When processing modal data other than image/text, these two encoder freeze parameters are updated corresponding to the encoder parameters of the modal.
  • The method of each modal data processing is as follows
    insert image description here
    insert image description here

experiment

insert image description here

downstream task performance

  • zero-shot:
    • There is only image-text pair data, but under the prompt of text prompt, other modal classification tasks also have comparable or even better performance than specific tasks; (Table 2)
    • Without audio-text training data, audio classification & audio retrieval tasks performed well (Table 3 & Table 4)
  • few-shot: training linear classifiers, based on audio classification tasks, better than self-supervised AudioMAE, comparable to supervised AudioMAE; based on depth classification, better than MultiMAE (Figure 3)
  • Extended tasks:
    • The embedding of the two modalities is used as a prompt, and pictures with both phonemes can be retrieved (Figure 4)
    • It is also possible to use text prompt for detection and replace it with audio embed as prompt (Figure 5)

insert image description here

  • Some structural comparisons and ablation experiments of scaling for performance were also done. See the paper for details.

Guess you like

Origin blog.csdn.net/qq_40168949/article/details/130600110