Reference Code
ImageBind:GitHub - facebookresearch/ImageBind: ImageBind One Embedding Space to Bind Them All
ImageBind + stable-diffusion-2-1-unclip:GitHub - Zeqiang-Lai/Anything2Image: Generate image from anything with ImageBind and Stable Diffusion
The recently popular ImageBind learns a single shared representation space by utilizing image pairing data of multiple types (depth, text, heatmap, audio, IMU). ImageBind does not require datasets where all modalities appear at the same time. It uses the binding properties of images. As long as the embedding of each modal is aligned with the image embedding, all modalities can be quickly aligned.
But judging from the open-source code of ImageBind, the author only open-sourced the encode part (mapping data of different modalities into the aligned embedding space), and could not directly implement functions such as text2img and audio2img. In order to achieve the above functions, the big guys combined the " unified latent space " provided by ImageBind with the decoder in stable diffusion. If you are interested, you can search for Anything2Image or BindDiffusion on Github . Here I refer to the code of ImageBind and Anything2Image, and reproduce the audio+img to img, text to img and other functions. The dependent library of the code operation can refer to ImageBind (pip install -r requirements.txt), plus diffusers (pip install diffusers).
code example
import torch
from diffusers import StableUnCLIPImg2ImgPipeline
import sys
sys.path.append("..")
from models import data
from models import imagebind_model
from models.imagebind_model import ModalityType
model = imagebind_model.imagebind_huge(pretrained=True).to("cuda").eval()
pipe = StableUnCLIPImg2ImgPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16).to("cuda")
with torch.no_grad():
## image
image_path = ["/kaxier01/projects/GPT/ImageBind/assets/image/bird.png"]
embeddings = model.forward({ModalityType.VISION: data.load_and_transform_vision_data(image_path, "cuda")}, normalize=False)
img_embeddings = embeddings[ModalityType.VISION]
## audio
audio_path = ["/kaxier01/projects/GPT/ImageBind/assets/wav/wave.wav"]
embeddings = model.forward({ModalityType.AUDIO: data.load_and_transform_audio_data(audio_path, "cuda")}, normalize=True)
audio_embeddings = embeddings[ModalityType.AUDIO]
embeddings = (img_embeddings + audio_embeddings) / 2
images = pipe(image_embeds=embeddings.half()).images
images[0].save("/kaxier01/projects/GPT/ImageBind/results/bird_wave_audioimg2img.png")
Encountered problems and solutions
The problem encountered in this area is mainly the problem of model download timeout. The solution is as follows:
method one:
Go to the official website ( Hugging Face – The AI community building the future. ) to search for the model and download it (it is best to download all the files), such as
After downloading, just specify the model path in the code, such as
# 模型路径: "/kaxier01/projects/GPT/ImageBind/checkpoints/stable-diffusion-2-1-unclip"
pipe = StableUnCLIPImg2ImgPipeline.from_pretrained("/kaxier01/projects/GPT/ImageBind/checkpoints/stable-diffusion-2-1-unclip", torch_dtype=torch.float16).to("cuda")
Method Two:
download git-lfs
apt-get update
apt-get install git-lfs
git lfs install
After downloading and installing, you can use this command to download the model, such as
git lfs clone https://huggingface.co/stabilityai/stable-diffusion-2-1-unclip
Result display
thermal2img
input
output
audio+img2img
input
Voice (wave.wav) + picture
output
text2img
input
'a photo of an astronaut riding a horse on mars'