ImageBind and Stable diffusion usage records

Reference Code

ImageBindGitHub - facebookresearch/ImageBind: ImageBind One Embedding Space to Bind Them All

ImageBind + stable-diffusion-2-1-unclipGitHub - Zeqiang-Lai/Anything2Image: Generate image from anything with ImageBind and Stable Diffusion

The recently popular ImageBind learns a single shared representation space by utilizing image pairing data of multiple types (depth, text, heatmap, audio, IMU). ImageBind does not require datasets where all modalities appear at the same time. It uses the binding properties of images. As long as the embedding of each modal is aligned with the image embedding, all modalities can be quickly aligned.

But judging from the open-source code of ImageBind, the author only open-sourced the encode part (mapping data of different modalities into the aligned embedding space), and could not directly implement functions such as text2img and audio2img. In order to achieve the above functions, the big guys combined the " unified latent space " provided by ImageBind with the decoder in stable diffusion. If you are interested, you can search for Anything2Image or BindDiffusion on Github . Here I refer to the code of ImageBind and Anything2Image, and reproduce the audio+img to img, text to img and other functions. The dependent library of the code operation can refer to ImageBind (pip install -r requirements.txt), plus diffusers (pip install diffusers).

code example

import torch
from diffusers import StableUnCLIPImg2ImgPipeline
import sys
sys.path.append("..")
from models import data
from models import imagebind_model
from models.imagebind_model import ModalityType


model = imagebind_model.imagebind_huge(pretrained=True).to("cuda").eval()
pipe = StableUnCLIPImg2ImgPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16).to("cuda")

with torch.no_grad():
    ## image
    image_path = ["/kaxier01/projects/GPT/ImageBind/assets/image/bird.png"]
    embeddings = model.forward({ModalityType.VISION: data.load_and_transform_vision_data(image_path, "cuda")}, normalize=False)
    img_embeddings = embeddings[ModalityType.VISION]
    ## audio
    audio_path = ["/kaxier01/projects/GPT/ImageBind/assets/wav/wave.wav"]
    embeddings = model.forward({ModalityType.AUDIO: data.load_and_transform_audio_data(audio_path, "cuda")}, normalize=True)
    audio_embeddings = embeddings[ModalityType.AUDIO]
    embeddings = (img_embeddings + audio_embeddings) / 2

    images = pipe(image_embeds=embeddings.half()).images
    images[0].save("/kaxier01/projects/GPT/ImageBind/results/bird_wave_audioimg2img.png")

Encountered problems and solutions

The problem encountered in this area is mainly the problem of model download timeout. The solution is as follows:

method one:

Go to the official website ( Hugging Face – The AI ​​community building the future. ) to search for the model and download it (it is best to download all the files), such as

After downloading, just specify the model path in the code, such as

# 模型路径: "/kaxier01/projects/GPT/ImageBind/checkpoints/stable-diffusion-2-1-unclip"
pipe = StableUnCLIPImg2ImgPipeline.from_pretrained("/kaxier01/projects/GPT/ImageBind/checkpoints/stable-diffusion-2-1-unclip", torch_dtype=torch.float16).to("cuda")

Method Two:

download git-lfs

apt-get update
apt-get install git-lfs
git lfs install

After downloading and installing, you can use this command to download the model, such as

git lfs clone https://huggingface.co/stabilityai/stable-diffusion-2-1-unclip

Result display 

thermal2img

input

output

audio+img2img

input

Voice (wave.wav) + picture

output

text2img

input

'a photo of an astronaut riding a horse on mars'

output

Guess you like

Origin blog.csdn.net/qq_38964360/article/details/130886233