[Computer vision] BLIP: source code example demo (including source code)

一、Image Captioning

First configure the code:

import sys
if 'google.colab' in sys.modules:
    print('Running in Colab.')
    !pip3 install transformers==4.15.0 timm==0.4.12 fairscale==0.4.4
    !git clone https://github.com/salesforce/BLIP
    %cd BLIP

This code is for setup in Google Colab environment. The code first checks to see if it is running in the Google Colab environment ('google.colab' in sys.modules). If running in a Colab environment, it will continue to use pip3 to install specific versions of Python packages. Then, it clones the GitHub code repository named "BLIP" via the git clone command. Finally, the code uses the %cd command to change the current working directory to the directory of the "BLIP" code repository.

The purpose of this code is to set up the necessary environment in Google Colab to continue executing other related code in the "BLIP" code repository.

from PIL import Image
import requests
import torch
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def load_demo_image(image_size,device):
    img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')   

    w,h = raw_image.size
    display(raw_image.resize((w//5,h//5)))
    
    transform = transforms.Compose([
        transforms.Resize((image_size,image_size),interpolation=InterpolationMode.BICUBIC),
        transforms.ToTensor(),
        transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))
        ]) 
    image = transform(raw_image).unsqueeze(0).to(device)   
    return image

This code is used to load the demo image and preprocess it for subsequent computer vision tasks. Let's break down the code line by line:

  1. from PIL import Image: Import the Image module in the PIL library for image processing.

  2. import requests: Import the requests library, used to get images from the network.

  3. import torch: Import the PyTorch library for deep learning tasks.

  4. from torchvision import transforms: Import the transforms module from the torchvision library for image preprocessing.

  5. from torchvision.transforms.functional import InterpolationMode: Import InterpolationMode from the torchvision.transforms.functional module to specify the interpolation method of the image.

  6. device = torch.device('cuda' if torch.cuda.is_available() else 'cpu'): Determine whether there is an available GPU, if so, set the device to cuda, otherwise set it to cpu. Subsequent calculations will be performed on this device.

  7. def load_demo_image(image_size, device):: defines a function named load_demo_image which accepts image size image_size and computing device device as input parameters.

  8. img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg': defines the URL of the demo image.

  9. raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB'): downloads a raw image from the given URL, and uses the Image module in the PIL library to open and convert the image format to RGB.

  10. w, h = raw_image.size: Get the width and height of the raw image.

  11. display(raw_image.resize((w//5, h//5))): Use the display function to display the reduced original image.

  12. transform = transforms.Compose([…]): Defines a transformation chain for image preprocessing, including image resizing, image conversion to tensor, and normalization.

  13. image = transform(raw_image).unsqueeze(0).to(device): preprocess the raw image and convert it to a tensor. Use unsqueeze(0) to resize the image tensor from [C, H, W] to [1, C, H, W] to match the input shape of the network model. Finally, the processed image tensor is moved to the previously set computing device.

  14. return image: Returns the preprocessed image tensor.

What this code does is load the demo images and preprocess them into tensor data suitable for subsequent computer vision tasks. When calling the function, you need to pass in the required image size and computing device, and then you can use the returned image tensor for inference and analysis of the computer vision model.

from models.blip import blip_decoder

image_size = 384
image = load_demo_image(image_size=image_size, device=device)

model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth'
    
model = blip_decoder(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device)

with torch.no_grad():
    # beam search
    caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5) 
    # nucleus sampling
    # caption = model.generate(image, sample=True, top_p=0.9, max_length=20, min_length=5) 
    print('caption: '+caption[0])
  1. from models.blip import blip_decoder: Import the custom blip_decoder model, which is the decoding part of the "BLIP" model.

  2. image_size = 384: Defines the image size as 384x384 pixels.

  3. image = load_demo_image(image_size=image_size, device=device): Load the demo image using the previously defined load_demo_image function, and preprocess the image to fit the input requirements of the model. image is the preprocessed image tensor.

  4. model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth': defines the URL of the pre-trained model.

  5. model = blip_decoder(pretrained=model_url, image_size=image_size, vit='base'): Create a model instance using the constructor of the blip_decoder model. The pretrained parameter here specifies the URL of the pre-trained model, the image_size parameter specifies the image size, and the vit parameter specifies which ViT (Vision Transformer) model to use, and the base version is selected here.

  6. model.eval(): Sets the model into evaluation mode, which turns off some specific features enabled during training, such as dropout.

  7. model = model.to(device): Move the model to the previously set computing device.

  8. with torch.no_grad():: Use the torch.no_grad() context manager to ensure that gradients are not computed during inference.

  9. caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5): Use the model.generate() method to generate a description of the image. Here the beam search method is used to search for the best description. sample=False means that the sampling method is not used, but beam search is used. num_beams=3 indicates that beam search uses 3 beams. max_length=20 means that the maximum length of the generated description is 20 words, and min_length=5 means that the shortest generated description is 5 words.

  10. print('caption: '+caption[0]): Output the generated image description.

The function of this code is to use the pre-trained "BLIP" model to describe the loaded image. It uses the beam search method to perform inference in the model and output the resulting image description. You can try different sampling methods or tune other parameters to see how the generated descriptions change.

The output is:

insert image description here

load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base_caption.pth
caption: a woman sitting on the beach with a dog

Two, VQA

from models.blip_vqa import blip_vqa

image_size = 480
image = load_demo_image(image_size=image_size, device=device)     

model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth'
    
model = blip_vqa(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device)

question = 'where is the woman sitting?'

with torch.no_grad():
    answer = model(image, question, train=False, inference='generate') 
    print('answer: '+answer[0])
  1. from models.blip_vqa import blip_vqa: Import the custom blip_vqa model, which is the visual question answering part of the "BLIP" model.

  2. image_size = 480: Defines the image size as 480x480 pixels.

  3. image = load_demo_image(image_size=image_size, device=device): Load the demo image using the previously defined load_demo_image function, and preprocess the image to fit the input requirements of the model. image is the preprocessed image tensor.

  4. model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth': URL that defines the pre-trained model for the visual question answering model.

  5. model = blip_vqa(pretrained=model_url, image_size=image_size, vit='base'): Create a model instance using the constructor of the blip_vqa model. The pretrained parameter here specifies the URL of the pre-trained model, the image_size parameter specifies the image size, and the vit parameter specifies which ViT (Vision Transformer) model to use, and the base version is selected here.

  6. model.eval(): Sets the model into evaluation mode, which turns off some specific features enabled during training, such as dropout.

  7. model = model.to(device): Move the model to the previously set computing device.

  8. question = 'where is the woman sitting?': defines a visual question answering question, where the question is "where is the woman sitting?".

  9. with torch.no_grad():: Use the torch.no_grad() context manager to ensure that gradients are not computed during inference.

  10. answer = model(image, question, train=False, inference='generate'): Use the model's __call__ method for inference, input an image and a question, and generate an answer. train=False means not to use the training mode during inference. inference='generate' means to use a generative inference method instead of a training model that provides answers.

  11. print('answer: '+answer[0]): Output the generated answer.

The function of this code is to use the pre-trained "BLIP" model for visual question answering, and generate answers to the loaded images according to the given question. It uses generative reasoning methods to generate answers. You can try asking different questions to see what answers the model generates.

Output result:

insert image description here

load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_vqa.pth
answer: on beach

三、Feature Extraction

from models.blip import blip_feature_extractor

image_size = 224
image = load_demo_image(image_size=image_size, device=device)     

model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth'
    
model = blip_feature_extractor(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device)

caption = 'a woman sitting on the beach with a dog'

multimodal_feature = model(image, caption, mode='multimodal')[0,0]
image_feature = model(image, caption, mode='image')[0,0]
text_feature = model(image, caption, mode='text')[0,0]

The output is:

insert image description here

load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth

四、Image-Text Matching

from models.blip_itm import blip_itm

image_size = 384
image = load_demo_image(image_size=image_size,device=device)

model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_coco.pth'
    
model = blip_itm(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device='cpu')

caption = 'a woman sitting on the beach with a dog'

print('text: %s' %caption)

itm_output = model(image,caption,match_head='itm')
itm_score = torch.nn.functional.softmax(itm_output,dim=1)[:,1]
print('The image and text is matched with a probability of %.4f'%itm_score)

itc_score = model(image,caption,match_head='itc')
print('The image feature and text feature has a cosine similarity of %.4f'%itc_score)

The output is:

insert image description here

load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_coco.pth
text: a woman sitting on the beach with a dog
The image and text is matched with a probability of 0.9960
The image feature and text feature has a cosine similarity of 0.5262

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132031204