OpenVINO™ empowers BLIP to implement visual language AI edge deployment

Author: Dr. Wu Zhuo, Intel AI Evangelist

Humans perceive the world through vision and language. A long-term goal of artificial intelligence is to build intelligent agents that understand the world through visual and verbal input and communicate with humans through natural language. For example, in " Speeding up Stable Diffusion with a few lines of code, using OpenVINO™ to make Vinsen graph easy ", we introduced the use of OpenVINO to run the Stable Diffusion model to quickly implement the Vinsen graph application. Let everyone become a painting master and use AI to paint at will.

With the rapid development of the fields of computer vision and natural language processing, the integration of vision and language has attracted more and more attention from researchers. In this context, BLIP (Bootstrapping Language-Image Pre-training), as an innovative pre-training model, has attracted widespread attention. This model pre-trains a deep neural network model on a large-scale image-to-text dataset to improve the performance of downstream visual-language tasks, such as image-to-text retrieval, image captioning, and visual question answering. By jointly training image and text data, it provides a strong foundation for the integration of vision and language.

Figure 1. Example of BLIP inference results

The pre-training process of BLIP involves two key components: an image encoder and a text encoder. An image encoder is responsible for converting an input image into a low-dimensional vector representation, while a text encoder converts an input text into another low-dimensional vector representation. In order to achieve unified vision-language pre-training, BLIP adopts a cross-modal constraint strategy, that is, in the pre-training stage, the image encoder and text encoder are designed to be mutually constrained. Such a constraint mechanism forces the model to learn to align visual information and language information, so that the model can better handle the joint information between vision and language in subsequent tasks.

In addition to visual-language understanding tasks, BLIP also performs well in visual-language generation tasks. In this task, the model needs to generate relevant descriptions or answer questions based on input images and text. BLIP jointly trains and introduces image-text generation tasks, giving the model more powerful image description and question answering capabilities. This enables BLIP to achieve excellent results in tasks such as image description generation and visual question answering.

Next, let's take a look at the key steps of how to use OpenVINO™ to optimize and accelerate BLIP reasoning on AAEON's new UP Squared Pro 7000 Edge.

As the third generation of AAEON's UP Squared Pro series, the Upsquared Pro 7000 series (up-shop.org) provides greater development potential through high-performance computing capabilities, upgraded circuit board design and expanded display interfaces. As the first product in this series to use Intel® Core™/Atom®/N series processors (formerly Alder Lake-N), UP Squared Pro 7000 is the first product to be equipped with onboard LPDDR5 memory, which improves I/O Running speed. In addition, UP Squared Pro 7000 has significant improvements in image processing and display functions. It supports MIPI CSI cameras and is equipped with Intel® UHD graphics card to support three 4K displays at the same time.

More than 1.4 times CPU performance improvement

UP Squared Pro 7000 uses Intel® Core™/Atom®/N-series processors, and the CPU performance is 1.4 times that of the previous generation. UP Squared Pro 7000 has up to 8 Gracemont cores, supports OpenVINO™ Toolkit, and the UHD graphics card of the 12th generation Intel® processor. With powerful computing power, optimized inference engine and image processing functions, it provides excellent intelligent solutions plan.

Supports 3 4K monitors simultaneously

Equipped with HDMI 2.0b, DP 1.2 port and DP 1.4a via USB Type-C, UP Squared Pro 7000 has excellent display interfaces. UP Squared Pro 7000 integrates GPU and multiple outputs, and can support three 4K displays simultaneously, making it ideal for visually oriented applications such as digital signage.

Double the high-speed system memory

As the first board in the UP Squared Pro series to feature onboard LPDDR5 system memory, the UP Squared Pro 7000 is equipped with 16GB of system memory, double that of the previous generation. In addition, the memory speed of up to 4800MHz doubles the user's bandwidth and data transmission speed, while also saving more power.

Comprehensive I/O upgrade

In addition to maintaining the compact 4" x 4" profile of the UP Squared Pro series, the UP Squared Pro 7000 has a more compact circuit board design. UP Squared Pro 7000 is equipped with two 2.5GbE, three USB 3.2 and one FPC port to connect more peripherals such as MIPI CSI cameras. Combining these features with onboard LPDDR5 and powerful CPU, it is very suitable for vision solutions in smart factory robots.

Note: All codes in the following steps come from the 233-blip-visual-language-processing notebook code example in the OpenVINO Notebooks open source repository. You can click the following link to go directly to the source code.https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/233-blip-visual-language-processing  

Step 1: Install the corresponding toolkit, load the model and convert it to OpenVINO IR format

This code example needs to install the corresponding toolkit of BLIP first.

!pip install "transformers >= 4.26.0"

Then download and load the corresponding PyTorch model. In this question, you will use the blip-vqa-base base model, which can be downloaded from Hugging Face. The same operation also applies to other models in the BLIP series. Although this model class is designed for performing question answering, its components can also be used for image captioning. To start using this model, you need to instantiate the BlipForQuestionAnswering class using the from_pretrained method. BlipProcessor is a helper class used for preparation of input data for textual and visual modalities and for post-processing of generated results.

import sys
import time
from PIL import Image
from transformers import BlipProcessor, BlipForQuestionAnswering

sys.path.append("../utils")
from notebook_utils import download_file

# Get model and processor
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")

Next, we look at how to convert the original model into a model in OpenVINO IR format, and use OpenVINO to perform corresponding optimization and deploy inference acceleration.

Step 2: Convert the model to OpenVINO IR format

According to our previous introduction, the BLIP model includes three models: visual model, text encoding and text decoding, so we need to convert these three models into the OpenVINO IR format respectively. The conversion operation of the visual model is relatively routine. For specific codes, please refer to our notebook . Here we focus on the conversion part of the text encoding and text decoding models.

  • Text encoder transformation : Visual question answering tasks use text encoders to build embedding representations of questions. It takes the input_ids of the tokenized question and outputs the image embeddings obtained from the vision model and their attention masks. Depending on the question text, the number of tokens after tokenized input may vary. Therefore, to preserve the dynamic shape for model inputs using markers, the dynamic_axes parameter is responsible for preserving the dynamically specific dimensions of the input in torch.onx.export. code show as below:
TEXT_ENCODER_OV = Path("blip_text_encoder.xml")
TEXT_ENCODER_ONNX = TEXT_ENCODER_OV.with_suffix(".onnx")

text_encoder = model.text_encoder
text_encoder.eval()

# if openvino model does not exist, convert it to onnx and then to IR
if not TEXT_ENCODER_OV.exists():
    if not TEXT_ENCODER_ONNX.exists():
        # prepare example inputs for ONNX export
        image_embeds = vision_outputs[0]
        image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long)
        input_dict = {"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"], "encoder_hidden_states": image_embeds, "encoder_attention_mask": image_attention_mask}
        # specify variable length axes
        dynamic_axes = {"input_ids": {1: "seq_len"}, "attention_mask": {1: "seq_len"}}
        # export PyTorch model to ONNX
        with torch.no_grad():
            torch.onnx.export(text_encoder, input_dict, TEXT_ENCODER_ONNX, input_names=list(input_dict), dynamic_axes=dynamic_axes)
    # convert ONNX model to IR using model conversion Python API, use compress_to_fp16=True for compressing model weights to FP16 precision
    ov_text_encoder = mo.convert_model(TEXT_ENCODER_ONNX, compress_to_fp16=True)
    # save model on disk for next usages
    serialize(ov_text_encoder, str(TEXT_ENCODER_OV))
    print(f"Text encoder successfuly converted and saved to {TEXT_ENCODER_OV}")
else:
    print(f"Text encoder will be loaded from {TEXT_ENCODER_OV}")
  • Text decoder transformation : The text decoder is responsible for generating a sequence of tokens for the model output (the answer or title of the question) using the representation of the image (and question, if needed). Generative methods are based on the assumption that the probability distribution of a sequence of words can be decomposed into the product of the conditional distribution of the next word. In other words, the model predicts the generation of the next token in a loop guided by the previously generated token until the condition to stop generation is reached (generating a token that reaches the maximum length of the sequence or the end of the obtained string). The way the next token is selected based on the predicted probability is driven by the chosen decoding method. Similar to text encoders, text decoders can handle input sequences of varying lengths and need to preserve dynamic input shapes. This part of special processing can be completed by the following code:
text_decoder = model.text_decoder
text_decoder.eval()

TEXT_DECODER_OV = Path("blip_text_decoder.xml")
TEXT_DECODER_ONNX = TEXT_DECODER_OV.with_suffix(".onnx")

# prepare example inputs for ONNX export
input_ids = torch.tensor([[30522]])  # begin of sequence token id
attention_mask = torch.tensor([[1]])  # attention mask for input_ids
encoder_hidden_states = torch.rand((1, 10, 768))  # encoder last hidden state from text_encoder
encoder_attention_mask = torch.ones((1, 10), dtype=torch.long)  # attention mask for encoder hidden states

input_dict = {"input_ids": input_ids, "attention_mask": attention_mask, "encoder_hidden_states": encoder_hidden_states, "encoder_attention_mask": encoder_attention_mask}
# specify variable length axes
dynamic_axes = {"input_ids": {1: "seq_len"}, "attention_mask": {1: "seq_len"}, "encoder_hidden_states": {1: "enc_seq_len"}, "encoder_attention_mask": {1: "enc_seq_len"}}

# specify output names, logits is main output of model
output_names = ["logits"]

# past key values outputs are output for caching model hidden state
past_key_values_outs = []
text_decoder_outs = text_decoder(**input_dict)
for idx, _ in enumerate(text_decoder_outs["past_key_values"]):
    past_key_values_outs.extend([f"out_past_key_value.{idx}.key", f"out_past_key_value.{idx}.value"])

Next, for the transformation of the text decoder, there is an additional input from the hidden state from the previous step. Similar to export, after models are exported to ONNX format, they will be flattened. Dynamic_axies and input_names need to be updated with the new input layer. Therefore, the subsequent conversion process is similar to that of the previous text encoder and will not be described again in this article.

Step 3: Run OpenVINO inference

As mentioned earlier, here we will mainly show how BLIP performs visual question answering and how to build the pipeline of image subtitles, and how to run OpenVINO for reasoning.

  • Image subtitles

The vision model accepts images preprocessed by BlipProcessor as input and generates image embeddings which are passed directly to the text decoder to generate subtitle tokens. After the generation is completed, the output sequence of the word segmentation tokenizer is provided to the BlipProcessor for decoding into text using the tokenizer.

Define the OVBLIPModel class:

class OVBlipModel:
    """ 
    Model class for inference BLIP model with OpenVINO
    """
    def __init__(self, config, decoder_start_token_id:int, vision_model, text_encoder, text_decoder):
        """
        Initialization class parameters
        """
        self.vision_model = vision_model
        self.vision_model_out = vision_model.output(0)
        self.text_encoder = text_encoder
        self.text_encoder_out = text_encoder.output(0)
        self.text_decoder = text_decoder
        self.config = config
        self.decoder_start_token_id = decoder_start_token_id
        self.decoder_input_ids = config.text_config.bos_token_id

Define the image subtitle function as follows,

def generate_caption(self, pixel_values:torch.Tensor, input_ids:torch.Tensor = None, attention_mask:torch.Tensor = None, **generate_kwargs):
        """
        Image Captioning prediction
        Parameters:
          pixel_values (torch.Tensor): preprocessed image pixel values
          input_ids (torch.Tensor, *optional*, None): pregenerated caption token ids after tokenization, if provided caption generation continue provided text
          attention_mask (torch.Tensor): attention mask for caption tokens, used only if input_ids provided
        Retruns:
          generation output (torch.Tensor): tensor which represents sequence of generated caption token ids
        """
        batch_size = pixel_values.shape[0]

        image_embeds = self.vision_model(pixel_values.detach().numpy())[self.vision_model_out]

        image_attention_mask = torch.ones(image_embeds.shape[:-1], dtype=torch.long)

        if isinstance(input_ids, list):
            input_ids = torch.LongTensor(input_ids)
        elif input_ids is None:
            input_ids = (
                torch.LongTensor([[self.config.text_config.bos_token_id, self.config.text_config.eos_token_id]])
                .repeat(batch_size, 1)
            )
        input_ids[:, 0] = self.config.text_config.bos_token_id
        attention_mask = attention_mask[:, :-1] if attention_mask is not None else None

        outputs = self.text_decoder.generate(
            input_ids=input_ids[:, :-1],
            eos_token_id=self.config.text_config.sep_token_id,
            pad_token_id=self.config.text_config.pad_token_id,
            attention_mask=attention_mask,
            encoder_hidden_states=torch.from_numpy(image_embeds),
            encoder_attention_mask=image_attention_mask,
            **generate_kwargs,
        )

        return outputs
  • Visual Q&A

The visual answering pipeline looks similar, but has additional question processing. In this case, the image embeddings and questions tagged by the BlipProcessor are fed to the text encoder, and then the multimodal question embeddings are passed to the text decoder to perform answer generation.

In the same way, the visual question and answer function can be defined inside the OVBLIPModel class as follows:

  def generate_answer(self, pixel_values:torch.Tensor, input_ids:torch.Tensor, attention_mask:torch.Tensor, **generate_kwargs):
        """
        Visual Question Answering prediction
        Parameters:
          pixel_values (torch.Tensor): preprocessed image pixel values
          input_ids (torch.Tensor): question token ids after tokenization
          attention_mask (torch.Tensor): attention mask for question tokens
        Retruns:
          generation output (torch.Tensor): tensor which represents sequence of generated answer token ids
        """
        image_embed = self.vision_model(pixel_values.detach().numpy())[self.vision_model_out]
        image_attention_mask = np.ones(image_embed.shape[:-1], dtype=int)
        if isinstance(input_ids, list):
            input_ids = torch.LongTensor(input_ids)
        question_embeds = self.text_encoder([input_ids.detach().numpy(), attention_mask.detach().numpy(), image_embed, image_attention_mask])[self.text_encoder_out]
        question_attention_mask = np.ones(question_embeds.shape[:-1], dtype=int)

        bos_ids = np.full((question_embeds.shape[0], 1), fill_value=self.decoder_start_token_id)

        outputs = self.text_decoder.generate(
            input_ids=torch.from_numpy(bos_ids),
            eos_token_id=self.config.text_config.sep_token_id,
            pad_token_id=self.config.text_config.pad_token_id,
            encoder_hidden_states=torch.from_numpy(question_embeds),
            encoder_attention_mask=torch.from_numpy(question_attention_mask),
            **generate_kwargs,
        )
        return outputs
  • Initialize the OpenVINO runtime and run inference

Initialize the OpenVINO Core object, select the inference device, and load and compile the model

# create OpenVINO Core object instance
core = Core()

import ipywidgets as widgets

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device

# load models on device
ov_vision_model = core.compile_model(VISION_MODEL_OV, device.value)
ov_text_encoder = core.compile_model(TEXT_ENCODER_OV, device.value)
ov_text_decoder = core.compile_model(TEXT_DECODER_OV, device.value)
ov_text_decoder_with_past = core.compile_model(TEXT_DECODER_WITH_PAST_OV, device.value)
  • Run image captioning inference

out = ov_model.generate_caption(inputs["pixel_values"], max_length=20)
caption = processor.decode(out[0], skip_special_tokens=True)
fig = visualize_results(raw_image, caption)

The running effect is shown in the figure below:

  • Run visual question answering inference

start = time.perf_counter()
out = ov_model.generate_answer(**inputs, max_length=20)
end = time.perf_counter() - start
answer = processor.decode(out[0], skip_special_tokens=True)
fig = visualize_results(raw_image, answer, question)

The running effect is shown in the figure below:

summary

That’s the whole process! Start following the code and steps we provide now and try using Open VINO and BLIP.

For more information about the Intel OpenVINOTM open source tool suite, including the more than 300 verified and optimized pre-trained models we provide, please click https://www.intel.com/content/www/us/ en/developer/tools/openvino-toolkit/overview.html

In addition, in order to facilitate everyone to understand and quickly master the use of OpenVINOTM, we also provide a series of open source Jupyter notebook demos. By running these notebooks, you can quickly understand how to use OpenVINOTM to implement a series of computer vision, speech and natural language processing tasks in different scenarios. The resources of OpenVINOTM notebooks can be downloaded and installed from GitHub: https://github.com/openvinotoolkit/openvino_notebooks .

About AAEON

Founded in 1992, AAEON is one of the leading designers and manufacturers of industrial IoT and artificial intelligence edge solutions. With continuous innovation as its core value, AAEON brings reliable, high-quality computing platforms to the market, including industrial motherboards and systems, rugged tablets, embedded artificial intelligence systems, uCPE network equipment, and LoRaWAN/WWAN solutions. AAEON also brings industry-leading experience and knowledge to provide OEM/ODM services globally. In addition, AAEON works closely with many cities and governments to develop and deploy smart city ecosystems, providing personalized platforms and end-to-end solutions. AAEON works closely with top chip designers to provide stable and reliable platforms, and is recognized as a Titanium-level member of the Intel® Internet of Things Solutions Alliance. To learn more about AAEON's product lines and services, please visit www.aaeon.com.

Notices and Disclaimers

Intel technologies may require enabling hardware, software or service activation.

No product or component can be completely safe.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Guess you like

Origin blog.csdn.net/gc5r8w07u/article/details/132578229