Hard core detailed Segment Anything Model (SAM) TensorRT model conversion

Summary

Segment Anything Model (SAM) is a new image segmentation task and model recently open sourced by Facebook Research. The model is designed and trained to be prompt, so it can transfer zero-shot transfer zero samples to new image distribution and task . Its segmentation effect is amazing, and it is the current algorithm for segmenting SOTA. There are many explanations about the details of the algorithm on the Internet. This article mainly shares how to convert the model into a TensorRT model to facilitate later deployment to accelerate model reasoning.

SAM code

SAM official website

This article TensorRT code

brief introduction

The Segment Anything Model (SAM) model consists of three components, as shown in Figure 1: image encoder , hint encoder and mask decoder .

Figure 1: Overview of the Segmentation Everything Model (SAM). The heavyweight image encoder outputs image embeddings, which can then be efficiently queried by various input cues, yielding object masks at amortized real-time speed. For ambiguous cues corresponding to multiple objects, SAM can output multiple effective masks and associated confidence scores.

Image encoder Image encoder . Inspired by the scalability and powerful pre-training methods, we use a MAE pre-trained Vision Transformer (ViT) that is minimally adapted to handle high-resolution inputs. The image encoder runs once per image and can be applied before hinting the model.

Prompt encoder Prompt encoder . Consider two sets of cues: sparse (points, boxes, text) and dense (masks). Points and boxes are represented by positional encodings that are summed with learned embeddings for each cue type, and free text is represented using CLIP's off-the-shelf text encoder. Dense cues (i.e. masks) use convolutional embeddings and are summed with image embedding elements.

Mask decoder mask decoder . The mask decoder does this by efficiently mapping image embeddings , prompt embeddings , and output tokens to a mask. This design employs a modified Transformer decoder block followed by a dynamic mask prediction header. The modified decoder block updates all embeddings using both directions of cue self-attention and cross-attention (hint-to-image embedding and vice versa). After running both blocks, we upsample the image embeddings and the MLP maps the output tokens to a dynamic linear classifier, then computes masked foreground probabilities at each image location.

Model conversion process

Since it is necessary to improve efficiency and achieve model acceleration, what should be done specifically? The currently commonly used method for accelerating deep learning models is to convert models represented by pytorch/tensorflow etc. into models represented by TensorRT. TensorRT is a framework developed by NVIDIA that can accelerate model reasoning. In fact, it speeds up the speed of your trained model in the testing phase. For example, if your model tests a picture at a speed of 50ms, then if you use tensorRT to accelerate it, it may only take 10ms. . For a more detailed introduction to TensorRT, this article will not go into details, and you can refer to the official website by yourself. I divided the overall acceleration of the deep learning model into two parts:

  1. Model conversion part. Realize the conversion of Pytorch/Tensorflow Model -> TensorRT Model.
  2. Model Inference (Inference) section. Use TensorRT Model to inference the model.

How to get the corresponding TensorRT Model from Pytorch Model? There are generally two ways:

  1. Transformation with **“ torch2trt ”**;
  2. "Pytorch -> onnx -> TensorRT" . This path is the most widely used. First, convert the Pytorch model to the model represented by ONNX; then convert the model represented by ONNX to the model represented by TensorRT. This method is also the method that this paper focuses on.

Pytorch model to ONNX model

Pytorch -> ONNX conversion is relatively simple, with the help of Pytorch's built-in API.

torch.onnx.export(model,
                  x,
                  "./ckpts/onnx_models/{}.onnx".format(model_name),
                  input_names=input_names,
                  output_names=output_names,
                  opset_version=16)

What needs to be emphasized here is the parameter **“opset_version”**: As the official onnx is still being updated, only a part of the pytorch operators can be converted, and a considerable number of operators cannot be converted. Therefore, when we convert, we try to choose the latest version of opset_version to ensure that more operators can be converted. Currently ONNX officially supports operators and corresponding versions.

Convert ONNX model to TensorRT model

Before converting ONNX -> TensorRT, it is strongly recommended to use the onnx-simplifier tool to simplify the converted ONNX model, otherwise errors may be reported in the next conversion. onnx-simplifier is a tool to simplify the ONNX model. The ONNX model we converted earlier is actually very redundant. Some operations (such as IF judgment) are not needed, and these redundant parts are in the following The conversion of ONNX->TensorRT model is likely to cause unnecessary errors and increase the memory of the model; therefore, it is necessary to simplify it.

Next, we need to convert the ONNX model to the TensorRT model. First, we need to download the TensorRT-8.6.1.6 toolkit from the NVIDIA official website, unzip it under the user root directory of the Ubuntu system, and set the environment variables to use it. ** "official tool trtexec"** for model conversion. The tool is already in the TensorRT folder downloaded earlier.

# 在python环境中安装TensorRT包
pip install ~/TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp38-none-linux_x86_64.whl

# 设置环境变量
export PATH=$HOME/TensorRT-8.6.1.6/targets/x86_64-linux-gnu/bin:$PATH
export TENSORRT_DIR=$HOME/TensorRT-8.6.1.6:$TENSORRT_DIR
export LD_LIBRARY_PATH=$HOME/TensorRT-8.6.1.6/lib:$LD_LIBRARY_PATH


#输入命令
./trtexec --onnx=pytorch.onnx --saveEngine=pytorch.engine --workspace=4096

If no error is reported, we will get a model named pytorch.engine , which is the converted TensorRT model. At this point, the model conversion part is all over.

model conversion

After roughly introducing the three components of the SAM model, the next step is to enter the main topic for model conversion. Because the image embedding model mainly uses VIT for feature extraction, and this step is only performed once, the model of this module is converted separately. The Prompt encoder and Mask decoder models are combined for model conversion.

Image embedding model converted to onnx

python scripts/onnx2trt.py --img_pt2onnx --sam_checkpoint weights/sam_vit_h_4b8939.pth --model_type default

Image embedding module onnx model converted to TensorRT model

trtexec --onnx=embedding_onnx/sam_default_embedding.onnx --workspace=4096 --saveEngine=weights/sam_default_embedding.engine

So far, we have obtained the TensorRT model of the image embedding module. The model input and output of this module are of fixed size, so there is basically no major problem in the conversion process. Moreover, the function of this model is to obtain the features of the image, which takes a long time but only needs to be extracted once, and does not need to be repeatedly extracted for subsequent input points or box prompts. According to this feature, the deployment of the front-end and back-end can be well designed.

The Pytorch model of the Prompt_Mask module is converted to the ONNX model

As we mentioned above, the prompt encoding and mask decoding models are operated on the embedding. After the image embedding is extracted in the early stage, you only need to change the coordinates of the input prompt point and the box according to your own wishes. The model conversion of this part is official A script is provided, just run the script to get the onnx model.

# clone官方代码
git clone https://github.com/facebookresearch/segment-anything

**Note: **The mask in the source code is a low-size mask after decoding, which needs to be restored according to the original size of the input image, but if this original size is used as the input node during onnx conversion, then it will also be used during the conversion of the TensorRT model. It is necessary to input this parameter and fix a length and width parameter value. However, the size of the image input by the user cannot be known in advance, so this parameter needs to be extracted separately, that is, the post-processing of the low-dimensional mask is processed separately, not as a model. part, so the source code needs to be slightly modified:

# 修改"segment_anything/utils/onnx.py"中的"forward"函数为如下:
def forward(
    self,
    image_embeddings: torch.Tensor,
    point_coords: torch.Tensor,
    point_labels: torch.Tensor,
    mask_input: torch.Tensor,
    has_mask_input: torch.Tensor
    # orig_im_size: torch.Tensor,
):
    sparse_embedding = self._embed_points(point_coords, point_labels)
    dense_embedding = self._embed_masks(mask_input, has_mask_input)

    masks, scores = self.model.mask_decoder.predict_masks(
        image_embeddings=image_embeddings,
        image_pe=self.model.prompt_encoder.get_dense_pe(),
        sparse_prompt_embeddings=sparse_embedding,
        dense_prompt_embeddings=dense_embedding,
    )

    if self.use_stability_score:
        scores = calculate_stability_score(
            masks, self.model.mask_threshold, self.stability_score_offset
        )

    if self.return_single_mask:
        masks, scores = self.select_masks(masks, scores, point_coords.shape[1])

    return masks, scores
    # upscaled_masks = self.mask_postprocessing(masks, orig_im_size)

    # if self.return_extra_metrics:
    #     stability_scores = calculate_stability_score(
    #         upscaled_masks, self.model.mask_threshold, self.stability_score_offset
    #     )
    #     areas = (upscaled_masks > self.model.mask_threshold).sum(-1).sum(-1)
    #     return upscaled_masks, scores, stability_scores, areas, masks

    # return upscaled_masks, scores, masks

After modifying the above functions in the model, convert the prompt encoding and mask decoding parts in the sam_vit_h_4b8939.pth model to the onnx model.

# 下载default模型在库下的weights文件夹,并进行onnx模型的转换
python scripts/onnx2trt.py --prompt_masks_pt2onnx 

**Note:** During the model conversion process, the opset version needs to match your onnx version, otherwise an error will be reported during the TensorRT model conversion process, which is a big pit.

The ONNX model of the Prompt_Mask module is converted to the TensorRT model

In this link, because the positive and negative numbers of the prompt point and the change point included in the input are variable, that is, the input size is dynamic, so it is necessary to set multi-size parameters during the conversion process, as follows:

trtexec --onnx=weights/sam_default_prompt_mask.onnx --workspace=4096 --shapes=image_embeddings:1x256x64x64,point_coords:1x1x2,point_labels:1x1,mask_input:1x1x256x256,has_mask_input:1 --minShapes=image_embeddings:1x256x64x64,point_coords:1x1x2,point_labels:1x1,mask_input:1x1x256x256,has_mask_input:1 --optShapes=image_embeddings:1x256x64x64,point_coords:1x10x2,point_labels:1x10,mask_input:1x1x256x256,has_mask_input:1 --maxShapes=image_embeddings:1x256x64x64,point_coords:1x20x2,point_labels:1x20,mask_input:1x1x256x256,has_mask_input:1 --saveEngine=weights/sam_default_prompt_mask.engine

After completing the above process, we have obtained two accelerated engine files of TensorRT, and then we can perform the inference task of the model. We provide the inference script:

python sam_trt_inference.py

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/131174352