Image Object Detection and Segmentation Practice Based on Text Prompt

In recent years, computer vision has made remarkable progress, especially in image segmentation and object detection tasks. One of the recent notable breakthroughs is the Segmentation Arbitrary Model (SAM), a versatile deep learning model designed to efficiently predict object masks from images and input cues. By utilizing powerful encoders and decoders, SAM is able to handle various segmentation tasks, making it an invaluable tool for researchers and developers.

insert image description here

Recommendation: Use NSDT editor to quickly build programmable 3D scenes

1. Introduction to SAM model

SAM uses an image encoder, usually a Vision Transformer (ViT), to extract image embeddings that serve as the basis for mask prediction. The model also contains a hint encoder, which encodes various types of input hints, such as point coordinates, bounding boxes, and low-resolution mask inputs. These encoded cues are then fed into a mask decoder along with image embeddings to generate the final object mask.

insert image description here

The above architecture allows fast and easy hinting of encoded images.

SAM is designed to handle a variety of prompts, including:

  • Mask: A coarse, low-resolution binary mask can be provided as initial input to guide the model.
    Point: The user can enter [x, y] coordinates and their type (foreground or background) to help define object boundaries.
  • Box: A bounding box can be specified with coordinates [x1, y1, x2, y2] to tell the model about the position and size of the object.
  • Text: Text hints can also be used to provide additional context or to designate objects of interest.
    insert image description here

Digging deeper into SAM's architecture, we can explore its key components:

  • Image encoder: The default image encoder of SAM is ViT-H, but ViT-L or ViT-B can also be used according to specific requirements.
  • Downsampling: To reduce the resolution of the hinted binary mask, a series of convolutional layers are employed.
  • Hint Encoder: Positional embeddings are used to encode various input cues, which help inform the model of the location and context of objects in images.
  • Mask Decoder: A modified Transformer encoder is used as a mask decoder to convert encoded cues and image embeddings into a final object mask.
  • Valid masks: For any given prompt, SAM generates the three most relevant masks, giving the user a range of options from which to choose.

They train the model using a weighted combination of focus, dice, and IoU losses. The weights are 20, 1, 1, respectively.

The strength of SAM lies in its adaptability and flexibility, as it can generate accurate segmentation masks with different cue types. Much like the underlying language models (LLMs) that serve as a solid foundation for a variety of natural language processing applications, SAMs also provide a solid foundation for computer vision tasks. The model's architecture is designed to facilitate easy fine-tuning of downstream tasks, enabling it to be tailored to specific use cases or domains. By fine-tuning the SAM for task-specific data, developers can enhance its performance and ensure that it meets the unique requirements of the application.

This fine-tuning capability not only enables SAM to achieve impressive performance in various scenarios, but also facilitates a more efficient development process. Using a pre-trained model as a starting point, developers can focus on optimizing the model for a specific task instead of starting from scratch. This approach not only saves time and resources, but also leverages the extensive knowledge encoded in the pretrained models, resulting in a more robust and accurate system.

2. Natural language prompts

The integration of text cues with SAM enables the model to perform highly specific and context-aware object segmentation. By leveraging natural language cues, SAM can segment objects of interest based on their semantic properties, attributes, or relationship to other objects in the scene.

In training the SAM, the largest publicly available CLIP model (ViT-L/14@336px) is used to compute text and image embeddings. These embeddings are normalized before being used in the training process.

To generate training hints, the bounding box around each mask is first expanded by a random factor ranging from 1× to 2×. The expanded box is then square cropped to maintain its aspect ratio and resized to 336×336 pixels. Before feeding the cropped image into the CLIP image encoder, pixels outside the mask are zeroed out with a probability of 50%. The last layer of the encoder uses masked attention to ensure that the embeddings are focused on objects, thereby limiting the attention of the output tokens to image locations within the mask. Output token embeddings as final hints. During training, CLIP-based cues are provided first, followed by iterative point cues to refine predictions.

For inference, prompts are created for SAM using the unmodified CLIP text encoder. The model relies on the alignment of text and image embeddings enabled by CLIP, which enables training without explicit text supervision while still using text-based cues for inference. This approach enables SAM to effectively utilize natural language cues to achieve accurate and context-aware segmentation results.

Unfortunately, Meta has not released weights for SAM with text encoders (yet?).

3、lang-segment-anything

The lang-segment-anything library combines the advantages of GroundingDino and SAM to propose an innovative method for object detection and segmentation.

Initially, GroundingDino performs zero-shot text-to-bounding-box object detection, which efficiently identifies objects of interest in images based on natural language descriptions. These bounding boxes are then used as input hints for the SAM model, which generates accurate segmentation masks for the recognized objects.

from  PIL  import  Image
from lang_sam import LangSAM
from lang_sam.utils import draw_image

model = LangSAM()
image_pil = Image.open('./assets/car.jpeg').convert("RGB")
text_prompt = 'car, wheel'
masks, boxes, labels, logits = model.predict(image_pil, text_prompt)
image = draw_image(image_pil, masks, boxes, labels)

insert image description here

4. Lightning AI application

You can quickly deploy apps using the Lightning AI App framework. We will use the ServeGradio component to deploy our model through the UI. You can learn more about ServeGradio here.

import os

import gradio as gr
import lightning as L
import numpy as np
from lightning.app.components.serve import ServeGradio
from PIL import Image

from lang_sam import LangSAM
from lang_sam import SAM_MODELS
from lang_sam.utils import draw_image
from lang_sam.utils import load_image

class LitGradio(ServeGradio):

    inputs = [
        gr.Dropdown(choices=list(SAM_MODELS.keys()), label="SAM model", value="vit_h"),
        gr.Slider(0, 1, value=0.3, label="Box threshold"),
        gr.Slider(0, 1, value=0.25, label="Text threshold"),
        gr.Image(type="filepath", label='Image'),
        gr.Textbox(lines=1, label="Text Prompt"),
    ]
    outputs = [gr.outputs.Image(type="pil", label="Output Image")]

    def __init__(self, sam_type="vit_h"):
        super().__init__()
        self.ready = False
        self.sam_type = sam_type

    def predict(self, sam_type, box_threshold, text_threshold, image_path, text_prompt):
        print("Predicting... ", sam_type, box_threshold, text_threshold, image_path, text_prompt)
        if sam_type != self.model.sam_type:
            self.model.build_sam(sam_type)
        image_pil = load_image(image_path)
        masks, boxes, phrases, logits = self.model.predict(image_pil, text_prompt, box_threshold, text_threshold)
        labels = [f"{phrase} {logit:.2f}" for phrase, logit in zip(phrases, logits)]
        image_array = np.asarray(image_pil)
        image = draw_image(image_array, masks, boxes, labels)
        image = Image.fromarray(np.uint8(image)).convert("RGB")
        return image

    def build_model(self, sam_type="vit_h"):
        model = LangSAM(sam_type)
        self.ready = True
        return model

app = L.LightningApp(LitGradio())

And that's it, the application launches in the browser!

insert image description here

5. Conclusion

That concludes our introduction to piecewise arbitrary models. Clearly, SAM is a valuable tool for computer vision researchers and developers, capable of handling various segmentation tasks and adapting to different cue types. Its architecture allows for easy implementation, making it general enough to be tailored to specific use cases and domains. Overall, SAM has quickly become a great asset to the machine learning community and will certainly continue to make waves in the field.


Original link: Text Prompt Target Detection and Segmentation—BimAnt

Guess you like

Origin blog.csdn.net/shebao3333/article/details/132656534