[Image Segmentation] SAM: Segment Anything Thesis Learning V1

Paper: 2304. Segment Anything
code: https://github.com/facebookresearch/segment-anything
official website and demo: https://segment-anything.com/
【Extended reading】——Comprehensive investigation of Segment Everything Model (SAM) : 2305.A Comprehensive Survey on Segment Anything Model for Vision and Beyond
应用】A plugin used in stable-diffusion-webui: https://github.com/continue-revolution/sd-webui-segment-anything

Summary: What is SAM?

SAM (Segment Anything Model) is a general segmentation model
that can annotate pictures through point selection, text input, and label boxes

Figure 1 Project content (release model, dataset):

Figure 1: Our goal is to build a base model for segmentation by introducing three interconnected components: a 可提示(promptable)segmentation task, a segmentation model (SAM) for 数据标注(powers data annotation), and through hint engineering. Zero-shot transfers for various tasks, and a data engine for collecting SA-1B, our dataset of over 1 billion masks
insert image description here

Model overview (SAM overview)

insert image description here
A heavyweight ( heavyweight) image encoder outputs a feature encoding of an image that can be efficiently queried by a variety of input cues promptsto achieve segmentation target masks and processed at amortized real-time speed. For ambiguous cues corresponding to multiple objects, SAM can output multiple effective masks, accompanied by confidence scores.
Remark

图像嵌入 image embeddingRefers to 固定长度高纬向量表示the process of converting an image into. It is achieved by feeding an image into an image encoder in a deep neural network
. An image encoder is a trained model that converts an image into A high-dimensional vector, where each dimension represents some feature or semantic information of the image. By converting the image into a vector representation, we can use the distance measure in the vector space to measure the similarity or difference between images. This vector representation It can also be used as input for other tasks such as image classification, image retrieval, and image generation, etc.)

Partial renderings

Multi-result output for a fuzzy dot prompt

Each column shows 3 effective masks generated by SAM from an ambiguous point prompt (green circle)
insert image description here

Zero-shot reasoning capabilities on various datasets

"Zero-shot" generally refers to the ability of a model to predict or process a specific task without being trained on it
insert image description here

Zero-shot edge detection capability

Zero-shot edge prediction
insert image description here

Segmentation based on text prompts (Zero-shot text-to-mask)

SAM can use simple and subtle 文本提示(text prompts). 点提示An extra (point prompt) can help when the SAM fails to make the correct prediction .
insert image description here

Visualization of Similarity of SAM Latent Space Mask Embeddings

(Visualization of thresholding the similarities of mask embeddings from SAM's latent space)
Queries are represented by 洋红色框(magenta boxes); the top row shows matches for low thresholds, and the bottom row shows matches for high thresholds. The most similar mask embedding in the same image can often be semantically similar to the query mask embedding, even if the SAM is not trained with explicit semantic supervision
insert image description here

use

plugin for stable-diffusion-webui

insert image description here

Read the original text

original abstract

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in the data collection loop, we build the largest segmentation dataset to date (by far), consisting of 1100万张images that are licensed and privacy respecting Collect over 10亿个掩码(masks).
The model is designed and trained to promptableperform zero-shot transfer ( zero-shot) on prompts ( ) to new image distributions and tasks. We evaluate it on a variety of tasks and find its zero-shot performance to be impressive, even outperforming prior fully supervised results competitively. Published the Segment Anything Model (SAM) and the corresponding dataset (SA-1B), which contains 1 billion masks and 11 million images, to facilitate the study of basic models for computer vision (foster research).

Guess you like

Origin blog.csdn.net/imwaters/article/details/130944496