Paper: 2304. Segment Anything
code: https://github.com/facebookresearch/segment-anything
official website and demo: https://segment-anything.com/
【Extended reading】——Comprehensive investigation of Segment Everything Model (SAM) : 2305.A Comprehensive Survey on Segment Anything Model for Vision and Beyond
【应用
】A plugin used in stable-diffusion-webui: https://github.com/continue-revolution/sd-webui-segment-anything
Summary: What is SAM?
SAM (Segment Anything Model) is a general segmentation model
that can annotate pictures through point selection, text input, and label boxes
Figure 1 Project content (release model, dataset):
Figure 1: Our goal is to build a base model for segmentation by introducing three interconnected components: a 可提示(promptable)
segmentation task, a segmentation model (SAM) for 数据标注
(powers data annotation), and through hint engineering. Zero-shot transfers for various tasks, and a data engine for collecting SA-1B, our dataset of over 1 billion masks
Model overview (SAM overview)
A heavyweight ( heavyweight
) image encoder outputs a feature encoding of an image that can be efficiently queried by a variety of input cues prompts
to achieve segmentation target masks and processed at amortized real-time speed. For ambiguous cues corresponding to multiple objects, SAM can output multiple effective masks, accompanied by confidence scores.
Remark
(·
图像嵌入 image embedding
Refers to固定长度高纬向量表示
the process of converting an image into. It is achieved by feeding an image into an image encoder in a deep neural network
. An image encoder is a trained model that converts an image into A high-dimensional vector, where each dimension represents some feature or semantic information of the image. By converting the image into a vector representation, we can use the distance measure in the vector space to measure the similarity or difference between images. This vector representation It can also be used as input for other tasks such as image classification, image retrieval, and image generation, etc.)
Partial renderings
Multi-result output for a fuzzy dot prompt
Each column shows 3 effective masks generated by SAM from an ambiguous point prompt (green circle)
Zero-shot reasoning capabilities on various datasets
"Zero-shot" generally refers to the ability of a model to predict or process a specific task without being trained on it
Zero-shot edge detection capability
Zero-shot edge prediction
Segmentation based on text prompts (Zero-shot text-to-mask)
SAM can use simple and subtle 文本提示
(text prompts). 点提示
An extra (point prompt) can help when the SAM fails to make the correct prediction .
Visualization of Similarity of SAM Latent Space Mask Embeddings
(Visualization of thresholding the similarities of mask embeddings from SAM's latent space)
Queries are represented by 洋红色框
(magenta boxes); the top row shows matches for low thresholds, and the bottom row shows matches for high thresholds. The most similar mask embedding in the same image can often be semantically similar to the query mask embedding, even if the SAM is not trained with explicit semantic supervision
use
plugin for stable-diffusion-webui
Read the original text
original abstract
We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in the data collection loop, we build the largest segmentation dataset to date (by far), consisting of 1100万张
images that are licensed and privacy respecting Collect over 10亿个掩码
(masks).
The model is designed and trained to promptable
perform zero-shot transfer ( zero-shot
) on prompts ( ) to new image distributions and tasks. We evaluate it on a variety of tasks and find its zero-shot performance to be impressive, even outperforming prior fully supervised results competitively. Published the Segment Anything Model (SAM) and the corresponding dataset (SA-1B)
, which contains 1 billion masks and 11 million images, to facilitate the study of basic models for computer vision (foster research).