One day online, 4k star | Facebook: Segment Anything

Preface   This paper introduces Facebook AI Research's Segment Anything (SA) project: New Tasks, Models, and Datasets for Image Segmentation. Using the model in a data collection loop, it builds the largest segmentation dataset to date, with over 1 billion masks on 11 million licensed and privacy-respecting images. The model is designed and trained to be promptable, so it can transfer zero-shots to new image distributions and tasks.

Its zero-shot performance is impressive—often competitive with, or even superior to—previous fully supervised results, as evaluated across a wide range of tasks.

Transformer, target detection, semantic segmentation exchange group

Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.

CV's major direction columns and the most complete tutorials for each deployment framework

Paper: https://arxiv.org/pdf/2304.02643.pdf

Project: https://github.com/facebookresearch/segment-anything

Demo:https://segment-anything.com 

Thesis starting point

Large language models pretrained on massive datasets are revolutionizing NLP with powerful zero-shot and few-shot generalization. The goal of this paper is to build a fundamental model for image segmentation, seeking to develop a hintable model and pre-train it on a wide range of datasets using tasks capable of achieving strong generalization, with which to solve new data distributions using hint engineering A series of downstream segmentation problems on . The program's success depends on three components: tasks, models, and data. Therefore, the following questions about image segmentation need to be addressed: 1. What tasks will achieve zero-shot generalization? 2. What is the corresponding model architecture? 3. What data can support this task and model?

innovative ideas

We first define a hintable segmentation task that is general enough to provide a powerful pre-training objective and support a wide range of downstream applications. It requires a model that supports flexible hints and can output segmentation masks in real-time when prompted for interactive use. To train models, diverse and large-scale data sources are required. The model must support flexible hints, require amortized real-time computation of masks to allow interactive use, and must be ambiguity-aware. At the same time, in order to achieve strong generalization to new data distributions, it is necessary to train SAMs on a large number of different mask sets than any split datasets that already exist.

method

Segment Anything Task

First translate the concept of hints from NLP to segmentation, where a hint can be a set of foreground/background points, a rough box or mask, free-form text, or in general, any information that indicates what to segment an image. The hintable segmentation task is then to return a valid segmentation mask given any hints. The requirement for a "valid" mask simply means that even if the hint is ambiguous and may refer to multiple objects, the output should be a reasonable mask for at least one of those objects. This requirement is similar to expecting a language model to output coherent responses to ambiguous cues. This task was chosen because it would lead to a natural pre-training algorithm and a general method for transferring zero-shots via hints to downstream segmentation tasks. Each column shows 3 effective masks generated by SAM from a single ambiguous point cue (green circle):

Segment Anything Model

SAM has three components: an image encoder, a flexible hint encoder, and a fast mask decoder. The image encoder outputs an image embedding, which can then be efficiently queried by various input cues to generate object masks at amortized real-time speed. For ambiguous cues corresponding to multiple objects, SAM can output multiple valid masks and associated confidence scores. Meanwhile, SAM uses a linear combination of the focal loss and dice loss used in [21] to supervise mask prediction, and uses a mixture of geometric cues to train the hintable segmentation task.

Segment Anything Data Engine

The authors built a data engine to collect the 1.1B masked dataset SA-1B. The data engine is divided into three stages: (1) model-assisted manual annotation stage, (2) semi-automatic stage with hybrid automatic prediction mask and model-assisted annotation, (3) fully automatic stage, SAM model without annotator input Generate a mask.

Segment Anything Dataset

Our dataset SA-1B contains 11 million diverse, high-resolution, licensed and privacy-preserving images, and 1.1B high-quality segmentation masks collected using a data engine. SA-1B has 11 times more images and 400 times more masks than Open Images, the largest existing segmentation dataset, as shown below:

result

The authors use mIoU for automatic evaluation on the full set of 23 datasets. We compared the results on each dataset with RITM. SAM produces higher results on 16 of 23 datasets, up to ∼47 IoU. The following is a display of some specific results~

Samples from 23 different split datasets are used to evaluate the zero-shot transfer capability of SAM:

Mask Evaluation Comparison: is mask evaluation on 23 datasets. (a) Average IoU of SAM and the strongest single-point segmenter RITM. (b) Comparison of annotator's mask quality ratings for each dataset from 1 (worst) to 10 (best). All methods use the ground truth mask center as a hint. (c, d) mIoU with different numbers of points. Among them, SAM significantly outperforms previous interactive segmenters by 1 point and is on par with more points.

Edge Detection Task: Subsequent visualization of edge detection results on zero-shot on BSDS500, the SAM was not trained to predict edge maps, nor did it have access to BSDS images or annotations during training.

Comparison of edge detection zero-sample transmission on BSDS500:

Target detection comparison:

Instance segmentation comparison:

Visualization:

Ablation experiments for data engine stage, image encoder scaling, and ablation studies for training data scaling:

Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.

[Technical Documents] "Building a pytorch Model Tutorial from Zero" 122-page PDF Download

QQ exchange group: 470899183 . There are big guys in the group who are responsible for answering everyone's daily study, scientific research, and code questions.

Model deployment exchange group: 732145323 . It is used for communication on model deployment, high-performance computing, optimization acceleration, and technology learning in computer vision.

other articles

Efficient-HRNet | Will EfficientNet thinking + HRNet technology be stronger and faster?

Practical Tutorial|Analysis and Optimization of Common Reasons for Low GPU Utilization

ICLR 2023 | SoftMatch: Achieving Quality and Quantity Trade-off of Pseudo-Labels in Semi-Supervised Learning

Target detection innovation: a region-based semi-supervised method, some labels are enough (download of original paper attached)

CNN strikes back! InceptionNeXt: When Inception meets ConvNeXt

Interpretability Analysis of Neural Networks: 14 Attribution Algorithms

Painless increase: Practical Trick for target detection optimization

Explain in detail the three ways PyTorch compiles and calls custom CUDA operators

What should I do if the GPU memory is insufficient when training a model with deep learning?

CV's major direction columns and the most complete tutorials for each deployment framework

Computer Vision Introduction 1v3 Tutorial Class

Communication group in various directions of computer vision

Guess you like

Origin blog.csdn.net/KANG157/article/details/130031306