GPT-prompt-based milestone research results in the Segment Anything CV world

1. A milestone research achievement in the field of computer vision - a review of SAM and SA-1B

Segment Anything is inspired by the chatGPT-style prompt-based idea. The training data set covers 1 billion masks. According to the provided image annotations, different mask segmentation results are generated in real time. The trial effect is amazing.

Segment Anything is to Computer Vision what chatGPT is to NLP.

On April 5th, Meta AI released a blog: Introducing Segment Anything: Working toward the first foundation model for image segmentation , translated as the first basic model in the field of image segmentation .

This official blog introduces the Segment Anything Model ( SAM ) model and the Segment Anything 1-Billion mask dataset ( SA-1B ) dataset, and briefly introduces it from the working principle of SAM and how to build the SA-1B dataset. Working content and research results, we will explore this blog post and its paper below:

2. Summary of Segment Anything project development achievements

1. Expectations and goals of project development

Official text:

Reducing the need for task-specific modeling expertise, training compute, and custom data annotation for image segmentation is at the core of the Segment Anything project.

can be roughly extracted as:

  • Reduces need for task-specific modeling expertise
  • Reducing Computational Requirements for Training
  • Reduce the need for custom data annotations

Achieving the goal: build a basic model for image segmentation - a promptable model , which is trained on different kinds of data to adapt to specific tasks, similar to the way prompts are used in natural language processing models.

Note from the author: 如同chatGPT的prompt-based learning一样, the release of Segment Anything Model and its model is like the concept of GPT in the nlp world.

The Meta AI team set out to develop a general purpose promptable segment modeland use it to create a segmentation dataset of unprecedented size.

The previous cv has become a tradition .

2. SAM:A general promptable model

2.1 What is Prompt?

Prompt is a piece of text or a set of vectors added to the input, so that the model can do masked language modeling (MLM) based on the input and the external prompt.

2.2 The development idea of ​​SAM

The basic task of computer vision is to distinguish which pixels belong to which object, which is often referred to as segmentation . At present, there are mainly two segmentation methods and directions:

  • Interactive segmentation, which allows segmenting objects of any class, but requires a human to guide the method by iteratively refining the mask
  • Automatic segmentation, which allows segmentation of specific object categories defined in advance, but requires a large number of manually annotated objects to train, along with computing resources and technical expertise to train the segmentation model

Neither of these approaches provides a general, fully automated approach to segmentation .

SAM is a generalization of these two types of methods. It is a single model that can easily perform interactive and automatic segmentation. The model's promptable interface (described later) allows it to be used in a flexible manner, enabling a wide range of segmentation tasks by simply designing the correct prompts (clicks, boxes, text, etc.) for the model. In addition, the SAM is trained on a diverse, high-quality dataset containing more than 1 billion masks (collected as part of this project), which enables it to generalize to new types of objects and images beyond what it has achieved during training. what was observed. This ability to generalize means that, in general, practitioners will no longer need to collect their own segmented data and fine-tune models for their use cases.

Based on the training idea of ​​prompting technology for zero-sample and few-sample learning, the Meta AI team trained SAM to return an effective segmentation mask for any prompt ,

where prompt can be a foreground/background point, a thick frame or mask, free-form text, or in general, any information that indicates what is to be segmented in the image .

The requirement for an effective segmentation mask is that the output should be a reasonable mask for one of these objects, even if 提示不明确either . 可能指代多个对象This task is used to pre-train models and solve general downstream segmentation tasks with hints.

The simple design yields good results in practice, considering the real-time segmentation and the provision of prompt and computation feedback in the web browser. Image encoders generate one-shot embeddings for images, while lightweight encoders convert any hints into embedding vectors in real-time. These two sources of information are then combined in a lightweight decoder that predicts segmentation masks. After computing image embeddings, SAM can generate a snippet based on any prompt in a web browser within 50 milliseconds. As shown below:

SAM efficiently maps image features and a set of hint embeddings to generate segmentation masks SAM efficiently maps image features and a set of hint embeddings to generate segmentation masksSAM efficiently maps image features and a set of hint embeddings to generate segmentation masks

Image source: Introducing Segment Anything official blog

2.3 Possible future application scenarios

SAM can be a powerful component in AR, VR, content creation, scientific fields and more general AI systems in the future SAM can be a powerful component in AR, VR, content creation, scientific fields and more general AI systems in the futureS A M can be a powerful component of AR, VR , content creation , scientific fields and more general AI systems in the future

SAM may also help farmers in agriculture or assist biologists in research SAM may also help farmers in agriculture or assist biologists in researchS A M may also help farmers in the field of agriculture or assist biologists in research

3. Construction of Dataset SA-1B

The training of SAM models requires a large number of data sources, and the newly released segmentation dataset is the largest to date. Collect data using SAM. In particular, annotators use SAM to interactively annotate images, and then sequentially update SAM with newly annotated data. This cycle is repeated many times in this manner to iteratively improve the model and dataset.

With SAM, collecting new segmentation masks is faster than ever.

The author uploaded an image for testing: the processing time was indeed less than 20 seconds The author uploaded an image for testing: the processing time was indeed less than 20 secondsThe author uploads an image for testing: the processing time is indeed less than 20 seconds

Using this tool, it only takes about 14 seconds to interactively annotate a mask. Our per-mask annotation process is only 2x slower than annotating bounding boxes, taking about 7 seconds using the fastest annotation interface. Compared with previous large-scale segmentation data collection work, the model is 6.5 times faster than COCO fully manual polygon-based mask annotation, and 2 times faster than the largest previous data annotation work, which is also model-assisted.

However, relying on interactively annotated masks did not scale sufficiently to create the 1 billion mask dataset used by the team. So the team built a data engine to create its SA-1B dataset. This data engine has three "gears".

  • Model Assist Annotator, mentioned above.
  • The second is the combination of fully automatic labeling and auxiliary labeling, which helps to increase the diversity of collected masks.
  • Finally, masks are created automatically to achieve the data set used for expanding training.

The final dataset consists of over 1.1 billion segmentation masks collected on approximately 11 million licensed and privacy-preserving images. The total amount of SA-1B masks is 400 times more than any existing segmentation dataset, and human evaluation studies confirm that these masks are of high quality and diversity, and in some cases even qualitatively comparable to previous smaller, Masks are comparable to fully manually annotated datasets.

Note: Segment Anything is trained on millions of images and masks collected by the data engine, resulting in a dataset of over 1 billion segmentation masks - 400 times larger than any previous segmentation dataset.

Seeing the size of such a data set can make other developers daunting. Enterprise-level AI development teams are taking advantage of their huge advantages in data set training to set off one revolution after another.

3. Segment Anything Webpage Demo tries to split

The author tries the demo on the official website of Segment Anything, the portal is here , interested friends can try it.

The author is currently working on the development of medical image segmentation. The following is the result of segmentation with our own trained network:

Image segmentation using the model trained by the Residual U net network Image segmentation using the model trained by the Residual U net networkImage segmentation using the model trained by the Res id u a l U net network

Here is a gif animation for segmentation using Segment Anything :

Image segmentation with Segment Anything Image segmentation with Segment AnythingImage Segmentation Using Segm e n t A n y t hing _ _ _

Arbitrary image segmentation demonstrations can then be performed. In the previous demonstration, different masks will be generated as the mouse moves. The left mouse button is add mask, and the right mouse button is remove area. When there was no add mask at the beginning, the processing results were almost the same as our training results.

The author is just providing a special case. In fact, the current performance of Segment Anything on other images in our data set is still quite different from the actual ground truth. However, this kind of development idea is excellent and it brings changes to the Computer Vision industry. ,inestimable.

4. Reference

Introducing Segment Anything: Working toward the first foundation model for image segmentation

Understand Prompt in an easy-to-understand manner

Guess you like

Origin blog.csdn.net/Samurai_Bushido/article/details/129994369