CV no longer exists? Meta released the GPT model "SAM" of the CV session, which can divide everything

e511e7e602c3b519ea4393948514549e.png

Source|Heart of the Machine

What's next for CV researchers?

2c8e4bf44a9320363d88883dce716dc9.gif

"The CV really doesn't exist now. <Quick Run>" This is Zhihu netizens' evaluation of a new Meta paper.

As the title says, this paper does only one thing: (zero-shot) splitting everything. Similar to "answer everything" that GPT-4 has already done.

048d196b074d4989888767c0eda5d260.png

According to Meta, this is the first base model dedicated to image segmentation. Since then, CV has also embarked on the road of "making an all-round model that unifies a certain (some? All?) tasks".

d333e4e03360a23dc566f02c3b6f5d6b.png

Prior to this, segmentation, as a core task of computer vision, has been widely used. However, creating an accurate segmentation model for a specific task usually requires highly specialized work by technical experts. In addition, this task also requires a large amount of domain-labeled data, and various factors limit the further development of image segmentation.

The new model released by Meta in the paper is called Segment Anything Model (SAM). "SAM has learned general concepts about objects, and it can generate masks for any object in any image or video, even objects and image types not encountered during training," they blogged. General enough to cover a wide range of use cases, and can be used out-of-the-box on new image "domains" without additional training." In the field of deep learning, this ability is often referred to as zero-shot transfer, which is also GPT- 4 One of the reasons for shocking the world.

d5aa4ba941843ff99e07bb417cc5387f.png

Paper address:
https://arxiv.org/abs/2304.02643

Project address:
https://github.com/facebookresearch/segment-anything

Demo address:
https://segment-anything.com/

In addition to the model, Meta released an image annotation dataset Segment Anything 1-Billion (SA-1B), which it claims is the largest segmentation dataset ever created. This dataset is available for research purposes and the Segment Anything Model is available under an open license (Apache 2.0).

Let's look at the effect first. As shown in the animation below, SAM can automatically segment everything in the image very well:

2b92f15016a487fdcd49856f16bde2d0.gif

SAM can also perform image segmentation based on prompt words. For example, enter the prompt word Cat, and SAM will draw boxes around several cats in the photo and achieve segmentation:

4689db16032fdab056dc72829a580d88.gif

SAM can also prompt with interactive dots and boxes:

4f4280cbf31c3fef0fb638d73019a58c.gif

5f26a269752810777628373d6f1bb95f.gif

Additionally, SAM can generate several effective masks for ambiguous hints:

3f1d9b02652fd46adb5f0bc5652cd3f5.gif

Jim Fan, an artificial intelligence scientist at Nvidia, said: "For this study of Meta, I think it is one of the GPT-3 moments in the field of computer vision. It has understood the general concept of objects, even for unknown objects, unfamiliar scenes (such as Underwater images) and ambiguous situations can also perform good image segmentation. Most importantly, the model and data are open source. IMHO, Segment-Anything has done everything (segmentation) very well gone."

1daf43e97ba3c4d77024846eef47f5dd.png
▲ Twitter address: https://twitter.com/DrJimFan/status/1643647849824161792

Some netizens said that the Prompt paradigm in the NLP field has begun to extend to the CV field. It can be expected that this type of paradigm will usher in an explosion in the academic world this year.

c1eff69ff678ef51990048a788f1042f.png

Some netizens even said that the clam can't live anymore. Once SAM comes out, the CV really doesn't exist. Be careful when submitting to ICCV.

4d7a64d2b74729d50e03e220841a8210.png

Others, however, said the model wasn't ideal for testing in a production environment. Maybe it will take time to solve this long-standing problem?

40a3d23299870e0a8aabb6529df02a3b.png

method introduction

There are roughly two methods to solve the segmentation problem before. The first is interactive segmentation, which allows segmentation of objects of any class but requires a human to guide the method by iteratively refining the mask. The second, automatic segmentation, allows segmentation of specific object categories defined in advance (e.g., cats or chairs), but requires a large number of manually annotated objects for training (e.g., thousands or even tens of thousands of examples of segmented cats). Neither of these approaches provides a general, fully automated approach to segmentation.

SAM nicely summarizes both approaches. It is a single model that can easily perform interactive and automatic segmentation. The promptable interface of this model allows users to use it in a flexible way, and a wide range of segmentation tasks can be accomplished by simply designing the correct prompts (clicks, boxes, text, etc.) for the model.

Taken together, these features enable SAMs to generalize to new tasks and domains. This flexibility is the first of its kind in the field of image segmentation.

Meta says they are inspired by cues in language models, so their trained SAMs can return valid segmentation masks for any cues, where cues can be foreground, background points, thick boxes or masks, free-form text, or Any information that indicates what to segment in the image. And the requirement of a valid mask simply means that even if the hint is ambiguous and might refer to multiple objects (e.g. a dot on a shirt might indicate either the shirt or the person wearing it), the output should be a reasonable mask (as in The above animation "SAM can also generate multiple effective masks for ambiguous prompts"). This task is used to pre-train models and solve general downstream segmentation tasks with hints.

As shown in the figure below, the Image Encoder generates one-shot embeddings for images, while the Lightweight Encoder converts hints into embedding vectors in real-time. These two sources of information are then combined in a lightweight decoder that predicts segmentation masks. After computing an image embedding, SAM can generate a segmentation within 50 milliseconds from any prompt in a web browser.

6ef203583b7db962090665c2d8e8724d.png
▲In a web browser, SAM effectively maps image features and a set of hint embeddings to produce segmentation masks

11 million images, 1B+ masks

The dataset was collected using SAM. The annotators use the SAM to interactively annotate the image, and then the newly annotated data in turn updates the SAM, which is mutually reinforcing.

Using this method, it only takes about 14 seconds to interactively annotate a mask. Compared with previous large-scale segmentation data collection work, Meta's method is 6.5 times faster than COCO fully manual polygon-based mask annotation, and 2 times faster than the largest previous data annotation work, due to the results assisted by the SAM model .

The final dataset exceeds 1.1 billion segmentation masks, collected on approximately 11 million licensed and privacy-preserving images. SA-1B has 400 times more masks than any existing segmentation dataset, and human evaluation studies confirm that these masks are of high quality and diversity, in some cases even comparable in quality to previous smaller, Masks are comparable to fully manually annotated datasets.

32353bd6cfc6b624e3c77ab6525bf5a4.png b8d9f8a3bbd704692d16a49ecadeaeba.png
▲Segment Anything is the result of training millions of images and masks collected using the data engine, resulting in a data set containing 1 billion segmentation masks, which is 400 times larger than any previous segmentation data set.

The SA-1B has images from photo contributors across multiple countries across different geographic regions and income levels, with more imagery and a better overall representation of all regions. A Meta-analysis of potential biases in their models in terms of perceived gender representation, perceived skin color, and age range found that SAM performed similarly across groups.

SA-1B can help other researchers train basic models for image segmentation. Meta further hopes that this data will form the basis for new datasets with additional annotations, such as textual descriptions associated with each mask.

future outlook

Through research and dataset sharing, Meta hopes to further accelerate research into image segmentation and more general image and video understanding. A hintable segmentation model can act as a component in a larger system to perform segmentation tasks. Composition is a powerful tool that allows a single model to be used in a scalable manner, with the potential to accomplish tasks unknown at the time the model was designed.

Meta anticipates that composable system designs based on techniques such as prompt engineering will support a wider range of applications than systems trained specifically for a fixed set of tasks. SAMs can be powerful components for AR, VR, content creation, scientific domains, and more general AI systems. For example, SAM can identify everyday objects through AR glasses and provide users with tips.

2c7326e54855929e156eaa107d16bbb6.gif

SAM also has the potential to help farmers in agriculture or assist biologists in their research.

c42fa10baedc1b96ca35bb4a276c278c.gif

In the future, we will see a tighter coupling between pixel-level image understanding and higher-level semantic understanding of visual content, unlocking more powerful AI systems.

e40cfbd17dc90a1e990f59fb3d8f0751.jpegReply keywords in the background [ join the group ]

Join the NLP, CV, search promotion and job hunting discussion group

 f3f97edd7145ea51ed4b56a5b1c13fbf.png

[1]https://ai.facebook.com/blog/segment-anything-foundation-model-image-segmentation/

[2]https://www.zhihu.com/question/593914819

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/130023426
CV