The latest masterpiece | HQ-SAM: high-quality segmentation of all models (Zurich Polytechnic & HKUST)

Author | Pai Pai Xing Editor | CVHub

Click the card below to pay attention to the " Automatic Driving Heart " public account

ADAS Jumbo dry goods, you can get it

Click to enter → Heart of Autopilot [Semantic Segmentation] Technical Exchange Group

Background Reply [Segmentation Summary] Obtain super complete learning materials such as semantic segmentation, instance segmentation, panoramic segmentation, and weakly supervised segmentation!

205fb7f982bc18c5af75b1eb018e1c45.png

Title: Segment Anything in High Quality
PDF: https://arxiv.org/pdf/2306.01567v1.pdf
Code: https://github.com/SysCV/SAM-HQ

guide

SAM has strong zero-shot capability and flexible hinting, although it has been trained with 1.1 billion masks, its mask prediction quality is still lacking in many cases, especially when dealing with structurally complex objects .

To this end, this paper proposes HQ-SAM, which endows SAM with the ability to accurately segment any object while maintaining the original design, efficiency, and zero-shot generalization ability of SAM. The author's design reuses and preserves the pre-trained model weights of SAM, while introducing only minimal additional parameters and calculations. At the same time, a learnable high-quality output token is also designed, which is injected into the mask decoder of SAM and is responsible for predicting high-quality masks. The method not only applies it to the mask decoder features, but first fuses them with the early and final ViT features to improve details. To train the introduced learnable parameters, this paper constructs a dataset containing 44K fine-grained masks. HQ-SAM is trained only on this incoming 44k mask dataset, which takes only 4 hours on 8 GPUs.

Finally, we demonstrate the effectiveness of HQ-SAM on 9 different segmentation datasets covering different downstream tasks, 7 of which are evaluated on zero-shot transfer.

introduction

Accurate segmentation of diverse objects is fundamental to a range of scene understanding applications, including image/video editing, robotic perception, and AR/VR. The "Segment Anything Model" (SAM) model is designed as a basic vision model for general image segmentation, which is trained with billions of mask labels. The SAM model can segment a series of objects, parts and visual structures in various scenes by accepting a hint containing points, bounding boxes or rough masks. Despite the impressive performance achieved by the SAM model, its segmentation results are still insufficient in many cases, especially for automatic annotation and image/video editing tasks, which have extremely high requirements for the accuracy of image masks .

Therefore, the authors propose a new model, HQ-SAM, which can predict segmentation masks with extremely high accuracy while maintaining the zero-shot capability and flexibility of the original SAM model. In order to maintain efficiency and zero-sample performance, the author made minor changes to the SAM model, adding only less than 0.5% of the parameters to improve its high-quality segmentation capabilities. They design a learnable HQ-Output token that is fed into a SAM's mask decoder and trained to predict high-quality segmentation masks. Furthermore, the HQ-Output token operates on an optimized feature set for precise mask details.

To learn accurate segmentation, a dataset containing accurate mask annotations is required. Therefore, the authors constructed a new dataset named HQSeg-44K, which contains 44K extremely fine-grained image mask annotations covering more than 1000 different semantic categories. Due to the small dataset size and their minimal ensemble architecture, HQ-SAM takes only 4 hours to train on 8 RTX 3090 GPUs.

In order to verify the effectiveness of HQ-SAM, the author conducted a large number of quantitative and qualitative experimental analysis. They compared HQ-SAM with SAM on 9 different segmentation datasets covering different downstream tasks, 7 of which employed the zero-shot transfer protocol. Rigorous evaluations show that the proposed HQ-SAM is able to generate higher quality masks compared to SAM while maintaining zero-shot capability.

method

ad33ac8ef57e8d06c5306848634ad5c3.png

In order to achieve high-quality mask prediction, HQ-SAM introduces HQ-Output Token (high-quality output token) and global-local feature fusion into SAM. In order to maintain the zero-shot capability of SAM, the lightweight HQ-Output Token multiplexes the mask decoder of SAM, and generates a new MLP (multi-layer perceptron) layer to perform the same function as the fused HQ-Features (high quality features) pointwise product. During training, the model parameters of the pre-trained SAM are fixed, and only a few learnable parameters in HQ-SAM can be trained.

In order to improve the performance of the original SAM model on the zero-sample segmentation task, while retaining its zero-sample characteristics. HQ-SAM makes two key changes to the SAM model.

First, the author introduces a new output token (High-Quality Output Token) and global-local feature fusion based on the SAM model. HQ-Output Token can better guide high-quality mask generation, while global-local feature fusion can extract and fuse features from different stages to enrich the global semantic context and local boundary details of mask features.

The introduction of HQ-Output Token has improved the mask prediction ability of the SAM model. In the design of the original SAM model, the mask decoder uses an output token (similar to object query in DETR) for mask prediction. In HQ-SAM, the author introduces a new learnable HQ-Output Token, and adds a new mask prediction layer for high-quality mask prediction.

Second, global-local feature fusion improves mask quality by extracting and fusing features from different stages of the SAM model. Specifically, the authors fused the early-level features of the ViT encoder of the SAM model, the global features of the last layer of the ViT encoder, and the mask features of the mask decoder of the SAM model to generate new high-quality features (HQ -Features).

5f2ee56b25be508c7a29ef539b5c94de.png

Training and inference comparison of ViT-L based SAM and HQ-SAM. HQ-SAM imposes negligible additional computational burden on SAM, increases model parameters by less than 0.5%, and achieves 96% of its original speed. SAM-L is trained for 180k iterations on 128 A100 GPUs. Based on SAM-L, it only takes 4 hours to train HQ-SAM on 8 RTX3090 GPUs.

The training and inference process of HQ-SAM is data and computation efficient. In the training phase, the authors fix the parameters of the pre-trained SAM model and only train the newly introduced learnable parameters in HQ-SAM. In the inference stage, the authors followed the inference process of SAM, but used the mask prediction of the HQ-Output token as a high-quality mask prediction.

Generally speaking, compared with the original SAM model, HQ-SAM improves the segmentation quality and the training process is more efficient. It only takes 4 hours to complete the training on 8 RTX3090 GPUs. HQ-SAM is also very lightweight, with negligible added model parameters, GPU memory usage, and per-image inference time.

experiment

a9778ea752f46b545480381b1473c579.png

SAM is compared with the mask predicted by our HQ-SAM, and the input cues are the same red box or a few points on the object. HQ-SAM produces more detailed results with very accurate boundaries. In the rightmost column, SAM misinterpreted the slender structure of the kite string and produced a large number of errors with broken holes in the input box prompt.

63836baf5856be8a5891a23473bd6d6f.png

HQ-Output Token ablation experiments on four extremely fine-grained segmentation datasets. This paper uses the boxes converted from their GT (Ground Truth, true value) masks as the box prompt input. By default, the prediction mask of the HQ Output-Token is trained by computing the full GT mask loss.

59c0658a518a276a748b3ddae4fee4fd.png

Ablation experiments on the source of HQ-Features. The early-layer (early-layer) represents the features after the first global attention block of the ViT encoder, while the final-layer (final-layer) represents the output of the last ViT block. The four HQ datasets are DIS (validation set), ThinObject-5K (test set), COIFT and HR-SOD.

012ef8ce066c19770e04d3448540570a.png

Comparison of model fine-tuning or additional post-processing. For the COCO dataset, the authors use FocalNet-DINO, a state-of-the-art object detector trained on the COCO dataset, as the bounding box hint generator.

a4f23c409a77686a7b3196f9684878f3.png

The figure above shows the recall comparison of COIFT and HRSOD under the zero-shot protocol, using BIoU thresholds from loose to strict. The results show that the performance gap between SAM and HQ-SAM increases significantly when the threshold varies from 0.5 to 0.9. This shows that HQ-SAM has an advantage in predicting very accurate segmentation masks, that is, HQ-SAM can perform object segmentation more accurately, especially under strict threshold requirements.

bd411da76b8041dd1d9061b31795972c.png

Comparison of results for zero-shot open-world instance segmentation on the UVO dataset. To generate boundary hints, the authors use the FocalNet-DINO model trained on the COCO dataset. where the symbol indicates that a stricter threshold is used to define the boundary region.

c54b0034db91d7877c128dcc8d7de766.png

Comparison of zero-shot segmentation results on the high-quality BIG benchmark test set. To generate input hints, the authors used PSPNet to generate coarse mask hints. Zero-shot segmentation results are evaluated by comparing different types of input cues.

594536bd688b9cbc460755377c296f04.png

Comparison of zero-shot instance segmentation results on COCO and LVISv1 datasets. For the COCO dataset, the authors use the FocalNet-DINO model trained on COCO for detection, while for the LVIS dataset, they use ViTDet-H trained on the LVIS dataset as their boundary hint generator. In the SAM model, the authors used ViT-L as the backbone network and used boundary hints. The authors improved the mask quality of the boundary regions while maintaining the zero-shot segmentation capability of the original SAM.

58f82532421e840857baaa04c8fae4ef.png

The figure above shows the comparison of the visual results of SAM and HQ-SAM in the zero-shot transfer setting, given the same red box or dot hint. As can be seen from the results, HQ-SAM produces significantly more detail-preserving results, and also fixes erroneous holes in the mask. In contrast, HQ-SAM is better able to preserve object details and handle errors in masks in the zero-shot transfer task.

d5f16f2b4d1b77a0a24a8e030b59ec2d.png

The figure above shows the comparison of the results of interactive segmentation using different numbers of input points on the COIFT (zero sample) and DIS validation sets. The results show that HQ-SAM consistently outperforms SAM at various numbers of points, and the relative improvement is more pronounced when the cue ambiguity is small. This shows that HQ-SAM has better performance for different numbers of input points in the interactive segmentation task, especially in the case of fewer input points and unclear hints, the improvement effect of HQ-SAM is more obvious.

62b286f4f991ee589ea3742fd9747241.png

The table above presents the comparison results for zero-shot video instance segmentation on the HQ-YTVIS benchmark. In this comparison, the authors used a Swin-L-based Mask2Fromer model pre-trained on the YTVIS dataset as input for bounding box cues, and reused its object association predictions. With this design, the authors evaluate and compare zero-shot video instance segmentation methods.

in conclusion

This paper proposes HQ-SAM, the first model to achieve high-quality zero-shot segmentation by introducing negligible overhead to the original SAM, and explores how to leverage and extend SAM-like base segmentations in a data-efficient and computationally economical manner. Model. The authors introduce a lightweight high-quality output marker in HQ-SAM to replace that of the original SAM for high-quality mask prediction. After training with only 44K highly accurate masks, HQ-SAM significantly improves the mask prediction quality of SAM, which itself was trained on 1.1 billion masks. The authors conduct zero-shot transfer evaluations on seven segmentation benchmarks including image and video tasks, covering a variety of objects and scenes.

(1) The video course is here!

The Heart of Autonomous Driving brings together millimeter-wave radar vision fusion, high-precision maps, BEV perception, multi-sensor calibration, sensor deployment, autonomous driving cooperative perception, semantic segmentation, autonomous driving simulation, L4 perception, decision planning, trajectory prediction, etc. Learning videos in each direction, welcome to take it yourself (scan the code to enter the learning)

734d31d89904a4ed670e4de0678cc3de.png

(Scan the code to learn the latest video)

Video official website: www.zdjszx.com

(2) The first autonomous driving learning community in China

A communication community of nearly 1,000 people, and 20+ autonomous driving technology stack learning routes, want to learn more about autonomous driving perception (classification, detection, segmentation, key points, lane lines, 3D object detection, Occpuancy, multi-sensor fusion, object tracking , optical flow estimation, trajectory prediction), automatic driving positioning and mapping (SLAM, high-precision map), automatic driving planning and control, field technical solutions, AI model deployment implementation, industry trends, job releases, welcome to scan the QR code below, Join the knowledge planet of the heart of autonomous driving, this is a place with real dry goods, exchange various problems in getting started, studying, working, and job-hopping with the big guys in the field, share papers + codes + videos daily , look forward to the exchange!

6c128ec398435a68db5b4adf8437f034.jpeg

(3) [ Heart of Automated Driving ] Full-stack Technology Exchange Group

The Heart of Autonomous Driving is the first developer community for autonomous driving, focusing on object detection, semantic segmentation, panoramic segmentation, instance segmentation, key point detection, lane lines, object tracking, 3D object detection, BEV perception, multi-sensor fusion, SLAM, light Flow estimation, depth estimation, trajectory prediction, high-precision map, NeRF, planning control, model deployment, automatic driving simulation test, product manager, hardware configuration, AI job exchange, etc.;

2e1c4f4323c580398411c23801f2398e.jpeg

Add Autobot Assistant Wechat invitation to join the group

Remarks: school/company + direction + nickname

Guess you like

Origin blog.csdn.net/CV_Autobot/article/details/131336177