segment anything in high quality

Committed to solving the situation of poor detail segmentation, it can be understood as a fine-tuned version of sam's fine segmentation, but the original segmentation ability is not lost, a bit like the small target detection optimization algorithm in target detection. In general, hq-features, hq output token and mlp are added to the original sam to make hq mask prediction, and finally the mask is obtained by multiplying the hq feature and the mlp weight predicted by hq output token. Equivalent to a fine-tuned version of sam.

1.introduction

There are two problems with sam: 1. The rough mask boundary often ignores the segmentation of fine object structures; 2. In challenging scenarios, there are wrong predictions of damage, mainly related to sam's misunderstanding of fine structures , this article is mainly to solve these two problems, and it is based on sam, which will not degrade or affect the original segmentation ability of sam. Two modules are added on the basis of sam to additionally deal with the fine structure segmentation problem, but due to the retention of The structure of the original sam is definitely slower than sam, but if combined with the vit_t of mobilesam, the effect may be balanced and the parameters increased by less than 0.5%. 

In the picture above, the original sam’s understanding of the structure is a bit poor, and in the last picture sam misunderstood the line.

2.method

Data: Since sam_hq is actually three small modules trained on the basis of sam, hq output token, three-layer mlp and a small feature fusion module, so a part of fine data is prepared, the original sam was trained on sa-1b, 1100w images And 1.1 billion masks, 256 gpus, and a batch of 256. sam_hq is trained on HQSeg-44k data, including 44,000 high-precision masks, including 1,000 semantic categories.

2.1 preliminaries:sam

Image encoder: based on vit, get 64x64 image embedding, prompt encoder: encode the position information from the point frame and mask, mask decoder: the decoder of the double-layer transformer uses image embedding and prompt to make the final mask prediction at the same time, output token is used for mask prediction, similar to the learnable object in detr, which predicts the weight of mlp, and then performs point-by-point product with mask feat.

2.2 hq-sam

Introduced hq output token and a new mask prediction layer (three-layer mlp) for high-quality mask prediction of Goddess Guo Heng, instead of directly using sam's rough mask as input.

By reusing and fixing sam's mask decoder, a learnable hq output token (1x256) is connected with sam's output token (4x256) and prompt token (Nx256), and passed to sam's mask decoder as input, and the original The output token is similar. In each attention layer, the hq output token first performs self-attention calculation with other tokens, and then performs the reverse attention of token to image and image to token to update its features. The hq output token is in each The point-by-point mlp shared with other tokens is used in the decoder layer. After entering the two decoder layers, the updated hq output token can access the global image context, the important geometric type information of the prompt token, and the hidden mask information of other output tokens. . Finally, a new three-layer mlp is added to generate dynamic convolution kernel weights from the updated hq-output token, and then perform point-by-point product with the fused hq features to generate high-quality masks.

Unlike directly fine-tuning sam or further adding complex post-processing, hq-sam only allows training on hq output token and its three-layer ml, and has corrected the wrong mask in sam's output token.

Accurate segmentation also needs to input image features with rich global semantic context and local boundary details. Instead of directly using the mask decoder features of sam, it extracts and fuses features from different stages of the sam model to combine new high-quality features hq-eatures , 1.san's vit early local features, the size is 64x64, capture more general image edge details, specifically, extract the features after the first global attention block of vit, vit-large sam, a total of 24 blocks , the output of the sixth block; 2. The global feature of the final layer of the vit encoder of sam, 64x64, 3. The mask feature in the mask decoder of sam, 256x256, contains mask shape information, in order to obtain the input hq-features, first The early layers and the final encoding layer are upsampled to 256x256 by transposed convolution and then summed.

3.training

The sam parameters are fixed, and only the hq output token and its associated three-layer and mlp and three simple convolutions for hq features fusion are trained.

Using the same inference process as sam, but using the mask prediction of hq output token as a high-quality mask prediction, add the logits of the mask of winning sam and the mask of hq-sam, which has been corrected, and the resolution is 256x256. Sampled to 1024x1024.

Guess you like

Origin blog.csdn.net/u012193416/article/details/132389474