ZeroSeg

It does not require manual labels, but also realizes zero-shot semantic segmentation through knowledge extraction of pre-trained visual-language models.

TL;DR : Today I will mainly introduce a ZeroSegnew method called , which is used to train zero-shot semantic segmentation models of open vocabulary. The characteristic of this method is that it does not need to use manually labeled data for model training, so it can be extended to large unlabeled data sets.

Overview : ZeroSegThe model takes a pre-trained vision-language model (eg CLIP) as a " teacher model " and distills its learned visual concepts into a set of segmentation tokens ( segment tokens), each of which summarizes a local region of the target image . This process does not require any human labels, so the knowledge of models such as CLIP can be directly transferred to semantic segmentation tasks. ZeroSeg

Model structure and optimization method : The ZeroSeg model proposes an effective pixel grouping and classification method for pixel-level supervision. The method makes it easier to extract and refine semantic information from the CLIP visual encoder by automatically grouping pixels into more meaningful, irregularly shaped segments. In addition, in order to improve the training efficiency, the model also introduces a masked autoencoder ( MAE).

Model performance : ZeroSeg trained using only the ImageNet 1k dataset performs comparable to those models trained with human labels. GroupViT The performance on the PASCAL VOC 2012, PASCAL Context, and COCO datasets achieves mIoU of 40.8, 20.6, and 20.4, respectively, and these results are comparable  to those pretrained on 26M and 20M image-text pairs  MaskCLIP . In addition, ZeroSeg it also performs well on semantic segmentation tasks with larger vocabularies (1000 categories).

Overall, this study provides an effective solution to the problem of semantic segmentation of open vocabularies, which requires no human labeling and achieves zero-shot (zero-shot) semantic segmentation. In addition, the model also has the characteristics of high training efficiency and superior performance.

MethodTraining ZeroSeg model

Overall ArchitectureZeroSeg It is a network structure for semantic segmentation. It mainly performs tasks by distillingCLIP knowledge acquired from pre-trained visual-language models . ZeroSegThe main components include an ViTencoder and two headers, including a decoder header and a segmentation header .

ViT Encoder : When given an image, the encoder divides it into non-overlapping ones patch. The encoder selects a fraction of visual tokens from each patch as input and generates corresponding latent representations.

Decoder Head : The decoder uses the latent representation to reconstruct  masked the pixels of the image, i.e. MAE, and is trained by minimizing the mean squared error (MSE) between the reconstructed image and the original image.

Segmentation Head : The output of the segmentation head is transformed into  segment token, which is used for learning semantic segmentation by distillation. Specifically, ZeroSeg extracts multi-scale image features from a pre-trained CLIP visual encoder and distills them into these st. Two distillation methods are mainly used here: multi-scale feature distillation loss and segmentation matching loss .

Multi-scale feature distillation loss : Based on L1 distillation loss, it operates between global features and multi-scale visual features. It generates visual features by dividing an input image into multi-scale views (such as 2x2 and 3x3 grids) and passing these views to a pretrained CLIP visual encoder.

Segmentation Matching Loss : This is a method for performing distillation between local region features and segmentation tokens. For each st, this loss function searches its nearest local regions and minimizes the L1 distance between them, thus increasing the semantic consistency between segmented parts and visual concepts.

It can be seen from the experiment that ZeroSeg only needs to rely on the ViT weights pre-trained on ImageNet-1k without specific semantic labels, and can obtain excellent zero-shot segmentation performance with the help of existing visual-language models such as CLIP. Judging from the results of the ablation experiment, multi-scale feature extraction plays a pivotal role in it, essentially learning different views of an image. The visualization results of  whaosoft  aiot  http://143ai.com can also highlight the role of multi-scale matching and avoid the "hole" problem caused by insufficient coverage of the receptive field.  Compared with other existing similar segmenters, the semantic analysis ability of complex scenes is also leveraged!

Summarize

This paper presents a model that can perform efficient semantic segmentation by distilling knowledge only from pre-trained models without relying on human labels. Overall, the authors demonstrate with ZeroSeg that semantic segmentation models can be efficiently trained by transferring knowledge from pre-trained general-purpose visual-language models, and hope that this will shed light on how recent fundamental model research can be leveraged to help applications like semantic segmentation . Such pixel-level downstream tasks open up a new avenue.

However, we can easily see a drawback because the model relies on pre-trained large visual language models, which may be biased in the training data. Therefore, mitigations such as careful screening of training data are critical to ensure compliant use of our models.

おすすめ

転載: blog.csdn.net/qq_29788741/article/details/132178521