KAUST & Meta AI's new work | ZeroSeg: Open vocabulary semantic segmentation without semantic labels and text information!

guide

论文:《Exploring Open-Vocabulary Semantic Segmentation without Human Labels》

TL;DR : Today I will mainly introduce a ZeroSegnew method called , which is used to train zero-shot semantic segmentation models of open vocabulary. The characteristic of this method is that it does not need to use manually labeled data for model training, so it can be extended to large unlabeled data sets.

Overview : ZeroSegThe model takes a pre-trained vision-language model (eg CLIP) as a " teacher model " and distills its learned visual concepts into a set of segmentation tokens ( segment tokens), each of which summarizes a local region of the target image . This process does not require any human labels, so the knowledge of models such as CLIP can be directly transferred to semantic segmentation tasks.

ZeroSeg

Model structure and optimization method : The ZeroSeg model proposes an effective pixel grouping and classification method for pixel-level supervision. The method makes it easier to extract and refine semantic information from the CLIP visual encoder by automatically grouping pixels into more meaningful, irregularly shaped segments. In addition, in order to improve the training efficiency, the model also introduces a masked autoencoder ( MAE).

Model performance : ZeroSeg trained using only the ImageNet 1k dataset performs comparable to those models trained with human labels. The performance on the PASCAL VOC 2012, PASCAL Context, and COCO datasets achieves mIoU of 40.8, 20.6, and 20.4, respectively, and these results are comparable to those pretrained on 26M and 20M image-text pairs such as GroupViTand MaskCLIP. In addition, ZeroSegit also performs well on semantic segmentation tasks with larger vocabularies (1000 categories).

Overall, this study provides an effective solution to the problem of semantic segmentation of open vocabularies, which requires no human labeling and achieves zero-shot (zero-shot) semantic segmentation. In addition, the model also has the characteristics of high training efficiency and superior performance.

method

Training ZeroSeg model

Overall Architecture : ZeroSegIt is a network structure for semantic segmentation. It mainly performs tasks by distillingCLIP knowledge acquired from pre-trained visual-language models . ZeroSegThe main components include an ViTencoder and two headers, including a decoder header and a segmentation header .

ViT Encoder : When given an image, the encoder divides it into non-overlapping ones patch. The encoder selects a fraction of visual tokens from each patch as input and generates corresponding latent representations.

Decoder Head : The decoder uses the latent representation to reconstruct maskedthe pixels of the image being captured, i.e. MAE, and is trained by minimizing the mean squared error (MSE) between the reconstructed image and the original image.

分割头部: 分割头部的输出被转化为 segment token,用于通过蒸馏进行语义分割的学习。具体来说,ZeroSeg 从预训练的 CLIP 视觉编码器中提取多尺度图像特征,并将它们蒸馏到这些 st 中。这里主要使用两种蒸馏方法:多尺度特征蒸馏损失分割匹配损失

多尺度特征蒸馏损失: 基于 L1 蒸馏损失,其操作在全局特征和多尺度视觉特征之间。它通过把输入图像划分为多尺度视图(如2x2和3x3网格)并将这些视图传递给预训练的 CLIP 视觉编码器以产生视觉特征。

分割匹配损失: 这是一种用于执行局部区域特征和段令牌之间蒸馏的方法。对于每个 st,这种损失函数搜索其最近的局部区域,并最小化它们之间的 L1 距离,从而增加分割部分和视觉概念之间的语义一致性。

实验

可以看出,ZeroSeg 只需依赖再 ImageNet-1k 上预训练的 ViT 权重,而无需具体语义标签,借助 CLIP 等现有的视觉-语言模型即可获得性能优异的零样本分割性能。

从消融实验的结果来看,多尺度特征提取在其中起到举足轻重的作用,本质上还是学习一副图像的不同视图。

可视化结果也能突出多尺度匹配的作用,避免因感受野覆盖不足而引起的“空洞”问题。

与其它现有的同类分割器相比,对复杂场景的语义剖析能力也是杠杠的!

总结

本文展示了一个不依赖人工标签,只通过从预训练模型中蒸馏知识就能进行高效语义分割的模型。总的来说,作者通过 ZeroSeg 证明了可以通过从预训练的通用视觉-语言模型中传递知识来有效地训练语义分割模型,同时希望这将为如何利用最近的基础模型研究成果来帮助像语义分割这样的像素级下游任务开辟一条新的途径。

However, we can easily see a drawback because the model relies on pre-trained large visual language models, which may be biased in the training data. Therefore, mitigations such as careful screening of training data are critical to ensure compliant use of our models.

write at the end

Children's shoes who are interested in computer vision-related research are welcome to scan the QR code below the screen or directly search the WeChat account cv_huber to add editor friends, note: school/company-research direction-nickname, and communicate and learn with more friends!

Guess you like

Origin juejin.im/post/7266310505920217100