CVPR 2022 | Semantic segmentation without any pixel labels! NVIDIA proposes GroupViT: Semantic segmentation based on text supervision...

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

Reprinted from: Heart of the Machine

The result is really amazing.

Visual scenes are composed of semantically meaningful groups of pixels. Before the concept of deep learning, the industry has been intensively researching pixel grouping and recognition using classical visual understanding methods. The idea of ​​bottom-up grouping is to first organize pixels into candidate groups, and then process each grouping with a recognition algorithm module. This idea has been successfully applied to superpixel image segmentation, as well as region construction for object detection and semantic segmentation. In addition to bottom-up reasoning, top-down feedback signals during recognition can better accomplish visual grouping.

With the advent of the deep learning era, the ideas of explicit grouping and recognition are no longer so distinct in end-to-end training systems, but are more closely coupled together. For example, semantic segmentation is often achieved through fully convolutional networks, where pixel groupings are shown at the output layer only by identifying labels for each pixel. This approach does not require explicit grouping of pixels. Although this method is very powerful and has the best performance, it suffers from two major limitations: (1) the manual labeling cost per pixel is high; (2) the learned model is limited to a few labeled classes, Cannot generalize to unknown classes.

Recent advances in learning visual representations from textual supervision have achieved great success in transferring to downstream tasks. The learned model not only transfers to ImageNet classification in a zero-shot manner and achieves the best performance, but also recognizes unknown object classes beyond ImageNet classification.

Inspired by this research direction, researchers from the University of California, San Diego and NVIDIA asked whether we can also learn a text-only supervised semantic segmentation model that can generalize in a zero-shot manner without any pixel labeling. into different object categories or vocabulary sets ?

4574219545e3573e1170a88a724d927f.png

GroupViT: Semantic Segmentation Emerges from Text Supervision

Homepage: https://jerryxu.net/GroupViT/

Paper: https://arxiv.org/abs/2202.11094

To achieve this, they propose to incorporate a grouping mechanism into deep networks . The grouping mechanism can automatically generate semantic fragments as long as it is learned through text supervision. An overview of the method is shown in Figure 1 below. By training on large-scale paired image-text data with contrastive loss, the model can learn the semantic segmentation vocabulary of unknown images through zero-shot transfer learning without any further annotation or fine-tuning. .

The key idea of ​​this research is to use the Visual Transformer (ViT) to add a new visual grouping module to it. The researchers call the new model GroupViT (Group Visual Transformer) .

c5d222bb0994f4fe10e4cb4ffba99095.png

Figure 1: The GroupViT and text encoders are first jointly trained using paired image-text data. With GroupViT, meaningful semantic groupings appear automatically without any mask annotations. Then the trained GroupViT model is transferred to the zero-shot semantic segmentation task.

The semantic segmentation effect of GroupVit is shown in the following two animations.

cd71eb44360a4a3f664e8a4d5f44f49f.gif

725d7c5fc7534e5ed0a226a1d8dbc66b.gif

Thesis 1 As Jiarui Xu, a second-year doctoral student in the Department of Computer Science and Engineering at UCSD, this work was carried out during his intern at NVIDIA.

The main contributions of this study are as follows:

  • Beyond Regularly Shaped Image Grids in Deep Networks: Introducing a novel GroupViT architecture that hierarchically groups visual concepts into irregularly shaped groups from the bottom up

  • Without any pixel-level labels, and only trained with image-level text supervision via a contrastive loss, GroupViT successfully learns to group image regions together and transfer to multiple semantic segmentation vocabularies in a zero-shot manner;

  • The first exploration does not use any pixel-level labels, completes the zero-shot transfer from single text supervision to several semantic segmentation tasks, and also establishes a solid foundation for this new task.

GroupViT Architecture

GroupViT contains a hierarchy of Transformer layers grouped in stages, each of which processes progressively enlarged visual fragments. The images on the right show visual fragments to be processed in different grouping stages. In the early stages the model groups pixels into local objects, such as an elephant's trunk and legs. They are further merged into wholes at higher stages, such as the whole elephant and the background forest.

Each grouping stage ends with a grouping block that computes the similarity between the learned group labels and segment (image) labels. Groups with high similarity are assigned to the segment markers of the same group and merged together, and make a new segment marker for entering the next grouping stage.

a98b34b43df7e6617d166d4a16a8b4f7.png

Figure 2: (a) Architecture and training process of GroupViT. (b) Architecture of the grouping block.

Learning from Image-Text Pairs

To train GroupViT for hierarchical grouping, we use a carefully designed contrastive loss between image-text pairs.

Figure 3 below shows the multi-label image-text comparison loss. Given an input image-text pair, they generate new text from the original text by extracting its nouns and hinted by some sentence templates. For contrastive learning, only image and text pairs that match are considered positive examples. We train GroupViT and a text encoder to maximize feature similarity between positive image-text pairs and minimize feature similarity between negative pairs.

5c41762c125ca7cb03d8be6d6265a033.png

Zero-shot migration to semantic segmentation

Since GroupViT automatically groups images into semantically similar segments, its output can be easily transferred from Zero-Shot to semantic segmentation without any further fine-tuning. The process of zero-sample migration is shown in Figure 4 below. Each output segment embedding of GroupViT corresponds to a region of the image. We assign each output segment to the object class with the highest image-text similarity in the embedding space.

5c78c1caf4993b5fa3cb0796efb45337.png

Learning through the concept of group tokens

The researcher selects a subset of tokens and highlights the attention regions in the PASCAL VOC 2012 dataset. Different groups of tokens are learning different semantic concepts even though they are not yet classified.

bd17606e9543db7830d7ee24b6afb929.gif

Experimental results 

Ablation experiment

To identify the contribution of each component of GroupViT, we conducted ablation experiments. For all experiments, 1-stage GoupViT was trained using the CC12M dataset by default unless otherwise stated. They record the predicted mIoU and segmentation masks on the PASCAL VOC 2012 validation set.

Hard versus soft assignment: In each grouping block, the researcher assigns image segment labels to group tokens using hard or soft assignment (Section 3.1). For soft allocation, they use the original A^l matrix instead of the one used for hard allocation 5fbf41a4dade069cb9866ef5fbf192e1.pngto calculate Equation 5. The impact of this is shown in the first column of Table 1 below.

b10b208b8476021686c6dec0c44f1983.png

Multi-label contrastive loss . We study the effect of adding a multi-label contrastive loss in the second column of Table 1. Adding the multi-label contrastive loss to the standard loss (Equation 8) improves the performance of hard and soft allocation by 13.1% and 2.6%, respectively. Using a multi-label contrastive loss, the input text during training and inference is in a similar hint format. They speculate that this consistency helps GroupViT to better classify the learned image segments into labeled categories.

Group token . In Table 2 below, researchers compare different group tokens and output tokens. They observed that increasing group tokens consistently improved performance. Conceptually, each group token represents a different semantic concept. So more group tokens might help GroupViT learn to group more semantic concepts. Although the number of group tokens is much less than the number of categories in the real world, each group token is 1 feature vector in a 384-dimensional embedding space, but it can represent more concepts than 1. They also experimented with different output tokens and found that 8 was optimal, similar to what was found in [64].

7ef30ff4f4d79a3523bc89821ad02807.png

Multi-stage grouping . In Table 3 below, the researchers compare the 1-stage and 2-stage GroupViT architectures. 

46d22406d59dbe4ee3541d3991bf43b1.png

Table 3: Ablation experiments for single-stage and multi-stage groupings.

We also compare 1-stage and 2-stage visual zero-shot semantic segmentation results in Figure 5 below.

b8d101a5728bbe3f5d60d95a594596c9.png

2-stage GroupViT produces smoother and more accurate segmentation maps than 1-stage GroupViT.

visualization

We evaluate GroupViT on Pascal VOC, Pascal Context, and COCO datasets. GroupViT can be zero-shot transferable to any dataset's semantic segmentation class without training with any semantic segmentation annotations, and without fine-tuning the model.

Qualitative experimental results on the PASCAL VOC 2012 dataset . Figure 6 below shows the specific qualitative segmentation results for GroupViT. They chose to experiment with a single target (row 1), multiple targets of the same class (row 2), and multiple targets of different classes (row 3). Experiments show that GroupViT can generate reasonable segmentations. 

ff5241efc4961990b84556a2596a1beb.png

Figure 6: Qualitative results for PASCAL VOC 2012. Stage 1/2 groups group the results before assigning labels.

Learn through the concept of group tagging. In Figure 7 below, you can visually see what the group token learns. We select partial group labels and highlight regions of attention in the PASCAL VOC 2012 dataset.

They found that different groups of tokens learn different semantic concepts. In the first stage, group tokens usually focus on mid-level concepts such as "eyes" (line 1) and "limbs" (line 2). Interestingly, group token 36 will focus on "hands" if there are people in the picture, and "feet" if there are animals such as birds and dogs. Group tokens in the second stage are more associated with high-level concepts such as "grass", "body", and "face". Figure 7 also shows that concepts learned in the first stage can be aggregated into higher-level concepts in the second stage.

caa4310da2280384b27495335f9835d5.png

Figure 7: Concept learning via group labeling. The researchers highlight the areas that group tokens involve at different stages.

Comparison with existing methods

We compare the zero-shot semantic segmentation performance of GroupViT with other zero-shot benchmarks, ViT-S-based fully supervised transfer methods. The results are detailed in Tables 4 and 5 below.

4bdf82cd13b85523193b2ea3b627eb95.png

Table 4: Comparison with zero-sample benchmarks.

d7006c5f1a93a671c4419caaec64964a.png

Table 5: Comparison with fully supervised transfer models. Zero-shot means migrating to semantic segmentation without any fine-tuning. The researchers also recorded mIoU on the PASCAL VOC 2012 and PASCAL context datasets.

ICCV和CVPR 2021论文和代码下载

后台回复:CVPR2021,即可下载CVPR 2021论文和代码开源的论文合集

后台回复:ICCV2021,即可下载ICCV 2021论文和代码开源的论文合集

后台回复:Transformer综述,即可下载最新的3篇Transformer综述PDF
CVer-Transformer交流群成立
扫码添加CVer助手,可申请加入CVer-Transformer 微信交流群,方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch和TensorFlow等群。
一定要备注:研究方向+地点+学校/公司+昵称(如Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲长按加小助手微信,进交流群
CVer学术交流群(知识星球)来了!想要了解最新最快最好的CV/DL/ML论文速递、优质开源项目、学习教程和实战训练等资料,欢迎扫描下方二维码,加入CVer学术交流群,已汇集数千人!

▲扫码进群
▲点击上方卡片,关注CVer公众号

整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/123540663