[Paper Express] ICCV2021 - Real-time and high-precision semantic segmentation of small samples based on hyper-correlation compression

[Paper Express] ICCV2021 - Small Sample Semantic Segmentation Based on Hyper-correlation Compression

【Original text】:Hypercorrelation Squeeze for Few-Shot Segmentation

[ Author Information ]: Juhong Min Dahyun Kang Minsu Cho

获取地址:https://openaccess.thecvf.com/content/ICCV2021/papers/Min_Hypercorrelation_Squeeze_for_Few-Shot_Segmentation_ICCV_2021_paper.pdf

Blogger keywords: small sample learning, semantic segmentation, 4D convolution, super correlation

Recommended related papers:

【论文速递】ECCV2022 - 开销聚合与四维卷积Swin Transformer_小样本分割
- https://phoenixash.blog.csdn.net/article/details/128698210
【论文速递】ACM2022 - 基于嵌入自适应更新和超类表示的增量小样本语义分割
- https://phoenixash.blog.csdn.net/article/details/128676817

Summary:

The goal of few-shot semantic segmentation is to learn to segment a target object from a query image using only a few annotated support images of the target class. This challenging task requires understanding different levels of visual cues and analyzing fine-grained correspondences between query and supporting images. To address this problem, we propose Hyper-Relational Squeezing Network (HSNet) utilizing multi-level feature correlation and efficient 4D convolution. It extracts different features from different layers of the intermediate convolutional layer to construct a collection of 4D correlation tensors, i.e. hyper-correlation. The method employs an efficient pyramid-structured center-axis 4D convolution to gradually squeeze hyper-correlated high-level semantic cues and low-level geometric cues from coarse to fine into precise segmentation masks. Significant performance improvements on standard few-shot segmentation benchmarks of PASCAL-5i, COCO-20i and FSS-1000 validate the effectiveness of the proposed method.

Introduction:

The advent of deep convolutional neural networks [17, 20, 64] has facilitated tremendous progress in many computer vision tasks, including object tracking [28, 29, 45], visual correspondence [22, 44, 48] and semantic segmentation [7, 47,62] and so on. Despite the effectiveness of deep networks, the requirement of deep networks for a large number of annotated examples in large-scale datasets [9, 11, 35] remains an issue due to the human-intensive nature of data annotation, especially for intensive prediction tasks such as semantic segmentation [9, 11, 35]. Basic restrictions. To address this challenge, various semi-supervised and weakly-supervised segmentation methods [6, 26, 39, 66, 72, 77, 88] have been tried, which can effectively alleviate the data starvation problem. However, due to only a few annotated training examples, the poor generalization ability of deep networks remains a problem for many few-shot segmentation methods [10, 12, 13, 19, 33, 36, 37, 46, 54, 61, 63, 69 ,70,74,75,80,83,86,87,89] The main problem that is difficult to solve.
insert image description here

In contrast, the human visual system easily generalizes the appearance of novel objects with extremely limited supervision. The key to this intelligence is the ability to find reliable communication between different instances of the same class. Recent work on semantic correspondences shows that exploiting dense intermediate features [38, 42, 44] and processing related tensors with high-dimensional convolutions [30, 58, 71] is very effective in establishing precise correspondences . However, although recent few-shot segmentation studies have begun to actively explore the direction of correlated learning, most of them [36, 37, 46, 65, 73, 75, 80] neither utilize feature representations at different levels from early to late layers of CNNs. , nor construct pairwise feature correlations to capture fine-grained correlation patterns. There have been some attempts [74, 86] to exploit dense correlation of multi-layer features, but these attempts are limited in the sense of simply using dense correlation for graph attention, using only a small fraction of intermediate convolutional layers.

In this work, we combine two of the most influential techniques in visual correspondence research in recent years, multi-level features and 4D convolutions, and design a new framework called Hyper-Relational Squeeze Network (HSNet) , It is used to complete the small sample semantic segmentation task. As shown in Figure 1, our network exploits different geometric/semantic feature representations from many different intermediate CNN layers to build a 4D collection of correlation tensors, i.e. hyper-correlations , which represent a rich set of correspondences across multiple visual aspects. Following the work of FPN [34], we employ a pyramidal design to capture high-level semantic and low-level geometric cues, using deeply stacked 4D conv layers for precise mask prediction in a coarse-to-fine fashion. In order to reduce the amount of computation caused by extensive use of high-dimensional convolutions, we design an efficient 4D kernel through reasonable weight sparsification, which is more efficient and more effective than existing kernels while achieving real-time inference. Lightweight. Improvements on standard few-shot segmentation benchmarks PASCAL-5i [61], COCO-20i [35] and FSS-1000 [33] validate the effectiveness of the proposed method.
insert image description here

【Paper Express | Featured】

Forum address: https://bbs.csdn.net/forums/paper

Guess you like

Origin blog.csdn.net/qq_36396104/article/details/128976052