CANet

论文阅读（19CVPR）CANet: Class-Agnostic Segmentation Networks with Iterative Reﬁnement and Attentive Few-Shot Learning) https://arxiv.org/abs/1903.02351

Abstract：

Recent progress in semantic segmentation is driven by deep Convolutional Neural Networks and large-scale labeled image datasets. However, data labeling for pixel-wise segmentation is tedious and costly. Moreover,a trained model can only make predictions within a set of pre-deﬁned classes. In this paper, we present CANet, a class-agnostic segmentation network that performs few-shot segmentation on new classes with only a few annotated images available. Our network consists of a two-branch dense comparison module which performs multi-level feature comparison between the support image and the query image, and an iterative optimization module which iteratively reﬁnes the predicted results. Furthermore, we introduce an attention mechanism to effectively fuse information from multiple support examples under the setting of k-shot learning. Experiments on PASCAL VOC 2012 show that our method achieves a mean Intersection-over-Union score of 55.4% for 1-shot segmentation and 57.1% for 5-shot segmentation, outperforming state-of-the-art methods by a large margin of 14.6% and 13.2%, respectively.

深度卷积神经网络和大规模标记图像数据集是近年来语义分割研究的重要进展。然而，为像素分割做数据标记是枯燥和昂贵的。此外，经过训练的模型只能在一组预定义的类别中进行预测。在本文中，我们提出了一种不依赖于类别的分割网络CANet，它对只有少量注释的可用图像的新类别执行小样本分割**（小样本学习属于迁移学习）**。我们的网络由一个两分支密集的比较模块和一个迭代优化模块组成，该模块对支持图像和查询图像进行多层次的特征比较，并对预测结果进行迭代优化。此外，我们在k-shot学习的背景下，引入了一种注意机制去有效地融合多个支持实例的信息。在PASCAL VOC 2012上的实验表明，我们的方法达到单样本分割55.4%， 5样本分割57.1%的平均交并比得分，大大超过了目前最先进方法的14.6%和13.2%。

Introduction

Deep Convolutional Neural Networks have made signiﬁcant breakthroughs in many visual understanding tasks including image classiﬁcation [13, 9, 30], object detection[27,8,26],and semantic segmentation[16,2,20]. One crucial reason is the availability of large-scale datasets such as ImageNet [4] that enable the training of deep models. However, data labeling is expensive, particularly for dense prediction tasks, e.g., semantic segmentation and instance segmentation. In addition to that, after a model is trained, it is very difﬁcult to apply the model to predict new classes. In contrast to machine learning algorithms,humans are able to segment a new concept from the image easily when only seeing a few examples. The gap between humans and machine learning algorithms motivates the study of few-shot learning that aims to learn a model which can be generalized well to new classes with scarce labeled training data.

*深度卷积神经网络在图像分类、目标检测、语义分割等许多视觉理解任务上都取得了重大突破。一个关键的原因是大规模数据集的可用性，比如ImageNet，这些数据集支持对深度模型的培训。然而，数据标记是昂贵的，特别是对于密集的预测任务，如语义分割和实例分割。此外，在对模型进行训练之后，很难将模型应用于新类的预测。与机器学习算法不同的是，人类只看到几个例子就能很容易地从图像中分割出一个新概念。人类和机器学习算法之间的差距激发了对小样本学习的研究，其目的是学习一个模型，可以很好地推广到具有稀缺标记的训练数据的新类别。

In this paper, we undertake the task of few-shot semantic segmentation that only uses a few annotated training images to perform segmentation on new classes. Previous work on this task follows the design of two-branch structure which includes a support branch and a query branch. The support branch aims to extract information from the support set to guide segmentation in the query branch. We also adopt the two-branch design in our framework to solve the few-shot segmentation problem.

在本文中，我们只使用少量带注释的训练图像对新类进行分割，从而实现了小样本的语义分割。在此之前的工作遵循了双分支结构的设计，其中包括一个支持分支和一个查询分支。支持分支旨在从支持集中提取信息，以指导查询分支中的分割。我们还采用了双分支设计的框架来解决少样本分割问题。

*Our network includes a two-branch dense comparison module, in which a shared feature extractor extracts representations from the query set and the support set for comparison. The design of the dense comparison module takes inspiration from metric learning [37, 31] on image classiﬁcation tasks where a distance function evaluates the similarity between images. However,different from image classiﬁcation where each image has a label, image segmentation needs to make predictions on data with structured representation. It is difﬁcult to directly apply metric learning to dense prediction problems. To solve this, one straightforward approach is to make comparisons between all pairs of pixels. However, there are millions of pixels in an image and comparison of all pixel pairs takes enormous computational cost. Instead, we aim to acquire a global representation from the support image for comparison. Global image features prove to be useful in segmentation tasks, which can be easily achieved by global average pooling. Here, to only focus on the assigned category, we use global average pooling over the foreground area to ﬁlter out irrelevant information. Then the global feature is compared with each location in the query branch, which can be seen as a dense form of the metric learning approach. *

我们的网络包括一个双分支密集的比较模块，其中共享的特征提取器从查询集和支持集提取表示，用于比较。密集比较模块的设计灵感来自于对图像分类任务的度量学习，其中距离函数评估图像之间的相似性。然而，与图像分类不同的是，图像分割需要对具有结构化表示的数据进行预测。将度量学习直接应用于稠密预测问题是困难的。要解决这个问题，一种直接的方法是对所有像素对进行比较。然而，一幅图像中有数百万像素，所有像素对的比较需要巨大的计算成本。相反，我们的目标是从支持图像中获取全局表示，以便进行比较。全局图像特征被证明在分割任务中是有用的，可以通过全局均值池化**（主要是用来解决全连接的问题，其主要是是将最后一层的特征图进行整张图的一个均值池化，形成一个特征点，将这些特征点组成最后的特征向量接入softmax中进行计算）**轻松实现。在这里，为了只关注指定的类别，我们在前景区域上使用全局均值池化来过滤不相关的信息。然后将全局特性与查询分支中的每个位置进行比较，这可以看作是度量学习方法的密集形式。

Under the few-shot setting,the network should be able to handle new classes that are never seen during training. Thus we aim to mine transferable representations from CNNs for comparison. As is observed in feature visualization literature [39, 38], features in lower layers relate to low-level cues, e.g., edges and colors while features in higher layers relate to object-level concepts such as categories. We focus on middle-level features that may constitute object parts shared by unseen classes. For example, if the CNN learns a feature that relates to wheel when the model is trained on the class car, such feature may also be useful for feature comparison on new vehicle classes, e.g., truck and bus. We extract multiple levels of representations in CNNs for dense comparison.

在少样本设置下，网络应该能够处理训练中从未见过的新类别。因此，我们的目标是从CNNs中挖掘可迁移的表现，以便进行比较。正如特征可视化文献所观察到的，低层特征与低层线索相关，如边缘和颜色，而高层特征与对象级概念相关，如类别。我们关注的是中层特性，这些特性可能构成未见过类别共享的对象部分。例如，当模型被训练在车的类别上时，如果CNN学习到一个与车轮相关的特征，那么这个特征也可以用于比较新车型的类别，例如卡车和公共汽车。我们提取了CNNs中多个层次的表示，以便进行密集比较。*

As there exist variances in appearance within the same category, objects from the same class may only share a few similar features. Dense feature comparison is not enough to guide segmentation of the whole object area. Nevertheless, this gives an important clue of where the object is. In semi-automatic segmentation literature, weak annotations are given for class-agnostic segmentation, e.g., interactive segmentation with click or scribble annotations [36, 14] and instance segmentation with bounding box or extreme point priors [10, 21]. Transferable knowledge to locate the object region is learned in the training process. Inspired by semi-automatic segmentation tasks, we hope to gradually differentiate the objects from the background given the dense comparison results as priors. We propose an iterative optimization module (IOM) that learns to iteratively reﬁne the predicted results. The reﬁnement is performed in a recurrent form that the dense comparison result and the predicted masks are sent to an IOM for optimization, and the output is sent to the next IOM recurrently. After a few iterations of reﬁnement,our dense comparison module is able to generate ﬁne-grained segmentation maps. Inside each IOM, we adopt residual connections to efﬁciently incorporate the predicted masks in the last iteration step. Fig. 1 shows an overview of our network for one-shot segmentation.

由于在同一个类别中存在外观差异，来自同一个类的对象可能只共享一些类似的特性。密集的特征比较不足以指导整个目标区域的分割。不过，这提供了一个物体在哪里的重要线索。在半自动分割的文献中，对于与类无关的分割给出了弱标注，例如，使用点击或涂写标注的交互式分割和使用边界框或极值点先验的实例分割。在训练过程中，学习到定位目标区域的可迁移知识。受半自动分割任务的启发，我们希望在对比结果密集的前提下，逐步将目标与背景区分开来。我们提出了一个迭代优化模块(IOM)，它学习迭代地细化预测结果。精细化以循环的形式进行，将稠密的比较结果和预测的掩模发送到IOM进行优化，并递归地将输出发送到下一个IOM。经过几次改进后，我们的密集比较模块能够生成细粒度的分割映射。在每个IOM中，我们采用剩余连接，以便在最后的迭代步骤中有效地合并预测的掩码。图1为我们的单样本分割网络概述。

Previous methods for k-shot segmentation is based on the 1-shot model. They use non-learnable fusion methods to fuse individual 1-shot results, e.g., averaging 1-shot predictions or intermediate features. Instead, we adopt an attention mechanism to effectively fuse information from multiple support examples.

以往的k样本分割方法都是基于单样本模型。他们使用不可学习的融合方法来融合单个的单样本结果，例如平均单样本预测或中间特征。相反，我们采用注意机制来有效地融合来自多个支持示例的信息。

To further reduce the labeling efforts for few-shot segmentation, we explore a new test setting: our model uses the bounding box annotated support set to perform segmentation in the query image. We conduct comprehensive experiments on the PASCAL VOC 2012 dataset and COCO dataset to validate the effectiveness of our network. Main contributions of this paper are summarized as follows.

为了进一步减少少样本分割的标注工作，我们探索了一个新的测试设置:我们的模型使用带注释的边界框支持集对查询图像进行分割。我们在PASCAL VOC 2012数据集和COCO数据集上进行了综合实验，验证了我们网络的有效性。本文的主要贡献总结如下。

• We develop a novel two-branch dense comparison module which effectively exploits multiple levels of feature representations from CNNs to make dense feature comparison.
• We propose an iterative optimization module to reﬁne predicted results in an iterative manner. The ability of iterative reﬁnement can be generalized to unseen classes with few-shot learning for generating ﬁne-grained maps.
• We adopt an attention mechanism to effectively fuse information from multiple support examples in the k-shot setting, which outperforms non-learnable fusion methods of 1-shot results.
• We demonstrate that given support set with weak annotations, i.e., bounding boxes, our model can still achieve comparable performance to the result with expensive pixel-level annotated support set, which further reduces the labeling efforts of new classes for few-shot segmentation signiﬁcantly.
• Experiments on the PASCAL VOC 2012 dataset show that our method achieves a mean Intersection-over-Union score of 55.4% for 1-shot segmentation and 57.1% for 5-shot segmentation, which signiﬁcantly outperform state-of-the-art results by 14.6% and 13.2%, respectively.

• 我们开发了一种新的双分支密集比较模块，该模块有效地利用了来自CNNs的多个层次的特征表示来进行密集特征比较。
• 提出了一种迭代优化模型，以迭代的方式对预测结果进行细化。迭代求精的能力可以推广到未见过的类，用少量的样本学习来生成细粒度的映射。
• 在k样本设置中，我们采用了一种注意机制来有效地融合多个支持实例的信息，其性能优于不可学习的单样本结果融合方法。
• 我们演示了给定的弱注解支持集，比如边界框，我们的模型仍然可以达到与昂贵的像素级注释支持集结果相当的性能，这进一步减少了标记新类别的努力，为少样本分割显著。
• 在PASCAL VOC 2012上的实验表明，我们的方法达到单样本分割55.4%， 5样本分割57.1%的平均交并比得分，大大超过了目前最先进方法的14.6%和13.2%。

论文阅读（19CVPR）CANet: Class-Agnostic Segmentation Networks with Iterative Reﬁnement and Attentive Few-Shot Learning) https://arxiv.org/abs/1903.02351

Abstract：

Introduction

猜你喜欢