[Image Segmentation 2022 ECCV] CP2

[Image Segmentation 2022 ECCV] CP2

Thesis title: CP2: Copy-Paste Contrastive Pretraining for Semantic Segmentation

Chinese title: CP2: Copy-paste comparison pre-training for semantic segmentation

Paper link: https://arxiv.org/abs/2203.11709

Paper code: https://github.com/wangf3014/CP2

Thesis team: Tsinghua University & Johns Hopkins University & Shanghai Jiaotong University

issuing time:

DOI:

引用:Wang F, Wang H, Wei C, et al. CP 2: Copy-Paste Contrastive Pretraining for Semantic Segmentation[C]//Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXX. Cham: Springer Nature Switzerland, 2022: 499-515.

Citation count: 10

Summary

Recent advances in self-supervised contrastive learning yield good image-level representations, which are beneficial for classification tasks, but often ignore pixel-level details, leading to unsatisfactory transfer to dense prediction tasks such as semantic segmentation.

In this work, we propose a pixel-level contrastive learning method called CP2 (Copy-Paste Contrastive Pretraining), which facilitates the learning of image and pixel-level representations, and thus is more suitable for downstream dense prediction tasks.

Specifically, we copy-paste randomly cropped parts of one image (foreground) onto different background images and pre-train a semantic segmentation model to: 1) distinguish foreground pixels from background pixels; 2) identify shared Composite images of the same foreground.

Experiments show that CP2 has strong performance in downstream semantic segmentation: By networking the pre-trained model of CP2 on PASCAL VOC 2012, we obtain 78.6% mIoU with ResNet-50 and 79.5% with ViT-S.

1 Introduction

Learning invariant image-level representations for transfer to downstream tasks has become a common paradigm in self-supervised contrastive learning. Specifically, these methods aim to minimize the Euclidean ( l 2 l_2l2) distance [26, 13] or cross-entropy [5, 6], or by optimizing the InfoNCE [38] loss [28, 12, 14, 10, 11, 41] to distinguish positive image features from a set of negative image features .

Despite success in downstream classification tasks, these comparative objectives are built on the assumption that each pixel in an image belongs to a single label and lack awareness of spatially varying image content. We argue that these classification-oriented objectives are not ideal for downstream dense prediction tasks such as semantic segmentation, where models are supposed to distinguish between different semantic labels in images. For semantic segmentation tasks, existing contrastive learning models are prone to overfit to learn image-level representations while ignoring the variance at the pixel level.

Furthermore, current pre-training optimization paradigms for downstream semantic segmentation tasks are architecturally dysfunctional:

1) Semantic segmentation models usually require larger transition rates and smaller output strides compared to classification-oriented pre-trained backbones [34, 8];

2) The fine-tuning of a well-pretrained backbone and a randomly initialized segmentation head may be out of sync,

For example, random headers may generate stochastic gradients that are detrimental to the pretrained backbone, negatively impacting performance. These two issues prevent classification-oriented pre-trained backbones from facilitating dense prediction tasks such as semantic segmentation.

In this paper, we propose a novel self-supervised pre-training method designed for downstream semantic segmentation, named copy-paste contrastive pre-training (CP2). Specifically, we pre-train a semantic segmentation model with copy-pasted input images composed by randomly cropping from foreground images and pasting them onto different background images. An example of a composed image is shown in Figure 2. In addition to image contrast losses [2, 44, 47, 28] for learning instance discriminants, we also introduce pixel contrast losses to strengthen dense predictions. The segmentation model is trained by maximizing the cosine similarity between foreground pixels while minimizing the cosine similarity between foreground and background pixels. Overall, CP2 produces a dense representation of specific pixels and has two key advantages for downstream segmentation: 1) CP2 pre-trains the backbone and segmentation head, which solves the problem of architectural misalignment; 2) CP2 uses dense prediction as The goal is to pre-train the model and establish the model's perception of spatial variation information in the image.

image-20230608111836493

Figure 1: Fast tuning of MoCo v2 using CP2 with semantic segmentation evaluation on PASCAL VOC. A quick adjustment of 20 bands using CP2 yields a huge mIoU improvement.

Furthermore, we find that in a considerably shorter amount of time, CP2 training enables the rapid adaptation of pre-trained classification-oriented models to semantic segmentation tasks, resulting in better downstream performance. In particular, we first initialize the backbone with the weights of a pre-trained classification-oriented model (such as Resnet-50 [29] pre-trained by MOCO V2 [12]), attach a randomly initialized segmentation head, and then use CP2 An additional 20 cycles of tuning are performed on the entire segmentation model. As a result, the performance of the entire segmentation model on downstream semantic segmentation is significantly improved by miou (%), e.g. +1.6%miou on the Pascal VOC 2012 [23] dataset. We refer to this training protocol as fast-tuning because it effectively transfers learning from image-level instance recognition to pixel-level dense prediction.

For technical details, we mainly follow MOCO V2 [12], including its architecture, data augmentation, and instance contrast loss, in order to fully isolate the effectiveness of our newly introduced copy-paste mechanism and dense contrast loss, so MOCO V2 [12] As an immediate baseline for CP2. In the empirical evaluation of semantic segmentation, the MIOU of the CP2 200-epoch model on Pascal VOC 2012 [23] is 77.6%, which is +2.7% higher than the MIOU of the MOCO V2 [12] 200-epoch model. Furthermore, as shown in Figure 1, the fast-tuning protocol of CP2 yields +1.5% and +1.4% MIOU improvements over MOCO V2200-epoch and 800-epoch models, respectively. The improvement is also generalizable to other segmentation datasets and visual transformations.

2. Related work

Self-supervised learning and proxy tasks. Self-supervised learning for visual understanding exploits intrinsic properties of images as supervised information for training, and the power of visual representations relies heavily on the formulation of subterfuge tasks. Before the popularity of instance recognition [2, 44, 47, 28], many pretext tasks have been explored, including image denoising and reconstruction [39, 51, 3], adversarial learning [19, 20, 22], and heuristic tasks , such as image colorization [50], mosaic [36, 43], context and rotation prediction [18, 33] and deep clustering [4].

The advent of contrastive learning, or more specifically, instance discriminative schemes [2, 44, 47, 28], has led to breakthroughs in unsupervised learning, as MOCO [28] achieves better performance than supervised Train for better transfer performance. Inspired by this success, many subsequent works have explored self-supervised contrastive learning more deeply, proposing different optimization objectives [26, 41, 42, 5, 6, 54], model structures [13] and training strategies [10,12,14].

Intensive contrastive learning. To achieve better adaptation in dense prediction tasks, a recent work [37] extends the image-level contrast loss to the pixel level. Although the extension of the contrastive loss helps the model learn more fine-grained features, it fails to build the model's perception of spatially varying information, so the model must be redesigned in downstream refinement. Some recent works try to enhance the model's understanding of positional information in images by encouraging the consistency of pixel-level representations [45], or by using heuristic masks [24, 1] and applying patchy contrastive losses [30].

Copy-paste contrastive learning. Copy-paste, i.e. copying a crop of one image and pasting it on another, has been used as a data augmentation method in supervised instance segmentation and semantic segmentation [25] because of its simplicity and rich image location and the significant effect of semantic information. Similarly, supervised models have also achieved considerable performance improvements in various tasks by blending images [49, 31] or image cropping [48] as data augmentation. The use of copy-paste has also been reported in recent work on self-supervised object detection [46, 30]. Inspired by the success of copy-paste, we leverage this approach as self-supervised information in a dense contrastive learning approach.

3. Method

In this section, we present our CP2 objective and loss function for learning pixel-wise dense representations. We also discuss our model architecture and propose a fast tuning protocol to efficiently train CP2.

3.1 Copy and paste comparison pre-training

We propose a new pre-training method, CP2, by which we expect the pre-trained model to learn instance recognition and dense prediction.

To this end, we artificially synthesize image compositions by pasting foreground crops onto the background. Specifically, as shown in Figure 2, we generate two random crops from the foreground image and then overlay them onto two different background images.

The goals of CP2 are: 1) distinguish the foreground and background in each synthetic image; 2) identify synthetic images with the same foreground from negative samples.

3.1.1 Copy-Paste

Given an original foreground image I fore I^{fore}Ifore , we first generate its two different viewsI qfore , I kfore ∈ R 224 × 224 × 3 \boldsymbol{I}_{q}^{fore},\boldsymbol{I}_{k through data augmentation }^{fore}\in\mathbb{R}^{224\times224\times3}Iqfore,IkforeR224 × 224 × 3 , one is the query and the other is the positive key. The enhancement strategy follows SimCLR[10] and MoCo v2[12], that is, the image is first randomly resized and cropped to224 × 224 224\times224224×224 resolution, followed by color dithering, grayscale, Gaussian blur and horizontal flip.

Next, we use the same enhancement method to generate a view for each of the two random background images, denoted as I qback , I kback ∈ R 224 × 224 × 3 \boldsymbol{I}_q^{back},\boldsymbol{I} _k^{back}\in\mathbb{R}^{224\times224\times3}Iqback,IkbackR224×224×3

We pass the binary foreground-background mask M q , M k ∈ { 0 , 1 } 224 × 224 \boldsymbol{M}_{q},\boldsymbol{M}_{k}\in\{0,1\} ^{224\times224}Mq,Mk{ 0,1}224 × 224 image pairs, where each elementm = 1 m=1m=1 represents a foreground pixel,m = 0 m=0m=0 means one background pixel. Formally, the composed image is produced by
I q = I qfore ⊙ M q + I qback ⊙ ( 1 − M q ), I k = I kfore ⊙ M k + I kback ⊙ ( 1 − M k ) , \begin{gathered} \boldsymbol{I}_{q}=\boldsymbol{I}_{q}^{fore}\odot\boldsymbol{M}_{q}+\boldsymbol{I}_{q }^{back}\odot(1-\boldsymbol{M}_{q}), \\ \boldsymbol{I}_{k}=\boldsymbol{I}_{k}^{fore}\odot\boldsymbol {M}_{k}+\boldsymbol{I}_{k}^{back}\odot(1-\boldsymbol{M}_{k}), \end{gathered}Iq=IqforeMq+Iqback(1Mq),Ik=IkforeMk+Ikback(1Mk),
where ⊙ \odot represents the product of elements. Now we get two synthetic imagesI q I_qIqSum I k I_kIk, which share a foreground source image, but a different background.

image-20230610101358933

Figure 2: Spatial information of unlabeled images is enriched by randomly pasting two sets of foreground images onto different backgrounds. A dense contrastive loss is applied to its encoded feature maps, and an instance contrastive loss is applied to the average (masked pooling) of foreground feature vectors. We follow the training architecture of momentum updates in MOCO and BYOL.

3.1.2 Comparison target

The synthesized images are then processed with a semantic segmentation model, which we describe in detail in Section 3.2.

Given input I q I_qIq, the output of the segmentation model is a set of r × rr\times rr×r特征 F q = { f q i ∈ R C ∣ i = 1 , 2 , … , r 2 } \mathbb{F}_{q}=\{\pmb{f}_{q}^{i}\in\mathbb{R}^{C}|i=1,2,\ldots,r^{2}\} Fq={ fqiRCi=1,2,,r2 }, whereCCC is the number of output channels,RRR is the feature map resolution. For224 × 224 224\times224224×224 input image, when the output strides = 16 s=16s=At 16 ,r = 14 r=14r=14

In all output features fq ∈ F q {\boldsymbol{f}}_{q}\in\mathbb{F}_{q}fqFq, we refer to the foreground feature, that is, the feature corresponding to the foreground pixel as fq + ∈ F q + ⊂ F q \boldsymbol{f}_{q}^{+}\in\mathbb{F}_{q} ^{+}\subset\mathbb{F}_{q}fq+Fq+Fq, where F q + \mathbb{F}_q^{+}Fq+is a subset of foreground features. Similarly, we get the input image I k I_kIkAll features fk ∈ F k {\boldsymbol{f}}_{k}\in\mathbb{F}_{k}fkFk, where the foreground feature is expressed as fk + ∈ F k + ∼ ⊂ F k {\boldsymbol{f}}_{k}^{+}\in\mathbb{F}_{k}^{+}\sim{\ subset}\mathbb{F}_{k}fk+Fk+Fk

We use two loss terms, a dense contrastive loss and an instance contrastive loss. Contrastive loss L dense \mathcal{L}_{dense}LdenseLocal fine-grained features are learned by distinguishing foreground and background features to help complete downstream semantic segmentation tasks, while instance contrastive loss aims to maintain a global instance-level representation.

In dense contrastive loss, we want image I q I_qIqAll foreground features of ∀ fq + ∈ F q + \forall f_q^+\in\mathbb{F}_q^+fq+Fq+with image I k I_kIkAll foreground features of ∀ fk + ∈ F k + \forall\boldsymbol{f}_k^+\in\mathbb{F}_k^+fk+Fk+Similar to the background feature of image Ik F k − = F k ∖ F k + \mathbb{F}_k^-=\mathbb{F}_k\setminus\mathbb{F}_k^+Fk=FkFk+not similar.

Formally, for each foreground feature ∀ fq + ∈ F q + \forall\boldsymbol{f}_{q}^{+}\in\mathbb{F}_{q}^{+}fq+Fq+and ∀ fk + ∈ F k + \forall\boldsymbol{f}_{k}^{+}\in\mathbb{F}_{k}^{+}fk+Fk+, the dense contrastive loss is obtained by
L dense = − 1 ∣ F q + ∣ ∣ F k + ∣ ∑ efq + ∈ F q + , ∀ fk + ∈ F k + log ⁡ exp ⁡ ( fq + ⋅ fk + /τ dense ) ∑ ∀ fk ∈ F k exp ⁡ ( fq + ⋅ fk / τ dense ) , \mathcal{L}_{dense}=-\frac{1}{|\mathbb{F}_{q}^{+ }||\mathbb{F}_{k}^{+}|}\sum_{\mathcal{ef}_{q}^{+}\in\mathbb{F}_{q}^{+}, \forall\mathcal{f}_{k}^{+}\in\mathbb{F}_{k}^{+}}\log\frac{\exp\left(\boldsymbol{f}_{q} ^{+}\cdot\boldsymbol{f}_{k}^{+}/\tau_{dense}\right)}{\sum_{\forall\boldsymbol{f}_{k}\in\mathbb{F }_{k}}\operatorname{exp}\left(\boldsymbol{f}_{q}^{+}\cdot\boldsymbol{f}_{k}/\tau_{dense}\right)},Ldense=Fq+∣∣Fk+1efq+Fq+,fk+Fk+logfkFkexp(fq+fk/ tdense)exp(fq+fk+/ tdense),
among whichτ dense τ_{dense}tdenseis the temperature coefficient. Figure 3 also illustrates this dense contrastive loss. Following the supervised contrastive learning approach [32, 52], we keep the sum outside the log.

image-20230608113215174

A dense contrastive loss that maximizes the similarity of each foreground pair while minimizing the similarity of each foreground-background pair.

In addition to the dense contrastive loss, we retain the instance contrastive loss with the aim of learning a global, instance-level representation. We mostly follow the practice of MOCO [28, 12], where the model is required to distinguish between positively and negatively keyed repositories given a query image. But in our case we use synthetic image I q I_qIqas the query image, and use the synthetic image I k I_kIkAs with the image I q I_qIqPositive key images for the shared foreground. Furthermore, we do not use the global average pooling feature as the representation in MOCO, but only use the normalized masked average of the foreground features, as shown in Figure 2. Formally, the instance contrastive loss is computed as
L ins = − log ⁡ exp ⁡ ( q + ⋅ k + / τ ins ) exp ⁡ ( q + ⋅ k + / τ ins ) + ∑ n = 1 N exp ⁡ ( q + ⋅ kn / τ ins ) , \mathcal{L}_{ins}=-\log\frac{\exp(\mathbf{q}_+\cdot\mathbf{k}_+/\tau_{ins})}{ \exp(\mathbf{q}_+\cdot\mathbf{k}_+/\tau_{ins})+\sum_{n=1}^N\exp(\mathbf{q}_+\cdot\mathbf {k}_n/\tau_{ins})},Lins=logexp(q+k+/ tins)+n=1Nexp(q+kn/ tins)exp(q+k+/ tins),
whereq + , k + \boldsymbol{q_{+}},\boldsymbol{k_{+}}q+,k+is F q + \mathbb{F}_{q}^{+}Fq+and F k + \mathbb{F}_k^+Fk+The normalized masked average of :
q + = ∑ ∀ fq + ∈ F q + fq + ∣ ∣ ∑ ∀ fq + ∈ F q + fq + ∣ ∣ 2 , k + = ∑ ∀ fk + ∈ F k + fk + ∣ ∣ ∑ ∀ fk + ∈ F k + fk + ∣ ∣ 2 . \begin{aligned} \\ \boldsymbol{q} _{+}=\frac{\sum_{\forall f_{q}^{+}\in \mathbb{F}_{q}^{+}}\boldsymbol{f}_{q}^{+}}{||\sum_{\forall f_{q}^{+}\in\mathbb{F }_{q}^{+}}\boldsymbol{f}_{q}^{+}||_{2}},\boldsymbol{k}_{+}=\frac{\sum_{\forall f_ {k}^{+}\in\mathbb{F}_{k}^{+}}\boldsymbol{f}_{k}^{+}}{||\sum_{\forall\boldsymbol{f} _{k}^{+}\in\mathbb{F}_{k}^{+}}\boldsymbol{f}_{k}^{+}||_{2}}.\end{aligned}q+=∣∣fq+Fq+fq+2fq+Fq+fq+,k+=∣∣fk+Fk+fk+2fk+Fk+fk+.

k n k_n knrepresents the representation of negative samples from a bank [28,44] of n vectors, τins is the temperature coefficient

The total loss L is simply a linear combination of dense and instance contrastive losses
L = L ins + α L dense , \mathcal L=\mathcal L_{ins}+\alpha\mathcal L_{dense},L=Lins+αLdense,
where α is the compromise coefficient of the two losses.

3.2 Model Architecture

Next, we will discuss our CP 2 CP^2 in detailCP2 model architecture, which consists of a backbone and a segmentation head for pre-training and fine-tuning. Unlike existing contrastive learning frameworks [28, 12] that only pre-train the backbone, CP2 is able to pre-train both the backbone and the segmentation head, almost the same architecture used for downstream segmentation tasks. In this way,CP 2 CP^2CP2 prevents the fine-tuning misalignment problem (Section 1), i.e. fine-tuning downstream models with a pre-trained backbone and a randomly initialized head. This misalignment may require careful hyperparameter tuning (e.g., larger learning rates for the head) and lead to poor transfer performance, especially when using heavily randomized initialization of the head. Therefore,CP 2 CP^2CP2 can achieve better segmentation performance, and can also use a stronger segmentation head.

In particular, we study two families of backbones, CNNs [29] and ViT [21]. For the CNN backbone, we use the original ResNet-50 [29] with 7×7 convolutions as the first layer instead of the inception stem commonly used for segmentation tasks. This setup ensures a fair comparison with previous self-supervised learning methods. To adapt the ResNet backbone to segmentation, we follow common segmentation settings [7, 8, 28] and use atrous rate 2 and stride 1 in the 3×3 convolutions in the final stage. For the ViT backbone, we choose ViT-S [21] with a patch size of 16×16 and a similar number of parameters to ResNet-50. Note that both our ResNet-50 and ViT-S have an output stride s = 16, which makes our backbone compatible with most existing segmentation heads.

Given the output span s = 16 of the backbone output features, we investigate two types of segmentation heads. By default, we adopt the common DeepLab v3 [8] segmentation head (i.e. ASPP head with image pooling), since it is able to extract multi-scale spatial features and produces very competitive results. In addition to the DeepLab v3 ASPP head, we also investigate the lightweight FCN head [34] which is commonly used to evaluate self-supervised learning methods.

Based on the backbone and segmentation head trained for pre-training and fine-tuning, we make as few changes as possible. Specifically, for CP 2 CP^2CP2 pre-training, we add two 1×1 convolutional layers on the output of the segmentation head to project the dense features on pixels into a 128-dimensional latent space (i.e., C=128). Then, the latent features of each pixel are individuallyℓ 2 \ell_22Normalized. Our dense projection design is similar to the 2-layer MLP design in the common contrastive learning framework [12], followed by ℓ 2 \ell_22. At CP 2 CP^2CP2 After training converges, we simply replace the 2-layer convolutional projections with segmented output convolutions, which project the segmented head features onto the number of output classes, similar to typical designs of image contrast frameworks [28, 10]. Following MoCo [28], we update the key encoder consisting of the backbone and segmentation head by querying the weights in the encoder.

3.3 Quick Tuning

To quickly train our CP2 model within a manageable computational budget, we propose a new training protocol, called Fast Tuning, to initialize our backbone online with existing backbone checkpoints. These backbones are usually trained with image contrastive losses with very long schedules (e.g., 800 epochs [12] or 1000 epochs [10]). On top of these existing checkpoints of well-encoded image-level semantic representations, we apply our CP2 training with only a few epochs (e.g., 20 epochs) to fine-tune still on ImageNet without human labels but for Representation for Semantic Segmentation. Specifically, we attach a randomly initialized segmentation head to the pretrained backbone and train the entire segmentation model with appropriate curvature, using our CP2 loss function. Finally, the segmentation model learned on ImageNet without using any labels is further fine-tuned on various downstream segmentation datasets to evaluate the learned representations.

Fast-tuning enables efficient and practical training for self-supervised contrastive learning, as it leverages a large number of trained self-supervised backbones and lets them quickly adapt to desired targets or downstream tasks. According to our empirical evaluation, a fast tuning of 20 epochs is sufficient to produce significant improvements on various datasets (e.g., fine-tuned mIoU on PASCAL VOC 2012 improves by 1.6% after a fast tuning of 20 epochs). This is especially helpful for efficiently pre-training segmentation models, which are usually much computationally heavier than the backbone due to the detoured convolutions of the backbone and ASPP modules. In this case, fast-tuning shows significant improvement with short-term self-supervised pre-training of segmented models, saving a lot of computational resources.

Guess you like

Origin blog.csdn.net/wujing1_1/article/details/131139193
Recommended