Chapter 21: PUZZLE-CAM: Improving localization by matching local and global features

0. Summary

        Weakly supervised semantic segmentation (WSSS) is introduced to narrow the semantic segmentation performance gap from pixel-level supervision to image-level supervision. Most state-of-the-art methods are based on class activation maps (CAM) to generate pseudo labels to train segmentation networks. The main limitation of WSSS is that the process of generating pseudo labels from CAM using image classifiers is mainly focused on the most discriminative parts of objects. To address this problem, we propose Puzzle-CAM, which discovers the most integrated regions in objects by minimizing the differences between independent patches and whole image features in the segmentation network. Our method consists of a puzzle module and two regularization terms. Puzzle-CAM can activate the entire region of an object using image-level supervision without requiring additional parameters. In experiments, Puzzle-CAM outperforms previous state-of-the-art methods on the PASCAL VOC 2012 dataset using the same label supervision. Code related to our experiments can be found at https://github.com/OFRIN/PuzzleCAM.

Index terms - Image segmentation, deep learning, neural network, weakly supervised semantic segmentation

1 Introduction

        Semantic segmentation is a basic method using convolutional neural networks (CNN), which aims to correctly predict the pixel-level classification of images. Recently, fully supervised semantic segmentation (FSSS) has achieved significant progress [1, 2, 3]. However, producing large-scale training datasets with accurate pixel-level annotations for each image is expensive and requires a time-consuming and labor-intensive task. To address this problem, many researchers focus on Weakly Supervised Semantic Segmentation (WSSS), which uses image-level annotations, doodles, bounding boxes, and points to train the network. Image-level supervision is easier to perform compared to other methods. In this study, we only focus on learning semantic segmentation models using image-level supervision.

        Most previous WSSS methods [4, 5, 6] are based on class activation maps (CAMs) [7] to achieve good performance. However, the generated CAMs usually only focus on small parts of semantic objects to effectively classify them, which prevents segmentation models from learning pixel-level semantic knowledge. Furthermore, we can see that the CAMs generated from isolated patches in the tiled image are different from those obtained from the original image. As shown in Figure 1, the CAMs of tiled images composed of tiled patches are significantly inconsistent with those of the original image. These differences further widen the oversight gap between FSSS and WSSS.

        The above observations inspire us to use attention-based feature learning methods to solve the WSSS problem. Therefore, we propose Puzzle-CAM for WSSS training to detect integrated regions of objects. Our method applies reconstruction regularization corresponding to CAMs generated from tiles and raw images to provide self-supervision. To further improve the consistency of network predictions, we introduce a jigsaw module that segments the image and merges CAMs generated from tiled images.

        Puzzle-CAM consists of a Siamese neural network with a reconstruction regularization loss that reduces the difference between the original and merged CAM. Our experiments produced both quantitative and qualitative results, demonstrating the superiority of our approach. Our main contributions are as follows:

  • We propose Puzzle-CAM, which combines reconstruction regularization and puzzle modules to effectively improve the quality of CAM without adding network layers.
  • On the PASCAL VOC 2012 dataset, Puzzle-CAM outperforms existing state-of-the-art methods with the same level of supervision.

Figure 1: CAMs generated from tiled and original images: (a) traditional CAMs from original images, (b) generated CAMs from tiled images, and (c) CAMs predicted by the proposed Puzzle-CAM.

Figure 2: The overall architecture of the proposed Puzzle-CAM, showing the integration of reconstruction regularization and puzzle modules.

2.Related work

2.1. Using CNN’s attention mechanism

        These provide fine-grained information about the features learned in the CNN. Simonyan et al. [8] use an error backpropagation strategy to visualize semantic regions, while the joint attention model uses the global average pooling (GAP) layer in CNN to generate CAMs more efficiently [7]. Finally, the attention map is generated using the final classifier. As far as we know, which attention mechanism is chosen does not have much impact in achieving high performance of WSSS, so we build Puzzle-CAM based on the joint attention model since it is easier to handle than other attention mechanisms.

2.2. Weakly supervised semantic segmentation

        Unlike FSSS, which requires pixel-by-pixel labeling of images, WSSS uses lower-level labeling, such as bounding boxes [9], graffiti [10], and image-level classification labels [4, 6]. Recently, the performance of WSSS has been significantly improved through the introduction of CAMs. Most previous WSSS methods refine the CAMs generated by image classifiers to approximate segmentation masks [4, 11, 12, 13, 6]. AffinityNet [4] trains an additional network to learn the similarity between pixels, usually generating a transition matrix and multiplying it with the CAM to adjust its activation range. IRNet [11] generates transition matrices from boundary activation maps and extends the method to implement weakly supervised instance segmentation and WSSS. SEAM [5] aims to improve CAMs by using pixel-dependent modules that capture contextual appearance information for each pixel, and by using learned affinity attention maps to alter the original CAMs.

3. Methodology

3.1.Motivation

        The highlight of the traditional single image shows the most representative area of ​​each category. Therefore, when generating CAMs of the same category for image patches, the model only uses part of the object to find key features of the category. Therefore, the merged CAM of image blocks highlights the object area more accurately than the CAM of a single image. To exploit the above advantages, we propose Puzzle-CAM, which trains a classifier using reconstruction loss to minimize the difference between the CAM of a single image and the CAM merged from image patches. By training the classification network using this reconstruction loss, CAM covers the object area more accurately. Puzzle-CAM contains a loss function designed to match the CAM generated from the tiled image to the CAM of the original image (Figure 2).

3.2. Hire the CAM method

        ​​​​​First, we introduce the CAM method for generating the initial attention map. Given a feature extractor F and a classifier θ, we can generate CAMs, which is the set of CAMs for all categories. After training the classifier with image-level supervision, we apply the weights of the c-channel classifier to the feature map f = F(I) obtained from the input image I to obtain the CAM for class c as follows:

        The resulting CAM is normalized by using the maximum value of Ac. Finally, by concatenating Ac, we obtain all categories of CAMs (A).

3.3. Jigsaw model

        When matching partial features and complete features, the key is to narrow the gap between FSSS and WSSS. The puzzle module consists of tiling and merging modules. Starting from an input image I of size W×H, the tiling module generates non-overlapping tiles fI1;1;I1;2;I2;1;I2;2g of size W=2×H=2. Next, we generate CAMs of Ai;j for each Ii;j. Finally, the merging module connects all Ai;j into a single CAMs Are with the same shape as the CAMs of I.

3.4.Loss design of Puzzle-CAM

        We use a GAP layer at the end of the network to fuse the prediction vector Y^=σ(G(Ac)) for image classification, and use multi-label soft edge loss for classification tasks. For convenience of expression, we define Yt as:

        ​​​​where α is the weight balancing coefficient for different losses. The classification losses Lcls and Lp−cls are used to roughly estimate the area of ​​the object. The reconstruction loss Lre is used to bridge the gap between pixel-level and image-level supervision processes. We report the details of the network training setup in the experimental section and explore the effectiveness of the proposed module.

Table 1: Ablation study of Puzzle-CAM loss function using ResNet-50 as backbone network.

4.Experimental results

4.1. Implementation details

        ​​​​​We use the PASCAL VOC 2012 dataset [14] to evaluate our method. The dataset is divided into 1,464 images for training, 1,449 for validation, and 1,456 for testing. Following the experimental protocol used by previous methods [4, 5, 6], we obtained additional annotations from the Semantic Boundary Dataset [15] and constructed an enhanced training set containing 10,582 images. These images are randomly scaled in the range [320, 640] and then cropped to 512×512 as input to the network. For all experiments, we set α to 4 and gradually increase α linearly to the maximum value within half an epoch. During inference, we use the classifier without the puzzle module. Therefore, we employ multi-scale and horizontal flipping to generate pseudo-segmentation labels. We used four TITAN-RTX GPUs to train the dataset.

4.2.Ablation studies

        ​​​​​​​We conducted an ablation study on the main components of Puzzle-CAM by applying the mean intersection-over-union (mIoU) metric (Table 1), where the baseline was mIoU = 47.82%. With the proposed reconstruction regularization of tiles (Lre), the baseline improves to mIoU = 49.21%, while the classification loss from tiles (Lp-cls) is similar to the baseline. Both Lre and Lp-cls consistently improved the baseline by 3.71%. We visualize CAMs using a combination of their loss functions (see Eq. 3). When using only the classification loss (Lp-cls), the results show no marginal difference. At the same time, when only the reconstruction loss (Lre) is used, the result has better localization ability for some categories than the original result, but the method cannot predict several categories. When both sets of losses are combined, the results show improved localization capabilities without suffering classification losses.

Figure 3: Visualizing predicted labels and CAMs using a loss function combination. In (d), the final CAMs not only suppress overactivation, but also extend the CAMs to the full range of object activation.

Table 2: mIoU quality of pseudo-semantic segmentation labels evaluated on the PASCAL VOC 2012 training set [14]. RW, random walk method using AffinityNet [4]; dCRF, dense conditional random field [16].

4.3. Comparison with existing state-of-the-art methods

        To further improve the accuracy of pseudo-pixel-level annotation, we trained AffinityNet using Puzzle-CAM following the method in [4]. We adopt the ResNeSt architecture, which can generally improve the learned feature representation and improve the performance of image classification, object detection, instance segmentation and semantic segmentation. In Table 2, we report the performance of the baseline AffinityNet [4] and the original CAM used by Puzzle-CAM.

        The final synthesized pseudo-label achieved an mIoU of 74.67% on the PASCAL VOC 2012 training set. Then, Puzzle-CAM is used in fully supervised mode to train the DeepLabv3+[1] segmentation model with ResNeSt-269[18] backbone using pseudo-labels to obtain the final segmentation results. Table 3 compares the mIoU values ​​of the proposed method and previous methods. Compared with baseline methods, Puzzle-CAM achieves significantly improved performance on both validation and test sets under the same training settings. Figure 4 shows some qualitative results on the validation set, demonstrating that the proposed method works well on both large and small objects.

Table 3: Comparison of Puzzle-CAM with existing state-of-the-art methods on PASCAL VOC 2012 validation and test sets. I represents the image-level label; S represents the external saliency model.

Figure 4: Qualitative segmentation results on the PASCAL VOC 2012 validation set. Top: original image. Middle: real annotation. Bottom: Predictions from a segmentation model trained using pseudo-labels generated by Puzzle-CAM.

5 Conclusion

        In this paper, we propose the Puzzle-CAM algorithm that utilizes image-level labels to close the supervision gap between fully supervised semantic segmentation (FSSS) and weakly supervised semantic segmentation (WSSS). To improve the network that generates consistent CAMs, we design a puzzle module and employ reconstruction regularization to match local and global features. Puzzle-CAM not only consistently generates features from locally tiled patches, but also better adapts to the shape of the ground-truth annotation mask. A segmentation network trained by using our synthesized pixel-level pseudo-labels achieves state-of-the-art performance on the PASCAL VOC 2012 dataset, demonstrating the effectiveness of our approach. We believe that the concept of Puzzle-CAM as a training module can be generalized and will benefit other weakly supervised and semi-supervised tasks such as semantic segmentation and instance segmentation.

Guess you like

Origin blog.csdn.net/ADICDFHL/article/details/131998783