Chapter 24: Reducing Information Bottleneck for WeaklySupervised Semantic Segmentation——Reducing the information bottleneck for WeaklySupervised Semantic Segmentation

0. Summary

        Weakly supervised semantic segmentation generates pixel-level localization through category labels. However, classifiers trained using these labels often only focus on small distinguishable regions of the target object. We explain this phenomenon using the information bottleneck principle: the last layer of a deep neural network induces an information bottleneck through a sigmoid or softmax activation function, with the result that only a subset of task-relevant information is passed to the output layer. We first support this argument with a simulated toy experiment and then propose a method to reduce the information bottleneck by removing the last activation function. Furthermore, we introduce a new pooling method that further encourages the transfer of information from non-discriminative regions to the classifier. Our experimental evaluation shows that this simple modification significantly improves the quality of localization maps on the PASCAL VOC 2012 and MS COCO 2014 datasets, demonstrating state-of-the-art performance in weakly supervised semantic segmentation. The code can be found at: GitHub - jbeomlee93/RIB: Reducing Information Bottleneck for Weakly Supervised Semantic Segmentation (NeurIPS 2021)

1 Introduction

        Semantic segmentation is the task of identifying objects in images using semantic labels assigned at the pixel level. The development of deep neural networks (DNNs) has made significant progress in semantic segmentation. Training a DNN for semantic segmentation requires a dataset containing a large number of images with pixel-level labels. However, preparing such a dataset requires considerable effort; for example, generating pixel-level annotations for a single image in the Cityscapes dataset took over 90 minutes. The high dependence on pixel-level labels can be alleviated through weakly supervised learning.

        The goal of weakly supervised semantic segmentation is to train the segmentation network using weaker annotations that provide less information about the location of the target object than pixel-level labels, but are cheaper to obtain. Weak supervision can take the form of scribbles, bounding boxes, or image-level category labels. In this study, we focus on image-level class labels since they are the cheapest and most popular option for weak supervision. Most methods using class labels utilize localization (attribution) maps from a trained classifier (such as CAM or Grad-CAM) to generate fake ground truth for training the segmentation network. However, these maps only identify small areas of the target object that play a differentiating role in classification and cannot identify the area occupied by the entire target object, making these attribution maps unsuitable for training semantic segmentation networks. We explain this phenomenon using the information bottleneck principle.

        Information bottleneck theory analyzes the flow of information in sequential DNN layers: input information is compressed as much as possible while passing through each layer of the DNN, while retaining as much task-relevant information as possible. This is advantageous for obtaining an optimal representation for classification, but disadvantageous when applying the resulting classifier's attribution map to weakly supervised semantic segmentation. The information bottleneck prevents non-discriminative information of the target object from being considered in the classification logic, so the attribution diagram only focuses on small discriminative regions of the target object.

        We believe that the information bottleneck becomes prominent in the last layer of DNN because the bidirectional saturation activation function (such as sigmoid, softmax) is used in this layer. We propose a method to reduce the information bottleneck by removing the last activation function when retraining the last layer of a DNN. Furthermore, we introduce a new pooling method that allows more information embedded in non-discriminative features to be processed in the last layer of the DNN instead of discriminative features. Therefore, the attribution map of the classifier obtained by our method contains more information about the target object.

        The main contributions of this study are summarized as follows. First, we emphasize that the information bottleneck mainly occurs in the last layer of the DNN, which results in the attribution map obtained from the trained classifier being limited to a small discriminative region of the target object. Second, we propose a method to reduce this information bottleneck by simply modifying existing training schemes. Third, our method significantly improves the quality of localization maps obtained from trained classifiers, demonstrating state-of-the-art performance on the PASCAL VOC 2012 and MS COCO 2014 datasets for weakly supervised semantic segmentation.

2. Preparation work in advance

2.1. Information bottleneck

        Given two random variables X and Y, mutual information I(X; Y) quantifies the interdependence between the two variables. The Data Processing Inequality (DPI) [13] infers that the mutual information of any three variables X, Y and Z satisfying the Markov chain X → Y → Z satisfies I(X; Y) ≥ I(X; Z). Each layer in a DNN only processes input from the previous layer, which means that the DNN layers form a Markov chain. Therefore, the flow of information through these layers can be represented by DPI. Specifically, when an L-layer DNN generates output Yˆ from a given input X through intermediate features Tl (1 ≤ l ≤ L), it forms a Markov chain X → T1 → · · · → TL → Yˆ, and the corresponding DPI chain can be expressed as follows:

        This means that the information of the input X is compressed as it passes through the layers of the DNN. Training a classification network can be interpreted as extracting maximally compressed features of the input that retain as much information as possible for classification; these features are often called minimally sufficient features (i.e., discriminative information). Minimum sufficient features (i.e. optimal representation T∗) can be obtained by the information bottleneck trade-off between mutual information I(X; T) and I(T; Y) (compression and classification) [54, 15]. In other words, T∗ = argminT I(X; T) − βI(T; Y), where β≥0 is the Lagrange multiplier. Shwartz-Ziv et al. observed that there is a compression phase in the process of finding the optimal representation T∗: when l is fixed, the observed I(X, Tl) increases steadily in the first few epochs but decreases in later epochs. [49]. Saxe et al. believe that the compression stage mainly occurs in DNNs equipped with bidirectional saturating nonlinear functions (such as tanh and sigmoid), but does not appear in DNNs equipped with unidirectional saturating nonlinear functions (such as ReLU) [46]. This means that a DNN equipped with a one-way saturated nonlinear function experiences less information bottlenecks than a DNN equipped with a two-way saturated nonlinear function. This can also be understood from the gradient saturation of bidirectional saturating nonlinear functions: when the gradient is related to the input above a certain value, the gradient saturates close to zero [8]. Therefore, during backpropagation, features above a certain value will have a gradient close to zero and be restricted from further contributing to classification.

2.2. Class activation diagram

        Class Activation Maps (CAM) [66] identify image regions that the classifier focuses on. CAM is based on a convolutional neural network that uses global average pooling (GAP) before the final classification layer. This is achieved by considering the class-specific contribution of each channel of the last feature map to the classification score. Given a classifier parameterized by θ = {θf, w}, where f(·; θf) is the feature extractor before GAP and w is the weight of the final classification layer, the CAM of class c is obtained from image x ,As follows:

2.3.Related work

Weakly supervised semantic segmentation:Weakly supervised semantic segmentation methods first build initial seeds by obtaining high-quality localization maps from trained classifiers. Erasure methods [56, 37, 22] prevent the classifier from focusing only on the discriminative parts of an object by feeding an image with the discriminative region erased into the classifier. Multiple contexts of the target object can be considered by combining multiple attribution maps obtained from different dilated convolutions [31, 57] or different layers of DNN or by considering semantic similarities and dissimilarities across images [17, 51] . Zhang et al. [62] analyzed the causal relationship between images, backgrounds and class labels, and proposed CONTA to eliminate confusion bias in classification. Since the localization map obtained by the classification network cannot accurately represent the boundary of the target object, subsequent boundary refinement techniques such as PSA [3] and IRN [2] can be used to refine the initial seeds obtained by the above method.

Information bottleneck:Tishby et al. [54] and Shwartz et al. [49] use the information bottleneck theory to analyze the inner working principle of DNN. The concept of information bottleneck has been applied in many research fields. Dubois et al. [15] and Achille et al. [1] exploit information bottlenecks to obtain optimal representations from DNNs. DICE [45] proposed a model integration method based on the information bottleneck principle: it aims to reduce unnecessary mutual information between features and inputs, as well as redundant information shared between features produced by separately trained DNNs. Jeon et al. [26] use the information bottleneck principle to study decoupled representation learning of generative models [19]. Yin et al. [60] designed a regularization objective based on information theory to deal with the memory problem in meta-learning. The information bottleneck principle can also be used to generate visual saliency maps for classifiers. Zhmoginov et al. [65] find important regions of the classifier through information bottleneck trade-offs, and Schulz et al. [47] limit the information flow by adding noise to intermediate feature maps and quantify the amount of information contained in image regions.

3. Proposed method

        Weakly supervised semantic segmentation methods using class labels use CAM [66] or Grad-CAM [48] to generate pixel-level localization maps from classifiers; however, such localization maps can only identify small regions of target objects. We analyze this phenomenon using the information bottleneck theory in Section 3.1, and propose a RIB method to solve this problem in Section 3.2. We then explain in Section 3.3 how to train the segmentation network using localization maps improved by RIB.

3.1.Motivation

        As mentioned in Section 2.1, DNN layers with bilateral saturating nonlinearity have larger information bottlenecks than layers with unilateral saturating nonlinearity. The intermediate layers of popular DNN architectures such as ResNet [21] and DenseNet [23] are coupled with the ReLU activation function, which is a one-sided saturating nonlinearity. However, the last layer of these networks is activated by a bilateral saturating nonlinearity (such as sigmoid or softmax), and the class probability p is calculated using the final feature map TL and the final classification layer w, i.e., p = sigmoid(w |GAP(TL)). Therefore, the final layer parameterized by w has an obvious bottleneck, and the amount of information transferred from the last feature TL to the actual classification prediction will be limited.

        These arguments are similar to observations from existing methods. The information plane provided by Saxe et al. [46] shows that the compression of information is more significant in the last layer than in other layers. Bae et al. [5] observed that although the final feature map of the classifier contains rich information about the target object, the final classification layer filters out most of the information; therefore, CAM cannot identify the entire region of the target object. This observation empirically supports the emergence of information bottlenecks in the last layer of DNN. To look at this phenomenon more closely, we designed a toy experiment. We collect images containing the digits '2' or '8' from the MNIST dataset [30]. For a small subset of these images (10%), we add a circle ( ) and a square ( ) to the images containing the numbers '2' and '8', respectively. Randomly (see Figure 1(a)). When classifying an image as a number '2' or '8', the pixels corresponding to the number are discriminative regions (RD), and the pixels corresponding to the added circles or squares are non-discriminative but relevant to the category area (RND), while the pixels corresponding to the background are category-independent areas (RBG).

        We trained a neural network with five convolutional layers and a final fully connected layer. We obtain the gradient map Gl of each feature Tl with respect to the input image x: Gl = ∇x P u,v Tl(u, v), where u and v are the spatial and channel indices of the feature Tl, for the final classification layer (l = 6), G6 = ∇xy c. Because this gradient map represents how much each pixel of the image contributes to each feature, it can be used to examine how much information is passed from the input image to the feature maps of successive convolutional layers. We show an example of Gl in Figure 1(b). As the input image passes through the convolutional layer, the total amount of gradient relative to the input decreases, indicating the occurrence of an information bottleneck. Specifically, the gradient of RBG decreases early (G1 → G2), which means that task-irrelevant information is quickly compressed. From G1 to G5, the gradient in RD or RND gradually decreases. However, the reduction in gradient amount is more significant in the last layer (G5 → G6), especially the gradient of RND (in the red box) almost disappears. This supports our argument that there is a significant information bottleneck in the last layer of DNNs, while emphasizing that non-discriminative information is particularly compressed in RNDs.

        ​ ​ ​ We conducted a quantitative analysis. We define the high gradient ratio (HGR) of region R as the ratio of the proportion of pixels with gradients greater than 0.3 in region R to the total number of pixels. HGR quantifies the amount of information transferred to each feature from the region R of the input image. The trend of the HGR value of each region in each layer is shown in Figure 1(c). The observed trends are similar to the above empirical observations, again supporting a significant information bottleneck for RND occurring in the last layer (red box). We believe that the information bottleneck causes the localization map obtained from the trained classifier to be concentrated in a small area of ​​the target object. According to Equation 2, CAM only contains information processed by the final classification weight wc. However, due to the information bottleneck, only part of the information in the features is passed through the last layer of wc, and most of the remaining non-discriminative information is ignored, and CAM cannot identify the non-discriminative areas of the target object. Using such a CAM to train a semantic segmentation network is not ideal since the entire region of the target object should be recognized. Therefore, we aim to bridge the gap between classification and localization by reducing information bottlenecks.

Figure 1: (a) Example of toy image. (b) Example of gradient graph Gk. (c) Plot of HGR values ​​for each layer of RD, RND and RBG, averaged based on 100 images.

3.2. Reduce information bottlenecks

        In Section 3.1, we observed that the information in the input image is compressed especially in the last layer of the DNN due to the use of bilateral saturating activation functions in this layer. Therefore, we propose a method to reduce the information bottleneck of the last layer by simply removing the sigmoid or softmax activation function used in the last layer of DNN. We focus on multi-class multi-label classifiers, which are the default for weakly supervised semantic segmentation. Suppose we are given an input image x and the corresponding one-hot class label t = [t1, · · · , tC], where tc ∈ {0, 1} (1 ≤ c ≤ C) is an indicator of class c, C is the set of all categories. While existing methods use sigmoid binary cross-entropy loss (LBCE) to train multi-label classifiers, our method replaces it with another loss function LRIB that does not depend on the final sigmoid activation function.

        ​​​​where m is a boundary and yc is the classification logit of image x. However, training a classifier with LRIB from scratch results in unstable training because the gradients cannot saturate (see Appendix). Therefore, we first use LBCE to train an initial classifier, whose trained weights are denoted as θ0, and for a given image x, we adjust the weights to a bottleneck-free model. Specifically, we fine-tune the initial model using the LRIB calculated from The total number of iterations, λ is the learning rate for fine-tuning. We call this fine-tuning process RIB. Using RIB can reduce the information bottleneck of x, and we can obtain CAMs that can identify more areas of the target object, including non-discriminative areas. We repeat the RIB process for all training images to obtain CAMs. However, a model tuned for a given image x can easily overfit to x. Therefore, to further stabilize the RIB process, we construct a batch of size B by randomly selecting B-1 samples other than x in each RIB iteration. Note that for each iteration, B-1 samples are randomly selected while x remains unchanged.

Effectiveness of RIB:We demonstrate the effectiveness of RIB by applying it to the same classifier as the toy experiment described in Section 3.1. Figure 2 shows an example (a) and graph (b) of the HGR values ​​of RD, RND and RBG of G6 in each RIB iteration. The HGR value is calculated as an average of 100 images. During the RIB process, the HGR value of RBG remained relatively stable, while the HGR values ​​of RD and RND increased significantly. This shows that the RIB process can indeed reduce the information bottleneck, thereby ensuring that more information corresponding to RD and RND is processed by the final classification layer.

Methods to limit the transfer of information from the discriminative region: Zhang et al. [64] showed the relationship between the classification logit y and CAM, that is, y = GAP(CAM). This means that increasing y c through the RIB will also increase the pixel value in the CAM. In order to enable CAM to identify a wider range of target object areas, it is important to increase the pixel scores of non-discriminative areas rather than discriminative areas. Therefore, we introduce a new pooling method in the RIB process so that features that previously delivered small amounts of information to the classification logit contribute more to the classification. We propose Global Non-Discriminative Region Pooling (GNDRP). Unlike GAP, which aggregates the values ​​of all spatial locations in the feature map Tl, our GNDRP only selectively aggregates the values ​​of spatial locations with CAM scores lower than the threshold τ, as follows:

        Other weakly supervised semantic segmentation methods also consider new pooling methods besides GAP to obtain better localization maps [4, 27, 44]. The pooling method introduced in previous work makes the classifier pay more attention to the discriminative part. In contrast, GNDRP excludes highly activated regions and encourages further activation of non-discriminative regions.

Obtain the final localization map:We obtain the final localization map M by aggregating all CAMs obtained from the classifier in each RIB iteration k: M = Σ(0 ≤k≤K) CAM(x; θk).

Figure 2: Analysis of RD, RND and RBG of G6 in each RIB iteration.

Table 1: Comparison of initial seed (Seed), seed with CRF (CRF) and pseudo ground truth mask (Mask) based on mIoU (%) on PASCAL VOC and MS COCO training images. †denotes the results reported by Zhang et al. [62], ‡denotes the results we obtained.

3.3. Weakly supervised semantic segmentation

        Since CAM [66] is obtained from downsampled intermediate features produced by the classifier, it should be upsampled to the size of the original image. Therefore, it tends to roughly locate the target object and cannot accurately represent its boundaries. Many weakly supervised semantic segmentation methods [7, 6, 55, 62, 40, 31] produce pseudo-ground truth masks by modifying their initial seeds using established seed optimization methods [25, 3, 2, 27, 10]. Similarly, we obtain pseudo ground truth masks by applying IRN [2], a state-of-the-art seed optimization method, to the rough map M. Furthermore, since image-level class labels lack any prior knowledge about the shape of the target object, saliency target mask supervision is commonly used in existing methods [59, 31, 22, 38]. Saliency target mask supervision can also be applied in our method to optimize pseudo ground truth masks: when identifying foreground pixels in pseudo labels as background, or background pixels as foreground on this map, we perform segmentation These pixels are ignored during training of the network.

4. Experiment

4.1. Experimental settings

Datasets and Evaluation Metrics:We conducted experiments on the PASCAL VOC 2012 [16] and MS COCO 2014 [39] datasets to quantitatively and qualitatively evaluate our method. Following common practice in weakly supervised semantic segmentation [3, 2, 31, 62], we use the PASCAL VOC 2012 dataset enhanced by Hariharan et al. [20], which contains 10,582 training images from 20 categories. The MS COCO 2014 dataset contains approximately 82,000 training images containing objects in 80 categories. We evaluate our method on 1,449 validation images and 1,456 test images of the PASCAL VOC 2012 dataset and 40,504 validation images of the MS COCO 2014 dataset by calculating the average intersection over union (mIoU) value.

Reproducibility. We implemented CAM [66] following the steps of Ahn et al. [2] and used the PyTorch framework [43] for implementation. We use ResNet-50 [21] as the backbone network of the classifier. We fine-tuned the classifier for K = 10 iterations using a learning rate of 8×10−6 and a batch size of 20. We set the boundary m to 600. For GNDRP, we set τ to 0.4. For final semantic segmentation, we used the PyTorch implementation of DeepLab-v2-ResNet101 provided by [42]. We use an initial model pretrained on the ImageNet dataset [14]. For the MS COCO 2014 dataset, we crop the training images to a crop size of 481×481 instead of 321×321 used by the PASCAL VOC 2012 dataset, taking into account the dimensions of the images in this dataset.  Figure 3: Examples of positioning maps obtained during the RIB process for (a) PASCAL VOC 2012 training images and (b) MS COCO 2014 training images.

4.2. Weakly supervised semantic segmentation

4.2.1. Initial seed and fake ground truth quality

PASCAL VOC 2012 Dataset: In Table 1, we report the mIoU values ​​for initial seeds and pseudo-ground truth masks generated from our method and other state-of-the-art techniques. Following SEAM [55], we evaluate a series of thresholds to distinguish the foreground and background in the map M and determine the optimal initial seed. Our initial seed shows a 7.7% improvement over the original CAM (the baseline used for comparison), while also outperforming the initial seeds of other methods. It should be noted that our initial seed is better than that of SEAM, which further optimizes the initial CAM at the pixel level by considering the relationship between pixels in the auxiliary self-attention module.

        We applied a post-processing method based on conditional random fields (CRF) [28] to perform pixel-level refinement on the initial seeds obtained from Chang et al. [7], SEAM [55], IRN [2] and our method. On average, applying CRF improves all seeds by more than 5% except SEAM. CRF only improves SEAM by 1.4%, which is reasonable, and it can be considered that this unusually small improvement is because the self-attention module has already optimized the seed of CAM. When the seeds generated by our method are refined by CRF, they improve by 6.1% over SEAM's seeds, thus outperforming all recent competing methods by a large margin.

        being obtained by the seed refinement, we were compared with those obtained using other methods. Most of the compared methods use PSA [3] or IRN [2] to optimize their initial seeds. For fair comparison, we generate pseudo ground truth masks using two seed refinement techniques. Table 1 shows that the mask obtained by our method achieves an mIoU of 68.6 in comparison with PSA [3] and 70.6 mIoU in comparison with IRN [2], thus outperforming other methods to a large extent. .

MS COCO 2014 dataset: Table 1 shows the mIoU of initial seeds and pseudo-ground truth masks obtained by our method and other state-of-the-art methods on the MS COCO 2014 dataset. value. We obtained the results of IRN [2] using the official code as baseline performance. Our method improves the initial seeds and pseudo-ground truth masks of the baseline IRN [2], improving mIoU values ​​by 3.0% and 2.7% respectively.

        Figure 3 shows examples of positioning maps gradually refined through the RIB process on the PASCAL VOC 2012 and MS COCO 2014 datasets. More samples can be found in the appendix.

4.2.2. Performance of weakly supervised semantic segmentation

PASCAL VOC 2012 dataset: Table 2 shows the segmentation maps predicted by our method and other recently introduced weakly supervised semantic segmentation methods on PASCAL VOC 2012 validation and test images. mIoU values, these methods use bounding box labels or image-level class labels. All results in Table 2 are obtained using ResNet-based backbone network [21]. Our method achieves mIoU values ​​of 68.3 and 68.6 on the validation and test images respectively on the PASCAL VOC 2012 semantic segmentation benchmark, outperforming all methods using image-level class labels as weak supervision. In particular, our method outperforms CONTA [62], the best performing method among our competitors, achieving an mIoU value of 66.1. However, CONTA relies on SEAM [55], which is known to be superior to IRN [2]. When CONTA uses IRN for a fairer comparison, its mIoU value drops to 65.3, while our method exceeds the 3.0% improvement.

        Table 3 compares our method with other recent methods that use additional salient object supervision. We use salient object supervision used by Li et al. [38] and Yao et al. [59]. Our method achieves mIoU values ​​of 70.2 and 70.0 on validation and test images respectively, outperforming all state-of-the-art methods introduced under the same level of supervision. Figure 4(a) shows examples of segmentation maps predicted by our method with and without saliency supervision. The boundary information provided by saliency supervision enables our method to produce more accurate boundaries (yellow box). However, when using saliency supervision, non-salient objects in images are often ignored, while RIB successfully identifies them (e.g., a “sofa” in the first column in Figure 4(a) and a "people"). This empirical finding inspires a possible future work that can simultaneously identify precise boundaries and non-salient objects.

MS COCO 2014 dataset: Table 4 compares the performance of our method with other recent methods on MS COCO 2014 validation images. Compared to our baseline IRN [2], our method improves mIoU score by 2.4%p and significantly outperforms other recent competing methods [11, 62, 58]. In comparison with CONTA [62], the IRN results reported in CONTA [62] differ from the results we obtained. Therefore, we compare relative improvements: CONTA achieves an improvement of 0.8%p relative to IRN (32.6 → 33.4), while our method achieves an improvement of 2.4%p (41.4 → 43.8). Figure 4(b) shows an example of the predicted segmentation map of our method on the MS COCO 2014 validation image.

Figure 4: Examples of predicted segmentation masks by IRN [2] and our method for (a) PASCAL VOC 2012 validation images and (b) MS COCO 2014 validation images.

Table 2: Semantic segmentation performance comparison on PASCAL VOC 2012 validation and test images.

Table 3: Comparison of semantic segmentation performance on PASCAL VOC 2012 validation and test images using explicit localization cues. S: salient object, SI: salient instance.

Table 4: Semantic segmentation performance comparison on MS COCO validation images.

Table 5: Comparison of mIoU scores of initial seeds: (a) comparison using different activation functions for the last layer, (b) using different values ​​of m and λ for comparison, (c) using different values ​​of τ for comparison.

4.3.Ablation studies

        In this section, we analyze our method by conducting various ablation studies on the PASCAL VOC 2012 dataset to provide more information on the effectiveness of each component of our method.

The impact of the number of iterations K on the RIB process: We analyzed the impact of the number of iterations K on the effectiveness of the RIB process. Figure 5 shows the mIoU scores for the initial seeds obtained by the baseline CAM method, and the mIoU scores for each iteration of the RIB process using GAP or GNDRP. As the RIB process proceeds, the localization map is significantly improved regardless of which pooling method is used. However, the performance improvement of RIB using GAP is limited and even drops slightly in subsequent iterations (K > 5). This is because GAP allows features that already provide sufficient classification information to be more involved in classification. Since our proposed GNDRP limits the increased contribution of these discriminative regions to classification, RIB using GNDRP can effectively make non-discriminative information more involved in classification, resulting in better localization maps in subsequent iterations. We observe that changing the value of K to be larger than 10 (or even 20) results in a drop in mIoU of no more than 0.8%, indicating that choosing a good K value is not difficult.

Use LRIB for fine-tuning:To verify the effectiveness of LRIB, we use various bidirectional saturation activation functions to fine-tune the model and use BCE loss for fine-tuning. Table 5(a) shows the mIoU scores of the initial seeds obtained by fine-tuning with sigmoid, tanh and softsign activations and our LRIB. We adjust the outputs of tanh and softsign to values ​​between 0 and 1 via affine transformations. Fine-tuning using a bidirectional saturating activation function improves the initial seeds to a certain extent, which demonstrates the effectiveness of per-sample adaptation; however, their performance improvement is limited due to the remaining information bottleneck. It is worth noting that the softsign activation function provides better localization maps than tanh and sigmoid. We believe this is because softsign's gradient reaches zero at higher values ​​compared to other functions (see appendix), and therefore softsign has less information bottlenecks. Our LRIB effectively solves the information bottleneck and achieves optimal performance.

Sensitivity analysis to hyperparameters:We analyzed the sensitivity of the initial seed’s mIoU to the hyperparameters involved in the RIB process. Table 5(b) gives the mIoU values ​​of the initial seeds obtained using different combinations of m and λ values. Overall, slightly lower performance is observed when the values ​​of m and λ are smaller and the intensity of the RIB process weakens. For sufficiently large m and λ, the performance of the RIB process is competitive. Table 5(c) analyzes the impact of the threshold τ involved in GNDRP. Increasing τ from 0.3 to 0.5, mIoU changes no more than 1%, so we conclude that the RIB process is robust to changes in τ.

 

Figure 5: Analysis of RIB process using GAP or GNDRP by mIoU of initial seeds.

Table 6: Comparison of precision (Prec.), recall and F1 score on PASCAL VOC 2012 training images.

Table 7: mIoU (%) of initial seeds for 'boat' and 'train' categories

4.4. False correlation analysis

        In natural images, when objects of a certain category primarily appear together in a specific background, spurious correlations may occur between the target object and the background [36, 62] (e.g., a ship at sea and a train on railroad tracks). Since image-level category labels cannot provide clear localization cues of target objects, classifiers trained using these labels are susceptible to spurious correlations. The resulting localization map from the classifier may also highlight spuriously relevant background, thereby reducing accuracy. This is a long-standing problem that often arises in weakly supervised semantic segmentation and object localization.

        RIB may also activate some false backgrounds. However, by comparing the precision, recall and F1 score of our method with the results of other recent methods in Table 6, we find that the number of regions found to belong to the foreground is significantly higher. Chang et al. [7] achieved a high recall rate, but the precision dropped significantly. SEAM [55] avoids accuracy loss by implementing pixel-level refinement in the additional modules mentioned in Section 4.2.1. Our method improves the precision and recall of our baseline IRN [2] without external modules.

        To further analyze the context of spurious correlations, we present class-level results on seed improvement for our method and other recent methods. We selected two representative categories, "boat" and "train", for which the background has spurious correlation with the foreground (boat on the sea and train on the tracks). Table 7 shows that RIB can improve localization quality (mIoU) even for classes where spurious foreground-background correlations are known to exist.

5. Summary

        In this study, we address the main challenges in weakly supervised semantic segmentation based on image-level category labels. Through the information bottleneck principle, we first analyzed why the localization map obtained from the classifier can only identify a small area of ​​the target object. Our analysis points out that the amount of information transferred between the input image and the output classification is largely determined by the last layer of the DNN. We then reduce the information bottleneck by making two simple modifications to the existing training scheme: removing the last nonlinear activation function in the DNN and introducing a new pooling method. Our method significantly improves the localization map obtained from the classifier and demonstrates state-of-the-art performance on the PASCAL VOC 2012 and MS COCO 2014 datasets. Social Impact: This work may have the following social impacts. Object segmentation that does not require pixel-level annotation will save resources for research and commercial development. This is particularly useful in fields such as medicine, where expert annotations are expensive. However, some companies offer image annotation as part of their service. If DNN's dependence on labels is reduced through weakly supervised learning, these companies may need to change their business models.

Guess you like

Origin blog.csdn.net/ADICDFHL/article/details/132000715