Paper Reading-4 Record 2023.3.24 (weakly supervised development)

 Research Progress in Visual Weakly Supervised Learning (wanfangdata.com.cn)

Research progress of visual weakly supervised learning

1. Review the general weakly supervised learning model

1. Multiple instance learning (MIL)

2. Expectation maximization (expectation_maximization, EM)

2. Object detection and positioning

1. Weakly supervised object detection can be transformed into a MIL-based candidate frame classification problem

WSDDN(weakly supervised deep detection networks) 

class attention map mechanism

White Training and Supervised Form Transformation

Vegetable Dog can’t see it and ignore it, you can use the label marked by the bounding box for semantic segmentation

3. Semantic Segmentation Task

3.1. Weakly supervised semantic segmentation based on bounding box annotation

1. Examples of recent articles

MCG (Arbelaez et al., 2014 ) etc. to extract a series of candidate regions, and then select a candidate region with the largest overlap with the bounding box as the pseudo-label in the initial stage. Then the pseudo labels are gradually optimized through iterative training. In each round of iteration, the pseudo-label is used to train the network segmentation model first, and then the prediction is made, and one of the candidate regions with the highest score of each bounding box is randomly selected as a new pseudo-label to participate in the next round of iteration process.

2021 

Oh et al.

Background aware pooling BAP (background aware pool) noise awareness loss NAL ((noise-aware loss)

Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation | IEEE Conference Publication | IEEE Xplore

The BAP method uses the attention map to distinguish the foreground and background parts of the bounding box. By aggregating the features of the foreground area and removing the background area , a more accurate class activation map CAM is obtained, and then combined with the attention map and cAM using DensecRF post-processing to generate pseudo-labels; The NAL method calculates the cross-entropy loss by adaptively using the network output, middle-level features, and classifier weights, uses the similarity of the middle-level features to generate the second type of pseudo-label , and uses feature similarity in areas where the two pseudo-labels are different. Degree weights the cross-entropy loss to suppress the influence of misinformation in pseudo-labels

2021

Xie et al.

Learning-based pseudo-label generator LPG (learned proposal genemtor)

Using the COCO dataset as an auxiliary dataset, we use 60 types of objects that are disjoint from the PASCAL VOC2012 dataset to learn a category-independent model that can be used to extract candidate regions on any dataset with bounding box annotations . A two-layer optimization model is introduced to train the segmentation model and the proposal extractor . Among them, training the segmentation model is defined as the lower-level problem, and training the candidate region extractor based on the bounding box is defined as the upper-level problem, and the EM algorithm is used to jointly train the LPG and the segmentation model. Pseudo-labels are continuously optimized during multiple stages of training and used to train split models

2019

Song et al.

BCM (box driven class-wise masking)

Adaptive Fill Rate Guidance Loss

The pre-selected label is obtained through Dense-CRF, which is used to calculate the average filling rate of the object in the bounding box, and is input into a segmentation network based on a fully convolutional network. Then the feature map of each category in the output of the segmentation network is multiplied by the mask of the corresponding category, and the final network output is constrained according to the average filling rate of the category it belongs to.

2. Existing problems and development

Problem: The accuracy of the pseudo-labels generated in stage 1 is not high enough

Development: Suppressing the influence of misinformation in pseudo-labels and fusing prior information of segmentation results with network structure

3.2 Weakly supervised semantic segmentation based on image-level annotation

If the image contains a certain type of object, it contains at least one pixel of that type of object, that is, a positive example, and if the image does not contain a certain type of object, each pixel must not belong to that type of object, that is, a negative example. First, the image is passed through an OverFeat (Sermanet et al., 2014 ) convolutional neural network to obtain a score map, and then the score map is aggregated through an aggregation layer. Then use the object category label to train the model, so that the correctly classified pixels can have a higher weight, and at the same time, three smoothing strategies are used to smooth the output prediction results.

Based on the class activation map CAM method, some methods (Hou et al., 2018 ; Mai et al., 2020; wei et al., 2017a) delete the regions with large response values ​​obtained by network classification each time during iterative training, and then Multi-class learning is performed on the remaining areas again, and the newly obtained areas with large response values ​​are added to the segmentation prediction results until the segmentation prediction results no longer change.

The EPs (explicit pseudo-pixel supervision) method proposed by Lee et al. ( 2021 c) also combines CAM and saliency map , considering that cAM can distinguish objects, but the boundary is not clear enough, and the boundary of saliency map is clear enough but cannot be distinguished Specific to object categories, EPs introduce a saliency loss to obtain accurate boundary information for foreground parts in CAM using saliency maps.

The PSA (pixel-level semantic affinitv) method proposed by Ahn and Kwak ( 2018 ) designed an affinity network Affinitv. Net to predict the semantic similarity of adjacent pixels, extract adjacent pixels in the trusted region of cAM as training data, and generate more accurate segmentation pseudo-markers

Ahn et al. ( 2019 ) designed an IRNet with two branches , one of which is used to predict the pixel displacement field of the object, and based on this to generate cAM for each instance: while the other branch is used to predict the object's Boundary, get the similarity relationship between pixel pairs. The final segmentation result is obtained by merging and processing the results obtained from the two branches. At the same time, the method also introduces random walk, and propagates the obtained prediction results according to the affinity matrix to optimize the segmentation results predicted by the network.

CONTA (context adiustment) proposed by Zhang et al. ( 2020 ) eliminates the influence of context features that do not belong to objects on cAM through causal intervention, thereby improving the accuracy of pseudo-labeling.

3.3 Line labeling

Lin et al. (2016) proposed the ScrumbleSup method, which first generates superpixels based on line annotations, and then propagates the superpixels through the energy function based on the GrabCut algorithm, and uses the idea of ​​the EM algorithm to alternately update the segmentation model and train Split the labels of the model until the model converges.

The RAwKS (random.walk weakly.supévised segmentation) method proposed by Vernaza and Chandmker (2017) propagates sparse line labels to object regions by training a consistency network, and proposes to use a specific probability model for sparse label propagation. , and this model is differentiable during semantic edge detection training, that is, random walk, which greatly improves the accuracy of using line annotation to obtain pseudo-labels.

SPML (semi-supervised pixel-wise metric leaning) proposed by Ke et al. (2021) uses segson (Hwang et al., 2019) as the backbone network, while using the HED contour detector as additional supervision information, introducing a pixel-level feature consistency loss. It makes the pixels with similar semantics have a closer distance after being mapped to the feature space, and the pixels belonging to different semantics are far away from each other after being mapped to the feature space. Existing methods usually constrain pixels with similar colors to have similar features, which may mislead the learning of the network.

In response to this problem, Zhang et al. (2021a) proposed a dynamic feature regularization (dyn indica feature regularization, DFR) loss, which only restricts the similarity of pixel pairs with similar colors in a small window, and also designs a feature consistent The robustness module is supervised by selecting pixel-level features with high model prediction confidence. It is required that the window and the pixels belonging to the same category have similar features to it, and both the DFR loss and the feature consistency module can be directly applied to other weakly supervised semantic segmentation methods.

4. Accuracy

 

Best Accuracy Article: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | IEEE Conference Publication | IEEE Xplore 

Guess you like

Origin blog.csdn.net/weixin_61235989/article/details/129758905