Paper Reading: (U2PL) Semi-supervised semantic segmentation based on unreliable pseudo-labels

Insert image description here

Introduction

题目:《Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels》, CVPR’22,

U2PL: Semi-supervised semantic segmentation based on unreliable pseudo-labels

Date: 2022.3.14
Unit: Shanghai Jiao Tong University, Chinese University of Hong Kong, SenseTime
Paper address: https://arxiv.org/abs/2203.03884
GitHub: https://github.com/Haochen-Wang409/U2PL/Project
address : https://haochen-wang409.github.io/U2PL/
Author PR: https://zhuanlan.zhihu.com/p/474771549

  • Author
    (the first author cannot be found with
    equal contribution, Wang Haochen, personal homepage: https://haochen-wang409.github.io/
    Insert image description here

Wang Haochen is a second-year doctoral student at the Intelligent Perception and Computing Research Center of the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences. Under the guidance of Professor Zhang Zhaoxiang. Obtained a bachelor's degree from the School of Mechanical Engineering, Shanghai Jiao Tong University in June 2022.

Research focuses on computer vision and pattern recognition, especially on the following topics: image perception, label efficient learning, unsupervised representation learning
Insert image description here

  • Other authors

Yujun Shen is currently a senior research scientist at Ant Research Institute and leads the Interatcion Intelligence Laboratory. His research focuses on computer vision and deep learning, especially generative models and 3D vision.
Insert image description here

(Those belonging to SenseTime alone are not listed: Jingjing Fei, Wei Li, Guoqiang Jin, Liwei Wu)

Executive Research Director of SenseTime Technology and head of the R&D Shared Technology Center (STC) of SenseTime Smart City Group (SCG)
Insert image description here

  • Corresponding Author
    Insert image description here

  • Summary

The key to semi-supervised semantic segmentation is to assign sufficient pseudo labels to pixels of unlabeled images. A common practice is to select highly confident predictions as pseudo-GT, but this leads to a problem that most pixels may be left idle due to their unreliability. We believe that every pixel is important for model training and even its prediction is ambiguous. Intuitively, unreliable predictions may be confused in the top class (i.e. the class with the highest probability), however, it should be confident that the pixel does not belong to the remaining classes. Therefore, such pixels can be convincingly considered as negative samples of those least likely classes . Based on this insight, we develop an efficient pipeline to fully utilize unlabeled data. Specifically, we separate reliable and unreliable pixels by predicted entropy, push each unreliable pixel into a class queue consisting of negative samples, and manage to train the model with all candidate pixels. Considering that training evolves and predictions become more and more accurate, we adaptively adjust the threshold of reliable and unreliable partitions . Experimental results on various benchmarks and training environments demonstrate that our approach outperforms state-of-the-art alternatives.


This article believes that the key to semi-supervised tasks is to make full use of unlabeled data, and proposes U2PL, which is based on the concept of "Every Pixel Matters" and effectively utilizes all unlabeled data, including unreliable samples, to improve algorithm accuracy.

Self-training: Sample screening leads to insufficient training

The core issue of semi-supervised learning is to effectively utilize unlabeled samples as a supplement to labeled samples to improve model performance.

Most of the classic self-training methods follow the basic process of supervised learning → pseudo labeling → re-training , but the student network will learn wrong information from incorrect pseudo labels, so there is a problem of performance degradation.

The conventional approach is to filter samples to only leave high-confidence prediction results, but this will exclude a large amount of unlabeled data from the training process, resulting in insufficient model training. In addition, if the model cannot predict certain hard classes well, it will be difficult to assign accurate pseudo labels to unlabeled pixels of that class, thus entering a vicious cycle.

If the model does not predict a certain class satisfactorily (e.g., chair in Figure 1), it will be difficult to assign accurate pseudo-labels to pixels about this class, which can lead to under-training and absolute imbalance. To fully utilize unlabeled data, every pixel should be properly utilized.
Insert image description here

figure 1. Classification performance and pixel count statistics with reliable and unreliable predictions. The model is trained using 732 labeled images on PASCAL VOC 2012 and evaluated on the remaining 9850 images.

As mentioned above, directly using unreliable predictions as pseudo labels will lead to performance degradation. In this paper, we propose an alternative approach using unreliable pseudo-labels. We will frame our.

First, we observe that unreliable predictions are often confounded only in a few classes rather than in all classes. Taking Figure 2 as an example, a pixel with a white cross receives similar probabilities on motorcycles and humans, but the model is very sure that the pixel does not belong to the car and train classes. Based on this observation, we reconsider treating the confused pixels as negative samples for those unlikely classes. Specifically, after obtaining predictions from unlabeled images, we use per-pixel entropy as a metric (see Figure 2a) to classify all pixels into two groups, namely reliable pixels and unreliable pixels. All reliable predictions are used to derive positive pseudo labels, while pixels with unreliable predictions are pushed into a memory bank full of negative samples. To avoid that all negative pseudo-labels only come from a subset of categories, we use one queue for each category. This design ensures that the number of negative samples for each category is balanced. At the same time, considering that the quality of pseudo labels becomes higher and higher as the accuracy of the model becomes higher, we propose a strategy of adaptively adjusting the threshold to classify reliable and unreliable pixels.

Goals/motivations

  • Every Pixel Matters

Specifically, we can measure the reliability of the prediction results by entropy (per-pixel entropy). Low entropy indicates that the prediction results are reliable, and high entropy indicates that the prediction results are unreliable. Let's observe a specific example through Figure 2. Figure 2(a) is an unlabeled image covered with an entropy map. Unreliable pixels with high entropy are difficult to be labeled with a definite pseudo-label, so they do not participate in re- The training process is represented in white in Figure 2(b).

Insert image description here

Figure 2: Illustration of unreliable pseudo-labels. (a) Pixel-wise entropy predicted from unlabeled images, where low-entropy pixels and high-entropy pixels indicate reliable and unreliable predictions, respectively. (b) Pixel-wise pseudo-labels from reliable predictions only, where pixels within white areas have no pseudo-labels assigned. (c) Reliably predicted class probabilities (i.e., yellow crosses) that are sufficiently confident for the supervisory class to be confident. (d) Unreliably predicted class probabilities (i.e., white crosses), hovering between motorcycles and people, but with enough confidence not to belong to cars and trains.

We selected a reliable and unreliable prediction result respectively, and plotted their category-wise probability in the form of histograms in Figure 2© and Figure 2(d). The prediction probability of the pixel represented by the yellow cross on the person class is close to 1. The model is very confident about this prediction result. The pixel with low entropy is a typical reliable prediction. The pixels represented by the white crosses have high prediction probabilities in both the motorcycle and person categories and are numerically close. The model cannot give a definite prediction result, which is in line with our definition of unralibale prediction. For the pixel represented by the white cross, although the model is not sure which category it belongs to, the model shows extremely low prediction probabilities in the two categories of car and train, and is obviously very confident that it does not belong to these categories.

Therefore, we thought that even unreliable prediction results, although they cannot be labeled with definite pseudo labels, can still be used as negative samples of some categories to participate in the training of the model, so that all unlabeled samples can be included in the training process. Play a role.

method

Insert image description here

image 3. Overview of our proposed U2PL approach. U2PL consists of a student network and a teacher network, where teachers work with students on momentum updates. The labeled data is directly fed into the student network for supervised training. Given an unlabeled image, we first use the teacher model to make predictions and then classify the pixels into reliable and unreliable pixels based on their entropy . Such a process is formulated as equation (6) . Reliable predictions are used directly as pseudo-labels to provide suggestions to students, while each unreliable prediction is pushed into a category memory bank . The pixels in each memory group are considered as negative samples of the corresponding class , which is formulated as equation (4) .

In terms of network structure, U2PL adopts the common momentum teahcer structure in the self-training technology route, which is composed of two networks with identical structures: teacher and student. The teacher accepts parameter updates from the student in the form of EMA. The specific composition of a single network refers to ReCo (ICLR'22), which includes three parts: encoder ℎ, decoder f, and representation head g.

In terms of loss function optimization, labeled data is directly optimized based on the standard cross-entropy loss function L s . For unlabeled data, the teacher first gives the prediction results, and then divides the prediction results into reliable pixels and unreliable pixels based on pixel-level entropy, and finally optimizes based on Lu and L c respectively.

Due to the long-tail problem in the data set, using only one batch of samples as negative samples for comparative learning may be very limited. Therefore, MemoryBank is used to maintain a category-related negative sample library, which stores the broken gradients generated by the teacher. Features are maintained in a first-in-first-out queue structure.
Insert image description here
Insert image description here

L c is the pixel-level InfoNCE loss , defined as:
Insert image description here

Among them, C: the number of classes, M: the number of anchor pixels (Mask?), N: the total number of negative samples,

z=g◦ h(x) is the output of the representation head, z ci : the representation of the i-th anchor of category c.

Each anchor pixel is followed by one positive sample and N negative samples, whose representations are z + ci and z cij respectively .

〈·,·〉 is the cosine similarity between features from two different pixels, and its range is restricted between −1 and 1 (setting M=50, N=256 and τ=0.5).

Self-training does not need to be explained too much, but focuses on the comparative learning L c part, which is the classic InfoNCE Loss.

Pseudo-Labeling

(The following instructions are from the author: https://zhuanlan.zhihu.com/p/474771549)
Insert image description here
Insert image description here

Using Unreliable Pseudo-Labels

Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here

The last step is to construct the negative sample of the anchor pixel , which also needs to be divided into two parts: labeled samples and unlabeled samples. For labeled samples , we clearly know the category they belong to, so all categories except the true value label can be used as negative sample categories for the pixel ; for unlabeled samples , since the pseudo labels may have errors, we are not completely However, we are sure that the labels are correct, so we need to filter out the categories with the highest prediction probability and regard the pixel as a negative sample of the remaining categories . This part corresponds to formulas 13-16 in the paper.

Insert image description here

Insert image description here
Insert image description here
Insert image description here
Insert image description here

O ij =argsort (p ij ), which is the sorted p (negative sample)

Supplementary knowledge

InfoNCE Loss

Loss functions commonly used in comparative learning are mostly used in the field of self-supervision. In terms of use, it can be simply summarized as using cosine to perform cross entropy on a batch of samples.
Insert image description here

  • NCE

**NCE (noise contrastive estimation, noise contrastive estimation)** The core idea is to convert a multi-classification problem into a two-classification problem. One class is the data class data sample, and the other class is the noise class noise sample. By learning the data sample and noise The difference between samples is to compare the data samples with the noise samples, which is called "noise contrastive", so as to discover some characteristics in the data. However, if the remaining data in the entire data set are treated as negative samples (i.e., noise samples), although the problem of many categories is solved, the computational complexity is still not reduced. The solution is to use negative sample sampling to calculate the loss. This is the meaning of estimation, which means that it is only an estimate and approximation. Generally speaking, the more negative samples are selected, the closer they are to the entire data set, and the effect will naturally be better.

Insert image description here

Regarding temperature: The temperature coefficient can control the model's discrimination against negative samples. Specifically, the larger the temperature coefficient, the lower the discrimination between negative samples by the model, and more negative samples can be included; the smaller the temperature t, the higher the discrimination between positive and negative samples, and more attention will be paid to those particularly difficult negative samples. sample, the smaller the loss. Generally speaking, a smaller t is better because it can focus more on difficult negative examples. However, the smaller the temperature parameter, the better. Since we are using unsupervised data learning, there may be some potential positive examples among the negative examples. If the parameters are too small, the closer potential positive examples will be pushed away, which is incorrect.

OMG

OHEM: online hard example mining, online hard sample mining algorithm training Region-based Object Detectors

Its advantages:
1. The problem of data category imbalance does not need to be solved by setting the ratio of positive and negative samples. This online selection method is more targeted.
2. When the data set increases, the algorithm can be improved even more on the original basis.
When we encounter a small data set and few positive proposals for target detection, we can try the OHEM trick.

experiment

  • Dataset: PASCAL VOC 2012(train, val)、SBD(add train)、Cityscapes
  • backbone: ResNet-101 pretrained on ImageNet
  • decoder: DeepLabv3+

Both the segmentation head and the representation head consist of two Conv BN ReLU blocks, where both blocks preserve the feature map resolution and the first block halves the number of channels. The segmentation head can be regarded as a pixel-level classifier that maps the 512-dimensional features output by the ASPP module into class C. The representation head maps the same features into a 256-dimensional representation space.

Comparison with Existing Alternatives

All experimental results in this article are based on the network structure of ResNet-101 + Deeplab v3+. For the data set composition and evaluation method used, please refer to the paper description.

We compared it with existing methods on three data sets: Classic VOC, Blender VOC, and Cityscapes. We achieved the best accuracy on all two PASCAL VOC data sets. On the Cityscapes data set, because we failed to solve the long-tail problem well, we lagged behind AEL (NeurIPS'21), which is dedicated to solving the class imbalance problem. However, we superimposed U2PL on AEL to achieve an accuracy that surpassed AEL. This proves the versatility of U2PL. It is worth mentioning that U2PL has particularly excellent accuracy under divisions with less labeled data.
Insert image description here

Table 1. Comparison with state-of-the-art methods on the classic PASCAL VOC 2012 val dataset under different partitioning settings. The labeled images are selected from the original VOC sequence collection, which consists of 1464 samples in total. The score represents the percentage of labeled data used for training, followed by the actual number of images. All images from SBD [18] are treated as unlabeled data. “SupOnly” stands for supervised training without using any unlabeled data. means we copy this method.

Insert image description here

Table 2. Comparison with other state-of-the-art methods on the blender PASCAL VOC 2012 val dataset under different partitioning settings. All labeled images are selected from the enhanced VOC sequence set, which consists of a total of 10582 samples. “SupOnly” stands for supervised training without using any unlabeled data. †means we replicate this method.

Insert image description here

table 3. Comparison with state-of-the-art methods on Cityscapes under different partitioning protocols. All labeled images are selected from the Cityscapes train set, which includes a total of 2975 samples. “SupOnly” stands for supervised training without using any unlabeled data. means we copy this method.

Ablation

Effectiveness of Using Unreliable Pseudo-Labels

We demonstrate the value of using unreliable pseudo-labels on multiple partitions of multiple datasets such as PSACAL VOC and CItyscapes .

Insert image description here

Table 4. Ablation experiments using pseudo-pixels with varying reliability, measured by the entropy of pixel-wise predictions (see Section 3.3). “Unreliability” means selecting negative sample candidates from the pixels with the highest 20% entropy score. "Reliable" means the bottom 20%. "All" means sampling without considering entropy.

Insert image description here

Effectiveness of probability level thresholds. That is, rl and rh in the formula just now, rl=3 and rh=20 are superior to other options to a great extent. When rl=1, false negative candidates are not filtered out, causing the intra-class features of pixels to be incorrectly distinguished by Lc. When rl=10, negative candidates tend to be semantically unrelated to the corresponding anchor pixels, making this distinction less informative.

Insert image description here

table 5. Ablation experiments for probability level thresholds, as described in Section 3.3.

Insert image description here

Table 6. An ablation study is performed on the effectiveness of various components in U2PL, including unsupervised loss Lu, contrastive loss Lc, category memory Qc, dynamic partition adjustment (DPA), probability rank threshold (PRT) and high entropy filtering (unreliable).

Insert image description here

Table 7. Ablation study of α0 in Eq. (7), control the initial ratio between reliable and unreliable pixels

Alternative of Contrastive Learning

Insert image description here
We have added a comparative experiment to utilize unreliable samples through binary classification, proving that using low-quality pseudo labels is not only achieved through comparative learning. As long as low-quality samples are well utilized, even binary classification methods can achieve good accuracy improvements. .

Summarize

  • Conclusion

We propose a semi-supervised semantic segmentation framework U2PL, which outperforms many existing methods by incorporating unreliable pseudo-labels in training, indicating that our framework provides a new promising approach in semi-supervised learning research. paradigm. Our ablation experiments demonstrate that the insights of this work are quite solid. Qualitative results provide intuitive proof of its effectiveness, especially better performance on boundaries between semantic objects or other ambiguous regions. Compared with fully supervised methods, training of our method is time-consuming [5, 6, 29, 35, 46], which is a common drawback of semi-supervised learning tasks [9, 20, 21, 33, 43, 48 ]. Due to the extreme lack of labels, semi-supervised learning frameworks often gain higher accuracy at a timely cost. Their training optimization can be explored more deeply in the future.

  • Visualization

Insert image description here

Figure 4. Qualitative results for the PASCAL VOC 2012 value set. All models are trained under the 1/4 split protocol of the mixer set, which contains 2466 labeled images and 7396 unlabeled images. (a) Input image. (b) Manual annotation labels of corresponding images. (c) Only labeled images are used for training without any unlabeled data. (d) Vanilla contrastive learning framework where all pixels are used as negative samples without entropy filtering. (e) Our U2PL predictions. The yellow rectangle highlights the segmentation results facilitated by the full use of unreliable pseudo-labels.

appendix

Appendix A: More details of the reproduction results are given in
Appendix B: More results on Cityscapes from two perspectives are given in
Appendix C: Alternatives to contrastive learning are provided to demonstrate that our main insights are not just Relies on contrastive learning
Appendix D: PASCAL VOC 2012 and Cityscapes ablation study with more hyperparameters
Appendix E: Visualization on feature space provides visual proof of the effectiveness of U2PL

Insert image description here

Table A1. Summary of hyperparameters used in U2PL.

Insert image description here

Table A2. Ablation study using pseudo-pixels of varying reliability, measured by the entropy of pixel-wise predictions. "Unreliable" means selecting negative candidates from the pixels with the highest 20% entropy score. "Reliable" means the bottom 20%. "All" means sampling without considering entropy. We demonstrate this effectiveness under 1/2 and 1/4 partitioning protocols on the Cityscapes val set.

U2PL is not restricted by contrastive learning. Binary classification is also a sufficient way to use unreliable pseudo-labels, i.e. using binary cross-entropy loss (BCE) Lb instead of contrastive loss. For the i-th anchor point zci belonging to class c, we simply use its negative samples {z cij } N j=1 and positive samples z c + to calculate the BCE loss:

Insert image description here

Insert image description here

Table A3. Using unreliable pseudo-labels based on binary classification on Cityscapes val sets under different partitioning protocols.

Insert image description here

Table A4. Using unreliable pseudo-labels based on binary classification on the PASCAL VOC 2012 val set under different splits.

  • More ablation experiments

TabA5: ablation of lr; TabA6: ablation of temperature coefficient

TabA7/8: The probability level threshold and α0 of the Cityspace data set were studied.

Insert image description here

The difference between U2PL and negative learning

The negative samples selected by negative learning are still reliable samples with high confidence. In contrast, U2PL advocates making full use of unreliable samples instead of filtering them out.
For example, the prediction result p=[0.3,0.3,0.2,0.1,0.1] T will be discarded by the negative learning method due to its uncertainty, but in U2PL it can be used as negative samples of multiple unlikely classes. Experimental results also found that The accuracy of the negative learning method is not as good as U2PL.

U2PL Technology Blueprint

The technical blueprint is posted here so that everyone can better understand the core story and experimental design of the paper.
img

Guess you like

Origin blog.csdn.net/Transfattyacids/article/details/134973332