A new paradigm of semantic segmentation-pixel contrast learning

code:https://github.com/tfzhou/ContrastiveSeg
paper:https://arxiv.org/pdf/2101.11939.pdf

Preface

Currently, the essence of semantic segmentation algorithms is to map image pixels to a highly non-linear feature space through deep neural networks. However, most of the existing algorithms only focus on local context information (position and semantic dependence within a single image and between pixels), but ignore the global context information of the training data set (cross-image, semantic correlation between pixels). ), it is difficult to constrain the learned feature space from an overall perspective, which limits the performance of the semantic segmentation model.

Recently, researchers from the ETH Zurich and the Shangtang Institute proposed a new, fully-supervised semantic segmentation training paradigm: pixel-wise contrastive learning, which emphasizes the use of pixels-pixels in the training set and across images. Correspondence (cross-image pixel-to-pixel relation) to learn a structured (well structured) feature space, used to replace the traditional image-wise training paradigm.

This training strategy can be directly applied to mainstream semantic segmentation models, and no additional computational overhead is introduced in the model inference stage. The following figure shows the performance of the mainstream segmentation algorithm on the Cityscapes verification set. It can be seen that after introducing pixel contrast learning on DeepLabV3, HRNet, and OCR, a more significant performance improvement has been achieved.
Insert picture description here

What problems are currently ignored in the field of semantic segmentation?

Image semantic segmentation aims to predict a semantic label for each pixel in an image, which is a core problem in the field of computer vision. Since the introduction of Fully Convolutional Network (FCN) [1], mainstream semantic segmentation algorithms emphasize intra-image context. Mainly start from two aspects: 1) Propose different context aggreation modules, such as classic models such as dilated convolution, spatial pyramid pooling, encoder-decoder and non-local attention. The core idea is to use additional model parameters or special The operation of modeling and extracting the context information inside the image; 2) The traditional algorithm regards semantic segmentation as a pixel-level classification task, so the cross-entropy loss is calculated pixel by pixel independently, but the pixel-to-pixel difference is completely ignored. Dependency. Therefore, some researchers have proposed structure-aware loss functions, such as pixel affinity loss [2], lovasz loss [3], etc., which directly constrain the overall structure information of the segmentation result in the training objective function.

However, the above work only focuses on the context information inside the image, but ignores the cross-image, global context information: in the training set, pixels from different images also have a strong correlation, as shown in Figure (b) below, Pixels of the same color indicate that they have the same semantics.
Insert picture description here
Furthermore, the essence of the current semantic segmentation algorithm is to map image pixels to a highly non-linear feature space (as shown in Figure c above) through a deep neural network. In this process, only context aggregation modules or structured The loss function emphasizes the dependence between local pixels, but ignores an essential question: what is an ideal semantic segmentation feature space?

Researchers believe that a good segmentation feature space should have two properties at the same time:

  • Strong discrimination ability: In this feature space, the feature of each pixel should have a strong categorization ability of individual pixel embeddings;
  • Highly structured: the features of similar pixels should be highly compact (intra-class compactness), and the features of different types of pixels should be as dispersed as possible (inter-class dispersion).

However, current semantic segmentation methods generally only focus on property 1, but ignore 2. In addition, many representation learning (representation learning) works [4, 5] have also verified that by emphasizing properties 2, it is helpful to better enhance properties 1. Therefore, we boldly assume that although the current semantic segmentation algorithm has achieved excellent performance, by considering properties 1 and 2 at the same time, it is possible to learn a better and structured segmentation feature space, and further improve the semantic segmentation algorithm. performance.

Thoughts from unsupervised comparative learning

In recent years, the field of unsupervised learning has ushered in tremendous development, the source is the successful application of contrast learning (contrastive learning) [6, 7] under a large number of unlabeled training samples. Suppose the picture is a feature vector of an unlabeled training sample image I, and the picture is a positive sample feature of image I. This positive sample is often obtained by applying some transformation to I (such as flipping and cropping operations) Etc.), the picture is a negative sample feature, other non-I images in the training set are regarded as negative samples. Then, by comparing the learning loss function, such as the following InfoNCE loss [8], unsupervised training: the
Insert picture description here
goal is to identify positive samples from a large number of negative samples. The image features obtained by unsupervised training show a strong generalization ability, which can provide excellent network initialization weights for downstream tasks, or only after a small number of labeled samples are fine-tuned (finetuning), you can get close to fully-supervised training performance Image classification model.

The success of unsupervised comparative learning also brings inspiration. Comparative learning belongs to metric learning. The essence is to use the overall information of the data set to learn an image representation space with strong expressive ability. Under the fully-supervised training condition of image semantic segmentation, the label of each pixel of the training image has been given. We can treat positive samples as pixels belonging to the same semantic category, and negative samples as pixels not belonging to the same semantic category. , Regardless of whether they are derived from the same training image. After that, you can use metric learning or contrast learning to improve the traditional cross-entropy loss, and then mine the global semantic relationship between pixels in all training images, and then obtain a highly structured segmentation feature space, thereby emphasizing at the same time Nature 1 and 2. Therefore, the researchers proposed a fully supervised semantic segmentation training paradigm based on pixel contrast learning, pixel-wise contrastive learning, which emphasizes the use of the global context information of the training data set to learn from the overall perspective. The feature space is explicitly constrained to make it have good properties in the global structure (intra-class compactness and inter-class dispersion).

As shown in the above figure (d), given a pixel i in a training sample, also called an anchor point, the researcher compares i with other pixels in the segmented feature space, and zooms in as close to i as possible. The distance between pixels of the same type (positive samples) and forcing i to be as far away as possible from other pixels of different types (negative samples). Therefore, the training paradigm can consider the global semantic similarity of all pixels in the entire training set, so that the model can use more diverse and large-scale samples to improve the ability of representation learning, so as to obtain a better semantic feature space (as shown in the figure (e) )).

What is the problem with the classic semantic segmentation loss function based on Pixel-Wise cross entropy?

In the following, the classic Pixel-wise cross entropy in the field of semantic segmentation is taken as a starting point to further discuss the necessity of introducing metric learning or contrast learning into semantic segmentation training.

As mentioned earlier, the current semantic segmentation algorithm regards this task as a pixel-by-pixel semantic classification problem, that is, predicting a semantic label c for each pixel i in the image. Therefore, pixel-wise cross entropy is used as the training target: Insert picture description here
here y means: the unnormalized categorical score vector for pixel i obtained through FCN, also known as logit, c means the true label of pixel i, 1c One-hot encoding for.

However, the optimization objective function has two disadvantages:

  • Only constrain the prediction of each pixel independently, ignoring the relationship between pixels
  • Due to the use of the softmax operation, the calculation of cross entropy actually only depends on the relative relationship between logits, but it cannot directly constrain the learned pixel features (cannot directly supervise on the learned representations)

Although some recent structured loss functions (such as pixel affinity loss, lovasz loss, etc.) have realized the shortcomings1, they only consider the pixel dependencies within the same image, but ignore the semantic consistency of pixels between different images. And disadvantage 2, it is rarely mentioned in the field of semantic segmentation.

Semantic segmentation training paradigm based on fully supervised and pixel-to-pixel contrast learning

The pixel-wise contrastive learning proposed in this paper can better solve the two shortcomings of the cross-entropy loss function. In the training process, for any pixel (anchor point) i, its positive sample is other pixels of the same type, and negative samples are other pixels of different types. It is worth noting that the choice of positive and negative samples for anchor i is not limited to the same image. For pixel i, the contrast loss function is defined as: Insert picture description here
where the picture represents the characteristics of all positive sample pixels of pixel i, and the picture represents the characteristics of all negative sample pixels of pixel i. It can be seen from the above formula that through pixel-to-pixel contrast learning, researchers directly draw pixels belonging to the same semantic category in the feature space of semantic segmentation, while forcing pixels of different semantic categories to move away from each other, thereby simultaneously emphasizing Two disadvantages of cross entropy loss.

The final semantic segmentation loss function is defined as:
Insert picture description here
cross-entropy loss promotes the segmentation model to learn discriminative features and improve classification capabilities (emphasis on property 1). Pixel-wise contrastive loss explores the global semantic relationship between pixels to constrain as a whole Semantic segmentation of feature space (emphasis on nature 2).

The following figure visualizes the segmentation features learned using only the cross-entropy loss (left image) and the above-mentioned mixed loss function (right image). It can be seen that by introducing pixel-wise contrastive loss, similar pixel features are more compact. The separation between classes can be better. This shows that by combining the advantages of unary cross-entropy loss and pari-wise contrastive loss, the segmentation network can learn better feature representation.Insert picture description here

Further Discussion

Different from the current mainstream algorithms that only focus on the local context information of the pixels inside the image, this paper proposes a cross-image and pixel comparison loss function to mine the global relationship of all pixels in the training data set, which effectively improves the performance of semantic segmentation. This helps us to rethink the current mainstream training paradigm, not only focusing on the characteristics of the training samples themselves, but also focusing on the relationship between the training samples from a global perspective.

This article also brings some useful enlightenments, such as:

  • Contrast learning or metric learning depends on the quality of positive and negative samples. Smarter sampling strategies can help the segmentation network learn more quickly and effectively.
  • From the perspective of metric learning, the cross-entropy loss is a unary loss function (unary loss), and the contrast loss is a binary loss function (pair-wise loss), exploring higher-order metric loss functions may bring greater improvement.
  • Contrast loss needs to sample positive and negative samples in the calculation, and it is possible to achieve class rebalance in training more naturally.
  • The solution in this paper has achieved effective performance improvement on mainstream semantic segmentation data sets, and is expected to play an advantage in other dense image prediction tasks (such as 2D human pose estimation, medical image segmentation, etc.).

Guess you like

Origin blog.csdn.net/weixin_42990464/article/details/114401479