paper:Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation
github:https://github.com/HRNet/HRNet-Semantic-Segmentation/tree/HRNet-OCR
In the semantic segmentation task, the category to which the pixel belongs is the category of the object to which the pixel belongs. Can we use the relationship between the pixel and the target to which it belongs? OCRNet proposes an efficient method to improve pixel representation by exploiting the contextual information of the object to which the pixel belongs. (The representation mentioned in this article, the popular understanding is the embedding corresponding to each pixel)
The method is divided into 3 steps:
(1) Generate soft object regions according to the number of categories (which can be understood as rough segmentation results);
(2) Estimate the object region representation by integrating the representations of all pixels of the target;
(3) Calculate the object-contextual representation according to the relationship between the pixel and its target, and use the object-context representation to enhance the pixel representation;
content
2、Object region representations
3、Object contextual representations
1. Calculation formula
Function implementation in the formula : 1x1 conv->BN->relu
1、Soft object regions
Divide the input image I into K soft object regions .
Simple understanding, assuming that the input image is [N, C, H, W], then the Soft object regions dimension is [N, num_classes, H, W], and the element value in it indicates the degree to which the pixel belongs to a certain category (in fact, it is The semantic segmentation head has no softmax output).
2、Object region representations
First, the formula of the paper is given as follows:
At first glance, I was confused, and I immediately ran to see the source code, which is the probability value output by softmax.
Simple understanding: For all pixels of the same target, an embedding needs to be learned to represent. Assuming that the input X dimension is [N, C, H*W], the Soft object regions dimension is [N, num_classes, H*W], a target has multiple pixels, and a unified representation needs to be learned, then the learned Object region The representations dimension is [N, num_classes, C].
3、Object contextual representations
First calculate the relationship between the pixel representation and the target representation ( representing the relationship between the i-th pixel representation and the k-th target region representation):
The target context representation is calculated using the following equation .
The calculation process here is the same as Attention:
It can be regarded as q, dimension [N, H*W, C], as k, dimension [N, C, num_classes], as v, dimension [N, num_classes, C], dimension of y is [ N, H*W, C].
4、Augmented representations
The target context representation and pixel representation are combined to enhance the pixel representation.
Second, the network structure
The network structure of OCRNet is shown in the figure.
Let the dimension of Pixel Representations be [N, C, H, W], and the dimension of Soft Object Regions to be [N, num_classes, H, W].
Object Region Representations: [N, num_classes, C]
Pixel Region Relation: [N, num_classes, H, W]
Object Contextual Representation: [N, C, H, W] (the part of the blue square that removes the pixel representations in the figure)
Augmented Representations: [N, 2C, H, W]
3. Experimental results
The performance of OCRNet on each dataset is as follows.
4. Summary
In the semantic segmentation task, the embeddings of all pixels of the same object should be similar, and OCRNet explicitly utilizes this relationship through the Attention mechanism to improve the effect of semantic segmentation.