[Paper Notes] OCRNet Paper Reading Notes

paper:Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation

github:https://github.com/HRNet/HRNet-Semantic-Segmentation/tree/HRNet-OCR

In the semantic segmentation task, the category to which the pixel belongs is the category of the object to which the pixel belongs. Can we use the relationship between the pixel and the target to which it belongs? OCRNet proposes an efficient method to improve pixel representation by exploiting the contextual information of the object to which the pixel belongs. (The representation mentioned in this article, the popular understanding is the embedding corresponding to each pixel)

The method is divided into 3 steps:

(1) Generate soft object regions according to the number of categories (which can be understood as rough segmentation results);

(2) Estimate the object region representation by integrating the representations of all pixels of the target;

(3) Calculate the object-contextual representation according to the relationship between the pixel and its target, and use the object-context representation to enhance the pixel representation;


content

1. Calculation formula

1、Soft object regions

2、Object region representations

3、Object contextual representations

4、Augmented representations

Second, the network structure

 3. Experimental results

 4. Summary


1. Calculation formula

Function implementation in the formula \delta() \rho() \phi()\psi(): 1x1 conv->BN->relu

1、Soft object regions

Divide the input image I into K soft object regions \left \{ M_{1},M_{2},M_{3}...M_{k}\right \}.

Simple understanding, assuming that the input image is [N, C, H, W], then the Soft object regions dimension is [N, num_classes, H, W], and the element value in it indicates the degree to which the pixel belongs to a certain category (in fact, it is The semantic segmentation head has no softmax output).

2、Object region representations

First, the formula of the paper is given as follows:

At first glance, I was confused, and I immediately ran to see the source code, which \widetilde{m}_{ki}is the probability value output by softmax.

Simple understanding: For all pixels of the same target, an embedding needs to be learned to represent. Assuming that the input X dimension is [N, C, H*W], the Soft object regions dimension is [N, num_classes, H*W], a target has multiple pixels, and a unified representation needs to be learned, then the learned Object region The representations dimension is [N, num_classes, C].

3、Object contextual representations

First calculate the relationship between the pixel representation and the target representation ( w_ {ik}representing the relationship between the i-th pixel representation and the k-th target region representation):

The target context representation is calculated using the following equation y_{i}.

The calculation process here is the same as Attention:

It can be regarded \phi (x)as q, dimension [N, H*W, C], \psi (x)as k, dimension [N, C, num_classes], \delta(x)as v, dimension [N, num_classes, C], dimension of y is [ N, H*W, C].

y = \ rho (softmax (q @ k, axis = -1) @v)

4、Augmented representations

The target context representation and pixel representation are combined to enhance the pixel representation.

Second, the network structure

The network structure of OCRNet is shown in the figure.

Let the dimension of Pixel Representations be [N, C, H, W], and the dimension of Soft Object Regions to be [N, num_classes, H, W].

Object Region Representations: [N, num_classes, C]

Pixel Region Relation: [N, num_classes, H, W]

Object Contextual Representation: [N, C, H, W] (the part of the blue square that removes the pixel representations in the figure)

Augmented Representations: [N, 2C, H, W]

 3. Experimental results

The performance of OCRNet on each dataset is as follows.

 4. Summary

In the semantic segmentation task, the embeddings of all pixels of the same object should be similar, and OCRNet explicitly utilizes this relationship through the Attention mechanism to improve the effect of semantic segmentation.

Guess you like

Origin blog.csdn.net/qq_40035462/article/details/123717178
Recommended