Context Encoding for Semantic Segmentation

1、Introduction

  • Does the author propose that capturing semantic information is equivalent to expanding the receptive field?
  • Traditional encoder (BoW, VLAD) can encode global feature statistics, which is very convenient
  • An encoding layer combines dictionary learning and residual encoding into one network. The author captures global feature statistics by extending the encoding layer.

2、contribution

  • The first is to design SEloss, do not want pixelwise loss, SEloss loss applies the same weight to the large target and small target, the network is improved for small targets
  • Through an encoding layer, the overall semantic information is encoded, and class-dependent features are selected, such as reducing the probability of vehicles appearing in the house
  • Synchronous BN, and memory-efficient coding layers

3、

  • First, the author uses the encoding layer to obtain feature statistics to obtain global global information. In order to better utilize global semantic information, channel-wise attention is used to select class-dependent feature maps. The encoding layer learns a dictionary of semantic information and outputs a residual encoder with rich semantic information

Input feature: CXWXH —> x = { x 1 , x 2 , . . . , x N } , N = H × W
Inherent codebook: D = { d 1 , d 2 , . . . , d k }
Scaling factors: S = { s 1 , s 2 , . . . , s k }
Finally, k residual codes will be output, e k = i = 1 N e i k

What is the purpose of doing this?
By combining the HXW C-dimensional features of the image, each with a semantic word d k Do the difference, and then add the difference results of all semantic words for normalization to obtain the information of a pixel position relative to a semantic word e i k , and then sum the N results together to obtain the final e k , to obtain the information of the whole image relative to the kth semantic word.
write picture description here, r i k = x i d k
e k is C-dimensional, and finally k e k Combined together, concat is not used here. On the one hand, concat contains sequence information, and on the other hand, the addition method is used to save video memory. The meaning of adding up here is to obtain all the information of the entire image relative to the K semantic words
write picture description here, and the last e is also c-dimensional.

  • Then use the generated e to generate channel weights to get a channel-wise attention
  • Use e plus a full connection to form a SEloss. The generation of the label directly depends on which classes are in the picture, and the corresponding position is set to 1

    The overall network framework is shown in the figure
    write picture description here

The last author's k selected is 32
write picture description here

Two SE-losses are designed in stages 3 and 4 respectively, and the author discusses the influence of K, k=0 is equivalent to global pooling

4. The experiment is omitted.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325675371&siteId=291194637