Region Normalization summary

I read the article Region Normalization for Image Inpainting a year ago . At that time, the author hadn't uploaded the code yet, and the content of the article was also read in Yunliwuli. Now let's get a deeper understanding of the article.

1. The focus of the article is on normalization . I have encountered various normalizations before reading articles, but no system has compared their specific mechanisms. It happened that the author specifically listed the sources of different normalizations in the Related Work section, code them, and then take a closer look at these methods.

2. The first is motivation, which is to address the current problems.
Insert picture description here
The three pictures are three feature maps, 1-3 are unmasked, masked with full-spatial normalization, and masked with normalization separately. In the calculation, set the masked area value to 255. Perform a simple average calculation on the pixels of the three pictures : Insert picture description here
Μ, σ represent the mean and standard deviation, the number subscript represents the number of features, the subscript u represents unmasked, and m represents the masked area. Feature 3 separates the masked area, and only calculates the unmasked area during calculation, which is easy to understand and avoids the pixel value of the effective area from being affected by the completely black (255) area. The following figure shows the effect of different processing methods under the two activation functions of ReLU and Sigmod: Insert picture description here
whether it is the mean or the variance, you can see that if you do not distinguish whether masked or not, you can perform normalization, and the results obtained have a large deviation from the true value.

3. Let me talk about the overall network structure first.
Generator: From EdgeConnect (EC) (Nazeri et al. 2019), the structure is as follows: It is
Insert picture description here
composed of Encoder, Residual Block, and Decoder. The Encoder adopts the RN-B method, and the Residual Block and Decoder adopts the RN-L method. Both methods are area normalization. The former is the basic version and requires the input of mask, and the latter is the learnable version, and only the feature maps are required.
Discriminator: Copy the structure of PatchGAN (Isola et al. 2017; Zhu et al. 2017). At the same time, its loss function is adopted, which includes four parts: reconstruction loss, adversarial loss, perceptual loss and style loss.
There is no innovation in the network structure and loss function, and different existing excellent methods are pieced together.

4. The innovation in the text: the definition of Region Normalization.
For each input feature map, it has four dimensions: N, C, H, W. Respectively represent batch size (number of batches), channels (number of channels), height (height of feature map), weith (feature width). Since it is a region (region) normalization, there must be different regions. The author gives the following expression:
Insert picture description here
In addition to n, c, there are H, W, so each batch of feature maps is divided into two dimensions according to H, W Several areas. As shown in the figure below: Use
Insert picture description here
n, c as the index to determine the area to be processed. Divide the area to be processed into several sub-areas. Normalize and merge several regions separately.
The normalization operation for different regions is the same: it
Insert picture description here
is a process of subtracting the mean and then dividing by the standard deviation. The calculation of the mean and standard deviation is also a traditional method, just pay attention to the calculation of the pixels in the unified region.
The author here explains that this method is actually an extension of the Instance Normalization (IN) method. When the number of divided regions is 1, it is the IN method. I haven't read the IN method, so I'll live and learn it. In the field of image restoration, the number of regions is set to 2, one type of intact area, one type of masked area.
5. The author uses the RN method in two ways, RN-B and RN-L.
1) RN-B (Basic Region Normalization)
Insert picture description here
This method divides the original image into two regions (masked/unmasked) according to the input mask. The specific rules are as follows :
Insert picture description here
That is, where the mask pixel value is 255, it is judged as masked. The two divided regions are respectively normalized by the above method and then merged to obtain a complete feature map. But for the feature of each channel, there are two sets of network parameters that need to be learned, instead of one set of weights and deviations for one channel.
2) RN-L (Learnable Region Normalization)
Insert picture description here
This method no longer needs to manually enter the mask to divide the region. As shown in the figure above, first perform maximum pooling and mean pooling on the input feature maps (to the channel axis) to obtain two 1×H×W maps. The original text says The two pooling operations are able to obtain an efficient feature descriptor ( Zagoruyko and Komodakis 2016; Woo et al. 2018) , this is also listed as content to be seen.
Convolve the two pooling layers and apply the Sigmod activation function to get a spatial response map:
Insert picture description here
Then set a threshold of t=0.8 to Msr to determine whether it is a masked area (I don’t know the principle of this too much here. It may be explained in the article on the operation of the two pooling layers mentioned above):
Insert picture description here
The threshold of 0.8 here only plays a role in the process of forward propagation inference, and will not affect the update of the gradient in back propagation. ( The specific effect of this sentence includes the possible impact of this operation. I don't want to understand it for the time being. You need to look at the code to understand this place )

When learning the γ (scale) and β (shift) parameters, the article said that they are also obtained through convolution operations:
Insert picture description here
the convolution here and the convolution on the two pooling layers above need to be understood by looking at the code.
Will γ and β expand along the channel dimension in the process of affine transformation?

6. Summary
This article is not long. The network structure is pieced together from different methods before. The innovation is in the concept of RN. This method is used in two forms, RN-B and RN-L.
The article as a whole is not difficult to understand, and some concepts are easier to understand, but there are still many places that I don’t understand in the RN-L method. I think it should be because I read too little in the previous article. Although it is used as a base in this article, I do not know. Some of the references mentioned in the article should be read as far as possible. Need to do some follow-up supplementary learning for the time being marked in black and italics.
Since last week, I have also tried to run the author's open source program. It is currently available, but there are still many small bugs. Slowly work hard to solve it. The overall program is not long, and the structure is relatively clear. I think it is suitable for my kind of beginner learning.

Guess you like

Origin blog.csdn.net/qq_41872271/article/details/105407076