Comprehensive interpretation of Group Normalization, comparing BN, LN, IN

Foreword

Face book AI research (FAIR) Wu Yuxin and He Kaiming jointly launched the new heavyweight Group Normalization (GN), proposing to use Group Normalization to replace the deep learning milestone work Batch normalization. This article will explain this article to readers in detail from the following three aspects:

What's wrong with BN ?

How GN work ?

Why GN work ?

What is Group Normalizition

In one sentence, Group Normalization (GN) is a new deep learning normalization method that can replace BN. As we all know, BN is a normalization method commonly used in deep learning. It plays a major role in improving training and convergence speed, and is a milestone in deep learning.

 

However, there are still some problems, and the newly proposed GN solves the effect of BN-type normalization on batch size dependence.

So, what's wrong with BN, and where is GN?

What's wrong with BN

The full name of BN is Batch Normalization, as you know it, it is a normalization method, and it is normalized in the dimension of batch, then the problem comes, this normalization method is independent of batch. A small batch size will cause its performance to decrease. Generally speaking, it is most appropriate to set the batch to 32 on each GPU;

 

But for some other deep learning tasks, the batch size is often only 1-2, such as target detection, image segmentation, video classification, the input image data is very large, and the larger batch size can not be saved. So, what is the performance of the smaller batch size? As shown below:

What's wrong with BN

The full name of BN is Batch Normalization, as you know it, it is a normalization method, and it is normalized in the dimension of batch, then the problem comes, this normalization method is independent of batch. A small batch size will cause its performance to decrease. Generally speaking, it is most appropriate to set the batch to 32 on each GPU;

The horizontal axis represents the batch size on each GPU, decreasing from left to right, and the vertical axis is the error rate. It can be seen that when the batch is small, GN has an error rate of less than 10% compared to BN.

 

In addition, Batch Normalization is Normalization in the dimension of batch, but this dimension is not fixed. For example, training and testing are generally different. Generally, the average of the mean-mean is pre-calculated by sliding average on the training set during training. , And variance-variance parameters.

 

During the test, these values ​​are no longer calculated, but these pre-calculated ones are directly used, but when the distribution of training data and test data is different, the pre-calculated data on the training machine does not represent the test data This leads to inconsistency in the three stages of training, verification, and testing.

 

Now that the problem is clear, it is easy to solve. Is it feasible to avoid the batch dimension when normalizing, so there are layer normalization and instance normalization, but it is still not as good as the GN introduced in this article.

How GN work

GN is still normalized in essence, but it flexibly avoids the problem of BN, and is different from Layer Norm and Instance Norm. The working methods of the four can be seen from the following figure:

 

From left to right are BN, LN, IN, GN

 

As we all know, the data dimension in the deep network is generally [N, C, H, W] or [N, H, W, C] format, N is the batch size, H / W is the height / width of the feature, and C is the feature Channel, compressed H / W to one dimension, the three-dimensional representation is as shown above, assuming that the length of a single square is 1, then it represents [6, 6, *, *]

 

The above picture vividly shows the four norm working methods:

BN is norm in the dimension of batch, and the normalized dimension is [N, H, W], normalizing the corresponding channel in batch;

LN avoids the batch dimension, the normalized dimension is [C, H, W];

The normalized dimension of IN is [H, W];

While GN is between LN and IN, it first divides the channel into many groups, normalizes each group, and first reshapes the feature from [N, C, H, W] to [ N, G, C // G, H, W], the normalized dimension is [C // G, H, W]

 

In fact, the extreme case of GN is LN and IN, corresponding to G equal to C and G equal to 1, respectively, the author gave G as 32 in the paper.

It can be seen from this that GN and BN have many similarities. Compared with the code, the BN changes are only one or two lines. The code implementation given in the paper is as follows:

def GroupNorm(x, gamma, beta, G, eps=1e-5):
    # x: input features with shape [N,C,H,W]
    # gamma, beta: scale and offset, with shape [1,C,1,1]
    # G: number of groups for GN
    N, C, H, W = x.shape
    x = tf.reshape(x, [N, G, C // G, H, W])
    mean, var = tf.nn.moments(x, [2, 3, 4], keep dims=True)
    x = (x - mean) / tf.sqrt(var + eps)
    x = tf.reshape(x, [N, C, H, W])
    return x * gamma + beta

Among them, the beta and gama parameters are trainable parameters in the norm, which indicate the translation and scaling factors. For details, see the blog. From the comparison of the above norm, I have to admire the skill of the author. .

Why GN work

The above three sections introduce the problems of BN and how GN works. This section will introduce the reasons for GN work.

 

Traditionally speaking, before deep learning became popular, feature extraction was usually performed using SIFT, HOG, and GIST features. These features have a commonality, and all have characteristics expressed in groups. Each group is constructed from the same kind of histogram. These features are usually obtained by performing group-wise norm on each histogram or each orientation.

 

The higher-dimensional features such as VLAD and Fisher Vectors (FV) can also be regarded as group-wise features. Here, the group can be considered as a sub-vector of each cluster (cluster).

 

In terms of deep learning, it can be considered that the feature extracted by convolution is an unstructured feature or vector. Take the first layer of convolution in the network as an example. The convolution kernel filter1 in the convolution layer and this convolution The other transformed version of the kernel filter2 (transform can be horizontal flipping, etc.), the features learned on the same image should have the same distribution, then, the same features can be divided into the same group, according to Personally understand that each layer has many convolution kernels. The features learned by these kernels are not completely independent. Some features have the same distribution, so they can be grouped.

 

There are many factors that lead to grouping, such as frequency, shape, brightness, and texture. HOG features are grouped according to orientation. For neural networks , the mechanism for extracting features is more complicated and difficult to describe. It becomes less so Intuitive.

 

In the field of neuroscience, a widely accepted calculation model is to normalize the response of the cell. This phenomenon exists in the superficial visual cortex and the entire visual system.

 

Based on this, the author proposes a group normalization (Group Normalization) method, and the effect shows that it is significantly better than BN, LN, IN and so on. The GN normalization method avoids the effect of batch size on the model. The feature group normalization can also solve the problem of Internal Internal Covariate Covariate ShiftShift and achieve better results. 

Show results

Taking resnet50 as the base model and batchsize set to 32, the training error (left) and test error (right) on the imagenet dataset. GN does not show great advantages, and the test error is slightly larger than the results using BN.

Taking resnet50 as the base model and batchsize set to 32, the training error (left) and test error (right) on the imagenet dataset. GN does not show great advantages, and the test error is slightly larger than the results using BN.

It can be easily seen that GN is more robust to batch size

 

At the same time, the author took VGG16 as an example to analyze the learning situation of the feature distribution after a certain layer of convolution, and conducted experiments based on the use of Norm and the use of BN and GN, respectively. The experimental results are as follows:

The unified batch size is set to 32, the leftmost picture is the feature learning situation of conv5 without the norm, the middle is the BN result, and the rightmost is the learning situation with GN. Compared with not using norm, the learning effect of using norm Obviously, the latter two have similar learning situations, but after changing the small batch size, BN is not as good as GN.

Published 943 original articles · Like 136 · Visit 330,000+

Guess you like

Origin blog.csdn.net/weixin_36670529/article/details/105165913