Layer Normalization analysis

Original paper name: Layer Normalization
Original paper address: https://arxiv.org/abs/1607.06450

I have talked about the principle and link of Batch Normalization before. Today, I will briefly talk about Layer Normalization. Layer Normalization is proposed for the field of natural language processing, such as RNN recurrent neural network. Why not use direct BN, because in time series networks such as RNN, the length of the time series is not a fixed value (the network depth is not necessarily the same), for example, the length of each sentence is not necessarily the same, so it is difficult to use BN, So the author proposed Layer Normalization (note that BN is more effective than LN in the field of image processing, but now many people use models in the natural language field to process images, such as Vision Transformer, which still involves LN at this time).

I won't talk about the specific paper, let's look directly at the brief introduction to LayerNorm officially given by Pytorch. If you just look at the formula, it feels no different from BN, it is all minus the mean E ( x ) E(x)E ( x ) , divided by the standard deviationV ar ( x ) + ϵ \sqrt{Var(x) + \epsilon }V a r ( x )+ϵ where ϵ \epsilonϵ is a very small quantity (default1 0 − 5 10^{-5}105 ), to prevent the denominator from being zero. There are also two trainable parametersβ , γ \beta, \gammaβ ,γ . The difference is that BN performs Norm processing on each channel of a batch of data, but LN performs Norm processing on a specified dimension of a single data, regardless of batch (examples follow). And when training in BN, it is necessary to accumulate two variables, moving_mean and moving_var (so there are 4 parameters in BN moving_mean , moving_var, β , γ moving\_mean, moving\_var, \beta, \gammamoving_mean,m o v i n g _ v a r ,β ,γ ), but LN does not need to accumulate onlyβ , γ \beta, \gammaβ ,γ two parameters.

There is a normalized_shape parameter in the LayerNorm class of Pytorch, which can specify the dimension of the Norm you want (note that in the function description the last certain number of dimensions, the specified dimension must start from the last dimension). For example, the shape of our data is [4, 2, 3], then the normalized_shape can be [3] (Norm processing on the last dimension), or [2, 3] (the last two dimensions of Norm), or It is [4, 2, 3] (norm the entire dimension), but it cannot be [2] or [4, 2], otherwise the following error will be reported (take normalized_shape=[2] as an example):

RuntimeError: 
Given normalized_shape=[2],         
expected input with shape [*, 2],    
but got input of size[4, 2, 3]

Prompt us to pass in normalized_shape=[2], and then the system deduces that the expected input data shape should be [*, 2] according to the normalized_shape we passed in, that is, the size of the last dimension should be 2, but the data we actually passed in The shape is [4, 2, 3] so an error is reported.

layer norm
Next, let's look at another example. The following is the test code I wrote, using the official LN method and the LN method implemented by myself to compare them to see if my understanding is correct.

import torch
import torch.nn as nn


def layer_norm_process(feature: torch.Tensor, beta=0., gamma=1., eps=1e-5):
    var_mean = torch.var_mean(feature, dim=-1, unbiased=False)
    # 均值
    mean = var_mean[1]
    # 方差
    var = var_mean[0]

    # layer norm process
    feature = (feature - mean[..., None]) / torch.sqrt(var[..., None] + eps)
    feature = feature * gamma + beta

    return feature


def main():
    t = torch.rand(4, 2, 3)
    print(t)
    # 仅在最后一个维度上做norm处理
    norm = nn.LayerNorm(normalized_shape=t.shape[-1], eps=1e-5)
    # 官方layer norm处理
    t1 = norm(t)
    # 自己实现的layer norm处理
    t2 = layer_norm_process(t, eps=1e-5)
    print("t1:\n", t1)
    print("t2:\n", t2)


if __name__ == '__main__':
    main()

First use the torch.rand method to randomly generate a variable t with shape [4, 2, 3]:

t

Then use the official method to create an LN layer, which t.shape[-1]refers to the last dimension 3 of the data, that is, only perform Norm processing on the last dimension, as shown in the above figure for each group of data selected with a red box:

# 仅在最后一个维度上做norm处理
norm = nn.LayerNorm(normalized_shape=t.shape[-1], eps=1e-5)

Then pass the data into the instantiated norm class to get the following results:

 tensor([[[-1.2758,  1.1659,  0.1099],
         [ 0.6532, -1.4123,  0.7591]],

        [[ 1.1400,  0.1522, -1.2922],
         [ 1.0942, -1.3229,  0.2287]],

        [[-0.9757, -0.3983,  1.3741],
         [ 1.4134, -0.7379, -0.6755]],

        [[ 0.1563,  1.1389, -1.2951],
         [-1.2341,  0.0203,  1.2138]]], grad_fn=<NativeLayerNormBackward>)

Then call the LayerNorm method implemented by yourself (note that β \betaβ is initially 0,γ \gammaγ is initially 1, which is then slowly learned and adjusted through training) to obtain the following results:

 tensor([[[-1.2758,  1.1659,  0.1099],
         [ 0.6532, -1.4123,  0.7591]],

        [[ 1.1400,  0.1522, -1.2922],
         [ 1.0942, -1.3229,  0.2287]],

        [[-0.9757, -0.3983,  1.3741],
         [ 1.4134, -0.7379, -0.6755]],

        [[ 0.1563,  1.1389, -1.2951],
         [-1.2341,  0.0203,  1.2138]]])

Obviously, the result is exactly the same as the official result, which also shows that my understanding is correct.

Guess you like

Origin blog.csdn.net/qq_37541097/article/details/117653177