There are two ways to use LayerNorm in pytorch, one is nn.LayerNorm , and the other is nn.functional.layer_norm

1. Calculation method

According to the introduction on the official website, the calculation formula of LayerNorm is as follows.
The formula is actually the same as BatchNorm, but the dimension of calculation is different.
insert image description here

Let's go through the formula with an example

Suppose you have the following data

x=
[
[0.1,0.2,0.3],
[0.4,0.5,0.6]
]
# shape （2，3）

Calculate mean and variant first

mean:

# 计算的维度是最后一维
mean= 
[
(0.1+0.2+0.3)/3=0.2,
(0.4+0.5+0.6)/3=0.5
]

variance

var=[  mean((0.1-0.2)^2=0.01,(0.2-0.2)^2=0,(0.3-0.2)^2=0.01)+0.00005,
       mean((0.4-0.5)^2=0.01, (0.5-0.5)^2=0, (0.6-0.5)^2=0.01)+0.00005
    ]
   = [ 0.0067+0.00005
       0.0067+0.00005
     ]

sqrt(var) = [ 0.0817,
              0.0817
            ]

Execute (x-mean)/sqrt(var) again

 (x-mean)/sqrt(var) = [ [(0.1-0.2)/0.0817,   (0.2-0.2)/0.0817,  (0.3-0.2)/0.0817],
                                    [(0.4-0.5)/0.0817, (0.5-0.5)/0.0817, (0.6-0.5)/0.0817]
                                  ]
                              = [  [-1.2238,  0.0000,  1.2238],
                                    [-1.2238,  0.0000,  1.2238]
                                 ]

2. Implement the code

The following code is to use these two methods and a way to implement it yourself

import numpy as np
import torch
import torch.nn.functional as F

x = torch.Tensor([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]) # shape is (2,3)

# 注意LayerNorm和layer_norm里的normalized_shape指的都是shape里的数字，而不是index；
# 在内部pytorch会将这个数字转成index
nn_layer_norm = torch.nn.LayerNorm(normalized_shape=[3], eps=1e-5, elementwise_affine=True)
print("LayerNorm=", nn_layer_norm(x))

layer_norm = F.layer_norm(x, normalized_shape=[3], weight=None, bias=None, eps=1e-5)
print("F.layer_norm=", layer_norm)

# dim是维度的index
mean = torch.mean(x, dim=[1], keepdim=True)
# 这里注意是torch.mean而不是torch.sum 
# 所以通过torch.var函数是不可以的
var = torch.mean((x - mean) ** 2, dim=[1], keepdim=True)+ 1e-5
print("my LayerNorm=", var,(x - mean) / torch.sqrt(var))

The result is as follows,

insert image description here

multidimensional realization

If the tensor x is 3-dimensional, how should it be used?

The code sample is as follows,

import numpy as np
import torch
import torch.nn.functional as F

x = torch.Tensor([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]).view(2,1,3) # shape (2,1,3)

# 注意这里的normalized_shape只能是张量的后面几个连续维度
# 比如这里的1，3 就是 (2,1,3)的最后两维
nn_layer_norm = torch.nn.LayerNorm(normalized_shape=[1,3], eps=1e-5, elementwise_affine=True)
print("LayerNorm=", nn_layer_norm(x))

layer_norm = F.layer_norm(x, normalized_shape=[1,3], weight=None, bias=None, eps=1e-5)
print("F.layer_norm=", layer_norm)

# 这里的dim写最后两维的index
mean = torch.mean(x, dim=[1,2], keepdim=True)
var = torch.mean((x - mean) ** 2, dim=[1,2], keepdim=True)+ 1e-5
print("my LayerNorm=", (x - mean) / torch.sqrt(var))

The result is as follows,
insert image description here

In the case of multi-dimensional tensors, it should be noted that the normalized_shape here can only be the next few consecutive dimensions of the tensor, otherwise similar errors will be reported as follows
RuntimeError: Given normalized_shape=[2, 3], expected input with shape [*, 2, 3], but got input of size[2, 1, 3]

3. think

It can be seen from here that this is actually the last dimension for Normalization.
Considering the scenario of training the nlp model, the tensor dimension is generally (Batch size, Length of Sequence, Embedding size), and using LayerNorm is actually within the scope of a mini batch, regularizing with Embedding as the dimension.

So why is LayerNorm generally used on nlp tasks?
In the nlp task, the sequnce in each batch may be different, so if the dimensions of batch and sequnce are included, the padding data may also be included.

[pytorch] Use pytorch to implement LayerNorm yourself

1. Calculation method

2. Implement the code

multidimensional realization

3. think

Guess you like