The difference and connection between Batch Normalization and Layer Normalization

The difference and connection between Batch Normalization and Layer Normalization

As a popular technology in the field of artificial intelligence, deep learning has achieved remarkable results in image recognition, speech recognition, natural language processing and other fields. However, with the continuous deepening and complexity of the neural network model, some common problems such as gradient disappearance, gradient explosion, and slowdown of model training have gradually emerged. In order to solve these problems, Batch Normalization (BN for short) and Layer Normalization (LN for short), as important technologies in deep learning, came into being. This blog will introduce the principles of BN and LN in detail, and demonstrate their application and advantages in deep learning through cases and codes.

1. Batch Normalization (BN): Start by solving the internal covariate shift

1.1 Internal covariate shift

In a deep neural network, the input of each layer is a function of the output of the previous layer, which means that the input distribution of each layer will continue to change with the depth of the network. This phenomenon is called internal covariate shift (Internal Covariate Shift) [1]. The internal covariate shift leads to the instability of the network training process, making it difficult for the network to converge, and requires a small learning rate and careful parameter initialization, thus increasing the difficulty of training a deep network.

1.2 The principle of Batch Normalization

BN is a technique to reduce the internal covariate shift by normalizing the input of each layer. The basic principle of BN is as follows:

For the input x of each layer, it is first normalized to obtain a standardized input:
x ^ = x − μ σ 2 + ϵ \hat{x} = \frac{x - \mu}{\sqrt{ \sigma^2 + \epsilon}}x^=p2 +ϵ x m
Among them, μ \muμ represents the mean value of the input,σ 2 \sigma^2p2 represents the variance of the input,ϵ \epsilonϵ is a small positive number used to avoid cases where the denominator is zero.

Next, scale and translate the normalized input to get the final output:
y = γ x ^ + β y = \gamma \hat{x} + \betay=cx^+β
whereγ \gammacb \betaβ is a learnable parameter used to scale and translate the normalized input.
The core idea of ​​BN is to normalize the input data to make the input distribution of each layer more stable, thereby speeding up the training process of the network, and allowing the use of a larger learning rate to speed up the convergence of the network. In addition, BN can also improve the generalization ability of the network and reduce the risk of over-fitting of the model.

1.3 Advantages of BN

As a commonly used regularization method, BN has many advantages in deep learning:

Accelerated network training: BN makes the input distribution of each layer more stable by reducing the internal covariate offset, thereby accelerating the training process of the network. At the same time, BN also allows the use of a larger learning rate to speed up the convergence of the network.

Improve the generalization ability of the network: BN can reduce the over-fitting risk of the model to a certain extent, thereby improving the generalization ability of the network.

Reduce sensitivity to parameter initialization: BN's normalization operation makes the network more robust to parameter initialization, and no longer relies too much on careful parameter initialization, thus simplifying the network design process.

Improve the robustness of the model: BN can increase the robustness of the model to the input data, making the model more stable to small disturbances of the input data.

1.4 Applications and Cases of BN

BN is widely used in various deep learning tasks, such as image classification, object detection, speech recognition, etc., and has achieved significant performance improvements in these tasks. The following is an example of image classification using BN:

import torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.bn1 = nn.BatchNorm2d(6) # 添加BN层
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.bn2 = nn.BatchNorm2d(16) # 添加BN层
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.bn3 = nn.BatchNorm1d(120) # 添加BN层
        self.fc2 = nn.Linear(120, 84)
        self.bn4 = nn.BatchNorm1d(84) # 添加BN层
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x))) # 在卷积层后添加BN层,并使用ReLU激活函数
        x = F.max_pool2d(x, (2, 2))
        x = F.relu(self.bn2(self.conv2(x))) # 在卷积层后添加BN层,并使用ReLU激活函数
        x = F.max_pool2d(x, 2)
		x = self.bn3(self.fc1(x.view(-1, 16 * 5 * 5))) # 在全连接层前添加BN层,并使用ReLU激活函数
		x = F.relu(self.bn4(self.fc2(x))) # 在全连接层前添加BN层,并使用ReLU激活函数
		x = self.fc3(x)
		return x

net = Net() # 创建使用BN的网络
# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# 训练网络
for epoch in range(10): # 进行10轮训练
	running_loss = 0.0
	for i, data in enumerate(trainloader, 0):
		inputs, labels = data
		optimizer.zero_grad()
		outputs = net(inputs)
		loss = criterion(outputs, labels)
		loss.backward()
		optimizer.step()
		running_loss += loss.item()
	print('Epoch %d Loss: %.3f' % (epoch + 1, running_loss / len(trainloader)))
print('Finished Training')
# 测试网络
correct = 0
total = 0
with torch.no_grad():
	for data in testloader:
		inputs, labels = data
		outputs = net(inputs)
		_, predicted = outputs.max(1)
		total += labels.size(0)
		correct += predicted.eq(labels).sum().item()
print('Accuracy of the network on the 10000 test images: %d %%' % (
100 * correct / total))

The above case shows how to use the BN layer in the image classification task, and by comparing the loss and test results during the training process, it can be seen that the use of the BN layer can accelerate the convergence speed of the network and improve the classification accuracy of the network.

2. Layer Normalization(LN)

2.1 Principle of LN

Unlike BN, LN normalizes the input of each layer so that the mean and variance of the input of each layer are kept within a fixed range. The mathematical formula of LN can be expressed as:
[
\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta ]
where
, xxx is the input data,γ \gammacb \betaβ is the learnable scaling factor and offset factor respectively,μ \muµσ 2 \sigma^2p2 are the mean and variance of the input data respectively,ϵ \epsilonϵ is a small constant used to prevent division by zero errors.

2.2 Advantages of LN

As a normalization method, LN has the following advantages:

    1. Does not depend on batch size: The normalization operation of BN depends on the sample data in the mini-batch, while LN normalizes the input of each layer and is not limited by the batch size, so it performs well in the case of small samples more stable.
    1. Suitable for RNN and single-sample reasoning: BN will face difficulties when dealing with variable-length sequence data (such as RNN) and single-sample reasoning, and LN can be well applied to these scenarios because it performs independent Normalization does not depend on batch size.
    1. Better robustness: BN has higher requirements on the distribution of input data, and when the distribution of input data deviates greatly, it may lead to a decrease in model performance. The LN can also maintain good robustness when the input data distribution is large, and it is not sensitive to the distribution of the input data.

2.3 Application of LN

LN has a wide range of applications in deep learning, especially in tasks such as language models (such as recurrent neural networks) and has shown good results. Here is a simple example using LN, showing how to use LN in PyTorch:

import torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
        self.ln1 = nn.LayerNorm(256) # 在全连接层前添加LN层
        self.ln2 = nn.LayerNorm(128) # 在全连接层前添加LN层

    def forward(self, x):
        x = torch.flatten(x, 1)
        x = torch.relu(self.ln1(self.fc1(x))) # 在全连接层前添加LN层,并使用ReLU激活函数
        x = torch.relu(self.ln2(self.fc2(x))) # 在全连接层前添加LN层,并使用ReLU激活函数
        x = self.fc3(x)
        return x

net = Net() # 创建使用LN的网络

# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# 训练网络
for epoch in range(10): # 进行10轮训练
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
    print('Epoch %d Loss: %.3f' % (epoch + 1, running_loss / len(trainloader)))

print('Finished Training')

# 测试网络
correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        inputs, labels = data
        outputs = net(inputs)
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))

The above case shows how to use the LN layer in the image classification task, and by comparing the loss and test results during the training process, it can be seen that the use of the LN layer can bring about model stability and performance improvement.

3. Comparison of BN and LN in deep learning

As normalization methods commonly used in deep learning, BN and LN have different advantages in different scenarios. The following compares BN and LN:

3.1 Batch Size during training

BN has higher requirements on Batch Size during training, because it needs to calculate the mean and variance within the Batch and use it for normalization. When the Batch Size is small, BN may lead to inaccurate estimation of the mean and variance, thereby affecting the performance of the model. The LN has lower requirements on the Batch Size during the training process, because it independently normalizes the input of each layer and does not depend on the Batch Size, so it can maintain better results even in the case of a small Batch Size. .

3.2 Robustness to input data distribution

BN has high requirements on the distribution of input data. For the case of large or uneven distribution of input data, BN may lead to a decrease in model performance. The LN can also maintain good robustness when the input data distribution is large, and it is not sensitive to the distribution of the input data, so it is more stable when processing data with different distributions.

3.3 Reasoning process of the model

During the inference process of the model, BN needs to save the mean and variance calculated during training, and use these values ​​for normalization. This means that during inference, BN requires additional storage space, and the inference speed may be slower. LN does not need to save additional mean and variance, so it is lighter and faster in the reasoning process.

3.4 Application scenarios

BN is widely used in tasks such as image classification, especially for large-scale images and larger Batch Size. However, LN has shown better results in tasks such as language models, especially in the case of small Batch Size, it can maintain better performance.

4 Conclusion

As normalization methods in deep learning, BN and LN have their own advantages and applicable scenarios. BN is suitable for tasks such as image classification, especially under large-size images and larger Batch Sizes, which can improve model performance. However, LN is suitable for tasks such as language models. It can maintain better performance in the case of small Batch Size, and can maintain better robustness when the distribution of input data is large.
Therefore, in practical applications, choosing an appropriate normalization method needs to be adjusted according to specific tasks and data conditions. At the same time, with the continuous development and deepening of research in the field of deep learning, new normalization methods are emerging, such as Instance Normalization (IN), Group Normalization (GN), Switchable Normalization (SN), etc. These methods may have better performance in different scenarios.

Guess you like

Origin blog.csdn.net/qq_41667743/article/details/130095908