Author: chen_h
Micro Signal & QQ: 862251340
micro-channel public number: coderpai
(Vi) pytorch study notes
What is standardized batch (Batch Normalization)
Today we'll talk to batch standardization Batch Normalization.
Common data standardization
Batch Normalization, standardization batches, and similar common data standardization, is dispersed in a unified approach data, but also a neural network optimization. Normalization before the introduction of a video we mentioned that data with a unified specification , machine learning make it easier to learn the law among the data.
Each layer standardization do
In a neural network, data distribution will have an impact on such a training neurons x has a value 1, an initial value of 0.1 Weights, so that the calculation result after the layer neuron is Wx = 0.1;. Or x = 20, so that the result is Wx 2 now can not see what the problem is, but when we add a layer activation function, activate this Wx value, the problem came. If you use the activation function like tanh, Wx the activation value becomes 0.1 to 1 and, at the close portion has a saturation phase excitation function, that is, no matter how expanded if x, tanh excitation function output value is closer to 1. in other words also , the neural network has not in the initial stages of those larger x range of features sensitive. this is very bad, imagine my own feelings and pat heavy hit their feelings actually no difference, which proves my sensory system ineffective. of course we mentioned before data can be used to do pre-normalization such that x input range is not too large, so that the input value after excitation function of the sensitive part, but this just is not sensitive to ask Not only in the input layer of the neural network, and in the hidden layer often occurs.
Just when x change to the hidden layer which we can enter the result of the hidden layer is normalization process like that before? The answer is yes, because the big cattle have invented a technology called batch normalization, it is to deal with this case.
BN add location
The batch is BATCH batch data normalization, the data is divided into small batches in small stochastic gradient descent. Further, when transmitted to the forward propagation, the process of each layer is carried out before each batch normalization data,
BN effect
Batch normalization may also be viewed as a level at the time to add a layer of the neural network, we first data X, add layers fully connected, the results will be fully connected layers after activation function becomes the input of the next layer, before the operation is then repeated. Batch Normalization (BN) was added to each of the excitation and full connection between functions.
I said before, the results are important in value before entering the activation function, if we look at not only a value, we can say that the value of the distribution of the results is very important for the excitation function for the data value of the data in this section are mostly located, to more efficient transmission. distributions compare the two values before activating the upper are no normalization, are conducted under normalization, this of course is able tanh more efficient use of the nonlinear process.
Tanh not normalize after activation data using the activation values are distributed to most of the saturation phase, i.e. most of the activation value is not -1, is 1, and later normalize, most are still on the activation value of each distribution section exist. then after passing this to the next level of activation distributed neural network subsequent calculations, each section has this kind of distribution it will be more valuable for neural networks. Batch normalization normalize not only a little data, he also We were anti normalize procedures. Why should I?
BN algorithm
We introduce some batch normalization formula. In just three steps is what we have been saying the normalization process, but there is a reverse operation behind the formula, would then expand and translate data normalize. It turned out that in order to make their own neural network to learn to use and modify the extended parameter gamma, and translation parameter β, so that the neural network will be able to figure out yourself slowly in front of the normalization operation in the end there is no optimization play a role, if not play a role, and I use gamma belt to offset some of the normalization of operations.
Finally, we look to the end of training a neural network represents the distribution of the results of each output value. In this way we can see at a glance the efficacy Batch normalization of friends. Let each layer transfer value within the effective range down.
Batch Normalization batch standardization
Batch standardization is popular for each layer neural network to standardize (normalize) process, we know that the input data standardization allows efficient learning machine learning. If the mode of each layer such as after receiving input data , then why do not we "approved standardized" all layers of it? specific and clear explanations please see What About batch standardization animation I made (recommended) .
Then we look at the following two action figure, which is the difference in the presence or absence of each layer of the neural network batch normalization friends.
Do point data
Yourself to do some dummy data to simulate the real situation. And Batch Normalization (after all referred BN) can effectively control the bad parameter initialization (initialization), for example, ReLU
this incentive function fear all values fall subsidiary interval , then we will move all parameters a level of -0.2 ( bias_initialization = -0.2
), take a look at the strength of BN.
import torch
from torch import nn
from torch.nn import init
import torch.utils.data as Data
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
# 超参数
N_SAMPLES = 2000
BATCH_SIZE = 64
EPOCH = 12
LR = 0.03
N_HIDDEN = 8
ACTIVATION = F.tanh # 你可以换 relu 试试
B_INIT = -0.2 # 模拟不好的 参数初始化
# training data
x = np.linspace(-7, 10, N_SAMPLES)[:, np.newaxis]
noise = np.random.normal(0, 2, x.shape)
y = np.square(x) - 5 + noise
# test data
test_x = np.linspace(-7, 10, 200)[:, np.newaxis]
noise = np.random.normal(0, 2, test_x.shape)
test_y = np.square(test_x) - 5 + noise
train_x, train_y = torch.from_numpy(x).float(), torch.from_numpy(y).float()
test_x = torch.from_numpy(test_x).float()
test_y = torch.from_numpy(test_y).float()
train_dataset = Data.TensorDataset(train_x, train_y)
train_loader = Data.DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2,)
# show data
plt.scatter(train_x.numpy(), train_y.numpy(), c='#FF9359', s=50, alpha=0.2, label='train')
plt.legend(loc='upper left')
plt.show()
Build neural networks
Here teach you how to build neural networks with the BN. BN can actually be seen as a Layer ( BN layer
). Add layers as we as usual plus BN layer
fine. Note that I also entered a data processing BN, because If you put the input data is seen from the front layer to the output data, we are also able to carry out her BN. Note that the video inside the wrong description of momentum, not used to update and reduce translation parameters, but for smoothing the batch mean and stddev
class Net(nn.Module):
def __init__(self, batch_normalization=False):
super(Net, self).__init__()
self.do_bn = batch_normalization
self.fcs = [] # 太多层了, 我们用 for loop 建立
self.bns = []
self.bn_input = nn.BatchNorm1d(1, momentum=0.5) # 给 input 的 BN
for i in range(N_HIDDEN): # 建层
input_size = 1 if i == 0 else 10
fc = nn.Linear(input_size, 10)
setattr(self, 'fc%i' % i, fc) # 注意! pytorch 一定要你将层信息变成 class 的属性! 我在这里花了2天时间发现了这个 bug
self._set_init(fc) # 参数初始化
self.fcs.append(fc)
if self.do_bn:
bn = nn.BatchNorm1d(10, momentum=0.5)
setattr(self, 'bn%i' % i, bn) # 注意! pytorch 一定要你将层信息变成 class 的属性! 我在这里花了2天时间发现了这个 bug
self.bns.append(bn)
self.predict = nn.Linear(10, 1) # output layer
self._set_init(self.predict) # 参数初始化
def _set_init(self, layer): # 参数初始化
init.normal_(layer.weight, mean=0., std=.1)
init.constant_(layer.bias, B_INIT)
def forward(self, x):
pre_activation = [x]
if self.do_bn: x = self.bn_input(x) # 判断是否要加 BN
layer_input = [x]
for i in range(N_HIDDEN):
x = self.fcs[i](x)
pre_activation.append(x) # 为之后出图
if self.do_bn: x = self.bns[i](x) # 判断是否要加 BN
x = ACTIVATION(x)
layer_input.append(x) # 为之后出图
out = self.predict(x)
return out, layer_input, pre_activation
# 建立两个 net, 一个有 BN, 一个没有
nets = [Net(batch_normalization=False), Net(batch_normalization=True)]
training
Training time, two separate neural network training. The training environment are the same.
opts = [torch.optim.Adam(net.parameters(), lr=LR) for net in nets]
loss_func = torch.nn.MSELoss()
losses = [[], []] # 每个网络一个 list 来记录误差
for epoch in range(EPOCH):
print('Epoch: ', epoch)
for step, (b_x, b_y) in enumerate(train_loader):
for net, opt in zip(nets, opts): # 训练两个网络
pred, _, _ = net(b_x)
loss = loss_func(pred, b_y)
opt.zero_grad()
loss.backward()
opt.step() # 这也会训练 BN 里面的参数
compare results
First, let's take a look at this comparison of two excitation function is look like:
We then use the results to compare different excitation functions.
The above is to use the relu
results of the activation function, we can see that there is no use BN error is higher, the line can not fit the data, because we have a "Bad initialization", the original bias = -0.2
, this trick, so that relu
can not be captured in negative territory input value. and with BN, which is not a problem.
The above results were obtained using tanh
as a result of the activation function, it can be seen, a bad initialization, so that the input data before activating dispersed very discrete, and with BN, data were Shoulong. Shoulong data is then placed on the excitation function well nonlinear activation function. the results can be seen but the data did not allow activation BN are distributed in tanh
the latter's two ends, the two ends of the gradient and very small, can not move forward error is pass, leading to the neural network died.
link:
https://morvanzhou.github.io/tutorials/machine-learning/torch/5-04-A-batch-normalization/
https://morvanzhou.github.io/tutorials/machine-learning/torch/5-04-batch-normalization/
https://github.com/MorvanZhou/PyTorch-Tutorial/blob/master/tutorial-contents/504_batch_normalization.py