(Vi) pytorch study notes

Author: chen_h
Micro Signal & QQ: 862251340
micro-channel public number: coderpai


(A) pytorch study notes

(B) pytorch study notes

(C) pytorch study notes

(D) pytorch study notes

(E) pytorch study notes

(Vi) pytorch study notes


What is standardized batch (Batch Normalization)

Today we'll talk to batch standardization Batch Normalization.

Common data standardization

Here Insert Picture Description

Batch Normalization, standardization batches, and similar common data standardization, is dispersed in a unified approach data, but also a neural network optimization. Normalization before the introduction of a video we mentioned that data with a unified specification , machine learning make it easier to learn the law among the data.

Each layer standardization do

Here Insert Picture Description

In a neural network, data distribution will have an impact on such a training neurons x has a value 1, an initial value of 0.1 Weights, so that the calculation result after the layer neuron is Wx = 0.1;. Or x = 20, so that the result is Wx 2 now can not see what the problem is, but when we add a layer activation function, activate this Wx value, the problem came. If you use the activation function like tanh, Wx the activation value becomes 0.1 to 1 and, at the close portion has a saturation phase excitation function, that is, no matter how expanded if x, tanh excitation function output value is closer to 1. in other words also , the neural network has not in the initial stages of those larger x range of features sensitive. this is very bad, imagine my own feelings and pat heavy hit their feelings actually no difference, which proves my sensory system ineffective. of course we mentioned before data can be used to do pre-normalization such that x input range is not too large, so that the input value after excitation function of the sensitive part, but this just is not sensitive to ask Not only in the input layer of the neural network, and in the hidden layer often occurs.

Here Insert Picture Description

Just when x change to the hidden layer which we can enter the result of the hidden layer is normalization process like that before? The answer is yes, because the big cattle have invented a technology called batch normalization, it is to deal with this case.

BN add location

Here Insert Picture Description

The batch is BATCH batch data normalization, the data is divided into small batches in small stochastic gradient descent. Further, when transmitted to the forward propagation, the process of each layer is carried out before each batch normalization data,

BN effect

Batch normalization may also be viewed as a level at the time to add a layer of the neural network, we first data X, add layers fully connected, the results will be fully connected layers after activation function becomes the input of the next layer, before the operation is then repeated. Batch Normalization (BN) was added to each of the excitation and full connection between functions.

Here Insert Picture Description

I said before, the results are important in value before entering the activation function, if we look at not only a value, we can say that the value of the distribution of the results is very important for the excitation function for the data value of the data in this section are mostly located, to more efficient transmission. distributions compare the two values ​​before activating the upper are no normalization, are conducted under normalization, this of course is able tanh more efficient use of the nonlinear process.

Here Insert Picture Description

Tanh not normalize after activation data using the activation values ​​are distributed to most of the saturation phase, i.e. most of the activation value is not -1, is 1, and later normalize, most are still on the activation value of each distribution section exist. then after passing this to the next level of activation distributed neural network subsequent calculations, each section has this kind of distribution it will be more valuable for neural networks. Batch normalization normalize not only a little data, he also We were anti normalize procedures. Why should I?

BN algorithm

Here Insert Picture Description

We introduce some batch normalization formula. In just three steps is what we have been saying the normalization process, but there is a reverse operation behind the formula, would then expand and translate data normalize. It turned out that in order to make their own neural network to learn to use and modify the extended parameter gamma, and translation parameter β, so that the neural network will be able to figure out yourself slowly in front of the normalization operation in the end there is no optimization play a role, if not play a role, and I use gamma belt to offset some of the normalization of operations.

Here Insert Picture Description

Finally, we look to the end of training a neural network represents the distribution of the results of each output value. In this way we can see at a glance the efficacy Batch normalization of friends. Let each layer transfer value within the effective range down.

Batch Normalization batch standardization

Batch standardization is popular for each layer neural network to standardize (normalize) process, we know that the input data standardization allows efficient learning machine learning. If the mode of each layer such as after receiving input data , then why do not we "approved standardized" all layers of it? specific and clear explanations please see What About batch standardization animation I made (recommended) .

Then we look at the following two action figure, which is the difference in the presence or absence of each layer of the neural network batch normalization friends.

Here Insert Picture Description

Do point data

Yourself to do some dummy data to simulate the real situation. And Batch Normalization (after all referred BN) can effectively control the bad parameter initialization (initialization), for example, ReLUthis incentive function fear all values fall subsidiary interval , then we will move all parameters a level of -0.2 ( bias_initialization = -0.2), take a look at the strength of BN.

Here Insert Picture Description

import torch
from torch import nn
from torch.nn import init
import torch.utils.data as Data
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

# 超参数
N_SAMPLES = 2000
BATCH_SIZE = 64
EPOCH = 12
LR = 0.03
N_HIDDEN = 8
ACTIVATION = F.tanh     # 你可以换 relu 试试
B_INIT = -0.2   # 模拟不好的 参数初始化

# training data
x = np.linspace(-7, 10, N_SAMPLES)[:, np.newaxis]
noise = np.random.normal(0, 2, x.shape)
y = np.square(x) - 5 + noise

# test data
test_x = np.linspace(-7, 10, 200)[:, np.newaxis]
noise = np.random.normal(0, 2, test_x.shape)
test_y = np.square(test_x) - 5 + noise

train_x, train_y = torch.from_numpy(x).float(), torch.from_numpy(y).float()
test_x = torch.from_numpy(test_x).float()
test_y = torch.from_numpy(test_y).float()

train_dataset = Data.TensorDataset(train_x, train_y)
train_loader = Data.DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2,)

# show data
plt.scatter(train_x.numpy(), train_y.numpy(), c='#FF9359', s=50, alpha=0.2, label='train')
plt.legend(loc='upper left')
plt.show()

Build neural networks

Here teach you how to build neural networks with the BN. BN can actually be seen as a Layer ( BN layer). Add layers as we as usual plus BN layerfine. Note that I also entered a data processing BN, because If you put the input data is seen from the front layer to the output data, we are also able to carry out her BN. Note that the video inside the wrong description of momentum, not used to update and reduce translation parameters, but for smoothing the batch mean and stddev

class Net(nn.Module):
    def __init__(self, batch_normalization=False):
        super(Net, self).__init__()
        self.do_bn = batch_normalization
        self.fcs = []   # 太多层了, 我们用 for loop 建立
        self.bns = []
        self.bn_input = nn.BatchNorm1d(1, momentum=0.5)   # 给 input 的 BN

        for i in range(N_HIDDEN):               # 建层
            input_size = 1 if i == 0 else 10
            fc = nn.Linear(input_size, 10)
            setattr(self, 'fc%i' % i, fc)       # 注意! pytorch 一定要你将层信息变成 class 的属性! 我在这里花了2天时间发现了这个 bug
            self._set_init(fc)                  # 参数初始化
            self.fcs.append(fc)
            if self.do_bn:
                bn = nn.BatchNorm1d(10, momentum=0.5)
                setattr(self, 'bn%i' % i, bn)   # 注意! pytorch 一定要你将层信息变成 class 的属性! 我在这里花了2天时间发现了这个 bug
                self.bns.append(bn)

        self.predict = nn.Linear(10, 1)         # output layer
        self._set_init(self.predict)            # 参数初始化

    def _set_init(self, layer):     # 参数初始化
        init.normal_(layer.weight, mean=0., std=.1)
        init.constant_(layer.bias, B_INIT)

    def forward(self, x):
        pre_activation = [x]
        if self.do_bn: x = self.bn_input(x)    # 判断是否要加 BN
        layer_input = [x]
        for i in range(N_HIDDEN):
            x = self.fcs[i](x)
            pre_activation.append(x)    # 为之后出图
            if self.do_bn: x = self.bns[i](x)  # 判断是否要加 BN
            x = ACTIVATION(x)
            layer_input.append(x)       # 为之后出图
        out = self.predict(x)
        return out, layer_input, pre_activation

# 建立两个 net, 一个有 BN, 一个没有
nets = [Net(batch_normalization=False), Net(batch_normalization=True)]

training

Training time, two separate neural network training. The training environment are the same.

opts = [torch.optim.Adam(net.parameters(), lr=LR) for net in nets]

loss_func = torch.nn.MSELoss()

losses = [[], []]  # 每个网络一个 list 来记录误差
for epoch in range(EPOCH):
    print('Epoch: ', epoch)
    for step, (b_x, b_y) in enumerate(train_loader):
        for net, opt in zip(nets, opts):     # 训练两个网络
            pred, _, _ = net(b_x)
            loss = loss_func(pred, b_y)
            opt.zero_grad()
            loss.backward()
            opt.step()    # 这也会训练 BN 里面的参数

compare results

First, let's take a look at this comparison of two excitation function is look like:

Here Insert Picture Description

We then use the results to compare different excitation functions.

Here Insert Picture Description

Here Insert Picture Description

Here Insert Picture Description

The above is to use the reluresults of the activation function, we can see that there is no use BN error is higher, the line can not fit the data, because we have a "Bad initialization", the original bias = -0.2, this trick, so that relucan not be captured in negative territory input value. and with BN, which is not a problem.

Here Insert Picture Description

Here Insert Picture Description

Here Insert Picture Description

The above results were obtained using tanhas a result of the activation function, it can be seen, a bad initialization, so that the input data before activating dispersed very discrete, and with BN, data were Shoulong. Shoulong data is then placed on the excitation function well nonlinear activation function. the results can be seen but the data did not allow activation BN are distributed in tanhthe latter's two ends, the two ends of the gradient and very small, can not move forward error is pass, leading to the neural network died.

link:

https://morvanzhou.github.io/tutorials/machine-learning/torch/5-04-A-batch-normalization/

https://morvanzhou.github.io/tutorials/machine-learning/torch/5-04-batch-normalization/

https://github.com/MorvanZhou/PyTorch-Tutorial/blob/master/tutorial-contents/504_batch_normalization.py

Published 414 original articles · won praise 168 · views 470 000 +

Guess you like

Origin blog.csdn.net/CoderPai/article/details/104187376