Feasibility Analysis model weights initialized

Original link: https://www.leiphone.com/news/201703/3qMp45aQtbxTdzmK.html

The original is the god of Google engineers wrote an article, after seeing feel very good, so that you can visually understand the depth and weight affect initialization mode activation function model training.

This article is an interpretation of the original, along with his understanding and code.

First of all, a good weight initialization method can help neural networks faster to find the optimal solution.

A necessary condition for initialising the weight weight 1: activation value of each network layer does not fall within the saturation region of the activation function;

A necessary condition for initialising the weight of the weight 2: activation value of each network layer are not very close to zero, are not far away from 0, preferably a mean of 0 (center of the distribution to 0)

1, is initialized to 0 the feasibility of:

        Feasible, if all weights are initialized to 0, the output values ​​of all neurons are the same, then the reverse spread, with all of the gradient layer are the same, the weight update is the same, so training is pointless.

2, several feasible ways to initialize:

pre-training:

That is the use of the trained model parameters are initialized, and then do the fine-tuning.

Random  initialization:

Network layer 10, a random initialization weights each output data distribution

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
data = tf.constant(np.random.randn(2000, 800),dtype=tf.float32)
layer_sizes = [800 - 50 * i for i in range(0,10)]#10层网络,输入和
num_layers = len(layer_sizes)

fcs = []  # To store fully connected layers' output
for i in range(0, num_layers-1):
    X = data if i == 0 else fcs[i - 1]
    node_in = layer_sizes[i]
    node_out = layer_sizes[i + 1]
   # W = tf.Variable(np.random.randn(node_in, node_out))# * 0.01
    W = tf.Variable(np.random.randn(node_in, node_out),dtype=tf.float32)*0.01
    fc = tf.matmul(X, W)
    fc = tf.nn.tanh(fc)
    fcs.append(fc)


plt.figure()
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(0,num_layers-1):
     plt.subplot(1,num_layers,i+1)
     x=fcs[i]
     x=np.array(x.eval())
     x=x.flatten()
     plt.hist(x=x,bins=20,range=(-1,1))
plt.show()

Create layer neural network 10, the activation function is tanh, each random weights are normally distributed mean of 0 and standard deviation of 0.01, the following output distribution:

 

 

 

Can be seen from the figure, the number of layers increases as the network, the output value gradually distributed closer to 0, the output of the back layers are very close to zero. The f = W.X + b can be seen when the weight W partial derivative reverse propagation, the current value of X output layer is calculated, i.e. when the gradient back propagation multiplication factor, resulting in the gradient is very small, difficult to update such parameters .

Above normal initialization parameter transfer large, standard deviation becomes 1:
W = tf.Variable(np.random.randn(node_in, node_out))

Look at the distribution of output:

 

 

 

Can be seen that the output values ​​are concentrated in the vicinity of 1 and -1, activation function tanh function is used, falls within the saturation region has been found to activate a function, a function tanh -1 and gradient close to zero, the parameter to be updated is difficult.

  • Xavier initialization

    Xavier initialization can solve the above problems, Xavier initialization is consistent with the variance of input and output, to avoid all the output values ​​tend to 0:

  • W = tf.Variable(np.random.randn(node_in, node_out)) / np.sqrt(node_in)
    下面就是采用Xavier初始化之后的每层的数据分布直方图:
  •  

    Wow, after the output of many layers of distribution remains good, very beneficial to us optimize the neural network!

  • xavier initialization are derived on a linear function, a nonlinear function indicating that it does not have universal applicability, only discussed here tanh activation function, activation function tests discussed below ReLu

  • It can be seen from the figure, since the function characteristic relu cause the layers 0-1 are biased in favor of network data output, the output distribution is not zero-centred, and back to the layers, in the vicinity of the 0 data.

    It seems to Relu Xavier initialization activation function is not very applicable. Let's see if He initialization initialization problem can be resolved relu, He initialization of thought: in relu network, it is assumed that half of each layer neurons are activated, the other half is zero, so to maintain constant variance, just Xavier divided by 2 basis.

  • W = tf.Variable(np.random.randn(node_in, node_out)) / np.sqrt(node_in/2)
    ......
    fc = tf.nn.relu(fc)

Let's look at the output of the distribution, although not a zero-centred (ie 0 to 0 as the center of the mean) but at least the output value distribution is very stable at between 0-1 are no longer as before with xavier as near zero good results, it is recommended in relu network.

 

 

Batch Normalization Layer:

       Batch Normalization is a clever way to weaken the influence of rough bad initialization brought before the desired nonlinear active, the output value should be a relatively good distribution (e.g. Gaussian distribution), in order to calculate the gradient back propagation, update weights. Batch Normalization output value forcibly made a Gaussian Normalization and a linear transformation.

 

        Batch Normalization All operations are derivable smooth, which makes it possible to effectively when backpropagation learning parameters corresponding to Υβ, Bach Normalization be different in the train and test time behavior. Training for

[mu] beta] and [sigma] beta] derived from the calculated current batch; when [mu] Testing beta] and [sigma] beta] values should be saved when using the training means or the like is treated, rather than the current batch calculation

Batch Normalization test:

Relu activation, random initialization, no Batch Normal:

Random initialization, there batchNormalization:

= tf.contrib.layers.batch_norm FC (FC, Center = True, Scale = True, is_training = True) 
# Note here that the input data required before the output data is all converted into dtype = float32, batchnorm float32 required data type, Oh data inconsistency will complain

 

 

 

Can be seen from the figure, after the addition of batch Normalization, with the deepening of the network, output data distribution remains behind the well layers, not tend to 0, good results.

Recommended initialization

            · Recommended Xavier Initialization variants in ReLU activation function, the call He Initialization:

                

 

  •   Use Batch Normalization layer can effectively reduce the depth of the network weight initialization dependency:
  •  

Summary: Good weight initialization and activation function allows the normal flow of data in the network, update normal weight, to achieve the purpose of learning, the above experiment can be intuitively understood the weight initialization and the impact on the distribution of output activation function.

 

 

 

 

 

 

 

 

 

 

 



 

Guess you like

Origin www.cnblogs.com/zzc-Andy/p/11511703.html