Neural networks weights initialization method

1 Overview 

  There are weights initialization method in the neural network a lot, but the design of these methods is to follow some logic, and also has its own application scenarios. First, we assumed that the input of each feature is subject to zero mean and unit variance distribution 1 (general input to the neural network data are normalized to do, is to achieve this condition).

  In order to better information delivery network, characterized by the variance of each layer should be as equal as possible, to ensure that this feature if the variance is equal to it. We can start from the weight values ​​initialized.

  First, do a derivation formula:

  $var(s) = var(\sum_i^n w_i x_i)$

  $var(s) = \sum_i^n var(w_i x_i)$

  $ Var (s) = \ sum_i ^ n E (w_i ^ 2) E (x_i ^ 2) - (E (x_i)) ^ 2 (E (w_i)) ^ 2 $

  Was $ (s) = \ sum_i ^ n (E (w_i)) ^ 2 was (x_i) + (E (x_i)) ^ 2 was (w_i) + was (x_i) was (w_i) $

  Here assumed average x is 0, weight for weight initialization is typically selected mean 0, so the above equation can be converted into

  Was $ (s) = \ ^ n was sum_i (x_i) was (w_i) $

  And because x in each feature we assume that variance is 1, so the above equation can be rewritten as

  $ Var (s) = n * have (w) $

  So that now $ var (s) = 1 $, there

  $ N * var (w) = 1 $.

  $ Var (w) = \ frac {1} {n} $.

  To ensure consistency and a desired dimension, we converted into standard deviation variance, so as to ensure that the standard deviation $ \ frac {1} {\ sqrt {n}} $

2, initialization method

  Now let's look at how we set the variance of each method when initializing ensure distribution of inputs unchanged.

  1) uniform distribution

  For a uniform distribution of $ U (a, b) $, whose mean and variance are $ E (x) = \ frac {(a + b)} {2}, D (x) = \ frac {(ba) ^ 2 } {12} $.

  Assuming homogeneous distribution of $ (- \ frac {1} {\ sqrt {d}}, \ frac {1} {\ sqrt {d}}) $, where $ D $ is the number of neurons, there is desired and variance:

  $E(x) = 0,  D(x) = \frac{1}{3d}$

  Is substituted into $ var (s) = n * var (w) $ can be obtained:

  $ Var (s) = \ frac {1} {3} $, so in order to ensure that the final variance 1, and therefore the variance is multiplied by 3, the standard deviation multiplied by $ \ sqrt {3} $. It is generally uniformly distributed initialization value may be selected $ (- \ sqrt {\ frac {3} {d}}, \ sqrt {\ frac {3} {d}}) $.

  In xavier uniform init (glorot uniform), i.e. tf.glorot_uniform_initializer () method of initializing a value of $ (- \ sqrt {\ frac {6} {(d_ {in} + d_ {out})}}, \ sqrt { \ frac {6} {d_ {in} + d_ {out}}}) $. In a two-dimensional matrix $ d_ {in}, d_ {out} $ represent the first dimension of the matrix and a second dimension.

  See below an example, for tf.glorot_uniform_initializer () Method:

    

  After you can see a layer of neural networks, expectation and variance of x is essentially the same. For uniform distribution tf.random_uniform_initializer () , when the parameters are initialized as we $ (- \ sqrt {\ frac {3} {d}}, \ sqrt {\ frac {3} {d}}) $. The results are as follows:

    

   2) normal distribution

  Normal distribution will be given directly expectations and standard deviation, so this goes without saying. To ensure $ var (s) = 1 $, we need to make $ var (w) = \ frac {1} {d} $, then the standard deviation of $ \ sqrt {\ frac {1} {d}} $.

  tf.random_normal_initializer () , the standard deviation will be set to $ \ sqrt {\ frac {1The results are as follows:

    

   xavier normal init (glorot normal), i.e. tf.glorot_normal_initializer (), standard deviation of $ \ sqrt {\ frac {2 } {(d_ {in} + d_ {out})}} $, with the following results:

    

  3) initialized by a constant

    Desirably constant value n, the variance is 0 initialized by a constant.

    tf.zeros_initializer () , to cause the output x 0 initialized to zero.

      

    tf.ones_initializer () , to initialize a variance will be great

       

     tf.constant_initializer () , ibid.

      

 

 3, the introduction of the activation function

   The above results are in the case of a linear operation, but the practical application of the activation function is to be introduced, so that a stronger neural network that has the ability to express. What happens if the activation function is introduced?

   For viewing, we network layers to 100 layers, the weights initialized using tf.glorot_normal_initializer () .

  When no activation function:

    

  It can be seen without activation function, even when the variance layer 100 also remains substantially unchanged.

  The introduction of tanh function,

    

  As a result, the variance would be reduced to 0.005, so a normalization layer is very important indeed introduced in the deep network.

  Introducing relu function:

    

  The above results it is clear that relu function does not apply tf.glorot_normal_initializer () . Relu respect to the activation function, typically a normal distribution standard deviation $ \ sqrt {\ frac {2 } {d}} $, generally uniformly distributed $ (- \ sqrt {\ frac {6} {d}}, \ sqrt {\ frac {6} { d}}) $. Thus replacing the initialization parameters, the following results

    

   Look like a lot above result, variance in the output layer 100 is 0.1, the result is much better than tanh, in addition there is no longer close to zero is desirable because relu tanh function is not symmetrical as about 0.

  In addition, I also found a strange thing, is temporarily unable to explain, can give you a look at the results:

  When we directly tf.random_normal_initializer () initialization, this time variance 1.

  Without introducing activation function.

    

   After the layer was 100 expectation and variance are nan, it should be the explosion.

  The introduction of tanh function:

    

   After layer 100 can be maintained even variance 1.

  relu activation function:

    

   Expectation and variance are zero.

  From the above results look like when doing activation function with tanh, it can be directly normal initialization parameters 0,1.

 

 4, random initialization

  In practice, also found a problem that the size of the parameters will also affect the final result, in order to activate relu function as an example. For ease of calculation, consistent dimensions x and w

   All dimensions of 512, the results obtained by the two relatively large differences:

    

    

   Ye Hao understand this fact, in itself x and w are randomly initialized, although the distribution is the same, but the specific value is not the same, the end result is not the same, so that the same result even if the distribution of initialization, sometimes resulting will vary, this discrepancy may be the convergence rate, it could be the end result, and so on.

 

references:

Neural network initialization

Neural network weight initialization list: Basic to Kaiming

Guess you like

Origin www.cnblogs.com/jiangxinyang/p/11574049.html