Data preprocessing; initialization weight

Data Preprocessing:

Usually with zero mean data, if all inputs are positive, the gradient of the weights obtained are also positive, it will be sub-optimal optimization gradient

By standard deviation normalized

 

 

Initialization Weight:

If 0 (or the same values) as the initial values ​​of all weights will cause all neurons will do the same thing, each neuron will have the same output value in the same operation on the input data, the same gradient to give , parameter update is the same, get exactly the same neurons, learned knowledge is exactly the same

 

Option One:

With a small weight value as a random initial value initialization, the sample from the probability distribution. This method is applicable in a small network, but do not perform well in the deep web. Small initialize the weights will be when time and time again by the weight W, a large number of input shrink gradually goes to zero, and finally get a bunch of zero.

 When back-propagation, leads to a small gradient, the gradient is substantially not updated.

 

Option II:

With large weights (1 standard deviation coefficient) as the initial value, at tanh activation function, so that the network will soon reach saturation, gradient tends to 0, the parameter is not updated

 

 

 Option Three: Xavier initialization

 

 W samples from the standard Gaussian distribution, and then scaled according to the number of inputs, designated the variance equal to the variance of the input requirements of our output. If you have a small amount of input will be divided by a small number to obtain a larger (require larger) weights; if there are a large number of input will be obtained by dividing the larger number of smaller (require smaller) Weights.

 

When using ReLU similar activation function, (half of the neurons are set to 0), the obtained variance is actually halved, to give a small value at this time will lead to (Unit Gaussian) distribution begins to contract, more and more peaks will be closer to zero, the neuron will be deactivated.

 

 Half of the neurons is set to 0, simply input (denominator fan_in) divided by 2 to solve the problem:

 

 

 

 

1

Guess you like

Origin www.cnblogs.com/Manuel/p/11008624.html