Data Preprocessing:
Usually with zero mean data, if all inputs are positive, the gradient of the weights obtained are also positive, it will be sub-optimal optimization gradient
By standard deviation normalized
Initialization Weight:
If 0 (or the same values) as the initial values of all weights will cause all neurons will do the same thing, each neuron will have the same output value in the same operation on the input data, the same gradient to give , parameter update is the same, get exactly the same neurons, learned knowledge is exactly the same
Option One:
With a small weight value as a random initial value initialization, the sample from the probability distribution. This method is applicable in a small network, but do not perform well in the deep web. Small initialize the weights will be when time and time again by the weight W, a large number of input shrink gradually goes to zero, and finally get a bunch of zero.
When back-propagation, leads to a small gradient, the gradient is substantially not updated.
Option II:
With large weights (1 standard deviation coefficient) as the initial value, at tanh activation function, so that the network will soon reach saturation, gradient tends to 0, the parameter is not updated
Option Three: Xavier initialization
W samples from the standard Gaussian distribution, and then scaled according to the number of inputs, designated the variance equal to the variance of the input requirements of our output. If you have a small amount of input will be divided by a small number to obtain a larger (require larger) weights; if there are a large number of input will be obtained by dividing the larger number of smaller (require smaller) Weights.
When using ReLU similar activation function, (half of the neurons are set to 0), the obtained variance is actually halved, to give a small value at this time will lead to (Unit Gaussian) distribution begins to contract, more and more peaks will be closer to zero, the neuron will be deactivated.
Half of the neurons is set to 0, simply input (denominator fan_in) divided by 2 to solve the problem:
1