Gradient disappears, gradient explosion

Creative Commons License Copyright: Attribution, allow others to create paper-based, and must distribute paper (based on the original license agreement with the same license Creative Commons )

With the I k always representative of the K-th input neuron, with O k represents the k-th output neuron.
Calculating a gradient of example No. 5 neurons:

Here Insert Picture Description
G K possible during the propagation smaller absolute value (until it becomes 0), which is called a gradient disappeared, which makes the network training stagnation.

G K possible increasing the absolute value (until divergence) in the communication process, this is called a gradient of explosion, which makes the network unstable, crash performance.

Gradient disappeared example:
if sigmoid or tanh nonlinearity, when a large absolute value in the input, there will be "saturated", i.e., closer to zero the derivative, according to the formula, causes a gradient disappears.

Examples of gradient explosion:
If the network is a large W, for example, to initialize the network using too much initial value, or the network weights training with the growing, gradient explosion may occur. For recurrent neural networks and GAN, this phenomenon is more likely to occur.

Therefore, if we find poor performance training network, it is necessary to consider whether there is a gradient disappears and gradient explosion.
Gradient improved by various techniques, including batch standardization, the residual network, truncation may gradient, or a gradient introduce a certain penalty.

Guess you like

Origin blog.csdn.net/qjt19950610/article/details/94764634