The purpose of the activation function is to provide nonlinearity to the network
Gradient disappears : the gradient is 0, and cannot be backpropagated, resulting in parameters not being updated.
Gradient saturation : As the data changes, the gradient does not change significantly .
Gradient explosion : The gradient becomes larger and larger, unable to converge
The problem of gradient disappearance:
1. The backpropagation link is too long, and gradually decreases after accumulation.
2. The data enters the gradient saturation area
How to solve:
1. Choose the correct activation function, relu, silu
2. BN normalized data
3. resnet shorter backpropagation path
4. LSTM memory network
1、Sigmoid
Function and derivative:
Features : The derivative of the data falling into both ends tends to 0, causing the gradient to disappear, and it is difficult to converge when used in a deep network. Batch normalization with BN can optimize this problem.
2、Tanky
Functions and derivatives:
Features : It is similar to sigmoid, except that the mapping interval is different.
3、ReLU
Features : simple and rude, solve the problem of gradient disappearance, and the derivative of the response interval is 1. The neurons less than 0 are suppressed, causing the network to be sparse, suppressing over-fitting, which is beneficial for the network to learn effective information and speed up the convergence speed.
4、Leaky_ReLU
Features : ** An improvement to relu, there is also a small activation when it is less than 0, to avoid the gradient sawtooth problem. **
5、SiLU(swish)
Features : ** Improvements to relu, smoothing around 0, disadvantages: the introduction of exponential operations increases the amount of calculation. **
6、Mish
Features : ** Similar to silu. **