Deep Learning (3) - Detailed explanation of the role and categories of activation function (active function)

     The mastery of basic knowledge determines the height of research. When we first came into contact with deep learning, we usually saw other people’s generalizations. This method is very good for us to get started quickly, but it also has a big disadvantage. The knowledge understanding is not thorough. As a result, we are confused about algorithm optimization. I also started my exploration of the essence of deep learning knowledge with the idea of ​​​​knowledge summary, and I hope to help more people. There are unclear points in the article. I hope fellow researchers (friends who study deep learning) will point it out and I will work hard to improve my article.

  1. Function:

     Provides nonlinear modeling capabilities for networks. By adding activation functions, deep neural networks can have hierarchical nonlinear mapping learning capabilities. Therefore the activation function is an integral part of the neural network.

  2. The properties that the activation function needs to have:

   1. Differentiability (there is a derivative function in the definition domain): This property is necessary when the optimization algorithm is based on gradients.

   2. Monotonicity: Officially, when the activation function is monotonic, the single-layer network can be guaranteed to be a convex function. This understanding is relatively general, and I understand it this way. From the activation function below, we can see that the functional form is that when I input a value, I get a value through calculation. There are no super parameters in it. That is to say, we are changing the value of the neuron while ensuring that the function distribution does not change. Come to a non-linear mapping. But the reason to ensure monotonicity is that the law of the distribution does not change greatly. When it is not monotonic, we may have many local optima, which cannot be guaranteed to be a convex function, which will have a huge impact on our subsequent gradient descent.

  Output range: When the output of the activation function is limited, the gradient-based optimization method will be more stable, because the representation of features is significantly affected by the weight value. When the output of the activation function is infinite, the training speed of the model will be faster. Efficient, but in this case, generally have a small learning rate.

  3. Commonly used activation functions:

1. sigmode function:

 

It can be seen from the picture that the signoid function is continuous, smooth, and strictly monotonic, and is a very good threshold function. The derivation is simple.

 Sigmoid is the most widely used type of activation function, with an exponential function shape, which is closest to biological neurons in a physical sense. However, as can be seen from the above figure,

This function has two flaws: saturation and non-zero mean.

      Saturation is reflected in the fact that when a function is differentiated, the reciprocal of the function will tend to zero. Includes soft saturation and hard saturation. The sigmoid function is a soft saturation. When the direction is transferred, a derivative of f() will appear. Once x is greater than a value, it will fall into the saturation zone, and the derivative will be almost zero, causing the transferred gradient to become very small. The network parameters are not trained, resulting in the phenomenon of gradient disappearance. Generally, the gradient disappears at the fifth layer.

    Non-zeroness causes the output to be greater than zero, resulting in cheap phenomena. This causes the neurons in the next layer to take the non-zero signal from the previous layer as input. The convergence effect is not good.

2. tanh function:

The tanh function is a common activation function. Compared with sigmoid, the output mean is zero, which will speed up the convergence speed. Reduce the number of iterations. However, the vanishing gradient phenomenon still occurs.

3. ReLU function: f(x)=max(0,x)

 ReLU is an activation function that came later. As can be seen from the figure, when x<0, hard saturation occurs, and when x>0, there is no saturation problem. Therefore, ReLU can

It is enough to keep the gradient from decaying when x>0, thereby alleviating the vanishing gradient problem. However, as training progresses, some inputs will fall into the hard saturation zone, causing the corresponding weights to be unable to be updated.

Newly, this phenomenon is called "neuronal death." Similar to sigmoid, the output mean of ReLU is also greater than 0. Drift phenomena and neuronal death jointly affect network convergence.

    The reason why a large area of ​​the activation function is activated is:

    1. When setting the initial parameters, Relu is not activated.

    2. If the learning rate is too high and neurons fluctuate within a certain range, data diversity may be lost. In this case, neurons will not be activated and the loss of data diversity is irreversible.

4. This function combines sigmoid and ReLU, with soft saturation on the left and no saturation on the right. The linear part on the right makes the ELU function mitigate gradient disappearance, while the soft saturation on the left makes the ELU function more robust to input changes or noise. In addition, the mean value of the output of the ELU function is close to zero, so the convergence speed is faster.

References: https://blog.csdn.net/u013989576/article/details/70185145

Guess you like

Origin blog.csdn.net/qq_37100442/article/details/81667954