Keras activation function difference

Introduction:

Activation function is an extremely important feature of artificial neural networks. It determines whether a neuron should be activated. The activation represents that the information received by the neuron is related to the given information.

The activation function is actually a non-linear transformation of the input information to the next layer of neurons. In the neural network, the activation function is basically an essential part, because it increases the complexity of our model. The element has a weight vector weight and a bias vector bias. If there is no activation function, the neural network essentially becomes a linear regression network, and each time it only performs a linear transformation on the basis of the previous layer. Then it cannot complete more complicated learning tasks, such as language translation, picture classification and so on.

Common activation functions and methods of use:

1. The activation function commonly used in the multi-class classification of softmax function
is based on logistic regression. It is commonly used in the output layer, compressing the output between 0 ~ 1, and ensuring that the sum of all elements is 1, indicating that the input value belongs to each output The probability of the value.
2. The sigmoid function
Formula: The
sigmoid (x) = 1 / (1 + exp (-x))
function and its derivative curve: The Insert picture description of function and its derivative curve here
sigmoid function is also used to map real numbers in the range of 0 ~ 1, which can be used to do two Classification problem. The above softmax is to deal with multi-classification problems. It can be seen from the image that the gradient of the sigmoid function is mainly concentrated in the [-3,3] region, so it is better to use when the feature difference is not particularly large. It can be seen from the derivative image that the derivative decays from x = 0 and then quickly decayed to 0, so the back propagation process in the deep network does not perform very well and is prone to "gradient disappearing".
3. Tanh function:
formula:
tanh (x) = 2sigmoid (2x) -1 = 2 / (1 + exp (-2x))-1
function and its derivative curve:
Insert picture description here
tanh function (also called double tangent function) is actually An enlarged version of the sigmoid function, the image is also very similar, but the image of the tanh function is symmetrical about the origin, the range is [-1,1], the tanh function is better when the features are significantly different, and will continue to expand the feature effect. His gradient function is steeper than sigmoid, so you can choose to use both according to the gradient requirements, but it is also prone to the phenomenon of 'gradient disappearing'.
4. The relu function
formula:
f (x) = max (0, x)
function and its derivative curve:
Insert picture description hereThe relu function is arguably the most widely used activation function today. As can be seen from the image, when the input is <0, both the output and the derivative (gradient) are 0, and when the input is> 0, the output is equal to the input and the derivative (gradient) is equal to 1.
A big advantage of Relu is that he will not activate all neurons at the same time, that is, when the input value is negative, the output is 0, and the neuron will not be activated, that is to say, during a period of time, only some neurons are activated, making the nerve The network becomes efficient and easy to calculate.
But the Relu function also exists when the gradient is 0 (x <0), that is to say, during the back propagation process, the weights are not updated, then this neuron will never be activated, if the leakage rate is still relatively large, then Many neurons will be 'Dead'.
5. Leaky Relu function
Formula:
f (x) = ax, x <0 (a value is very small, here 0.01)
f (x) = x, x> 0

function and its derivative curve:
Insert picture description here
due to the Relu function gradient is 0 In the case of Leaky Relu function as an improved version of the Relu function, the gradient part is removed to 0, so that no dead neurons will appear.
In addition, the Leaky Relu function is similar to the PRelu function (the formula is the same). The a value in the PRelu function can be trained. When the Leaky Relu function cannot solve the problem, you can consider using the PRelu function.
6. Elu function
Formula:
elu (x) = x, x> 0;
elu (x) = a
(exp (x) -1), x <0;
(a takes 0.2 here)
function and its derivative curve:
Insert picture description here
It can be seen that the elu function is also an improved version of the relu function, which solves the problem of dead neurons and can obtain negative output. However, due to the introduction of exponential operation, the learning time will be longer, and the gradient explosion problem may even occur. The neural network also has Can't learn a value by itself.
In addition, similar functions are SELU , GELU, etc.
7. There are many activation functions, not listed one by one. For
example:
hard_sigmoid function The
calculation speed is faster than the sigmoid activation function. If x <-2.5, return 0. If x> 2.5, return 1. If -2.5 <= x <= 2.5, return 0.2 * x + 0.5
linear (linear activation function)

Published 2 original articles · praised 1 · visits 43

Guess you like

Origin blog.csdn.net/qq_45074963/article/details/105593914