Summary and comparison of commonly used activation functions and their advantages and disadvantages

1、sigmoid

Insert image description here
Insert image description here

Advantages: Compress a large range of input feature values ​​to between 0 and 1, suitable for models that use predicted probabilities as output; Disadvantages: 1
)
When the input is very large or very small, it is easy to cause the problem of gradient disappearance
2) The output is not 0-mean, which causes the neurons in the next layer to get the non-0-mean signal output by the previous layer as input. As the network deepens, the distribution trend of the original data will change, and it is generally used in the final output layer.

2、Fish

Insert image description here
Insert image description here

Advantages: It solves the above-mentioned problem that the output of the Sigmoid function is not 0 mean;
Disadvantages: There is still the problem of gradient disappearance.

3、ReLU

Insert image description here
Insert image description here

Advantages:
1) Solve the problem of gradient disappearance
2) The calculation and convergence speed are very fast, because you only need to judge whether it is greater than 0
Disadvantages:
1) Like sigmoid, it is not 0 mean value
2) When ReLU is less than 0, the gradient is zero, This will cause the neuron to be unable to update parameters, which is the problem of neuron death.

4、Leaky ReLU

Insert image description here
Insert image description here

Advantages: The Leaky Relu function gives the input value a small slope when the input is a negative value, which alleviates the Dead Relu problem; Disadvantages: In
theory, this function has a better effect than the Relu function, but a lot of practice has proved that, Its effect is unstable, so there are not many applications of this function in practice.

5、Softmax

Insert image description here
Insert image description here

Features: The sum of the probabilities of the predicted results is equal to 1, and there will also be the problem of neuron death.

Comparison with sigmoid:
softmax : 1) Used for single-label multi-classification problems, that is, selecting a correct answer from multiple categories. Softmax combines the normalization of all output values, so what is obtained is the correlation between different probabilities. 2) The sum of the probabilities of each category is 1, that is to say, if we increase the probability of a certain category, it will inevitably lead to the reduction of other categories—these categories are mutually related and mutually exclusive.
sigmoid : 1) For multi-label multi-classification problems, multiple labels can be selected as the correct answer. It is a normalized mapping of any real value to [0-1]. 2) The sum of the probabilities of each category is not It must be 1, and each output value is independently mapped by the activation function in turn. The increase in the probability of a certain category may also be accompanied by the increase in the probability of another category—each category is independent of each other and not mutually exclusive.

Reference link:
https://blog.csdn.net/caip12999203000/article/details/127067360

Guess you like

Origin blog.csdn.net/m0_48086806/article/details/132335936