Advantages and disadvantages of commonly used activation functions

Nonlinear activation functions are an important part of deep learning networks. With the rapid development in recent years, more and more activation functions have been proposed and improved. Choosing an appropriate activation function will determine the final result of the model. The following summarizes the calculation methods and corresponding images of 13 common activation functions. The calculation methods in this article come from pytorch.


1.Sigmoid

This is an earlier activation function, and its calculation formula is as follows:

a0f18793ada9a92d7f7152cd82aac616.png

Its diagram is as follows:

e1f0517ce6d38715fc5eeb14ad014eea.png

advantage:

  1. The output of the Sigmoid function is between (0,1), the output range is limited, the optimization is stable, and it can be used as the output layer.

  2. Continuous function, easy to find derivatives.

shortcoming:

  1. Requires exponentiation and is computationally expensive.

  2. The output is not averaged at 0, resulting in a decrease in the convergence rate.

  3. Gradient dispersion is prone to occur. During backpropagation, when the gradient is close to 0, the weights will basically not be updated, so that the training of the deep network cannot be completed.

2.LogSigmoid

Its calculation formula is as follows:

3b3f929667b07dc4985ceac48f5e0eb6.png

Its diagram is as follows:

c6c3284490077fe94251ad802214c917.png

3. Replay

One of the most commonly used activation functions in deep neural networks, its calculation formula is as follows:

0b79dc6cb246fdd727210262a27eefa7.png

Its diagram is as follows:

f5bcb8070686da76002e24caf94fe8a2.png

advantage:

  1. In the area of ​​x>0, there will be no problems of gradient saturation and gradient disappearance, and the convergence speed is fast.

  2. No exponent operation is required, so the operation speed is fast and the complexity is low.

shortcoming:

  1. The mean of the output is non-zero.

  2. There is neuron death, and at x<0, the gradient is 0. The gradient of this neuron and subsequent neurons is always 0, and no longer responds to any data, resulting in the corresponding parameters never being updated.

4. LeakyRelu

The above Relu outputs 0 when x is less than 0, while LeakyRelu can output a non-zero value when x is less than 0, and its calculation formula is as follows:

3963887e1e7af58deb8c84b1ffd6a431.png

Its diagram is as follows:

030fab5d6d901c12c6bdf05e4d400c2a.png

advantage:

  1. Solve Relu's neuron death problem, have a small positive slope in the negative region, so it can backpropagate even for negative input values.

  2. Has the advantage of the Relu function.

shortcoming:

  1. Inconsistent results, unable to provide consistent relationship predictions for positive and negative input values

5. High

It is also an improvement for the negative part of ReLU. The ELU activation function uses an exponential calculation method to output the case where x is less than zero. The calculation formula is as follows:

fa47d792fc85fad61fb5448258fdfdd6.png

Its diagram is as follows:

a8c2a063087562b8e8be42cd228b0bd3.png

advantage:

  1. is continuously differentiable at all points.

  2. Faster training time compared to other linear non-saturated activation functions such as Relu and its variants.

  3. There is no question of neuron death.

  4. As a non-saturated activation function, it does not suffer from exploding or vanishing gradients and has higher accuracy.

shortcoming:

  1. Exponentiation is involved, and the calculation speed is slow.

6. Prelude

Among them, a is not fixed, but is learned through backpropagation. Its calculation formula is as follows:

7de0b1c6a4c460619432d8dab9e460fc.png

Its diagram is as follows:

f76012d6131679867334ec3d1714c99d.png

7.Relu6

Relu6 limits the output of Relu to no more than 6, and its calculation formula is as follows:

752298b165b165cb5e7a6059f4acb3a2.png

Its diagram is as follows:

5c90525425464716349a1391d16c8d87.png

Relu uses x for linear activation in the area of ​​x>0, which may cause the activated value to be too large and affect the stability of the model. In order to offset the linear growth part of the ReLU activation function, the Relu6 function can be used.

8. RRelu

"Random Rectified Linear Unit" RRelu is also a variant of LeakyRelu. In RRelu, the slope of negative values ​​is random during training and becomes fixed during subsequent testing. Its calculation formula is as follows:

2092995cafec843dd0975534af998ff1.png

Its diagram is as follows:

d401941e2cf4bd5b2638309ec58184c4.png

9. SELU

The forms of SElu and Elu are similar, but there is one more scale parameter. Its calculation formula is as follows:

5b9367992ce9ccf2cdf7c7743a609df9.png

Its diagram is as follows:

0ba2018c719e89813382e8b472ba8752.png

acale=1.0507009873554804934193349852946 in pytorch.

10. Purpose

Similar to the above SElu, CElu also uses the negative range as an exponential calculation, and the integer range as a linear calculation. The calculation formula is as follows:

0b14c49e5cbe5b709473b3dd38a91f3a.png

Its diagram is as follows:

f7fa85c85894e25ad70ff292defa0c52.png

11. GElu

The regularization method is added to the activation function, and its calculation formula is as follows:

64371c9908b452f37037e2c032da8990.png

Because erf has no analytical expression, the original paper gives an approximate solution.

cb3cee8a105ad0cf905435f23bf0fa6b.png

Its diagram is as follows:

9a6739456f68aa64d83b792d99907a47.png

12.Tang

The hyperbolic tangent function Tanh in mathematics is also a commonly used activation function for neural networks, especially for the last layer of image generation tasks. Its calculation formula is as follows:

4dfb02095e095b76e1afbff08be4cdb4.png

Its diagram is as follows:

a65d1036068f5a518cc0310a659faa56.png

advantage:

  1. The output mean value is 0, which makes its convergence speed faster than sigmoid, which can reduce the number of iterations.

shortcoming:

  1. Its disadvantage is that exponentiation is required and the calculation cost is high

  2. There is also gradient disappearance, because there are cases where it approaches 0 on both sides.

13.Tanhshrink

Tanhshrink directly uses the input minus the hyperbolic tangent value, and its calculation formula is as follows:

9affe1eada6fe6b2688a973f6595a238.png

Its diagram is as follows:

60111e3f381d3d2f897843b9faff69b4.png

Summarize:

First use ReLU, which is the fastest, and then observe the performance of the model.

If the ReLU effect is not very good, you can try variants such as LeakyRelu

In a CNN whose depth is not particularly deep, the influence of the activation function is generally not too great.

Guess you like

Origin blog.csdn.net/weixin_41202834/article/details/121173761