22 activation functions of Pytorch

Reprinted from: https://www.pianshen.com/article/33331174884/

1.22.Linear常用激活函数
1.22.1.ReLU torch.nn.ReLU()
1.22.2.RReLU torch.nn.RReLU()
1.22.3.LeakyReLU torch.nn.LeakyReLU()
1.22.4.PReLU torch.nn.PReLU()
1.22.5.Sofplus torch.nn.Softplus()
1.22.6.ELU torch.nn.ELU()
1.22.7.CELU torch.nn.CELU()
1.22.8.SELU torch.nn.SELU()
1.22.9.GELU torch.nn.GELU()
1.22.10.ReLU6 torch.nn.ReLU6()
1.22.11.Sigmoid torch.nn.Sigmoid()
1.22.12.Tanh torch.nn.Tanh()
1.22.13.Softsign torch.nn.Softsign()
1.22.14.Hardtanh torch.nn.Hardtanh()
1.22.15.Threshold torch.nn.Threshold()
1.22.16.Tanhshrink torch.nn.Tanhshrink()
1.22.17.Softshrink torch.nn.Softshrink()
1.22.18.Hardshrink torch.nn.Hardshrink()
1.22.19.LogSigmoid torch.nn.LogSigmoid()
1.22.20.Softmin torch.nn.Softmin()
1.22.21.Softmax torch.nn.Softmax()
1.22.22.LogSoftmax torch.nn.LogSoftmax()

1.22. Linear common activation function

1.22.1.ReLU torch.nn.ReLU ()

Insert picture description here
The function diagram of ReLU is as follows:
Insert picture description here

1.22.2.RReLU torch.nn.RReLU()

There are many variants of ReLU. RReLU means Random ReLU and is defined as follows:
Insert picture description here
For RReLU, a is a random variable (training) within a given range, which remains unchanged during inference. Unlike LeakyReLU, RReLU's a is a learnable parameter, while LeakyReLU's a is fixed.
Insert picture description here

1.22.3.LeakyReLU torch.nn.LeakyReLU()

Insert picture description here
Insert picture description here
Here a is a fixed value. The purpose of LeakyReLU is to avoid that the activation function does not handle negative values ​​(the gradient of the part less than 0 is 0). By using negative slope, it allows the network to pass the gradient of the negative part, so that the network can learn more A lot of information does have greater benefits in some applications.

1.22.4.PReLU torch.nn.PReLU ()

Insert picture description here
Unlike RReLU, a can be random, and a in PReLU is a learnable parameter.
Insert picture description here

It should be noted that the above activation functions (ie ReLU, LeakyReLU, PReLU) are scale-invariant.

1.22.5.Sofplus torch.nn.Softplus()

Softplus is used as a loss function in both StyleGAN1 and 2. The following is its expression and diagram respectively.
Insert picture description here
Insert picture description here

Softplus is a smooth approximation of ReLU, which can effectively constrain networks whose outputs are all positive.
With the increase of β, Softplus and ReLU are getting closer and closer.

1.22.6.ELU torch.nn.ELU ()

Insert picture description here
Insert picture description here
The difference between ELU and ReLU is that it can output a value less than 0, so that the average output of the system is 0. Therefore, ELU will make the model converge faster, and its variants (CELU, SELU) are just different parameter combinations ELU.

1.22.7 TARGET torch.nn TARGET ()

Compared with ELU, CELU changes exp(x) in ELU to exp(x/a)
Insert picture description here
Insert picture description here

1.22.8.SELU torch.nn.SELU ()

Compared with ELU, SELU multiplies ELU by a scala variable.
Insert picture description here
Insert picture description here

1.22.9.GELU torch.nn.GELU ()

Among them (x)Φ(x) is the Cumulative Distribution Function for Gaussian Distribution (Cumulative Distribution Function for Gaussian Distribution).
Insert picture description here

1.22.10.ReLU6 torch.nn.ReLU6()

Insert picture description here
Insert picture description here
ReLU6 is based on ReLU, limiting the upper limit of positive values ​​6. This loss function is used in one-stage target detection network SSD.

1.22.11.Sigmoid torch.nn.Sigmoid()

Sigmoid limits the data to between 0 and 1. Moreover, since the maximum gradient of sigmoid is 0.25, as more and more layers are used, the network becomes difficult to converge.

Therefore, for deep learning, ReLU and its variants are widely used to avoid the problem of difficulty in convergence.
Insert picture description here
Insert picture description here

1.22.12.Tanh torch.nn.Tanh ()

Tanh is hyperbolic tangent, and its output value range is -1 to 1. Its calculation can be calculated by trigonometric function, or it can be obtained by the following expression:
Insert picture description here
Insert picture description here
Tanh is basically the same as Sigmoid except centering (-1 to 1) the same. The mean value of the output of this function is approximately zero. Therefore, the model converges faster. Note that if the average value of each input variable is close to 0, the convergence speed will usually be faster. The principle is the same as Batch Norm.

1.22.13.Softsign torch.nn.Softsign()

Insert picture description here
Insert picture description here
Similar to Sigmoid, but it is slower than Sigmoid to reach the asymptote (asymptot n. [number] asymptote), effectively alleviating the gradient vanishing problem (to some extent).

1.22.14.Hardtanh torch.nn.Hardtanh()

As shown in the figure below, Hardtanh is a linear piecewise function [-1, 1], but the user can adjust the lower limit min_val and upper limit max_val to expand/shrink the range.
Insert picture description here
Insert picture description here
When the weights are kept within a small range, Hardtanh works surprisingly well.

1.22.15.Threshold torch.nn.Threshold()

Insert picture description here

This Threshold method is rarely used now, because the network will not be able to propagate the gradient back. This is also the reason that prevented people from using backpropagation in the 1960s and 1970s, because the researchers at that time mainly used Binary neurons, which output only 0 and 1, impulse signals.

1.22.16.Tanhshrink torch.nn.Tanhshrink()

Insert picture description here
Insert picture description here
Except for sparse coding, it is rarely used to calculate the value of latent variables.

1.22.17.Softshrink torch.nn.Softshrink()

Insert picture description here
Insert picture description here
This method is not very commonly used at present. Its purpose is to directly force the value close to 0 to 0 by setting λ. Since this method has no restriction on the part less than 0, the effect is not very good.

1.22.18.Hardshrink torch.nn.Hardshrink()

Insert picture description here
Insert picture description here
Similar to Softshrink, except for sparse coding, it is rarely used.

1.22.19.LogSigmoid torch.nn.LogSigmoid()

LogSigmoid wraps a logarithmic function on the basis of Sigmoid.
Insert picture description here
Insert picture description here
This method is more used as a loss function

1.22.20.Softmin torch.nn.Softmin()

Insert picture description here

Turn numbers into probability distributions, similar to Softmax.

1.22.21.Softmax torch.nn.Softmax()

Insert picture description here

1.22.22.LogSoftmax torch.nn.LogSoftmax()

Insert picture description here
Similar to LogSigmoid, LogSoftmax is more used as a loss function

 

 

Undertake programming in Matlab, Python and C++, machine learning, computer vision theory implementation and guidance, both undergraduate and master's degrees, salted fish trading, professional answers please go to know, please contact QQ number 757160542 for details, if you are the one.

 

Guess you like

Origin blog.csdn.net/weixin_36670529/article/details/114242626