Deep learning-detailed analysis and comparison of commonly used activation functions

Deep learning-analysis and comparison of common activation functions

  This article mainly focuses on the centralized activation function mentioned in Section 3.1 of the "NNDL" by Qiu Xipeng and made a comparative record. The activation function itself is not less, but it is evaluated from the perspective of development and advantages and disadvantages. And try to find the angles and trends that can be improved among them.

  First, give the activation functions that are participating in the comparison today: Sigmoid series (Logistic, Tanh, Hard-L&T), ReLU function series (ReLU, Leaky ReLU, PReLU, ELU, Softplus), Google's Swish function, GELUS function used in BERT And Maxout unit. One by one, and add a contrast with the former when describing the latter.

  First, the general characteristics of the activation function are given: 1. Continuous and derivable; 2. The function is relatively simple; 3. The function range is in a suitable interval and cannot be too large or too small.

Sigmoid function series

  This series of functions is easy to saturate, and there is a problem of gradient disappearance (gradient disappearance means that the derivative of the saturation region of the Sigmoid function is close to 0, and the error will continue to attenuate through each layer. When the network layer is deep, the gradient will continue Attenuation or even disappear). The SIgmoid function will result in a non-sparse neural network.

Logistic function

  Direct analysis of its characteristics: (The rest are not written here are given in the following comparison)

  • The Logistic function first has the standard characteristics of Sigmoid, and both ends are saturated (that is, the derivative is zero when it tends to infinity).
  • A function with "squeezing" characteristics can be seen as "squeezing" the input of a real number field to the (0,1) interval.
  • Compared with the perceptron step function, its output can be directly regarded as a probability distribution, so that the neural network can be better integrated with the machine learning model.
  • Understandably, it can be regarded as a "soft gate", which can be used to control the amount of information output by other neurons.

Tanh function

  Comparative analysis of Tanh function:

  • Compared with Logistic, it is a zero-centralized function (while Logistic is not, you can see it intuitively by referring to the image below), which can reduce the bias offset brought to the next layer of neurons and improve the efficiency of gradient descent (convergence Faster).
  • It also has the characteristics of Logistic just now, and it is also a saturated function at both ends, but the value range is doubled.
  • Like Logistic, it is an exponential operation function, and the gradient descent process is computationally expensive.
Chart1: Logistic function image and Sigmoid function image comparison chart

Hard-L&T function

  Hard_Logistic and Hard_Tanh are Hard versions of the previous two Sigmoid functions, which solve the problem of saturation at both ends but still do not solve the problem of large exponential calculations.

  • Fitting is easier than the former. The saturated parts at both ends are linearized by the first-order Taylor expansion function, and the convergence speed is obviously accelerated during gradient calculation.
  • As for whether the function is zero-centralized, the two are consistent with their non-Hard version, and inherit the other advantages of the former.

  But despite so many advantages, the Sigmoid function is still at the bottom of the activation function despise chain. It belongs to the deep learning activation function used in the early days. The activation function selection order used in the past few years is already: Elus> Relu> Sigmoid Now, Google’s Swish function has appeared, so Sigmoid alone is rarely used as an activation function.

Chart2: Compare the Hard version of the two functions with the original picture

ReLU function series

  The problem with this series of functions is the lack of random factors and strong subjectivity in parameter settings. In actual applications, it is necessary to continuously adjust and test the network model parameter initialization process to avoid over-fitting problems. The sparsity brought by the ReLU function is much better than that of the Sigmoid function. The original ReLU has not been used much now, and we should pay more attention to its variant functions. But we first look at the original ReLU, and use it to compare the improved variant function.

ReLU function

  The original ReLU function is very simple, so the neurons using ReLU only need to perform addition, multiplication and comparison operations, which is more computationally efficient.

  • Simple and computationally efficient.
  • Like Logistic, it is also a non-zero centralized activation function (produces a bias offset for the next layer, which affects the gradient descent effect).
  • When compared with the Sigmoid series of functions, it can reflect that ReLU alleviates the problem of gradient disappearance and accelerates the convergence speed of gradient descent.
  • One-sided suppression (one side function value is 0).
  • Wide excitement boundary (excitement can reach very high).
  • The fatal flaw is death.
  • Like the Logistic function, negative values ​​cannot be generated, which affects the convergence speed of gradient descent.
  • Sparse expression ability (can produce 0, while Sigmoid saturation can only be approximately close to 0 and cannot produce 0)

Leaky ReLU function

  On the basis of ReLU, when x<0, a very small γ value is introduced as the gradient, so that the problem of neuron death in ReLU can be avoided.

  • Can avoid the problem of neuron death.
  • Will not produce over-fitting problems.
  • The calculation is also relatively simpler than the latter models (for example, compared with ELU and the previous function based on exponential operation).
  • Because death can be avoided and the gradient is supplemented, the convergence speed is even faster than the original ReLU.

PReLU function

  Compared with the Leaky ReLU just now, it only changes the parameters into a vector. The number of elements in the vector is the same as the number of neurons in the layer, and different parameters are configured for each neuron path. Parameters can be obtained through learning.

  • Different neurons are allowed to have different parameters, which is more flexible and can have a better fitting effect when applied to the machine learning process. However, unlike L-ReLU just now, the possibility of overfitting has reappeared.
  • The amount of calculation is relatively more, but it is not a fatal flaw (still much better than exponential calculation).
  • It is more suitable for practical model learning and more in line with the nature of general practical problems, but the parameter learning process is complicated.
  • It inherits the advantages of L-ReLU model to prevent neuron death.

ELU function

  The ReLU function<0 is partially smoothed, linearly converted to saturation, and the image is approximately zero centered. Although there are problems with gradient saturation and exponential calculations, it is still much better than Sigmoid.

  • In fact, the efficiency of ELU is higher than the above three ReLUs. It is equivalent to taking a more compromised solution. First of all, it avoids the defect of no negative value.
  • But adding a negative value will inevitably appear a smooth curve, although it is equivalent to re-introducing a saturated curve, but the advantages have a greater effect.
  • It also prevents death.
  • Exponential calculations are only complicated compared to other ReLU family functions, and are still insignificant compared to Sigmoid, so now ELU is also a commonly used activation function choice.

SoftPlus function

  SoftPlus was actually proposed very early and was kicked out in 2001. It was much earlier than the ReLU proposed 10 years ago. It can be regarded as a smooth version of ReLU. The corresponding characteristics are also very similar.

  • There are also no negative values.
  • The same is not zero centralization.
  • Same one-sided suppression (saturated left)
  • The same wide excitement boundary.
  • It lacks sparsity than ReLU because the left side is saturated.
Chart3: ReLU series function image (PReLU can be regarded as a collection of different L-ReLU images)

Swish function

  This is a new activation function proposed by Google in 2017. Based on the advantages of ReLU, the gating mechanism is strengthened. The addition of hyperparameter β allows the function to adjust the switch state of the gating to approximate different ReLU functions. I said before that the Sigmoid function is easy to saturate, and the ReLU function family lacks random factors. The more important thing in neural network modeling is the nonlinearity of the model. At the same time, a random regular part needs to be added for the generalization ability of the model. And the previous function we have not discussed the good expression of both at the same time. From the beginning of Swish and later GELU began to have this effect.

  • Inheriting the advantages of ReLU, there is still no gradient disappearance problem when X>0.
  • At the same time, the part <0 will not die easily. The addition of the gating mechanism makes the flexibility stronger, which leads to the application of Swish in the actual environment of a variety of machine learning models.
  • The β parameter can be unique or one for each neuron in the layer. (One-to-one learning, one-to-many fixed)
  • The only drawback is the large amount of calculation.
Chart4: Swish function image

GELU function

  GELU applies the idea of ​​random regularity in nonlinear activation, and is also an activation function that adjusts the output through a gating mechanism, which is very similar to the Swish function. The source code of BERT is using GELU, so it can be seen that the current application of GELU is still very good.

  • The idea of ​​including random regularity can make the gradient descent process and learning speed more convenient.
  • With the standard normal distribution as the expression component, it can be approximated by the Sigmoid function, and the fitting ability is stronger.
  • At the same time has the above advantages of Swish.
  • It does not produce saturation, and is adjusted by self-gating, which can adapt to more model environments than ReLU.
  • At the same time, compared with ELU, there is no need to sacrifice too much in order to create negative values ​​and nonlinearities.
  • The code implementation is also relatively simple.
Chart5: Comparison of GELU function with the former

Maxout unit

  Maxout also proposed a later time than ReLU, so it avoided some of the problems of the former. Its main improvement is to no longer use the net input, and instead focus on all the original output of the previous layer.

  • The defect is very obvious, because the input to be referenced becomes all the original output of the upper layer, the parameter is increased by k times.
  • The parameters are learnable, the weight of which is changed by learning, and is itself a piecewise linear function.
  • Maxout's fitting ability is very strong. It is equivalent to a function approximator. For a standard MLP, if there are enough hidden neuron layers, it can theoretically approximate any function. But at the same time, avoid overfitting.
  • Compared with Swish and GELU, it is a bit lacking in nonlinear and random regular expressions, but it can also avoid most of the shortcomings of the first few models.

 

 

Guess you like

Origin blog.csdn.net/qq_39381654/article/details/108185536