Deep learning - activation function (continuous update)

Activation function


(I. Introduction

1. Concept:

Activation functions are very important for artificial neural network models to learn and understand very complex and nonlinear functions.
 

2. Reasons for use

(1) The activation function plays an important role in learning and understanding very complex and nonlinear functions for the model.
(2) The activation function can convert the current feature space to another space through a certain linear mapping, so that the data can be better classified.
(3) The activation function can introduce nonlinear factors.

  • If the activation function is not used, each layer is equivalent to matrix multiplication, and the output signal is only a simple linear function. No matter how many layers your neural network has, the output is a linear combination of the inputs, which is equivalent to the effect of no hidden layer. This is the most primitive perceptron (Perceptron), which is no different from a single linear classifier, so the approximation ability of the network It's pretty limited.
  • A linear function is a polynomial of the first degree, the complexity of linear equations is limited, and the ability to learn complex function mappings from data is small. Without activation functions, neural networks will not be able to learn and simulate other complex types of data, such as images, videos, audio, speech, etc.
  • Because of the above reasons, it was decided to introduce a nonlinear function as the activation function, so that the expressive ability of the deep neural network is more powerful (it is no longer a linear combination of inputs, but can almost approximate any function).
     

3. Nature

(1) Nonlinearity: When the activation function is nonlinear, a two-layer neural network can basically approximate all functions. But if the activation function is the identity activation function, that is, f(x) = x, this property is not satisfied, and if the multi-layer perceptron (MLP) uses the identity activation function, then the entire network is actually the same as the single layer Neural networks are equivalent.
(2) Differentiability: This essence is reflected when the optimization method is based on gradients.
(3) Monotonicity: When the activation function is monotonic, the single-layer network can be guaranteed to be a convex function.
(4) Range of output values: When the output value of the activation function is limited, the gradient-based optimization method will be more stable, because the representation of the feature is more significantly affected by the limited weight; when the output of the activation function is infinite, The training of the model is more efficient, but this case is very small and generally requires a smaller learning rate.
 

4. Choose

According to experience:
(1) Deep learning often requires a lot of time to process a large amount of data, and the convergence speed of the model is particularly important. Generally speaking, when training a deep learning network, try to use zero-centered data (which can be achieved through data preprocessing) and zero-centered output. Try to choose such an activation function to speed up the convergence of the model.
(2) If the number of layers of the built neural network is small, you can choose sigmoid, tanh, or relu. If the number of built network layers is large, improper selection will cause the problem of gradient disappearance, which is generally not suitable at this time. Choose sigmoid, tanh activation function, preferably relu activation function.
(3) In the binary classification problem, the last layer of the network is suitable for using the sigmoid activation function; while in the multi-classification task, the last layer of the network uses the softmax activation function.
(4) If using ReLU, set the initial value and learning rate carefully. (In practice, if the learning rate is very large, it is likely that there will be a large number of dead relus in the network. If you set an appropriate smaller learning rate, this problem will not happen too frequently.) If this problem is not easy to solve, then you can Try Leaky ReLU, PReLU, ELU or Maxout.


 

(2) Specific classification

Compared with the saturated activation function, the advantage of using the "unsaturated activation function" lies in two points:

  1. First of all, the "unsaturated activation function" can solve the "gradient disappearance" problem of a deep neural network with a very large number of layers, while the shallow network uses sigmoid as the activation function.
  2. Second, it speeds up convergence.
     

1. Saturated activation function

1.1 Logarithmic probability function, Logistic-Sigmoid function

Function expressions and their derivatives:
sigmoid ( x ) = f ( x ) = 1 1 + e − xf ′ ( x ) = e − x ( 1 + e − x ) 2 = f ( x ) ( 1 − f ( x ) ) sigmoid(x)=f(x)=\frac{1}{1+e^{-x}}\\ f'(x)=\frac{e^{-x}}{(1+e ^{-x})^{2}}=f(x)(1-f(x))sigmoid(x)=f(x)=1+ex1f(x)=(1+ex)2ex=f(x)(1f ( x ) )
geometric image:
Please add a picture description

Features:
(1) The most commonly used nonlinear activation function in the past is a continuous, smooth, and differentiable threshold unit approximation. Symmetrical to the center of (0,0.5), it can transform the continuous real value of the input into an output between 0 and 1. In particular, if it is a very large negative number, the output is 0; if it is a very large positive number, the output is 1.
(2) Since the output value is limited to 0 to 1, it normalizes the output of each neuron and can also be used for models that use predicted probabilities as outputs (3) When you want to treat the output as a binary
classification sigmoid is still widely used as the activation function on the output unit (sigmoid can be regarded as a special case of softmax) when solving the probability of the problem.
(4) The gradient is smooth, avoiding jumping output values; clear predictions, that is, very close to 1 or 0
(5) Derivative calculation is convenient:
f ′ ( x ) = f ( x ) ( 1 − f ( x ) ) f' (x)=f(x)(1-f(x))f(x)=f(x)(1f(x))

(6)
lim ⁡ x → ∞ f ′ ( x ) = 0 \displaystyle \lim_{x \to \infty } f'(x)=0 xlimf(x)=0
has this property calledsoft package and activation function. Once it falls into the saturation zone, the gradient value passed to the bottom layer becomes very small. At this time, it is difficult for the network parameters to be effectively trained. (Gradient disappears) Disadvantages
 
:
(1) There are gradient disappearance and gradient explosion

  • When the gradient is passed backwards in the deep neural network, it leads to gradient explosion and gradient disappearance. The probability of gradient explosion is very small, and the probability of gradient disappearance is relatively high.
  • When x in σ(x) is large or small, the derivative is close to 0, and the mathematical basis for backward transfer is the chain rule of calculus derivation. The derivative of the current layer needs the product of the derivatives of the previous layers. Multiplied together, the result will be very close to 0.
  • Because the maximum value of the Sigmoid derivative is 1/4, it means that the derivative will be compressed to at least 1/4 of the original at each layer. If we initialize the weight of the neural network to a random value between [0,1], it can be known from the mathematical derivation of the backpropagation algorithm that when the gradient propagates from back to front, the gradient value of each layer will be reduced to the original 0.25 times, if there are too many hidden layers in the neural network, the gradient will become very small and close to 0 after passing through multiple layers, that is, the phenomenon of gradient disappearance will appear; when the network weight is initialized to a value in the (1,+∞) interval , there will be a gradient explosion. In addition, saturation occurs when the absolute value of the input is large, causing the gradient to disappear.
  • The gradient disappears so that the parameters cannot be updated and the neural network cannot be optimized

 
(2) The output is not zero mean.
The output output of Sigmoid is not zero mean (that is, zero-centered), which is not advisable, because it will cause the neurons of the latter layer to get the signal of the non-zero mean output of the previous layer. as input. One result is, for example:
x > 0 , f = w T x + b x>0 , f = w^{T} x + bx>0,f=wTx+b
Then the local gradient for w is all positive, so that in the process of backpropagation, w is either updated in the positive direction or updated in the negative direction, resulting in a binding effect, which makes the convergence slow and reduces the weight update. efficiency. If you train by batch, then that batch may get different signals, which can alleviate some of this problem.
 
(3) Power calculation is time-consuming.
The analytical formula contains power calculation, and it is relatively time-consuming to solve it by computer. For larger deep networks, this can significantly increase training time.
 

1.2 Hyperbolic Tangent (Tanh) function

Function expressions and their derivatives:
tanh ( x ) = f ( x ) = ex − e − xex + e − x = 2 ∗ sigmoid ( 2 x ) − 1 f ′ ( x ) = 4 ∗ sigmoid ′ ( 2 x ) = 4 e − 2 x ( 1 + e − 2 x ) 2 = 1 − f 2 ( x ) tanh(x)=f(x)=\frac{e^{x}-e^{-x}}{ e^{x}+e^{-x}}=2*sigmoid(2x)-1\\ f'(x)=4*sigmoid'(2x)=\frac{4e^{-2x}}{( 1+e^{-2x})^{2}}=1-f^{2}(x)t a n h ( x )=f(x)=ex+exexex=2sigmoid(2x)1f(x)=4sigmoid (2x)=(1+e2 x )24 e2 x=1f2 (x)
geometric image:
Please add a picture description

Features:
(1) Compress a real-valued input to the range of [-1, 1]. When the input is near 0, the tanh function is close to a linear transformation. This type of function is smooth and asymptotic, and maintains monotonicity, which solves the non-zero-centered output problem of the Sigmoid function
(2) In general binary classification problems, the tanh function is used for the hidden layer, and the sigmoid function is used for output layer, but this is not fixed and needs to be tuned to the specific problem.
(3) Tanh is zero-mean, so in practical applications, tanh will be better than sigmoid and converge faster.
 
Disadvantages:
(1) The problems of gradient disappearing and exponentiation still exist.
(2) When the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update
 

2. Unsaturated activation function - LU function

2.1 Rectified linear unit (ReLU) function

Function expressions and their derivatives:
R e LU ( x ) = f ( x ) = max ( 0 , x ) = { x , x ≥ 0 0 , x ≤ 0 f ′ ( x ) = { 1 , x > 0 0 , x < 0 ReLU(x)=f(x)=max(0,x)=\left\{\begin{matrix}x,x\geq0\\0,x\leq0\end{matrix}\right. \\ f'(x)=\left\{\begin{matrix}1,x>0\\0,x<0\end{matrix}\right.R e L U ( x )=f(x)=max(0,x)={ x,x00,x0f(x)={ 1,x>00,x<0
Geometric image:
Please add a picture description

Features:
(1)
One of the more popular activation functions in deep learning is a maximum value function (not fully derivable), providing a very simple nonlinear transformation
(2) One-sided suppression
ReLU function From the image, it is a piecewise linear function, which turns all negative values ​​to 0, while positive values ​​remain unchanged, which becomes one-sided suppression. Therefore, it is relatively simple in terms of derivation: either let the parameters disappear, or let the parameters pass.
(3) Relatively wide activation boundary
When the input is negative, ReLU is hard saturated; when the input is positive, the gradient is not attenuated, and there is no gradient saturation problem. The non-saturation of ReLU alleviates the gradient vanishing problem of neural networks (in the positive interval), providing a relatively wide activation boundary.
(4) Sparse activation
ReLU makes the output of some neurons 0, which causes the sparsity of the network, reduces the interdependence of parameters, and alleviates the occurrence of over-fitting problems. From the signal point of view, that is, neurons only selectively respond to a small part of the input signal at the same time, and a large number of signals are deliberately shielded, which can improve the accuracy of learning and extract sparse features better and faster.
(5) Relu only needs a threshold value to judge whether the input is greater than 0, and the activation value can be obtained, and there is only a linear relationship, and the calculation speed is faster (6) When using the gradient
descent (GD) method, the convergence speed is much faster than sigmoid and
There are many variants of the tanh (7) ReLU function, such as LeakyReLU, PReLU, etc.
 
Disadvantages:
(1) The output is not zero mean
(2) Dead Neuron problem
Dead Neuron (dead neuron) or Dead ReLU (neuron necrosis phenomenon), refers to when the input value of Relu is negative, the output is always 0, and its first derivative is always 0, no longer has any data. In response, this will cause the neuron to fail to update the parameters, and may never be activated, causing the corresponding parameters to never be updated. The phenomenon of drift and neuron death can jointly affect the convergence of the network.
There are two main reasons this can happen:

  • Parameter initialization problem
  • The learning rate is too high and the parameter update is too large during the training process

Solution: Use the Xavier initialization method, and avoid setting the learning rate too large or use an algorithm that automatically adjusts the learning rate such as adagrad

(3) Very sensitive to parameter initialization and learning rate
 

2.2 ReLU (Leaky ReLU) function with leaky unit

Function expressions and their derivatives:
Leaky R e LU ( x ) = f ( x ) = max ( α x , x ) = { x , x ≥ 0 α x , x ≤ 0 where α is a small positive number , generally 0.01 f ′ ( x ) = { 1 , x > 0 α , x < 0 LeakyReLU(x)=f(x)=max(αx,x)=\left\{\begin{matrix}x,x \geq0\\αx,x\leq0\end{matrix}\right.\\ where α is a small positive number, generally 0.01\\ f'(x)=\left\{\begin{matrix}1 ,x>0\\α,x<0\end{matrix}\right.LeakyReLU(x)=f(x)=max(αx,x)={ x,x0αx,x0Where α is a very small positive number , generally 0.0 1 _ _ _f(x)={ 1,x>0a ,x<0
Geometric image:
Please add a picture description

Features:
(1) In order to solve the Dead ReLU Problem of the Relu function, a Leaky value is introduced in the negative half of the Relu function, so it is called the Leaky Relu function.
(2) Give a very small linear component of x to the negative input (αx) to initialize the neurons, so that ReLU is more inclined to activate rather than die in the negative area. The slope here is definite, that is, α is a definite value. Generally set to a smaller value
(3) to help expand the range of the ReLU function, from [0,+∞) to (-∞,+∞)
(4) In theory, Leaky ReLU has all the advantages of ReLU, In addition, there will be no Dead ReLU problem, but in actual operation, it has not been fully proved that Leaky ReLU is always better than ReLU.
 
Disadvantages:
(1) It is not very stable in practice, and some are approximately linear, which leads to poor results in complex classification
(2) The selection of a value increases the difficulty of the problem, and requires strong artificial prior or multiple repeated training to determine the appropriate The parameter value of
 

2.3 Parametric ReLU (Parametric ReLU) (PReLU) function

Function expression and its derivative:
PR e LU ( x ) = f ( x ) = max ( α x , x ) = { x , x ≥ 0 α x , x ≤ 0 where α is a learnable parameter, and is a variable, generally Initialized to 0.25 f ′ ( x ) = { 1 , x > 0 α , x < 0 PReLU(x)=f(x)=max(αx,x)=\left\{\begin{matrix}x,x\ geq0\\αx,x\leq0\end{matrix}\right.\\ where α is a learnable parameter, a variable, generally initialized to 0.25\\ f'(x)=\left\{\begin{matrix}1 ,x>0\\α,x<0\end{matrix}\right.P R E L U ( x )=f(x)=max(αx,x)={ x,x0αx,x0where α is a learnable parameter , a variable , and is generally initialized to 0.2 5f(x)={ 1,x>0a ,x<0
Geometric image:
Please add a picture description

Features:
Parameter rectification linear unit, used to solve the problem of neuron necrosis caused by Relu, where α is a learnable parameter, which can be learned by the direction propagation algorithm, and the others are basically the same as above
 

2.4 Random ReLU (RReLU) function

Function expression and its derivative:
RR e LU ( x ) = f ( x ) = max ( α x , x ) = { x , x ≥ 0 α x , x ≤ 0 where α ∼ uniform distribution U ( m , n ) , usually relatively small, m < n , m , n ∈ [ 0 , 1 ) f ′ ( x ) = { 1 , x > 0 α , x < 0 RReLU(x)=f(x)=max(αx, x)=\left\{\begin{matrix}x,x\geq0\\αx,x\leq0\end{matrix}\right.\\ where α\sim is uniformly distributed U(m,n), usually relatively Small, m<n,m,n\in [0,1)\\ f'(x)=\left\{\begin{matrix}1,x>0\\α,x<0\end{matrix} \right.RReLU(x)=f(x)=max(αx,x)={ x,x0αx,x0where αUniformly distributed U ( m ,n),usually relatively small ,m<n,m,n[0,1)f(x)={ 1,x>0a ,x<0
Features:
When using RReLU as the activation function during training, a value α randomly selected from the uniform distribution U(m,n) is required as the slope of the negative value. PReLU is a linear operation in the negative value domain. Although the slope is small, it will not tend to 0, which can also avoid the Dead ReLU problem.
 

2.5 Exponential Linear Unit (ELU) function

Function expressions and their derivatives:
ELU ( x ) = f ( x ) = { x , x ≥ 0 α ( ex − 1 ) , x ≤ 0 where α is a hyperparameter, the default is 1.0 f ′ ( x ) = { 1 , x > 0 α ex , x < 0 ELU(x)=f(x)=\left\{\begin{matrix}x,x\geq0\\α\left ( e^{x}-1 \right ) ,x\leq0\end{matrix}\right.\\ where α is a hyperparameter, the default is 1.0\\ f'(x)=\left\{\begin{matrix}1,x>0\\α e^ {x},x<0\end{matrix}\right.E L U ( x )=f(x)={ x,x0a(ex1),x0where α is a hyperparameter , the default is 1 . 0f(x)={ 1,x>0a ex,x<0
Geometric image:
Please add a picture description
f(x) and α

Features:
(1) Proposed to solve the existing problems of ReLU, there will be no Dead ReLU problem
(2) ELU has negative values, which will make the average value of activation close to zero. Mean activations close to zero can make learning faster because they make gradients closer to natural gradients.
(3) ELU makes the normal gradient closer to the unit natural gradient by reducing the influence of bias offset, so that the mean value accelerates learning towards zero (
4) ELU will saturate to a negative value under a small input, thereby reducing the forward Propagated variation and information; has negative saturation regions and thus some robustness to noise.
(5) Similar to Leaky ReLU, although it is better than ReLU in theory, there is currently no good evidence that ELU is always better than ReLU in actual use.
 
Disadvantages: the amount of calculation is slightly larger, and the origin cannot be guided when α is not 1
 

2.6 Gaussian Error Linear Units (GELU)

Function expressions and their derivatives:
GELU ( x ) = f ( x ) = x ∗ P ( X ≤ x ) = x ∗ Φ ( x ) = x ∗ 1 2 π σ ∫ − ∞ xe − ( X − μ ) 2 2 σ 2 d X where Φ ( x ) represents the cumulative distribution of the Gaussian normal distribution of x, and μ and σ distributions are the mean and standard deviation of the normal distribution GELU ( x ) ≈ g ( x ) = 0.5 x ( 1 + tanh ( 2 π ( x + 0.047715 x 3 ) ) ≈ h ( x ) = x ∗ sigmoid ( 1.702 x ) f ′ ( x ) ≈ sigmoid ( 1.702 x ) + 1.702 x ∗ sigmoid ( 1.702 x ) ∗ ( 1 − sigmoid ( 1.702 x ) ) GELU(x)=f(x)=x*P(X\leq x)=x*\Phi (x)=x*\frac{1}{\sqrt{2\pi}\sigma }\int_{-\infty }^{x}e^{-\frac{(X-\mu)^{2}}{2\sigma^{2}}}dX\\ Among them, Φ(x) means The cumulative distribution of the Gaussian normal distribution of x, the μ and σ distributions are the mean and standard deviation of the normal distribution \\ GELU(x)\approx g(x)=0.5x(1+tanh(\sqrt{\frac{2} {\pi }}(x+0.047715x^{3})))\\ \approx h(x)=x*sigmoid(1.702x)\\ f'(x)\approx sigmoid(1.702x)+1.702x *sigmoid(1.702x)*(1-sigmoid(1.702x))G E L U ( x )=f(x)=xP(Xx)=xΦ ( x )=x2 p.m p1xe2 p2( X μ )2dXAmong them , Φ ( x ) represents the cumulative distribution of the Gaussian normal distribution of x , and the μ and σ distributions are the mean and standard deviation of the normal distributionG E L U ( x )g(x)=0 . 5 x ( 1 .)+n h ( _ _Pi2 (x+0.047715x3)))h(x)=xsigmoid(1.702x)f(x)sigmoid(1.702x)+1.702xsigmoid(1.702x)(1s i g m o i d ( 1 . 7 0 2 x ) )
is approximated when μ=0, σ=1, the maximum error of f(x) and g(x) is about 0.002, g(x) and h( x) has a maximum error of about 0.02
 
Geometry image:
Please add a picture description
Please add a picture description
Please add a picture description
Please add a picture description

Features:
(1)

  • In the modeling process of neural network, the very important nature of the model is nonlinearity. At the same time, for the generalization ability of the model, it is necessary to add random regularization, such as dropout (randomly setting some outputs to 0, which is actually a random nonlinear activation in disguise) ), and random regularization and nonlinear activation are two separate things, but in fact, the input of the model is determined by both nonlinear activation and random regularization.
  • The Gaussian error linear unit introduces the idea of ​​random regularization in activation. It combines nonlinearity with a random regularizer that depends on the distribution of input data in the expression of an activation function, which is a probabilistic description of neuron input. Intuitively, it is more in line with natural understanding, and the experimental effect is better than both ReLU and ELU.

(2)

  • GELUs is actually a combination of dropout, zoneout, and ReLU. GELU multiplies the input by a mask composed of [0,1], and the generation of the mask is randomly dependent on the input according to the probability. Suppose the input is X and the mask is m, then m obeys a Bernoulli distribution Φ(x)=P(X<x), where X obeys the standard normal distribution. This choice is because the input of neurons tends to be normally distributed, so that when the input x decreases, the input will have a higher probability of being dropped out, so that the activation transformation will be randomly dependent on the input.
  • Unlike the previous dropout specifying a random probability value or ReLU masking based on the positive or negative of the input value, GeLU performs random regularization according to the probability that the current input is greater than the rest of the inputs, that is, the data distribution that depends on the input when masking.

(3) x is used as a neuron input, the larger x is, the activation output x is likely to be retained, and the smaller x is, the more likely the activation result is 0. For a relatively large input x >0, GELU is basically a linear output (similar to ReLU); for a relatively small input x<0, the output of GELU is 0; when the input x is close to 0, GELU is a nonlinear output, with A certain continuity.
(4) The version with a mean of 0 and a variance of 1 is generally common. When the variance is infinite and the mean is 0, GeLU is equivalent to ReLU. GELU can be regarded as a smoothing strategy of RELU.
(5) When GELU is trained as an activation function, it is recommended to use an optimizer with momentum

Specific details: Gaussian Error Linear Units (GELUs)

 

2.7 Gated Linear Unit (GLU)

Function expression and its derivative:
GLU ( x ) = f ( x ) = ( x ∗ w + b ) ⊗ sigmoid ( x ∗ v + c ) where x represents the input, w , v , b , c are all learnable parameters , sigmoid can be other activation functions, ⊗ is the multiplication of corresponding elements GLU(x)=f(x)=(x*w+b)\otimes sigmoid(x*v+c)\\ where x represents the input, w, v, b, c are all learnable parameters, sigmoid can be other activation functions, ⊗ is the multiplication of corresponding elementsGLU(x)=f(x)=(xw+b)sigmoid(xv+c)Where x represents input , w , v , b , c are all learnable parameters , s i g m o i d can be other activation functions ,is the multiplication of the corresponding elements

The element-wise product, also known as the Hadamard product, represents the dot product of matrix multiplication

Geometric image:

insert image description here
 
Features:
(1) If the Sigmoid gate in GTU is removed, it is a Tanh activation function. Therefore, the effect of Gate mechanism on model performance can be compared by comparing the experimental results of Tanh and GTU. From the left picture in Figure 1, it can be found that the effect of using GTU is far better than the Tanh activation function. It can be seen that gate units are helpful for deep network modeling.
(2) Both Tanh activation function and GTU have the problem of gradient disappearance, because even GTU, when the activation of units is in the saturation area, the input unit activation unit: tanh(X*W+b) and gate unit: O(X * V + c) will weaken the gradient value. In contrast, GLU and Relu do not have such problems. Both GLU and Relu have linear channels, which can make the gradient easily pass through the activated units, and backpropagation will not decrease. Therefore, using GLU or Relu as activation, the convergence speed is faster during training.
(3) The Relu unit does not completely abandon the gate units in the GLU. The GLU can be regarded as a simplified Relu unit in the activated state. Comparing Relu and GLU, it can be seen from the right figure of Figure 1 that under the same training time, the GLU unit can obtain higher accuracy than Relu.
 

2.8 Gated Tanh Unit (GTU)

Function expressions and their derivatives:
GLU ( x ) = f ( x ) = tanh ( x ∗ w + b ) ⊗ sigmoid ( x ∗ v + c ) GLU(x)=f(x)=tanh(x*w+ b)\otimes sigmoid(x*v+c)\\GLU(x)=f(x)=t a n h ( xw+b)sigmoid(xv+c)
Features:
GTU has a nonlinear unit activated by tanh, and GLU has a linear unit. There is no gradient disappearance problem similar to GTU in GLU. By comparison, it can be found that GLU achieves faster convergence speed and higher accuracy than GTU.
 

3. Non-saturated activation function-Soft function

3.1 Softsign function

Function expressions and their derivatives:
Softsign ( x ) = f ( x ) = x 1 + ∣ x ∣ f ′ ( x ) = 1 ( 1 + ∣ x ∣ ) 2 Softsign(x)=f(x)=\ frac{x}{1+\left|x\right|}\\ f'(x)=\frac{1}{(1+\left|x\right|)^{2}}Softsign(x)=f(x)=1+xxf(x)=(1+x)21
Geometric image:
Please add a picture description

Features:
(1) Softsign function is another alternative to Tanh function. Like the Tanh function, the Softsign function is antisymmetric, decentralized, differentiable, and returns a value between -1 and 1. Its flatter curve and slower descending derivative indicate that it can learn more efficiently and solve the problem of gradient disappearance better than tTanh function.
(2) The calculation of the derivative of the Softsign function is more troublesome than that of the Tanh function.
 

3.2 Softplus function

Function expressions and their derivatives:
Softplus ( x ) = f ( x ) = ln ( 1 + ex ) f ′ ( x ) = 1 1 + e − x = sigmoid ( x ) Softplus(x)=f(x) =ln(1+e^{x})\\ f'(x)=\frac{1}{1+e^{-x}}=sigmoid(x)Softplus(x)=f(x)=ln(1+ex)f(x)=1+ex1=s i g m o i d ( x )
geometric image:
Please add a picture description

Features:
(1) The Softplus function is the original function of the Logistic-Sigmoid function, which can be regarded as a smooth version of the ReLU function, and 1 is added to ensure non-negativity.
(2) According to the relevant research of neuroscientists, the Softplus function and ReLU function are similar to the activation frequency function of brain neurons. That is to say, compared with the early activation functions, the Softplus function and the ReLU function are closer to the activation model of brain neurons, and the neural network is developed based on brain neuroscience. The application of these two activation functions has contributed to the neural network. A new wave of research.
 

3.3 Softmax function

Function expression:
f ( xi ) = exi ∑ j = 1 nexjf(x_{i})=\frac{e^{x_{i}}}{\sum_{j=1}^{n}e^{x_ {j}}}f(xi)=j=1nexjexi
Features:
(1) Softmax is an activation function for the output of the neural network for multi-classification problems. It is special and cannot be displayed directly. In multiclass classification problems, more than two class labels require class membership. For any real vector with a length of K, Softmax can compress it into a real vector with a length of K, a value in the range of (0, 1), and the sum of the elements in the vector is 1, representing the predicted probability of each category.
(2) In the binary classification task, the sigmoid activation function is often used. When dealing with multi-classification problems, you need to use the softmax function.
(3) Softmax is different from the normal max function: the max function only outputs the maximum value, but Softmax ensures that smaller values ​​have a smaller probability and will not be discarded directly. Think of it as the probabilistic or "soft" version of the argmax function.
(4) The denominator of the Softmax function combines all the factors of the original output value, which means that various probabilities obtained by the Softmax function are related to each other. The output satisfies: the interval range of each item is (0,1), and the sum of all items is 1.
 
Disadvantages:
(1) non-differentiable at zero
(2) negative input has a gradient of zero, which means that for activations in this region, the weights are not updated during backpropagation, thus producing dead neurons that never fire


Example:
insert image description here
insert image description here


 

4. Unsaturated activation function - previous function

4.1 Threshold function (step function) Sgn function

Function expressions and their derivatives:
S gn ( x ) = f ( x ) = { 1 , x ≥ 0 0 , x < 0 f ′ ( x ) = 0 , x ≠ 0 Sgn(x)=f(x)= \left\{\begin{matrix}1,x\geq0\\0,x<0\end{matrix}\right.\\ f'(x)=0,x\neq 0Sgn(x)=f(x)={ 1,x00,x<0f(x)=0,x=0
geometry image:
Please add a picture description

 

4.2 Piecewise linear functions

Function expressions and their derivatives:
f ( x ) = { 1 , x ≥ 1 1 2 ( x + 1 ) , − 1 ≤ x ≤ 1 0 , x ≤ − 1 f ′ ( x ) = { 1 2 , − 1 < x < 1 0 , x > 1 , x < − 1 f(x)=\left\{\begin{matrix}1,x\geq1\\\frac{1}{2}(x+1),- 1\leq x \leq1\\0,x\leq-1\end{matrix}\right.\\ f'(x)=\left\{\begin{matrix}\frac{1}{2},- 1<x<1\\0,x>1,x<-1\end{matrix}\right.f(x)=1,x121(x+1),1x10,x1f(x)={ 21,1<x<10,x>1,x<1
Geometric image:
Please add a picture description

Features:
It is similar to a nonlinear amplifier with an amplification factor of 1. It is a linear combiner when it works in the linear region, and becomes a threshold unit when the amplification factor tends to infinity.
 

5. Unsaturated activation function - other functions

5.1 Maxout function

Function expression:
M axout ( x ) = fi ( x ) = max ⁡ j ∈ [ 1 , k ] zij = max ⁡ j ∈ [ 1 , k ] ( x T w . . . ij + bij ) M axout hidden layer The calculation formula of each neuron is as above, where k is the number of virtual hidden layers connected after the output of each neuron, and the size is set manually, where the weight w is a three-dimensional matrix with a size of ( d , m , k ), b is a two-dimensional matrix with a size of ( m , k ), the learnable parameter d represents the number of input nodes, and m represents the number of output layer nodes Maxout(x)=f_{i}(x )=\displaystyle \max_{j\in [1,k]}z_{ij}=\displaystyle \max_{j\in [1,k]} (x^{T}w..._{ij}+ b_{ij})\\ The calculation formula of each neuron in the Maxout hidden layer is as above, where k is the number of virtual hidden layers connected after the output of each neuron, and the size is set manually\\ where the weight w is a size It is a three-dimensional matrix of (d, m, k), b is a two-dimensional matrix of size (m, k), and is a learnable parameter \\ d represents the number of input nodes, and m represents the output layer nodes number ofMaxout(x)=fi(x)=j[1,k]maxzij=j[1,k]max(xTw...ij+bij)The calculation formula of each neuron in the Max hidden layer is as above , where k is the number of virtual hidden layers connected after the output of each neuron , which is artificially set fixed sizewhere the weight w is a size ( d , _m,k ) three- dimensional matrix , b is a size ( m ,k ) is a two- dimensional matrix , which is a learnable parameterd represents the number of input nodes , m represents the number of output layer nodes _ _ _

When k is 1, the network is similar to a normal MLP network

The traditional MLP calculation formula is: z=w*X+b, out=f(z), f is the so-called activation function

Features:
(1) It can be understood as a layer of network in the neural network, similar to the pooling layer and convolution layer, and the Maxout function can also be regarded as the activation function layer of the network
(2) Originally, only one set of inputs is required , now requires k sets of inputs, so the number of parameters increases by k times
(3) The fitting ability of maxout is very strong, it can fit any convex function. The most intuitive explanation is that any convex function can be fitted by a piecewise linear function with arbitrary precision, and maxout takes the maximum value of k hidden layer nodes. These "hidden layer" nodes are also linear, so in Under different value ranges, the maximum value can also be regarded as piecewise linear (the number of segments is related to the value of k)
(4) maxout is a function approximator. For a standard MLP network, if hidden If there are enough neurons in the layer, we can theoretically approximate any function. Similarly, it is also a function approximator for the maxout network.
(5) Ordinary neurons cooperate with the activation function to obtain a set of outputs. After maxout, k sets of outputs will be obtained, and then the maximum value will be taken among the k sets. It can be seen that the fitting ability of maxout nonlinearity is stronger, because there are more parameters and they can be learned. When k=2, it is equivalent to relu(z). If there are more k, a more responsible activation function can be learned, which is why the maxout fitting ability is stronger. In fact maxout can fit any convex function.
(6) Neurons will not die, not a fixed function Equation
(8) is a learnable activation function, because the w parameter is learned and changed
(9) is a piecewise linear function, not easy for the gradient to disappear
 
Disadvantages:
The amount of calculation is very large

Theorem: For any continuous piecewise linear function g(v), we can find two convex piecewise linear functions h1(v), h2(v), so that the difference between these two convex functions is g(v )

Specific details: Maxout Networks

 

5.2 Dropout function

 

5.3 Swish function

Function expression and its derivative:
S wish ( x ) = f ( x ) = x ∗ sigmoid ( α x ) = x 1 + e − α x where a is a hyperparameter or learnable parameter f ′ ( x ) = sigmoid ( α x ) + α x ∗ sigmoid ( α x ) ∗ ( 1 − sigmoid ( α x ) ) Swish(x)=f(x)=x*sigmoid(αx)=\frac{x}{1+e^{ -αx}}\\ where a is a hyperparameter or a learnable parameter\\ f'(x)=sigmoid(αx)+αx*sigmoid(αx)*(1-sigmoid(αx))Swish(x)=f(x)=xsigmoid(αx)=1+eαxxwhere a is a hyperparameter or a learnable parameterf(x)=sigmoid(αx)+αxsigmoid(αx)(1s i g m o i d ( α x ) )
geometric image:
Please add a picture description
Please add a picture description

Features:

(1) Self-gated activation function, the design is inspired by the use of the sigmoid function of gating in LSTM and high-speed networks. Use the same gating value to simplify the gating mechanism, which is called self-gating. The advantage of self-gating is that it requires only a simple scalar input, while ordinary gating requires multiple scalar inputs. This enables self-gated activation functions such as Swish to easily replace activation functions that take a single scalar as input (such as ReLU) without changing the hidden capacity or number of parameters.
(2) Unboundedness. Helps prevent slow training where gradients approach 0 and cause saturation

Bounded activation functions can have strong regularization and can handle large negative inputs

(3) The derivative is always greater than zero, and smoothness plays an important role in optimization and generalization.
(4) From the image, the swish function is similar to relu, and the only difference is the negative semi-axis area close to 0. swish outperforms Relu on deep models
 

(3) Comparison

Typical activation function comparison:
(1) Sigmoid function: binary classification
(2) Softmax function: multi-classification
(3) Tanh function: the function output is centered on 0. When the input of the activation function is 0, the output is also 0.
(4) ReLU function: the gradient is not saturated, and the calculation speed is fast.
 


The above is not finished and needs to be updated, it is only for personal study, the infringement contact is deleted, if there are any mistakes or deficiencies, please point out for improvement.

Guess you like

Origin blog.csdn.net/abc31431415926/article/details/127927739