【Summary and Analysis of CV Knowledge Points】|Activation Function

【Summary and Analysis of CV Knowledge Points】|Activation Function

【Written in front】

This series of articles is suitable for students or people who have already started Python and have a certain programming foundation, as well as students or people who are looking for jobs in artificial intelligence, algorithms, and machine learning. The series of articles includes deep learning, machine learning, computer vision, feature engineering, etc. I believe it can help beginners quickly get started with deep learning, and help job seekers fully understand the knowledge points of algorithms.

1. What is an activation function?

In a neural network, the activation function of a node defines the output of the node under a given input variable or input set. In the wiki, the computer chip circuit is taken as an example. The standard computer chip circuit can be regarded as a digital circuit activation function that obtains an on (1) or off (0) output according to the input. The activation function is mainly used to improve the ability of the neural network to solve nonlinear problems. There are various activation functions, each with its own advantages and disadvantages. Currently, ReLU, sigmoid, and tanh are commonly used.

2. Why do you need an activation function?

When the activation function is not used, the weights and biases of the neural network will only be transformed linearly. Linear equations are simple, but have limited ability to solve complex problems. A neural network without an activation function is essentially a linear regression model. In order to facilitate understanding, a simple example is used to illustrate. Consider the following network

In the absence of an activation function, the graph can be represented by the following formula

o u t p u t = w 7 ( i n p u t 1 ∗ w 1 + i n p u t 2 ∗ w 2 ) + w 8 ( i n p u t 1 ∗ w 3 + i n p u t 2 ∗ w 4 ) + w 9 ( i n p u t 1 ∗ w 5 + i n p u t 2 ∗ w 6 ) output =w 7( input 1 * w 1+i n p u t 2 * w 2)+w 8(i n p u t 1 * w 3+i n p u t 2 * w 4)+w 9(i n p u t 1 * w 5+i n p u t 2 * w 6) output=w7(input1w1+input2w2)+w8(input1w 3+input2w 4 )+w9(input1w5+input2w 6 )

The essence is the following linear equation:

o u t p u t = [ w 1 ∗ w 7 + w 3 ∗ w 8 + w 5 ∗ w 9 w 2 ∗ w 7 + w 4 ∗ w 8 + w 6 ∗ w 9 ] ∗ [  input  1  input  2 ] ⟹ Y = W X output =\left[\begin{array}{c}w 1 * w 7+w 3 * w 8+w 5 * w 9 \\ w 2 * w 7+w 4 * w 8+w 6 * w 9\end{array}\right] *\left[\begin{array}{c}\text { input } 1 \\ \text { input } 2\end{array}\right] \Longrightarrow Y=W X output=[w1w7 _+w 3w 8+w5w9 _w2w7 _+w 4w 8+w6 _w 9][ input 1 input 2]Y=WX

If the activation function is introduced into the hidden layer h ( y ) = max ⁡ ( y , 0 ) h(y)=\max (y, 0)h(y)=max(y,0 ) , then the original formula cannot be expressed by a simple linear equation.

o u t p u t = w 7 ∗ max ⁡ ( i n p u t 1 ∗ w 1 + i n p u t 2 ∗ w 2 , 0 ) + w B ∗ max ⁡ ( i n p u t 1 ∗ w 3 + i n p u t 2 ∗ w 4 , 0 ) + w 9 ∗ max ⁡ ( i n p u t 1 ∗ w 5 + i n p u t 2 ∗ w 6 , 0 ) output =w 7 * \max ( input 1 * w 1+i n p u t 2 * w 2,0)+w B * \max ( input 1 * w 3+i n p u t 2 * w 4,0)+w 9 * \max ( input 1 * w 5+i n p u t 2 * w 6,0) output=w7 _max(input1w1+input2w2,0)+wBmax(input1w 3+input2w 4 ,0)+w9 _max(input1w5+input2w 6 ,0)

3. Some characteristics of the activation function

Nonlinear (Nonlinear) When the activation function is nonlinear, then a two-layer neural network also proves to be a general approximation function general approximation theory. However, the identity activation function cannot satisfy this characteristic. When each layer of a multi-layer network is an identity activation function, the network is essentially equivalent to a single-layer network.

Continuously differentiable Usually, when the activation function is continuously differentiable, gradient-based optimization methods can be used. (There are exceptions. For example, although the ReLU function is not continuously differentiable, there are some problems with gradient optimization. For example, ReLU has a neuron output less than 0 due to excessive gradient or excessive learning rate, so that the neuron output is always It is 0, and the neuron parameters connected to it cannot be updated, which is equivalent to the neuron entering a "sleeping" state, but ReLU can still be optimized using gradients.) The binary step function is not differentiable at 0, and at Derivatives are zero elsewhere, so gradient optimization methods are not suitable for this activation function.

**Monotonic **When the activation function is a monotonic function, the error surface of the single-layer model must be convex. That is, the corresponding error function is a convex function, and the minimum value obtained must be the global minimum.

**Smooth functions with a monotonic derivative **In general, these functions perform better.

Approximates identity near the origin (Approximates identity near the origin) If the activation function has this feature, the neural network learns more efficiently when randomly initializing smaller weights. If the activation function does not have this property, special care must be taken when initializing the weights.

4. Common activation functions in the field of machine learning?

Identity (identity function)

Description : An activation function with equal input and output, more suitable for linear problems, such as linear regression problems. But not suitable for solving nonlinear problems.

Equation : f ( x ) = xf(x)=xf(x)=x

First derivative : f ′ ( x ) = 1 f^{\prime}(x)=1f(x)=1

graphics :

Binary step (unit step function)

Description : step is the closest to the meaning of neuron activation, which means that it will only fire when the stimulus exceeds the threshold. However, since the gradient of this function is always 0, it cannot be used as the activation function of the deep network.

方程式 f ( x ) = { 0  for  x < 0 1  for  x ≥ 0 f(x)=\left\{\begin{array}{ll}0 & \text { for } x<0 \\ 1 & \text { for } x \geq 0\end{array}\right. f(x)={ 01 for x<0 for x0

一阶导 f ′ ( x ) = { 0  for  x ≠ 0 ?  for  x = 0 f^{\prime}(x)=\left\{\begin{array}{ll}0 & \text { for } x \neq 0 \\ ? & \text { for } x=0\end{array}\right. f(x)={ 0? for x=0 for x=0

graphics :

Sigmoid (S function is also called Logistic logic function)

Description : A widely used class of activation functions with an exponential function shape that is closest to biological neurons in the physical sense. And the value range is between (0,1), which can be expressed as a probability. This function is also commonly used to normalize the input, such as the Sigmoid cross entropy loss function. The Sigmoid activation function has the problem of gradient disappearance and saturation. Generally speaking, the sigmoid network will produce gradient disappearance within 5 layers.

Equation : f ( x ) = σ ( x ) = 1 1 + e − xf(x)=\sigma(x)=\frac{1}{1+e^{-x}}f(x)=σ ( x )=1+ex1

First derivative : f ′ ( x ) = f ( x ) ( 1 − f ( x ) ) f^{\prime}(x)=f(x)(1-f(x))f(x)=f(x)(1f(x))

graphics :

TanH (hyperbolic tangent function)

Description : TanH is similar to the Sigmoid function. When the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update, and is prone to gradient disappearance and saturation problems. However, the value range of the TanH function is between (-1, 1), antisymmetric with 0 as the center, and the origin is approximately identical. These points are bonus items. In general binary classification problems, the hidden layer uses the tanh function and the output layer uses the sigmod function.

方程式 f ( x ) = tanh ⁡ ( x ) = ( e x − e − x ) ( e x + e − x ) f(x)=\tanh (x)=\frac{\left(e^{x}-e^{-x}\right)}{\left(e^{x}+e^{-x}\right)} f(x)=tanh ( x )=(ex +ex)(ex ex)

First derivative : f ′ ( x ) = 1 − f ( x ) 2 f^{\prime}(x)=1-f(x)^{2}f(x)=1f(x)2

graphics :

ArcTan (arc tangent function)

Description : ArcTen is similar to the TanH function in graphics, but it is gentler than TanH and has a larger value range. From the first-order derivation, it can be seen that the derivative tends to zero slowly, so the training is faster.

Equation : f ( x ) = tan ⁡ − 1 ( x ) f(x)=\tan ^{-1}(x)f(x)=tan1(x)

First derivative : f ′ ( x ) = 1 x 2 + 1 f^{\prime}(x)=\frac{1}{x^{2}+1}f(x)=x2+11

graphics :

Softsign function

Description : Softsign is also similar to the TanH function from a graphical point of view. It is antisymmetric with 0 as the center, and the training is relatively fast.

Equation : f ( x ) = x 1 + ∥ x ∥ f(x)=\frac{x}{1+\|x\|}f(x)=1+xx

First derivative : f ′ ( x ) = 1 ( 1 + ∥ x ∥ ) 2 f^{\prime}(x)=\frac{1}{(1+\|x\|)^{2}}f(x)=(1+x)21

graphics :

Rectified linear unit (linear rectification function, ReLU)

Description : A popular activation function, which retains the biological neuron mechanism similar to step, that is, it is activated only when it is higher than 0, but because the derivative below 0 is 0, it may cause slow learning or even neuron death Condition.

方程式 f ( x ) = { 0  for  x ≤ 0 x  for  x > 0 f(x)=\left\{\begin{array}{ll}0 & \text { for } x \leq 0 \\ x & \text { for } x>0\end{array}\right. f(x)={ 0x for x0 for x>0

一阶导 f ′ ( x ) = { 0  for  x ≤ 0 1  for  x > 0 f^{\prime}(x)=\left\{\begin{array}{ll}0 & \text { for } x \leq 0 \\ 1 & \text { for } x>0\end{array}\right. f(x)={ 01 for x0 for x>0

graphics :

Leaky rectified linear unit (with leaky random linear rectification function, Leaky ReLU)

Description : A change of relu, that is, the part less than 0 is not equal to 0, but a small non-zero slope is added to reduce the impact of neuron death.

方程式 f ( x ) = { 0.01 x  for  x < 0 x  for  x ≥ 0 f(x)=\left\{\begin{array}{ll}0.01 x & \text { for } x<0 \\ x & \text { for } x \geq 0\end{array}\right. f(x)={ 0.01xx for x<0 for x0

一阶导 f ′ ( x ) = { 0.01  for  x < 0 1  for  x ≥ 0 f^{\prime}(x)=\left\{\begin{array}{ll}0.01 & \text { for } x<0 \\ 1 & \text { for } x \geq 0\end{array}\right. f(x)={ 0.011 for x<0 for x0

graphics :

Parameteric rectified linear unit (parameterized linear rectification function, PReLU)

Description : It is also a change of ReLU, similar to Leaky ReLU, except that PReLU replaces the slope of the part less than zero with a variable parameter α. This variation causes the range of values ​​to vary depending on α.

方程式 f ( α , x ) = { α x  for  x < 0 x  for  x ⩾ 0 f(\alpha, x)=\left\{\begin{array}{ll}\alpha x & \text { for } x<0 \\ x & \text { for } x \geqslant 0\end{array}\right. f ( a ,x)={ αxx for x<0 for x0

For example : f ′ ( α , x ) = { α for x < 0 1 for x ≥ 0 f^{\prime}(\alpha, x)=\left\{\begin{array}{ll}\alpha & \text { for } x<0 \\ 1 & \text { for } x \geq 0\end{array}\right.f (a,x)={ a1 for x<0 for x0

graphics :

Randomized leaky rectified linear unit (random linear rectification function with leakage, RReLU)

Description : On the basis of PReLU, α is changed into a random number.

方程式 f ( α , x ) = { α x  for  x < 0 x  for  x ⩾ 0 f(\alpha, x)=\left\{\begin{array}{ll}\alpha x & \text { for } x<0 \\ x & \text { for } x \geqslant 0\end{array}\right. f ( a ,x)={ αxx for x<0 for x0

For example : f ′ ( α , x ) = { α for x < 0 1 for x ≥ 0 f^{\prime}(\alpha, x)=\left\{\begin{array}{ll}\alpha & \text { for } x<0 \\ 1 & \text { for } x \geq 0\end{array}\right.f (a,x)={ a1 for x<0 for x0

graphics :

Exponential linear unit (exponential linear function, ELU)

Description : The part of the ELU less than zero adopts a negative exponential form, which can have a negative value compared to the ReLU weight, and has soft saturation characteristics when the input takes a small value, which improves the robustness to noise

方程式 f ( α , x ) = { α ( e x − 1 )  for  x ≤ 0 x  for  x > 0 f(\alpha, x)=\left\{\begin{array}{ll}\alpha\left(e^{x}-1\right) & \text { for } x \leq 0 \\ x & \text { for } x>0\end{array}\right. f ( a ,x)={ a(ex1)x for x0 for x>0

一阶导 f ′ ( α , x ) = { f ( α , x ) + α  for  x ≤ 0 1  for  x > 0 f^{\prime}(\alpha, x)=\left\{\begin{array}{ll}f(\alpha, x)+\alpha & \text { for } x \leq 0 \\ 1 & \text { for } x>0\end{array}\right. f (a,x)={ f ( a ,x)+a1 for x0 for x>0

graphics :

Scaled exponential linear unit (extended exponential linear function, SELU)

Description : A change of ELU, introducing hyperparameters λ and α, and giving the corresponding values, which are deduced in detail in the original paper (Self- Normalizing Neural Networks )

方程式 f ( α , x ) = λ { α ( e x − 1 )  for  x < 0 x  for  x ≥ 0 f(\alpha, x)=\lambda\left\{\begin{array}{ll}\alpha\left(e^{x}-1\right) & \text { for } x<0 \\ x & \text { for } x \geq 0\end{array}\right. f ( a ,x)=l{ a(ex1)x for x<0 for x0 w i t h λ = 1.0507 a n d α = 1.67326 with \quad \lambda=1.0507 \quad and \quad \alpha=1.67326 withl=1.0507anda=1.67326

For example : f ′ ( α , x ) = λ { α ( ex ) for x < 0 1 for x ≥ 0 f^{\prime}(\alpha, x)=\lambda\left\{\begin{array }{ll}\alpha\left(e^{x}\right) & \text { for } x<0 \\ 1 & \text { for } x \geq 0\end{array}\right.f (a,x)=l{ a(ex)1 for x<0 for x0

graphics :

SoftPlus function

Description : It is a smooth alternative to ReLU, the function is continuous everywhere and the value range is non-zero, avoiding dead neurons. However, due to asymmetry and non-zero centering, it can affect network learning. Since the derivative must be less than 1, there is also a vanishing gradient problem.

Equation :

f ( x ) = ln ⁡ ( 1 + e x ) f(x)=\ln \left(1+e^{x}\right) f(x)=ln(1+ex)

First order guide :

f ′ ( x ) = 1 1 + e − x f^{\prime}(x)=\frac{1}{1+e^{-x}} f(x)=1+ex1

graphics :

Bent identity (bent identity function)

Description : It can be understood as a compromise between identity and ReLU. There will be no problem of dead neurons, but there are risks of gradient disappearance and gradient explosion.

Equation :

f ( x ) = x 2 + 1 − 1 2 + x f(x)=\frac{\sqrt{x^{2}+1}-1}{2}+x f(x)=2x2+1 1+x

First order guide :

f ′ ( x ) = x 2 x 2 + 1 + 1 f^{\prime}(x)=\frac{x}{2 \sqrt{x^{2}+1}}+1 f(x)=2x2+1 x+1

graphics :

Sinusoid (sine function)

Description : Sinusoid, as an activation function, introduces periodicity to the neural network, and the function is connected everywhere, symmetrical to zero.

Equation :

f ( x ) = sin ⁡ ( x ) f(x)=\sin (x) f(x)=sin(x)

First order guide :

f ′ ( x ) = cos ⁡ ( x ) f^{\prime}(x)=\cos (x) f(x)=cos(x)

graphics :

Sinc function

Description : The Sinc function is particularly important in signal processing because it characterizes the Fourier transform of a rectangular function. As an activation function, its advantage lies in its differentiable and symmetric characteristics, but it is prone to the problem of gradient disappearance.

Equation :

f ( x ) = { 1  for  x = 0 sin ⁡ ( x ) x  for  x ≠ 0 f(x)=\left\{\begin{array}{ll}1 & \text { for } x=0 \\ \frac{\sin (x)}{x} & \text { for } x \neq 0\end{array}\right. f(x)={ 1xsin(x) for x=0 for x=0

First order guide :

f ′ ( x ) = { 0  for  x = 0 cos ⁡ ( x ) x − sin ⁡ ( x ) x 2  for  x ≠ 0 f^{\prime}(x)=\left\{\begin{array}{ll}0 & \text { for } x=0 \\ \frac{\cos (x)}{x}-\frac{\sin (x)}{x^{2}} & \text { for } x \neq 0\end{array}\right. f(x)={ 0xcos(x)x2sin(x) for x=0 for x=0

graphics :

Gaussian (Gaussian function)

Description : The Gaussian activation function is not commonly used.

Equation :

f ( x ) = e − x 2 f(x)=e^{-x^{2}} f(x)=ex2

First order guide :

f ′ ( x ) = − 2 x e − x 2 f^{\prime}(x)=-2 x e^{-x^{2}} f(x)=2xex2

graphics :

Hard Sigmoid (segmented approximate Sigmoid function)

Description : It is a piecewise linear approximation of the Sigmoid function, which is easier to calculate, but there are problem
equations of gradient disappearance and neuron death :

f ( x ) = { 0  for  x < − 2.5 0.2 x + 0.5  for  − 2.5 ≥ x ≤ 2.5 1  for  x > 2.5 f(x)=\left\{\begin{array}{ll}0 & \text { for } x<-2.5 \\ 0.2 x+0.5 & \text { for }-2.5 \geq x \leq 2.5 \\ 1 & \text { for } x>2.5\end{array}\right. f(x)= 00.2x _+0.51 for x<2.5 for 2.5x2.5 for x>2.5

First order guide :

f ′ ( x ) = { 0  for  x < − 2.5 0.2  for  − 2.5 ≥ x ≤ 2.5 0  for  x > 2.5 f^{\prime}(x)=\left\{\begin{array}{ll}0 & \text { for } x<-2.5 \\ 0.2 & \text { for }-2.5 \geq x \leq 2.5 \\ 0 & \text { for } x>2.5\end{array}\right. f(x)= 00.20 for x<2.5 for 2.5x2.5 for x>2.5

graphics :

Hard Tanh (subsection approximate Tanh function)

Description : A piecewise linear approximation of the Tanh activation function.

Equation :

f ( x ) = { − 1  for  x < − 1 x  for  − 1 ≥ x ≤ 1 1  for  x > 1 f(x)=\left\{\begin{array}{ll}-1 & \text { for } x<-1 \\ x & \text { for }-1 \geq x \leq 1 \\ 1 & \text { for } x>1\end{array}\right. f(x)= 1x1 for x<1 for 1x1 for x>1

First order guide :

f ′ ( x ) = { 0  for  x < − 1 1  for  − 1 ≥ x ≤ 1 0  for  x > 1 f^{\prime}(x)=\left\{\begin{array}{ll}0 & \text { for } x<-1 \\ 1 & \text { for }-1 \geq x \leq 1 \\ 0 & \text { for } x>1\end{array}\right. f(x)= 010 for x<1 for 1x1 for x>1

graphics :

LeCun Tanh (also known as Scaled Tanh, scaled Tanh function)

Description : A scaled version of Tanh

Equation :

f ( x ) = 1.7519 tanh ⁡ ( 2 3 x ) f(x)=1.7519 \tanh \left(\frac{2}{3} x\right) f(x)=1.7519fishy(32x)

First order guide :

f ′ ( x ) = 1.7519 ∗ 2 3 ( 1 − tanh ⁡ 2 ( 2 3 x ) ) = 1.7519 ∗ 2 3 − 2 3 ∗ 1.7519 f ( x ) 2 \begin{aligned} f^{\prime}(x) &=1.7519 * \frac{2}{3}\left(1-\tanh ^{2}\left(\frac{2}{3} x\right)\right) \\ &=1.7519 * \frac{2}{3}-\frac{2}{3 * 1.7519} f(x)^{2} \end{aligned} f(x)=1.751932(1fishy2(32x))=1.75193231.75192f(x)2

graphics :

Symmetrical Sigmoid (symmetrical Sigmoid function)

Description : An alternative to Tanh, with a flatter shape, smaller derivatives, and slower descent than Tanh.

Equation :

f ( x ) = tanh ⁡ ( x / 2 ) = 1 − e − x 1 + e − x \begin{aligned} f(x) &=\tanh (x / 2) \\ &=\frac{1-e^{-x}}{1+e^{-x}} \end{aligned} f(x)=tanh ( x /2 )=1+ex1ex

First order guide :

f ′ ( x ) = 0.5 ( 1 − tanh ⁡ 2 ( x / 2 ) ) = 0.5 ( 1 − f ( x ) 2 ) \begin{aligned} f^{\prime}(x) &=0.5\left(1-\tanh ^{2}(x / 2)\right) \\ &=0.5\left(1-f(x)^{2}\right) \end{aligned} f(x)=0.5(1fishy2(x/2))=0.5(1f(x)2)

graphics :

Complementary Log Log function

Description : It is an alternative to Sigmoid, which is more saturated than Sigmoid.

Equation :

f ( x ) = 1 − e − e x f(x)=1-e^{-e^{x}} f(x)=1eex

First order guide :

f ′ ( x ) = e x ( e − e x ) = e x − e x f^{\prime}(x)=e^{x}\left(e^{-e^{x}}\right)=e^{x-e^{x}} f(x)=ex(eex)=ex ex

graphics :

Absolute (absolute value function)

Description : The derivative has only two values.

Equation :

f ( x ) = ∣ x ∣ f(x)=|x| f(x)=x

First order guide :

f ′ ( x ) = { − 1  for  x < 0 1  for  x > 0 ?  for  x = 0 f^{\prime}(x)=\left\{\begin{array}{ll}-1 & \text { for } x<0 \\ 1 & \text { for } x>0 \\ ? & \text { for } x=0\end{array}\right. f(x)= 11? for x<0 for x>0 for x=0

graphics :

5. What is the activation function used in the transformer FFN layer? Why?

ReLU. The advantages of ReLU are fast convergence, no gradient disappearance or explosion, and low computational complexity.

6. What is the activation function used in Bert, GPT, and GPT2? Why?

Bert, GPT, GPT2, RoBERTa, and ALBERT all use Gelu.

GELU ⁡ ( x ) = x P ( X ≤ x ) = x Φ ( x ) \operatorname{GELU}(x)=x P(X \leq x)=x \Phi(x) GELU ( x )=xP(Xx)=xΦ(x)

Intuitive understanding: x is used as the input of neurons, the larger P(X<=x), the more likely x is retained; otherwise, the smaller the activation function output is, the closer to 0.

7. How to choose the activation function

  • When used in a classifier, the binary classification is Sigmoid, and the multi-classification is Softmax. These two types are generally used in the output layer;

  • For the problem of long sequences, try to avoid using Sigmoid and Tanh in the hidden layer, which will cause the problem of gradient disappearance;

  • Relu was more general in most cases before Gelu appeared, but it can only be used in the hidden layer;

  • Now in 2022, the main choices in the hidden layer must be Gelu and Swish.

8. What are the advantages and disadvantages of ReLU?

Advantages :

  • From a calculation point of view, both the Sigmoid and Tanh activation functions need to calculate the index, which has high complexity, and ReLU can get the activation value by inputting a value;

  • The ReLU function is considered to be biologically explanatory, such as unilateral inhibition, wide excitation boundary (that is, the degree of excitation can also be very high). In the human brain, only about 1 to 4% of neurons are active at the same time, so unilateral Suppression provides the sparse expressiveness of the network, and wide activation boundaries can effectively solve problems such as gradient disappearance.

Disadvantages :

  • ReLU is the same as Sigmoid, each output will introduce a bias offset to the neural network of the next layer, which will affect the efficiency of gradient descent.

  • The problem of ReLU neuron death, an abnormal parameter update, may cause the activation item to be 0, and the subsequent gradient update is also 0, and the neuron dies.

【Project recommendation】

The core code library of top conference papers for Xiaobai: https://github.com/xmu-xiaoma666/External-Attention-pytorch

YOLO target detection library for Xiaobai: https://github.com/iscyy/yoloair

Analysis of papers for Xiaobai's top journal and conference: https://github.com/xmu-xiaoma666/FightingCV-Paper-Reading

reference:

https://www.jianshu.com/p/466e54432bac

https://zhuanlan.zhihu.com/p/354013996

https://blog.csdn.net/qq_22795223/article/details/106184310

Guess you like

Origin blog.csdn.net/Jason_android98/article/details/127180568