Neural Network: Activation Function Layer Knowledge Points

1. The role of activation function, what are the commonly used activation functions?

The role of activation function

The activation function can introduce nonlinear factors to improve the learning expression ability of the network.

Commonly used activation functions

Sigmoid activation function

The function is defined as:

f ( x ) = 1 1 + e − x f(x) = \frac{1}{1 + e^{-x}} f(x)=1+ex1

As shown in the figure below, its value range is (0, 1) (0,1)(0,1 ) . In other words, each input neuron and node will be scaled to a value between0 00 and1 1value between 1 .

When xxWhen x is greater than zero, the output result will approach1 11 , and whenxxWhen x is less than zero, the output result tends to0 00. Due to the characteristics of the function,it is often used as the output activation function of binary classification.

Derivative of Sigmoid:

f ′ ( x ) = ( 1 1 + e − x ) ′ = 1 1 + e − x ( 1 − 1 1 + e − x ) = f ( x ) ( 1 − f ( x ) ) f^{'}(x)=(\frac{1}{1+e^{-x}})^{'}=\frac{1}{1+e^{-x}}\left( 1- \frac{1}{1+e^{-x}} \right)=f(x)(1-f(x)) f(x)=(1+ex1)=1+ex1(11+ex1)=f(x)(1f(x))

when x = 0 x=0x=When 0 ,f (x) ′ = 0.25 f(x)'=0.25f(x)=0.25

Advantages of Sigmoid:

  1. smooth
  2. Easy to derive
  3. It can be used as a probability to assist in explaining the output results of the model.

Disadvantages of Sigmoid:

  1. When the input data is large or small, the gradient of the function is almost close to 0, which is very detrimental to the learning of the neural network in backpropagation.
  2. The mean value of the Sigmoid function is not 0, which causes only all-positive or all-negative feedback to occur during the training process of the neural network.
  3. The derivative value is always less than 1, and backpropagation can easily cause the gradient to disappear.

Schematic diagram of Sigmoid derivative, the gradient on both sides is almost 0

Tanh activation function

The Tanh function is defined as:

f ( x ) = T a n h ( x ) = e x − e − x e x + e − x f(x) = Tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} f(x)=T he ( x )=ex+exexex

As shown in the figure below, the value range is ( − 1 , 1 ) (-1,1)(1,1)

Advantages of Tanh:

  1. The Tanh function compresses the data to the range of -1 to 1, solving the problem that the mean value of the Sigmoid function is not 0, so in practice the Tanh function is usually easier to converge than the Sigmoid function. In mathematical form, Tanh is actually just a scaling form of Sigmoid. The formula is tanh (x) = 2 f (2 x) − 1 tanh(x) = 2f(2x) -1t he ( x )=2f ( 2x ) _ _1 f ( x ) f(x) f ( x ) is the function of Sigmoid).
  2. smooth
  3. Easy to derive

Derivative of Tanh:

f ′ ( x ) = ( e x − e − x e x + e − x ) ′ = 1 − ( t a n h ( x ) ) 2 f^{'}(x)=(\frac{e^x - e^{-x}}{e^x + e^{-x}})^{'}=1-(tanh(x))^2 f(x)=(ex+exexex)=1( t he ( x ) )2

when x = 0 x=0x=When 0 ,f (x) ′ = 1 f(x)'=1f(x)=1

It can also be seen from the derivatives of Tanh and Sigmoid that the Tanh derivative is steeper and the convergence speed is faster than that of Sigmoid.

Tanh derivative diagram

Disadvantages of Tanh:

The derivative value is always less than 1, and backpropagation can easily cause the gradient to disappear.

Relu activation function

The Relu activation function is defined as:

f ( x ) = m a x ( 0 , x ) f(x) = max(0, x) f(x)=max(0,x)

As shown in the figure below, the value range is [ 0 , + ∞ ) [0,+∞)[0,+)

Advantages of ReLU:

  1. The calculation formula is very simple. It does not involve more expensive exponential operations like the two activation functions introduced above, which saves a lot of calculation time.
  2. In stochastic gradient descent, it is easier to make the network converge than Sigmoid and Tanh.
  3. When ReLU enters the negative half area, the gradient is 0. At this time, the neurons will be trained to form unilateral suppression, resulting in sparsity, which can extract sparse features better and faster.
  4. The gradients of the derivatives of the Sigmoid and Tanh activation functions in the positive and negative saturation zones will be close to 0, which will cause the gradient to disappear, while the part of the ReLU function greater than 0 is a constant to keep the gradient from attenuating and will not cause the gradient to disappear.

Sparse : In neural networks, this means that the activation matrix contains many zeros. What does this sparse performance get us? This results in greater efficiency in terms of time and space complexity, requiring less space for constant values ​​and lower computational costs.

Derivative of ReLU:

c ( u ) = { 0 , x < 0 1 , x > 0 u n d e f i n e d , x = 0 c(u)=\begin{cases} 0,x<0 \\ 1,x>0 \\ undefined,x=0\end{cases} c(u)= 0,x<01,x>0undefined,x=0

Usually x = 0 x=0x=0 , given its derivative is1 11 and0 00

Derivative of ReLU

Disadvantages of ReLU:

  1. Training may result in some neurons never being updated. One of the improvements to the ReLU function is LeakyReLU.
  2. ReLU cannot avoid the gradient explosion problem.

LeakyReLU activation function

The LeakyReLU activation function is defined as:

f ( x ) = { a x , x < 0 x , x ≥ 0 f(x) = \left\{ \begin{aligned} ax, \quad x<0 \\ x, \quad x\ge0 \end{aligned} \right. f(x)={ ax,x<0x,x0

As shown in the figure below ( a = 0.5 a = 0.5a=0.5 ), the value range is( − ∞ , + ∞ ) (-∞,+∞)(,+)

Advantages of LeakyReLU:

The difference between this method and ReLU is that xxWhen x is less than 0,f (x) = axf(x) = axf(x)=a x , in whichaaa is a very small slope (say 0.01). Such improvements can makexxWhen x is less than 0, it will not cause the gradient disappearance phenomenon during back propagation.

Disadvantages of LeakyReLU:

  1. The problem of exploding gradients cannot be avoided.
  2. Neural network does not learn α \alphaα value.
  3. When deriving the derivative, both parts are linear.

SoftPlus activation function

The SoftPlus activation function is defined as:

f ( x ) = l n ( 1 + e x ) f(x) = ln( 1 + e^x) f(x)=ln(1+ex)

The value range is ( 0 , + ∞ ) (0,+∞)(0,+)

The function image is as follows:

SoftPlus can be thought of as the smoothing of ReLU.

ELU activation function

The ELU activation function solves some problems of ReLU while retaining some good aspects. This activation function requires selecting an α \alphaα value, its common value is between 0.1 and 0.3.

The function definition looks like this:

f ( x ) = { a ( e x − 1 ) , x < 0 x , x ≥ 0 f(x) = \left\{ \begin{aligned} a(e^x -1), \quad x<0 \\ x, \quad x\ge0 \end{aligned} \right. f(x)={ to ( ex1),x<0x,x0

If we enter xxx value is greater than0 00 , the result is the same as ReLU, that is,yyThe y value is equal toxxx value; but if the inputxxx value is less than0 00 , then we will get a slightly smaller than0 0The value of 0 , the resultingyyThe y value depends on the inputxxx value, but also take into account the parameterα \alphaα - This parameter can be adjusted as needed. The formula further introduces the exponential operationexe^xex , so the computational cost of ELU is higher than that of ReLU.

α \alpha is given belowELU function graph when α value is 0.2:

ELU function graph

Derivative of ELU:

Derivative formula of ELU

The derivative plot looks like this:

Derivative plot of ELU

Advantages of ELU:

  1. It can avoid the situation where some neurons in ReLU cannot be updated.
  2. Can get negative value output.

Disadvantages of ELU:

  1. Contains exponential operations and takes a long time to calculate.
  2. There is no way to avoid the gradient explosion problem.
  3. Neural network cannot learn α \alphaα value.

Guess you like

Origin blog.csdn.net/weixin_51390582/article/details/135124596