Activation function

What is the activation function?

The activation function is an extremely important feature of artificial neural networks; the activation function determines whether a neuron should be activated, and activation represents that the information received by the neuron is related to the given information; the activation function performs a nonlinear transformation on the input information, and then The transformed output information is passed to the next layer of neurons as input information.

The role of activation function

If an activation function is not used, the output of each layer is a linear function of the input of the upper layer. No matter how many layers the neural network has, the final output is a linear combination of the inputs. The activation function introduces nonlinear factors to neurons, allowing the neural network to arbitrarily approximate any nonlinear function.

Types of activation functions

Identity

identity is suitable for tasks where the underlying behavior is linear (similar to linear regression). However, nonlinear mapping cannot be provided. When a multi-layer network uses the identity activation function, the entire network is equivalent to a single-layer model.

The function is defined as f ( x ) = xf(x)=xf(x)=x , the derivative isf ′ ( x ) = 1 { f }^{ ' }(x)=1f(x)=1

Insert image description here

Step

The activation function Step is more theoretical than practical, mimicking the all-or-nothing properties of biological neurons. However, it cannot be applied to neural networks because its derivatives are 0 (except for the zero-point derivative, which is undefined), which means that gradient-based optimization methods are not feasible.

The definition of the function is f ( x ) = { 0 x < 0 1 x ≥ 0 { f }(x)= \begin{cases} \begin{matrix} 0 & x<0 \end{matrix} \\ \begin{ matrix} 1 & x\ge 0 \end{matrix} \end{cases}f(x)={ 0x<01x0,导数为 f ′ ( x ) = { 0 x ≠ 0 ? x = 0 { f }^{ ' }(x)=\begin{cases} \begin{matrix} 0 & x\neq 0 \end{matrix} \\ \begin{matrix} ? & x=0 \end{matrix} \end{cases} f(x)={ 0x=0?x=0

Insert image description here

Sigmoid

The output mapping of the sigmoid function is between (0,1), monotonic and continuous, the output range is limited, the optimization is stable, and can be used as the output layer; and the derivation is easy. However, due to its soft saturation, once it falls into the saturation zone, the gradient will be close to 0. According to the chain rule of backpropagation, the gradient will easily disappear, causing training problems; the output of the Sigmoid function is always greater than 0. The non-zero centered output will bias the input of the neuron in the subsequent layer (Bias Shift), and further slow down the convergence speed of gradient descent; and during calculation, due to the power operation, the computational complexity Higher, the operation speed is slower.

The function is defined as f ( x ) = σ ( x ) = 1 1 + e − x { f }(x)=\sigma (x)=\displaystyle \frac { 1 }{ 1+{ e }^{ -x } }f(x)=σ ( x )=1+ex1, the derivative is f ′ ( x ) = e − x ( 1 + e − x ) 2 = f ( x ) ( 1 − f ( x ) ) { f }^{ ' }(x)=\displaystyle \frac {e ^{-x}} {(1+e^{-x})^2}=f(x)(1-f(x))f(x)=(1+ex)2ex=f(x)(1f(x))

Insert image description here

Fishy

Tanh converges faster than the sigmoid function. Compared with the sigmoid function, tanh is centered on 0. However, like the sigmoid function, the gradient disappears easily due to saturation. Due to the power operation, the calculation complexity is high and the operation speed is slow.

The function is defined as f ( x ) = tanh ( x ) = ex − e − xex + e − x { f }(x)=tanh(x)=\displaystyle \frac { { e }^{ x }-{ e }^{ -x } }{ { e }^{ x }+{ e }^{ -x } }f(x)=t he ( x )=ex+exexex, the derivative is f ′ ( x ) = 1 − tanh 2 ( x ) = 1 − f ( x ) 2 { f }^{ ' }(x)=1-tanh^2(x)=1-f(x) ^{ 2 }f(x)=1t an h2(x)=1f(x)2

Insert image description here

resume

The ReLU function converges quickly. Compared with sigmoid and tanh, which involve power operations, resulting in high computational complexity, ReLU​ can be implemented more simply; when the input x>=0, the derivative of ReLU​ is a constant, which can be effective Alleviates the vanishing gradient problem; when x<0, the gradient of ReLU is always 0, providing the sparse expression capability of the neural network. However, the output of ReLU is not centered on 0; and due to neuron necrosis, some neurons may never be activated, causing the corresponding parameters to never be updated, and the gradient explosion problem cannot be avoided.

The definition of the function is f ( x ) = { 0 x < 0 xx ≥ 0 f(x)=\begin{cases} \begin{matrix} 0 & x<0 \end{matrix} \\ \begin{matrix} x & x\ge 0 \end{matrix} \end{cases}f(x)={ 0x<0xx0,导数为 f ′ ( x ) = { 0 x < 0 1 x ≥ 0 { { f '}(x) }=\begin{cases} \begin{matrix} 0 & x<0 \end{matrix} \\ \begin{matrix} 1 & x\ge 0 \end{matrix} \end{cases} f(x)={ 0x<01x0

Insert image description here

LReLU

The LReLU activation function can avoid gradient disappearance. Since the derivative is always non-zero, it can reduce the appearance of dead neurons; however, LReLU​ performance is not necessarily better than ReLU​, and the gradient explosion problem cannot be avoided.

The definition of the function is f ( x ) = { α xx < 0 xx ≥ 0 f(x)=\begin{cases} \begin{matrix} \alpha x & x<0 \end{matrix} \\ \begin{matrix } x & x\ge 0 \end{matrix} \end{cases}f(x)={ αxx<0xx0,导数为 f ( x ) ′ = { α x < 0 1 x ≥ 0 { { f }(x) }^{ ' }=\begin{cases} \begin{matrix} \alpha & x<0 \end{matrix} \\ \begin{matrix} 1 & x\ge 0 \end{matrix} \end{cases} f(x)={ ax<01x0

Insert image description here

PRELU

PReLU​ is an improvement of LReLU. It can adaptively learn parameters from data, has fast convergence speed and low error rate. PReLU can be used for backpropagation training and can be optimized simultaneously with other layers.

The definition of the function is f ( α , x ) = { α xx < 0 xx ≥ 0 f(\alpha ,x)=\begin{cases} \begin{matrix} \alpha x & x<0 \end{matrix} \ \ \begin{matrix} x & x\ge 0 \end{matrix} \end{cases}f ( a ,x)={ αxx<0xx0, the derivative of the function is f ′ ( α , x ) = { α x < 0 1 x ≥ 0 { { f' }(\alpha ,x) }=\begin{cases} \begin{matrix} \alpha & x< 0 \end{matrix} \\ \begin{matrix} 1 & x\ge 0 \end{matrix} \end{cases}f (a,x)={ ax<01x0

Insert image description here

RReLU

RReLU adds a linear term for negative inputs, and the slope of this linear term is randomly assigned at each node (usually uniformly distributed).

The definition of the function is f ( α , x ) = { α xx < 0 xx ≥ 0 f(\alpha ,x)=\begin{cases} \begin{matrix} \alpha x & x<0 \end{matrix} \ \ \begin{matrix} x & x\ge 0 \end{matrix} \end{cases}f ( a ,x)={ αxx<0xx0,导数为 f ( α , x ) ′ = { α x < 0 1 x ≥ 0 { { f }(\alpha ,x) }^{ ' }=\begin{cases} \begin{matrix} \alpha & x<0 \end{matrix} \\ \begin{matrix} 1 & x\ge 0 \end{matrix} \end{cases} f ( a ,x)={ ax<01x0

Insert image description here

UP

The ELU derivative converges to zero, thereby improving learning efficiency. Using ELU can get negative outputs, which can help the network push weight and bias changes in the right direction, and also prevent dead neurons from appearing. However, due to the large amount of calculation, its performance is not necessarily better than ReLU, and the gradient explosion problem cannot be avoided.

函数的定义为 f ( α , x ) = { α ( e x − 1 ) x < 0 x x ≥ 0 f(\alpha ,x)=\begin{cases} \begin{matrix} \alpha \left( { e }^{ x }-1 \right) & x<0 \end{matrix} \\ \begin{matrix} x & x\ge 0 \end{matrix} \end{cases} f ( a ,x)={ a(ex1)x<0xx0,其导数为 f ′ ( α , x ) = { f ( α , x ) + α x < 0 1 x ≥ 0 { { f' }(\alpha ,x) }^{ }=\begin{cases} \begin{matrix} f(\alpha ,x)+\alpha & x<0 \end{matrix} \\ \begin{matrix} 1 & x\ge 0 \end{matrix} \end{cases} f (a,x)={ f ( a ,x)+ax<01x0

Insert image description here

THE VILLAGE

SELU is a variant of ELU. After passing through this activation function, the sample distribution is automatically normalized to 0 mean and unit variance, and there will be no gradient disappearance or explosion problems.

Independent of f ( α , x ) = λ { α ( ex − 1 ) x < 0 xx ≥ 0 f(\alpha ,x)=\lambda \begin{cases} \begin{matrix} \alpha \left(; { e }^{ x }-1 \right) & x<0 \end{matrix} \\ \begin{matrix} x & x\ge 0 \end{matrix} \end{cases}f ( a ,x)=l{ a(ex1)x<0xx0Let f ′ ( α , x ) = λ { α ( ex ) x < 0 1 x ≥ 0 { { f' }(\alpha ,x) }^{ }=\lambda \begin{cases} \begin{ matrix} \alpha \left( { e }^{ x } \right) & x<0 \end{matrix} \\ \begin{matrix} 1 & x\ge 0 \end{matrix} \end{cases}f (a,x)=l{ a(ex)x<01x0. where λ λla aα is a fixed value (1.0507 and 1.6726 respectively).

Insert image description here

Softsign

softsign is another alternative to the tanh activation function. softsign is antisymmetric, decentered, differentiable, and returns a value between −1 and 1, and softsign's flatter curve and slower descending derivative indicate that it can be more efficient. to learn. But the calculation of the derivative is more troublesome than tanh.

The function is defined as f ( x ) = x ∣ x ∣ + 1 f(x)=\displaystyle \frac { x }{ \left| x \right| +1 }f(x)=x+1x, the derivative is f ′ ( x ) = 1 ( 1 + ∣ x ∣ ) 2 { f }^{ ' }(x)=\displaystyle \frac { 1 }{ { (1+\left| x \right| ) } ^{ 2 } }f(x)=(1+x)21

Insert image description here

Softplus

softplus is a good alternative to relu, capable of returning any value greater than 0. Unlike relu, the derivative of softplus is continuous, non-zero, and everywhere, thus preventing dead neurons. However, the derivative is often less than 1, and the problem of vanishing gradient may also occur. Another difference between softplus and relu is its asymmetry, which is not centered on zero, which may hinder learning.

The definition of the function is f ( x ) = ln ⁡ ( 1 + ex ) f(x)=\ln { (1+{ e }^{ x }) }f(x)=ln(1+ex ), the derivative of the function isf ′ ( x ) = 1 1 + e − x { f }^{ ' }(x)=\displaystyle \frac { 1 }{ 1+{ e }^{ -x } }f(x)=1+ex1

Insert image description here

Softmax

The softmax function is generally used in multi-classification problems. It is an extension of logistic regression and is also called multi-nominal logistic regression model (multi-nominal logistic mode). Suppose you want to implement kkFor the classification task of k categories, the Softmax function will input dataxi x_iximapped to iiProbability yi y_i of i categoriesyi 如下计算: y i = s o f t max ⁡ ( x i ) = e x i ∑ j = 1 k e x j y_i=soft\max \left( x_i \right) =\frac{e^{x_i}}{\sum_{j=1}^k{e^{x_j}}} yi=softmax(xi)=j=1kexjexi显然, 0 < y i < 1 0<y_i<1 0<yi<1

Swish

For the Swish function, when x>0, there is no gradient disappearance; when x<0, neurons will not die like ReLU. swish is differentiable everywhere, continuously smooth, and swish is not a monotonic function. Using swish can improve the performance of the model, but swish is computationally intensive.

The function is defined as f ( x ) = x ⋅ σ ( x ) f\left( x \right) =x\cdot \sigma \left( x \right)f(x)=xp(x),导数为 f ′ ( x ) = f ( x ) + σ ( x ) ( 1 − f ( x ) ) \begin{array}{c} f^{'}\left( x \right) =f\left( x \right) +\sigma \left( x \right) \left( 1-f\left( x \right) \right)\\ \end{array} f(x)=f(x)+p(x)(1f(x)). where σ ( x ) \sigma(x)σ ( x ) is the sigmoid function.

Insert image description here

Insert image description here

HSwish

Compared with swish, hard swish reduces the amount of calculation and has the same properties as swish. However, the calculation amount of hard swish is still larger than that of relu.

函数的定义为 f ( x ) = x ReLU 6 ( x + 3 ) 6 = x ⋅ { 1 , x ≥ 3 x 6 + 1 2 , − 3 < x < 3 0 , x ≤ − 3 f\left( x \right) =x\displaystyle \frac{\text{ReLU}6\left( x+3 \right)}{6}=x\cdot \begin{cases}1,&x \ge 3 \\ \frac x 6+\frac 1 2, &-3 \lt x \lt 3 \\ 0, &x \le -3 \end{cases} f(x)=x6RELU 6(x+3)=x 1,6x+21,0,x33<x<3x3,导数为 f ′ ( x ) = { 1 , x ≥ 3 x 3 + 1 2 , − 3 < x < 3 0 , x ≤ − 3 f'(x)=\begin{cases} 1,&x \ge 3 \\ \frac x 3+\frac 1 2, &-3 \lt x \lt 3 \\ 0, &x \le -3 \end{cases} f(x)= 1,3x+21,0,x33<x<3x3

Insert image description here

Choice of activation function

1. When using shallow networks as classifiers, sigmoid functions and their combinations usually work better.

2. Due to the vanishing gradient problem, it is sometimes necessary to avoid using the sigmoid and tanh functions.

3. The relu function is a general activation function and is currently used in most cases.

4. If dead neurons appear in the neural network, then the prelu function is the best choice.

5. The relu function can only be used in hidden layers.

6. Usually, you can start with the relu function. If the relu function does not provide optimal results, try other activation functions.

Activation function related issues

1. Why can relu be used for gradient-based learning even if it is not fully differentiable/differentiable?

From a mathematical point of view, relu is not differentiable at 0 because its left derivative and right derivative are not equal; but when implemented, one of the left derivative or the right derivative is usually returned instead of reporting an error that the derivative does not exist, thus This problem is avoided.

2. Why does tanh converge faster than sigmoid?

tan ⁡ h ′ ( x ) = 1 − tan ⁡ h ( x ) 2 ∈ ( 0 , 1 ) \tan\text{h}^{'}\left( x \right) =1-\tan\text{h}\left( x \right) ^2\in \left( 0,1 \right) tanh(x)=1tanh(x)2(0,1) s ′ ( x ) = s ( x ) ( 1 − s ( x ) ) ∈ ( 0 , 1 4 ] s^{'}\left( x \right) =s\left( x \right) \left( 1-s\left( x \right) \right) \in \left( 0,\frac{1}{4} \right] s(x)=s(x)(1s(x))(0,41] From the above two formulas, it can be seen that the gradient disappearance problem caused by tanh is not as serious as sigmoid, so the convergence speed of tanh is faster than that of sigmoid.

3. What is the difference between sigmoid and softmax?

  • For binary classification problems, sigmoid and softmax are the same. They both seek cross entropy loss, while softmax can be used for multi-classification problems.

  • Softmax is an extension of sigmoid, because when the number of categories k=2, softmax regression degenerates into logistic regression.

  • The distribution used in softmax modeling is a polynomial distribution, while logistic is based on the Bernoulli distribution.

  • Multiple logistic regressions can also achieve multi-classification effects through superposition, but softmax regression performs multi-classification. Classes are mutually exclusive, that is, one input can only be classified into one class; multi-logistic regression performs multi-classification. , the output categories are not mutually exclusive, that is, the word "apple" belongs to both the "fruit" category and the "3C" category.

Guess you like

Origin blog.csdn.net/weixin_49346755/article/details/127356457