Basic tricks for neural network training

In the previous article, I introduced how to use the optimization function to update the weight parameters and continuously approach the minimum value of the loss function, so as to achieve the purpose of learning. The basic idea of ​​​​the optimization function is the gradient descent method, and the weight parameter update needs to calculate the gradient of all weights relative to the loss function. From the previous article on the introduction of neural networks, we can know that the gradient factors that affect the weight include the derivative of the loss function, the derivative of the activation function, the input value of the node, and the weight value .

Next, this article will introduce the concepts of gradient disappearance and gradient explosion, as well as some tricks to prevent gradient disappearance and gradient explosion during neural network training.

Vanishing Gradients and Exploding Gradients

When training neural networks, especially deep neural networks, a big problem is that the gradient disappears or the gradient explodes, that is, the derivative of the weight to the loss function becomes very large or very small. The problem of gradient disappearance or explosion generally becomes more and more obvious as the number of network layers increases.

simulate vanishing gradient

As shown in the figure below, it is assumed that each layer has only one neuron, and each layer can be expressed by the following formula, where σ \sigmaσ is a vigorous function, for example sigmoid is here.
ai = σ ( zi ) = σ ( wi ∗ xi ) , where xi = ai − 1 a_i = \sigma{(z_i)} = \sigma{(w_i * x_i)}, where x_i = a_{i-1}ai=s ( zi)=s ( wixi),where xi=ai1
image.png
Then you can derive the following formula:
α L α w 1 = α L α a 4 ∗ α a 4 α z 4 ∗ α z 4 α a 3 ∗ α a 3 α z 3 ∗ α z 3 α a 2 ∗ α a 2 α z 2 ∗ α z 2 α a 1 ∗ α a 1 α z 1 ∗ α z 1 α w 1 \frac{\alpha{L}}{\alpha{w_1}} = \frac{\alpha{L}} {\alpha{a_4}}*\frac{\alpha{a_4}}{\alpha{z_4}}*\frac{\alpha{z_4}}{\alpha{a_3}}*\frac{\alpha{a_3} }{\alpha{z_3}}*\frac{\alpha{z_3}}{\alpha{a_2}}*\frac{\alpha{a_2}}{\alpha{z_2}}*\frac{\alpha{z_2 }}{\alpha{a_1}}*\frac{\alpha{a_1}}{\alpha{z_1}}*\frac{\alpha{z_1}}{\alpha{w_1}}a w1αL=αa4αLa z4αa4αa3a z4a z3αa3αa2a z3a z2αa2αa1a z2a z1αa1a w1a z1
= α L α a 4 ∗ σ ′ z 4 ∗ w 4 ∗ σ ′ z 3 ∗ w 3 ∗ σ ′ z 2 ∗ w 2 ∗ σ ′ z 1 ∗ x 1 =\frac{\alpha{L}}{\alpha{a_4}}*{\sigma'{z_4}}*w_4*{\sigma'{z_3}}*w_3*{\sigma'{z_2}}*w2*{\sigma'{z_1}}*x_1 =αa4αLpz4w4pz3w3pz2w2pz1x1
The derivative of the sigmoid function is shown in the figure below:
image.png
Visible, σ ′ ( x ) \sigma'(x)pThe maximum value of ′ (x)is1 4 \frac{1}{4}41, while in general network training, a Gaussian distribution with a mean of 0 and a standard deviation of 1 is used to initialize the network weights, that is, the initialized weight values ​​are usually less than 1, so σ ′ ( x ) ∗ w ≤ 1 4 \ sigma '(x)*w ≤ \frac{1}{4}p(x)w41. Therefore, for the chain derivation in the above formula, when the number of layers is more, the derivation result is smaller, which eventually leads to the disappearance of the gradient.

gradient explosion

σ ′ ( x ) ∗ w > 1 \sigma'(x)*w > 1p(x)w>1 o'clock, that is,wwThe case where w is relatively large. Then the gradient of the hidden layer close to the input layer changes greatly, which causes the problem of gradient explosion.

normalized input

When training a neural network, one of the ways to speed up training is to normalize the input. Normalizing the input consists of the following two steps:

  • Zero mean: Calculate the mean of all samples μ = 1 m ∑ i = 1 mxi \mu=\frac{1}{m}\sum_{i=1}^m{x^{i}}m=m1i=1mxi , then each sample subtracts the meanxi = xi − μ x^i = x^i - \muxi=xim
  • Normalized variance: calculate the variance σ 2 = 1 m ∑ i = 1 m ( xi ) 2 \sigma^2 = \frac{1}{m}\sum_{i=1}^m(x^i)^2p2=m1i=1m(xi)2 , and each sample is divided by the meanxi = xi σ 2 + ϵ x^i = \frac{x^i}{\sqrt{\sigma^2 + \epsilon}}xi=p2 +ϵ xi

image.png

weight initialization

Why initialize weights

From the previous weight gradient calculation, we can know that the weight value also participates in the gradient calculation, that is, the purpose of weight initialization is to prevent the deep neural network from being invalid during the forward propagation process due to too large or too small weights, and the backpropagation process vanishing or exploding gradients .

forward propagation

Taking a simple 100-layer neural network without an activation function as an example, use torch.randn to generate standard normal distribution data to initialize input x and weight matrix a for forward propagation: x = x 1 ∗ w 1
+ x 2 ∗ w 2 + . . . + x 100 ∗ w 100 x = x_1*w_1 + x_2*w_2 + ... + x_{100}*w_{100}x=x1w1+x2w2+...+x100w100

>>> x = torch.randn(512)
>>> for i in range(100):
...     a = torch.randn(512, 512)
...     x = a @ x
...
>>> x.mean()
tensor(nan)
>>> x.std()
tensor(nan)

It can be seen that after 100 matrix multiplications, in a certain operation, the layer output becomes so large that the computer cannot recognize its standard deviation and mean. That is, there is a gradient explosion.
When weights are initialized using a normal distribution with mean 0 and variance 0.01, forward propagation is performed:

>>> x = torch.randn(512)
>>> for i in range(100):
...     a = torch.randn(512, 512) * 0.01
...     x = a @ x
...
>>> x.mean()
tensor(0.)
>>> x.std()
tensor(0.)

It can be seen that the output value of the layer is infinitely close to 0 at this time. That is, if the initial value of the weight is too large or too small, the model cannot learn well.

Method of weight initialization

There are three common weight initialization methods: Xavier initialization, Kaimming He initialization, and initialization to 0.

Xavier initialization

Xavier initialization is a classic weight initialization method. According to the input and output dimensions, the initial value of the weight parameter is adaptively set, and the weight of each layer is set on a bounded random uniform distribution. The expression is as follows : W
∼ U [ − 6 ni + ni + 1 , 6 ni + ni + 1 ] W\sim{U[-\frac{\sqrt{6}}{\sqrt{n_i + n_{i+1}}}, \frac {\sqrt{6}}{\sqrt{n_i + n_{i+1}}}]}WU[ni+ni+1 6 ,ni+ni+1 6 ]
Thatni n_iniIndicates the number of input connections of the neuron, ni + 1 n_{i+1}ni+1Indicates the number of outgoing connections. The Xavier weight initialization ensures that the variance of the activation function and backpropagation gradient is propagated up or down to each layer of the neural network.

Kaimming He initialization

Similar to Xavier initialization, He initialization also adaptively sets the initial value of weight parameters according to the dimensions of the input and output. The difference is that the He initialization method is designed to better fit the ReLU activation function. The Xavier activation function is suitable for: symmetric and linear activation functions about 0, while the ReLU activation function does not meet these conditions.

initialized to 0

Take the simple two-layer neural network in the following figure as an example to simulate the forward propagation and back propagation process with the weight initialized to 0.

Forward propagation:
z 1 [ 1 ] = w 11 [ 1 ] ∗ x 1 + w 21 [ 1 ] ∗ x 2 = 0 ; a 1 [ 1 ] = σ ( z 1 [ 1 ] ) = 0.5 z_1^{[ 1]} = w_{11}^{[1]} * x_1 + w_{21}^{[1]} * x_2 = 0;a_1^{[1]}=\sigma(z_1^{[1]} )=0.5z1[1]=w11[1]x1+w21[1]x2=0;a1[1]=s ( z1[1])=0.5
z 2 [ 1 ] = 0 ; a 2 [ 1 ] = 0.5 z_2^{[1]}=0;a_2^{[1]}=0.5 z2[1]=0;a2[1]=0.5
z 3 [ 1 ] = 0 ; a 3 [ 1 ] = 0.5 z_3^{[1]}=0;a_3^{[1]}=0.5 z3[1]=0;a3[1]=0.5
z 1 [ 2 ] = 0 ; a 1 [ 2 ] = 0.5 z_1^{[2]}=0;a_1^{[2]}=0.5 z1[2]=0;a1[2]=0.5
When the weights are all initialized to 0, the output values ​​of the neurons are all the same. Next simulate backpropagation:
α L α w 11 [ 2 ] = α L α a 1 [ 2 ] ∗ α a 1 [ 2 ] α z 1 [ 2 ] ∗ a 1 [ 1 ] = α L α a 1 [ 2 ] ∗ σ ( z 1 [ 2 ] ) ( 1 − σ ( z 1 [ 2 ] ) ) ∗ 0.5 = α L α a 1 [ 2 ] ∗ ( 0.5 ) 3 \frac{\alpha{L}}{\ alpha{w_{11}^{[2]}}}=\frac{\alpha{L}}{\alpha{a_{1}^{[2]}}}*\frac{\alpha{a_1^{ [2]}}}{\alpha{z_{1}^{[2]}}}*a_1^{[1]}=\frac{\alpha{L}}{\alpha{a_{1}^{ [2]}}}*\sigma(z_1^{[2])}(1-\sigma(z_1^{[2])})*0.5 = \frac{\alpha{L}}{\alpha{a_ {1}^{[2]}}}*(0.5)^3a w11[2]αL=αa1[2]αLa z1[2]αa1[2]a1[1]=αa1[2]αLs ( z1[2])(1s ( z1[2]))0.5=αa1[2]αL(0.5)3
α L α w 21 [ 2 ] = α L α a 1 [ 2 ] ∗ ( 0.5 ) 3 \frac{\alpha{L}}{\alpha{w_{21}^{[2]}}}=\ frac{\alpha{L}}{\alpha{a_{1}^{[2]}}}*(0.5)^3a w21[2]αL=αa1[2]αL(0.5)3
α L α w 31 [ 2 ] = α L α a 1 [ 2 ] ∗ ( 0.5 ) 3 \frac{\alpha{L}}{\alpha{w_{31}^{[2]}}}=\ frac{\alpha{L}}{\alpha{a_{1}^{[2]}}}*(0.5)^3a w31[2]αL=αa1[2]αL(0.5)3
and againstw 11 [ 1 ] , w 21 [ 1 ] , w 31 [ 1 ] w_{11}^{[1]},w_{21}^{[1]},w_{31}^{[1 ]}w11[1],w21[1],w31[1]The partial derivative of , because the multiplied value contains the weight value, so it is all 0.
In summary, when the weights are all initialized to 0, each layer of neurons has the same output value in the first forward propagation, while in When backpropagation updates the weights, only the weights of the last layer are updated, causing the network to fail to learn.

learning rate decay

In the Mini-Batch gradient descent method, since the gradient calculation of the loss function depends on a small number of samples, the randomness of the samples may lead to a large variance of the gradient fluctuation. By continuously reducing the learning rate, it can gradually converge to the minimum value .

Learning Rate Decay Method

There are four common learning rate decay methods: exponential decay, fixed step decay, multi-step decay, and cosine annealing decay. The changing rules are shown in the figure below:
image.png

exponential decay

Decaying the learning rate exponentially is a common strategy. The usage method in pytorch is as follows:

ExpLR = torch.optim.lr_scheduler.ExponentialLR(optimizer_ExpLR, gamma=0.98)

The parameter gamma represents the base of the attenuation, and different attenuation curves can be obtained by selecting different gamma values , as follows:
image.png
the smaller the gamma value, the faster the learning rate decays.

Fixed Step Decay

That is, the learning rate is reduced to one part of the original gamma every certain number of steps (or epoch).

StepLR = torch.optim.lr_scheduler.StepLR(optimizer_StepLR, step_size=step_size, gamma=0.65)

Among them, the gamma parameter indicates the degree of attenuation, and the step_size parameter indicates the learning rate adjustment every how many steps.
image.png

multi-step decay

Although the attenuation of the fixed step size can update the learning rate according to the fixed interval length, the advantage is that we hope that different intervals use different update frequencies.

torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[200, 300, 320, 340, 200], gamma=0.8)

The milestones parameter is the start and end interval of the learning rate update. The above means that the learning rate is not updated in the interval [0, 200], and the subsequent intervals are updated once.
image.png

cosine annealing

Strictly speaking, the cosine annealing strategy should not be a learning rate decay strategy, because it makes the learning rate change periodically.

CosineLR = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer_CosineLR, T_max=150, eta_min=0)

Where T_max represents the period of the cosine function; eta_min represents the minimum value of the learning rate.
image.png

BatchNorm

BatchNorm is a commonly used regularization technique for neural networks designed to speed up training and improve model performance.

Why use BatchNorm

The role of BatchNorm is to make the input of each layer of neural network more stable during , thereby improving the training speed of the network.

Internal Covariate Shift problem

In a deep neural network, the input data distribution of each layer usually changes as the training progresses , that is, the input distribution of each layer always changes, making it difficult for the network to stabilize the law of learning . During the training process, because the parameters of each layer are constantly changing, the input distribution of the hidden layer is always changing , which is the so-called "Internal Covariate Shift" problem.
The basic idea of ​​BatchNorm: let the activation input zz of each hidden nodeThe z- distribution is fixed, thereby avoiding the "Internal Covariate Shift" problem. Since the distribution of the activation input valueof the deep neural network before the nonlinear transformation isgradually shifted or changed during the training process, the overall distribution gradually approaches theupperand lower ends of the value range of the nonlinear function (sigmoid as an example).the disappearance of the gradient of the low-level neural networkduring backpropagation, and the convergence of the neural network is getting slower and slower.
BN is toof each neuronback to the standard normal distribution with a mean of 0 and a variance of 1 through normalization means,so that the activation input valuezzz falls in the area with a large change rate of the nonlinear activation function, a small input transformation can cause a large change in the loss function,avoiding the problem of gradient disappearance, and at the same time,the sensitive area of ​​the nonlinear activation function can cause a large change in the gradient, that is, Speed ​​up your training.
At the same time, when all distributions are pulled back to the standard normal distribution, each layer of the multilayer neural network becomes a fixed expression, that is, the expressive ability of the network decreases, and the meaning of depth does not exist. In order to solve this problem, BN performs scale and shift operations on the transformed standard normal distribution. These two parameters scale and shift are learned through training.

What is BatchNorm

For mini-batch training, a training process contains m training instances. The specific operation process of BN is as follows:

First, pull the input activation value of the neuron to a standard distribution with a mean of 0 and a variance of 1. In order to prevent the network from expressing The ability decreases, and two adjustment parameters (scale and shift) are added to each neuron to enhance the fitting ability of the network.

training phase

In the training phase, BN will standardize each input value (sample), and use the exponentially weighted moving average method to save the mean and variance, and approximate the mean and variance of the entire sample set. On the entire sample set, (total number of samples/BatchSize) group γ \gamma will be generatedcb \betab .

def bn_simple_for_train(x, gamma, beta, bn_params):
    '''
    X : 输入数据
    gamma:缩放因子
    beta:平移因子
    bn_params: batch norm所需要的一些参数
    running_mean: 滑动平均的方式计算新的均值,训练时计算,为测试数据做准备
    running_var: 滑动平均的方式计算新的方差,训练时计算,为测试数据做准备
    '''
    running_mean = bn_params["running_mean"]       # shape = [B]
    running_var = bn_params["running_var"]
    
    x_mean = x.mean(axis=0)
    x_var = x.var(axis=0)
    # 归一化
    x_normalized = (x - x_mean) / np.sqrt(x_var + eps)
    # bn
    res = gamma * x_normalized + beta
    
    # 滑动平移方式
    running_mean = momentum * running_mean + (1- momentum) * x_mean
    running_var = momentum * running_var + (1 - momentum) * x_var
    
    # 记录新的参数值
    bn_params["running_mean"] = running_mean
    bn_params["running_var"] = running_var
    
    return res, bn_params

reasoning stage

Since there is only one input sample during testing, the mean and variance cannot be calculated, that is, the global statistics obtained in the training phase are used to predict.

def bn_simple_for_test(x, gamma, beta, bn_params):
    '''
    X : 输入数据
    gamma:缩放因子
    beta:平移因子
    bn_params: batch norm所需要的一些参数
    running_mean: 滑动平均的方式计算新的均值,训练时计算,为测试数据做准备
    running_var: 滑动平均的方式计算新的方差,训练时计算,为测试数据做准备
    '''
    running_mean = bn_params["running_mean"]      
    running_var = bn_params["running_var"]

    # 归一化
    x_normalized = (x - running_mean) / np.sqrt(running_var + eps)
    # bn
    res = gamma * x_normalized + beta

    return res, bn_params

activation function

In machine learning and deep learning, activation functions are often used to introduce nonlinear factors into neural networks, enabling the network to construct complex and nonlinear relationships in input and output data. The activation function should have the following properties:

  1. Non-linearity: The activation function must be non-linear so that there is a non-linear relationship between the input and output data;
  2. Continuity: The activation function should be continuous and differentiable, because the derivative of the activation function needs to be calculated when the gradient is updated;
  3. High computational efficiency: computationally intensive activation functions can significantly slow down the training process;

Why do we need an activation function

If the activation function is not used, the activation function is f ( x ) = xf(x)=xf(x)=x , at this time, the input of each node is a linear function of the output of the upper layer, then, no matter how many layers the neural network has, the output is a linear combination of inputs, which cannot handle nonlinear relationships, and the network expressive ability is limited. By introducing the activation function and increasing the nonlinear characteristics of the network, the expressive ability of the network is more powerful, and any function can be fitted. In addition, the activation function can also make the neural network have a certain noise immunity and be able to deal with noisy data.

The difference and characteristics of different activation functions

Common activation functions include sigmoid function, tanh function, ReLU function, etc.

  1. sigmoid function

Sigmoid is a commonly used nonlinear activation function, the mathematical form is as follows:
f ( x ) = 1 1 + e − zf(x)=\frac{1}{1 + e^{-z}}f(x)=1+ez1
The geometric image is as follows:
sigmoid.png
the expression of its derivative is as follows:
f ′ ( x ) = σ ( x ) ( 1 − σ ( x ) ) f'(x) = \sigma(x)(1-\sigma(x))f(x)=σ ( x ) ( 1σ ( x ))
geometric image is as follows:
sigmoid.png
Advantages:

  • The output range of the sigmoid function is between [0,1], and the output can be limited to a specific range;
  • The derivative of the sigmoid function can be expressed by itself, which is convenient for calculation;
  • The sigmoid function has smooth continuity, which can guarantee the continuity of the neural network;

shortcoming:

  • The gradient of the sigmoid function is close to 0 at large or small input values, which will cause the problem of gradient disappearance, making the neural network unable to continue learning;
  • The output of the sigmoid function is not zero-centered with 0 as the center, so during backpropagation, the direction of the gradient will be biased to one side, rather than evenly distributed on both sides, which may cause instability in the training process.
  • The calculation of the sigmoid function is slower than the ReLU and tanh functions because exponential operations are required;
  1. tanh function

Tanh is a commonly used activation function with an output range of [-1,1]. Similar to the sigmoid function, the tanh function is also a S-type function, but its output range is wider. Its mathematical expression is as follows:
tanh ( x ) = ex − e − xex + e − x tanh(x) = \frac{e^x- e^{-x}}{e^x+e^{-x} }t english ( x )=ex +exexex
The geometric image looks like this:
sigmoid.png
The derivative of the tanh function is:
ddxtanh ( x ) = 1 − ( tanh ( x ) ) 2 \frac{d}{dx}tanh(x) = 1- {(tanh(x))}^ 2dxdt english ( x )=1( t he ( x ))2
geometric images are shown below:
sigmoid.png
Advantages:

  • The output range is [-1, 1], which can normalize the data to a smaller range, which is helpful for the stability and convergence speed of the model;
  • With zero-centered features, it is integrated in the training and optimization of the model;

shortcoming:

  • It is prone to the problem of gradient disappearance;
  • The amount of calculation is large, and the calculation amount of the tanh function is twice that of the sigmoid function, which affects the training speed of the model;
  • For large or small input data, the tanh function will be saturated, and there is also a gradient saturation problem, which affects the training efficiency of the model;
  1. ReLU function

The ReLU activation function is a commonly used nonlinear activation function, and its expression is as follows:
relu ( x ) = max ( 0 , x ) relu(x) = max(0, x)read again ( x ) _=max(0,x )
whose geometrical image is as follows:
sigmoid.png
Advantages:

  • fast calculation speed;
  • Solved the problem of gradient disappearance;
  • The convergence speed is fast, because the derivative of the ReLU activation function is 0 or 1, and there is no smooth transition area, so it is easier to converge during training;

shortcoming:

  • Some neurons may be "dead", that is, the output is always 0, which makes these neurons unable to participate in the training of the model;
  • Data preprocessing is required to normalize or standardize the input data to avoid negative input;
  1. leaky ReLU(2013)

The function of the Leaky ReLU activation function is defined as:
leakyrelu ( x ) = max ( x , α x ) leaky_relu(x) = max(x, \alpha{x})leakyre l u ( x )=max(x,α x )
where α is a very small positive constant, usually set to 0.01. Its geometric image is shown in the figure below:
sigmoid.png
Advantages:

  • Solved the problem of possible "death" of neurons;
  • High computational efficiency;
  1. Softplus(2010)

The Softplus activation function is defined as follows:
softplus ( x ) = ln ( 1 + ex ) softplus(x)=ln(1+e^x)softplus(x)=ln(1+ex )
Its geometric image is shown in the figure below:
sigmoid.png
The softplus activation function can be approximated as a smooth, continuous and everywhere ReLU function. Compared with the ReLU function, it has the following characteristics.
advantage:

  • Continuity: The softplus function is a function with a continuous first-order derivative, which facilitates gradient updates;
  • Non-zero: Unlike the output value produced by the ReLU function is either 0 or positive, the softplus function produces a positive output value;
  • Non-zero gradient: the softplus function has a non-zero gradient for negative input values, which can prevent dying ReLU problems;

shortcoming:

  • The softplus output value is not centered on 0;
  • Its derivative is often less than 1, and there may be a problem of gradient disappearance;
  1. Swish(2017)

The Swish activation function is defined as follows:
S wish ( x ) = x ∗ σ ( β ∗ x ) Swish(x)=x*\sigma(\beta*x)Swish(x)=xs ( bx )
in whichβ \betaβ is a learnable parameter,σ \sigmaσ is the sigmoid function.

  1. SiLU(2017)

The SiLU (Sigmoid Linear Unit) activation function combines sigmoid and linear activation function elements, which are defined as follows:
S i LU ( x ) = x 1 + e − x SiLU(x)=\frac{x}{1+e^{ -x}}S i LU ( x )=1+exx
Its geometric image is as follows:

sigmoid.png
advantage:

  • smooth gradients: The SiLU function has smooth derivatives, which can avoid the problem of gradient disappearance;
  • Non-monotonicity: SiLU functions are non-monotonic, having both positive and negative values, allowing SiLU functions to capture complex patterns in input data;
  • High computational efficiency;
  1. HardSwish(2019)

The HardSwish activation function is a piecewise linear activation function, which is a simplified version of the Swish function. It is defined as follows:
hardswish ( x ) = { 0 , if ( x ≤ − 3 ) x , if ( x ≥ 3 ) x ( x + 3 ) 6 , otherwise hard_swish(x) = \begin{cases} 0, \quad if(x\le-3) \\x, \quad if(x\ge3) \\ \frac{x(x+3) }{6}, \quad otherwise \end{cases}hardswish(x)= 0,if(x3)x,if(x3)6x(x+3),othervise
Its geometric image is as follows: The
sigmoid.png
hardswish activation function is an improvement to the swish activation function. The swish activation function can improve the accuracy of the neural network to a certain extent, but it is not suitable for use on embedded mobile devices because the "S" type function The computational cost on embedded mobile devices is high, and the derivation is more complicated.
advantage:

  • fast calculation speed;
  • Compared with ReLU, it can effectively alleviate the problem of gradient disappearance;

shortcoming:

  • Low degree of nonlinearity: Compared with Swish and Mish activation functions, HardSwish has a relatively low degree of nonlinearity and may not be able to fully exploit the nonlinear features in the data;
  1. Mish(2019)

The Mish activation function has the four advantages of ** no upper bound, lower bound, smooth derivative and non-monotonic **, and its mathematical expression is as follows:
f ( x ) = x ∗ tanh ( softplus ( x ) ) = x ∗ tanh ( ln ( 1 + exp ( x ) ) ) f(x)=x*tanh(softplus(x)) = x * tanh(ln(1+exp(x)))f(x)=xt anh ( so f tpl u s ( x ))=xt english ( l n ( 1+e x p ( x )))
geometry is as follows:
sigmoid.png
Features:

  • Non-monotonicity (non-linearity): Through the derivative, it can be obtained that the Mish function may monotonically increase or decrease in different intervals, and can capture complex patterns in the input data;
  • Continuity: The Mish activation function has good continuity and is differentiable everywhere, which can better update the model;
  • No upper bound and lower bound: no upper bound can avoid the saturation problem, and lower bound can help achieve strong regularization effect;

How to choose an appropriate activation function

The following points can be considered in choosing an appropriate activation function:

  • Try to choose an activation function that uses zero-centered features to speed up the convergence of the model;
  • Consider the problem of gradient disappearance, and choose an activation function that is not prone to gradient disappearance, such as ReLU, Leaky ReLU, etc.;
  • Consider computational efficiency;

reference link

Guess you like

Origin blog.csdn.net/hello_dear_you/article/details/129432976