Activation Function and Ten Common Activation Functions

Table of contents

1 The concept and function of activation function

1.1 The concept of activation function

1.2 The role of the activation function

1.3 A popular understanding of the activation function (combination of graphics and text)

1.3.1 Neural network without activation function

1.3.2 Neural Networks with Activation Functions

2 Neural Network Gradient Disappearance and Gradient Explosion

2.1 Introduction Gradient disappearance and gradient explosion

2.2 Gradient instability problem 

2.3 The root cause of gradient disappearance

2.4 The root cause of gradient explosion

2.5 When the activation function is sigmoid, which is more likely to occur, gradient disappearance or gradient explosion

2.6 How to solve gradient disappearance and gradient explosion

3 Comparison of activation functions

3.1 Sigmoid

3.1.1 Formula

3.1.2 Under what circumstances is it suitable to use the Sigmoid activation function?

3.1.3 Disadvantages

3.2 tanh function

3.3 ReLU

3.4 Leaky ReLU

3.5 ELU function

3.6 Prerequisites

3.7 Softmax

3.8 Swish

3.9 Maxout

3.10 Softplus

4 How to choose the appropriate activation function in the application


1 The concept and function of activation function

1.1 The concept of activation function

    An Activation Function is a function added to an artificial neural network to help the network learn complex patterns in data. Similar to neuron-based models in the human brain, the activation function ultimately determines what gets fired to the next neuron.

    In an artificial neural network, a node's activation function defines the node's output given an input or set of inputs. A standard computer chip circuit can be thought of as a digital circuit activation function that produces an output that is on (1) or off (0) depending on the input. Thus, an activation function is a mathematical equation that determines the output of a neural network.

    First of all, let's take a look at the working principle of artificial neurons, as follows:

    The mathematical visualization process of the above process is shown in the figure below:

1.2 The role of the activation function

    Regarding the role of the activation function in the neural network, it is usually explained in this way: if the activation function is not used, each layer of the neural network is only a linear transformation, and the multi-layer input is still a linear transformation. Because the expressive ability of the linear model is usually not enough, the role of the activation function is reflected at this time, and the activation function can introduce nonlinear factors. The question arises, how does the activation function introduce nonlinear factors?

1.3 A popular understanding of the activation function (combination of graphics and text)

    To explain how activation functions introduce non-linearities, let's take the example of a neural network partitioning a planar space.

1.3.1 Neural network without activation function

    The simplest structure of a neural network is a single-output single-layer perceptron. A single-layer perceptron has only an input layer and an output layer, which represent neural receptors and nerve centers respectively. The figure below is a simple single-layer perceptron with only two input units and one output unit. In the figure, x1 and x2 represent the stimulation of the input neurons of the neural network, w1 and w2 represent the tightness of the connection between the input neuron and the output neuron, b represents the excitation threshold of the output neuron, and y is the output of the output neuron. We use this single-layer perceptron to draw a line separating the planes, as shown:

    In the same way, we can also combine multiple perceptrons (note, not multi-layer perceptrons) to obtain stronger plane classification capabilities, as shown in the figure:

    Take another look at the case of a multi-layer perceptron with a hidden layer, as shown in the figure: 

    By comparison, it can be found that the outputs of the above three neural networks without activation functions are all linear equations, which are all trying to approximate the curve with complex linear combinations. 

1.3.2 Neural Networks with Activation Functions

    Let us add a nonlinear activation function to convert the result of the linear transformation after the linear transformation of each layer of neurons in the neural network. The result is obvious, and the output immediately becomes an out-and-out nonlinear function.

    In the case of expanding to a multi-layer neural network, with the same structure as just now, after adding a nonlinear activation function, the output becomes a complex nonlinear function, as shown in the figure:

    Summary: After adding a nonlinear activation function, it is possible for the neural network to learn a smooth curve to divide the plane, instead of using a complex linear combination to approximate a smooth curve to divide the plane, which makes the neural network more expressive and able to better The fitting objective function of . That's why we have non-linear activation functions. . As shown in the figure below, the difference after adding a nonlinear activation function is illustrated. The figure above shows the use of a linear combination to approximate a smooth curve to divide the plane, and the figure below shows a smooth curve to divide the plane:

2 Neural Network Gradient Disappearance and Gradient Explosion

2.1 Introduction Gradient disappearance and gradient explosion

    The neural network model with a large number of layers will have gradient disappearing problem and gradient exploding problem during training. The gradient disappearance problem and the gradient explosion problem generally become more and more obvious as the number of network layers increases.

    For example, for a neural network with three hidden layers as shown below:

    → When the gradient disappearance problem occurs, the weight update of the hidden layer3 near the output layer is relatively normal, but the weight update of the hidden layer1 near the input layer will become very slow, resulting in almost no change in the weight of the hidden layer near the input layer. Still close to the initial weights. This leads to hidden layer1 being equivalent to just a mapping layer, which makes a function mapping for all inputs. At this time, the learning of this deep neural network is equivalent to learning only the hidden layer network of the last few layers.

    →The situation of gradient explosion is: when the initial weight is too large, the weight of hidden layer1 near the input layer changes faster than the weight of hidden layer3 near the output layer, which will cause the problem of gradient explosion.

2.2 Gradient instability problem 

    Gradients in deep neural networks are unstable, either disappearing or exploding in hidden layers close to the input layer. This instability is the fundamental problem of gradient-based learning in deep neural networks.

    The reason for gradient instability: the gradient on the previous layer is the product of the gradient from the later layer. When there are too many layers, there will be gradient instability scenarios, such as gradient vanishing and gradient explosion.

2.3 The root cause of gradient disappearance

    Let's take the backpropagation in Figure 2 as an example, assuming that each layer has only one neuron and each layer can be expressed by formula 1, where is the \sigmasigmoid function, C represents the cost function, the output of the previous layer and the The input relationship of one layer is shown in Equation 1. We can derive Equation 2.

    The derivative of the sigmoid function {\sigma }'(x)is shown in the right figure below. 

    It can be seen that  {\sigma }'(x)the maximum value of is 1/4, and we generally use the standard method to initialize the network weights, that is, use a Gaussian distribution with a mean of 0 and a standard deviation of 1. Therefore, the initialized network weights are usually less than 1, so \left |{\sigma }'(z)\omega \right | \leq \frac{1}{4}that For the chain derivation of formula 2, the more layers, the smaller the derivation result, which eventually leads to the disappearance of the gradient.

    For the above figure, \frac{\partial C}{\partial b_{1}}and  \frac{\partial C}{\partial b_{3}}have a common derivative term. It can be seen that the gradient change of the previous network layer is smaller than that of the latter network layer, so the weight changes slowly, which causes the problem of gradient disappearance.

2.4 The root cause of gradient explosion

    When \left |{\sigma }'(z)\omega \right | > 1, that is, the case where W is relatively large. Then the gradient of the previous network layer changes faster than that of the latter network layer, causing the problem of gradient explosion.

2.5 When the activation function is sigmoid, which is more likely to occur, gradient disappearance or gradient explosion

    Conclusion: The gradient explosion problem occurs less frequently when using the sigmoid activation function, and it is not easy to occur.

    Quantitative analysis of the value range of x when the gradient explodes: since the maximum derivative is 0.25, it \left | w \right | is possible to appear > 4 \left |{\sigma }'(z)\omega \right | > 1; according to \left |{\sigma }'(wx + b)\omega \right | > 1the calculation, the value range of x is very narrow, and the gradient explosion will only occur within the range of formula 3. As shown in Figure 5, it can be seen that the value range of x is very small; the maximum value range is only 0.45, which \left | w \right | appears when = 6.9. Therefore, the problem of exploding gradients occurs only in this narrow range.

2.6 How to solve gradient disappearance and gradient explosion

    Both gradient disappearance and gradient explosion problems are caused by the network being too deep and the update of network weights unstable, which is essentially due to the multiplication effect in gradient backpropagation. For the more general gradient disappearance problem, the following three solutions can be considered:

1. Replace the sigmod function with ReLU, Leaky-ReLU, P-ReLU, R-ReLU, Maxout, etc.

2、用Batch Normalization。

3. The structural design of LSTM can also improve the gradient disappearance problem in RNN.

3 Comparison of activation functions

    Conversely, a function that does not satisfy the above conditions is called a non-saturated activation function.

    sigmoid and tanh are "saturated activation functions", while ReLU and its variants are "non-saturated activation functions". The advantage of using "unsaturated activation function" lies in two points: (1) "unsaturated activation function" can solve the so-called "gradient disappearance" problem. (2) It can speed up the convergence speed.

    →The Sigmoid function compresses a real-valued input to the range [0,1] ---------σ(x) = 1 / (1 + exp(−x))

    →The tanh function compresses a real-valued input to the range [-1, 1]---------tanh(x) = 2σ(2x) − 1

    Since the use of the sigmoid activation function will cause the problem of gradient disappearance and gradient explosion of the neural network, many people have proposed some improved activation functions, such as: tanh, ReLU, LeakyReLU, PReLU, RReLU, ELU, Maxout.

3.1 Sigmoid

3.1.1 Formula

    Sigmoid (s-shaped) is a commonly used nonlinear activation function, and its mathematical formula is as follows:

    Features: It can transform the continuous real value of the input into an output between 0 and 1. In particular, if it is a very large real number, the output is 0; if it is a very large positive number, the output is 1.

    The Sigmoid function is the most frequently used activation function at the beginning of the deep learning field. It is a smooth function for easy differentiation.

3.1.2 Under what circumstances is it suitable to use the Sigmoid activation function?

1. The output range of the Sigmoid function is 0 to 1. Since the output values ​​are bounded between 0 and 1, it normalizes the output of each neuron.

2. For models that take predicted probabilities as output. Since the probability ranges from 0 to 1, the sigmoid function is a good fit.

3. The gradient is smooth to avoid jumping output values.

4. The function is differentiable, which means that the slope of the Sigmoid curve of any two points can be found.

5. Unambiguous predictions, i.e. very close to 1 or 0.

3.1.3 Disadvantages

1. It is prone to gradient disappearance.

2. Function output is not zero-centered.

3. The exponentiation operation is relatively time-consuming.

(1) Gradient disappears

    The method of optimizing the neural network is backpropagation, that is, the backward transfer of the derivative: first calculate the loss corresponding to the output layer, and then continuously transfer the loss to the upper layer of the neural network in the form of a derivative, and modify the corresponding parameters to achieve the purpose of reducing the loss . The Sigmoid function often causes the derivative to gradually become 0 in the deep network, so that the parameters cannot be updated, and the neural network cannot be optimized for two reasons:

    1. It is easy to see in the above figure that \sigma (x)when x is large or small, the derivative is close to 0, and the mathematical basis for backward transmission is the chain rule of calculus derivation. The derivative of the current layer needs the derivative of the previous layers. The product, the multiplication of several decimals, the result will be very close to 0.

    2. The maximum value of the Simoid derivative is 0.25, which means that the derivative will be compressed to at least 1/4 of the original value at each layer, and become 1/16 after passing through two layers,...., and will be 1 after passing through 10 layers /1048576. Please note that "at least" here, it is rare for the derivative to reach the maximum value.

(2) The output is not zero-centered

    The output value of the Sigmoid function is always greater than 0, which will slow down the convergence speed of model training. For example, yes

Generally speaking, if all xi are positive or negative, then their derivatives to wi are always positive or negative, which will lead to a stepwise update as shown by the red arrow in the figure below, which is obviously not a good optimization path. Deep learning often requires a lot of time to process a large amount of data, and the convergence speed of the model is particularly important. So in general, try to use zero-centered data (which can be achieved through data preprocessing) and zero-centered output for training deep learning networks.

    One consequence of not being zero-centered is that if the data is positive when it enters the neuron, then the gradient calculated by w will always be positive. Of course, if you are training by batch, the batch may get different signals, so this problem can still be alleviated. Therefore, although the problem of non-zero mean will have some bad effects, it is still much better than the kill gradients problem mentioned above. 

(3) Power calculation is relatively time-consuming

    Compared with the first two items, this is actually not a big problem. We currently have the corresponding computing power, but in the face of the huge amount of computing in deep learning, it is best to save as much as possible. We will see later that in the R-LU function, all that needs to be done is a thresholding, which is much faster than the power operation.

3.2 tanh function

    The image of the tanh activation function is also s-shaped, and the expression is as follows:

    Tanh is pronounced as Hyperbolic Tangent (hyperbolic tangent). As shown in the figure above, it solves the problem of zero-centered output, but the problem of gradient disappearance and power operation still exist.

    tanh is a hyperbolic tangent function. The curves of the tanh function and the sigmoid function are relatively similar, but it has some advantages over the sigmoid function. 

picture

    1. First, when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update. The difference between the two lies in the output interval. The output interval of tanh is 1 (x>=0 y∈[0,1)), and the entire function is centered on 0, which is better than the sigmoid function.

    2. In a tanh graph, negative inputs will be strongly mapped as negative, and zero inputs will be mapped as close to zero.

    Note: In general binary classification problems, the tanh function is used for the hidden layer and the sigmoid function is used for the output layer, but neither of these is fixed and needs to be tuned to the specific problem. 

3.3 ReLU

    In recent years, the ReLU function has become more and more popular. The full name is Rectified Linear Unit (corrected linear unit). ReLU is a linear and unsaturated activation function proposed by Krizhevsky, Hinton et al. in the 2012 "ImageNet Classification with Deep Convolutional Neural Networks" paper. Its mathematical expression:

    The ReLU function is actually a maximum value function. Note that this is not fully derivable, but we can take sub-gradient, as shown in the figure above. Although ReLU is simple, it is an important achievement in recent years. It has the following advantages:

1. Solved the gradient vanishing problem (on the positive interval).

2. Both the Sigmoid and tanh activation functions need to calculate the index, which has high complexity, while ReLU only needs a threshold to get the activation value. There is only a linear relationship in the ReLU function, so its calculation speed is faster than Sigmoid and tanh. The calculation speed is very fast, only need to judge whether the input is greater than 0.

3. The convergence speed is much faster than sigmoid and tanh.

4. The non-saturation of ReLU can effectively solve the problem of gradient disappearance and provide a relatively wide activation boundary.

5. The unilateral suppression of ReLU provides the sparse expression ability of the network.

    ReLU also has several issues that require special attention:

1. The output of the ReLU function is 0 or a positive number, not zero-centered.

2. Dead ReLU Problem, which means that some neurons may never be activated, causing the corresponding parameters to never be updated. This is due to the function f(x) = max(0,z), which causes the negative gradient to be set to 0 when passing through the ReLU unit, and is not activated by any data afterwards, that is, the gradient flowing through the neuron is always 0 , without affecting any data. ReLU fails completely when the input is negative, which is not a problem during forward propagation. Some areas are sensitive and others are not. But in the backpropagation process, if a negative number is input, the gradient is completely 0, and the sigmoid function and tanh function also have the same problem. There are two main reasons that may cause this situation: (1) very unfortunate parameter initialization, which is relatively rare (2) too high a learning rate leads to too large a parameter update during training, which will cause more than a certain percentage of neurons The element is irreversibly dead, and the parameter gradient cannot be updated, and the entire training process fails. The solution is to use the Xavier initialization method, and avoid setting the learning rate too large or use an algorithm that automatically adjusts the learning rate such as adagrad.

    Despite these two problems, ReLU is still the most commonly used activation function, and it is recommended to try it first when building an artificial neural network!

3.4 Leaky ReLU

    In order to solve the Dead ReLU Problem, people proposed to set the first half of ReLU to 0.01 instead of 0. Note that it is not equal to 0 in the negative interval.

    Why Leaky ReLU is better than ReLU?

1. Leaky ReLU adjusts the zero gradients problem of negative values ​​by giving a very small linear component of x to the negative input (0.01x).

2. Leak is beneficial to expand the range of ReLU function, usually the value of a is about 0.01.

3. The function range of Leaky ReLU is (negative infinity to positive infinity)

    But on the other hand, the selection of the value of a increases the difficulty of the problem, requiring strong manual experience or repeated training to determine the appropriate parameter value.

    Note: In theory, Leaky ReLU has all the advantages of ReLU, and Dead ReLU will not have any problems, but in practice, it has not been fully proved that Leaky ReLU is always better than ReLU.

3.5 ELU function

    The proposal of ELU (Exponential Linear Units) exponential linear unit also solves the problem of ReLU. Compared to ReLU, ELU has negative values, which makes the mean value of activations close to zero. Mean activations close to zero can make learning faster because they make gradients closer to natural gradients.

    Obviously, ELU has all the advantages of ReLU, and:

1. There is no Dead ReLU problem, and the average value of the output is close to 0, with 0 as the center.

2. ELU makes the normal gradient closer to the natural unit gradient by reducing the influence of the bias offset, so that the mean value is like 0 to accelerate learning.

3. ELU will saturate to a negative value under a small input, thereby reducing the variation and information of forward propagation.

    A slight problem is that it is more computationally intensive. Similar to Leaky ReLU, although theoretically better than ReLU, there is currently no sufficient evidence in practice that ELU is always better than ReLU. 

3.6 Prerequisites

    PReLU is also an improved version of ReLU. The idea is a parameter-based method, namely Parametric ReLU:  , which \alphacan be learned by back propagation.

    Kaiming He's paper "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification" pointed out that not only can it be trained, but the effect is better.

    Take a look at the formula for PReLU: the parameters \alphaare usually numbers between 0 and 1, and are usually relatively small.

1. If a_i = 0, then f becomes ReLU.

2. If a_i > 0, then f becomes leaky ReLU.

3. If a_i is a learnable parameter, then f becomes PReLU. 

    The advantages of PReLU are as follows:

1. In the negative value range, the slope of PReLU is smaller, which can also avoid the Dead ReLU problem.

2. Compared with ELU, PReLU is a linear operation in the negative range. Although the slope is small, it will not tend to 0.

3.7 Softmax

    Softmax is an activation function for multi-class classification problems where more than two class labels require class membership. For any real vector with a length of k, Softmax can compress it into a real vector with a length of k, a value in the range (0,1), and the sum of the elements in the vector is 1.

    Softmax is different from the normal max function: the max function only outputs the maximum value, but Softmax ensures that smaller values ​​have a smaller probability and are not discarded directly. We can think of it as the probabilistic or [soft] version of the argmax function.

    The denominator of the Softmax function combines all the factors of the original output value, which means that the various probabilities obtained by the Softmax function are related to each other.

    The main disadvantages of the Softmax function are:

1. Non-differentiable at zero

2. Negative inputs have a gradient of 0, which means that for activations in this region, the weights are not updated during backpropagation, thus resulting in dead neurons that never activate.

3.8 Swish

    Function expression: y = x * sigmoid(x)

    The design of Swish was inspired by the use of the sigmoid function of gating in LSTM and high-speed networks. We use the same gating value to simplify the gating mechanism, which is called self-gating.

    The advantage of self-gating is that it requires only a simple scalar input, while ordinary gating requires multiple scalar inputs. This enables self-gated activation functions such as Swish to easily replace activation functions that take a single scalar as input (such as ReLU) without changing the hidden capacity or number of parameters.

    The main advantages of the Swish activation function are as follows:

1. Unboundedness: helps to prevent the gradient from gradually approaching 0 and causing saturation during slow training (at the same time, boundedness is also advantageous, because the bounded activation function can have strong regularization and a large negative input problem can also be solved).

2. The derivative is always greater than 0.

3. Smoothness plays an important role in optimization and generalization. 

3.9 Maxout

    This function can refer to the paper "maxout networks". Maxout is a layer of network in deep learning network, just like pooling layer, convolution layer, etc. We can see maxout as the activation function layer of the network. We assume that the network has The input feature vector of a layer is: X = (x1,x2,...,xd), that is, our input is d neurons. The calculation formula of each neuron in the Maxout hidden layer is as follows:

    The above formula is the calculation formula of maxout hidden layer neuron i. Among them, k is the parameter required by the maxout layer, and we manually set the size. Just like dropout, it also has its own parameter p (the dropout probability of each neuron), and the parameter of maxout is k. The formula for calculating z in the formula is:

    The weight w is a three-dimensional matrix of size (d, m, k), and b is a two-dimensional matrix of size (m, k). These two are the parameters we need to learn. If we set the parameter k = 1, then at this time, the network is similar to the ordinary MLP network we have learned before.

    We can understand it this way. Originally, the traditional MLP algorithm has only one set of parameters from layer i to layer i+1. However, we don’t do this now. We train n groups of w and b parameters at the same time at this layer, and then The one with the largest activation value z is selected as the activation value of the next layer of neurons, and this max(z) function acts as the activation function. 

    In the Maxout layer, the activation function is the maximum value of the input, so a multilayer perceptron with only 2 maxout nodes can fit any convex function.

    A single Maxout node can be interpreted as a piecewise linear approximation (PWL) to a real-valued function, where the line segment between any two points on the graph of the function lies above the graph (convex function).

    Maxout can also be implemented for d-dimensional vectors (V):

    Assuming two convex functions h_1(x) and h_2(x), approximated by two Maxout nodes, the function g(x) is a continuous PWL function.

    Therefore, any continuous function can be well approximated by a Maxout layer consisting of two Maxout nodes.

3.10 Softplus

    Softplus function: f(x) = ln(1 + exp x), the derivative of Softplus is: f ′(x)=exp(x) / ( 1+exp⁡ x ) = 1/ (1 +exp(−x ) ), also known as the logistic / sigmoid function. The Softplus function is similar to the ReLU function, but is relatively smooth, and like ReLU, it is unilaterally suppressed. It accepts a wide range: (0, + inf).

4 How to choose the appropriate activation function in the application

    There is currently no definite method for this problem, let's rely on some experience.

1. Deep learning often requires a lot of time to process data, and the convergence speed of the model is particularly important. Therefore, in general, the training deep learning network should try to choose the output with zero-centered data (which can be achieved through data preprocessing) and zero-centered output. Therefore, try to choose an activation function with zero-centered characteristics to speed up the convergence of the model.

2. If you use ReLU, you must set the learning rate carefully, and be careful not to let the network have many "dead" neurons. If this problem is not easy to solve, you can try Leaky ReLU, PReLU or Maxout.

3. It is best not to use sigmoid, you can try tanh, but it can be expected that its effect will not be as good as ReLU and Maxout.

This article refers to: Activation Function (Activation Function)_idea reply blog-CSDN blog

Guess you like

Origin blog.csdn.net/Starinfo/article/details/130061127