Activation functions commonly used in deep learning

——The original text was published on my WeChat public account "Big Data and Artificial Intelligence Lab" (BigdataAILab). Welcome to pay attention.

 

We know that the theoretical basis of deep learning is a neural network. In a single-layer neural network (perceptron), the calculation relationship between input and output is shown in the following figure:
 
It can be seen that the input and output are a linear relationship. After adding multiple neurons , the calculation formula is similar, as shown in the figure below:
 
such a model can only handle some simple linear data, but it is difficult to effectively handle nonlinear data (it can also be combined with multiple different linear representations, but this is more complex and inefficient. Flexible), as shown in the following figure:
 
Then, by adding a nonlinear excitation function to the neural network, the neural network may learn a smooth curve to realize the processing of nonlinear data. As shown in the figure below:
 
Therefore, the role of the excitation function in the neural network is to convert multiple linear inputs into a nonlinear relationship . If the excitation function is not used, each layer of the neural network is only a linear transformation, even if the multi-layer input is superimposed, it is still a linear transformation. After the nonlinear factor is introduced through the excitation function, the representation ability of the neural network is stronger.

Here are a few commonly used excitation functions
1. sigmoid function
 
This should be the most frequently used excitation function in neural networks. It compresses a real number between 0 and 1. When the input number is very large, the result will be close to 1, when entering a very large negative number, you will get a result close to 0. It was used a lot in early neural networks because it well explained the scenario of whether a neuron was activated after being stimulated and passed backwards (0: hardly activated, 1: fully activated), but recently In recent years, it is rarely seen in deep learning applications, because gradient dispersion or gradient saturation is prone to occur when using the sigmoid function. When the number of layers of the neural network is large, if the excitation function of each layer adopts the sigmoid function, the problem of gradient dispersion will occur, because when the parameters are updated by backpropagation, it will be multiplied by its derivative, so it will always decrease. little. If the input is a relatively large or relatively small number (for example, 100 is input, the result is close to 1 after the sigmoid function, and the gradient is close to 0), a saturation effect will occur, causing the neuron to resemble a dead state.

[Xiaobai Science] What is saturation?

2. The tanh function
 
The tanh function compresses the input value between -1 and 1. Similar to Sigmoid, this function also has the disadvantage of gradient dispersion or gradient saturation.

3. The ReLU function
 
ReLU is the abbreviation of The Rectified Linear Unit. It has been used a lot in deep learning in recent years and can solve the gradient dispersion problem because its derivative is equal to 1 or 0. Compared with the sigmoid and tanh excitation functions, it is very simple to calculate the gradient of ReLU, and the calculation is also very simple, which can greatly improve the convergence speed of stochastic gradient descent. (Because ReLU is linear, while sigmoid and tanh are non-linear).
However, the disadvantage of ReLU is that it is relatively fragile. As the training progresses, neurons may die. For example, after a large gradient flows through the ReLU unit, the update result of the weight may be that any data after this There is no way to activate it anymore. If this happens, the gradient flowing through the neuron will always be 0 from this point on. That is, ReLU neurons die irreversibly during training.

4. Leaky ReLU The function
 
Leaky ReLU is mainly to avoid the disappearance of the gradient. When the neuron is in an inactive state, a non-zero gradient is allowed to exist, so that the gradient will not disappear and the convergence speed is fast. Its advantages and disadvantages are similar to ReLU.

5. The value of ELU function
 
ELU in the positive range is x itself, which alleviates the gradient dispersion problem (the derivative of x>0 range is 1 everywhere), which is similar to ReLU and Leaky ReLU. In the negative range, the ELU has the characteristics of soft saturation when the input takes a small value, which improves the robustness to noise. The
following figure shows the curve comparison of ReLU, LReLU, and ELU:

6. Maxout function
 
Maxout is also a very popular excitation function in recent years. In short, it is a generalized version of ReLU and Leaky ReLU. When w1 and b1 are set to 0, it is converted to ReLU formula.
Therefore, Maxout inherits the advantages of ReLU without the worry of "hanging by accident". But compared to ReLU, because there are 2 linear mapping operations, the amount of calculation will be doubled.

 

Recommended related reading

 

Welcome to follow my WeChat public account "Big Data and Artificial Intelligence Lab" (BigdataAILab) for more information

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324410717&siteId=291194637