Why use a non-linear activation function? Common nonlinear activation functions and comparison of advantages and disadvantages

  • Why use a non-linear activation function? 

w600

 As shown in the above neural network, in the process of forward propagation, if a linear activation function (identical excitation function) is used, that is, the g(z)=kzoutput of the hidden layer is

 a^{[1]}=z^{[1]}=W^{[1]}x+b^{[1]}

 a^{[2]}=z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}=W^{[2]}\left ( W^{[1]}x+b^{[1]} \right )+b^{[2]}Right now

a^{[2]}=z^{[2]}=W'x+b' 

It can be seen that the neural network using the linear activation function only linearly combines the input and outputs, so when there are many hidden layers, the training effect of using the linear activation function in the hidden layer is the same as the standard Logistic regression without using the shadow hidden layer. So we have to use a non-linear activation function instead of a non-linear one in the hidden layer.

There is usually only one place you can use a linear activation function, and that's when you're doing a regression problem in machine learning. y is a real number. For example, if you want to predict real estate prices, y is not 0 or 1 for the binary classification task, but a real number, from 0 to positive infinity. If y is a real number, then it might be feasible to use a linear activation function in the output layer, and your output is also a real number, from negative infinity to positive infinity. In summary, you cannot use linear activation functions in hidden layers, except for some special cases, such as those related to compression.

  • concept of saturation

When an activation function h(x) satisfies \lim_{n\rightarrow+\infty}h'(x)=0, we call it right saturated. 
When an activation function h(x) satisfies \lim_{n\rightarrow-\infty}h'(x)=0, we call it left saturated. 
When an activation function satisfies both left saturation and saturation, we call it saturation.

h'(x)=0For any x, if there is a constant c, it is called right-hard saturation  when x > c . For any x, if there is a constant c, it is called left-hard saturation 
when x < c . If both left hard saturation and right hard saturation are satisfied, this activation function is called hard saturation. If there are only functions with partial derivatives equal to 0 in the limit state, it is called soft saturation.h'(x)=0

  •  Commonly used activation functions

Reference link:

1)https://mp.weixin.qq.com/s?__biz=MzI1NTE4NTUwOQ==&mid=2650325236&idx=1&sn=7bd8510d59ddc14e5d4036f2acaeaf8d&scene=0#wechat_redirect

2)http://www.ai-start.com/dl2017/html/lesson1-week3.html#header-n152

1. The sigmoid function

w600

 

 Derivation:\frac{\mathrm{d}}{\mathrm{d} x}g(z)=\frac{1}{1+e^{-z}}\left ( 1- \frac{1}{1+e^{-z}}\right )=g(z)(1-g(z))

when z=10or z=-10:\frac{\mathrm{d}}{\mathrm{d} x}g(z)\approx 0

 when z=0:\frac{\mathrm{d}}{\mathrm{d} x}g(z)= 1/4 

  The soft saturation of sigmoid has made it difficult for deep neural networks to be trained effectively for two or three decades, which is an important reason hindering the development of neural networks. Since the gradient of the sigmoid downward conduction during the backward pass contains a f'(x) factor (the derivative of the sigmoid with respect to the input), once the input falls into the saturation region, f'(x) will become close to 0, causing the gradient passed to the bottom layer to become very small.

advantage:

  1. The output mapping of the Sigmoid function is between (0,1), monotonically continuous, the output range is limited, the optimization is stable, and it can be used as the output layer. It is the closest to a biological neuron in the physical sense.
  2. Finding guidance is easy.

shortcoming:

  1. Due to its soft saturation, it is prone to gradient disappearance, which leads to problems in training.
  2. Its output is not centered on 0.

 2、tanh  

Similarly, the Tanh activation function also has soft saturation. The Tanh network converges faster than the Sigmoid. Because Tanh's output mean is closer to 0 than Sigmoid, SGD will be closer to natural gradient (a quadratic optimization technique), thereby reducing the number of iterations required.

advantage:

  1. Converges faster than the Sigmoid function.
  2. Compared with the Sigmoid function, its output is centered on 0.

shortcoming:

      Still haven't changed the biggest problem of the Sigmoid function - the gradient disappears due to saturation.

 3, Relu and Leeaky Relu

 ReLU hard saturates when x<0. Since the derivative is 1 when x>0, ReLU can keep the gradient from decaying when x>0, thereby alleviating the problem of gradient disappearance. However, as the training progresses, some inputs will fall into the hard saturation region, resulting in the failure to update the corresponding weights. This phenomenon is called "neuron death".

advantage:

  1. Compared with Sigmoid and tanh, ReLU can converge quickly in SGD. This is allegedly due to its linear, non-saturated form.
  2. Sigmoid and tanh involve a lot of expensive operations (such as indices), and ReLU can be implemented more simply.
  3. Effectively alleviate the problem of gradient disappearance.
  4. It can also perform better without unsupervised pre-training.
  5. Provides sparse expressiveness for neural networks.

shortcoming:

        As training progresses, neurons may die and weights cannot be updated. If this happens, then the gradient flowing through the neuron will always be 0 from this point forward. That is, ReLU neurons die irreversibly during training.

        Another problem with ReLU is that the output has an offset phenomenon, that is, the output mean is always greater than zero. The phenomenon of drift and neuron death can jointly affect the convergence of the network.

4、PRelu

PReLU is an improved version of ReLU and LReLU with non-saturation. When ai is small and fixed, it is called LReLU. The original purpose of LReLU is to avoid gradient disappearance. But in some experiments, it was found that LReLU did not have much impact on the accuracy. Many times, when we want to apply LReLU, we must be very careful to repeat the training and select the appropriate a, so that the results of LReLU are better than ReLU. Therefore, someone proposed a PReLU that adaptively learns parameters from data. PReLU has the characteristics of fast convergence and low error rate . Because the output of PReLU is closer to 0 mean, making SGD closer to natural gradient. PReLU can be used for backpropagation training and can be optimized simultaneously with other layers.

5、Eru

ELU is a fusion of sigmoid and ReLU with left soft saturation. The linear part on the right makes the ELU mitigate vanishing gradients, while the soft saturation on the left makes the ELU more robust to input changes or noise. The mean value of the output of ELU is close to zero, so the convergence speed is faster.

Activation functions such as PReLU and ELU do not have the sparsity of Relu, but they can all improve network performance.

6、Maxout

Maxout is a generalization of ReLU, and its saturation is a zero-measurement event. The Maxout network can approximate any continuous function, and when w2, b2,..., wn, bn are 0, it degenerates into ReLU.

Maxout can alleviate the disappearance of the gradient, and at the same time avoid the disadvantage of ReLU neuron death, but it increases the parameters and calculation amount.

Guess you like

Origin blog.csdn.net/weixin_42149550/article/details/99839184