[Deep Learning Theory] (3) Activation function

Hello everyone, I recently learned the CS231N Stanford Computer Vision Open Class. It was so wonderful that I would like to share it with you.

The function of the activation function is to linearly sum the input of the neuron and put it into the nonlinear activation function for activation . Because of the nonlinear activation function, the neural network can fit the nonlinear decision boundary and solve the nonlinear classification. and regression problems


1. Sigmoid function

Function: Map the input from negative infinity to positive infinity to between 0 and 1.

official:\sigma (x) = \frac{1}{1+e^{-x}}

If x=0, the function value = 0.5,; if x is large, the function value is very close to 1; if x is small, the function value is very close to 0

Features:

(1) Squeeze any input from negative infinity to positive infinity into a number between 0 and 1 , also known as the squeeze function

(2) It has good interpretability and can be compared to whether nerve cells are activated. The output of the function is a value between 0 and 1, which is equivalent to a binary classification problem, where 0 is one category and 1 is another category.

defect:

(1) The saturation causes the gradient to disappear . When x is too small or too large, the gradient approaches 0.

(2) The output values ​​of the function are all positive numbers , not symmetrical about zero.

As shown in the figure below, if the outputs of both hidden layers are positive, it will cause all partial derivatives to weights in each neuron to have the same sign . That is , for all the weights of a neuron, either increase or decrease at the same time.

If the function is  w1x1+w2x2+w3x3+...+b activated with a sigmoid function, take the partial derivative of the inner function, \frac{\partial [\sum_{i}wi\times xi+b]}{\partial wi}=xi. This xi is the input of the neuron and the output of the previous neuron. If the last neuron is activated with the sigmoid function, then xi must be a positive number.

That is, for w1, w2, w3 of the function, the respective partial derivatives are positive numbers, that is, all the weights increase and decrease together . As shown in the right figure below, the horizontal axis represents the partial derivative of w1, and the vertical axis represents the partial derivative of w2. The update direction of the weights is always the first or third quadrant, either all positive or all negative. As shown in the blue line in the figure below, now you need to increase w1 (positive) and decrease w2 (negative), there is no way to do it in one step. You need to reduce it together first, then increase it together, slide back and forth, and generate a zigzag optimization path.

(3) The exponential operation consumes more resources . Compared with operations such as addition, subtraction, multiplication and division, it is more resource-intensive.


2. tanh function

The hyperbolic tangent function is similar to the sigmoid function, and can be converted to each other through scaling and translation transformations.

Features:

(1) The saturation causes the gradient to disappear

(2) Squeeze any input from negative infinity to positive infinity into a number between -1 and 1

(3) The output value of the function is positive and negative, symmetrical about 0


3. ReLU function

The ReLU function, also known as the modified linear unit, wipes all numbers less than 0 into 0.

official:f(x)=max(0,x)

Features:

(1) The function will not saturate . When x is greater than 0, the output will be as big as the input.

(2) The calculation is simple and hardly consumes computing resources.

(3) When x is greater than 0, the gradient can be preserved , so the convergence speed of the ReLU function is more than 6 times faster than that of the sigmoid function.

defect:

(1) The output is not symmetrical about 0

(2) When x is less than 0, the gradient is 0 . Causes some neurons to be dead (dead ReLU), producing neither positive outputs, nor positive gradients, nor updates.

The reasons for Dead ReLU are : ① Poor initialization , randomly initialized weights, so that all outputs of neurons are 0, and the gradient is equal to 0; ② The learning rate is too large , and the steps are too large, and it is easy to scramble.

In order to solve the problem of Dead ReLU, a bias term of 0.01 is added to the weight of ReLU to ensure that all neurons can output a positive number at the beginning, so that all neurons can obtain gradient update


4. ReLU function improvement

In order to solve the case where the ReLU function x is less than 0 and the gradient is 0, Leaky ReLU and ELU functions are introduced.

The Leaky ReLU function multiplies a very small weight when x is less than 0.

official:f(x)=max(\alpha x, x)

A linear relationship is satisfied when x is less than 0

When the ELU function gives x less than 0, use the exponential function

official:f(x)=\left\{\begin{matrix} x \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\ if \;\ x>0 & \\ \alpha (exp(x)-1) \;\;\;\ if \;\ x\leq 0 & \end{matrix}\right.

Improve the problem that the output of the ReLU function is not symmetrical about 0, but using the exponential operation will bring a larger amount of calculation


5. Maxout

As shown in the figure below, there are now two output neurons, and now use another set of neural networks to process these two outputs , use 5 neurons (5 sets of weights) to process the output, and select from these 5 output results The largest most active function output .

Features : Not only an activation function, but also changes the structure of the network, because new neurons are introduced, and the number of parameters increases by k times


6. Summary

(1) When using the ReLU function, pay attention to the fact that the learning rate cannot be too large, otherwise the gradient will disappear

(2) Leaky ReLU or ELU or Maxout can be used instead of ReLU

(3) Do not use the sigmoid function in the middle layer. If it is a binary classification problem, you can use the sigmoid function in the output layer

(4) The tanh function can be used, but don't get your hopes up for it. 

Guess you like

Origin blog.csdn.net/dgvv4/article/details/123541694