Activation functions (sigmoid, tanh, ReLU, leaky ReLU)

In order to ensure that the calculation of neurons contains simplicity and functionality, the calculations of neurons include linear calculations and nonlinear calculations.

Today we mainly talk about centralized nonlinear calculations (ie, activation functions), including:

sigmoid

fishy

resume

ReLU leaks

1. The sigmoid function

The sigmoid function can map the output to the interval of (0,1), which is suitable for binary classification tasks.

The sigmoid function formula:

                               S(x)=\frac{1}{1+e^{-x}}

Its derivative is:

                                {S}'(x)=S(x)(1-S(x))

The graph of the sigmoid function is:

 

The sigmoid function has the advantage of being smooth and easy to derive, but it has a large amount of calculation. When backpropagating, it is prone to gradient disappearance, and it has never been unable to complete deep network training.

2. Tanh function

The tanh function is very similar to the sigmoid function, except that tanh maps the output to (-1,1).

Tanh function formula:

                       tanh(x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}

Derivative of tanh function:

                      tanh'(x)=1-tanh^{2}(x)

The graph of the tanh function is:

 The sigmoid function and tanh function are the earliest activation functions studied. Tanh is an improved version of the sigmoid function, which improves the problem that the sigmoid function is not centered on zero and speeds up the convergence. Therefore, in actual use, more tanh functions are used.

The vanishing gradient problem:

Although tanh improves the sigmoid function to a certain extent, observing the images of these two functions will reveal that when the independent variable is large or small, the slope of the graph is close to 0. That is to say, when a value with a very large absolute value is input, the output value does not change significantly, which is the problem of gradient disappearance.

3、ReLU

In order to solve the problem of gradient disappearance, ReLU (Linear rectification function corrected linear unit) is usually used.

ReLU official:

                 f(x)=max{(0,x)}

ReLU derivative:

               f'(x)=\left\{\begin{matrix} 0&x<0 \\ 1 &x\geq 0 \end{matrix}\right.

ReLU graph shape:

4、leaky ReLU

 Leaky ReLU is a variant of ReLU. When x<0, the gradient of the function is not 0, but a small constant \lambda\in (0,1), such as 0.01.

leaky ReLU formula:

                           f(x)=\begin{cases} \lambda x & \text{ if } x<0 \\ x & \text{ if } x\geq 0 \end{cases}

leaky ReLU derivative:

                          f'(x)=\begin{cases}\lambda &\text{if}x<0\\1&\text{if} x\geq 0\end{cases}

leaky ReLU graph:

 In the neural network, different layers can have different activation functions. In the binary classification task, we can use the sigmoid function in the last layer (that is, the output layer), and use ReLU or leaky ReLU in other layers.

         

Guess you like

Origin blog.csdn.net/m0_45267220/article/details/128703129