If you need the source code, please like and follow the collection and leave a private message in the comment area~~~
Activation functions are an important part of neural networks. In a multi-layer neural network, there is a functional relationship between the output of the upper layer nodes and the input of the lower layer nodes. If we set this function as a nonlinear function, the expressive ability of the deep network will be greatly improved, and almost any function can be approximated. Here, we call these nonlinear functions activation functions. The role of the activation function is to provide the network with nonlinear modeling capabilities.
1. Sigmoid function
The Sigmoid function refers to a type of S-shaped curve function, which is a saturated function at both ends. The Sigmoid function is the most widely used type of activation function, which is closest to biological neurons in the physical sense.
Since its output is between (0,1), it can also be expressed as a probability or used as a normalization of the input, that is, with a "squeeze" function
Sigmoid function image and formula
torch.sigmoid(): function or method torch.nn.Sigmoid(): network layer torch.nn.functional.sigmoid(): layer method, used in forward
The Sigmoid function is a good explanation of whether the neuron is activated and passed back when it is stimulated. When the value is close to 0, it is almost not activated, and when the value is close to 1, it is almost completely activated.
The disadvantage of the sigmoid function is that the use of the sigmoid function is prone to gradient disappearance, and even a small probability of gradient explosion.
The analytical formula contains a power function, and the computer is time-consuming to solve it. For a relatively large-scale network, it will greatly increase the time for network training.
The output of sigmoid is not zero-mean, which will cause the input of neurons in the later layer to be a signal with non-zero mean, which will affect the gradient and make the convergence slow
code show as below
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-10,10)
y_sigmoid = 1/(1+np.exp(-x))
y_tanh = (np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x))
fig = plt.figure()
# plot sigmoid
ax = fig.add_subplot()
ax.plot(x,y_tanh)
ax.grid()
ax.set_title('Sigmoid')
plt.show()
2. Tanh function
The tanh function is a deformation of the sigmoid function, and the relationship between the two is tanh(x)=2sigmoid(2x)-1
Tanh function image and formula
Map the output value between (-1,1), thus solving the non-zero mean problem of the sigmoid function
The tanh function also has disadvantages, that is, it also has the problem of gradient disappearance and gradient explosion
Exponentiation can also take a long time to compute
In order to prevent the occurrence of saturation, a step of batch normalization can be added before the activation function, so as to ensure that the input of the neural network has a 0 center distribution with a small mean value in each layer as much as possible.
code show as below
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-10,10)
y_sigmoid = 1/(1+np.exp(-x))
y_tanh = (np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x))
fig = plt.figure()
ax = fig.add_subplot()
ax.plot(x,y_tanh)
ax.grid()
ax.set_title('Tanh')
plt.show()
3. ReLU function
Relu is the abbreviation of The Rectified Linear Unit. Compared with the sigmoid and tanh functions, the Relu function greatly promotes the convergence speed of stochastic gradient descent.
The Relu function is a popular activation function in recent years and is currently very commonly used in the field of deep learning
ReLU function image and formula
There is no exponential operation part in the ReLU function, and there is almost no calculation amount
Fast convergence, simple calculation, biological rationality with unilateral inhibition and wide excitation boundary, which can alleviate the problem of gradient disappearance
The disadvantage is that it is sometimes fragile and may lead to the death of neurons. For example, after a large gradient passes through the Relu unit, the weight update result may be 0, after which it will never be activated again
code show as below
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-10,10)
y_sigmoid = 1/(1+np.exp(-x))
y_tanh = (np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x))
fig = plt.figure()
ax = fig.add_subplot()
y_relu = np.array([0*item if item<0 else item for item in x ])
ax.plot(x,y_relu)
ax.grid()
ax.set_title('ReLu')
plt.show()
4. LeakyReLU function
In the formula, γ is a small negative gradient value
LeakyRelu function image and formula
LeakyReLU solves the problem that some ReLUs may kill neurons. It assigns a non-zero slope to all non-negative values to ensure that the negative axis is not zero and ensures the existence of information on the negative axis, thus solving the problem of some neurons dying
However, in actual use, the LeakyRelu function is not always better than the Relu function
Generally speaking, it is rare to use various activation functions in a network at the same time
You can try the Relu function first. If the effect is not good, you can try the LeakyRelu function, tanh function, etc. It is best not to use the sigmoid function lightly. All in all, the use of the activation function needs to be analyzed according to the specific model, try different activation functions, and finally select the one with the best effect, and analyze the specific problems in detail, and cannot be generalized
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-10,10)
y_sigmoid = 1/(1+np.exp(-x))
y_tanh = (np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x))
fig = plt.figure()
ax = fig.add_subplot()
y_relu = np.array([0.2*item if item<0 else item for item in x ])
ax.plot(x,y_relu)
ax.grid()
ax.set_title('Leaky ReLu')
plt.show()
It's not easy to create and find it helpful, please like, follow and collect~~~