Mo trouble PyTorch study notes (b) - activation function

1. sigmod function
FIG function formula and graph below

 

 

 

 

 


In sigmod function we can see, the output is in the (0,1) This open interval, which is very interesting, you can think of probability, but strictly speaking, not as a probability. sigmod function once more popular, it can be thought of a discharge rate of neurons in the middle of the slope where relatively large sensitive neurons in both sides of a very gentle slope where neurons zone of inhibition.
Of course, fashion is ever popular, indicating that there is a certain function itself is defective.
1) When the input slightly away from the coordinate origin, the gradient function becomes very small, almost zero. During back propagation neural network, we are the respective weights w is calculated through the chain rule of differentiation differential. When the back-propagation through the sigmod function, differential on the chain on very, very small, and moreover also may go through a number of sigmod function, eventually leading to weight w almost no impact on the loss function, which is not conducive to weight-optimized, this the problem is called saturation gradients, it can also be called diffusion gradient.
2) not to function outputs 0 as the center, this would reduce the weight updating efficiency. For this defect, in which the Stanford course has a detailed explanation.
3) sigmod exponentiation function to be performed, this is for relatively slow computers.
2.tanh function
tanh curve following function formula and

 

 

 


tanh is the hyperbolic tangent function, tanh curve function and sigmod function is relatively similar, let's compare it and see. First, is the same, these two functions in a large or small input, the output is almost smooth, the gradient is very small, is not conducive to weight updating; different output section, the output section is tanh (-1, between 1), and the entire function is 0-centric, this feature sigmod better than.
General binary classification, the hidden layer function tanh output layer sigmod function. But these are not immutable, what is the activation function specific use, or according to the specific questions to specific analysis, or rely on debugging.
3.ReLU function
ReLU curve following function formula and
 

 

 

 


ReLU (Rectified Linear Unit) is a function of the activation function more fire than in sigmod function and tanh functions, it has the following advantages:
1) When the input is positive, the saturated gradient no problem.
2) calculate much faster. ReLU linear function only, either a forward or reverse the spread of the spread, and tanh than sigmod much faster. (Sigmod and tanh To calculate the index computation speed will be slower)
Of course, there are disadvantages:
1) When the input is negative when, ReLU is completely activated, suggesting that once entered into the negative, ReLU will die. Such forward propagation process, but also not a problem, and some areas are sensitive, some insensitive. However, to the back-propagation process, a negative input, a gradient will be completely to 0, and the function sigmod, tanh function has the same problem.
2) we find the output ReLU function is either 0 or a positive number, that is to say, not ReLU function is 0-centric functions.
4.ELU function
ELU following function formula and a graph in FIG.
 

 

 


ELU modification function is a function for ReLU compared to ReLU function, when the input is negative, there is a certain output, and this output part also have some anti-jamming capability. This eliminates the problem ReLU dead, but there are still saturated and gradient index calculation problems.
5.PReLU function
PReLU following function formula and a graph in FIG.
 

 


PReLU also for an improved version of ReLU, in the negative area, PReLU there is a small slope, which would also avoid the problem of ReLU dead. Compared to ELU, PReLU is a linear operation in the negative area, the slope may be small, but does not tend to 0, which can be considered definite advantage of it.
We see PReLU formula, parameter α which is generally fetching between 0 and 1 and generally is still relatively small, such as 0.0 a few. When α = 0.01, we call PReLU as Leaky ReLU, be regarded as a special case PReLU it.
Overall, these activation function has its own advantages and disadvantages, none of which is not to say shows that what is good activation function, all the good and bad have to get their own experiments.
 
Excitation function code shown below
import torch
from torch.autograd import Variable
import matplotlib.pyplot as plt
import torch.nn.functional as F
x= torch.linspace(-5,5,200)
x= Variable(x)
x_np=x.data.numpy()

y_relu = torch.relu(x).data.numpy()
y_sigmoid =torch.sigmoid(x).data.numpy()
y_tanh = torch.tanh(x).data.numpy()
y_softplus = F.softplus(x).data.numpy()

plt.figure(1,figsize=(8,6))
plt.subplot(221)
plt.plot(x_np,y_relu,c='red',label='relu')
plt.ylim(-1,5)
plt.legend(loc='best')

plt.subplot(222)
plt.plot(x_np,y_sigmoid,c='red',label='sigmoid')
plt.ylim(-0.2,1.2)
plt.legend(loc='best')

plt.subplot(223)
plt.plot(x_np,y_tanh,c='red',label='tanh')
plt.ylim(-1.2,1.2)
plt.legend(loc='best')

plt.subplot(224)
plt.plot(x_np,y_softplus,c='red',label='softplus')
plt.ylim(-0.2,6)
plt.legend(loc='best')
plt.show()

 

 

Guess you like

Origin www.cnblogs.com/henuliulei/p/11364417.html