In recent attempt to activate the custom function in pytorch, how to use a custom activation function in pytorch in?
If the custom function is activated guide, you can write directly to a python function to define and call because pytorch of autograd automatically to its derivation.
If the customized activation function is not derivable , such segment may be similar to the function ReLU guide, need to write a class that inherits torch.autograd.Function, and define their own forward and backward process .
Provides a new autograd function defined in pytorch in the tutorial: https://pytorch.org/tutorials/beginner/examples_autograd/two_layer_net_custom_function.html , tutorial to ReLU example to introduce the contents in forward, backward need to define yourself.
1 import torch 2 3 4 class MyReLU(torch.autograd.Function): 5 """ 6 We can implement our own custom autograd Functions by subclassing 7 torch.autograd.Function and implementing the forward and backward passes 8 which operate on Tensors. 9 """ 10 11 @staticmethod 12 def forward(ctx, input): 13 """ 14 In the forward pass we receive a Tensor containing the input and return 15 a Tensor containing the output. ctx is a context object that can be used 16 to stash information for backward computation. You can cache arbitrary 17 objects for use in the backward pass using the ctx.save_for_backward method. 18 """ 19 ctx.save_for_backward(input) 20 return input.clamp(min=0) 21 22 @staticmethod 23 def backward(ctx, grad_output): 24 """ 25 In the backward pass we receive a Tensor containing the gradient of the loss 26 with respect to the output, and we need to compute the gradient of the loss 27 with respect to the input. 28 """ 29 input, = ctx.saved_tensors 30 grad_input = grad_output.clone() 31 grad_input[input < 0] = 0 32 return grad_input 33 34 35 dtype = torch.float 36 device = torch.device("cpu") 37 # device = torch.device("cuda:0") # Uncomment this to run on GPU 38 39 # N is batch size; D_in is input dimension; 40 # H is hidden dimension; D_out is output dimension. 41 N, D_in, H, D_out = 64, 1000, 100, 10 42 43 # Create random Tensors to hold input and outputs. 44 x = torch.randn(N, D_in, device=device, dtype=dtype) 45 y = torch.randn(N, D_out, device=device, dtype=dtype) 46 47 # Create random Tensors for weights. 48 w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True) 49 w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True) 50 51 learning_rate = 1e-6 52 for t in range(500): 53 # To apply our Function, we use Function.apply method. We alias this as 'relu'. 54 relu = MyReLU.apply 55 56 # Forward pass: compute predicted y using operations; we compute 57 # ReLU using our custom autograd operation. 58 y_pred = relu(x.mm(w1)).mm(w2) 59 60 # Compute and print loss 61 loss = (y_pred - y).pow(2).sum() 62 print(t, loss.item()) 63 64 # Use autograd to compute the backward pass. 65 loss.backward() 66 67 # Update weights using gradient descent 68 with torch.no_grad(): 69 w1 -= learning_rate * w1.grad 70 w2 -= learning_rate * w2.grad 71 72 # Manually zero the gradients after updating weights 73 w1.grad.zero_() 74 w2.grad.zero_()
But if the definition of ReLU function, do not use more than the correct way, but the custom function directly, what problems it?
Here MyReLU and compared using the above-defined function from: no_back the results.
1 def no_back(x): 2 return x * (x > 0).float()
Code:
N, D_in, H, D_out = 2, 3, 4, 5 # Create random Tensors to hold input and outputs. x = torch.randn(N, D_in, device=device, dtype=dtype) y = torch.randn(N, D_out, device=device, dtype=dtype) # Create random Tensors for weights. origin_w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True) origin_w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True) learning_rate = 1e-3 def myReLU(func, x, y, origin_w1, origin_w2, learning_rate,N = 2, D_in = 3, H = 4, D_out = 5): w1 = deepcopy(origin_w1) w2 = deepcopy(origin_w2) for t in range(5): # Forward pass: compute predicted y using operations; we compute # ReLU using our custom autograd operation. y_pred = func(x.mm(w1)).mm(w2) # Compute and print loss loss = (y_pred - y).pow(2).sum() print("------", t, loss.item(), "------------") # Use autograd to compute the backward pass. loss.backward() # Update weights using gradient descent with torch.no_grad(): print('w1 = ') print(w1) print('---------------------') print("x.mm(w1) = ") print(x.mm(w1)) print('---------------------') print('func(x.mm(w1))') print(func(x.mm(w1))) print('---------------------') print("w1.grad:", w1.grad) # print("w2.grad:",w2.grad) print('---------------------') w1 -= learning_rate * w1.grad w2 -= learning_rate * w2.grad # Manually zero the gradients after updating weights w1.grad.zero_() w2.grad.zero_() print('========================') print() myReLU(func = MyReLU.apply, x = x, y = y, origin_w1 = origin_w1, origin_w2 = origin_w2, learning_rate = learning_rate, N = 2, D_in = 3, H = 4, D_out = 5) print('============') print('============') print('============') myReLU(func = no_back, x = x, y = y, origin_w1 = origin_w1, origin_w2 = origin_w2, learning_rate = learning_rate, N = 2, D_in = 3, H = 4, D_out = 5)
For the experimental results MyReLU.apply as follows:
1 ------ 0 20.18220329284668 ------------ 2 w1 = 3 tensor([[ 0.7070, 2.5772, 0.7987, 2.2287], 4 [ 0.7425, -0.6309, 0.3268, -1.5072], 5 [ 0.6930, -2.6128, 0.1949, 0.8819]], requires_grad=True) 6 --------------------- 7 x.mm(w1) = 8 tensor([[-0.9788, 1.0135, -0.4164, 1.8834], 9 [-0.7692, -1.8556, -0.7085, -0.9849]]) 10 --------------------- 11 func(x.mm(w1)) 12 tensor([[0.0000, 1.0135, 0.0000, 1.8834], 13 [0.0000, 0.0000, 0.0000, 0.0000]]) 14 --------------------- 15 w1.grad: tensor([[ 0.0000, 0.0499, 0.0000, 0.1881], 16 [ 0.0000, -4.4962, 0.0000, -16.9378], 17 [ 0.0000, -0.2401, 0.0000, -0.9043]]) 18 --------------------- 19 ======================== 20 21 ------ 1 19.546737670898438 ------------ 22 w1 = 23 tensor([[ 0.7070, 2.5772, 0.7987, 2.2285], 24 [ 0.7425, -0.6265, 0.3268, -1.4903], 25 [ 0.6930, -2.6126, 0.1949, 0.8828]], requires_grad=True) 26 --------------------- 27 x.mm(w1) = 28 tensor([[-0.9788, 1.0078, -0.4164, 1.8618], 29 [-0.7692, -1.8574, -0.7085, -0.9915]]) 30 --------------------- 31 func(x.mm(w1)) 32 tensor([[0.0000, 1.0078, 0.0000, 1.8618], 33 [0.0000, 0.0000, 0.0000, 0.0000]]) 34 --------------------- 35 w1.grad: tensor([[ 0.0000, 0.0483, 0.0000, 0.1827], 36 [ 0.0000, -4.3446, 0.0000, -16.4493], 37 [ 0.0000, -0.2320, 0.0000, -0.8782]]) 38 --------------------- 39 ======================== 40 41 ------ 2 18.94647789001465 ------------ 42 w1 = 43 tensor([[ 0.7070, 2.5771, 0.7987, 2.2283], 44 [ 0.7425, -0.6221, 0.3268, -1.4738], 45 [ 0.6930, -2.6123, 0.1949, 0.8837]], requires_grad=True) 46 --------------------- 47 x.mm(w1) = 48 tensor([[-0.9788, 1.0023, -0.4164, 1.8409], 49 [-0.7692, -1.8591, -0.7085, -0.9978]]) 50 --------------------- 51 func(x.mm(w1)) 52 tensor([[0.0000, 1.0023, 0.0000, 1.8409], 53 [0.0000, 0.0000, 0.0000, 0.0000]]) 54 --------------------- 55 w1.grad: tensor([[ 0.0000, 0.0467, 0.0000, 0.1775], 56 [ 0.0000, -4.2009, 0.0000, -15.9835], 57 [ 0.0000, -0.2243, 0.0000, -0.8534]]) 58 --------------------- 59 ======================== 60 61 ------ 3 18.378826141357422 ------------ 62 w1 = 63 tensor([[ 0.7070, 2.5771, 0.7987, 2.2281], 64 [ 0.7425, -0.6179, 0.3268, -1.4578], 65 [ 0.6930, -2.6121, 0.1949, 0.8846]], requires_grad=True) 66 --------------------- 67 x.mm(w1) = 68 tensor([[-0.9788, 0.9969, -0.4164, 1.8206], 69 [-0.7692, -1.8607, -0.7085, -1.0040]]) 70 --------------------- 71 func(x.mm(w1)) 72 tensor([[0.0000, 0.9969, 0.0000, 1.8206], 73 [0.0000, 0.0000, 0.0000, 0.0000]]) 74 --------------------- 75 w1.grad: tensor([[ 0.0000, 0.0451, 0.0000, 0.1726], 76 [ 0.0000, -4.0644, 0.0000, -15.5391], 77 [ 0.0000, -0.2170, 0.0000, -0.8296]]) 78 --------------------- 79 ======================== 80 81 ------ 4 17.841421127319336 ------------ 82 w1 = 83 tensor([[ 0.7070, 2.5770, 0.7987, 2.2280], 84 [ 0.7425, -0.6138, 0.3268, -1.4423], 85 [ 0.6930, -2.6119, 0.1949, 0.8854]], requires_grad=True) 86 --------------------- 87 x.mm(w1) = 88 tensor([[-0.9788, 0.9918, -0.4164, 1.8008], 89 [-0.7692, -1.8623, -0.7085, -1.0100]]) 90 --------------------- 91 func(x.mm(w1)) 92 tensor([[0.0000, 0.9918, 0.0000, 1.8008], 93 [0.0000, 0.0000, 0.0000, 0.0000]]) 94 --------------------- 95 w1.grad: tensor([[ 0.0000, 0.0437, 0.0000, 0.1679], 96 [ 0.0000, -3.9346, 0.0000, -15.1145], 97 [ 0.0000, -0.2101, 0.0000, -0.8070]]) 98 --------------------- 99 ========================
For the experimental results no_back as follows:
1 ------ 0 20.18220329284668 ------------ 2 w1 = 3 tensor([[ 0.7070, 2.5772, 0.7987, 2.2287], 4 [ 0.7425, -0.6309, 0.3268, -1.5072], 5 [ 0.6930, -2.6128, 0.1949, 0.8819]], requires_grad=True) 6 --------------------- 7 x.mm(w1) = 8 tensor([[-0.9788, 1.0135, -0.4164, 1.8834], 9 [-0.7692, -1.8556, -0.7085, -0.9849]]) 10 --------------------- 11 func(x.mm(w1)) 12 tensor([[-0.0000, 1.0135, -0.0000, 1.8834], 13 [-0.0000, -0.0000, -0.0000, -0.0000]]) 14 --------------------- 15 w1.grad: tensor([[ 0.0000, 0.0499, 0.0000, 0.1881], 16 [ 0.0000, -4.4962, 0.0000, -16.9378], 17 [ 0.0000, -0.2401, 0.0000, -0.9043]]) 18 --------------------- 19 ======================== 20 21 ------ 1 19.546737670898438 ------------ 22 w1 = 23 tensor([[ 0.7070, 2.5772, 0.7987, 2.2285], 24 [ 0.7425, -0.6265, 0.3268, -1.4903], 25 [ 0.6930, -2.6126, 0.1949, 0.8828]], requires_grad=True) 26 --------------------- 27 x.mm(w1) = 28 tensor([[-0.9788, 1.0078, -0.4164, 1.8618], 29 [-0.7692, -1.8574, -0.7085, -0.9915]]) 30 --------------------- 31 func(x.mm(w1)) 32 tensor([[-0.0000, 1.0078, -0.0000, 1.8618], 33 [-0.0000, -0.0000, -0.0000, -0.0000]]) 34 --------------------- 35 w1.grad: tensor([[ 0.0000, 0.0483, 0.0000, 0.1827], 36 [ 0.0000, -4.3446, 0.0000, -16.4493], 37 [ 0.0000, -0.2320, 0.0000, -0.8782]]) 38 --------------------- 39 ======================== 40 41 ------ 2 18.94647789001465 ------------ 42 w1 = 43 tensor([[ 0.7070, 2.5771, 0.7987, 2.2283], 44 [ 0.7425, -0.6221, 0.3268, -1.4738], 45 [ 0.6930, -2.6123, 0.1949, 0.8837]], requires_grad=True) 46 --------------------- 47 x.mm(w1) = 48 tensor([[-0.9788, 1.0023, -0.4164, 1.8409], 49 [-0.7692, -1.8591, -0.7085, -0.9978]]) 50 --------------------- 51 func(x.mm(w1)) 52 tensor([[-0.0000, 1.0023, -0.0000, 1.8409], 53 [-0.0000, -0.0000, -0.0000, -0.0000]]) 54 --------------------- 55 w1.grad: tensor([[ 0.0000, 0.0467, 0.0000, 0.1775], 56 [ 0.0000, -4.2009, 0.0000, -15.9835], 57 [ 0.0000, -0.2243, 0.0000, -0.8534]]) 58 --------------------- 59 ======================== 60 61 ------ 3 18.378826141357422 ------------ 62 w1 = 63 tensor([[ 0.7070, 2.5771, 0.7987, 2.2281], 64 [ 0.7425, -0.6179, 0.3268, -1.4578], 65 [ 0.6930, -2.6121, 0.1949, 0.8846]], requires_grad=True) 66 --------------------- 67 x.mm(w1) = 68 tensor([[-0.9788, 0.9969, -0.4164, 1.8206], 69 [-0.7692, -1.8607, -0.7085, -1.0040]]) 70 --------------------- 71 func(x.mm(w1)) 72 tensor([[-0.0000, 0.9969, -0.0000, 1.8206], 73 [-0.0000, -0.0000, -0.0000, -0.0000]]) 74 --------------------- 75 w1.grad: tensor([[ 0.0000, 0.0451, 0.0000, 0.1726], 76 [ 0.0000, -4.0644, 0.0000, -15.5391], 77 [ 0.0000, -0.2170, 0.0000, -0.8296]]) 78 --------------------- 79 ======================== 80 81 ------ 4 17.841421127319336 ------------ 82 w1 = 83 tensor([[ 0.7070, 2.5770, 0.7987, 2.2280], 84 [ 0.7425, -0.6138, 0.3268, -1.4423], 85 [ 0.6930, -2.6119, 0.1949, 0.8854]], requires_grad=True) 86 --------------------- 87 x.mm(w1) = 88 tensor([[-0.9788, 0.9918, -0.4164, 1.8008], 89 [-0.7692, -1.8623, -0.7085, -1.0100]]) 90 --------------------- 91 func(x.mm(w1)) 92 tensor([[-0.0000, 0.9918, -0.0000, 1.8008], 93 [-0.0000, -0.0000, -0.0000, -0.0000]]) 94 --------------------- 95 w1.grad: tensor([[ 0.0000, 0.0437, 0.0000, 0.1679], 96 [ 0.0000, -3.9346, 0.0000, -15.1145], 97 [ 0.0000, -0.2101, 0.0000, -0.8070]]) 98 --------------------- 99 ========================
Comparison, both in size and gradient values are updated value, loss of equal size, does that mean that for non-derivative function, direct-defined functions can also get back to the pre-defined process and correct the same result?
It should be noted that one issue, and that is in the experimental results MyReLU.apply, the emergence of local value 0, shown at 0.0000, while the experimental results no_back, the emergence of local value 0, the display is -0.0000;
0.0000 and -0.0000 What difference does it make?
Reference stack overflow in the answer: https://stackoverflow.com/questions/4083401/negative-zero-in-python
And wikipedia for signed zero in the introduction: https://en.wikipedia.org/wiki/Signed_zero
In python two different objects will be apparent, however numerical comparison, both the values shown to be equal.
-0.0 == +0.0 == 0
In Python to set them equal to the set value, is incorporated in the code to avoid the bug.
>>> a = 3.4 >>> b =4.4 >>> c = -0.0 >>> d = +0.0 >>> a*c -0.0 >>> b*d 0.0 >>> a*c == b*d True >>>
Although it seems, they are in use and there is no difference, but inside the computer said they were not the same for their encoding.
For the 1-bit integer + 7 symbolic representation values , the negative zero is represented by 10000000 in binary code. 8 yuan in the binary one , the negative zero is represented by binary 11111111, but complement indicates no negative law concept of zero. In IEEE 754 binary floating point arithmetic standard, zero exponent and a mantissa, the sign bit is a negative number is zero.
In IBM ordinary decimal arithmetic coding standard, the use of decimal floating-point representation. Here negative zero is expressed as an index value is any legal encoding, all coefficients are zero, a sign bit of the number.
~(wikipedia)
In numerical analysis, often seen from the negative territory -0 infinitely close to the value 0, +0 seen from the positive range of 0 to a value approaching infinity, both approximately equal in value, but in the in some operations it may produce different results.
For example divmod, will follow the sign value:
>>> divmod(-0.0,100) (-0.0, 0.0) >>> divmod(+0.0,100) (0.0, 0.0)
For example atan2, (see description https://en.wikipedia.org/wiki/Atan2 )
atan2(+0, +0)
= +0;
atan2(+0, −0)
+ = [Pi]; (when y is the y-axis positive axis, infinitely close to the value 0; x is the x-axis negative axle, infinitely close to the value 0 => can be seen in the second quadrant x-axis negative axle located at a point => $ \ theta angle of $ \ $ PI)
atan2(−0, +0)
= -0; (it can be seen as a point located in the x-axis positive axle fourth quadrant => $ \ theta angle -0)
atan2(−0, −0)
= - p.
With code validation:
>>> math.atan2(0.0, 0.0) == math.atan2(-0.0, 0.0) True >>> math.atan2(0.0, -0.0) == math.atan2(-0.0, -0.0) False
Therefore, although in the above custom activation function, the function will not be forcibly turned pytorch added to the autograd the computation, the same numerical results; however Observe that there is a program bug -0.0000 tips, still need to be considered rigorous specification defines, as MyReLU.