Pytorch's hook technology - get the specific intermediate layer output of the pre-trained/trained model

In order to understand the neural network model more deeply, sometimes we need to observe information such as the convolution kernel, feature map or gradient obtained by its training, which is often used in CNN visualization research. Among them, the convolution kernel is the easiest to obtain, and can be obtained by saving the model parameters; the feature map is an intermediate variable, and the corresponding image will be cleared by the system after processing, otherwise it will seriously occupy memory; the gradient is similar to the feature map, except that the leaf structure Outside the point, the gradients of other intermediate variables are released from memory, so they cannot be directly obtained.
The easiest way to obtain it is to change the model structure. At the end of forward, not only the predicted output of the model is returned, but also the required feature map and other information are returned.

So how to obtain feature maps, gradients and other information without changing the model structure (such as for pre-trained models or trained models)?
Pytorch's hook programming can effectively obtain and change information such as model intermediate variables and gradients without changing the network structure.

Hook can extract or change the gradient of Tensor , and can also get the output and gradient of nn.Module (cannot be changed here).
Therefore, there are 3 hook functions used to realize the above functions:
(1) Tensor object:
Tensor.register_hook(hook)

(2) Module object:
nn.Module.register_forward_hook(hook) ;
nn.Module.register_backward_hook(hook).



The following introduces its usage one by one:

Tensor object

The official introduction to Tensor objects: https://pytorch.org/docs/stable/tensors.html
has the following register_hook(hook) method, which registers a backward hook for Tensor to obtain the gradient of variables. The hook must follow the following format: hook(grad) -> Tensor or None , where grad is the obtained gradient.

Function: Register a backpropagation hook function to automatically record the gradient of Tensor .
PyTorch will automatically release the gradient of intermediate variables and non-leaf nodes after running to reduce memory usage. What are intermediate variables? What is a non-leaf node?
insert image description here
In the figure above, a, b, and d are leaf nodes, and c, e, and o are non-leaf nodes and intermediate variables.

In [1]: a = torch.Tensor([1,2]).requires_grad_() 
    ...: b = torch.Tensor([3,4]).requires_grad_() 
    ...: d = torch.Tensor([2]).requires_grad_() 
    ...: c = a + b 
    ...: e = c * d 
    ...: o = e.sum()     

In [2]: o.backward()

In [3]: print(a.grad)
Out[3]: tensor([2., 2.])

In [4]: print(b.grad)
Out[4]: tensor([2., 2.])

In [5]: print(c.grad)
Out[5]: None

In [6]: print(d.grad)
Out[6]: tensor([10.])

In [7]: print(e.grad)
Out[7]: None

In [8]: print(o.grad)
Out[8]: None

It can be seen from the output of the program that a, b, and d are leaf nodes, and the gradient values ​​are still retained after backpropagation, while the gradients of other non-leaf nodes have been automatically released. To get their gradient values, you need Use hooks .

We first customize a hook function to record the operation of the gradient of the Tensor , and then use Tensor.register_hook(hook) to register the Tensor of the non-leaf node to obtain the gradient , and then backpropagate again:

In [9]: def hook(grad):
    ...:	print(grad)
    ...:

In [10]: e.register_hook(hook)
Out[10]: <torch.utils.hooks.RemovableHandle at 0x1d139cf0a88>

In [11]: o.backward()

In [12]: print(e.grad)
Out[12]: tensor([1., 1.])

At this time, the gradient of e is automatically output.

The function name of the custom hook function can be chosen arbitrarily, and its parameter is grad , indicating the gradient of Tensor . This custom function is mainly used to describe the operation on the gradient value of Tensor . In the above example, we directly output the gradient, so it is print(grad). We can also put the gradient in a list or dictionary, and even modify the gradient, so that if the gradient is small, it can be made larger to prevent the gradient from disappearing:

In [13]: a = torch.Tensor([1,2]).requires_grad_() 
    ...: b = torch.Tensor([3,4]).requires_grad_() 
    ...: d = torch.Tensor([2]).requires_grad_() 
    ...: c = a + b 
    ...: e = c * d 
    ...: o = e.sum()                                                            

In [14]: grad_list = []                                                         

In [15]: def hook(grad): 
    ...:     grad_list.append(grad)    # 将梯度装在列表里
    ...:     return 2 * grad           # 将梯度放大两倍
    ...:                                                                        

In [16]: c.register_hook(hook)                                                  
Out[16]: <torch.utils.hooks.RemovableHandle at 0x7f009b713208>

In [17]: o.backward()                                                           

In [18]: print(grad_list)                                                              
Out[18]: [tensor([2., 2.])]

In [19]: print(a.grad)                                                                 
Out[19]: tensor([4., 4.])

In [20]: print(b.grad)                                                                 
Out[20]: tensor([4., 4.])

In the above example, the hook function we defined performs two operations: one is to load the gradient into the list grad_list, and the other is to double the gradient. From the output, we can see that after performing backpropagation, the gradient of our registered non-leaf node c is saved in the list grad_list, and the gradients of a and b are doubled. It should be noted here that if you want to store the gradient value in a list or dictionary, you must first define a list or dictionary of global variables with the same name, even if it is a local variable, it must be outside the custom hook function. Another point to note is that if you want to change the gradient value, the hook function must return a value to return the changed gradient.

To sum up here, if we want to obtain the gradient value of the non-leaf node Tensor , we need to : 1) Customize a hook function to describe the operation on the gradient, the function name is self-made, and the parameter is only grad, indicating the gradient of the Tensor
; 2) Register
the Tensor to obtain the gradient with the method Tensor.register_hook(hook) . 3) Perform backpropagation.

Module object

There are two methods, register_forward_hook(hook) and register_backward_hook(hook) , which correspond to the hook functions of forward propagation and backpropagation respectively .
The operation objects of these two are the nn.Module class, such as the convolutional layer (nn.Conv2d), fully connected layer (nn.Linear), pooling layer (nn.MaxPool2d, nn.AvgPool2d) in the neural network, activation Small modules defined by layers (nn.ReLU) or nn.Sequential, etc.

For the intermediate module of the model, it can also be regarded as an intermediate node (non-leaf node). Its output is a feature map or activation value. The gradient value of backpropagation will be automatically released by the system. If you want to obtain them, you must use hook function.

As you can see from the name, register_forward_hook is to obtain the output of forward propagation, that is, the feature map or activation value; register_backward_hook is to obtain the output of backpropagation, that is, the gradient value. Their usage is similar to the register_hook introduced above . The hook function should be deleted in time after use to avoid running the hook every time to increase the running load.

(1) For register_forward_hook(hook) , the hook function is defined as follows:

# 这里有3个参数,分别表示:模块,模块的输入,模块的输出。
# 函数用于描述对这些参数的操作,一般我们都是为了获取特征图,即只描述对output的操作即可。
def forward_hook(module, input, output):
    operations

Hook can modify input and output, but it will not affect the result of forward. The most commonly used scenario is to extract the output features of a certain layer (not the last layer) of the model, but do not want to modify its original model definition file, then the forward_hook function can be used .

import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3,6,3,1,1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(2,2)
        self.conv2 = nn.Conv2d(6,9,3,1,1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(2,2)
        self.fc1 = nn.Linear(8*8*9, 120)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(120,10)

    def forward(self, x):
        out = self.pool1(self.relu1(self.conv1(x)))
        out = self.pool2(self.relu2(self.conv2(out)))
        out = out.view(out.shape[0], -1)
        out = self.relu3(self.fc1(out))
        out = self.fc2(out)

        return out

fmap_block = dict()  # 装feature map

def forward_hook(module, input, output):
    fmap_block['input'] = input
    fmap_block['output'] = output

net = Net()
net.load_state_dict(torch.load('checkpoint.pth')) # 载入已训练好的模型
x = torch.randn(1, 3, 32, 32).requires_grad_() # 随机生成一副图像作为输入
handle = net.conv2.register_forward_hook(hook) # 注册hook
y = net(x)

# 展示输入图像和特定中间层的特征
plt.subplot(121)
plt.imshow(fmap_block['input'][0][0,0,:,:].cpu().detach().numpy())
plt.subplot(122)
plt.imshow(fmap_block['output'][0][0,0,:,:].cpu().detach().numpy())
plt.show()
print((fmap_block['input'][0].shape))
print((fmap_block['output'][0].shape))

handle.remove()

(2) For register_backward_hook(hook) , the hook function is defined as follows:

# 这里有3个参数,分别表示:模块,模块输入端的梯度,模块输出端的梯度。
def backward_hook(module, grad_in, grad_out):
    operations

What needs special attention here is that the input and output here are the input and output during forward propagation, that is to say, the gradient of the output above corresponds to the grad_out here. For example, linear module: o=W*x+b, its input terminal is W, x and b, and its output terminal is o.

grad_in and grad_out can be of type tuple if the module has multiple inputs or outputs . For the linear module: o=W*x+b, its input includes W, x and b, so grad_input is a tuple containing three elements.

Note here the difference from the forward hook:

  1. In the forward hook, the input is x, excluding W and b.
  2. Returning Tensor or None, the backward hook function cannot directly change its input variables, but can return a new
    grad_in and backpropagate to its previous module.
import torch
import torch.nn as nn
import numpy as np 
import torchvision.transforms as transforms

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3,6,3,1,1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(2,2)
        self.conv2 = nn.Conv2d(6,9,3,1,1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(2,2)
        self.fc1 = nn.Linear(8*8*9, 120)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(120,10)

    def forward(self, x):
        out = self.pool1(self.relu1(self.conv1(x)))
        out = self.pool2(self.relu2(self.conv2(out)))
        out = out.view(out.shape[0], -1)
        out = self.relu3(self.fc1(out))
        out = self.fc2(out)

        return out

fmap_block = dict()  # 装feature map
grad_block = dict()  # 装梯度
def forward_hook(module, input, output):
    fmap_block['input'] = input
    fmap_block['output'] = output
    
def backward_hook(module, grad_in, grad_out):
    grad_block['grad_in'] = grad_in
    grad_block['grad_out'] = grad_out

loss_func = nn.CrossEntropyLoss()

label = torch.empty(1, dtype=torch.long).random_(3)  # 生成一个假标签
input_img = torch.randn(1,3,32,32).requires_grad_()  # 生成一副假图像作为输入 

net = Net()

# 注册hook
handle_forward = net.conv2.register_forward_hook(farward_hook)
handle_backward = net.conv2.register_backward_hook(backward_hook)

outs = net(input_img)
loss = loss_func(outs, label)
loss.backward()

handle_forward.remove()
handle_backward.remove()
print('End.')

In the above program, we first define a simple convolutional neural network model. We register the hook for the second-layer convolutional module, obtain its input and output, and obtain the gradient of the input and output, and install them in the in the dictionary. In order to achieve the verification effect, we randomly generate a fake image whose size is the same as the image size of the cifar-10 dataset, and define a category label for the fake image, and use the loss function for backpropagation to simulate the training process of the neural network .

After running the program, the corresponding feature map and gradient will appear in the two lists fmap_block and grad_block . Let's look at the dimensions of their input and output:

In [21]: print(len(fmap_block['input']))                                               
Out[21]: 1

In [22]: print(len(fmap_block['output']))                                              
Out[22]: 1

In [23]: print(len(grad_block['grad_in']))                                             
Out[23]: 3

In [24]: print(len(grad_block['grad_out']))                                            
Out[24]: 1

It can be seen that there is only one input and output of the second layer convolution module, namely the corresponding feature map. There are three gradient values ​​at the input end, which are the gradient of the weight, the gradient of the deviation, and the gradient of the input feature map. There is only one gradient value at the output, that is, the gradient of the output feature map. As emphasized above, even if there are W, X, and b at the input, only X is the input for the previous propagation, and for the back propagation, all three are inputs. What is the order in which the gradient values ​​​​of the three items at the output end are arranged? Let's take a look at the specific dimensions of the three gradients:

In [25]: print(grad_block['grad_in'][0].shape)                                         
Out[25]: torch.Size([1, 6, 16, 16])

In [26]: print(grad_block['grad_in'][1].shape)                                        
Out[26]: torch.Size([9, 6, 3, 3])

In [27]: print(grad_block['grad_in'][2].shape)                                         
Out[27]: torch.Size([9])

Judging from the dimension of the gradient at the output end, the first is obviously the gradient of the feature map, the second is the gradient of the weight (convolution kernel/filter), and the third is the gradient of the bias. In order to verify that the gradient has the same dimensions as these parameters, let's look at the dimensions of the forward propagation of these three values:

In [28]: print(fmap_block['input'][0].shape)                                           
Out[28]: torch.Size([1, 6, 16, 16])

In [29]: print(net.conv2.weight.shape)         
Out[29]: torch.Size([9, 6, 3, 3])

In [30]: print(net.conv2.bias.shape)                                                   
Out[30]: torch.Size([9])

The last thing to note is that if you need to get the gradient of the input image, you must set the requires_grad attribute of the input Tensor to True.

Reference blog:
Pytorch obtains intermediate layer information - hook function
HOOK of PyTorch - an effective tool for obtaining neural network features and gradients

Guess you like

Origin blog.csdn.net/Joker00007/article/details/128943862