Gradient Solving for Convolutional Neural Networks

In some studies on the interpretability of neural networks, the loss function is always used to calculate the gradient of the last layer of feature maps, such as the famous Grad CAM, so the understanding of convolutional neural networks cannot just stay in the package At the stage, we need to disassemble the black box that calculates the gradient.
insert image description here

As shown in the figure, suppose there is a feature map AAA , after a2 × 2 2 \times 22×The convolution kernelKK of 2After the K operation, a new feature mapOOO , and then flatten it throughthe MLP MLPM L P gets an output vector YYof length 2Y. _
If you want to know the feature mapAAThe contribution of each element of A to the final output needs to be calculatedYYY vsAAThe partial derivative of each element in A , that is , ∂ Y ∂ A \frac{ \partial Y }{ \partial A }AY.
Let's sort out the feature map AAA gets outputYYThe process of Y
can be written as: O = CONV ( A ) O=CONV(A)O=CONV(A)
Y = M L P ( O ) Y=MLP(O) Y=M L P ( O )
Therefore, according to the chain derivation rule,∂ Y ∂ A = ∂ Y ∂ O ∂ O ∂ A \frac{ \partial Y }{ \partial A }= \frac{ \partial Y }{ \partial O} \frac{ \partial O }{ \partial A}AY=OYAO.
To output Y 1 = 68 Y_1=68Y1=6 8 example,Y 1 = 0 ∗ O 11 + 1 ∗ O 12 + 0 ∗ O 21 + 1 ∗ O 22 Y_1=0*O_{11}+1*O_{12}+0*O_{21}+ 1*O_{22}Y1=0O11+1O12+0O21+1O22, so ∂ Y 1 ∂ O = [ 0 1 0 1 ] \frac{ \partial Y_1 }{ \partial O }=[0 \quad1 \quad0\quad1]OY1=[0101 ]
Let's calculate again ∂ O ∂ A = [ ∂ O 11 ∂ A 11 ∂ O 11 ∂ A 12 ∂ O 11 ∂ A 13 ∂ O 11 ∂ A 21 … ∂ O 11 ∂ A 33 ∂ O 12 ∂ A 11 ∂ O 12 ∂ A 12 ∂ O 12 ∂ A 13 ∂ O 12 ∂ A 21 … ∂ O 12 ∂ A 33 ∂ O 21 ∂ A 11 ∂ O 21 ∂ A 12 ∂ O 21 ∂ A 13 ∂ O 21 ∂ A 21 … ∂ O 21 ∂ A 33 ∂ O 22 ∂ A 11 ∂ O 22 ∂ A 12 ∂ O 22 ∂ A 13 ∂ O 22 ∂ A 21 … ∂ O 22 ∂ A 33 ] = C T \frac{ \partial O }{ \partial A}=\begin{bmatrix} \frac{ \partial O_{11} }{ \partial A_{11}} & \frac{ \partial O_{11} }{ \partial A_{12}} & \frac{ \partial O_{11} }{ \partial A_{13}} & \frac{ \partial O_{11} }{ \partial A_{21}} & \dots & \frac{ \partial O_{11} }{ \partial A_{33}} \\ \frac{ \partial O_{12} }{ \partial A_{11}} & \frac{ \partial O_{12} }{ \partial A_{12}} & \frac{ \partial O_{12} }{ \partial A_{13}} & \frac{ \partial O_{12} }{ \partial A_{21}} & \dots & \frac{ \partial O_{12} }{ \partial A_{33}} \\ \frac{ \partial O_{21} }{ \partial A_{11}} & \frac{ \partial O_{21} }{ \partial A_{12}} & \frac{ \partial O_{21} }{ \partial A_{13}} & \frac{ \partial O_{21} }{ \partial A_{21}} & \dots & \frac{ \partial O_{21} }{ \partial A_{33}} \\ \frac{ \partial O_{22} }{ \partial A_{11}} & \frac{ \partial O_{22} }{ \partial A_{12}} & \frac{ \partial O_{22} }{ \partial A_{13}} & \frac{ \partial O_{22} }{ \partial A_{21}} & \dots & \frac{ \partial O_{22} }{ \partial A_{33}}\end{bmatrix} =C^T AO=A11O11A11O12A11O21A11O22A12O11A12O12A12O21A12O22A13O11A13O12A13O21A13O22A21O11A21O12A21O21A21O22A33O11A33O12A33O21A33O22=CAfter T
finally integrates the results, the shape transformation andAAA can be the same, that is,[ 0 0 1 0 2 4 0 2 3 ] \begin{bmatrix} 0 & 0 & 1\\ 0 & 2 & 4 \\ 0 & 2 & 3\end{bmatrix}000022143.
The following is the code of the above calculation process, and it can be found that the calculation results are consistent with the derivation.

import torch
import torch.nn as nn

X = torch.tensor([[0, 1, 2],
                  [3, 4, 5],
                  [6, 7, 8]]).reshape(1, 1, 3, 3).float()
X.requires_grad = True
kernel = torch.tensor([[0, 1],
                       [2, 3]]).reshape(1, 1, 2, 2).float()
conv = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=2, bias=False)
conv.weight.data = kernel

fc = nn.Linear(in_features=4, out_features=2, bias=False)
fc.weight.data = torch.tensor([[0, 1, 0, 1],
                               [1, 0, 1, 1]]).float()
print(conv(X))
O = fc(torch.flatten(conv(X), start_dim=1))
print(O)

O[0][0].backward()

print(X.grad)

Guess you like

Origin blog.csdn.net/loki2018/article/details/127864790