Deformable Conv principle analysis and torch code implementation

1. Analysis of the principle of deformable convolution 

1.1 General convolution principle

        The traditional convolution operation is to divide the feature map into parts of the same size as the convolution kernel, and then perform the convolution operation. The position of each part on the feature map is fixed .

A Brief Introduction to Graph Convolutional Networks (GNN) - Programmer Sought

Figure 1 Ordinary convolution process 

        Figure 1 shows the process of convolution calculation of ordinary convolution on the input feature map. The convolution kernel size is 3*3, and the convolution is performed on the input feature map size of 7*7. The weight of the convolution kernel and the input The corresponding position elements of the feature map are multiplied and summed to obtain the output feature map elements, and the entire output feature map can be calculated by sliding the window in a certain way.

        Therefore, for any point on the input feature map p_{0}, the convolution operation can be expressed as: 

y(p_{0})=\sum_{p_{n}\in R}w(p_{n})*x(p_{0}+p_{n})

Formula 1 convolution operation formula 

        Among them, p_{n}it represents the offset of each point in the convolution kernel relative to the center point, which can be expressed by the following formula (3*3 convolution kernel as an example): 

  R=\left \{ (-1,-1),(-1,0),...,(0,0),...,(1,0),(1,1) \right \}  

Formula 2 relative offset of convolution kernel point
insert image description here

Figure 2 Example diagram of relative offset of 3*3 convolution kernel points

 w(p_{n})Represents the weight of the corresponding position of the convolution kernel, x(p_{0}+p_{n})represents p_{0}+p_{n}the element value at the position on the input feature map, and y(p0)represents the element value at the position on the output feature map p_{0}, which is obtained by convolving the convolution kernel and the input feature map.

 1.2 Deformable convolution idea

         The convolution kernel of conventional convolution has a fixed size and shape. It may have a better effect on objects with regular shapes. What about objects with more complex deformations?

         Generally speaking, the methods that can be adopted are: enriching the data set, introducing more samples with complex deformation, using various data enhancements and tricks, manually designing some manual features and algorithms, etc., then whether it is possible to use a more flexible convolution kernel Woolen cloth? So deformable convolution -- Deformable Conv appeared.

 Let's take a picture first to feel it. Figure 3 is an example of convolution between standard convolution and deformable convolution.

Figure 3 Convolution example of standard convolution and deformable convolution 

        It can be clearly seen from the left-right comparison that the sampling position of the deformable convolution is more in line with the shape and size of the object itself, while the standard convolution form cannot do this. It can be clearly seen that the final feature points in the top-level feature map of the variable convolution learn the overall characteristics of the object. This feature is only for the object itself. Compared with the original convolution, it can eliminate the interference of background noise and get more useful Information. 

 1.3 Deformable convolution principle

         As can be seen from Figure 2, the sampling position of the deformable convolution is variable, or learnable, so the deformable convolution can better take into account the shape change of the object.

        Figure 4 Different adoption points of deformable convolution

        In Figure 4 (a) is the sampling method of the common 3x3 convolution kernel, (b) is the sampling deformable convolution, plus the change of the sampling point after the offset, where (c) (d) is the deformable volume A special form of accumulation.

        Therefore, the principle of deformable convolution is based on a network learning offset (offset) , so that the convolution kernel is offset at the sampling point of the input feature map, focusing on the area or target we are interested in.

Deformable convolution introduces an offset for each point          based on Equation 1. The offset is generated by the input feature map and another convolution, usually a decimal.

 y(p_{0})=\sum_{p_{n}\in R}w(p_{n})*x(p_{0}+p_{n}+\Delta p_{n})

Formula 3 Deformable convolution operation formula 

        where,  \Delta p_{n}represents the offset.

        Since the position after adding the offset is generally a decimal and does not correspond to the actual pixel on the input feature map, interpolation is required to obtain the offset pixel value. Usually, bilinear interpolation can be used, and the formula is expressed as follows:

 Formula 4 bilinear interpolation

        Among them, the max(0, 1-...) in the last line of the formula limits the distance between the interpolation point and the neighbor point to no more than 1 pixel. 

        Bilinear interpolation refers to setting the pixel value of the interpolation point position as the weighted sum of its 4 neighboring pixels. The 4 points in the neighborhood are the closest pixels that actually exist on the feature map. The weight of each point The weight is set according to the distance between it and the horizontal and vertical coordinates of the interpolation point, and finally the pixel value of the interpolation point is obtained.

Detailed explanation of image bilinear interpolation algorithm python to achieve bilinear interpolation algorithm - Ibelievesunshine's Blog - CSDN Blog

 Figure 5 Example diagram of bilinear interpolation

        The pixel value of point P is calculated according to Q^{_{11}},Q^{_{12}},Q_{21},Q_{22}the weighted sum of the four points, and the weight of each point is determined by the distance between each point and point P.

1.4 Deformable Convolution

        Figure 6 is a schematic diagram of deformable convolution. It can be seen that offsets (offset) are generated using an additional convolution, which is not the same convolution as the final convolution operation. The figure N is the size of the convolution kernel area, for example, a convolution kernel of 3*3 size, N=9, the green process in the figure is the process of convolution learning offset, where the channel size of the offset field is 2N, indicating the convolution kernel Learn the offsets in the x direction and the y direction respectively.

        As shown in Figure 6, the convolution sampling area corresponding to the ordinary convolution operation on the input feature map is a square (green box) the size of a convolution kernel, and the convolution sampling area corresponding to the deformable convolution is some points represented by blue boxes. , which is the difference between deformable convolution and ordinary convolution.

Figure 6 Schematic diagram of deformable convolution

        The specific details of deformable convolution:

  1. A point on an output feature map corresponds to a convolution sampling area on the input feature map whose size is K*K. According to the operation of deformable convolution, each convolution sampling point in this K*K area must learn a deviation offset, and offset is represented by coordinates, so an output needs to learn 2*K*K parameters. Suppose an output size is H*W, so a total of 2*K*K*H*W parameters need to be learned. That is, the offset field (N=K*K) in the above figure has a dimension of B*2*K*K*H*W, where B stands for batch_size;
  2. Assuming that the dimension of the input feature map is B*C*H*W, the feature maps in a batch (a total of C) share an offset field, that is, the offset used by each feature map in a batch is the same;
  3. The deformable convolution does not change the size of the input feature map, so the output feature map is also H x W;

 2. Implementation of deformable convolution

2.1 Deformable convolution implementation process:

Learn from blogger Facias ' code implementation logic diagram, see the code for specific implementation.

insert image description here

Figure 7 Deformable convolution implementation process 

 2.2 Deformable convolution torch implementation

class DeformConv2d(nn.Module):
    def __init__(self, inc, outc, kernel_size=3, padding=1, stride=1, bias=None, modulation=False):
        """
        Args:
            modulation (bool, optional): If True, Modulated Defomable Convolution (Deformable ConvNets v2).
        """
        super(DeformConv2d, self).__init__()
        self.kernel_size = kernel_size
        self.padding = padding
        self.stride = stride
        self.zero_padding = nn.ZeroPad2d(padding)
        # conv则是实际进行的卷积操作,注意这里步长设置为卷积核大小,因为与该卷积核进行卷积操作的特征图是由输出特征图中每个点扩展为其对应卷积核那么多个点后生成的。
        self.conv = nn.Conv2d(inc, outc, kernel_size=kernel_size, stride=kernel_size, bias=bias)
        # p_conv是生成offsets所使用的卷积,输出通道数为卷积核尺寸的平方的2倍,代表对应卷积核每个位置横纵坐标都有偏移量。
        self.p_conv = nn.Conv2d(inc, 2*kernel_size*kernel_size, kernel_size=3, padding=1, stride=stride)
        nn.init.constant_(self.p_conv.weight, 0)
        self.p_conv.register_backward_hook(self._set_lr)

        self.modulation = modulation # modulation是可选参数,若设置为True,那么在进行卷积操作时,对应卷积核的每个位置都会分配一个权重。
        if modulation:
            self.m_conv = nn.Conv2d(inc, kernel_size*kernel_size, kernel_size=3, padding=1, stride=stride)
            nn.init.constant_(self.m_conv.weight, 0)
            self.m_conv.register_backward_hook(self._set_lr)

    @staticmethod
    def _set_lr(module, grad_input, grad_output):
        grad_input = (grad_input[i] * 0.1 for i in range(len(grad_input)))
        grad_output = (grad_output[i] * 0.1 for i in range(len(grad_output)))

    def forward(self, x):
        offset = self.p_conv(x)
        if self.modulation:
            m = torch.sigmoid(self.m_conv(x))

        dtype = offset.data.type()
        ks = self.kernel_size
        N = offset.size(1) // 2

        if self.padding:
            x = self.zero_padding(x)

        # (b, 2N, h, w)
        p = self._get_p(offset, dtype)

        # (b, h, w, 2N)
        p = p.contiguous().permute(0, 2, 3, 1)
        q_lt = p.detach().floor()
        q_rb = q_lt + 1

        q_lt = torch.cat([torch.clamp(q_lt[..., :N], 0, x.size(2)-1), torch.clamp(q_lt[..., N:], 0, x.size(3)-1)], dim=-1).long()
        q_rb = torch.cat([torch.clamp(q_rb[..., :N], 0, x.size(2)-1), torch.clamp(q_rb[..., N:], 0, x.size(3)-1)], dim=-1).long()
        q_lb = torch.cat([q_lt[..., :N], q_rb[..., N:]], dim=-1)
        q_rt = torch.cat([q_rb[..., :N], q_lt[..., N:]], dim=-1)

        # clip p
        p = torch.cat([torch.clamp(p[..., :N], 0, x.size(2)-1), torch.clamp(p[..., N:], 0, x.size(3)-1)], dim=-1)

        # bilinear kernel (b, h, w, N)
        g_lt = (1 + (q_lt[..., :N].type_as(p) - p[..., :N])) * (1 + (q_lt[..., N:].type_as(p) - p[..., N:]))
        g_rb = (1 - (q_rb[..., :N].type_as(p) - p[..., :N])) * (1 - (q_rb[..., N:].type_as(p) - p[..., N:]))
        g_lb = (1 + (q_lb[..., :N].type_as(p) - p[..., :N])) * (1 - (q_lb[..., N:].type_as(p) - p[..., N:]))
        g_rt = (1 - (q_rt[..., :N].type_as(p) - p[..., :N])) * (1 + (q_rt[..., N:].type_as(p) - p[..., N:]))

        # (b, c, h, w, N)
        x_q_lt = self._get_x_q(x, q_lt, N)
        x_q_rb = self._get_x_q(x, q_rb, N)
        x_q_lb = self._get_x_q(x, q_lb, N)
        x_q_rt = self._get_x_q(x, q_rt, N)

        # (b, c, h, w, N)
        x_offset = g_lt.unsqueeze(dim=1) * x_q_lt + \
                   g_rb.unsqueeze(dim=1) * x_q_rb + \
                   g_lb.unsqueeze(dim=1) * x_q_lb + \
                   g_rt.unsqueeze(dim=1) * x_q_rt

        # modulation
        if self.modulation:
            m = m.contiguous().permute(0, 2, 3, 1)
            m = m.unsqueeze(dim=1)
            m = torch.cat([m for _ in range(x_offset.size(1))], dim=1)
            x_offset *= m

        x_offset = self._reshape_x_offset(x_offset, ks)
        out = self.conv(x_offset)

        return out

    def _get_p_n(self, N, dtype):
        # 由于卷积核中心点位置是其尺寸的一半,于是中心点向左(上)方向移动尺寸的一半就得到起始点,向右(下)方向移动另一半就得到终止点
        p_n_x, p_n_y = torch.meshgrid(
            torch.arange(-(self.kernel_size-1)//2, (self.kernel_size-1)//2+1),
            torch.arange(-(self.kernel_size-1)//2, (self.kernel_size-1)//2+1))
        # (2N, 1)
        p_n = torch.cat([torch.flatten(p_n_x), torch.flatten(p_n_y)], 0)
        p_n = p_n.view(1, 2*N, 1, 1).type(dtype)

        return p_n

    def _get_p_0(self, h, w, N, dtype):
        # p0_y、p0_x就是输出特征图每点映射到输入特征图上的纵、横坐标值。
        p_0_x, p_0_y = torch.meshgrid(
            torch.arange(1, h*self.stride+1, self.stride),
            torch.arange(1, w*self.stride+1, self.stride))
        
        p_0_x = torch.flatten(p_0_x).view(1, 1, h, w).repeat(1, N, 1, 1)
        p_0_y = torch.flatten(p_0_y).view(1, 1, h, w).repeat(1, N, 1, 1)
        p_0 = torch.cat([p_0_x, p_0_y], 1).type(dtype)

        return p_0
    
    # 输出特征图上每点(对应卷积核中心)加上其对应卷积核每个位置的相对(横、纵)坐标后再加上自学习的(横、纵坐标)偏移量。
    # p0就是将输出特征图每点对应到卷积核中心,然后映射到输入特征图中的位置;
    # pn则是p0对应卷积核每个位置的相对坐标;
    def _get_p(self, offset, dtype):
        N, h, w = offset.size(1)//2, offset.size(2), offset.size(3)

        # (1, 2N, 1, 1)
        p_n = self._get_p_n(N, dtype)
        # (1, 2N, h, w)
        p_0 = self._get_p_0(h, w, N, dtype)
        p = p_0 + p_n + offset
        return p

    def _get_x_q(self, x, q, N):
        # 计算双线性插值点的4邻域点对应的权重
        b, h, w, _ = q.size()
        padded_w = x.size(3)
        c = x.size(1)
        # (b, c, h*w)
        x = x.contiguous().view(b, c, -1)

        # (b, h, w, N)
        index = q[..., :N]*padded_w + q[..., N:]  # offset_x*w + offset_y
        # (b, c, h*w*N)
        index = index.contiguous().unsqueeze(dim=1).expand(-1, c, -1, -1, -1).contiguous().view(b, c, -1)

        x_offset = x.gather(dim=-1, index=index).contiguous().view(b, c, h, w, N)

        return x_offset

    @staticmethod
    def _reshape_x_offset(x_offset, ks):
        b, c, h, w, N = x_offset.size()
        x_offset = torch.cat([x_offset[..., s:s+ks].contiguous().view(b, c, h, w*ks) for s in range(0, N, ks)], dim=-1)
        x_offset = x_offset.contiguous().view(b, c, h*ks, w*ks)

        return x_offset

reference:

More flexible and personalized convolution - Deformable Conv

DeformableConv (deformable convolution) theory and code analysis 

Only for learning records, invade and delete! 

Guess you like

Origin blog.csdn.net/panghuzhenbang/article/details/129816869