Stereo matching - GC-Net network structure analysis

  1. First look at the code corresponding to the network structure diagram and the table of the network layer.
    Insert picture description here
    Insert picture description here

1. Unary Features feature extraction

1. Use 2-D convolution to extract deep features. First, use conv2d with fiter size: 5*5 and stride: 2 to reduce the input dimension (1/2H, 1/2W).

 imgl0=F.relu(self.bn0(self.conv0(imgLeft)))
 imgr0=F.relu(self.bn0(self.conv0(imgRight)))
self.conv0=nn.Conv2d(3,32,5,2,2)
self.bn0=nn.BatchNorm2d(32)

2. Followed by an 8-layer residual network.
Insert picture description here

  • Note that num_block[0] here, where the value is 8 represents eight-level residuals, and the variables are defined below
 self.res_block=self._make_layer(block,self.in_planes,32,num_block[0],stride=1)
 def _make_layer(self,block,in_planes,planes,num_block,stride):
        strides=[stride]+[1]*(num_block-1)
        layers=[]
        for step in strides:
            layers.append(block(in_planes,planes,step))
        return nn.Sequential(*layers)
  • Note that this'num_block' parameter is an array [8,1],
  • This for loop needs to pay attention, because the residual structure penetrates num_block=8, so here strides=[[1],[1],[1],[1],[1],[1],[1] ,[1]],step takes the value every time, so the stride passed to the block is stride=1

def GcNet(height,width,maxdisp):
    return GC_NET(BasicBlock,ThreeDConv,[8,1],height,width,maxdisp)
  • The following is a detailed analysis of the code details of the 8-level residuals
  • The residual structure consists of two layers of conv (input=32, output=32, kernel_size=3, stride=1, padding=1), a total of eight layers. The main function of this residual layer is to extract the'unary features' of the left and right images
class BasicBlock(nn.Module):  #basic block for Conv2d
    def __init__(self,in_planes,planes,stride=1):
        super(BasicBlock,self).__init__()
        self.conv1=nn.Conv2d(in_planes,planes,kernel_size=3,stride=stride,padding=1)
        self.bn1=nn.BatchNorm2d(planes)
        self.conv2=nn.Conv2d(planes,planes,kernel_size=3,stride=1,padding=1)
        self.bn2=nn.BatchNorm2d(planes)
        self.shortcut=nn.Sequential()
    def forward(self, x):
        out=F.relu(self.bn1(self.conv1(x)))
        out=self.bn2(self.conv2(out))
        out+=self.shortcut(x)
        out=F.relu(out)
        return out
  • Finally, a layer of (no RELU, no BN) convolution is passed. I don’t understand the role of this layer. Maybe it is to expand the field of perception?
   self.conv1=nn.Conv2d(32,32,3,1,1)

2. Form a cost volume

  • The'unary features' formed by the residual layer, through column splicing (why w is column splicing, please see the print result of my PSMNet ), forming a cost body of (1,64,96,1/2H,1/2W) size . I estimated this by referring to the output of PSMNet. I didn't run the code. To configure the environment, it is mainly to understand the idea.
 def cost_volume(self,imgl,imgr):
        B, C, H, W = imgl.size()
        cost_vol = torch.zeros(B, C * 2, self.maxdisp , H, W).type_as(imgl)
        for i in range(self.maxdisp):
            if i > 0:
                cost_vol[:, :C, i, :, i:] = imgl[:, :, :, i:]
                cost_vol[:, C:, i, :, i:] = imgr[:, :, :, :-i]
            else:
                cost_vol[:, :C, i, :, :] = imgl
                cost_vol[:, C:, i, :, :] = imgr
        return cost_vol
 cost_volum = self.cost_volume(imgl1, imgr1)
  • Through this splicing method, the feature dimensions and-are retained unary features, so that the network can learn absolute representationand can be combined with context. This splicing method is better than the distance measurement function (L1, L2, cosine)
  • The following explanation of cost volum is very vivid:
    (For a certain feature, the matching cost volume is a three-dimensional square, the first layer is the feature map when the disparity is 0, and the second layer is the feature map when the disparity is 1. By analogy, there is a total of maximum parallax + 1 layer, the length and width are respectively the size of the feature map, assuming a total of 10 features are extracted, there are 10 such three-dimensional squares)
    Insert picture description here

3. 3D convolutional downsampling (encoder)

1.The merged'cost volume' feature size=64, the feature size is reduced to 32 through two layers of conv3d.

        self.conv3d_1 = nn.Conv3d(64, 32, 3, 1, 1)
        self.bn3d_1 = nn.BatchNorm3d(32)
        self.conv3d_2 = nn.Conv3d(32, 32, 3, 1, 1)
        self.bn3d_2 = nn.BatchNorm3d(32)

2.The first sub-sampled layer makes 1/2 into 1/4.

self.block_3d_1 = self._make_layer(block_3d, 64, 64, num_block[1], stride=2)
  • At this time num_block[1]=1. Run the 3D convolution module, where stride=2 is used for downsampling.
class ThreeDConv(nn.Module):
    def __init__(self,in_planes,planes,stride=1):
        super(ThreeDConv, self).__init__()
        self.conv1 = nn.Conv3d(in_planes, planes, kernel_size=3, stride=stride, padding=1)
        self.bn1 = nn.BatchNorm3d(planes)
        self.conv2 = nn.Conv3d(planes, planes, kernel_size=3, stride=1, padding=1)
        self.bn2 = nn.BatchNorm3d(planes)
        self.conv3=nn.Conv3d(planes,planes,kernel_size=3,stride=1,padding=1)
        self.bn3=nn.BatchNorm3d(planes)

    def forward(self, x):
        out=F.relu(self.bn1(self.conv1(x)))
        out=F.relu(self.bn2(self.conv2(out)))
        out=F.relu(self.bn3(self.conv3(out)))
        return out

Insert picture description here

  • The description in the original text is that the downsampling layer is followed by two layers of conv3d() with stir=2. Communicating with the code author, the kernel_size=1 of the second layer only played a role in changing the channel.
 self.conv3d_3 = nn.Conv3d(64, 64, 3, 2, 1)
        self.bn3d_3 = nn.BatchNorm3d(64)

3. The second and third downsampling layers are similar, so let's just talk about the fourth downsampling layer.

  • Note that the output channel of this layer becomes 128.
  self.block_3d_4 = self._make_layer(block_3d, 64, 128, num_block[1], stride=2)

Four. Upsampling (decoder)

1. The description of the original text is that while downsampling improves the speed and enlarges the receptive field, it also loses details. The author uses the residual layer to cascade the high-resolution feature map with the down-sampling layer. The high-resolution image is obtained using transposed convolution nn.ConvTranspose3d(), let’s take a look at how the residual structure is formed.

  • Transposed convolution: notice that the feature size is changed to 2F=64
      # deconv3d
        self.deconv1 = nn.ConvTranspose3d(128, 64, 3, 2, 1, 1)
        self.debn1 = nn.BatchNorm3d(64)

Insert picture description here

  • Up-sample the results of downsampling directly will lose many features, and cascade with the output of the high-resolution downsampling layer to make up for the missing details. There are four levels of upsampling and four residual structures that are not described one by one.
  deconv3d = F.relu(self.debn1(self.deconv1(conv3d_block_4)) + conv3d_block_3)
  • The last layer is up-sampled and output
    Insert picture description here
    2. Finally, add a layer of transposed convolution with an output channel of '1', compress the'cost volum' to a layer of initial disparity map, restore the size (1 D H W), notice the first The output of conv2d of a layer of 5 5 is (1/2H, 1/2W), here restore the size of the original image
 original_size = [1, self.maxdisp*2, imgLeft.size(2), imgLeft.size(3)]
  • Here is the syntax of view()
  self.deconv5 = nn.ConvTranspose3d(32, 1, 3, 2, 1, 1)
  out = deconv3d.view( original_size)

V. Parallax regression

  1. For this matching cost volume, we can estimate the disparity value by using the soft argmin operation in the disparity dimension. The function has two characteristics:
  • Differentiable, you can use optimizer for gradient calculation
  • Can be returned, loss can be passed
 prob = F.softmax(-out, 1)
  • Parallax regression
 disp1 = self.regression(prob)

Insert picture description here

Six. Optimizer and loss

    criterion = SmoothL1Loss().to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.001)

Guess you like

Origin blog.csdn.net/weixin_41405284/article/details/109381542