- Refer to GC-Net pytorch version
- First look at the code corresponding to the network structure diagram and the table of the network layer.
1. Unary Features feature extraction
1. Use 2-D convolution to extract deep features. First, use conv2d with fiter size: 5*5 and stride: 2 to reduce the input dimension (1/2H, 1/2W).
imgl0=F.relu(self.bn0(self.conv0(imgLeft)))
imgr0=F.relu(self.bn0(self.conv0(imgRight)))
self.conv0=nn.Conv2d(3,32,5,2,2)
self.bn0=nn.BatchNorm2d(32)
2. Followed by an 8-layer residual network.
- Note that num_block[0] here, where the value is 8 represents eight-level residuals, and the variables are defined below
self.res_block=self._make_layer(block,self.in_planes,32,num_block[0],stride=1)
def _make_layer(self,block,in_planes,planes,num_block,stride):
strides=[stride]+[1]*(num_block-1)
layers=[]
for step in strides:
layers.append(block(in_planes,planes,step))
return nn.Sequential(*layers)
- Note that this'num_block' parameter is an array [8,1],
- This for loop needs to pay attention, because the residual structure penetrates num_block=8, so here strides=[[1],[1],[1],[1],[1],[1],[1] ,[1]],step takes the value every time, so the stride passed to the block is stride=1
def GcNet(height,width,maxdisp):
return GC_NET(BasicBlock,ThreeDConv,[8,1],height,width,maxdisp)
- The following is a detailed analysis of the code details of the 8-level residuals
- The residual structure consists of two layers of conv (input=32, output=32, kernel_size=3, stride=1, padding=1), a total of eight layers. The main function of this residual layer is to extract the'unary features' of the left and right images
class BasicBlock(nn.Module): #basic block for Conv2d
def __init__(self,in_planes,planes,stride=1):
super(BasicBlock,self).__init__()
self.conv1=nn.Conv2d(in_planes,planes,kernel_size=3,stride=stride,padding=1)
self.bn1=nn.BatchNorm2d(planes)
self.conv2=nn.Conv2d(planes,planes,kernel_size=3,stride=1,padding=1)
self.bn2=nn.BatchNorm2d(planes)
self.shortcut=nn.Sequential()
def forward(self, x):
out=F.relu(self.bn1(self.conv1(x)))
out=self.bn2(self.conv2(out))
out+=self.shortcut(x)
out=F.relu(out)
return out
- Finally, a layer of (no RELU, no BN) convolution is passed. I don’t understand the role of this layer. Maybe it is to expand the field of perception?
self.conv1=nn.Conv2d(32,32,3,1,1)
2. Form a cost volume
- The'unary features' formed by the residual layer, through column splicing (why w is column splicing, please see the print result of my PSMNet ), forming a cost body of (1,64,96,1/2H,1/2W) size . I estimated this by referring to the output of PSMNet. I didn't run the code. To configure the environment, it is mainly to understand the idea.
def cost_volume(self,imgl,imgr):
B, C, H, W = imgl.size()
cost_vol = torch.zeros(B, C * 2, self.maxdisp , H, W).type_as(imgl)
for i in range(self.maxdisp):
if i > 0:
cost_vol[:, :C, i, :, i:] = imgl[:, :, :, i:]
cost_vol[:, C:, i, :, i:] = imgr[:, :, :, :-i]
else:
cost_vol[:, :C, i, :, :] = imgl
cost_vol[:, C:, i, :, :] = imgr
return cost_vol
cost_volum = self.cost_volume(imgl1, imgr1)
- Through this splicing method, the feature dimensions and-are retained
unary features
, so that the network can learnabsolute representation
and can be combined with context. This splicing method is better than the distance measurement function (L1, L2, cosine) - The following explanation of cost volum is very vivid:
(For a certain feature, the matching cost volume is a three-dimensional square, the first layer is the feature map when the disparity is 0, and the second layer is the feature map when the disparity is 1. By analogy, there is a total of maximum parallax + 1 layer, the length and width are respectively the size of the feature map, assuming a total of 10 features are extracted, there are 10 such three-dimensional squares)
3. 3D convolutional downsampling (encoder)
1.The merged'cost volume' feature size=64, the feature size is reduced to 32 through two layers of conv3d.
self.conv3d_1 = nn.Conv3d(64, 32, 3, 1, 1)
self.bn3d_1 = nn.BatchNorm3d(32)
self.conv3d_2 = nn.Conv3d(32, 32, 3, 1, 1)
self.bn3d_2 = nn.BatchNorm3d(32)
2.The first sub-sampled layer makes 1/2 into 1/4.
self.block_3d_1 = self._make_layer(block_3d, 64, 64, num_block[1], stride=2)
- At this time num_block[1]=1. Run the 3D convolution module, where stride=2 is used for downsampling.
class ThreeDConv(nn.Module):
def __init__(self,in_planes,planes,stride=1):
super(ThreeDConv, self).__init__()
self.conv1 = nn.Conv3d(in_planes, planes, kernel_size=3, stride=stride, padding=1)
self.bn1 = nn.BatchNorm3d(planes)
self.conv2 = nn.Conv3d(planes, planes, kernel_size=3, stride=1, padding=1)
self.bn2 = nn.BatchNorm3d(planes)
self.conv3=nn.Conv3d(planes,planes,kernel_size=3,stride=1,padding=1)
self.bn3=nn.BatchNorm3d(planes)
def forward(self, x):
out=F.relu(self.bn1(self.conv1(x)))
out=F.relu(self.bn2(self.conv2(out)))
out=F.relu(self.bn3(self.conv3(out)))
return out
- The description in the original text is that the downsampling layer is followed by two layers of conv3d() with stir=2. Communicating with the code author, the kernel_size=1 of the second layer only played a role in changing the channel.
self.conv3d_3 = nn.Conv3d(64, 64, 3, 2, 1)
self.bn3d_3 = nn.BatchNorm3d(64)
3. The second and third downsampling layers are similar, so let's just talk about the fourth downsampling layer.
- Note that the output channel of this layer becomes 128.
self.block_3d_4 = self._make_layer(block_3d, 64, 128, num_block[1], stride=2)
Four. Upsampling (decoder)
1. The description of the original text is that while downsampling improves the speed and enlarges the receptive field, it also loses details. The author uses the residual layer to cascade the high-resolution feature map with the down-sampling layer. The high-resolution image is obtained using transposed convolution nn.ConvTranspose3d(), let’s take a look at how the residual structure is formed.
- Transposed convolution: notice that the feature size is changed to 2F=64
# deconv3d
self.deconv1 = nn.ConvTranspose3d(128, 64, 3, 2, 1, 1)
self.debn1 = nn.BatchNorm3d(64)
- Up-sample the results of downsampling directly will lose many features, and cascade with the output of the high-resolution downsampling layer to make up for the missing details. There are four levels of upsampling and four residual structures that are not described one by one.
deconv3d = F.relu(self.debn1(self.deconv1(conv3d_block_4)) + conv3d_block_3)
- The last layer is up-sampled and output
2. Finally, add a layer of transposed convolution with an output channel of '1', compress the'cost volum' to a layer of initial disparity map, restore the size (1 D H W), notice the first The output of conv2d of a layer of 5 5 is (1/2H, 1/2W), here restore the size of the original image
original_size = [1, self.maxdisp*2, imgLeft.size(2), imgLeft.size(3)]
- Here is the syntax of view()
self.deconv5 = nn.ConvTranspose3d(32, 1, 3, 2, 1, 1)
out = deconv3d.view( original_size)
V. Parallax regression
- For this matching cost volume, we can estimate the disparity value by using the soft argmin operation in the disparity dimension. The function has two characteristics:
- Differentiable, you can use optimizer for gradient calculation
- Can be returned, loss can be passed
prob = F.softmax(-out, 1)
- Parallax regression
disp1 = self.regression(prob)
Six. Optimizer and loss
criterion = SmoothL1Loss().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)