Amodal Instance Segmentation with KINS Dataset Papers understanding

Solve the problem

Examples of invisible part divided, such data sets and methods are now rarely


This innovation \ contribution

  1. New data sets, new tasks, the data set is enhanced KITTI
  2. Proposed Multi-Level Coding, invisible to predict some of the existing divided portion of the network

The previous method

[ Amodal instance segmentation ] 2016
[ Learning to see the invisible: End-toend trainable amodal instance segmentation ] 2019
[ Semantic amodal segmentation] 2017


method

The method summarized

First use a common network division, branch box of regression and classification, plus a sheltered branch, the new branch is used to predict the current RoI there is no occlusion occurs, if occlusion occurs, by MLC, the box branch block classification branch, all branches of the mask features fused together, then a mask to give a final Amodal division.

Multi-Level Coding

The authors' approach is to extract semantic information to guide the senior branch of segmentation to better reasoning blocked sites.

Now the distance between the mask and the head methods backbone long, easy to lose some of the information, while the MLC global information can magnify the mask prediction, only for the positive branch Rois, only positive RoI feature will be extracted and sent MLC as global guidance.

With positive samples was obscured it RoI

Bbox size classification and the branch is blocked 7 × 7 7\times 7 , the size of the mask is a branch of 14 × 14 14 \times 14 , before it branches into the mask, to do some processing, there are two modules, are combined and used to extract
Here Insert Picture Description

The size of the convolution kernel is here C × C × 3 × 3 C\times C \times 3 \times 3 , step 1, and are filled, C C is the number of channels

? Channel x channel, before this mechanism appears like a passage attention, may be the explanation for a bar

Extraction :
combined type and occlusion information to the global features

The corresponding features are integrated into a FIG

shutter box and is connected to a classification characteristic, then 2 C × C × 3 × 3 2C\times C \times 3 \times 3 convolution kernel to do deconvolution operation (that is upsampled, for the fusion of two feature information), and wherein the sample is fed to the convolutional two layers, plus a ReLu

Fusion :
mixing the local and global features to help segmentation mask features, in order to join the global and local cues special branch mask, mask features and the extracted feature integration layer, and then thrown into three concatenated convolutional layer, plus a ReLU.
The final layer is the convolution of the channel is reduced by half, so that the output branch of the same dimensions and mask, wherein the output branch to the final segmentation mask.

Conclusion is behind the feature mask, but also combines information from two classifications, done after the mask do

Occlusion Classification Branch

Only global box feature is not enough, because many instances are likely to exist in a RoI, features other instances may cause mask can be predicted two of

This is not bullshit you, have some problems ah

Therefore, the introduction of occlusion classification branch possible to determine the position of the shutter, the shutter is determined by the presence or absence of a branch to the RoI

Experiments show that this branch can provide a lot of invisible features

Here Insert Picture Description
In general, recommended extracted from RPN 512 with 128 results in the foreground roi.

On statistics, 40 RoI, will be able to meet most needs.

However, only about shaded area of ​​1 to 10 pixels, and the proportion of non-occluded occlusion extreme imbalance, the feature extraction based RoI, a very small region feature is impaired or even lost, only 5% of the total area of ​​the mask is greater than the overlap area as the sample block, in order to solve the imbalance problem, set a positive RoI weight loss was 8.

Blocking too little on the matter


training

Network architecture

Here Insert Picture Description

Before the network can make use of segmentation, such as Mask RCNN and Panet, as long as the input is predicted amodal can be occluded mask.

The author uses global information to infer occlusion region, the structure of the upper RPN has three branches, box classification, box regression, classification block. The first two branches, except the last two layers are shared weights FC, FC were used to predict the final, overall perceived concerns such a configuration, wherein the features may be used to help the whole case-based reasoning. Mask is a lower branch, it will be characterized by the integration of MLC, doing the final forecast.

That this shared characteristic is very important, there are global information?

Details, all the RoI are input branch, each branch consists of four concatenated convolutional ReLu operation and composition, in order to predict the occlusion portion, proposed multi-level coding (MLC) method, the result will be sent to the MLC Amodal segmentation can be accomplished simultaneously by visible clues examples and inherent perception of the entire region.

Experiments show that more iterations would be too fit, with the increasing number of iterations, occluded regions shrink or disappear, predicted the visible part of the stabilization

I suspect this figure posted upside down, right side of the mask is obviously some more
Here Insert Picture Description

Structural Analysis

We discuss the causes of over-fitting of the nature of CNN:

  1. Convolution can capture accurate local features and global information for this loss divided regions

    Therefore, the increase attention can be resolved?

  2. Fully connected by polymerization operation and spatial information channels, so that comprehensive understanding network instance.

Some examples of existing split frames, mask head typically four convolutional network and a head composed of deconvolution, make full use of local information

So how to use global information?

但是没有全局信息或先验知识,很难预测不可见的区域,看不见的还是遮住了

没有预知的3D模型也很难搞
Path aggregation network for instance segmentation 2018
这个讲了下不挨着的物体的全局信息的重要性,可以看看

所以,全局信息的强感知能力是网络识别遮挡区域的关键。

这句话作者特别重视

Loss

RPN,box识别,遮挡分类,mask预测的系数都设置为1

L = L c l s + L b O x + L O c c l in s i O n + L m a s k a + L m a s k i L = L_{cls} + L_{box} + L_{occlusion} + L_{mask_a} + L_{mask_i}
推理的时候还不太一样,根据box分支和推荐位置,来回归box,然后再把更新的box放进box分支来提取class和遮挡特征,选NMS后的box来做mask预测。

是说先预测个结果,再根据推荐位置,推测出个box结果,此时,再用这box来提取特征,顶算是微调了一次bbox呗

Published 63 original articles · won praise 2 · Views 8016

Guess you like

Origin blog.csdn.net/McEason/article/details/104346679