Mask Free VIS notes (CVPR2023 instance segmentation without mask annotation)

paper: Mask-Free Video Instance Segmentation
github

General model learning instance segmentation requires mask annotation,
However, mask annotation is time-consuming and boring, so in the paper only uses Annotation of target boxes to achieve instance segmentation.

Mainly targeted atinstance segmentation of videos.
There was box-supervised instance segmentation before, but it was for images, and the accuracy was not very high when used in videos.
The author analyzed that the characteristic of videos is that the images are continuous , that is to say, the same target area should belong to a mask label in consecutive frames.

The theory is time continuity. A video is a continuous picture composed of multiple pictures, and the changes in the target are also gradual.
If the pixels in the frame at time t+1 and the area corresponding to time t belong to the same target or background, they should have the same mask.
This kind of search For the corresponding areas of continuous images, the optical flow method is popular.

However, the optical flow method faces two problems:
1. It is unstable, when there is occlusion (cannot be found), there is no obvious texture (undefined), or there is only one When marginal (ambiguous).
2. The SOTA optical flow method uses a deep network, which requires a large amount of calculation and memory.

The paper defines temporal KNN-patch loss (TK-loss)
Briefly introduce TK-loss. For each target patch, find the highest matching score in adjacent frames. Top K matches.
Calculate loss for all K matches.

The difference between and the optical flow method is that the optical flow method is a 1-to-1 matching, while TK-loss is a 1-to-K matching.
K can be 0, such as in the case of occlusion, or K>=2, such as the sky or the ground, where textures are not rich.
When K>=2, multiple patches may belong to the same target or background.
This method does not require much calculation , and has no parameters that need to be learned< /span>.

There are 4 steps to calculate TK-loss, as shown below

Insert image description here

Step 1:
candidate patch.
An N * N patch, assuming that its center point coordinate is p=(x, y), X p t X_{p}^t a>Xpt represents the N * N patch with p as the center point in frame t.
Now we need to find t ^ \hat{t} a>t^ 帧与 X p t X_{p}^t XptThe corresponding patch (center point) Insert image description here
The position of the center point can be selected from the area within the radius R with p as the center (a bit like the local search in template matching), < /span>
All target images of the acceleration measure are searched in this window at the same time.

Step 2:
K matches.
The distance must be calculated for matching. The L2 distance is used in the paper.
Insert image description here
Select the K matches with the smallest distance.
Among these K matches, there may be some distances that are not small enough. In this case, a threshold is used to filter again, and those with a distance >= threshold D are filtered out.
The rest is what is requiredInsert image description here

Step 3:
Consistency loss.
When the matching patch does not belong to a mask, losses will occur.
Let M p t M_{p}^t Mpt is the predicted binary mask value (0,1), position p, frame t.
If (p,t) is inconsistent with its corresponding patch Insert image description here, there will be loss.
Insert image description here
where
Insert image description here
It can be seen that when the mask values ​​of the matching points are all 0 or 1, the log value is 1, and the overall loss is 0, which means that there will be no loss when the matching points are consistent.

Step 4:
Cyclic Tube connection
The tube refers to the number of frames that contains a time series Pipe, equipped with T frames.
Calculate the loss of all frames in a tube each time. Use a cyclic method.
Insert image description here
Here, one time channel uses 5 frames and is shuffled.
Blue indicates that loss is calculated between two frames.
Red is a cyclic connection, the last frame and the first frame calculate the loss, and the other ones calculate the loss of adjacent frames.
Insert image description here

train

In the past, instance segmentation training required mask annotation. In the paper, mask annotation is not used, only box annotation is used.
Then the prediction mask and gt mask cannot be used like calculating mask loss.
The author used BoxInstTwo loss functions in L p r o j L_{proj}
box mapping loss to replace mask loss.LprojLoss of adjacent pixels L p a i r L_{pair} Lpair.

The mapping loss is
Insert image description here
The dice loss is used because the author found that cross-entropy will cause the loss of large objects to be greater than that of small objects.
The label is ignored when calculating loss here.

Loss of adjacent pixels L p a i r L_{pair} LpairThe main basis is that adjacent pixels with similar colors in the same frame should belong to the same object.

Insert image description here
However, there are so many points on an image. According to the formula, pi is a point that belongs to the target frame. However, how to choose pj is not mentioned here.
BoxInst points out the surrounding 8 points (one point apart).

Insert image description here
The loss in BoxInst simply combines the two:
Insert image description here
In the paper, the author adds a weight to get the spatial loss:
Insert image description here
And A temporal loss is the TK-loss mentioned earlier. Combine the spatial loss and the temporal loss to get the final loss function:

Insert image description here
Recall TK-loss L t e m p L_{temp} Ltemp,

Take the time channel of T frame, calculate the loss of two adjacent frames, and calculate the loss of the last frame and the first frame.
Insert image description here
The loss is as follows:
Insert image description here
Traverse all points in a frame of image, let one point be p, find points within the radius R as candidates for matching points, and find the L2 distance of the N * N patch with each point as the center.
Find the first K candidate points with the smallest distance, remove the candidate points with a distance
Then calculate whether the masks of the matching points are consistent.
Insert image description here

After all points of a frame are calculated, the loss superposition of all frames in a time channel is calculated in cyclic order.

L t e m p L_{temp} LtempThe algorithm flow is as follows:
Insert image description here

Summarize

Replace the mask loss in the instance segmentation method with the one in the paper L s e g L_{seg} Lseg, you can achieve video instance segmentation with only box annotations.

So, I think this article actually improves the BoxInst loss function, taking into account the continuity characteristics of video pictures, adding time loss on the basis of BoxInst L t e m p L_{temp} Ltemp.
Time loss in paper L t e m p L_{temp} LtempIt is aimed at video scenes. If it is simply instance segmentation of pictures and there is no continuity in the pictures, it is not applicable.

Experimental data referencepaper

Guess you like

Origin blog.csdn.net/level_code/article/details/133900298