[Paper Notes] Video Instance Segmentation CVPR2021 Oral——VisTR: End-to-End Video Instance Segmentation with Transformers

Video instance segmentation video instance segmentation, on the basis of vos, label each instance.

Instance segmentation is target detection + semantic segmentation. The target is detected in the image, and then each pixel of the target is assigned a category label, which can distinguish different instances with the same foreground semantic category.

Dataset: Youtube-VIS

Predecessor: Video instance segmentation

image-20220514164204594

VisTR:End-to-End Video Instance Segmentation with Transformers

solved problem

VIS not only detects and segments objects in a single frame of images, but also finds the corresponding relationship of each object in multiple frames, that is, associates and tracks them.

  • The video itself is sequence-level data, and it is modeled as a sequence prediction task. Given the input of multiple frames, the output of the segmentation mask sequence of multiple frames requires a model that can process multiple frames in parallel.
  • Unify the two tasks of segmentation and target tracking. Segmentation is the similarity learning of pixel features, and target tracking is the similarity learning of instance features.

Applying transformers to video instance segmentation

  • itself for sequence to sequence
  • Transformer is good at modeling long sequences and can be used to establish long-distance dependencies and learn time information across multiple frames
  • Self-attention can learn and update features based on the similarity between pairs, and can better learn the correlation between frames

End-to-end, consider the temporal and spatial characteristics of the video as a whole, referring to DETR

introduce

Task: Carry out instance segmentation for each frame, and establish data association of instances in consecutive frames at the same time - tracking tracking

Summary: It treats the VIS task as a parallelsequenceDecoding/prediction problem. Given a video clip consisting of multiple image frames as input, VisTR directly outputs a sequence of masks for each instance in the video. The output sequence of each instance is referred to as the instance sequence in this paper

In the image below, frames are distinguished by shape, instances are distinguished by color, three frames, and four instances

image-20220514165143772

  • In the first stage, given a sequence of video frames, a standard CNN module extracts the features of a single image frame, and then concatenates multiple image features in frame order to form a feature sequence.

    Note: The selection of the network at the stage of extracting features can be changed according to the image type

  • In the second stage, the transformer takes a sequence of segment-level features as input and outputs a sequence of instance predictions in sequence. The sequence of predictions follows the order of the input frames , while the predictions for each frame also follow the same order of instances .

Challenge : Modeling as a sequence prediction problem

Although at the initial input, the input and output of multiple frames in the time sequence dimension are ordered, but for a single frame, the sequence of instances is unordered in the initial state, and the tracking association of instances cannot be realized, so it is necessary to It is mandatory to make the order of the instances output by each frame image consistent, so that as long as the output of the corresponding position is found, the association of the same instance can be naturally realized.

  • How to keep the order of the output

    Instance sequence matching strategy: Sequence dimensioning of features at the same strength position

    Perform bipartite graph matching between the output and ground truth sequences and supervise the sequences as a whole

  • How to get the mask sequence for each instance from the transformer network

    Instance Sequence Segmentation Module: Obtain the mask features of each instance in multiple frames with self-attention, and use 3D convolution to segment the mask sequence of each instance

Overall structure of VisTR

A CNN backbone to extract multi-frame feature representations (Different feature extraction networks can be used here according to different scene requirements)

An encoder-decoder transformer to model the similarity of pixel-level and instance-level features

An example sequence matching module

An instance sequence segmentation module
Overall structure

backbone

Initial input: T frame * number of channels 3 * H' * W'

Output feature matrix (concat each frame): T * C * H * W

transformer encoder

Learn the similarity between points and points, and the output is a dense sequence of pixel features

First, the features extracted by the backbone are reduced by 1*1 convolution: T * C * H * W => T * d * H * W

Flattening: The input of the transformer needs to be two-dimensional, so flatten the space (H, W) and time (T), T * d * H * W => d * (T * H * W)

The understanding of flattening: d is similar to the channel, and T * H * W is all the pixels of all T frames of this sequence

Position encoding in time and space

Temporal and spatial positional encoding.

The dimension of the final position code is d

The result of Transformer is independent of the input sequence, and the instance segmentation task requires precise position information, so the feature is supplemented with fixed position encoding information, which contains the three-dimensional (time, space - H, W) position in the segment Information, sent to the encoder together with the feature information extracted by the backbone

In the original transformer, the position information is one-dimensional, so iii is from 1 to d dimension, so2k 2k2 k from 0 to d,wk w_kwkGradually from 1 to infinitely close to 0
Please add a picture description
Please add a picture description

Then the vector at the final t position is also d-dimensional:
pt = [ sin ( w 1 . t ) cos ( w 1 . t ) sin ( w 2 . t ) cos ( w 2 . t ) sin ( wd / 2 . t ) cos ( wd / 2 . t ) ] d p_t = \left[ \begin{matrix} sin(w_1 .t)\\ cos(w_1 .t)\\ sin(w_2 .t)\\ cos(w_2 .t) \\ \\ \\ sin(w_{d/2}.t)\\ cos(w_{d/2}.t)\\ \end{matrix} \right] _dpt=sin(w1.t)cos(w1.t)sin(w2.t)cos(w2.t)sin(wd/2.t)cos(wd/2.t)d
In this article, what needs to be considered is the position of three dimensions, H, W, T, that is to say, for a pixel point, its coordinates have three values, so for three dimensions, independently generate d / 3 d/ 3d / 3 -dimensional position vector

For the coordinates of each dimension, using the sine and cosine functions independently, yields d/3 d/3vector of d / 3 dimensions

p o s pos p o s represents coordinates (h , w , th, w, th,w,t), i i i represents the dimension, assuming only look athhh , sectioniii from 1 tod/3 d/3d / 3 dimensions, at the same time in order to ensurewk w_kwkThe value is between 0 and 1, so here wk w_kwkNot the same as the original transformer.

image-20220515153825709

image-20220518133305019

The final position vector in the H dimension is expressed as follows
PE ( pos ) H = [ sin ( w 1 . t ) cos ( w 1 . t ) sin ( w 2 . t ) cos ( w 2 . t ) sin ( wd / 6 . t ) cos ( wd / 6 . t ) ] d / 3 PE(pos)_H = \left[ \begin{matrix} sin(w_1 .t)\\ cos(w_1 .t)\\ sin(w_2 .t) \\ cos(w_2 .t)\\ \\ \\ sin(w_{d/6}.t)\\ cos(w_{d/6}.t)\\ \\end{matrix} \right] _{ d/3}P E ( p o s )H=sin(w1.t)cos(w1.t)sin(w2.t)cos(w2.t)sin(wd/6.t)cos(wd/6.t)d/3

Then input the position code (d * H * W * T) into the encoder together with the features extracted by the backbone

transformer decoder

Decode the dense pixel feature sequence output by the encoder into a sparse instance feature sequence

To be seen: Inspired by DETR, assuming that each frame has n instances, each frame has a fixed number of input embeddings used to extract the characteristics of the instance, named instance query, then the instance query of T frame has N = n*t. is learnable and has the same dimensions as pixel features

instance query: used to perform attention operations with dense input feature sequences, and select features that can represent each instance

Input: E + instance query

Output: the prediction sequence O of each instance, the subsequent process regards the prediction sequence of all frames of a single instance as a whole, and outputs them in the order of the original video frame sequence, which is n * T instance vectors

instance sequence matching

The output of the decoder is n * T prediction sequences, in the order of frames, but the order of n instances in each frame is uncertain

The function of this module is to keep the relative position unchanged for the prediction of the same instance in different frames

Binary match the predicted sequence of each instance with the GT sequence of each instance in the labeled data, and use the Hungarian matching method to find the nearest labeled data for each prediction

Supplement FFN: In fact, it is the MLP multi-layer perceptron, which is FC+GeLU+FC

In Transformer, MSA is followed by a FFN (Feed-forward network), which contains two FC layers, the first FC transforms features from dimension D to 4D, the second FC restores features from dimension 4D to D, and the middle Non-linear activation functions all use GeLU (Gaussian Error Linear Unit, Gaussian Error Linear Unit) - this is essentially an MLP (multi-layer perceptron is similar to the linear model, the difference is that MLP increases the number of layers relative to FC and introduces non-linear activation functions, such as FC + GeLU + FC)

Mainly to maintain the relative position of the prediction of the same instance in different images

Instance Sequence Segmentation

Task: Calculate the mask of each instance in the corresponding frame

O is the output of the decoder, E is the output of the encoder, and B is the feature extracted by CNN

The essence of instance segmentation is the learning of pixel similarity. For each frame , first send the prediction image O and the encoding feature E to self-attention, calculate the similarity, and use the result as the initial mask of the frame of the instance, and then compare it with the frame The initial backbone feature and the encoding feature E are fused to obtain the final mask feature of the instance of the frame. In order to make better use of timing information, the multi-frame mask concat of this instance generates a mask sequence, which is sent to the 3D convolution module for segmentation

This method strengthens the segmentation of a single frame by using the characteristics of the same instance of multiple frames to take advantage of timing

Reason: When the object is in the challenge state, such as motion blur, cover, etc., it can help segmentation by learning information from other frames, that is, the features of the same instance from multiple frames can help the network to better identify the object. example

Assuming the mask feature g(i, t) of instance i in frame t: 1 * a * H'/4 * W'/4, where a is the channel number, concatenate the features of T frame to get the instance in all mask in frame 1*a*T*H'/4*W'/4

4 here is because there are 4 instances in the example

image-20220515160846905

Ablation experiment

length of the video sequence

From 18 to 36, the effect becomes better, and more time information can help improve the results
insert image description here

order of video sequences

image-20220515162424531

location code

relative position in the video sequence

The effect of the first line without position encoding is that the correspondence between the ordered format of the sequence supervision and the input and output order of the transformer implies part of the relative position information

image-20220515162527332

Learnable Instance Query Embeddings

Default prediction. level: one embedding is responsible for one prediction, a total of n * T

video level: only one embedding, reused n * T times

frame level: For each frame, use one embedding, that is, T embeddings, and for each embedding, repeat n times

instance level: For each instance, use one embedding, that is, n embeddings, repeated T times

Queries of an instance can be shared , which can be used to improve speed

image-20220515162914171

Experimental results

Dataset: YouTube-VIS

Fast speed thanks to: Parallel decoding

Visualization results: (a) overlapping instances, (b) changes in relative positions between instances, © confusion when instances of the same category are close together, (d) instances in various poses.
insert image description here

Guess you like

Origin blog.csdn.net/xqh_Jolene/article/details/124784226