OneFormer: One Transformer to Rule Universal Image Segmentation paper notes

paper https://arxiv.org/pdf/2211.06220.pdf
Code https://github.com/SHI-Labs/OneFormer

1. Motivation

Insert image description here

  • Universal image segmentation is not a concept proposed in this article. There is UpperNet in the distance, and recently there are Mask2Former and K-Net. These networks are also proposed as a general segmentation architecture.
  • However, this article believes that although these networks can achieve a unified model structure for semantic segmentation, instance segmentation, and panoramic segmentation, they still need to be trained separately for specific tasks to obtain their own dedicated models. The author divides this category into The model is called semi-universal.
  • The OneFormer proposed in this article not only has a unified model structure, but also requires semantic segmentation, instance segmentation, and panoramic segmentation to be trained once on the panoramic segmentation data set. According to the data, a common model for the three tasks can be obtained for reasoning, achieving true senseuniversal.

2. Method

Insert image description here

2.1 Similarity with Mask2Former

Ignoring the text-related content in module (b), the rest of the model structure of OneFormer is basically the same as that of Mask2Former:

  1. Backbone: Encode multi-scale features using ImageNet pre-trained network;
  2. Pixel Decoder: Use Multi-Scale Deformable Transformer (MSDeformAttn) to model multi-scale contextual features;
  3. Transformer Decoder: Use {1/8、1/16、1/32} feature maps of three resolutions to update the object query. Li Yong’s main structures are cross attention, self attention, and FFN;
  4. Use the updated object query to predict(K+1) categories;
  5. Use the updated object query and1/4 feature map dot product to generate a corresponding binary mask for each query;

2.2 Innovations of OneFormer

Compared with Mask2Former, the innovations or differences of OneFormer are as follows:

  1. Task Conditioned Joint Training: In order to train the three tasks of semantic segmentation, instance segmentation, and panoramic segmentation in a unified architecture, task prompt words for specific tasks need to be introduced.
  2. Query Representations: In addition to the object query used in methods such as Mask2Former, which is called visual query for ease of understanding, this article proposes text query. Semantic segmentation, instance segmentation, and panoramic segmentation all have different corresponding text queries.
  3. Task Guided Contrastive Queries: Calculate the contrast loss between visual query and text query, because the text query of different tasks is different, so that the visual query obtained by training on different tasks can be differentiated;

2.3 Task Conditioned Joint Training

  • First, for each task, use the task is { } to construct the task prompt word I t a s k I_{task} Itask, the prompt word will then go through Tokenize, Embedding and MLP to obtain task-token Q t a s k Q_{task} Qtask
  • In addition, as shown in the figure below, for each task, count the number of various things or stuff that appears in it, using a phrase structure like T l i s t T_{list} a photo with a {CLS}Tlist sequence,. In order to align the text length within the batch, you need to use a phrase like a/an {task} photo for padding, which represents no-object. The result after padding is T p a d T_{pad} Tpad

Insert image description here

2.4 Query Representations

The Query Representations section mainly introduces text query Q t e x t Q_{text} Qtext 和object query Q Q How Q is constructed and initialized:

  • Q t e x t Q_{text} Qtext:将 T p a d T_{pad} Tpad Perform Tokenize and Embedding, and then use 6-layer transformer encoder to obtain N t e x t N_{text} Ntext Embeddings. Then, N c t x N_{ctx} Nctxlearnable Embedding sums N t e x t N_{text} Ntext embeddings are connected, and finally N N N 个text query Q t e x t Q_{text} Qtext
    Insert image description here

  • Q Q Q:首先,将 Q t a s k Q_{task} Qtask Reduction N − 1 N-1 N1 times, to the initialized object query Q ′ Q' Q, afterward, use1/4special expedition Q ′ Q' Q is updated (using 2-layer transformer), and finally, Q t a s k Q_{task} Qtask Sum Q ′ Q' Q N N N 个object query Q Q Q

2.4 Task Guided Contrastive Queries

The key challenge in unifying semantic segmentation, instance segmentation, and panoramic segmentation in the same model is how to generate task-specific object queries for each task. So, how to distinguish the object queries of each task from each other?

The solution of this article is to calculate text query Q t e x t Q_{text} Qtext 和object query Q Q Contrast loss between Q because Q t e x t Q_{text} Qtext is obtained from the GT of a specific task by counting the number of things and stuff, so Q t e x t Q_{text} Qtext are distinguished from each other, then you only need to change Q Q Q Q t e x t Q_{text} QtextAlignment.

The comparison losses used are as follows:

Insert image description here
B B B is an object-text pair inside a batch.

3. Experiment

3.1 BenchMarks

Insert image description here

Insert image description here

Insert image description here

3.2 Ablation Studies

Insert image description here

3.3 Hyperparameter experiment

Insert image description here

Guess you like

Origin blog.csdn.net/xijuezhu8128/article/details/132831476