[Deep Learning] Semantic Segmentation: Paper Reading (NeurIPS 2021) MaskFormer: per-pixel classification is not all you need

details

Paper: Per-Pixel Classification is Not All You Need for Semantic Segmentation / MaskFormer
Code: Code
Official - Code

Notes:
Author's Notes
[Paper Notes] MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation
Summary of ideas clear and concise
[MaskFormer] Per-Pixel Classification is Not All You Need for Semantic Segmentation
concise and clear
Video explanation:
Brief analysis of the paper Per-Pixel Classification is Not All You Need

knowledge supplement

semantic segmentation

Put a category label on each pixel in the image, as shown in the figure below, and divide the image into people (red), trees (dark green), grass (light green), and sky (blue) labels.
insert image description here

instance segmentation

Summary of Instance Segmentation Summary Comprehensive finishing version
The combination of target detection and semantic segmentation, detect the target in (target detection), and then classify each pixel in the image with a category label ( semantic segmentation).
Comparing the above and below figures, if the target is a person, semantic segmentation does not distinguish between different instances belonging to the same category (all are marked in red), and instance segmentation
distinguishes different instances of the same type
(using different colors to distinguish different people) ).

The purpose is to detect the target in the input image and assign a class label to each pixel of the target.
Ability to distinguish between different instances of the same foreground semantic category, which is the biggest difference between it and semantic segmentation. Compared with semantic segmentation,
insert image description here

basic process

The instance segmentation model generally consists of three parts: image input, instance segmentation processing, and segmentation result output.

  • After the image is input, the model generally uses backbone networks such as VGGNet and ResNet to extract image features.
  • It is then processed by an instance segmentation model.
    • In the model, the position and category of the target instance can be determined by target detection first , and then segmented in the selected area .
    • Or perform the semantic segmentation task first, then distinguish different instances, and finally output the instance segmentation results .

Main technical route

Research on instance segmentation has long been divided into two lines,
which are bottom-up semantic segmentation-based methods and top-down detection-based methods , both of which are two-stage methods.

Top-Down Instance Segmentation Approach

The idea is:
first find out the area where the instance is located (bounding box) through the method of target detection,
and then perform semantic segmentation in the detection box, and each segmentation result is output as a different instance.
Usually detect first and then segment, such as FCIS, Mask-RCNN, PANet, Mask Scoring R-CNN;

The originator of top-down dense instance segmentation is DeepMask , which predicts a mask proposal on each spatial region through a sliding window method. This method has the following three disadvantages:

  • The connection between mask and feature (local consistency) is lost, such as using a fully connected network to extract mask in DeepMask
  • The feature extraction representation is redundant, such as DeepMask will extract a mask for each foreground feature
  • Loss of position information caused by downsampling (convolution with a step size greater than 1)

Bottom-Up Instance Segmentation Methods

Treat each instance as a category; then follow the idea of ​​clustering, the maximum distance between classes, the minimum distance between classes, embedding for each pixel, and finally do grouping to separate different instances. Grouping method: generally the effect of bottom-up is worse than that of top-down;

The idea is: first perform semantic segmentation at the pixel level, and then distinguish different instances by means of clustering and metric learning. Although this method maintains better low-level features (detail information and location information), it also has the following disadvantages:

  • High quality requirements for dense segmentation can lead to non-optimal segmentation
  • Poor generalization ability, unable to deal with complex scenes with many categories
  • The post-processing method is cumbersome

Single Shot Instance Segmentation (Single Shot Instance Segmentation), this work is actually influenced by single-stage target detection research, so there are two ideas, one is inspired by one-stage, anchor-based detection models such as YOLO, RetinaNet , the representative works are YOLACT and SOLO; one is inspired by anchor-free detection models such as FCOS, and the representative works are PolarMask and AdaptIS.

Mask Mask

[Mask]
A mask, in layman's terms, is a baffle. When spraying paint, or engraving or painting, a mask of a specific shape will be placed on the modified material, and it can fit well according to the shape of the baffle. to get the final pattern you want. Masks are such a thing.

【Binary mask】

binary Mask is called a binary mask, what does it mean? Because in image processing, the computer recognizes the image as a matrix, you need to put a occluder on an image to operate, and the image matrix is ​​multiplied by another "occluder" matrix, so as to get what you want desired result. For example:
insert image description here
It can be seen from the figure that after the mask processing, other parts are filtered out by the 0 value in the "mask", and the remaining part is the desired part.

yolov5 target detection neural network - loss function calculation principle
Use selected images, graphics or objects to occlude the processed image (all or part) to control the area or process of image processing .
In digital image processing**, the mask is a two-dimensional matrix array**, and sometimes a multi-valued image is used. The
image mask is mainly used for:
① Extracting the region of interest, using the pre-made region of interest mask and the image to be processed By multiplying them together, the image of the region of interest is obtained. The image values ​​in the region of interest remain unchanged, while the values ​​of images outside the region are all 0.
② Shielding function, use a mask to shield certain areas on the image, so that they do not participate in the processing or calculation of processing parameters, or only process or count the shielded areas.
③ Structural feature extraction, using similarity variables or image matching methods to detect and extract structural features similar to the mask in the image.
④ Production of special shape images.

What is a mask mask?

The neural network predicts 3 80 80 prediction frames


for an image divided into 80 80 grids, so is there a detection target in each prediction frame ? Obviously not. Therefore, during training, you first need to make a preliminary judgment based on the label. Which prediction frames are likely to contain targets? The mask mask is such a 3 80 80 bool matrix: 3 80 80 bool values ​​correspond to 3 80*80 prediction boxes one by one, and judge whether there is a target in each prediction box according to the label information and certain rules, if If it exists, the value of the corresponding position in the mask matrix is ​​set to true, otherwise it is set to false.

What is the mask mask for?

The neural network predicts three rectangular boxes for each grid of the 80 x 80 grid, so 380*80 prediction boxes are output , and the prediction information of each prediction box includes the box information, confidence, and classification probability. In fact, not all prediction boxes need to calculate the loss function values ​​of all categories, but are determined according to the mask matrix:

  • Only the prediction frame whose corresponding position is true in the mask matrix needs to calculate the loss of the rectangular frame;

  • Only the prediction box whose corresponding position is true in the mask matrix needs to calculate the classification loss;

  • All prediction boxes need to calculate the confidence loss, but the prediction box whose mask is true and the prediction box whose mask is false have different confidence label values.

[Masks is actually each pixel must be calibrated whether this pixel belongs to an object, the bounding box is relatively rough]

mask classification

It is to generate N binary masks on the image first, and then predict the category of each binary mask (we consider all foreground pixels in the binary mask to be this category, just like mask rcnn).

The ground truth used for semantic segmentation and panoptic segmentation is different. Each category
in semantic segmentation is a binary mask , and each instance in panoptic segmentation is a binary mask.

DeepMask

Detailed explanation of DeepMask
Mask R-CNN network

Overall, given an image patch as input, DeepMask outputs a category-independent mask and a related score ,
estimating the probability that the patch completely contains an object. Its biggest feature is that it does not rely on edges, superpixels, or any other form of low-level segmentation. It is the first work that learns to generate segmentation candidates directly from raw image data.

Another huge difference from other segmentation work is that DeepMask outputs segmentation masks instead of bounding boxes.
[Masks is actually each pixel must be calibrated whether this pixel belongs to an object, the bounding box is relatively rough]

Max-Deeplab

Modeling panoramic segmentation as a set of Masks has unified the representation of foreground and background. In their paper, each type of background (that is, semantic segmentation) has been expressed as a Mask and separated from the classification.
In Max-Deeplab, a picture will have N (finally 100) queries, each query corresponds to a Mask and a C classification result, and then discard the mask that does not meet the requirements through the score of the C classification to achieve a fixed length The prediction becomes the effect of variable-length results , thus completing panoptic segmentation.

motivation

Semantic Segmentation
Current methods usually formulate semantic segmentation as a per-pixel classification task

  • Since the advent of Fully Convolution Networks (FCNs), the semantic segmentation problem has been solved as a pixel classification problem by default (Figure 1 left). Pixel classification greatly simplifies semantic segmentation, turning it from a segmentation (or pixel grouping) problem into a classification (or recognition) problem. Admittedly, this simplification is quite clever,
  • But on the other hand, it also limits people's imagination

Pixel classification limitation:
If you look at semantic segmentation as a "segmentation" problem, you will find that "pixel classification" itself has many limitations.
The biggest problem is that it can only output a fixed number of segmentation masks (this fixed number is equal to the number of categories defined by the data set ),
so it is difficult for "pixel classification" to solve more difficult problems such as instance segmentation.

Instance Segmentation
Instance segmentation is handled using mask classification.

In contrast, instance segmentation has always been solved by mask classification-based methods represented by Mask R-CNN (Figure 1 right).

Mask classification The biggest difference between
Mask classification and per-pixel classification is that each binary mask in mask classification requires only one global category (rather than each pixel requires a category).

Each binary mask predicts one category (instead of one category per pixel)

We believe that mask classification is a more general segmentation method, and mask classification once "dominated" the Pascal VOC semantic segmentation challenge (O2P, R-CNN, SDS and other mask proposal-based methods) before FCN. But because of the emergence of simpler FCN, everyone gave up the path of mask classification.

Question:
Is it possible to use mask classification to simplify the paradigm of semantic segmentation and instance segmentation?
Can mask classification perform better than per-pixel classification on semantic segmentation tasks ?

The author's point of view is that mask classification can be fully generalized , that is, semantic and instance-level segmentation tasks can be solved in a unified manner using the exact same model, loss, and training procedure.

Therefore, the author proposes MaskFormer, which can convert the existing per-pixel classification model into a mask classification model.

Predicts a set of binary masks, each associated with a single global class label prediction. Simplifying semantic segmentation and panoptic segmentation tasks, MaskFormer performs better than pixel-wise classification when the number of classes is large.

Related Works

How to evaluate the MaskFormer proposed by FAIR, which achieves SOTA on semantic segmentation ADE20K: 55.6 mIoU? - The answer of the tourbillon - Zhihu
https://www.zhihu.com/question/472122951/answer/1997405212

Maskformer sees semantic segmentation as two sub-tasks of category prediction and mask prediction. The semantic segmentation task can be solved only by one-to-one correspondence between category prediction and mask prediction.

max deeplab proposes the concept, predicts the cls+mask branch, and then calculates the loss by bipartite graph matching. The
insert image description here
left figure shows the semantic segmentation based on the pixel classification of each position with the same classification loss, and the
right figure shows the prediction based on the mask classification. Group binary masks and assign a class to each mask.

The method in this paper is not to output only K categories of binary classification maps, but to set a larger value in advance. At the same time, a prediction category is predicted together with the classification map. This category can be empty, that is, the binary classification map has no use.
Therefore, the loss function consists of two, one is the loss of the predicted classification map and the truth map , and the other is the cross-entropy loss of the predicted category

In fact, because the category of each predicted classification map may change with different pictures, a category matching is performed before the loss, and the matching cost is:
insert image description here

Per-pixel classification formulation

Per-pixel classification:
The segmentation task is regarded as a task of classifying each pixel
. For a H x W picture, the goal of the pixel-based classification algorithm is to predict that each pixel is divided into K categories. The probability distribution of a certain class
of K is
insert image description here
the number of categories in the data set, and the real label is:
insert image description here
since it is a classification task, the per-pixel loss function is actually the sum of the cross-entropy loss functions of each pixel , and It is the following formula:
insert image description here

Mask classification formulation

As shown in the figure above (right), the mask classification model converts the segmentation task into two steps,

  • 1. Divide the image into N different regions, represented by binary mask (this step only divides the regions of different categories , but does not classify them)
    insert image description here
  • 2. For each region as a whole, divide it into K categories. Note that multiple regions are allowed to be divided into the same category, so that the algorithm can be applied to semantic and instance-level segmentation tasks. (This step is to divide different regions into different classes ).

For classifying masks, the desired output z is a set of N probabilistic mask pairs:
insert image description here
mask classification allows multiple mask predictions to have the same associated class, making it suitable for semantic and instance segmentation tasks.
insert image description here
For semantic segmentation,

  • Simple fixed matching is possible if the number N of predicted regions matches the number K of class labels. Thus, the ith prediction matches the ground truth region with class label i,
  • If the category of the predicted region i does not exist in the ground truth label, it is matched with the background.
  • In experiments, it is found that assignment based on binary matching works better than fixed matching .

The loss function of the final model
The loss function Lmask of the segmentation task in the first step + the loss function of the classification task in the second step
can be expressed as the following formula:
insert image description here

MaskFormer

The core contribution is that
the original semantic segmentation is based on the embedding of each pixel. The independent prediction pixel belongs to which category
the maskformer takes out our feature and throws it into the transformer decoder. The decoder generates embedding and embedding according to each category. Make a mask, classify, and multiply the mask embedding with the pixel embedding to get the shape of this category. At the same time, according to the embedding generated by the decoder, you can directly infer what category the cladification category belongs to. Take this to say goodbye to our prediction on the feature of each pixel.

Process:
MaskFormer proposes to treat panorama segmentation as a mask classification task. Get N mask embedding and N cls pred through transformer decoder and MLP. Another branch obtains per-pixel embedding through pixel decoder, then multiplies mask embedding and per-pixel embedding to obtain N mask predictions, and finally multiplies cls pred and mask pred, discards masks without targets, and obtains the final prediction result.
The overall structure is divided into: a decoder
for the pixels of the entire image , a Transformer for pre-booking N number of split images , and then the two are integrated, as shown in the following figure:
insert image description here
C is the number of channels,The green pixel decoder below is for H*W, while the Transformer decoder above is for N segmentation maps.
After the Transformer outputs, an MLP is used to keep the number of channels consistent between the two, and then each segmentation map is predicted by dot product :
insert image description here
the classification end maps the output of the Transformer decoder into K + 1 categories through another MLP, which contains an empty category.
In the final integration part, these empty categories are removed first, and then each pixel is classified according to the highest category confidence and segmentation map confidence:
insert image description here

It can be divided into three main parts:

1) pixel-level module:
a backbone for extracting image features and a pixel-level decoder for generating per-pixel embeddings
to extract each pixel embedding (gray background part)

2) Transformer module:
Use the stacked Transformer decoder layer
to calculate the embedding of N segments (green background part)

3) Segmentation module:
According to the above per-pixel embedding and per-segment embedding, generate prediction results. (blue background part)

Pixel-level module

First, the backbone is used to compress the H and W of the picture, the channel dimension is improved, and the visual features are extracted . This part is similar to the normal CNN extraction features.
Then use a pixel Decoder to convert the length and width back to H and W.

Input H×W image,

  • The backbone generates a low-resolution image feature map F(C H/S W/S);
  • The pixel decoder progressively upsamples features to CHW size to generate per-pixel embedding Epixel.

Any segmentation model based on per-pixel classification is suitable for pixel-level module design. MaskFormer can seamlessly convert such models to Mask classification. The backbones used in this article are ResNet backbones and Swin-Transformer backbones.

Pixel Decoder A lightweight pixel decoder based on the popular FPN architecture.
After the FPN, the low-resolution feature maps in the decoder are upsampled by 2× and combined with the corresponding resolution projected feature maps from the backbone (the projection is to match the dimensionality of the feature map by 1×1 convolution layer + GroupNorm implementation) added.
Next, the concatenated features are fused by an additional 3×3 convolutional layer + GN + ReLU.
This process is repeated until the final feature map is obtained.
Finally, a single 1×1 convolutional layer is applied to obtain peri-pixel embeddings.

Transformer module

Use the standard Transformer decoder to calculate the image feature F and N learnable position embeddings (ie, query), and
the output is N segmentation embeddings encoded into the global information MaskFormer prediction of each segmentation.

We use 6 Transformer-decoder layers and 100 queries, and apply the same loss as DETR after each decoder.
In experiments, the authors observe that MaskFormer is also quite competitive with a single decoder layer for semantic segmentation, but multiple layers are necessary for instance segmentation to remove duplicates from the final predictions.

Segmentation module

A linear classifier is used after sofmax, on each segment embedding, to produce class probability predictions for each segment.

For the prediction of the mask, the per-segment embedding is converted into N mask embeddings through a two-layer MLP.
Then, the mask embedding and per-pixel embedding are dot-multiplied,
followed by a sigmoid activation function to obtain the final mask prediction result.

Mask Classification Inference

Segmentation reasoning:
divide the image into one of the N categories for each pixel value by pixel,

The way of division:
first calculate the predicted probability of N categories for each pixel,
and then use the argmax function to find the maximum value of N possibilities, which is the classification category of this pixel.

For semantic segmentation, the category labels of several shared segmentation blocks can be merged.
For instance segmentation, the labels of these segmentation blocks can not be merged.
The predicted probability calculation for each pixel:
insert image description here

semantic reasoning

Semantic reasoning is done by simple matrix multiplication. The authors find that **marginalization** on probability-mask pairs produces better results than hard-assigning pixels to probability-mask pairs in a general inference strategy argmax does not include the "no object"
insert image description here
category (∅), because standard semantic segmentation requires one label per output pixel. Note that this strategy returns per-pixel class probabilities. However, we observe that directly maximizing the per-pixel class likelihood leads to poor performance.

Experiment - Semantic Segmentation

For the ADE20K dataset, if not specified, we use a crop size of 512 × 512, a batch size of 16, and train all models 160k times. For the ADE20K-full dataset, we use the same settings as ADE20K, except we train all models for 200k iterations.

Semantic segmentation of 150 categories on ADE20K val. Mask-based classification with MaskFormer outperforms the best pixel-wise classification methods while using fewer parameters and less computation.
insert image description here
MaskFormer compares pixel-wise classification baselines on 4 semantic segmentation datasets. The MaskFormer improvement is larger when the number of classes is larger. We use the ResNet-50 backbone, reporting single-scale mIoU and PQSt, for ADE20K, COCO-Stuff and ADE20K-full, while for higher resolution cityscapes we use the following deeper ResNet-101 backbone [8, 9].
insert image description here
Semantic segmentation on the ADE20K test of 150 categories. MaskFormer outperforms previous state-of-the-art methods on all three metrics: pixel accuracy (P . a .), mIoU, and final test score (average of PA and mIoU). We train our model on the ImageNet-22K pretrained checkpoint and ADE20K training set following [29] and use multi-scale inference.
insert image description here

video notes

Brief analysis of the paperPer-Pixel Classification is Not All You Need
insert image description here
Classify first, and don’t know what is divided, but this area belongs to this object; then look at what each area belongs to

Therefore, it is different from the problem of classifying each pixel. In this article, we only need to pay attention to each instance, that is, each object we have divided, and which category it is.

After we have determined the object category, we only need to brush all the pixels contained in this object into this category, and we are done.

insert image description here

  • First, the image is divided into n regions, n is not necessarily equal to k, k is the category of all possible objects in the data set, generally n is greater than k.
  • In general, use a binary mask, that is, a mask that is either 1 or 0, to express an object. After having these masks, we are distinguishing which category each mask corresponds to.

Here the author has an intuitive demonstration.

  • For the traditional pixel, for the classification in each pixel, the vector of each pixel, k classes, the probability of belonging to each class.
  • In this article, first divide it into n areas, each area represents an object, and also contains an empty area, indicating that there is nothing. After having n regions, classify according to the entire feature of the region, which will contain shape information and provide more information than pre pixel to build a more accurate classification.
    This is the core idea of ​​this article.

insert image description here
The author revolves around this idea and how to build the model.

  • First, a picture comes in and sent to the backbone to mention some features, such as resnet and swintransformer in the backbone
  • After we have the feature, we throw it into the transformer decoder. After edcode and mlp, we propose n different areas, use representation as prediction, k represents the number of all classes contained in the data set, and +1 is because In case there is nothing in n, add an empty set, then this is the head of prediction.
  • In addition, to improve the mask, how to refill the E vector into an H x W image? Directly multiply the output of the pixel decoder to obtain n H x W images. Each image is a binary mask, indicating this area , which belong to this region and which do not, do the prediction.
  • Finally, by combining these two items, a conventional semantic segmentation result can be output.
    -n queries are all zero vectors

Two-way matching
n areas, n masks

Guess you like

Origin blog.csdn.net/zhe470719/article/details/125067737