"MGMatting: Mask Guided Matting via Progressive Refinement Network" paper notes

Reference code: MGMatting

1 Overview

Guide: In this article, a guidance-based matting method is proposed, and the guidance is mainly reflected in extra-guidance and self-guidance. Among them, extra-guidance is to add a three-color image/segment mask/low-quality alpha image to the input to give the network a priori knowledge. The self-guidance is implemented by adding PRN (Progressive Refinement Network) on different stages of the decoder, where the output of the previous stage is used as the guidance of the current stage to guide the return of the opaque area. For another prospect problem in matting, the article uses another codec network to predict, thereby decoupling alpha and foreground prediction to get better results. In addition, it also uses Random Alpha Blending (RAB) fusion operation to get better and better results. Abundant synthetic data helps to further improve the performance of foreground prediction. In the article's git repository, it is mentioned that the additional matting data used by the open source may be open source, and interested friends can continue to pay attention to it.

The method of the article does not directly do matting on the original image, but gives the mask guidance (extra-guidance) to provide better prior knowledge and make the final result more robust. Through the self-guidance-like model, you can pay more attention to the translucent area, thereby enhancing the expressive power of details. The following figure shows its effect:
Insert picture description here

2. Method design

2.1 pipline

The article’s method pipline is shown in the figure below:
Insert picture description here
its structure is a typical U-shaped network structure, the input is raw data plus extra-guidance, and PRN is used in the decoder part for further mining of the translucent area.

2.2 Progressive Refinement Network

This part of the network exists as a self-guidance in the entire pipline. It uses the output of the previous layer and the output of the current layer for fusion processing in the decoder to obtain a more semi-transparent area and a more refined prediction result. The output in the decoder is expressed as α l \alpha_{l}al, It will be guided by the following calculation process gl g_lgl
f α l − 1 → g l ( x , y ) = { 1 , i f   0 < α l − 1 ( x , y ) < 1 0 , o t h e r w i s e f_{\alpha_{l-1}\to g_l}(x,y) = \begin{cases} 1, & if\ 0\lt\alpha_{l-1}(x,y)\lt 1\\ 0, & otherwise \end{cases} fal1gl(x,and )={ 1,0,if 0<al1(x,and )<1otherwise
Remember the alpha output of the current layer is α '\alpha^{'}a' , The alpha output of the current layer is described as:
α l = α l' gl + α l − 1 (1 − gl) \alpha_l=\alpha_l^{'}g_l+\alpha_{l-1}(1-g_l)al=algl+al1(1gl)
There are two such PRNs in the text, the initialg 0 g_0g0It is all 1s, that is, all the results will participate in the loss calculation.

For each layer of the decoder, the output loss function is described as:
L α ^, α = L l 1 (α ^, α) + L comp (α ^, α) + L lap (α ^, α) L_{\ hat{\alpha},\alpha}=L_{l1}(\hat{\alpha},\alpha)+L_{comp}(\hat{\alpha},\alpha)+L_{lap}(\hat{ \alpha},\alpha)La^ ,A=Ll1(a^,a )+Lcomp(a^,a )+Llap(a^,α )
, the loss functions of multiple layers are combined as:
L final = ∑ lwl L (α ^ ⋅ gl, α ⋅ gl) L_{final}=\sum_lw_lL(\hat{\alpha}\cdot g_l,\alpha\ cdot g_l)Lfinal=lwlL(a^gl,agl)
其中 ,w 0: w 1: w 2 = 1: 2: 3 w_0: w_1: w_2 = 1: 2: 3w0:w1:w2=1:2:3

Comparison of the article’s PRN network with other types of fusion methods:
Insert picture description here

2.3 Data augmentation

Input image data augmentation:
Here we use a random combination of two foregrounds/random scaling/random sampling method/random affine transformation/random 512 cropping/add COCO background.

Guidance data augmentation:
extra part: The
article proposes some data augmentation measures in order for the network to adapt to all kinds of guidance:

  • 1) Binarize the input GT alpha map according to a random threshold, and then perform the expansion and corrosion operation with a random kernel size of 1 to 30 on the obtained binarized mask;
  • 2) Refer to CutMix's data augmentation strategy, cut out two small pieces of 1/4 to 1/2 of the original image size for overlapping and fusion, and the article named it CutMask;

Self part:
For the self-guidance part, in order to enhance its robustness, the article has also carried out data augmentation. For the alpha map generated at stride=8, the sampling kernel is K 1 ∈ [1, 30] K_1\in [1 ,30]K1[1,3 0 ] corrosion operation, the kernel isK 2 ∈ [1, 15] K_2\in [1,15]where stride is 4K2[1,1 5 ] Corrosion operation. In the infer,K 1 = 15, K 2 = 7 K_1=15, K_2=7K1=15,K2=7

2.4 Prospect forecast

In the matting task, in addition to predicting the alpha, it is also necessary to obtain the foreground map. This article uses a separate codec network for prediction. Although the same network can predict the foreground and the alpha map at the same time, the article points out that this will reduce the performance of matting. For the existing prospect map prediction, it faces the following difficulties:

  • 1) The amount of data available is small;
  • 2) The existing foreground area generated by the Photoshop tool has noise and inaccurate boundaries, as shown in Figure 3:
    Insert picture description here
    This will lead to the introduction of color blocks and other considered factors, which will lead to instability of training;
  • 3) Existing foreground predictions are all carried out in areas with a value greater than 0 in the alpha map, and the results of those unsupervised areas are undefined;

For this article, first randomly select the foreground and background, and then randomly select on the alpha map to enrich the training data, thereby enhancing its generalization ability. The loss function used here is consistent with the loss function supervised by the alpha map. And the article forecasts the prospects on the whole picture. Figure 4 below compares the method of the article with some of the previous methods:
Insert picture description here
performance comparison:
Insert picture description here

3. Experimental results

Composition-1k test performance:
Insert picture description here
Comparison of portrait segmentation performance:
Insert picture description here

Guess you like

Origin blog.csdn.net/m_buddy/article/details/113855074