[Matting] MODNet: Real-time portrait matting model-notes

paper:MODNet: Real-Time Trimap-Free Portrait Matting via Objective Decomposition (AAAI 2022)

github:https://github.com/ZHKKKe/MODNet

Deployment tutorial:

[Matting] MODNet: Real-time portrait matting model - onnx python deployment

[Matting] MODNet: Real-time portrait matting model - onnx C++ deployment

NCNN quantization deployment tutorial (model size reduced to 1/4):

[Matting] MODNet: Real-time portrait matting model - NCNN C++ quantitative deployment

Existing matting methods often require auxiliary input such as tripmap to obtain good results, but the cost of tripmap acquisition is high. MODNet is a real-time matting algorithm that does not require Trimap. MODNet contains 2 novel methods to improve model efficiency and robustness:

(1) e-ASPP (Efficient Atrous Spatial Pyramid Pooling) fuses multi-scale feature maps;

(2) The self-supervised SOC (sub-objectives consistency) strategy makes MODNet adapt to real-world data.

MODNet has 67 FPS on 1080Ti.

Cutout effect (officially provided weight):


content

1. MODNet

 1、Semantic Estimation

 2、Efficient ASPP (e-ASPP)

 3、Detail Prediction

 4、Semantic-Detail Fusion

二、SOC(sub-objectives consistency)

3. Experimental results


1. MODNet

The MODNet network structure is shown in the figure, which mainly includes three parts: semantic estimation (S branch), detail prediction (D branch), semantic-detail fusion (F branch).

 1、Semantic Estimation

Semantic Estimation is used to locate the position of the portrait. Here only the encoder is used to extract high-level semantic information. The encoder here can be any backbone network, and mobilenetv2 is used in the paper. There are 2 benefits to doing this:

(1) Semantic Estimation is more efficient, because there is no decoder, the parameters are reduced;

(2) The resulting high-level semantic representation S(I) is beneficial to subsequent branches;

S(I) is sent to the convolutional layer with channel 1, the output is obtained through sigmoid to obtain Sp, and the G(\alpha _{g})calculation loss G(\alpha _{g})is obtained by 16 times downsampling by GT and Gaussian blurring. Using L2 loss, the loss function is as follows:

 2、Efficient ASPP (e-ASPP)

 The ASPP proposed by DeepLab has been proven to significantly improve the semantic segmentation effect. It uses multiple convolutions with different dilation rates to obtain feature maps of different receptive fields, and then fuses multiple feature maps ( ASPP can refer to here ).

To reduce the amount of computation, the following modifications are made to ASPP:

(1) Change each hole convolution to depth-wise conv+point-wise conv;

(2) Exchange the order of channel fusion and multi-scale feature map fusion. ASPP is that each channel is calculated first, and the feature maps of different scales are obtained and then fused with conv. e-ASPP is the convolution of different dilation rates of each channel, which is fused after concat ( Here is the understanding of the reference paper, the source code did not find this part);

(3) The number of feature map channels input to e-ASPP is reduced to 1/4 of the original.

PS: I took a look at the picture and the paper here, but I still don’t quite understand how the M in the picture below came from, and I didn’t see what happened to the concat dimension on the far right. I looked at the source code, good guy, there is no e-ASPP (Am I wrong??).

 3、Detail Prediction

Detail Prediction is a high-resolution branch whose input consists of low-resolution features output by I, S(I), and S branches. The D branch is additionally simplified:

(1) Compared with the S branch, D has fewer convolutional layers;

(2) The number of convolutional layer channels of the D branch is less;

(3) The resolution of all feature maps of branch D will be reduced during forward propagation to reduce the amount of computation;

 The output of branch D is d_pthat its goal is to learn the edge details of the portrait, and its loss function is the L1 loss, as follows, where m_dis a binary image, and its calculation formula is m_d=dilate(\alpha _g)-erode(\alpha _g):

 4、Semantic-Detail Fusion

Branch F combines the outputs of branch D and branch S, predicts \alphathe graph, and the loss is as follows, Lc is compositional loss ( paper transmission gate )

二、SOC(sub-objectives consistency)

The cost of labeling hair-level matting data is very high. The commonly used data enhancement method is to replace the background, but the images generated in this way are far from the images in life. Therefore, the existing trimap-free models often overfit the training set. Poor performance in real scenarios.

The paper proposes a self-supervised method that does not require labeled data to train a network to fit real-world data. The output of MODNet branch S is S(I), and the output of F is F(S(I), D(S(I))). S(I) is the prior of F(S(I), D(S(I))), and this relationship can be used to achieve self-supervised training (with the predicted results F(S(I), D(S(I) ))), downsample it and then blur it as a label for S(I)).

Assuming that the model is M, there are:

Design a loss function (similar to supervised loss, but \widetilde{\alpha }_pused instead \alpha _p):

The second half of the loss function above is as follows, and it has a problem: it only requires the model to not predict any details to minimize the loss.

 The improvement method is also relatively simple. During self-supervised training, a copy M' of the model M is created, and the predicted value of M' is \widetilde{\alpha }_p'used as the target value ( \widetilde{\alpha }_p'replace with the above formula \widetilde{\alpha }_p). Since M' is also output \widetilde{d}_p', add a regularization loss Ldd to the detail branch:

 In the SOC optimization process, Lcons+Ldd is used as the loss.

3. Experimental results

 1、PPM-100

The performance on the dataset PPM-100 is as follows.

 2. Real-world Matting

OFD (One-Frame Delay): A simple video matting strategy, for continuous alpha images \alpha _{t-1}\ \alpha _t\ \alpha _{t+1}, if the \alpha _{t-1}sum \alpha _{t+1}is very close, and \alpha _{t}the difference between the two is large, then there \alpha _{t}may be jitter, remove it and \alpha _{t-1}replace it.

In order to make MODNet more suitable for real data, 50000 images are cropped from 400 videos and trained using SOC self-supervision. The blue box in the figure below is the improved result after SOC training, and the orange box is the effect of OFD.

Guess you like

Origin blog.csdn.net/qq_40035462/article/details/123769796