PortraitNet: Real-time portrait segmentation network for mobile device

PortraitNet: Real-time portrait segmentation network for mobile device


Paper link: https://www.sciencedirect.com/science/article/pii/S0097849319300305
Publication source: 2019 CAD&Graphics
1. Background
Generally, semantic segmentation networks are not very good for fine segmentation of portraits. The reasons are as follows:
(1) Usually contains at least one person, whose face area accounts for at least 10% of the entire image;
(2) It often has fuzzy boundaries and complicated lighting conditions;
(3) The segmentation network is large, not suitable for real-time and fast portrait segmentation on mobile terminals;
2 the content
presents a real-time portrait segmentation model, called PortraitNet, can effectively run on mobile devices.
PortraitNet is based on a lightweight U-shaped architecture with two auxiliary losses in the training phase, and PortraitNet inference does not require additional costs in the testing phase.
The two auxiliary losses are boundary loss and consistency constraint loss. The former improves the accuracy of boundary pixels, and the latter enhances robustness in complex lighting environments.
3. The
overall structure of the network is shown in the figure: the
Insert picture description here
green block represents the encoder module, and the number in parentheses represents the down-sampling rate. In order to increase the speed, the backbone is mobileNetV2.
The yellow and purple blocks represent the decoder module, which adopts the U-Net structure. There are two branches in the D-Block block, and one branch contains two convolutions that are separable in depth. The other contains a single 1 × 1 convolution to adjust the number of channels. At the same time, the connection feature map in the decoder module is a fusion map to make full use of the model's capabilities.
Two auxiliary losses, mask loss and auxiliary loss, are used to achieve better results, where mask loss is used to calculate the binary cross-entropy loss for pixel classification. In addition to the boundary loss, the auxiliary loss also has a consistency constraint loss. The following highlights to talk about;
4. Auxiliary loss
(1) border loss
first in order to ensure the volume of the network, not because you want to optimize additional branch boundary, so the author in the last layer, add a layer conv. Used to predict boundaries.
The label of the boundary comes from the output of the canny operator that divides gt. Set the line width to 4. The
effect is shown in the following figure:
Insert picture description here
because the border occupies a small part of the image, in order to avoid extreme sample imbalance, focal loss is used.
Insert picture description here
L m is the cross-entropy loss, L E is the focal loss. λ is the weight of the boundary loss. y i represents the GT tag of pixel i, and p i represents the predicted probability of pixel i .
Since only one convolutional layer is used to generate the boundary mask, the mask feature and the boundary feature will produce invalid competition in the feature representation. In order to avoid this situation, a small λ should be set. Boundary loss can increase the sensitivity of the model to the boundaries of portraits, thereby improving the accuracy of segmentation.
(2) loss of consistency
from taking pictures in different lighting conditions will get the same content but different brightness of images, these images although the same label but the network may be because these different images to different pixels worth of forecasting division, in order to avoid this When the situation occurs, the author proposes a consistency loss, thereby obtaining a more stable result.
As shown in the figure:
Insert picture description here
For the original image, we first use the deformation enhancement method to generate image A, and then perform texture enhancement on image A to generate A'. Texture enhancement does not change the shape of the image, so the segmentation of images A and A'is the same.
Assuming that the network output of image A is heatmap B, and the output of image A'is heatmap B', then heatmap B and B'are essentially the same, that is, the final output image of the network. Both B and B'go to participate in the calculation of loss with GT. (L m is the loss obtained by using GT and B, B’, which is ordinary BCE.)
However, due to the method of texture enhancement, the quality of image A’ is worse than that of A, so the generated B’ is worse than B, so the author uses The higher quality heatmap B serves as the soft label of heatmap B'. Specifically, HeatMap B and B 'is added between a consistency constraint loss in B and B' calculated KL loss (L between C ):
Insert picture description here
where α used to balance two kinds of losses. T is used for smooth output. Consistency constraint loss can further improve the accuracy of the model and enhance the robustness of the model under different lighting conditions.
5. Results
(1) Quantified index results compared with other methods:
Insert picture description here
(2) Visual effect
Insert picture description here
(3) Speed ​​comparison:
Insert picture description here
(4) Ablation experiment of loss function:
Insert picture description here

Guess you like

Origin blog.csdn.net/balabalabiubiu/article/details/115185394