[Neural Network] U2Net Semantic Segmentation Network

I. Overview

        U2Net is a network designed for the SOD task (SOD task: segment the most attractive target/area in the picture; that is, there are only two parts, the foreground and the background, which is a two-category problem)

2. Network structure

        1. Feature extraction network 

                 In the encoder stage, each block will be down-sampled by 2 times (maxpool); in the decoder stage, it will be sampled by 2 times before each block (bilinear).

                The module used by EN_1 and DE_1 is RSU-7 (downsampling 5 times, upsampling 5 times, the total compression ratio is 32), the structure is shown in the figure below.

                 EN_2 and DE_2 use RSU-6, which has one less upsampling and one downsampling compared to RSU-7, and the total compression ratio becomes 16 times; EN_3 and DE_3 use RSU-5, (less than RSU-6 2x compression ratio); EN_4 and DE_4 use RSU-4 (2x less compression ratio than RSU-5)

                However, EN_5, EN_6, and DE_5 use the RSU-4F structure (as shown in the figure below). Compared with the RSU-4, this structure does not have a downsampling structure. The reason for adopting this structure is that after several times of downsampling, the size of the feature map is already very small. If you downsample again, some information will be lost.

         2. Feature fusion network

                Take the output feature maps of DE_1, DE_2, DE_3, DE_4, DE_5 and EN_6 respectively, and perform 3x3 convolution respectively (kernel=1, the number of channels after convolution is 1);

                Then restore the feature map to the size of the input map through bilinear interpolation;

                Finally, the 6 feature maps are concat stitched. The spliced ​​feature map is subjected to a 1x1 convolution kernel sigmoid activation function to obtain the final prediction.

3. Network configuration parameters

        The upper one is the standard network, and the lower one is the lightweight network. The corresponding parameters are shown in the figure below.

 4. Loss function

        L=\sum w^{(m)}_{side}l^{(m)}_{side}+w_{fuse}l_{fuse}

        The loss function can be divided into two parts: w^{(m)}_{side}l^{(m)}_{side}it represents the 6 feature maps and the labeled Ground Turth calculation loss, lwhich is the binary cross entropy loss . wRepresents the weight of each loss; w_{fuse}l_{fuse}it is the loss between the final result and Ground Turth. All weights default to 1.

5. Evaluation indicators

        1.F-measure

                F_\beta=\frac{(1+\beta^2)\times Precision\times Recall}{\beta^2\times Precision+Recall}

                        F_\betaThe value range is 0~1, the larger the value, the better the effect

        2.MAE

                 MAE=\frac{1}{H\times W}\sum\sum |P(r,c)-G(r,c)|

Guess you like

Origin blog.csdn.net/weixin_37878740/article/details/129395740