[Semantic Segmentation] DeepLab v3+ (DeepLab v3 Plus, Backbone, Xception, MobileNet v2, Encoder, Decoder, ASPP, multi-scale fusion, dilated convolution)

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

The DeepLab v3+ model is considered to be the new peak of semantic segmentation (2018) because the effect of this model is very good. This paper mainly focuses on the architecture of the model, introducing the ability to arbitrarily control the resolution of the encoder to extract features, and balancing accuracy and time consumption through dilated convolution.

This is an article published on CVPR in 2018. Compared with DeepLab V3, there are five changes: ①Encoder-Decoder architecture; ②Improved ASPP module; ③Stronger multi-scale information fusion; ④Xception as Backbone; ⑤Support any size input. Therefore, DeepLab v3+ has improved compared to DeepLab v3 in terms of architecture, module improvements, and features, making it perform even better on semantic segmentation tasks.

  1. Encoder-Decoder architecture : DeepLab v3+ introduces a new Encoder-Decoder architecture that combines the ASPP (Atrous Spatial Pyramid Pooling) module of DeepLab v3 with a specific decoder module. Such an architecture can effectively integrate global information and local information, thereby improving the accuracy of semantic segmentation.

  2. Improved ASPP module : ASPP has been improved in DeepLab v3+. The goal of ASPP is to capture contextual information at different scales to better understand objects in images. DeepLab v3+ uses more atrous convolution ratios, allowing the model to better handle multi-scale information.

  3. Stronger multi-scale information fusion : In order to further improve the performance of semantic segmentation, DeepLab v3+ uses a technology called deep supervision, which can add loss functions to the network at different levels so that low-level features can also participate. Supervision, thereby helping to better integrate multi-scale information.

  4. Xception as Backbone : DeepLab v3+ uses Xception (a more efficient convolutional neural network architecture) as its backbone. Compared with ResNet used by DeepLab v3, Xception has fewer parameters and higher computational efficiency.

  5. Supports input of any size : DeepLab v3+ uses an average pooling strategy to process input images of different sizes, so it can accept images of any size for semantic segmentation tasks without being limited by fixed input sizes.

Xception is a convolutional neural network architecture proposed by François Chollet in 2016. It is a further development of the Google Inception series network. It uses depth-wise separable convolution to reduce the number of parameters and computational complexity, and has achieved good performance on the ImageNet image classification task. The full name of Xception is "Extreme Inception", and the "X" in its name is taken from "Extreme", which means extreme improvements based on the Inception series.

The main feature of Xception is to replace the traditional standard convolution with a depth-separable convolution in the Inception module. Depthwise separable convolution splits standard convolution into two steps: first, depth-wise convolution (DW Conv), which performs independent convolution on each input channel, and then point-wise convolution (Point-wise convolution). -wise Convolution, PW Conv), use 1 × 1 1\times 11×1 convolution kernel performs linear combination between channels. This structure reduces the amount of calculations and parameters, while improving computational efficiency while maintaining high performance.

Xception has achieved some success in computer vision tasks, especially image classification tasks. Its design ideas also had an impact on some subsequent network architectures.

The core idea of ​​DeepLab v3+

The core idea of ​​DeepLab v3+ is to improve the performance of semantic image segmentation by introducing a new Encoder-Decoder architecture. The traditional DeepLab series models use technologies such as Atrous Convolution (expansion convolution) and ASPP (Atrous Spatial Pyramid Pooling) to capture the contextual information of the image in order to better understand the objects in the image. However, these models may have certain limitations when dealing with multi-scale information and edge parts .

In order to solve these problems, DeepLab v3+ introduces an Encoder-Decoder structure, which consists of the following points:

  1. Encoder : The Xception network using depth-wise separable convolution is used as the backbone network of the Encoder. Xception is an efficient convolutional neural network architecture that has fewer parameters and higher computational efficiency than traditional ResNet.

  2. ASPP : At the end of Encoder, the Atrous Spatial Pyramid Pooling (ASPP) module is introduced. ASPP can capture contextual information at different scales to better understand the semantic information of images.

  3. Decoder : DeepLab v3+ uses a specific decoder module that restores the feature map extracted by the Encoder to the size of the original input image through an upsampling operation. The purpose of this is to integrate global information and local information so that the model can better perform semantic segmentation.

  4. Deep supervision : DeepLab v3+ also introduces deep supervision technology. Add loss functions to decoders at different levels so that low-level features can also participate in supervision. This can help to better integrate multi-scale information and improve the performance of the model.

  5. Supports input of any size : DeepLab v3+ uses an average pooling strategy to process input images of different sizes, so it can accept images of any size for semantic segmentation tasks without being limited by fixed input sizes.

To sum up, the core idea of ​​DeepLab v3+ is to better capture the contextual information and multi-scale information of images through the Encoder-Decoder structure and other improvements, thereby achieving better performance on semantic image segmentation tasks.

Abstract

DCNN often uses the ASPP module or Encoder-Decoder (encoding-decoding) structure in semantic segmentation tasks. The former encodes multi-scale contextual information by using convolution or pooling operations on multiple scales and multiple effective receptive fields, while the latter captures clearer object boundaries by gradually recovering spatial information. In this paper, we propose to combine the advantages of both approaches. Specifically, our proposed model DeepLab v3+ adds a simple and effective decoder module (Decoder) based on DeepLab v3 to improve the segmentation results, especially the segmentation results of object boundaries . We further explored the Xception model and applied depthwise separable convolutions to ASPP and decoder modules to achieve a faster and more powerful encoder-decoder network. We verified the effectiveness of the proposed model on the PASCAL VOC 2012 and Cityscapes data sets, and the performance of the test set reached 89.0% 89.0\% respectively.89.0% and82.1% 82.1\%82.1% withoutany post-processing. Our paper also comes with a reference implementation of the proposed model that is publicly available in Tensorflow at the link:Official code (TensorFlow version).

2. DeepLab v3+ implementation ideas

2.1 Backbone (backbone network)

DeepLab v3+ uses the Xception series as the backbone feature extraction network in the paper. Bubbliying provides two backbone networks, namely Xception and MobileNet v2. Since the training cost of Xception is relatively high, this article uses MobileNet v2 as the Backbone. Next, we briefly introduce MobileNet v2.

Insert image description here

  • DeepLab v3+ introduces a large number of dilated convolutions (in the ASPP module) in the Encoder part, which allows the model to increase the receptive field without losing information, so that each convolution output contains a larger range of information.
  • The Backbone used in the original paper is Xception.

MobileNet v2 is an efficient convolutional neural network architecture proposed by Google in 2018 for image classification and related vision tasks on devices with limited computing resources. It is the second generation of the MobileNet series and is an improvement and expansion of MobileNet v1.

The core idea of ​​MobileNet v2 is to improve the performance and computing efficiency of the model through a series of innovative designs. Major improvements include:

  1. Linear Bottleneck structure : MobileNet v2 introduces a linear Bottleneck structure for building deep networks. This structure includes a 1x1 convolution for dimensionality reduction, then a series of Depthwise Separable Convolution (depth separable convolution) layers, and finally uses 1 × 1 1\times 11×1 convolution to increase the dimension. This structure increases the depth of the network while reducing the amount of parameters and calculations.

  2. Inverse residual structure : MobileNet v2 introduces the inverse residual structure to solve the vanishing gradient problem in deep networks. This structure allows skip connections in certain layers, allowing gradients to propagate better through the network, helping to train deeper networks.

  3. Linear activation function : MobileNet v2 uses a linear activation function instead of the traditional nonlinear activation function (such as ReLU), which can avoid information loss during gradient propagation and help train deeper networks.

  4. Wider network : MobileNet v2 uses a wider network (i.e. more channels) to increase the expressiveness and accuracy of the model.

  5. SE (Squeeze-and-Excitation) module : MobileNet v2 introduced the SE module to enhance the network's attention to important features. The SE module improves model performance by adaptively adjusting the importance between channels.

MobileNet v2 has achieved excellent performance on the ImageNet image classification task and has higher computational efficiency, making it suitable for deployment on resource-constrained mobile devices and embedded systems. Due to its excellent performance and efficient computing characteristics, MobileNet v2 has become an important choice for mobile computer vision tasks.

For a detailed introduction to MobileNet v2, please see the blog: MobileNet Series (v1 ~ v3) Theoretical Explanation

2.1.1 Inverted Residual Module

Let’s take a look at the official implementation code of MobileNet v2 PyTorch:

class ConvBNReLU(nn.Sequential):
    def __init__(self, in_channel, out_channel, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_channel, out_channel, kernel_size, stride, padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_channel),
            nn.ReLU6(inplace=True)
        )


class InvertedResidual(nn.Module):
    def __init__(self, in_channel, out_channel, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        hidden_channel = in_channel * expand_ratio
        self.use_shortcut = stride == 1 and in_channel == out_channel

        layers = []
        if expand_ratio != 1:
            # 1x1 pointwise conv
            layers.append(ConvBNReLU(in_channel, hidden_channel, kernel_size=1))
        layers.extend([
            # 3x3 depthwise conv
            ConvBNReLU(hidden_channel, hidden_channel, stride=stride, groups=hidden_channel),
            # 1x1 pointwise conv(linear)
            nn.Conv2d(hidden_channel, out_channel, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channel),
        ])

        self.conv = nn.Sequential(*layers)

    def forward(self, x):
        if self.use_shortcut:
            return x + self.conv(x)
        else:
            return self.conv(x)

The inverse residual structure can be divided into two parts:

  1. Main path/feature extraction part (left): first use 1 × 1 1\times 11×1 convolution to increase the dimension, and then use3 × 3 3\times 33×3 depth separable convolution for feature extraction, and then use1 × 1 1\times 11×1Convolution dimensionality reduction
  2. Residual part/gradient return part (right): input and output are directly connected

Insert image description here

Schematic diagram of inverse residual structure

It should be noted that in DeepLab v3+, 5 times of downsampling are generally not performed (the greater the downsampling ratio, the worse the segmentation model effect). 3 times of downsampling are optional (2 3 = 8 2^3 = 823=8 times downsampling) and 4 times downsampling (2 4 = 16 2^4 = 1624=16 times downsampling), the 4 times of downsampling used in this article is 16 times downsampling.

After completing the feature extraction of MobileNet v2 (Backbone), we can obtain two effective feature layers. One effective feature layer is the result of compressing the height and width of the input image twice (4 times downsampling) (Low-Level Features, low semantics). feature map), an effective feature layer is the result of compressing the height and width of the input image 4 times (16x downsampling).

2.1.2 Implement MobileNet v2 Backbone code in DeepLab v3+

We can easily obtain the MobileNet v2 model officially implemented by PyTorch, but as we said before, DeepLab v3+ does not use all Backbone, and there is also a branch (extracting a shallow feature map), so we need to Backbone for MobileNet v2 is modified. How to modify is a question. The best situation is not to modify the corresponding source code, because we may still use the original Backbone of MobileNet v2 in the future, so we can modify the read MobileNet v2.

The basis for the modification is that in PyTorch, the model is a dictionary.

File name: deeplabv3_plus.py.

from nets.mobilenetv2 import mobilenetv2


class MobileNetV2(nn.Module):
    def __init__(self, downsample_factor=8, pretrained=True):
        super(MobileNetV2, self).__init__()
		
		# 导入 partial
        from functools import partial
		
		# 定义好模型
        model = mobilenetv2(pretrained)
        
        # 去除逆残差结构后面的1×1卷积
        self.features = model.features[:-1]

        self.total_idx = len(self.features)  # Block的数量
        self.down_idx = [2, 4, 7, 14]  # 每个进行下采样逆残差结构在模型中所处的idx


        if downsample_factor == 8:
            """
                如果下采样倍数为8,则先对倒数两个需要下采样的Block进行参数修改,使其stride=1、dilate=2
                如果下采样倍数为16,则先对最后一个Block进行参数修改,使其stride=1、dilate=2
            """
            # 修改倒数两个Block
            for i in range(self.down_idx[-2], self.down_idx[-1]):
                self.features[i].apply(
                    partial(self._nostride_dilate, dilate=2))  # 修改stride=1, dilate=2
            # 修改剩下的所有block,使其都使用膨胀卷积
            for i in range(self.down_idx[-1], self.total_idx):
                self.features[i].apply(
                    partial(self._nostride_dilate, dilate=4))
                
        elif downsample_factor == 16:
            for i in range(self.down_idx[-1], self.total_idx):
                self.features[i].apply(
                    partial(self._nostride_dilate, dilate=2)
                )

    def _nostride_dilate(self, m, dilate):
        """修改返回的block,使其stride=1

        Args:
            m (str): 模块的名称
            dilate (int): 膨胀系数
        """
        classname = m.__class__.__name__  # 获取模块名称
        if classname.find('Conv') != -1:  # 如果有卷积层
            if m.stride == (2, 2):  # 如果卷积层的步长为2
                m.stride = (1, 1)  # 修改步长为1
                if m.kernel_size == (3, 3):  # 修改对应的膨胀系数和padding
                    m.dilation = (dilate//2, dilate//2)
                    m.padding = (dilate//2, dilate//2)
            else:  # 如果卷积层步长本来就是1
                if m.kernel_size == (3, 3):  # 修改对应的膨胀系数和padding
                    m.dilation = (dilate, dilate)
                    m.padding = (dilate, dilate)

    def forward(self, x):
        """前向推理

        Args:
            x (tensor): 输入特征图

        Returns:
            (tensor, tensor): 输出特征图(两个)
        """
        low_level_features = self.features[:4](x)  # 浅层的特征图
        x = self.features[4:](low_level_features)  # 经过完整Backbone的特征图
        return low_level_features, x

Here we need to partialexplain this Python built-in method:

from functools import partial

# 原始函数
def add(x, y):
	return x + y

# 使用 partial 部分应用 add 函数的第一个参数为 5
add_5 = partial(add, 5)

# 调用新的函数 add_5 只需要提供第二个参数
result = add_5(10)  # 实际调用 add(5, 10)

print(result)  # 输出: 15

In the above example, partialthe function addfixes the first parameter of the function to 5 and returns a new function add_5. When we call add_5(10), it's actually equivalent to calling add(5, 10), and the result is 15.

Using paritalfunctions can make the code more concise and readable in some cases, and make the reuse of functions more convenient.

2.2 Strengthen the feature extraction structure——ASPP + Concat

Insert image description here

In DeepLab v3+, the enhanced feature extraction network can be divided into two parts:

  1. In the Encoder, we will use dilated convolutions with different expansion coefficients to extract features from the feature layer compressed 4 times; then stack and merge along the channel; and then perform 1 × 1 1\times 11×1 Convolutional compression feature.
  2. In the Decoder, we will use 1 × 1 1\times 1 for the low-level feature layer compressed 2 times1×1 convolution to adjust the number of channels; then perform Channel dimension with the feature map obtained through the first part of upsamplingconcat; finally perform 2 ordinary convolutions.

2.2.1 ASPP code implementation

File name: deeplabv3_plus.py.

class ASPP(nn.Module):
    def __init__(self, dim_in, dim_out, rate=1, bn_mom=0.1):
        super(ASPP, self).__init__()
        self.branch1 = nn.Sequential(  # 1×1 普通卷积
            nn.Conv2d(dim_in, dim_out, 1, 1, padding=0,
                      dilation=rate, bias=True),
            nn.BatchNorm2d(dim_out, momentum=bn_mom),
            nn.ReLU(inplace=True),
        )
        self.branch2 = nn.Sequential(  # 3×3膨胀卷积(r=6)
            nn.Conv2d(dim_in, dim_out, 3, 1, padding=6 *
                      rate, dilation=6*rate, bias=True),
            nn.BatchNorm2d(dim_out, momentum=bn_mom),
            nn.ReLU(inplace=True),
        )
        self.branch3 = nn.Sequential(  # 3×3膨胀卷积(r=12)
            nn.Conv2d(dim_in, dim_out, 3, 1, padding=12 *
                      rate, dilation=12*rate, bias=True),
            nn.BatchNorm2d(dim_out, momentum=bn_mom),
            nn.ReLU(inplace=True),
        )
        self.branch4 = nn.Sequential(  # 3×3膨胀卷积(r=18)
            nn.Conv2d(dim_in, dim_out, 3, 1, padding=18 *
                      rate, dilation=18*rate, bias=True),
            nn.BatchNorm2d(dim_out, momentum=bn_mom),
            nn.ReLU(inplace=True),
        )
        
        """
            在论文中,这里应该是池化层,但这里定义为普通卷积层,
            但莫慌,在forward函数中先进行池化再进行卷积(相当于增加了一个后置卷积)
        """
        self.branch5_conv = nn.Conv2d(dim_in, dim_out, 1, 1, 0, bias=True)
        self.branch5_bn = nn.BatchNorm2d(dim_out, momentum=bn_mom)
        self.branch5_relu = nn.ReLU(inplace=True)

        self.conv_cat = nn.Sequential(  # concat之后需要用到的1×1卷积
            nn.Conv2d(dim_out*5, dim_out, 1, 1, padding=0, bias=True),
            nn.BatchNorm2d(dim_out, momentum=bn_mom),
            nn.ReLU(inplace=True),
        )

    def forward(self, x):
        [b, c, row, col] = x.size()  # [BS, C, H, W]
        
        # 先进行前4个分支
        conv1x1 = self.branch1(x)
        conv3x3_1 = self.branch2(x)
        conv3x3_2 = self.branch3(x)
        conv3x3_3 = self.branch4(x)
        
        # 第五个分支:全局平均池化+卷积
        global_feature = torch.mean(input=x, dim=2, keepdim=True)  # 沿着H进行mean
        global_feature = torch.mean(input=global_feature, dim=3, keepdim=True)  # 再沿着W进行mean
        
        # 经典汉堡包卷积结构
        global_feature = self.branch5_conv(global_feature)
        global_feature = self.branch5_bn(global_feature)
        global_feature = self.branch5_relu(global_feature)
        
        # 双线性插值使其回复到输入特征图的shape
        global_feature = F.interpolate(
            input=global_feature, size=(row, col), scale_factor=None, mode='bilinear', align_corners=True)

        # 沿通道方向将五个分支的内容堆叠起来
        feature_cat = torch.cat(
            [conv1x1, conv3x3_1, conv3x3_2, conv3x3_3, global_feature], dim=1)
        
        # 最后经过1×1卷积调整通道数
        result = self.conv_cat(feature_cat)
        return result

resultWhat is obtained here is the green feature map in the picture .

2.2.2 Enhanced feature extraction structure in Decoder

File name: deeplabv3_plus.py.

class DeepLab(nn.Module):
    def __init__(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16):
        super(DeepLab, self).__init__()
        if backbone == "xception":
            """
            获得两个特征层
                1. 浅层特征    [128,128,256]
                2. 主干部分    [30,30,2048]
            """
            self.backbone = xception(
                downsample_factor=downsample_factor, pretrained=pretrained)
            in_channels = 2048
            low_level_channels = 256
        elif backbone == "mobilenet":
            """
            获得两个特征层
                1. 浅层特征    [128,128,24
                2. 主干部分    [30,30,320]
            """
            self.backbone = MobileNetV2(
                downsample_factor=downsample_factor, pretrained=pretrained)
            in_channels = 320
            low_level_channels = 24
        else:
            raise ValueError(
                'Unsupported backbone - `{}`, Use mobilenet, xception.'.format(backbone))

        # ASPP特征提取模块:利用不同膨胀率的膨胀卷积进行特征提取
        self.aspp = ASPP(dim_in=in_channels, dim_out=256,
                         rate=16//downsample_factor)

        # 浅层特征图
        self.shortcut_conv = nn.Sequential(
            nn.Conv2d(low_level_channels, 48, 1),
            nn.BatchNorm2d(48),
            nn.ReLU(inplace=True)
        )

        self.cat_conv = nn.Sequential(
            nn.Conv2d(48+256, 256, 3, stride=1, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),

            nn.Conv2d(256, 256, 3, stride=1, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),

            nn.Dropout(0.1),
        )
		
		# 最后的1×1卷积,目的是调整输出特征图的通道数,调整为 num_classes
        self.cls_conv = nn.Conv2d(256, num_classes, 1, stride=1)

    def forward(self, x):
        H, W = x.size(2), x.size(3)
        
        low_level_features, x = self.backbone(x)  # 浅层特征-进行卷积处理
        x = self.aspp(x)  # 主干部分-利用ASPP结构进行加强特征提取
        
        # 先利用1×1卷积对浅层特征图进行通道数的调整
        low_level_features = self.shortcut_conv(low_level_features)

        # 先对深层特征图进行上采样
        x = F.interpolate(x, size=(low_level_features.size(
            2), low_level_features.size(3)), mode='bilinear', align_corners=True)
        # 再进行堆叠
        x = self.cat_conv(torch.cat((x, low_level_features), dim=1))

        # 最后使用3×3卷积进行特征提取
        x = self.cls_conv(x)
        
        # 上采样得到和原图一样大小的特征图
        x = F.interpolate(x, size=(H, W), mode='bilinear', align_corners=True)
        return x

2.3 Obtain prediction structure

Using the results obtained in 2.1 and 2.2, we can obtain the features of the input image. At this time, we need to use the features to obtain the prediction results.

The process of using features to obtain prediction results can be divided into 2 steps:

  1. Using a 1 × 1 1\times 11×1 convolution for channel adjustment, adjusted tonum_classes;
  2. Use interpolatethe function to perform upsampling so that the final output layer [H, W]is the same as the input image.

Its code implementation has been given in 2.2.2.

2.4 Loss Function

The loss function used by DeepLab v3+ consists of two parts:

  1. Cross Entropy Loss
  2. Dice Loss

2.4.1 Cross Entropy Loss

Cross Entropy Loss is a common cross entropy loss, used when the semantic segmentation platform uses Softmax to classify pixels.

Cross Entropy Loss is used to measure the difference between two probability distributions. In deep learning, it is usually used for multi-classification tasks, especially pixel-level classification , image classification and other tasks.

Suppose there is a classification problem and the output of the model is a probability distribution y ^ \hat{y}y^, represents the predicted probability of each category, and the true label is yyy , represents the true category of the sample. The formula of the cross-entropy loss function is as follows:

Cross Entropy Loss = − ∑ i = 1 N ∑ j = 1 C y i j log ⁡ ( y ^ i j ) \text{Cross Entropy Loss} = -\sum_{i=1}^{N} \sum_{j=1}^{C} y_{ij} \log(\hat{y}_{ij}) Cross Entropy Loss=i=1Nj=1Cyijlog(y^ij)

in,

  • N N N represents the number of samples;
  • C C C represents the number of categories;
  • y i j y_{ij} yijis sample iiThe true label of i if sampleiii belongs to categoryjjj is 1, otherwise it is 0;
  • y ^ i j \hat{y}_{ij} y^ijis the sample predicted by the model iii belongs to categoryjjThe probability of j .

The calculation process of the cross-entropy loss function is to calculate the cross-entropy of its predicted probability and the true label for each sample, sum the cross-entropy of all samples, and then take a negative value. The goal of this loss function is to minimize the difference between the model predictions and the true labels , allowing the model to better fit the training data and improve generalization on unseen data .

2.4.2 Dice Loss (Dice coefficient loss function)

Dice Loss uses the evaluation index of semantic segmentation as Loss. The Dice coefficient is a set similarity measurement function. It is usually used to calculate the similarity of two samples. The value range is [0, 1] [0,1 ][0,1]

When calculating Dice Loss, we first need to calculate the Dice coefficient, and then subtract the Dice coefficient from 1 to get the Dice Loss.

  1. Calculate the Dice coefficient :

Dice = 2 ∣ X ∩ Y ∣ ∣ X ∣ + ∣ Y ∣ \text{Dice} = \frac{2|X \cap Y|}{|X| + |Y|} Dice=X+Y2∣XY

Among them, XXX represents the binary mask (or segmentation area) of the prediction result,YYY represents the binary mask of the true result.

  1. Calculate Dice Loss :

Dice Loss = 1 − Dice \text{Dice Loss} = 1 - \text{Dice} Dice Loss=1Dice

The value range of Dice Loss is [0, 1] [0, 1][0,1 ] , the closer to 0, the higher the similarity between the predicted results and the real results, and the smaller the loss. The optimization goal is to minimize Dice Loss, so that the model can better fit the training data and improve the accuracy of unseen data. Generalization ability on data.

3. Forecasting process

3.1 Forecast overview

Training result prediction requires two files:

  1. deeplab.py
  2. predict.py

We first need to deeplab.pymodify it model_pathand num_classesthese two parameters must be modified:

  • model_path: Points to the trained weight file, in logs\the folder
  • num_classes: Points to the number of detection categories (requires +1 (background))

After completing the modification, you can run it predict.pyfor testing. After running, enter the image path to detect.

CUDA_VISIBLE_DEVICE=2,3 python predict

# 模型加载完毕后需要输入要预测图片的路径

3.2 Preprocessing (preprocessing)

DeepLab v3+ mainly includes two parts in prediction:

  1. Preprocessing of predictions
  2. Postprocessing of predictions

Next, we first explain the preprocessing part.


File name: deeplab.py.

def detect_image(self, image, count=False, name_classes=None):
    """图片推理

    Args:
        image (_type_): 输入图片
        count (bool, optional): _description_. Defaults to False.
        name_classes (_type_, optional): 类别墅. Defaults to None.

    Returns:
        _type_: 模型预测结果(单通道图)
    """
    # 在这里将图像转换成RGB图像,防止灰度图在预测时报错。(代码仅仅支持RGB图像的预测,所有其它类型的图像都会转化成RGB)
    image = cvtColor(image)

    # 对输入图像进行一个备份,后面用于绘图
    old_img = copy.deepcopy(image)
    orininal_h = np.array(image).shape[0]
    orininal_w = np.array(image).shape[1]

    # 给图像增加灰条,实现不失真的resize(也可以直接resize进行识别)
    # 裁剪后的图片, 缩放后的图像宽度, 缩放后的图像高度
    image_data, nw, nh = resize_image(image, (self.input_shape[1], self.input_shape[0]))

    # 添加上batch_size维度
    image_data = np.expand_dims(np.transpose(preprocess_input(
        np.array(image_data, np.float32)), (2, 0, 1)), 0)

The above is the pre-processing process, the focus of which isResize without distortion, the code is implemented as follows:

def resize_image(image, size):
    """将给定的图像进行调整大小并居中放置在新的画布上

    Args:
        image (_type_): 输入图片
        size (_type_): 目标大小

    Returns:
        (elem1, elem2, elem3): (裁剪后的图片, 缩放后的图像宽度, 缩放后的图像高度)
    """
    iw, ih  = image.size  # 获取图片大小
    w, h    = size  # 获取目标大小

    scale   = min(w/iw, h/ih)  # 计算了原始图像与目标大小之间的缩放比例。
    nw      = int(iw*scale)  # 计算了缩放后的图像宽度
    nh      = int(ih*scale)  # 计算了缩放后的图像高度

    # 使用 Pillow 的 resize() 方法将图像调整为缩放后的大小。
    # 使用 BICUBIC 插值方法进行图像的重采样,以获得更平滑的结果。
    image   = image.resize((nw,nh), Image.BICUBIC)
    
    # 创建了一个新的画布,用于放置调整后的图像。
    # 画布大小与目标大小相同,并且以灰色 (128, 128, 128) 作为默认背景色。
    new_image = Image.new('RGB', size, (128,128,128))
    
    # 将调整后的图像粘贴到新的画布上。图像会居中放置在画布上
    new_image.paste(image, ((w-nw)//2, (h-nh)//2))

    return new_image, nw, nh

3.3 Post-processing

    with torch.no_grad():
        """
            在推理(inference)过程中关闭梯度计算,以节省内存并提高推理速度
        """
        images = torch.from_numpy(image_data)
        if self.cuda:
            images = images.cuda()


        # 图片传入网络进行预测
        pr = self.net(images)[0]

        # 取出每一个像素点的种类
        pr = F.softmax(pr.permute(1, 2, 0), dim=-1).cpu().numpy()

        # 将灰条部分截取掉
        pr = pr[int((self.input_shape[0] - nh) // 2): int((self.input_shape[0] - nh) // 2 + nh),
                int((self.input_shape[1] - nw) // 2): int((self.input_shape[1] - nw) // 2 + nw)]

        # 进行图片的resize(普通的resize)
        pr = cv2.resize(pr, (orininal_w, orininal_h),
                        interpolation=cv2.INTER_LINEAR)

        # 取出每一个像素点的种类
        pr = pr.argmax(axis=-1)


    # 计数
    if count:
        classes_nums = np.zeros([self.num_classes])
        total_points_num = orininal_h * orininal_w
        print('-' * 63)
        print("|%25s | %15s | %15s|" % ("Key", "Value", "Ratio"))
        print('-' * 63)
        for i in range(self.num_classes):
            num = np.sum(pr == i)
            ratio = num / total_points_num * 100
            if num > 0:
                print("|%25s | %15s | %14.2f%%|" %
                      (str(name_classes[i]), str(num), ratio))
                print('-' * 63)
            classes_nums[i] = num
        print("classes_nums:", classes_nums)

    if self.mix_type == 0:
        # seg_img = np.zeros((np.shape(pr)[0], np.shape(pr)[1], 3))
        # for c in range(self.num_classes):
        #     seg_img[:, :, 0] += ((pr[:, :] == c ) * self.colors[c][0]).astype('uint8')
        #     seg_img[:, :, 1] += ((pr[:, :] == c ) * self.colors[c][1]).astype('uint8')
        #     seg_img[:, :, 2] += ((pr[:, :] == c ) * self.colors[c][2]).astype('uint8')
        seg_img = np.reshape(np.array(self.colors, np.uint8)[
                             np.reshape(pr, [-1])], [orininal_h, orininal_w, -1])

        # 将新图片转换成Image的形式
        image = Image.fromarray(np.uint8(seg_img))

        # 将新图与原图及进行混合
        image = Image.blend(old_img, image, 0.7)

    elif self.mix_type == 1:
        # seg_img = np.zeros((np.shape(pr)[0], np.shape(pr)[1], 3))
        # for c in range(self.num_classes):
        #     seg_img[:, :, 0] += ((pr[:, :] == c ) * self.colors[c][0]).astype('uint8')
        #     seg_img[:, :, 1] += ((pr[:, :] == c ) * self.colors[c][1]).astype('uint8')
        #     seg_img[:, :, 2] += ((pr[:, :] == c ) * self.colors[c][2]).astype('uint8')
        seg_img = np.reshape(np.array(self.colors, np.uint8)[
                             np.reshape(pr, [-1])], [orininal_h, orininal_w, -1])

        # 将新图片转换成Image的形式
        image = Image.fromarray(np.uint8(seg_img))

    elif self.mix_type == 2:
        seg_img = (np.expand_dims(pr != 0, -1) *
                   np.array(old_img, np.float32)).astype('uint8')

        # 将新图片转换成Image的形式
        image = Image.fromarray(np.uint8(seg_img))

    return image

4. Training part

4.1 Training files

The training files we use are in PASCAL VOC format. The files for semantic segmentation model training are divided into two parts:

  1. Original picture
  2. Label

As shown below.

Insert image description here

The original image is an ordinary RGB image, and the label is a grayscale image or an 8-bit color image ( 2 8 = 256 2^8 = 25628=256 , so the pixel range is[0, 255] [0, 255][0,255 ] ). The shape of the original image is[height, width, 3], and the shape of the label is[height, width]. For labels, the content of each pixel is a number, which0, 1, 2, 3, 4, 5, ..., represents the category to which the pixel belongs.

The job of semantic segmentation is to classify each pixel of the original image, so by comparing the probability that each pixel belongs to each category in the prediction result with the label, the network can be trained

In addition, we also need to pay attention to ❗️: The label file is a grayscale image, but we opened it in P mode, and the parts with different grayscale values ​​​​are colored. If we open this image normally, the effect is as follows:

Insert image description here

Yes, it is difficult for our human eyes to distinguish different categories through pixel values ​​(the pixel value of an airplane is 1, and the pixel value of a person is 15), so we generally use P mode to open the real label. Of course, for computers, we can classify different categories based on the size of the pixel values.

4.2 Data set preparation

This article uses the PASCAL VOC format for model training. You need to make your own data set before training. If you do not have your own data set, you can download the official data set.

PASCAL VOC 2012 data set download address: Visual Object Classes Challenge 2012 (VOC2012)

After decompressing PASCAL VOC, the directory structure is as follows:

`-- VOCdevkit
    `-- VOC2012
        |-- Annotations
        |-- ImageSets
        |   |-- Action
        |   |-- Layout
        |   |-- Main
        |   `-- Segmentation
        |-- JPEGImages
        |-- SegmentationClass
        `-- SegmentationObject

Before training, place the image files in the folder under the VOCdevkit/folder . Before training, place the label file in the folder under the folder .VOC2012/JPEGImages
VOCdevkit/VOC2012/SegmentationClass

4.3 Standardize your own data set

tool:labelme

pip install labelme
labelme  # 打开labelme
  1. Open Dir: Open the folder where the file we want to mark is located
  2. Create Polygons: Start labeling polygons
  3. After marking, you can continue to mark the next picture (use the button Dto enter the next picture; use the button Ato enter the previous picture); you need to save .jsonthe path of the file along the way, we can confirm the path.

It is recommended to turn on automatic savingFile -> Save Automatically

The labeled data will generate a .jsonfile with the following content:

{
    
    
  "version": "5.2.1",  # Labelme 的版本号
  "flags": {
    
    },  # 用于存储一些标注时的标记或标志信息
  "shapes": [  # 一个数组,包含图像中标注的物体的形状信息。每个元素表示一个物体的标注
    {
    
    
      "label": "airplane",  # 标注物体的类别名称
      "points": [  # 这是一个数组,包含构成标注物体边界的点坐标。每个点坐标表示一个点的 x 和 y 坐标值
        [
          198.94308943089433,
          287.6422764227642
        ],
        ...
        ...
        [
          331.4634146341464,
          343.739837398374
        ]
      ],
      "group_id": null,  # 用于分组的标识符,通常在多个形状之间进行分组时使用
      "description": "",  # 对标注物体的描述信息
      "shape_type": "polygon",  # 标注物体形状的类型。在这个示例中,使用了 "polygon",表示标注的物体是由多边形组成的
      "flags": {
    
    }  # 用于存储关于标注物体的其他标记或标志信息
    }
  ],
  "imagePath": "..\\Airplane_01.jpg",  # 原始图像的文件路径
  "imageData": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDA.....  # 图像的数据,通常以 Base64 编码的形式包含在 JSON"imageHeight": 720,  # 原始图像的高度,以像素为单位
  "imageWidth": 1280  # 这是原始图像的宽度,以像素为单位。
}

After manually annotating .jsonthe files, we need to convert them. Because the Ground Truth of PASCAL VOC is a grayscale image, not a Json file. The code for converting Json to grayscale image is as follows:

import base64
import json
import os
import os.path as osp

import numpy as np
import PIL.Image
from labelme import utils


if __name__ == '__main__':
    jpgs_path = "datasets/JPEGImages"
    pngs_path = "datasets/SegmentationClass"
    
    # 需要注意的是要有一个背景类别
    classes = ["_background_", "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow",
               "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]

    count = os.listdir("./datasets/before/")
    for i in range(0, len(count)):
        path = os.path.join("./datasets/before", count[i])

        if os.path.isfile(path) and path.endswith('json'):
            data = json.load(open(path))

            if data['imageData']:
                imageData = data['imageData']
            else:
                imagePath = os.path.join(
                    os.path.dirname(path), data['imagePath'])
                with open(imagePath, 'rb') as f:
                    imageData = f.read()
                    imageData = base64.b64encode(imageData).decode('utf-8')

            img = utils.img_b64_to_arr(imageData)
            label_name_to_value = {
    
    '_background_': 0}
            for shape in data['shapes']:
                label_name = shape['label']
                if label_name in label_name_to_value:
                    label_value = label_name_to_value[label_name]
                else:
                    label_value = len(label_name_to_value)
                    label_name_to_value[label_name] = label_value

            # label_values must be dense
            label_values, label_names = [], []
            for ln, lv in sorted(label_name_to_value.items(), key=lambda x: x[1]):
                label_values.append(lv)
                label_names.append(ln)
            assert label_values == list(range(len(label_values)))

            lbl, _ = utils.shapes_to_label(img.shape, data['shapes'], label_name_to_value)
            lbl = lbl.astype(np.uint8)

            new = np.zeros_like(lbl, dtype=np.uint8)
            for name in label_names:
                index_json = label_names.index(name)
                index_all = classes.index(name)
                new += np.uint8(index_all) * np.uint8(lbl == index_json)

            utils.lblsave(osp.join(pngs_path, count[i].split(".")[0] + '.png'), new)

            print('Saved ' + count[i].split(".")[0] +
                  '.jpg and ' + count[i].split(".")[0] + '.png')
  • The suffix of the original image is.jpg
  • The suffix of the label is.png

Result demonstration:

Insert image description here

4.4 Data set processing (training your own data)

After completing the placement of the data set, we need to process the data set in the next step, in order to obtain the data for training train.txtand the data val.txtin the root directory voc_annotation.py.

If you train the PASCAL VOC original data set, you do not need to run the following script (the data set comes with train.txtand val.txt).

import os
import random

import numpy as np
from PIL import Image
from tqdm import tqdm

"""
想要增加测试集修改trainval_percent 
    修改train_percent用于改变验证集的比例 9:1
  
注意:当前该库将测试集当作验证集使用,不单独划分测试集
"""
trainval_percent = 1
train_percent = 0.9

# 指向VOC数据集所在的文件夹(默认指向根目录下的VOC数据集)
VOCdevkit_path = 'VOCdevkit'

if __name__ == "__main__":
    random.seed(0)
    print("Generate txt in ImageSets.")
    segfilepath = os.path.join(VOCdevkit_path, 'VOC2012/SegmentationClass')
    saveBasePath = os.path.join(VOCdevkit_path, 'VOC2012/ImageSets/Segmentation')

    temp_seg = os.listdir(segfilepath)
    total_seg = []
    for seg in temp_seg:
        if seg.endswith(".png"):
            total_seg.append(seg)

    num = len(total_seg)
    list = range(num)
    tv = int(num*trainval_percent)
    tr = int(tv*train_percent)
    trainval = random.sample(list, tv)
    train = random.sample(trainval, tr)

    print("train and val size", tv)
    print("traub suze", tr)
    ftrainval = open(os.path.join(saveBasePath, 'trainval.txt'), 'w')
    ftest = open(os.path.join(saveBasePath, 'test.txt'), 'w')
    ftrain = open(os.path.join(saveBasePath, 'train.txt'), 'w')
    fval = open(os.path.join(saveBasePath, 'val.txt'), 'w')

    for i in list:
        name = total_seg[i][:-4]+'\n'
        if i in trainval:
            ftrainval.write(name)
            if i in train:
                ftrain.write(name)
            else:
                fval.write(name)
        else:
            ftest.write(name)

    ftrainval.close()
    ftrain.close()
    fval.close()
    ftest.close()
    print("Generate txt in ImageSets done.")

    print("Check datasets format, this may take a while.")
    print("检查数据集格式是否符合要求,这可能需要一段时间。")
    classes_nums = np.zeros([256], np.int)
    for i in tqdm(list):
        name = total_seg[i]
        png_file_name = os.path.join(segfilepath, name)
        if not os.path.exists(png_file_name):
            raise ValueError("未检测到标签图片%s,请查看具体路径下文件是否存在以及后缀是否为png。" % (png_file_name))

        png = np.array(Image.open(png_file_name), np.uint8)
        if len(np.shape(png)) > 2:
            print("标签图片%s的shape为%s,不属于灰度图或者八位彩图,请仔细检查数据集格式。" % (name, str(np.shape(png))))
            print("标签图片需要为灰度图或者八位彩图,标签的每个像素点的值就是这个像素点所属的种类。" % (name, str(np.shape(png))))

        classes_nums += np.bincount(np.reshape(png, [-1]), minlength=256)

    print("打印像素点的值与数量。")
    print('-' * 37)
    print("| %15s | %15s |" % ("Key", "Value"))
    print('-' * 37)
    for i in range(256):
        if classes_nums[i] > 0:
            print("| %15s | %15s |" % (str(i), str(classes_nums[i])))
            print('-' * 37)

    if classes_nums[255] > 0 and classes_nums[0] > 0 and np.sum(classes_nums[1:255]) == 0:
        print("检测到标签中像素点的值仅包含0与255,数据格式有误。")
        print("二分类问题需要将标签修改为背景的像素点值为0,目标的像素点值为1。")
    elif classes_nums[0] > 0 and np.sum(classes_nums[1:]) == 0:
        print("检测到标签中仅仅包含背景像素点,数据格式有误,请仔细检查数据集格式。")

    print("JPEGImages中的图片应当为.jpg文件、SegmentationClass中的图片应当为.png文件。")

4.5 Start network training

train.txtMake sure there are and in the dataset folder val.txt, now we can start training.

What needs to be noted is ❗️:

  • num_classesNumber of points used to point to detection categories + 1 (including background categories)
  • For example: PASCAL VOC data set category is 20, thennum_classes=21
  • The same applies to other data sets

Otherwise train.pyunder Folders, select:

  • backbone: Use Xception or MobileNet v2 as Backbone
  • model_path: Pre-training weight position (corresponding to the backbone model)
  • downsample_factor: Downsampling coefficient (optional value: 8 or 16)
    • The larger the downsampling coefficient, the faster the model training, but the theoretical effect will be worse.

Then you can start training.

4.6 Training result prediction

Training result prediction requires the use of two files, namely

  1. deeplab.py
  2. predict.py

There is no obvious difference from 3.1 in the text.

4.7 Training parameter analysis

Training is divided into two stages:

  1. Freezing phase : The purpose of the freezing phase is to fix the backbone of the model during training (Backbone no longer updates parameters), that is, the feature extraction network part, and only train the top classifier (usually a fully connected layer or convolutional layer) part. The freeze phase is usually used for initial training, especially when the machine performance is limited, the video memory is small, or the graphics card performance is poor. In the freezing phase, you can set Freeze_Epochequal to UnFreeze_Epoch, and only freeze training is performed at this time.

  2. Unfreeze phase : The unfreeze phase is performed after the freeze phase, when the backbone of the model is unfrozen, that is, all parameters can be updated. The purpose of the unfreezing phase is to fine-tune the entire model to suit the needs of the specific task . During the unfreezing phase, you can set appropriate UnFreeze_Epochto control the number of training rounds (or iterations).

In summary, the freezing phase is used to train the classifier part first, and the unfreezing phase is used to fine-tune the entire model. Freeze_EpochParameters such as and UnFreeze_Epochcan be set according to specific problems and experimental conditions. Depending on machine performance, video memory, and training data size, these hyperparameters can be adjusted to achieve better training results with limited resources.


  • Init_Epoch: The training epoch currently started by the model, its value can be greater thanFreeze_Epoch
    • Such as setting:Init_Epoch = 60、Freeze_Epoch = 50、UnFreeze_Epoch = 100
    • Then when training starts, the model will skip the freezing stage, epoch=60start directly from , and adjust the corresponding learning rate.
    • Main application scenarios: resume training after breakpoint
  • Freeze_Epoch: The model is frozen for training Freeze_Epoch(invalid at that Freeze_Train=Falsetime )
  • Freeze_batch_size: The model is frozen for training batch_size(invalid at that Freeze_Train=Falsetime )

4.8 Training parameter recommendations

4.8.1 Start training from the pre-trained weights of the entire model

Adam optimizer :

# 冻结训练
Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 100,Freeze_Train = True,
optimizer_type = 'adam',Init_lr = 5e-4,weight_decay = 0

# 不冻结训练
Init_Epoch = 0,UnFreeze_Epoch = 100,Freeze_Train = False,
optimizer_type = 'adam',Init_lr = 5e-4,weight_decay = 0

SGD optimizer :

# 冻结训练
Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 100,Freeze_Train = True,
optimizer_type = 'sgd',Init_lr = 7e-3,weight_decay = 1e-4

# 不冻结训练
Init_Epoch = 0,UnFreeze_Epoch = 100,Freeze_Train = False,
optimizer_type = 'sgd',Init_lr = 7e-3,weight_decay = 1e-4

Where: UnFreeze_Epochcan be 100-300adjusted between.

4.8.2 Start training from the pre-trained weights of the backbone network

Adam optimizer :

# 冻结训练
Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 100,Freeze_Train = True,
optimizer_type = 'adam',Init_lr = 5e-4,weight_decay = 0

# 不冻结训练
Init_Epoch = 0,UnFreeze_Epoch = 100,Freeze_Train = False,
optimizer_type = 'adam',Init_lr = 5e-4,weight_decay = 0

SGD optimizer :

# 冻结训练
Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 120,Freeze_Train = True,
optimizer_type = 'sgd',Init_lr = 7e-3,weight_decay = 1e-4

# 不冻结训练
Init_Epoch = 0,UnFreeze_Epoch = 120,Freeze_Train = False,
optimizer_type = 'sgd',Init_lr = 7e-3,weight_decay = 1e-4

in:

  • Since training starts from the pre-trained weights of the backbone network, the weights of the backbone are not necessarily suitable for semantic segmentation, and more training is required to jump out of the local optimal solution.
  • UnFreeze_EpochCan be 120-300adjusted between .
  • AdamSGDConvergence is faster than . Therefore, UnFreeze_Epochtheoretically it can be smaller, but more is still recommended Epoch.

4.8.3 batch_size setting

Within the range that the graphics card can accept, the bigger the better. Insufficient video memory has nothing to do with the size of the data set. If there is insufficient video memory, please adjust the batch_size smaller.

Note❗️:

  • Affected by BatchNormthe layer, batch_sizethe minimum is 2, cannot be 1.

Under normal circumstances, Freeze_batch_sizeit is recommended to be Unfreeze_batch_size1 to 2 times of . It is not recommended that the setting gap is too large, because it is related to the automatic adjustment of the learning rate.

5. Training results

5.1 Training overview

===>background: Iou-93.08; Recall (equal to the PA)-97.11; Precision-95.74
===>aeroplane:  Iou-84.57; Recall (equal to the PA)-91.72; Precision-91.56
===>bicycle:    Iou-42.19; Recall (equal to the PA)-86.07; Precision-45.28
===>bird:       Iou-81.81; Recall (equal to the PA)-92.48; Precision-87.64
===>boat:       Iou-61.61; Recall (equal to the PA)-76.12; Precision-76.37
===>bottle:     Iou-71.61; Recall (equal to the PA)-88.54; Precision-78.93
===>bus:        Iou-93.45; Recall (equal to the PA)-95.97; Precision-97.27
===>car:        Iou-84.7; Recall (equal to the PA)-90.26; Precision-93.22
===>cat:        Iou-87.14; Recall (equal to the PA)-92.56; Precision-93.71
===>chair:      Iou-33.68; Recall (equal to the PA)-53.53; Precision-47.6
===>cow:        Iou-80.36; Recall (equal to the PA)-86.62; Precision-91.75
===>diningtable:        Iou-50.32; Recall (equal to the PA)-54.12; Precision-87.77
===>dog:        Iou-79.77; Recall (equal to the PA)-90.29; Precision-87.25
===>horse:      Iou-79.56; Recall (equal to the PA)-87.99; Precision-89.25
===>motorbike:  Iou-80.65; Recall (equal to the PA)-89.82; Precision-88.75
===>person:     Iou-80.07; Recall (equal to the PA)-86.52; Precision-91.49
===>pottedplant:        Iou-57.46; Recall (equal to the PA)-70.36; Precision-75.8
===>sheep:      Iou-80.42; Recall (equal to the PA)-89.93; Precision-88.37
===>sofa:       Iou-43.68; Recall (equal to the PA)-49.21; Precision-79.53
===>train:      Iou-84.46; Recall (equal to the PA)-89.14; Precision-94.14
===>tvmonitor:  Iou-67.93; Recall (equal to the PA)-74.5; Precision-88.52
===> mIoU: 72.31; mPA: 82.52; Accuracy: 93.53
Get miou done.
Save mIoU out to miou_out/mIoU.png
Save mPA out to miou_out/mPA.png
Save Recall out to miou_out/Recall.png
Save Precision out to miou_out/Precision.png
Save confusion_matrix out to miou_out/confusion_matrix.csv

5.2 mean IoU (average intersection and union ratio)

Mean IoU (mIoU) is one of the most commonly used metrics in semantic segmentation tasks. It is the average of the Intersection over Union (IoU) of all categories. For each category, IoU represents the intersection area of ​​the predicted region and the real region divided by their union area. mIoU measures the segmentation accuracy of the model on all categories, the value range is [0, 1] [0, 1][0,1 ] , the closer it is to 1, the better the model performance is.

Insert image description here

5.3 mPA (average pixel accuracy)

mPA (mean Pixel Accuracy) is the average of pixel-level accuracy. Pixel-level accuracy is a measure of how accurately a model predicts each pixel. It is the ratio of the number of correctly classified pixels to the total number of pixels. mPA measures a model's pixel classification accuracy over the entire dataset.

Insert image description here

5.4 mPrecision (average precision)

  • Precision : It is one of the commonly used indicators in binary classification problems. It is used to measure the accuracy of the model in predicting positive category samples. In semantic segmentation, each category can be considered as a binary classification problem, and the accuracy of that category is calculated by dividing the number of pixels predicted correctly by the number of pixels predicted as positive categories.
  • mPrecision (mean Precision) : is the average of the precision rates of all categories, used to measure the prediction accuracy of the model on all categories.

Insert image description here

5.5 mRecall (average recall rate)

  • Recall : It is one of the commonly used indicators in binary classification problems. It is used to measure the model's ability to identify positive category samples. In semantic segmentation, each category can be considered as a binary classification problem, and the recall rate of that category is calculated by dividing the number of correctly predicted pixels by the number of pixels in the true positive category.
  • mRecall (mean Recall) : It is the average of the recall rates of all categories, used to measure the model's overall recognition ability of positive category samples in all categories.

Insert image description here

knowledge source

  1. https://www.bilibili.com/video/BV173411q7xF
  2. https://blog.csdn.net/weixin_44791964/article/details/120113686

Guess you like

Origin blog.csdn.net/weixin_44878336/article/details/132018570