Semantic Segmentation Semantic Segmentation

I have learned about semantic segmentation before, and I feel that I can do a lot of things, so I took the time to study it. I will record it here for future reference. This article may be updated as the study progresses.

some basic concepts

Image segmentation
Image segmentation refers to dividing images using features such as boundaries and color gradients. Popular algorithms at this time include Ostu, FCM, watershed, N-Cut, etc. , these algorithms are generally unsupervised learning, and the segmented results are not semantically labeled. In other words, you don’t know what the segmented things are.

Semantic segmentation
predicts each pixel in the input image as a different semantic category. It pays more attention to the distinction between categories, focusing on separating the vehicles in the foreground from the houses, sky, and ground in the background, but does not distinguish overlapping vehicles. There are mainly methods such as FCN, DeepLab, and PSPNet.

Instance segmentation
is a combination of target detection and semantic segmentation. It detects the target in the input image and assigns a category label to each pixel contained in the target. Pay more attention to the segmentation between target individuals in the foreground, and the houses, sky, and ground in the background are all in the same category. There are mainly DeepMask, Mask R-CNN, PANet and other methods.

panoptic segmentation

It is a synthesis of semantic segmentation and instance segmentation. It aims to simultaneously segment the target (thing) at the instance level and the background content (stuff) at the semantic level. Each pixel in the input image is assigned a category label and instance ID to generate a global, Unified segmented images.

Insert image description here

The difference between semantic segmentation and image segmentation

Example and panoramic segmentation PPT

The article "Automatic Technology" will help you understand panoramic segmentation

CNN image semantic segmentation is basically this routine:
Downsampling + upsampling: Convlution + Deconvlution/Resize
Multi-scale feature fusion: features Point-by-point addition/feature channel dimension splicing
Obtain pixel-level segment map: judge the category of each pixel

The semantic segmentation network also has two methods for feature fusion:

FCN-style point-by-point addition, corresponding to caffe's EltwiseLayer layer, corresponding to tensorflow's tf.add()
U-Net-style channel dimension splicing and fusion, corresponding to caffe's ConcatLayer Layer, corresponding to tensorflow's tf.concat()

Overview of Image Segmentation [Deep Learning Methods]

Several semantic segmentation algorithms

Fully Convolutional Networks (FCN)

Fully Convolutional Networks for Semantic Segmentation, referred to as FCN. This paper is the first paper that successfully uses deep learning for image semantic segmentation. The main contributions of this paper are two points:

A fully convolutional network is proposed. The fully connected network is replaced with a convolutional network, so that the network can accept images of any size and output a segmentation map of the same size as the original image. Only in this way can each pixel be classified.

Deconvolution layer is used. The feature map of a classification neural network is generally only a fraction of the size of the original image. To map back to the original image size, the feature map must be upsampled. This is the role of the deconvolution layer. Although the name is called a deconvolution layer, it is not actually the inverse operation of convolution. A more appropriate name is transposed convolution (Transposed Convolution), which is used to roll out a large feature map from a small feature map.

Basic information about fully convolutional networks (FCN)

Insert image description here

Insert image description here

Advantages and Disadvantages of FCN

Insert image description here

Semantic Segmentation VS Image Classification

Insert image description here

Insert image description here

Classification -> Changes in Split

Insert image description here

upsampling method

Insert image description here

Up mining - bilinear interpolation

Insert image description here
Insert image description here

Upsampling - Un-pooling

Insert image description here

Upsampling - Transpose Conv

Insert image description here
Insert image description here
Insert image description here
Insert image description here

FCN network structure

Insert image description here
Insert image description here
Insert image description here
Insert image description here

FCN code implementation

class FCN8s(nn.Module):
    def __init__(self, in_channels=1, out_channels=[64, 128, 256, 512, 512, 4096, 4096], n_class=21):
        super(FCN8s, self).__init__()

        # conv1
        self.conv1_1 = nn.Conv2d(in_channels=in_channels, out_channels=out_channels[0], kernel_size=3, padding=100)
        self.relu1_1 = nn.ReLU(inplace=True)
        self.conv1_2 = nn.Conv2d(in_channels=out_channels[0], out_channels=out_channels[0], kernel_size=3, padding='same')
        self.relu1_2 = nn.ReLU(inplace=True) # 覆盖掉原来的变量
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True) # 向上取整, 1/2

        # conv2
        self.conv2_1 = nn.Conv2d(in_channels=out_channels[0], out_channels=out_channels[1], kernel_size=3, padding='same')
        self.relu2_1 = nn.ReLU(inplace=True)
        self.conv2_2 = nn.Conv2d(in_channels=out_channels[1], out_channels=out_channels[1], kernel_size=3, padding='same')
        self.relu2_2 = nn.ReLU(inplace=True) # 覆盖掉原来的变量
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True) # 向上取整, 1/4

        # conv3
        self.conv3_1 = nn.Conv2d(in_channels=out_channels[1], out_channels=out_channels[2], kernel_size=3, padding='same')
        self.relu3_1 = nn.ReLU(inplace=True)
        self.conv3_2 = nn.Conv2d(in_channels=out_channels[2], out_channels=out_channels[2], kernel_size=3, padding='same')
        self.relu3_2 = nn.ReLU(inplace=True)
        self.conv3_3 = nn.Conv2d(in_channels=out_channels[2], out_channels=out_channels[2], kernel_size=3, padding='same')
        self.relu3_3 = nn.ReLU(inplace=True) # 覆盖掉原来的变量
        self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True) # 向上取整, 1/8

        # conv4
        self.conv4_1 = nn.Conv2d(in_channels=out_channels[2], out_channels=out_channels[3], kernel_size=3, padding='same')
        self.relu4_1 = nn.ReLU(inplace=True)
        self.conv4_2 = nn.Conv2d(in_channels=out_channels[3], out_channels=out_channels[3], kernel_size=3, padding='same')
        self.relu4_2 = nn.ReLU(inplace=True)
        self.conv4_3 = nn.Conv2d(in_channels=out_channels[3], out_channels=out_channels[3], kernel_size=3, padding='same')
        self.relu4_3 = nn.ReLU(inplace=True) # 覆盖掉原来的变量
        self.pool4 = nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True) # 向上取整, 1/16

        # conv5
        self.conv5_1 = nn.Conv2d(in_channels=out_channels[3], out_channels=out_channels[4], kernel_size=3, padding='same')
        self.relu5_1 = nn.ReLU(inplace=True)
        self.conv5_2 = nn.Conv2d(in_channels=out_channels[4], out_channels=out_channels[4], kernel_size=3, padding='same')
        self.relu5_2 = nn.ReLU(inplace=True)
        self.conv5_3 = nn.Conv2d(in_channels=out_channels[4], out_channels=out_channels[4], kernel_size=3, padding='same')
        self.relu5_3 = nn.ReLU(inplace=True) # 覆盖掉原来的变量
        self.pool5 = nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True) # 向上取整, 1/32

        # fc6
        self.fc6 = nn.Conv2d(in_channels=out_channels[4], out_channels=out_channels[5], kernel_size=7) # 由padding=100得此处的最小尺寸为7!
        self.relu6 = nn.ReLU(inplace=True)
        self.drop6 = nn.Dropout2d()

        # fc7
        self.fc7 = nn.Conv2d(in_channels=out_channels[5], out_channels=out_channels[6], kernel_size=1)
        self.relu7 = nn.ReLU(inplace=True)
        self.drop7 = nn.Dropout2d()

        self.score_fr = nn.Conv2d(in_channels=out_channels[5], out_channels=n_class, kernel_size=1)
        self.scoer_pool3 = nn.Conv2d(in_channels=out_channels[2], out_channels=n_class, kernel_size=1)
        self.score_pool4 = nn.Conv2d(in_channels=out_channels[3], out_channels=n_class, kernel_size=1)

        self.upscore2 = nn.ConvTranspose2d(in_channels=n_class, out_channels=n_class, kernel_size=4, stride=2, bias=False)
        self.upscore8 = nn.ConvTranspose2d(in_channels=n_class, out_channels=n_class, kernel_size=16, stride=8, bias=False)
        self.upscore_pool4 = nn.ConvTranspose2d(in_channels=n_class, out_channels=n_class, kernel_size=4, stride=2, bias=False)


    def forward(self, x):
        shape = x.shape
        if len(shape) is not 4:
            x = torch.unsqueeze(x, 1)
        
        x = self.relu1_1(self.conv1_1(x)) # [n, c, h-2+200, w-2+200]
        x = self.relu1_2(self.conv1_2(x))
        x = self.pool1(x) # [n, 64, (h-2+200)/2, (w-2+200)/2]

        x = self.relu2_1(self.conv2_1(x))
        x = self.relu2_2(self.conv2_2(x))
        x = self.pool2(x) # [n, 128, (h-2+200)/4, (w-2+200)/4]

        x = self.relu3_1(self.conv3_1(x))
        x = self.relu3_2(self.conv3_2(x))
        x = self.relu3_3(self.conv3_3(x))
        x = self.pool3(x) # [n, 256, (h-2+200)/8, (w-2+200)/8]
        pool3 = x

        x = self.relu4_1(self.conv4_1(x))
        x = self.relu4_2(self.conv4_2(x))
        x = self.relu4_3(self.conv4_3(x))
        x = self.pool4(x) # [n, 256, (h-2+200)/16, (w-2+200)/16]
        pool4 = x

        x = self.relu5_1(self.conv5_1(x))
        x = self.relu5_2(self.conv5_2(x))
        x = self.relu5_3(self.conv5_3(x))
        x = self.pool5(x) # [n, 512, (h-2+200)/32, (w-2+200)/32]

        x = self.relu6(self.fc6(x)) # [n, 4096, (h-2+200)/32-6, (w-2+200)/32-6]
        x = self.drop6(x)

        x = self.relu7(self.fc7(x)) # [n, 4096, (h-2+200)/32-6, (w-2+200)/32-6]
        x = self.drop7(x)

        x = self.score_fr(x) # [n, n_class, (h-2+200)/32-6, (w-2+200)/32-6]
        x = self.upscore2(x) # [n, n_class, (h-2+200)/16-10, (w-2+200)/16-10]

        score_pool4 = self.score_pool4(pool4) # [n, n_class, (w-2+200)/16, (w-2+200)/16]
        score_pool4 = score_pool4[:, :, 5:5+x.size()[2], 5:5+x.size()[3]] # [n, n_class, (h-2+200)/16-10, (w-2+200)/16-10]
        
        x = x + score_pool4 # [n, n_class, (h-2+200)/16-10, (w-2+200)/16-10]
        x = self.upscore_pool4(x) # [n, n_class, (h-2+200)/8-18, (w-2+200)/8-18]

        score_pool3 = self.scoer_pool3(pool3) # [n, n_class, (h-2+200)/8, (w-2+200)/8]
        score_pool3 = score_pool3[:, :, 9:9+x.size()[2], 9:9+x.size()[3]] # [n, n_class, (h-2+200)/8-18, (w-2+200)/8-18]

        x = x + score_pool3 # [n, n_class, (h-2+200)/8-18, (w-2+200)/8-18]

        x = self.upscore8(x) # [n, n_class, (h-2+200)-17*8, (w-2+200)-17*8] -> [n, n_class, h+62, w+62]
        x = x[:, :, 31:31+shape[1], 31:31+shape[2]].contiguous() # [n, n_class, h, w]

        return x

FCN fully convolutional network detailed explanation PPT

U-Net

U-Net: Convolutional Networks for Biomedical Image Segmentation, U-Net is a segmentation network proposed by the original author to participate in the ISBI Challenge. It can adapt to a small training set (approximately 30 pictures). U-Net and FCN are both very small segmentation networks. They neither use dilated convolutions nor are followed by CRF, and their structures are simple.

U-Net is similar to a big U letter: first perform Conv+Pooling downsampling; then Deconv deconvolution for upsampling, crop the low-level feature map before fusion; and then upsample again. Repeat this process until a feature map with an output of 388x388x2 is obtained, and finally the output segment map is obtained through softmax. Generally speaking, the idea is very similar to FCN.

U-Net network structure

Insert image description here
Insert image description here
Insert image description here

skip-connect mechanism

Insert image description here

U-Net output layer

Insert image description here

U-Net code implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchinfo import summary
from torchstat import stat



#%%
class DoubleConv(nn.Module):
    '''(convolution => [BN] => ReLU) * 2'''

    def __init__(self, in_channels, out_channels, mid_channels=None):
        super(DoubleConv, self).__init__()
        if not mid_channels:
            mid_channels = out_channels
        self.double_conv = nn.Sequential(
            nn.Conv2d(in_channels, mid_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(mid_channels, out_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        return self.double_conv(x)



#%%
class DownSample(nn.Module):
    """Downscaling with maxpool then double conv"""

    def __init__(self, in_channels, out_channels):
        super(DownSample, self).__init__()
        
        self.doubleConv = DoubleConv(in_channels, out_channels)
        self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2)

    def forward(self, x):
        res = self.doubleConv(x)
        out = self.maxpool(res)
        return res, out



#%%
class UpSample(nn.Module):
    """Upscaling then double conv"""

    def __init__(self, in_channels, out_channels, bilinear=False):
        super().__init__()

        # if bilinear, use the normal convolutions to reduce the number of channels
        if bilinear:
            self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
            self.conv = DoubleConv(in_channels, out_channels, in_channels//2)
        else:
            self.up = nn.ConvTranspose2d(in_channels, in_channels//2, kernel_size=2, stride=2)
            self.conv = DoubleConv(in_channels, out_channels)

    def forward(self, x1, x2):
        x1 = self.up(x1)
        # input is CHW
        diffY = x2.size()[2] - x1.size()[2]
        diffX = x2.size()[3] - x1.size()[3]

        x1 = F.pad(x1, [diffX//2, diffX-diffX//2,
                        diffY//2, diffY-diffY//2]) # 补齐维度
        x = torch.cat([x2, x1], dim=1)
        return self.conv(x)



#%%
class UNet(nn.Module):
    def __init__(self, in_channels=1, out_channels=[64, 128, 256, 512, 1024], n_classes=5, bilinear=False):
        super(UNet, self).__init__()
        
        self.down1 = DownSample(in_channels=in_channels, out_channels=out_channels[0])
        self.down2 = DownSample(in_channels=out_channels[0], out_channels=out_channels[1])
        self.down3 = DownSample(in_channels=out_channels[1], out_channels=out_channels[2])
        self.down4 = DownSample(in_channels=out_channels[2], out_channels=out_channels[3])

        factor = 2 if bilinear else 1
        self.center = DoubleConv(in_channels=out_channels[3], out_channels=out_channels[4]//factor)

        self.up1 = UpSample(in_channels=out_channels[4], out_channels=out_channels[3]//factor, bilinear=bilinear)
        self.up2 = UpSample(in_channels=out_channels[3], out_channels=out_channels[2]//factor, bilinear=bilinear)
        self.up3 = UpSample(in_channels=out_channels[2], out_channels=out_channels[1]//factor, bilinear=bilinear)
        self.up4 = UpSample(in_channels=out_channels[1], out_channels=out_channels[0])

        self.outConv = nn.Conv2d(in_channels=out_channels[0], out_channels=n_classes, kernel_size=1)

    def forward(self, x):
        if len(x.shape) is not 4:
            x = torch.unsqueeze(x, 1)

        res1, x = self.down1(x)
        res2, x = self.down2(x)
        res3, x = self.down3(x)
        res4, x = self.down4(x)

        x = self.center(x)

        x = self.up1(x, res4)
        x = self.up2(x, res3)
        x = self.up3(x, res2)
        x = self.up4(x, res1)

        x = self.outConv(x)

        return x

U-Net/PSP network PPT

Pyramid Scene Parsing Network (PSP) segmentation network

Pyramid Scene Parsing Network, the proposed pyramid pooling module can aggregate contextual information in different areas, thereby improving the ability to obtain global information.

Three segmentation problems

Insert image description here

CNN-based segmentation models have achieved great results, but they have encountered difficulties in facing the task of scene analysis. Scenario analysis has two characteristics: there are many types of targets; multiple targets overlap. This jointly leads to increased segmentation difficulty and unsatisfactory segmentation results. Three questions arise:

Mismatched Relationship
Contextual relationship matching is important for understanding complex scenarios. There are rules for the location of an object. For example, as shown in the first row of the figure above, the FCN network mistakenly classified "boat" as "car", but cars rarely appear on the river. This is Because FCN lacks the ability to infer based oncontext.

Confusion Categories

For some targets with similar attributes there will be confusion in the segmentation network results, as shown in the second line of the figure above, FCN is confused in the classification of two similar targets, building and skyscaper. Many labels are related, and the relationship between labels can make up for the shortcomings of the segmentation network.

Inconspicuous Classes

For some smaller targets, it is difficult to find in the segmentation task, and the large targets exceed the network's receptive field, resulting in discontinuous segmentation, as shown in the third line of the figure above. Because the bed and pillow have the same color and material, and the pillow is included in the bed, FCN lacks segmentation of the pillow. In order to improve the performance of the networkfor very small or very large objects, special attention should be paid to the different sub-classes that contain insignificant categories (too large or too small). area.

Main contributions of PSPNet

Many problems with segmentation networks arise from the fact that FCN cannot effectively handle the relationships and global information between scenes. The paper proposes a deep network PSPNet that can obtain the global scene, fuse appropriate global features, and fuse local and global information together. And proposed an optimization strategywith moderate supervision loss, which performed well on multiple data sets.

Problems PSP targets

Insert image description here

Insert image description here

Understanding the role of the receptive field (RF)

Insert image description here

RF -> PSP

Insert image description here

Pyramid Pooling module

Insert image description here

In general CNN, the receptive field can be roughly considered as the size of the context information used. The paper points out that in many networks, global information is not fully obtained, so the effect is not good. To solve this problem, common methods are:

  • Processed with Global Average Pooling. But this may result in a loss of spatial relationships resulting in blurring.
  • Features at different levels generated by pyramid pooling are finally smoothly connected into a FC layer for classification. This can remove the fixed-size image classification constraints of CNN and reduce the information loss between different regions.

Insert image description here

Adaptive Pool Dimension calculation of adaptive pooling

Insert image description here

PSP network structure

Insert image description here

  • Input image pre-trained model (ResNet101) andatrous convolution (dilated) strategy to extract feature map, extracted feature map is 1/8 of the original input image.
  • The feature map is merged with the overall information through the Pyramid Pooling Module, and the feature maps before pooling are spliced ​​together.
  • Finally, a convolutional layer is passed to obtain the final output.

Dilated Convolution Dilated Convolution

Insert image description here
Insert image description here
Insert image description here

Auxiliary module for PSP network

Insert image description here

PSPNet code implementation

import torch
import torch.nn as nn
from torchvision import models
import torch.nn.functional as F
from torchinfo import summary
from torchstat import stat



#%%
def initialize_weights(*models):
    for model in models:
        for module in model.modules():
            if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear):
                nn.init.kaiming_normal(module.weight)
                if module.bias is not None:
                    module.bias.data.zero_()
            elif isinstance(module, nn.BatchNorm2d):
                module.weight.data.fill_(1)
                module.bias.data.zero_()



#%%
class PyramidPoolingModule(nn.Module):
    def __init__(self, in_dim, reduction_dim, setting):
        super(PyramidPoolingModule, self).__init__()
        self.features = []
        for s in setting:
            self.features.append(nn.Sequential(
                nn.AdaptiveAvgPool2d(output_size=s),
                nn.Conv2d(in_channels=in_dim, out_channels=reduction_dim, kernel_size=1, bias=False),
                nn.BatchNorm2d(num_features=reduction_dim, momentum=.95),
                nn.ReLU(inplace=True)
            ))
        self.features = nn.ModuleList(self.features)

    def forward(self, x):
        x_size = x.size()
        out = [x]
        for f in self.features:
            out.append(F.upsample(f(x), x_size[2:], mode='bilinear'))
        out = torch.cat(out, 1)
        return out



#%%
class PSPNet(nn.Module):
    def __init__(self, in_channels=1, n_classes=5, pretrained=False, use_aux=False):
        super(PSPNet, self).__init__()
        self.use_aux = use_aux
        resnet = models.resnet101()
        if pretrained:
            resnet = models.resnet101(pretrained=pretrained)
        self.layer0 = nn.Sequential(
            nn.Conv2d(in_channels=in_channels, out_channels=64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
        ) # out_channel=64
        self.layer1 = resnet.layer1 # out_channel=256
        self.layer2 = resnet.layer2 # out_channel=512
        self.layer3 = resnet.layer3 # out_channel=1024
        self.layer4 = resnet.layer4 # out_channel=2048

        for n, m in self.layer3.named_modules():
            if 'conv2' in n:
                m.dilation, m.padding, m.stride = (2, 2), (2, 2), (1, 1)
            elif 'downsample.0' in n:
                m.stride = (1, 1)
        for n, m in self.layer4.named_modules():
            if 'conv2' in n:
                m.dilation, m.padding, m.stride = (4, 4), (4, 4), (1, 1)
            elif 'downsample.0' in n:
                m.stride = (1, 1)

        self.ppm = PyramidPoolingModule(in_dim=2048, reduction_dim=512, setting=[1, 2, 3, 6])

        self.final = nn.Sequential(
            nn.Conv2d(in_channels=4096, out_channels=512, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(num_features=512, momentum=.95),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.1),
            nn.Conv2d(in_channels=512, out_channels=n_classes, kernel_size=1)
        )

        if use_aux: # auxiliary loss
            self.aux_logits = nn.Conv2d(in_channels=1024, out_channels=n_classes, kernel_size=1)
            initialize_weights(self.aux_logits)

        initialize_weights(self.ppm, self.final)

    def forward(self, x):
        x_size = x.size()
        if len(x_size) is not 4:
            x = torch.unsqueeze(x, 1) # [n, c, h, w]

        x = self.layer0(x) # [n, 64, h//4, w//4]
        x = self.layer1(x) # [n, 256, h//4, w//4]
        x = self.layer2(x) # [n, 512, h//8, w//8]
        x = self.layer3(x) # [n, 1024, h//8, w//8]
        if self.training and self.use_aux:
            aux = self.aux_logits(x)
        x = self.layer4(x) # [n, 2048, h//8, w//8]
        x = self.ppm(x) # [n, 4096, h//8, w//8]
        x = self.final(x) # [n, n_classes, h//8, w//8]
        if self.training and self.use_aux:
            return F.upsample(x, x_size[1:], mode='bilinear'), F.upsample(aux, x_size[1:], mode='bilinear')
        return F.upsample(x, x_size[1:], mode='bilinear')

U-Net/PSP network PPT

Image segmentation 2 (U-Net/V-Net/PSPNet)

DeepLab

Insert image description here

DeepLab V1

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

DeepLab V1 network structure

Insert image description here

DeepLab V1 code implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchinfo import summary
from torchstat import stat



#%%
class classification(nn.Module):
    def __init__(self, in_channels, out_channels, stride, n_classes):
        super(classification, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=stride, padding=1)
        self.relu1 = nn.ReLU(inplace=True)
        self.drop1 = nn.Dropout(p=0.3)
        self.conv2 = nn.Conv2d(in_channels=out_channels, out_channels=out_channels, kernel_size=1)
        self.relu2 = nn.ReLU(inplace=True)
        self.drop2 = nn.Dropout(p=0.3)
        self.conv3 = nn.Conv2d(in_channels=out_channels, out_channels=n_classes, kernel_size=1)

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.drop1(x)
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.drop2(x)
        x = self.conv3(x)

        return x


#%%
class DeepLab_V1(nn.Module):
    def __init__(self, in_channels=1, out_channels=[64, 128, 256, 512, 512, 512, 512], n_classes=5):
        super(DeepLab_V1, self).__init__()

        self.classification0 = classification(
            in_channels, out_channels[0], stride=8, n_classes=n_classes
        ) # 下采样八倍所以里面的第一个卷积的stride = 8
        
        self.vggLayer1 = nn.Sequential(
            nn.Conv2d(in_channels=in_channels, out_channels=out_channels[0], kernel_size=3, padding='same'),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=out_channels[0], out_channels=out_channels[0], kernel_size=3, padding='same'),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)
        )

        self.classification1 = classification(
            out_channels[0], out_channels[1], stride=4, n_classes=n_classes
        ) #接受Layer1的输出所以下采样4倍

        self.vggLayer2 = nn.Sequential(
            nn.Conv2d(in_channels=out_channels[0], out_channels=out_channels[1], kernel_size=3, padding='same'),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=out_channels[1], out_channels=out_channels[1], kernel_size=3, padding='same'),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)
        )

        self.classification2 = classification(
            out_channels[1], out_channels[2], stride=2, n_classes=n_classes
        ) #接受Layer1的输出所以下采样2倍

        self.vggLayer3 = nn.Sequential(
            nn.Conv2d(in_channels=out_channels[1], out_channels=out_channels[2], kernel_size=3, padding='same'),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=out_channels[2], out_channels=out_channels[2], kernel_size=3, padding='same'),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=out_channels[2], out_channels=out_channels[2], kernel_size=3, padding='same'),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)
        )

        self.classification3 = classification(
            out_channels[2], out_channels[3], stride=1, n_classes=n_classes
        ) #接受Layer3的输出相对于原图已经下采样8倍所以不用下采样

        self.vggLayer4 = nn.Sequential(
            nn.Conv2d(in_channels=out_channels[2], out_channels=out_channels[3], kernel_size=3, padding='same'),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=out_channels[3], out_channels=out_channels[3], kernel_size=3, padding='same'),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=out_channels[3], out_channels=out_channels[3], kernel_size=3, padding='same'),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1, ceil_mode=True)
        )

        self.classification4 = classification(
            out_channels[3], out_channels[4], stride=1, n_classes=n_classes
        ) #接受Layer4的输出相对于原图已经下采样8倍所以不用下采样

        self.vggLayer5 = nn.Sequential(
            nn.Conv2d(out_channels[3], out_channels[4], kernel_size=3, dilation=2, padding='same'),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels[4], out_channels[4], kernel_size=3, dilation=2, padding='same'),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels[4], out_channels[4], kernel_size=3, dilation=2, padding='same'),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1, ceil_mode=True)
        )

        self.fc6 = nn.Sequential(
            nn.Conv2d(out_channels[4], out_channels[5], kernel_size=3, dilation=4, padding='same'),
            nn.ReLU(inplace=True),
            nn.Dropout()
        )

        self.fc7 = nn.Sequential(
            nn.Conv2d(out_channels[5], out_channels[6], kernel_size=1),
            nn.ReLU(inplace=True),
            nn.Dropout()
        )

        self.classification7 = classification(
            out_channels[6], out_channels[6], stride=1, n_classes=n_classes
        ) #接受fc7的输出相对于原图已经下采样8倍所以不用下采样

    def forward(self, x):
        x_size = x.size()
        if len(x_size) is not 4:
            x = torch.unsqueeze(x, 1) # [n, c, h, w]

        cla0 = self.classification0(x) # [n, 64, h//8, w//8]

        x = self.vggLayer1(x) # [n, 64, h//2, w//2]
        cla1 = self.classification1(x) # [n, n_classes, h//8, w//8]

        x = self.vggLayer2(x) # [n, 128, h//4, w//4]
        cla2 = self.classification2(x) # [n, n_classes, h//8, w//8]

        x = self.vggLayer3(x) # [n, 256, h//8, w//8]
        cla3 = self.classification3(x) # [n, n_classes, h//8, w//8]

        x = self.vggLayer4(x) # [n, 512, h//8, w//8]
        cla4 = self.classification4(x) # [n, n_classes, h//8, w//8]

        x = self.vggLayer5(x) # [n, 512, h//8, w//8]
        x = self.fc6(x) # [n, 512, h//8, w//8]
        x = self.fc7(x) # [n, 512, h//8, w//8]
        cla7 = self.classification7(x) # [n, n_classes, h//8, w//8]

        x = cla0+cla1+cla2+cla3+cla4+cla7 # [n, n_classes, h//8, w//8]
        x = F.upsample(x, size=x_size[1:], mode='bilinear')

        return x

DeepLab V2

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

DeepLab V2 network structure

Insert image description here

Insert image description here

DeepLab V2 ASPP module

Insert image description here
Insert image description here

DeepLab V2 code implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchinfo import summary
from torchstat import stat



#%%
class ResBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride, padding, dilation):
        super(ResBlock, self).__init__()
        self.downsample = False
        self.mid_channels = out_channels//4

        self.reduce = nn.Sequential(
            nn.Conv2d(in_channels, self.mid_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(num_features=self.mid_channels)
        )
        self.conv3x3 = nn.Sequential(
            nn.Conv2d(self.mid_channels, self.mid_channels, 
                      kernel_size=3, stride=stride, padding=padding, dilation=dilation, bias=False),
            nn.BatchNorm2d(num_features=self.mid_channels)
        )
        self.increase = nn.Sequential(
            nn.Conv2d(self.mid_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(num_features=out_channels),
            nn.ReLU(inplace=True)
        )

        if in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(num_features=out_channels)
            )
        else:
            self.shortcut = nn.Identity()

    def forward(self, x):
        res = x
        x = self.reduce(x)
        x = self.conv3x3(x)
        x = self.increase(x)
        res = self.shortcut(res)
        x = x+res

        return x
        


#%%
class ResLayer(nn.Module):
    def __init__(self, in_channels, out_channels, n_layers, stride=1, padding=1, dilation=1):
        super(ResLayer, self).__init__()
        resLayer = []
        for i in range(n_layers):
            resLayer.append(
                ResBlock(in_channels=(in_channels if i==0 else out_channels),
                         out_channels=out_channels,
                         stride=(stride if i==0 else 1),
                         padding=padding,
                         dilation=dilation)
            )
        self.resLayers = nn.Sequential(*resLayer)
    
    def forward(self, x):
        x = self.resLayers(x)
        return x



#%%
class ASPP(nn.Module):
    def __init__(self, in_channesls, out_channels, dilatopns):
        super(ASPP, self).__init__()

        self.aspp1 = nn.Sequential(
            nn.Conv2d(in_channesls, out_channels, kernel_size=3,
                      dilation=dilatopns[0], padding=dilatopns[0]),
            nn.BatchNorm2d(num_features=out_channels),
            nn.ReLU(inplace=True)
        )
        
        self.aspp2 = nn.Sequential(
            nn.Conv2d(in_channesls, out_channels, kernel_size=3,
                      dilation=dilatopns[1], padding=dilatopns[1]),
            nn.BatchNorm2d(num_features=out_channels),
            nn.ReLU(inplace=True)
        )
        self.aspp3 = nn.Sequential(
            nn.Conv2d(in_channesls, out_channels, kernel_size=3,
                      dilation=dilatopns[2], padding=dilatopns[2]),
            nn.BatchNorm2d(num_features=out_channels),
            nn.ReLU(inplace=True)
        )
        self.aspp4 = nn.Sequential(
            nn.Conv2d(in_channesls, out_channels, kernel_size=3,
                      dilation=dilatopns[3], padding=dilatopns[3]),
            nn.BatchNorm2d(num_features=out_channels),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        aspp1 = self.aspp1(x)
        aspp2 = self.aspp2(x)
        aspp3 = self.aspp3(x)
        aspp4 = self.aspp4(x)
        out = aspp1+aspp2+aspp3+aspp4

        return out



#%%
class DeepLab_V2(nn.Module):
    def __init__(self, in_channels=1, out_channels=[64, 256, 512, 1024, 2048], n_layers=[3, 4, 23, 3], n_classes=5):
        super(DeepLab_V2, self).__init__()

        self.stem = nn.Sequential(
            nn.Conv2d(in_channels, out_channels[0], kernel_size=7, stride=2, padding=3, dilation=1),
            nn.BatchNorm2d(num_features=out_channels[0]),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1, ceil_mode=True)
        )

        self.res101Layer1 = ResLayer(out_channels[0], out_channels[1], n_layers[0], stride=2)
        self.res101Layer2 = ResLayer(out_channels[1], out_channels[2], n_layers[1], stride=2)
        self.res101Layer3 = ResLayer(out_channels[2], out_channels[3], n_layers[2], stride=1, padding=2, dilation=2)
        self.res101Layer4 = ResLayer(out_channels[3], out_channels[4], n_layers[3], stride=1, padding=4, dilation=4)

        self.aspp = ASPP(out_channels[4], n_classes, dilatopns=[6, 12, 18, 24])

    def forward(self, x):
        x_size = x.size()
        if len(x_size) is not 4:
            x = torch.unsqueeze(x, 1) # [n, c, h, w]
        x = self.stem(x)
        x = self.res101Layer1(x)
        x = self.res101Layer2(x)
        x = self.res101Layer3(x)
        x = self.res101Layer4(x)
        x = self.aspp(x)
        x = F.upsample(x, size=x_size[1:], mode='bilinear')

        return x

DeepLab V3

Rethinking Atrous Convolution for Semantic Image Segmentation

DeepLab V3 network structure

Insert image description here

DeepLab V3 ASPP upgrade module

Insert image description here

DeepLab V3 Multi-Grid

Insert image description here

Code
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchinfo import summary
from torchstat import stat



#%%
class ResBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride, padding, dilation):
        super(ResBlock, self).__init__()
        self.downsample = False
        self.mid_channels = out_channels//4

        self.reduce = nn.Sequential(
            nn.Conv2d(in_channels, self.mid_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(num_features=self.mid_channels)
        )
        self.conv3x3 = nn.Sequential(
            nn.Conv2d(self.mid_channels, self.mid_channels, 
                      kernel_size=3, stride=stride, padding=padding, dilation=dilation, bias=False),
            nn.BatchNorm2d(num_features=self.mid_channels)
        )
        self.increase = nn.Sequential(
            nn.Conv2d(self.mid_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(num_features=out_channels),
            nn.ReLU(inplace=True)
        )

        if in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(num_features=out_channels)
            )
        else:
            self.shortcut = nn.Identity()

    def forward(self, x):
        res = x
        x = self.reduce(x)
        x = self.conv3x3(x)
        x = self.increase(x)
        res = self.shortcut(res)
        x = x+res

        return x
        


#%%
class ResLayer(nn.Module):
    def __init__(self, in_channels, out_channels, n_layers, stride=1, padding=1, dilation=1, multi_grids=None):
        super(ResLayer, self).__init__()

        if multi_grids is None:
            multi_grids = [1 for _ in range(n_layers)]
        else:
            assert n_layers == len(multi_grids)
        
        resLayer = []
        for i in range(n_layers):
            resLayer.append(
                ResBlock(in_channels=(in_channels if i==0 else out_channels),
                         out_channels=out_channels,
                         stride=(stride if i==0 else 1),
                         padding=padding*multi_grids[i],
                         dilation=dilation*multi_grids[i])
            )
        self.resLayers = nn.Sequential(*resLayer)
    
    def forward(self, x):
        x = self.resLayers(x)
        return x
    


#%%
class ASPP_plus(nn.Module):
    def __init__(self, in_channesls, out_channels, dilatopns=[6, 12, 18]):
        super(ASPP_plus, self).__init__()

        self.aspp0 = nn.Sequential(
            nn.Conv2d(in_channesls, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(num_features=out_channels),
            nn.ReLU(inplace=True)
        )

        self.aspp1 = nn.Sequential(
            nn.Conv2d(in_channesls, out_channels, kernel_size=3,
                      dilation=dilatopns[0], padding=dilatopns[0], bias=False),
            nn.BatchNorm2d(num_features=out_channels),
            nn.ReLU(inplace=True)
        )
        
        self.aspp2 = nn.Sequential(
            nn.Conv2d(in_channesls, out_channels, kernel_size=3,
                      dilation=dilatopns[1], padding=dilatopns[1], bias=False),
            nn.BatchNorm2d(num_features=out_channels),
            nn.ReLU(inplace=True)
        )
        self.aspp3 = nn.Sequential(
            nn.Conv2d(in_channesls, out_channels, kernel_size=3,
                      dilation=dilatopns[2], padding=dilatopns[2], bias=False),
            nn.BatchNorm2d(num_features=out_channels),
            nn.ReLU(inplace=True)
        )

        self.aspp4 = nn.Sequential(
            nn.AdaptiveAvgPool2d(output_size=1),
            nn.Conv2d(in_channesls, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(num_features=out_channels),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        x_size = x.size()
        aspp0 = self.aspp0(x)
        aspp1 = self.aspp1(x)
        aspp2 = self.aspp2(x)
        aspp3 = self.aspp3(x)
        aspp4 = self.aspp4(x)
        aspp4 = F.interpolate(aspp4, x_size[2:], mode='bilinear')
        out = aspp0+aspp1+aspp2+aspp3+aspp4

        return out
    

#%%
class DeepLab_V3(nn.Module):
    def __init__(self, in_channels=1, out_channels=[64, 256, 512, 1024, 2048], 
                 n_layers=[3, 4, 6, 3], multi_grids=[1, 2, 4], n_classes=5):
        super(DeepLab_V3, self).__init__()

        self.stem = nn.Sequential(
            nn.Conv2d(in_channels, out_channels[0], kernel_size=7, stride=2, padding=3, dilation=1),
            nn.BatchNorm2d(num_features=out_channels[0]),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1, ceil_mode=True)
        )

        self.res50Layer1 = ResLayer(out_channels[0], out_channels[1], n_layers[0], stride=2)
        self.res50Layer2 = ResLayer(out_channels[1], out_channels[2], n_layers[1], stride=2)
        self.res50Layer3 = ResLayer(out_channels[2], out_channels[3], n_layers[2], stride=1, padding=2, dilation=2)
        self.res50Layer4 = ResLayer(out_channels[3], out_channels[4], n_layers[3], 
                                    stride=1, padding=2, dilation=2, multi_grids=multi_grids)
        self.res50Layer4_copy1 = ResLayer(out_channels[4], out_channels[4], n_layers[3],
                                          stride=1, padding=4, dilation=4, multi_grids=multi_grids)
        self.res50Layer4_copy2 = ResLayer(out_channels[4], out_channels[4], n_layers[3],
                                          stride=1, padding=8, dilation=8, multi_grids=multi_grids)
        self.res50Layer4_copy3 = ResLayer(out_channels[4], out_channels[4], n_layers[3],
                                          stride=1, padding=16, dilation=16, multi_grids=multi_grids)
        
        self.aspp = ASPP_plus(out_channels[4], n_classes, dilatopns=[6, 12, 18, 24])

    def forward(self, x):
        x_size = x.size()
        if len(x_size) is not 4:
            x = torch.unsqueeze(x, 1) # [n, c, h, w]
        x = self.stem(x)
        x = self.res50Layer1(x)
        x = self.res50Layer2(x)
        x = self.res50Layer3(x)
        x = self.res50Layer4(x)
        x = self.res50Layer4_copy1(x)
        x = self.res50Layer4_copy2(x)
        x = self.res50Layer4_copy3(x)
        x = self.aspp(x)
        x = F.upsample(x, size=x_size[1:], mode='bilinear')

        return x

DeepLab V3+

Encoder-decoder with atrous separable convolution for semantic image segmentation

DeepLab PPT

Guess you like

Origin blog.csdn.net/qq_41990294/article/details/132763209