I have learned about semantic segmentation before, and I feel that I can do a lot of things, so I took the time to study it. I will record it here for future reference. This article may be updated as the study progresses.
Semantic Segmentation Semantic Segmentation
- some basic concepts
- Several semantic segmentation algorithms
-
- Fully Convolutional Networks (FCN)
- U-Net
- Pyramid Scene Parsing Network (PSP) segmentation network
-
- Three segmentation problems
- Main contributions of PSPNet
- Problems PSP targets
- Understanding the role of the receptive field (RF)
- RF -> PSP
- Pyramid Pooling module
- Adaptive Pool Dimension calculation of adaptive pooling
- PSP network structure
- Dilated Convolution Dilated Convolution
- Auxiliary module for PSP network
- PSPNet code implementation
- DeepLab
some basic concepts
Image segmentation
Image segmentation refers to dividing images using features such as boundaries and color gradients. Popular algorithms at this time include Ostu, FCM, watershed, N-Cut, etc. , these algorithms are generally unsupervised learning, and the segmented results are not semantically labeled. In other words, you don’t know what the segmented things are.
Semantic segmentation
predicts each pixel in the input image as a different semantic category. It pays more attention to the distinction between categories, focusing on separating the vehicles in the foreground from the houses, sky, and ground in the background, but does not distinguish overlapping vehicles. There are mainly methods such as FCN, DeepLab, and PSPNet.
Instance segmentation
is a combination of target detection and semantic segmentation. It detects the target in the input image and assigns a category label to each pixel contained in the target. Pay more attention to the segmentation between target individuals in the foreground, and the houses, sky, and ground in the background are all in the same category. There are mainly DeepMask, Mask R-CNN, PANet and other methods.
panoptic segmentation
It is a synthesis of semantic segmentation and instance segmentation. It aims to simultaneously segment the target (thing) at the instance level and the background content (stuff) at the semantic level. Each pixel in the input image is assigned a category label and instance ID to generate a global, Unified segmented images.
The difference between semantic segmentation and image segmentation
Example and panoramic segmentation PPT
The article "Automatic Technology" will help you understand panoramic segmentation
CNN image semantic segmentation is basically this routine:
Downsampling + upsampling: Convlution + Deconvlution/Resize
Multi-scale feature fusion: features Point-by-point addition/feature channel dimension splicing
Obtain pixel-level segment map: judge the category of each pixel
The semantic segmentation network also has two methods for feature fusion:
FCN-style point-by-point addition, corresponding to caffe's EltwiseLayer layer, corresponding to tensorflow's tf.add()
U-Net-style channel dimension splicing and fusion, corresponding to caffe's ConcatLayer Layer, corresponding to tensorflow's tf.concat()
Overview of Image Segmentation [Deep Learning Methods]
Several semantic segmentation algorithms
Fully Convolutional Networks (FCN)
Fully Convolutional Networks for Semantic Segmentation, referred to as FCN. This paper is the first paper that successfully uses deep learning for image semantic segmentation. The main contributions of this paper are two points:
A fully convolutional network is proposed. The fully connected network is replaced with a convolutional network, so that the network can accept images of any size and output a segmentation map of the same size as the original image. Only in this way can each pixel be classified.
Deconvolution layer is used. The feature map of a classification neural network is generally only a fraction of the size of the original image. To map back to the original image size, the feature map must be upsampled. This is the role of the deconvolution layer. Although the name is called a deconvolution layer, it is not actually the inverse operation of convolution. A more appropriate name is transposed convolution (Transposed Convolution), which is used to roll out a large feature map from a small feature map.
Basic information about fully convolutional networks (FCN)
Advantages and Disadvantages of FCN
Semantic Segmentation VS Image Classification
Classification -> Changes in Split
upsampling method
Up mining - bilinear interpolation
Upsampling - Un-pooling
Upsampling - Transpose Conv
FCN network structure
FCN code implementation
class FCN8s(nn.Module):
def __init__(self, in_channels=1, out_channels=[64, 128, 256, 512, 512, 4096, 4096], n_class=21):
super(FCN8s, self).__init__()
# conv1
self.conv1_1 = nn.Conv2d(in_channels=in_channels, out_channels=out_channels[0], kernel_size=3, padding=100)
self.relu1_1 = nn.ReLU(inplace=True)
self.conv1_2 = nn.Conv2d(in_channels=out_channels[0], out_channels=out_channels[0], kernel_size=3, padding='same')
self.relu1_2 = nn.ReLU(inplace=True) # 覆盖掉原来的变量
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True) # 向上取整, 1/2
# conv2
self.conv2_1 = nn.Conv2d(in_channels=out_channels[0], out_channels=out_channels[1], kernel_size=3, padding='same')
self.relu2_1 = nn.ReLU(inplace=True)
self.conv2_2 = nn.Conv2d(in_channels=out_channels[1], out_channels=out_channels[1], kernel_size=3, padding='same')
self.relu2_2 = nn.ReLU(inplace=True) # 覆盖掉原来的变量
self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True) # 向上取整, 1/4
# conv3
self.conv3_1 = nn.Conv2d(in_channels=out_channels[1], out_channels=out_channels[2], kernel_size=3, padding='same')
self.relu3_1 = nn.ReLU(inplace=True)
self.conv3_2 = nn.Conv2d(in_channels=out_channels[2], out_channels=out_channels[2], kernel_size=3, padding='same')
self.relu3_2 = nn.ReLU(inplace=True)
self.conv3_3 = nn.Conv2d(in_channels=out_channels[2], out_channels=out_channels[2], kernel_size=3, padding='same')
self.relu3_3 = nn.ReLU(inplace=True) # 覆盖掉原来的变量
self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True) # 向上取整, 1/8
# conv4
self.conv4_1 = nn.Conv2d(in_channels=out_channels[2], out_channels=out_channels[3], kernel_size=3, padding='same')
self.relu4_1 = nn.ReLU(inplace=True)
self.conv4_2 = nn.Conv2d(in_channels=out_channels[3], out_channels=out_channels[3], kernel_size=3, padding='same')
self.relu4_2 = nn.ReLU(inplace=True)
self.conv4_3 = nn.Conv2d(in_channels=out_channels[3], out_channels=out_channels[3], kernel_size=3, padding='same')
self.relu4_3 = nn.ReLU(inplace=True) # 覆盖掉原来的变量
self.pool4 = nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True) # 向上取整, 1/16
# conv5
self.conv5_1 = nn.Conv2d(in_channels=out_channels[3], out_channels=out_channels[4], kernel_size=3, padding='same')
self.relu5_1 = nn.ReLU(inplace=True)
self.conv5_2 = nn.Conv2d(in_channels=out_channels[4], out_channels=out_channels[4], kernel_size=3, padding='same')
self.relu5_2 = nn.ReLU(inplace=True)
self.conv5_3 = nn.Conv2d(in_channels=out_channels[4], out_channels=out_channels[4], kernel_size=3, padding='same')
self.relu5_3 = nn.ReLU(inplace=True) # 覆盖掉原来的变量
self.pool5 = nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True) # 向上取整, 1/32
# fc6
self.fc6 = nn.Conv2d(in_channels=out_channels[4], out_channels=out_channels[5], kernel_size=7) # 由padding=100得此处的最小尺寸为7!
self.relu6 = nn.ReLU(inplace=True)
self.drop6 = nn.Dropout2d()
# fc7
self.fc7 = nn.Conv2d(in_channels=out_channels[5], out_channels=out_channels[6], kernel_size=1)
self.relu7 = nn.ReLU(inplace=True)
self.drop7 = nn.Dropout2d()
self.score_fr = nn.Conv2d(in_channels=out_channels[5], out_channels=n_class, kernel_size=1)
self.scoer_pool3 = nn.Conv2d(in_channels=out_channels[2], out_channels=n_class, kernel_size=1)
self.score_pool4 = nn.Conv2d(in_channels=out_channels[3], out_channels=n_class, kernel_size=1)
self.upscore2 = nn.ConvTranspose2d(in_channels=n_class, out_channels=n_class, kernel_size=4, stride=2, bias=False)
self.upscore8 = nn.ConvTranspose2d(in_channels=n_class, out_channels=n_class, kernel_size=16, stride=8, bias=False)
self.upscore_pool4 = nn.ConvTranspose2d(in_channels=n_class, out_channels=n_class, kernel_size=4, stride=2, bias=False)
def forward(self, x):
shape = x.shape
if len(shape) is not 4:
x = torch.unsqueeze(x, 1)
x = self.relu1_1(self.conv1_1(x)) # [n, c, h-2+200, w-2+200]
x = self.relu1_2(self.conv1_2(x))
x = self.pool1(x) # [n, 64, (h-2+200)/2, (w-2+200)/2]
x = self.relu2_1(self.conv2_1(x))
x = self.relu2_2(self.conv2_2(x))
x = self.pool2(x) # [n, 128, (h-2+200)/4, (w-2+200)/4]
x = self.relu3_1(self.conv3_1(x))
x = self.relu3_2(self.conv3_2(x))
x = self.relu3_3(self.conv3_3(x))
x = self.pool3(x) # [n, 256, (h-2+200)/8, (w-2+200)/8]
pool3 = x
x = self.relu4_1(self.conv4_1(x))
x = self.relu4_2(self.conv4_2(x))
x = self.relu4_3(self.conv4_3(x))
x = self.pool4(x) # [n, 256, (h-2+200)/16, (w-2+200)/16]
pool4 = x
x = self.relu5_1(self.conv5_1(x))
x = self.relu5_2(self.conv5_2(x))
x = self.relu5_3(self.conv5_3(x))
x = self.pool5(x) # [n, 512, (h-2+200)/32, (w-2+200)/32]
x = self.relu6(self.fc6(x)) # [n, 4096, (h-2+200)/32-6, (w-2+200)/32-6]
x = self.drop6(x)
x = self.relu7(self.fc7(x)) # [n, 4096, (h-2+200)/32-6, (w-2+200)/32-6]
x = self.drop7(x)
x = self.score_fr(x) # [n, n_class, (h-2+200)/32-6, (w-2+200)/32-6]
x = self.upscore2(x) # [n, n_class, (h-2+200)/16-10, (w-2+200)/16-10]
score_pool4 = self.score_pool4(pool4) # [n, n_class, (w-2+200)/16, (w-2+200)/16]
score_pool4 = score_pool4[:, :, 5:5+x.size()[2], 5:5+x.size()[3]] # [n, n_class, (h-2+200)/16-10, (w-2+200)/16-10]
x = x + score_pool4 # [n, n_class, (h-2+200)/16-10, (w-2+200)/16-10]
x = self.upscore_pool4(x) # [n, n_class, (h-2+200)/8-18, (w-2+200)/8-18]
score_pool3 = self.scoer_pool3(pool3) # [n, n_class, (h-2+200)/8, (w-2+200)/8]
score_pool3 = score_pool3[:, :, 9:9+x.size()[2], 9:9+x.size()[3]] # [n, n_class, (h-2+200)/8-18, (w-2+200)/8-18]
x = x + score_pool3 # [n, n_class, (h-2+200)/8-18, (w-2+200)/8-18]
x = self.upscore8(x) # [n, n_class, (h-2+200)-17*8, (w-2+200)-17*8] -> [n, n_class, h+62, w+62]
x = x[:, :, 31:31+shape[1], 31:31+shape[2]].contiguous() # [n, n_class, h, w]
return x
FCN fully convolutional network detailed explanation PPT
U-Net
U-Net: Convolutional Networks for Biomedical Image Segmentation, U-Net is a segmentation network proposed by the original author to participate in the ISBI Challenge. It can adapt to a small training set (approximately 30 pictures). U-Net and FCN are both very small segmentation networks. They neither use dilated convolutions nor are followed by CRF, and their structures are simple.
U-Net is similar to a big U letter: first perform Conv+Pooling downsampling; then Deconv deconvolution for upsampling, crop the low-level feature map before fusion; and then upsample again. Repeat this process until a feature map with an output of 388x388x2 is obtained, and finally the output segment map is obtained through softmax. Generally speaking, the idea is very similar to FCN.
U-Net network structure
skip-connect mechanism
U-Net output layer
U-Net code implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchinfo import summary
from torchstat import stat
#%%
class DoubleConv(nn.Module):
'''(convolution => [BN] => ReLU) * 2'''
def __init__(self, in_channels, out_channels, mid_channels=None):
super(DoubleConv, self).__init__()
if not mid_channels:
mid_channels = out_channels
self.double_conv = nn.Sequential(
nn.Conv2d(in_channels, mid_channels, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(mid_channels),
nn.ReLU(inplace=True),
nn.Conv2d(mid_channels, out_channels, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True)
)
def forward(self, x):
return self.double_conv(x)
#%%
class DownSample(nn.Module):
"""Downscaling with maxpool then double conv"""
def __init__(self, in_channels, out_channels):
super(DownSample, self).__init__()
self.doubleConv = DoubleConv(in_channels, out_channels)
self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
def forward(self, x):
res = self.doubleConv(x)
out = self.maxpool(res)
return res, out
#%%
class UpSample(nn.Module):
"""Upscaling then double conv"""
def __init__(self, in_channels, out_channels, bilinear=False):
super().__init__()
# if bilinear, use the normal convolutions to reduce the number of channels
if bilinear:
self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
self.conv = DoubleConv(in_channels, out_channels, in_channels//2)
else:
self.up = nn.ConvTranspose2d(in_channels, in_channels//2, kernel_size=2, stride=2)
self.conv = DoubleConv(in_channels, out_channels)
def forward(self, x1, x2):
x1 = self.up(x1)
# input is CHW
diffY = x2.size()[2] - x1.size()[2]
diffX = x2.size()[3] - x1.size()[3]
x1 = F.pad(x1, [diffX//2, diffX-diffX//2,
diffY//2, diffY-diffY//2]) # 补齐维度
x = torch.cat([x2, x1], dim=1)
return self.conv(x)
#%%
class UNet(nn.Module):
def __init__(self, in_channels=1, out_channels=[64, 128, 256, 512, 1024], n_classes=5, bilinear=False):
super(UNet, self).__init__()
self.down1 = DownSample(in_channels=in_channels, out_channels=out_channels[0])
self.down2 = DownSample(in_channels=out_channels[0], out_channels=out_channels[1])
self.down3 = DownSample(in_channels=out_channels[1], out_channels=out_channels[2])
self.down4 = DownSample(in_channels=out_channels[2], out_channels=out_channels[3])
factor = 2 if bilinear else 1
self.center = DoubleConv(in_channels=out_channels[3], out_channels=out_channels[4]//factor)
self.up1 = UpSample(in_channels=out_channels[4], out_channels=out_channels[3]//factor, bilinear=bilinear)
self.up2 = UpSample(in_channels=out_channels[3], out_channels=out_channels[2]//factor, bilinear=bilinear)
self.up3 = UpSample(in_channels=out_channels[2], out_channels=out_channels[1]//factor, bilinear=bilinear)
self.up4 = UpSample(in_channels=out_channels[1], out_channels=out_channels[0])
self.outConv = nn.Conv2d(in_channels=out_channels[0], out_channels=n_classes, kernel_size=1)
def forward(self, x):
if len(x.shape) is not 4:
x = torch.unsqueeze(x, 1)
res1, x = self.down1(x)
res2, x = self.down2(x)
res3, x = self.down3(x)
res4, x = self.down4(x)
x = self.center(x)
x = self.up1(x, res4)
x = self.up2(x, res3)
x = self.up3(x, res2)
x = self.up4(x, res1)
x = self.outConv(x)
return x
Pyramid Scene Parsing Network (PSP) segmentation network
Pyramid Scene Parsing Network, the proposed pyramid pooling module can aggregate contextual information in different areas, thereby improving the ability to obtain global information.
Three segmentation problems
CNN-based segmentation models have achieved great results, but they have encountered difficulties in facing the task of scene analysis. Scenario analysis has two characteristics: there are many types of targets; multiple targets overlap. This jointly leads to increased segmentation difficulty and unsatisfactory segmentation results. Three questions arise:
Mismatched Relationship
Contextual relationship matching is important for understanding complex scenarios. There are rules for the location of an object. For example, as shown in the first row of the figure above, the FCN network mistakenly classified "boat" as "car", but cars rarely appear on the river. This is Because FCN lacks the ability to infer based oncontext.
Confusion Categories
For some targets with similar attributes there will be confusion in the segmentation network results, as shown in the second line of the figure above, FCN is confused in the classification of two similar targets, building and skyscaper. Many labels are related, and the relationship between labels can make up for the shortcomings of the segmentation network.
Inconspicuous Classes
For some smaller targets, it is difficult to find in the segmentation task, and the large targets exceed the network's receptive field, resulting in discontinuous segmentation, as shown in the third line of the figure above. Because the bed and pillow have the same color and material, and the pillow is included in the bed, FCN lacks segmentation of the pillow. In order to improve the performance of the networkfor very small or very large objects, special attention should be paid to the different sub-classes that contain insignificant categories (too large or too small). area.
Main contributions of PSPNet
Many problems with segmentation networks arise from the fact that FCN cannot effectively handle the relationships and global information between scenes. The paper proposes a deep network PSPNet that can obtain the global scene, fuse appropriate global features, and fuse local and global information together. And proposed an optimization strategywith moderate supervision loss, which performed well on multiple data sets.
Problems PSP targets
Understanding the role of the receptive field (RF)
RF -> PSP
Pyramid Pooling module
In general CNN, the receptive field can be roughly considered as the size of the context information used. The paper points out that in many networks, global information is not fully obtained, so the effect is not good. To solve this problem, common methods are:
- Processed with Global Average Pooling. But this may result in a loss of spatial relationships resulting in blurring.
- Features at different levels generated by pyramid pooling are finally smoothly connected into a FC layer for classification. This can remove the fixed-size image classification constraints of CNN and reduce the information loss between different regions.
Adaptive Pool Dimension calculation of adaptive pooling
PSP network structure
- Input image pre-trained model (ResNet101) andatrous convolution (dilated) strategy to extract feature map, extracted feature map is 1/8 of the original input image.
- The feature map is merged with the overall information through the Pyramid Pooling Module, and the feature maps before pooling are spliced together.
- Finally, a convolutional layer is passed to obtain the final output.
Dilated Convolution Dilated Convolution
Auxiliary module for PSP network
PSPNet code implementation
import torch
import torch.nn as nn
from torchvision import models
import torch.nn.functional as F
from torchinfo import summary
from torchstat import stat
#%%
def initialize_weights(*models):
for model in models:
for module in model.modules():
if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear):
nn.init.kaiming_normal(module.weight)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.BatchNorm2d):
module.weight.data.fill_(1)
module.bias.data.zero_()
#%%
class PyramidPoolingModule(nn.Module):
def __init__(self, in_dim, reduction_dim, setting):
super(PyramidPoolingModule, self).__init__()
self.features = []
for s in setting:
self.features.append(nn.Sequential(
nn.AdaptiveAvgPool2d(output_size=s),
nn.Conv2d(in_channels=in_dim, out_channels=reduction_dim, kernel_size=1, bias=False),
nn.BatchNorm2d(num_features=reduction_dim, momentum=.95),
nn.ReLU(inplace=True)
))
self.features = nn.ModuleList(self.features)
def forward(self, x):
x_size = x.size()
out = [x]
for f in self.features:
out.append(F.upsample(f(x), x_size[2:], mode='bilinear'))
out = torch.cat(out, 1)
return out
#%%
class PSPNet(nn.Module):
def __init__(self, in_channels=1, n_classes=5, pretrained=False, use_aux=False):
super(PSPNet, self).__init__()
self.use_aux = use_aux
resnet = models.resnet101()
if pretrained:
resnet = models.resnet101(pretrained=pretrained)
self.layer0 = nn.Sequential(
nn.Conv2d(in_channels=in_channels, out_channels=64, kernel_size=7, stride=2, padding=3, bias=False),
nn.BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
) # out_channel=64
self.layer1 = resnet.layer1 # out_channel=256
self.layer2 = resnet.layer2 # out_channel=512
self.layer3 = resnet.layer3 # out_channel=1024
self.layer4 = resnet.layer4 # out_channel=2048
for n, m in self.layer3.named_modules():
if 'conv2' in n:
m.dilation, m.padding, m.stride = (2, 2), (2, 2), (1, 1)
elif 'downsample.0' in n:
m.stride = (1, 1)
for n, m in self.layer4.named_modules():
if 'conv2' in n:
m.dilation, m.padding, m.stride = (4, 4), (4, 4), (1, 1)
elif 'downsample.0' in n:
m.stride = (1, 1)
self.ppm = PyramidPoolingModule(in_dim=2048, reduction_dim=512, setting=[1, 2, 3, 6])
self.final = nn.Sequential(
nn.Conv2d(in_channels=4096, out_channels=512, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(num_features=512, momentum=.95),
nn.ReLU(inplace=True),
nn.Dropout(p=0.1),
nn.Conv2d(in_channels=512, out_channels=n_classes, kernel_size=1)
)
if use_aux: # auxiliary loss
self.aux_logits = nn.Conv2d(in_channels=1024, out_channels=n_classes, kernel_size=1)
initialize_weights(self.aux_logits)
initialize_weights(self.ppm, self.final)
def forward(self, x):
x_size = x.size()
if len(x_size) is not 4:
x = torch.unsqueeze(x, 1) # [n, c, h, w]
x = self.layer0(x) # [n, 64, h//4, w//4]
x = self.layer1(x) # [n, 256, h//4, w//4]
x = self.layer2(x) # [n, 512, h//8, w//8]
x = self.layer3(x) # [n, 1024, h//8, w//8]
if self.training and self.use_aux:
aux = self.aux_logits(x)
x = self.layer4(x) # [n, 2048, h//8, w//8]
x = self.ppm(x) # [n, 4096, h//8, w//8]
x = self.final(x) # [n, n_classes, h//8, w//8]
if self.training and self.use_aux:
return F.upsample(x, x_size[1:], mode='bilinear'), F.upsample(aux, x_size[1:], mode='bilinear')
return F.upsample(x, x_size[1:], mode='bilinear')
Image segmentation 2 (U-Net/V-Net/PSPNet)
DeepLab
DeepLab V1
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs
DeepLab V1 network structure
DeepLab V1 code implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchinfo import summary
from torchstat import stat
#%%
class classification(nn.Module):
def __init__(self, in_channels, out_channels, stride, n_classes):
super(classification, self).__init__()
self.conv1 = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=stride, padding=1)
self.relu1 = nn.ReLU(inplace=True)
self.drop1 = nn.Dropout(p=0.3)
self.conv2 = nn.Conv2d(in_channels=out_channels, out_channels=out_channels, kernel_size=1)
self.relu2 = nn.ReLU(inplace=True)
self.drop2 = nn.Dropout(p=0.3)
self.conv3 = nn.Conv2d(in_channels=out_channels, out_channels=n_classes, kernel_size=1)
def forward(self, x):
x = self.conv1(x)
x = self.relu1(x)
x = self.drop1(x)
x = self.conv2(x)
x = self.relu2(x)
x = self.drop2(x)
x = self.conv3(x)
return x
#%%
class DeepLab_V1(nn.Module):
def __init__(self, in_channels=1, out_channels=[64, 128, 256, 512, 512, 512, 512], n_classes=5):
super(DeepLab_V1, self).__init__()
self.classification0 = classification(
in_channels, out_channels[0], stride=8, n_classes=n_classes
) # 下采样八倍所以里面的第一个卷积的stride = 8
self.vggLayer1 = nn.Sequential(
nn.Conv2d(in_channels=in_channels, out_channels=out_channels[0], kernel_size=3, padding='same'),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=out_channels[0], out_channels=out_channels[0], kernel_size=3, padding='same'),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)
)
self.classification1 = classification(
out_channels[0], out_channels[1], stride=4, n_classes=n_classes
) #接受Layer1的输出所以下采样4倍
self.vggLayer2 = nn.Sequential(
nn.Conv2d(in_channels=out_channels[0], out_channels=out_channels[1], kernel_size=3, padding='same'),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=out_channels[1], out_channels=out_channels[1], kernel_size=3, padding='same'),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)
)
self.classification2 = classification(
out_channels[1], out_channels[2], stride=2, n_classes=n_classes
) #接受Layer1的输出所以下采样2倍
self.vggLayer3 = nn.Sequential(
nn.Conv2d(in_channels=out_channels[1], out_channels=out_channels[2], kernel_size=3, padding='same'),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=out_channels[2], out_channels=out_channels[2], kernel_size=3, padding='same'),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=out_channels[2], out_channels=out_channels[2], kernel_size=3, padding='same'),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)
)
self.classification3 = classification(
out_channels[2], out_channels[3], stride=1, n_classes=n_classes
) #接受Layer3的输出相对于原图已经下采样8倍所以不用下采样
self.vggLayer4 = nn.Sequential(
nn.Conv2d(in_channels=out_channels[2], out_channels=out_channels[3], kernel_size=3, padding='same'),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=out_channels[3], out_channels=out_channels[3], kernel_size=3, padding='same'),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=out_channels[3], out_channels=out_channels[3], kernel_size=3, padding='same'),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=1, padding=1, ceil_mode=True)
)
self.classification4 = classification(
out_channels[3], out_channels[4], stride=1, n_classes=n_classes
) #接受Layer4的输出相对于原图已经下采样8倍所以不用下采样
self.vggLayer5 = nn.Sequential(
nn.Conv2d(out_channels[3], out_channels[4], kernel_size=3, dilation=2, padding='same'),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels[4], out_channels[4], kernel_size=3, dilation=2, padding='same'),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels[4], out_channels[4], kernel_size=3, dilation=2, padding='same'),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=1, padding=1, ceil_mode=True)
)
self.fc6 = nn.Sequential(
nn.Conv2d(out_channels[4], out_channels[5], kernel_size=3, dilation=4, padding='same'),
nn.ReLU(inplace=True),
nn.Dropout()
)
self.fc7 = nn.Sequential(
nn.Conv2d(out_channels[5], out_channels[6], kernel_size=1),
nn.ReLU(inplace=True),
nn.Dropout()
)
self.classification7 = classification(
out_channels[6], out_channels[6], stride=1, n_classes=n_classes
) #接受fc7的输出相对于原图已经下采样8倍所以不用下采样
def forward(self, x):
x_size = x.size()
if len(x_size) is not 4:
x = torch.unsqueeze(x, 1) # [n, c, h, w]
cla0 = self.classification0(x) # [n, 64, h//8, w//8]
x = self.vggLayer1(x) # [n, 64, h//2, w//2]
cla1 = self.classification1(x) # [n, n_classes, h//8, w//8]
x = self.vggLayer2(x) # [n, 128, h//4, w//4]
cla2 = self.classification2(x) # [n, n_classes, h//8, w//8]
x = self.vggLayer3(x) # [n, 256, h//8, w//8]
cla3 = self.classification3(x) # [n, n_classes, h//8, w//8]
x = self.vggLayer4(x) # [n, 512, h//8, w//8]
cla4 = self.classification4(x) # [n, n_classes, h//8, w//8]
x = self.vggLayer5(x) # [n, 512, h//8, w//8]
x = self.fc6(x) # [n, 512, h//8, w//8]
x = self.fc7(x) # [n, 512, h//8, w//8]
cla7 = self.classification7(x) # [n, n_classes, h//8, w//8]
x = cla0+cla1+cla2+cla3+cla4+cla7 # [n, n_classes, h//8, w//8]
x = F.upsample(x, size=x_size[1:], mode='bilinear')
return x
DeepLab V2
DeepLab V2 network structure
DeepLab V2 ASPP module
DeepLab V2 code implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchinfo import summary
from torchstat import stat
#%%
class ResBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride, padding, dilation):
super(ResBlock, self).__init__()
self.downsample = False
self.mid_channels = out_channels//4
self.reduce = nn.Sequential(
nn.Conv2d(in_channels, self.mid_channels, kernel_size=1, bias=False),
nn.BatchNorm2d(num_features=self.mid_channels)
)
self.conv3x3 = nn.Sequential(
nn.Conv2d(self.mid_channels, self.mid_channels,
kernel_size=3, stride=stride, padding=padding, dilation=dilation, bias=False),
nn.BatchNorm2d(num_features=self.mid_channels)
)
self.increase = nn.Sequential(
nn.Conv2d(self.mid_channels, out_channels, kernel_size=1, bias=False),
nn.BatchNorm2d(num_features=out_channels),
nn.ReLU(inplace=True)
)
if in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(num_features=out_channels)
)
else:
self.shortcut = nn.Identity()
def forward(self, x):
res = x
x = self.reduce(x)
x = self.conv3x3(x)
x = self.increase(x)
res = self.shortcut(res)
x = x+res
return x
#%%
class ResLayer(nn.Module):
def __init__(self, in_channels, out_channels, n_layers, stride=1, padding=1, dilation=1):
super(ResLayer, self).__init__()
resLayer = []
for i in range(n_layers):
resLayer.append(
ResBlock(in_channels=(in_channels if i==0 else out_channels),
out_channels=out_channels,
stride=(stride if i==0 else 1),
padding=padding,
dilation=dilation)
)
self.resLayers = nn.Sequential(*resLayer)
def forward(self, x):
x = self.resLayers(x)
return x
#%%
class ASPP(nn.Module):
def __init__(self, in_channesls, out_channels, dilatopns):
super(ASPP, self).__init__()
self.aspp1 = nn.Sequential(
nn.Conv2d(in_channesls, out_channels, kernel_size=3,
dilation=dilatopns[0], padding=dilatopns[0]),
nn.BatchNorm2d(num_features=out_channels),
nn.ReLU(inplace=True)
)
self.aspp2 = nn.Sequential(
nn.Conv2d(in_channesls, out_channels, kernel_size=3,
dilation=dilatopns[1], padding=dilatopns[1]),
nn.BatchNorm2d(num_features=out_channels),
nn.ReLU(inplace=True)
)
self.aspp3 = nn.Sequential(
nn.Conv2d(in_channesls, out_channels, kernel_size=3,
dilation=dilatopns[2], padding=dilatopns[2]),
nn.BatchNorm2d(num_features=out_channels),
nn.ReLU(inplace=True)
)
self.aspp4 = nn.Sequential(
nn.Conv2d(in_channesls, out_channels, kernel_size=3,
dilation=dilatopns[3], padding=dilatopns[3]),
nn.BatchNorm2d(num_features=out_channels),
nn.ReLU(inplace=True)
)
def forward(self, x):
aspp1 = self.aspp1(x)
aspp2 = self.aspp2(x)
aspp3 = self.aspp3(x)
aspp4 = self.aspp4(x)
out = aspp1+aspp2+aspp3+aspp4
return out
#%%
class DeepLab_V2(nn.Module):
def __init__(self, in_channels=1, out_channels=[64, 256, 512, 1024, 2048], n_layers=[3, 4, 23, 3], n_classes=5):
super(DeepLab_V2, self).__init__()
self.stem = nn.Sequential(
nn.Conv2d(in_channels, out_channels[0], kernel_size=7, stride=2, padding=3, dilation=1),
nn.BatchNorm2d(num_features=out_channels[0]),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1, ceil_mode=True)
)
self.res101Layer1 = ResLayer(out_channels[0], out_channels[1], n_layers[0], stride=2)
self.res101Layer2 = ResLayer(out_channels[1], out_channels[2], n_layers[1], stride=2)
self.res101Layer3 = ResLayer(out_channels[2], out_channels[3], n_layers[2], stride=1, padding=2, dilation=2)
self.res101Layer4 = ResLayer(out_channels[3], out_channels[4], n_layers[3], stride=1, padding=4, dilation=4)
self.aspp = ASPP(out_channels[4], n_classes, dilatopns=[6, 12, 18, 24])
def forward(self, x):
x_size = x.size()
if len(x_size) is not 4:
x = torch.unsqueeze(x, 1) # [n, c, h, w]
x = self.stem(x)
x = self.res101Layer1(x)
x = self.res101Layer2(x)
x = self.res101Layer3(x)
x = self.res101Layer4(x)
x = self.aspp(x)
x = F.upsample(x, size=x_size[1:], mode='bilinear')
return x
DeepLab V3
Rethinking Atrous Convolution for Semantic Image Segmentation
DeepLab V3 network structure
DeepLab V3 ASPP upgrade module
DeepLab V3 Multi-Grid
Code
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchinfo import summary
from torchstat import stat
#%%
class ResBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride, padding, dilation):
super(ResBlock, self).__init__()
self.downsample = False
self.mid_channels = out_channels//4
self.reduce = nn.Sequential(
nn.Conv2d(in_channels, self.mid_channels, kernel_size=1, bias=False),
nn.BatchNorm2d(num_features=self.mid_channels)
)
self.conv3x3 = nn.Sequential(
nn.Conv2d(self.mid_channels, self.mid_channels,
kernel_size=3, stride=stride, padding=padding, dilation=dilation, bias=False),
nn.BatchNorm2d(num_features=self.mid_channels)
)
self.increase = nn.Sequential(
nn.Conv2d(self.mid_channels, out_channels, kernel_size=1, bias=False),
nn.BatchNorm2d(num_features=out_channels),
nn.ReLU(inplace=True)
)
if in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(num_features=out_channels)
)
else:
self.shortcut = nn.Identity()
def forward(self, x):
res = x
x = self.reduce(x)
x = self.conv3x3(x)
x = self.increase(x)
res = self.shortcut(res)
x = x+res
return x
#%%
class ResLayer(nn.Module):
def __init__(self, in_channels, out_channels, n_layers, stride=1, padding=1, dilation=1, multi_grids=None):
super(ResLayer, self).__init__()
if multi_grids is None:
multi_grids = [1 for _ in range(n_layers)]
else:
assert n_layers == len(multi_grids)
resLayer = []
for i in range(n_layers):
resLayer.append(
ResBlock(in_channels=(in_channels if i==0 else out_channels),
out_channels=out_channels,
stride=(stride if i==0 else 1),
padding=padding*multi_grids[i],
dilation=dilation*multi_grids[i])
)
self.resLayers = nn.Sequential(*resLayer)
def forward(self, x):
x = self.resLayers(x)
return x
#%%
class ASPP_plus(nn.Module):
def __init__(self, in_channesls, out_channels, dilatopns=[6, 12, 18]):
super(ASPP_plus, self).__init__()
self.aspp0 = nn.Sequential(
nn.Conv2d(in_channesls, out_channels, kernel_size=1, bias=False),
nn.BatchNorm2d(num_features=out_channels),
nn.ReLU(inplace=True)
)
self.aspp1 = nn.Sequential(
nn.Conv2d(in_channesls, out_channels, kernel_size=3,
dilation=dilatopns[0], padding=dilatopns[0], bias=False),
nn.BatchNorm2d(num_features=out_channels),
nn.ReLU(inplace=True)
)
self.aspp2 = nn.Sequential(
nn.Conv2d(in_channesls, out_channels, kernel_size=3,
dilation=dilatopns[1], padding=dilatopns[1], bias=False),
nn.BatchNorm2d(num_features=out_channels),
nn.ReLU(inplace=True)
)
self.aspp3 = nn.Sequential(
nn.Conv2d(in_channesls, out_channels, kernel_size=3,
dilation=dilatopns[2], padding=dilatopns[2], bias=False),
nn.BatchNorm2d(num_features=out_channels),
nn.ReLU(inplace=True)
)
self.aspp4 = nn.Sequential(
nn.AdaptiveAvgPool2d(output_size=1),
nn.Conv2d(in_channesls, out_channels, kernel_size=1, bias=False),
nn.BatchNorm2d(num_features=out_channels),
nn.ReLU(inplace=True)
)
def forward(self, x):
x_size = x.size()
aspp0 = self.aspp0(x)
aspp1 = self.aspp1(x)
aspp2 = self.aspp2(x)
aspp3 = self.aspp3(x)
aspp4 = self.aspp4(x)
aspp4 = F.interpolate(aspp4, x_size[2:], mode='bilinear')
out = aspp0+aspp1+aspp2+aspp3+aspp4
return out
#%%
class DeepLab_V3(nn.Module):
def __init__(self, in_channels=1, out_channels=[64, 256, 512, 1024, 2048],
n_layers=[3, 4, 6, 3], multi_grids=[1, 2, 4], n_classes=5):
super(DeepLab_V3, self).__init__()
self.stem = nn.Sequential(
nn.Conv2d(in_channels, out_channels[0], kernel_size=7, stride=2, padding=3, dilation=1),
nn.BatchNorm2d(num_features=out_channels[0]),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1, ceil_mode=True)
)
self.res50Layer1 = ResLayer(out_channels[0], out_channels[1], n_layers[0], stride=2)
self.res50Layer2 = ResLayer(out_channels[1], out_channels[2], n_layers[1], stride=2)
self.res50Layer3 = ResLayer(out_channels[2], out_channels[3], n_layers[2], stride=1, padding=2, dilation=2)
self.res50Layer4 = ResLayer(out_channels[3], out_channels[4], n_layers[3],
stride=1, padding=2, dilation=2, multi_grids=multi_grids)
self.res50Layer4_copy1 = ResLayer(out_channels[4], out_channels[4], n_layers[3],
stride=1, padding=4, dilation=4, multi_grids=multi_grids)
self.res50Layer4_copy2 = ResLayer(out_channels[4], out_channels[4], n_layers[3],
stride=1, padding=8, dilation=8, multi_grids=multi_grids)
self.res50Layer4_copy3 = ResLayer(out_channels[4], out_channels[4], n_layers[3],
stride=1, padding=16, dilation=16, multi_grids=multi_grids)
self.aspp = ASPP_plus(out_channels[4], n_classes, dilatopns=[6, 12, 18, 24])
def forward(self, x):
x_size = x.size()
if len(x_size) is not 4:
x = torch.unsqueeze(x, 1) # [n, c, h, w]
x = self.stem(x)
x = self.res50Layer1(x)
x = self.res50Layer2(x)
x = self.res50Layer3(x)
x = self.res50Layer4(x)
x = self.res50Layer4_copy1(x)
x = self.res50Layer4_copy2(x)
x = self.res50Layer4_copy3(x)
x = self.aspp(x)
x = F.upsample(x, size=x_size[1:], mode='bilinear')
return x
DeepLab V3+
Encoder-decoder with atrous separable convolution for semantic image segmentation