PyTorch ~ Semantic Segmentation Framework

Here are 3 to share~~

The link to the VOC dataset used is open in the article. The pre-trained model has been uploaded to Github. I use the environment Colab pro. You can download the model to make predictions.

Code link:  https://github.com/lixiang007666/segmentation-learning-experiment-pytorch

Instructions:

  1. Download the VOC dataset and put JPEGImages SegmentationClassthe two folders into the data folder.

  2. Switch the terminal to the target directory and run python train.py -hto view the training

 

Select the model and GPU number for training, e.g. runpython train.py -m Unet -g 0

  1. Prediction requires manual modification predict.pyof the model in

If you know FCN very well, you can skip d2lthe explanation (hands-on deep learning) to the last part.

2 datasets

The VOC data set is generally used for target detection. In the 2012 version, the semantic segmentation task was added.

The basic data set includes: a training set containing 1464 pictures, a validation set of 1449 and a test set of 1456. There are 21 categories of objects in total.

In the PASCAL VOC segmentation task, there are a total of 20 categories of objects, and other content is used as the background category, where red represents the aircraft category, black is the background, and the aircraft boundary is drawn with beige (looking like white) lines, indicating the segmentation fuzzy area.

Among them, the segmentation labels are all images in png format, which is actually a single-channel color index image. In addition to an index image with a single channel and the same size as the image, the image also stores a list of 256 color values ​​(palette ), each index value corresponds to an RGB color value in the palette, therefore, a single-channel index map + palette can represent a color map.

Original image: insert image description here

Tags: insert image description here

When selecting an image, it can be found that there are more than two categories for a single image segmentation, and the category of each image is not fixed.

3 Fully Convolutional Neural Networks

Semantic segmentation can classify every pixel in an image. The fully convolutional network (FCN) uses a convolutional neural network to realize the transformation from image pixels to pixel categories. Unlike the convolutional neural networks we introduced earlier in the image classification or object detection sections, : this is achieved through the transposed convolution layer introduced全卷积网络将中间层特征图的高和宽变换回输入图像的尺寸 in . Thus, the output class prediction has a one-to-one correspondence with the input image at the pixel level: given a position in the spatial dimension, the output in the channel dimension is the class prediction for the pixel corresponding to that position.

%matplotlib inline
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

 3.1 Network structure 

Below, we use the ResNet-18 model pre-trained on the ImageNet dataset to extract image features, and denote this network instance as pretrained_net. The last few layers of the model include global average pooling and fully connected layers, whereas they are not required in fully convolutional networks. Create a fully convolutional network instance net. It replicates most of the pre-training layers in Resnet-18, but removes the final global average pooling layer and the fully connected layer closest to the output.  

 

num_classes = 21
net.add_module('final_conv', nn.Conv2d(512, num_classes, kernel_size=1))
net.add_module('transpose_conv', nn.ConvTranspose2d(num_classes, num_classes,
                                    kernel_size=64, padding=16, stride=32))

3.2 Initialize the transposed convolution layer

Upsampling is usually used to enlarge an image. Bilinear interpolation is one of the commonly used upsampling methods, and it is also often used to initialize transposed convolutional layers.

To explain bilinear interpolation, assume that given an input image, we want to compute each pixel on an upsampled output image. Upsampling for bilinear interpolation can be achieved by transposing convolutional layers, and the kernel is constructed by the following bilinear_kernel function. Due to space limitations, we only give the implementation of the bilinear_kernel function without discussing the principle of the algorithm. 

def bilinear_kernel(in_channels, out_channels, kernel_size):
    factor = (kernel_size + 1) // 2
    if kernel_size % 2 == 1:
        center = factor - 1
    else:
        center = factor - 0.5
    og = (torch.arange(kernel_size).reshape(-1, 1),
          torch.arange(kernel_size).reshape(1, -1))
    filt = (1 - torch.abs(og[0] - center) / factor) * \
           (1 - torch.abs(og[1] - center) / factor)
    weight = torch.zeros((in_channels, out_channels,
                          kernel_size, kernel_size))
    weight[range(in_channels), range(out_channels), :, :] = filt
    return weight

Experiment with upsampling with bilinear interpolation which is implemented by transposed convolutional layers. We construct a transposed convolution layer that doubles the height and width of the input, and initialize its convolution kernel with the bilinear_kernel function.

conv_trans = nn.ConvTranspose2d(3, 3, kernel_size=4, padding=1, stride=2,
                                bias=False)
conv_trans.weight.data.copy_(bilinear_kernel(3, 3, 4));

In fully convolutional networks, we initialize transposed convolutional layers with upsampling with bilinear interpolation. For 1×1 convolutional layers, we use Xavier initialization parameters. whaosoft  aiot  http://143ai.com 

W = bilinear_kernel(num_classes, num_classes, 64)
net.transpose_conv.weight.data.copy_(W);

3.3 Training

The loss function and accuracy calculation are not fundamentally different from image classification, since we use the channels of the transposed convolutional layer to predict the class of pixels, so the channel dimension is specified in the loss calculation. Additionally, the model calculates accuracy based on whether the predicted class is correct for each pixel.

def loss(inputs, targets):
    return F.cross_entropy(inputs, targets, reduction='none').mean(1).mean(1)

num_epochs, lr, wd, devices = 5, 0.001, 1e-3, d2l.try_all_gpus()
trainer = torch.optim.SGD(net.parameters(), lr=lr, weight_decay=wd)
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)

4 Open Source Code and Dataset

Dataset download address: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar

Input samples: Output samples:  Training:

!python3 train.py -m Unet -g 0

Prediction: The model code includes the implementation of FCN, U-Net and Deeplab. You can change the model training and prediction more conveniently. DeeplabV3 segmentation results:  FCN segmentation results: insert picture description here

U-Net segmentation results:

5 summary

By comparing with the segmented standard image, it can be found that the output segmented image of the model is almost consistent with the segmented standard image, and the output segmented image of the model is also well integrated with the original image, indicating that the model has better accuracy.

In addition, from the perspective of input image size, the model can input an image of any size and output a labeled segmented image of the same size. Since it is segmented for the images of the PASCAL VOC dataset, the PASCAL VOC dataset only supports 20 categories (the background is the 21st category), so when segmenting, things that are not in the 20 categories will be marked as the background .

But overall, the model achieves a high accuracy rate for the image segmentation of the PASCAL VOC dataset.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/132183129