1. Description

This is the second part of a 4-part series to walk through image segmentation from scratch using deep learning techniques in PyTorch. This section will focus on how to implement a baseline image segmentation convolutional neural network (CNN) model.

Figure 1: Results of running image segmentation with a CNN. In order from top to bottom, input image, ground truth segmentation mask, predicted segmentation mask.

2. Article Outline

In this article, we will implement a convolutional neural network (CNN) -based architecture called SegNet , which assigns each pixel in an input image to a corresponding pet, such as a cat or a dog. Pixels that do not belong to any pets will be classified as background pixels. We will build and train this model on the Oxford Pets dataset using PyTorch to understand what is required to deliver a successful image segmentation task. The model building process will be hands-on, and we will discuss in detail the role of each layer in the model. This article will contain numerous references to research papers and articles for further study.

Throughout this article, we will refer to the code and results from this notebook . If you want to reproduce the results, you'll need a GPU to ensure that the notebook finishes in a reasonable amount of time.

3. This series of articles

This series is aimed at readers of all deep learning experience levels. If you want to learn about deep learning and visual AI in practice with some solid theoretical and hands-on experience, you're in the right place! This will be a 4-part series with the following articles:

Let's start the discussion with a short introduction to convolutional layers and some other layers that are commonly used together as convolutional blocks.

4. Conv-BatchNorm-ReLU and Max Pooling/Unpooling

Convolution, Batch Normalization, ReLU blocks are the Holy Trinity of Vision AI. You'll see it used a lot with CNN-based vision AI models. Each of these terms represents a different layer implemented in PyTorch. Convolutional layers are responsible for performing cross-correlation operations of learned filters on input tensors. Batch normalization centers the elements in the batch to zero mean and unit variance, and ReLU is a non-linear activation function that only keeps positive values in the input.

A typical CNN progressively reduces the dimensionality of the input space as layers are stacked. The next section discusses the motivation behind shrinking spatial dimensions. This reduction is achieved by pooling adjacent values using simple functions such as max or mean. We will discuss this further in the max pooling section. In a classification problem, a stack of Transform-BN-ReLU-Pooling blocks is followed by a classification head that predicts the probability that the input belongs to one of the target classes. Certain problem sets, such as semantic segmentation, require pixel-wise predictions. For this case, a bunch of upsampling blocks are appended after the downsampling block to project its output into the desired spatial dimension. Upsampling blocks are nothing more than Conv-BN-ReLU-Unpool blocks that replace pooling layers with unpooling layers. We will discuss unpooling in detail in the max pooling section.

Now, let's further elaborate on the motivation behind convolutional layers.

Five, convolution

Convolutions are the fundamental building blocks of vision AI models. They are used heavily in computer vision and have historically been used to implement visual transformations such as:

edge detection
Image Blurring and Sharpening
Bump
strengthen

The convolution operation is the element-wise multiplication and aggregation of two matrices. An example of the convolution operation is shown in Figure 2.

Figure 2: Illustration of the convolution operation. Source: Author

In the context of deep learning, convolutions are performed between n-dimensional parameter matrices called filters or kernels on larger inputs. This is achieved by sliding filters over the input and applying convolutions to the corresponding parts. The extent of the slideshow is configured using the stride parameter. One step means the kernel slides over one step to execute the next part. In contrast to traditional methods that use fixed filters, deep learning uses backpropagation to learn filters from data.

So how does convolution help deep learning?

In deep learning, convolutional layers are used to detect visual features. A typical CNN model contains a bunch of such layers. The bottom layers in the stack detect simple features such as lines and edges. These layers detect increasingly complex features as we move up the stack. Middle layers in the stack detect combinations of lines and edges, and top layers detect complex shapes such as cars, faces, or airplanes. Figure 3 visually shows the outputs of the top and bottom layers of the trained model.

Figure 3: What a convolutional filter learns to recognize. Source: Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

A convolutional layer has a set of learnable filters that act on small regions in the input to produce representative output values for each region. For example, a 3x3 filter operates on an area of size 3x3 and produces values representative of that area. Repeated application of a filter over the input region produces an output that becomes the input for the next layer in the stack. Intuitively, higher layers can "see" larger regions of the input. For example, a 3x3 filter in the second convolutional layer operates on the output of the first convolutional layer, where each unit contains information about a 3x3 sized region in the input. If we assume a convolution operation with stride=1, then the filter in the second layer will "see" a 5x5 region of the original input. This is called the receptive field of the convolution . Repeated application of convolutional layers gradually reduces the spatial dimensionality of the input image and increases the field of view of the filters, enabling them to "see" complex shapes. Figure 4 shows the processing of a convolutional network on a one-dimensional input. Elements in the output layer represent relatively large input blocks.

Figure 4: Receptive field of a 1D convolution with kernel size = 1, applied 3 times. Assume stride=3 and no padding. After the third consecutive application of the convolution kernel, a single pixel is able to see 1 pixel in the original input image. Source: Author

Once the convolutional layers can detect these objects and are able to generate their representations, we can use these representations for image classification, image segmentation, and object detection and localization. Broadly speaking, CNN follows the following general principles:

Convolutional layers either keep the number of output channels © constant, or double them.
It uses stride=1 to keep the spatial dimension unchanged, or use stride=2 to reduce it to half.
The output of a convolutional block is usually pooled to change the spatial dimension of the image.

Convolutional layers apply kernels to each input independently. This can cause its output to vary for different inputs. Batch normalization layers usually follow convolutional layers to solve this problem. Let's learn more about what it does in the next section.

6. Batch normalization

A batch normalization layer normalizes the channel values in a batched input to have zero mean and unit variance. This normalization is performed independently for each channel in the batch to ensure that the input channel values have the same distribution. Batch normalization has the following advantages:

It stabilizes the training process by preventing gradients from becoming too small.
It achieves faster convergence on our task.

If all we have is a bunch of convolutional layers, it is basically equivalent to a single convolutional layer network due to the cascading effect of linear transformations. In other words, a series of linear transformations can be replaced by a single linear transformation with the same effect. Intuitively, if we multiply a vector of a constant k₁ by another constant k₂ , it is equivalent to a single multiplication by the constant k₁k₂. Therefore, in order for networks to have realistic depth, they must have nonlinearities to prevent them from collapsing . We'll discuss ReLU, which is often used as a nonlinearity, in the next section.

7. Activation function Qdot

ReLU is a simple non-linear activation function that clips the lowest input value to be greater than or equal to 0. It also helps to solve the vanishing gradient problem that restricts the output to be greater than or equal to 0. A ReLU layer is usually followed by a pooling layer to reduce the spatial dimension in the reduced subnet, or a non-pooling layer to boost the spatial dimension in the subnet. Details are provided in the next section.

8. Pooling

Pooling layers are used to reduce the spatial dimension of the input. Pooling with stride = 2 transforms an input with spatial dimensions (H, W) into (H/2, W/2). Max pooling is the most commonly used pooling technique in deep CNNs. It projects the largest value in the grid (for example) 2x2 onto the output. We then slide the 2x2 pooling window to the next part according to a convolution-like stride. Repeating this with step=2 results in an output half the height and half the width. Another commonly used pooling layer is the average pooling layer, which calculates the average value instead of the maximum value.

The opposite of a pooling layer is called a non-pooling layer. It takes (H, W) dimensional input and converts it to (2H, 2W) dimensional output with stride =2 . A necessary part of this transformation is the selection of locations in the 2x2 portion of the output to project the input values. For this, we need a max-solution-pooling index map that tells us where to target in the output section. This unpooled map was produced by a previous max pooling operation. Figure 5 shows examples of pooling and unpooling operations.

Figure 5: Max pooling and unpooling. Source: DeepPainter: Painter Classification Using Deep Convolutional Autoencoders

We can think of max pooling as a kind of non-linear activation function. However, using it to replace nonlinearities such as ReLU has been reported to affect the performance of the network . In contrast, average pooling cannot be considered a non-linear function because it uses all of its inputs to produce an output that is a linear combination of its inputs.

This covers all the basic building blocks of deep CNNs. Now, let's put it all together to create a model. The model we have chosen for this exercise is called SegNet. We will discuss it next.

Nine, SegNet: CNN-based model

SegNet is a deep CNN model based on the basic blocks we discuss in this article. It has two distinct parts. The bottom (also known as the encoder) downsamples the input to generate features representative of the input. The top decoder part upsamples features to create per-pixel classifications. Each part consists of a sequence of Conv-BN-ReLU blocks. These modules also include pooling or unpooling layers in the downsampling and upsampling paths, respectively. Figure 6 shows the arrangement of the layers in more detail. SegNet uses the pooling index from the max pooling operation in the encoder to determine which values to replicate during the max pooling operation in the decoder. While each element of the activation tensor is 4 bytes (32 bits), offsets within a 2x2 square can be stored using only 2 bits. This is more memory efficient since these activations (or indices in SegNet) need to be stored while the model is running.

Figure 6: SegNet model architecture for image segmentation. Source: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

This notebook contains all the code for this section.

This model has 15.27M trainable parameters.

The following configurations were used during model training and validation.

Random horizontal flipping and color jittering data augmentation is applied to the training set to prevent overfitting
Resize image to 128x128 pixels in non-aspect ratio preserving resize operation
No input normalization will be applied to the images; instead a batch normalization layer is used as the first layer of the model
The model is trained for 7 epochs using the Adam optimizer with an LR of 20.0 and the StepLR scheduler with a learning rate decay of 0.7 every 001 epochs
The cross-entropy loss function is used to classify pixels as belonging to pet, background or pet border

The model achieved a validation accuracy of 28.20% after 88 training epochs.

We draw a gif showing how the model learns to predict segmentation masks for the 21 images in the validation set.

Figure 6: A gif showing how the SegNet model learns to predict segmentation masks for 21 images in the validation set. Source: Author

Part 1 of this series presented definitions of all validation metrics.

If you want to see a Tensorflow implementation of a fully convolutional model for segmenting pet images, see Chapter 4: Efficient Architectures for Efficient Deep Learning book .

10. Observations of model learning

Based on the evolution of the predictions made by the trained model after each epoch, we can observe the following.

The model is able to learn enough that the output appears to be on the correct court for the pet in the image, even as early as 1 training epoch
The border pixels are harder to segment because we are using an unweighted loss function that treats each success (or failure) equally, so getting the border pixels wrong doesn't cost the model much in terms of loss. We encourage you to investigate this and check what strategies you can try to resolve this issue. Try using focal loss and see how it performs
Even after 20 training epochs, the model seems to be still learning. This shows that we can improve validation accuracy if we train the model for a longer period of time.
Some ground truth labels are hard to figure out on their own - like the dog mask in the middle row, the last column has a lot of unknown pixels in areas where the dog's body is occluded by foliage. This is hard for the model to figure out, so one should always expect a drop in accuracy on such examples. However, that doesn't mean that the model doesn't perform well. In addition to looking at the overall validation metrics, you should always spot-check the predictions to understand how your model behaves.

Figure 7: An example of a ground-truth segmentation mask containing a large number of unknown pixels. This is a very difficult input for any ML model. Source: Author

11. Conclusion

In Part 2 of this series, we looked at the basic building blocks of deep CNNs for vision AI. We saw how to implement a SegNet model from scratch in PyTorch and visualized how a model trained in successive epochs would perform on 21 validation images. This should give you an idea of how quickly the model can learn enough to make the output appear to be on the correct court. In this case, we can see the segmentation mask, which roughly resembles the actual segmentation mask as early as the first training epoch!

Efficient Image Segmentation with PyTorch: Part 2