Efficient Image Segmentation with PyTorch: Part 3

1. Description

        In this 4-part series, we'll walk through image segmentation from the ground up using deep learning techniques in PyTorch. This section will focus on how to optimize our CNN baseline model using depthwise separable convolutions to reduce the number of trainable parameters and make the model deployable on mobile and other edge devices.

Figure 1: Results of running image segmentation using a CNN with depthwise separable convolutions instead of regular convolutions. From top to bottom, the input image, the ground truth segmentation mask, and the predicted segmentation mask. 

2. Article Outline

        In this article, we will enhance the convolutional neural network (CNN) we built earlier to reduce the number of learnable parameters in the network. The task of identifying pet pixels (pixels belonging to cats, dogs, hamsters, etc.) in the input image remains the same. Our network of choice will still be SegNet , and the only change we will make is to replace our convolutional layers with depthwise separable convolutions (DSC). Before we do, we'll delve into the theory and practice of depthwise separable convolutions and appreciate the ideas behind the technique.

        In this article, we will refer to the code and results in this notebook for model training and the code and results in this notebook for getting started with DSC. If you want to reproduce the results, you'll need a GPU to ensure that the first notebook runs in a reasonable amount of time. The second notebook can run on a regular CPU.

3. This series of articles

        This series is aimed at readers of all deep learning experience levels. If you want to learn about deep learning and visual AI in practice with some solid theoretical and hands-on experience, you're in the right place! This will be a 4-part series with the following articles:

  1. concepts and ideas
  2. CNN-based models
  3. Depthwise Separable Convolution (this paper)
  4. Visual Transformer-Based Models

4. Introduction

        Let's start the discussion by taking a closer look at convolutions from the perspective of model size and computational cost. The number of trainable parameters is a good indicator of model size, and the number of tensor operations reflects model complexity or computational cost. Consider we have a convolutional layer with n filters of size dk x dk. Suppose further that the layer processes an input of shape mxhxw, where  m  is the number of input channels and h  and  w  are the height and width dimensions, respectively. In this case, the convolutional layer will produce an output of shape  nxhxw  , as shown in Figure 2. We assume that the convolution uses  stride=1 . Let's go ahead and evaluate this setup in terms of trainable parameters and computational cost.

Figure 2: A conventional convolution filter applied to an input to produce an output. Assume stride=1 and padding=dk-2. Source: Efficient Deep Learning Book

        Evaluation of trainable parameters: We have  n filters and each filter has m x dk  x dk  learnable parameters. This results in a total of nxmx dk  x dk learnable parameters. Ignore bias terminology to simplify this discussion. Let's look at the following PyTorch code to verify our understanding.

import torch
from torch import nn
def num_parameters(m):
return sum([p.numel() for p in m.parameters()])
dk, m, n = 3, 16, 32
print(f"Expected number of parameters: {m * dk * dk * n}")
conv1 = nn.Conv2d(in_channels=m, out_channels=n, kernel_size=dk, bias=False)
print(f"Actual number of parameters: {num_parameters(conv1)}")

        prints the following.

Expected number of parameters: 4608
Actual number of parameters: 4608

        Now, let's evaluate the computational cost of convolution.

        Computational cost evaluation: When run with stride=1  and padding=dk-2 on an input of size hxw  , a single convolution filter of shape mx dk x dk will apply the convolution filter  hxw  times, for The hxw part  is applied once in total for each image part of  dk  x dk  . It results in a cost of  mx dk x dk xhxw per filter or output channel . Since we wish to compute n output channels, the total cost will be mx  dk x dk xhxn . Let's go ahead and verify this using the torchinfo PyTorch package.

from torchinfo import summary
h, w = 128, 128
print(f"Expected total multiplies: {m * dk * dk * h * w * n}")
summary(conv1, input_size=(1, m, h, w))

will print the following.

Expected total multiplies: 75497472


==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
Conv2d                                   [1, 32, 128, 128]         4,608
==========================================================================================
Total params: 4,608
Trainable params: 4,608
Non-trainable params: 0
Total mult-adds (M): 75.50
==========================================================================================
Input size (MB): 1.05
Forward/backward pass size (MB): 4.19
Params size (MB): 0.02
Estimated Total Size (MB): 5.26
==========================================================================================

        If we ignore the implementation details of convolutional layers for a moment, we realize that at a high level, convolutional layers simply convert mxhxw inputs into  nxhxw  outputs. Transformation is achieved through trainable filters that gradually learn features as they see the input. The next question is: is it possible to achieve this transformation using fewer learnable parameters while ensuring minimal compromise in the learnability of the layers? Depthwise separable convolutions are proposed to answer this exact question. Let's take a closer look at them and see how they stack up on our evaluation metrics.

5. Depth separable convolution

The concept of Depth Separable Convolution (DSC) was originally proposed         by Laurent Sifre in their PhD thesis " Rigid Motion Scattering for Image Classification ". Since then, they have been successfully used in various popular deep convolutional networks such as XceptionNet and MobileNet .

        The main difference between regular convolution and DSC is that DSC consists of 2 convolutions as follows:

  1. Depthwise grouped convolution , where the number m of input channels is equal to the number of output channels such that each output channel is affected by only a single input channel. In PyTorch, this is called a "grouped" convolution. You can read more about grouped convolutions in PyTorch here .
  2. Pointwise convolution (filter size = 1), which operates like a regular convolution, so each of the n filters operates on all m input channels to produce a single output value.

Figure 3: A depthwise separable convolutional filter applied to an input to produce an output. Assume stride=1 and padding=dk-2 . Source: Efficient Deep Learning Book

        Let's perform the same exercise as DSC regular convolution and calculate the number of trainable parameters and computations.

        Evaluation of trainable parameters: A "grouped" convolution has m filters, each with dk  x dk  learnable parameters, resulting in m output channels. This results in a total of mx dk  x dk  learnable parameters. A pointwise convolution has n filters of size mx 1 x 1 which add up to at most nxmx 1  x 1 learnable parameters. Let's look at the following PyTorch code to verify our understanding.

class DepthwiseSeparableConv(nn.Sequential):
    def __init__(self, chin, chout, dk):
        super().__init__(
            # Depthwise convolution
            nn.Conv2d(chin, chin, kernel_size=dk, stride=1, padding=dk-2, bias=False, groups=chin),
            # Pointwise convolution
            nn.Conv2d(chin, chout, kernel_size=1, bias=False),
        )

conv2 = DepthwiseSeparableConv(chin=m, chout=n, dk=dk)
print(f"Expected number of parameters: {m * dk * dk + m * 1 * 1 * n}")
print(f"Actual number of parameters: {num_parameters(conv2)}")

        This will print.

Expected number of parameters: 656
Actual number of parameters: 656

        We can see that the DSC version has about  7 times fewer parameters . Next, let's focus on the computational cost of the DSC layer.

        Computational cost evaluation: Suppose our input has spatial dimensions  mxhxw . In the grouped convolution section of DSC, we have m filters, each of size dk x dk . Filters are applied to their corresponding input channels, resulting in a segment cost of  mx dk x dk xhxw . For pointwise convolution, we apply n filters of size mx 1  x 1 to produce  output channels. This results in a segment cost of nx  mx 1 x 1 xhxw . We need to add the cost of the group by and pointwise operations to calculate the total cost. Let's go ahead  and verify this using the torchinfo PyTorch package.

print(f"Expected total multiplies: {m * dk * dk * h * w + m * 1 * 1 * h * w * n}")
s2 = summary(conv2, input_size=(1, m, h, w))
print(f"Actual multiplies: {s2.total_mult_adds}")
print(s2)

This will print.

Expected total multiplies: 10747904
Actual multiplies: 10747904
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
DepthwiseSeparableConv                   [1, 32, 128, 128]         --
├─Conv2d: 1-1                            [1, 16, 128, 128]         144
├─Conv2d: 1-2                            [1, 32, 128, 128]         512
==========================================================================================
Total params: 656
Trainable params: 656
Non-trainable params: 0
Total mult-adds (M): 10.75
==========================================================================================
Input size (MB): 1.05
Forward/backward pass size (MB): 6.29
Params size (MB): 0.00
Estimated Total Size (MB): 7.34
========================================================================================== 

Let's compare the size and cost of the two convolutions, with a few examples to gain some intuition.

6. Size and cost comparison of conventional and depthwise separable convolutions

        To compare the size and cost of regular convolutions and depthwise separable convolutions, we will assume a network with an input size of 128 x  128 , a kernel size of 3  x 3 , and a network that gradually halves the spatial dimension and doubles the number of channel dimensions network. We assume that there is one 2D-conv layer at each step, but in reality, there may be more.

Figure 4: Comparing the number of trainable parameters (size) and extras (cost) for regular and depthwise separable convolutions. We also show the ratio of the size and cost of the 2 convolutions. Source: Author.

You can see that, on average, the size and computational cost of DSC is about 11% to 12% of the cost of regular convolution in the above configuration.

Figure 5: Relative size and cost of a conventional VS/s DSC. Source: Author.

        Now that we have a good understanding of the types of convolutions and their relative costs, you must be wondering if there are any downsides to using DSC. Everything we've seen so far seems to point to them being better in every way! Well, we haven't considered one important aspect, their impact on the accuracy of our model. Let's dig into it with the following experiment.

7. SegNet using depthwise separable convolution

        This notebook contains all the code for this section.

        We will adapt the SegNet model from the previous post and replace all regular convolutional layers with DSC layers. After doing this, we noticed that the number of parameters in the notebook dropped from 15.27M to 1.75M, a reduction of 88.5%! This is consistent with our previous estimate of an 11% to 12% reduction in the number of network trainable parameters.

A similar configuration as before was used during model training and validation . The configuration is specified as follows.

  1. Random horizontal flipping and color jittering data augmentation is applied to the training set to prevent overfitting
  2. Resize image to 128x128 pixels in non-aspect ratio preserving resize operation
  3. Does not apply any input normalization to the images, but instead uses a batch normalization layer as the first layer of the model
  4. The model was trained for 001 epochs using the Adam optimizer with LR of 20.0 and no LR scheduler
  5. The cross-entropy loss function is used to classify pixels as belonging to pet, background or pet border

The model achieved a validation accuracy of 96.20% after 86 training epochs. This is lower than the 88.28% accuracy achieved by the model using regular convolutions in the same number of training epochs. We have determined experimentally that training for more epochs improves the accuracy of both models, so 20 epochs is definitely not the end of the training cycle. For the purposes of this article, we stop at 20 epochs for demonstration purposes.

We draw a gif showing how the model learns to predict segmentation masks for the 21 images in the validation set.

Figure 6: A gif showing how the SegNet model with DSC learns to predict segmentation masks for 21 images in the validation set. Source: Author

Now that we understand how the model progresses through the training cycle, let's compare the model's training cycle with regular convolution and DSC.

8. Accuracy comparison

        We found it useful to look at the training cycle of the model using regular convolutions and DSC. The main difference we notice is during the early stages (epochs) of training, after which both models go into roughly the same flow of predictions. In fact, after training both models for 100 epochs, we noticed that the accuracy of the model using DSC was only about 1% lower than that of the model with regular convolutions. This is consistent with our observations for 20 training epochs.

Figure 7: A gif showing the progress of the segmentation mask predicted by the SegNet model using regular convolutions with DSC. Source: Author.

        You'll notice that both models get predictions roughly correctly after only 6 training epochs - that is, one can visually see that the models are predicting something useful. Then, most of the hard work of training the model is making sure the predicted mask has as tight boundaries as possible and is as close as possible to the actual pet in the image. This means that while one can expect a smaller absolute increase in accuracy in later training epochs, this has a much larger impact on prediction quality. We note that at higher absolute accuracy values ​​(from 89% to 90%), a single-digit increase in accuracy leads to a significant qualitative improvement in prediction.

9. Comparison with the United Nations Network Model

        We ran an experiment changing a number of hyperparameters with a focus on improving overall accuracy to see how close this setting is to optimal. Below is the configuration for this experiment.

  1. Image size: 128 x 128 — same as experiments so far
  2. Training epochs: 100 — the current experiment trained for 20 epochs
  3. Enhancements: More enhancements such as image rotation, channel dropout, random block removal. We use Albumentations instead of Torchvision conversion. Protein automatically converts the segmentation mask for us
  4. LR scheduler: using the StepLR scheduler, decayed by a factor of 8.25 every 0 train cycles
  5. Loss function: We tried 4 different loss functions: cross entropy, focal, dice, weighted cross entropy. Dice performed the worst, while the rest were almost comparable to each other. In fact, the best precision difference between the rest after 100 epochs is at the 4th decimal place (assuming precision is a number between 0.0 and 1.0)
  6. Convolution Type: Regular
  7. Model type: UNet — the current experiment uses the SegNet model

        For the above settings, we achieve the best validation accuracy of 91.3%. We note that image size significantly affects the best validation accuracy. For example, when we change the image size to 256 x 256, the best validation accuracy is as high as 93.0%. However, training takes much longer and uses more memory, which means we have to reduce the batch size.

Figure 8: Results of training a UNet model for 100 training epochs with the above hyperparameters. Source: Author.

You can see that the forecast is much smoother and clearer than what we have seen so far.

10. Conclusion

        In Part 3 of this series, we learned about depthwise separable convolution (DSC) as a technique to reduce model size and training/inference cost without significantly reducing validation accuracy. We looked at the size/cost tradeoffs between conventional and DSC for certain setups.

        We show how to tune a SegNet model to use DSC in PyTorch. This technique can be applied to any deep CNN. In fact, we can selectively replace some convolutional layers with DSC - that is, we don't necessarily need to replace all convolutional layers. Choosing which layers to replace will depend on the balance you want to strike between model size/runtime cost and prediction accuracy. This decision will depend on your specific use case and deployment setup.

        Although this paper trains the model for 20 epochs, we explain that this is insufficient for production workloads and provide a glimpse of what to expect if the model is trained for more epochs. Additionally, we introduce some hyperparameters that can be tuned during model training. While this list is not comprehensive, it should give you an idea of ​​the complexities and decisions required to train an image segmentation model for production workloads.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132346499