Understanding depthwise separable convolution in simple terms

If you review the past and learn the new, you can become a teacher!

1. Reference materials

A detailed and popular explanation of lightweight neural networks - MobileNets [V1, V2, V3]
A detailed and popular explanation of lightweight neural networks - MobileNets [V1, V2, V3】
Separable Convolution in Convolutional Neural Networks
Several convolutions commonly used in deep learning (Part 2): dilated convolution, separable convolution (Depth separable, space separable), grouped convolution (with Pytorch test code)

2. Related introduction

1. Standard convolution

Standard convolution uses several multi-channel convolution kernels to process the input multi-channel image. The output feature map not only extracts the channel features, and extracted spatial features.

As shown in the figure below, assume that the input layer is a 64×64 pixel, 3-channel color image. After a convolutional layer containing 4 Filters, 4 Feature Maps are finally output, and the size is the same as the input layer. At this time, the convolution layer has a total of 4 Filters, each Filter contains 3 Kernels, and the size of each Kernel is 3×3. Therefore, the parameter amount of the convolutional layer is: $N_{std} = 4 × 3 × 3 × 3 = 108$ 。
Insert image description here

Use mathematical formulas to express standard convolution, assuming that the convolution kernel size is $D_K*D_K$ , the input channel is M, the output channel is N, and the output feature map size is $D_F*D_F$ , then after standard convolution, it can be calculated:

Reference quantity为： $D_K*D_K*M*N$ ；

计算量为： $D_K*D_K*M*N*D_F*D_F$ 。

2. Depthwise Convolution

The difference between depthwise convolution (DWConv) and standard convolution is that the convolution kernel of depthwise convolution is a single-channel mode, and each input channel needs to be convolved, so that the input feature map channel will be obtained. The output feature map is consistent in number. There isNumber of input feature map channels = number of convolution kernels = number of output feature maps。

Assume that for a 3-channel color image with a size of 64×64 pixels, 3 single-channel convolution kernels perform convolution calculations respectively and output 3 single-channel feature maps. Therefore, a 3-channel image generates 3 Feature maps after operation, as shown in the figure below. One of the Filters only contains a Kernel of size 3×3, and the parameter amount of the convolution part is: $N_{depthwise} = 3 × 3 × 3 = 27$ 。
Insert image description here

For another example, input a 12x12x3 feature map, the number of convolutions is 3, and output an 8x8x3 feature map.
Insert image description here

3. Pointwise Convolution

According to the depth convolution, we can know thatNumber of input feature map channels = number of convolution kernels = number of output feature maps, this will result in too few output feature maps (or too few channels in the output feature map, which can be regarded as the number of output feature maps is 1 and the number of channels is 3), which may affect the effectiveness of the information. At this point, point-by-point convolution is required.

Pointwise Convolution (PWConv) is essentiallyUse 1x1 convolution kernel to increase dimensionality. 1x1 convolution kernels are used extensively in GoogleNet, where they are mainly used for dimensionality reduction. The main function of the 1x1 convolution kernel is to increase and decrease the dimensionality of the feature map.

As shown in the figure below, 3 single-channel feature maps obtained from depth convolution are calculated through 4 convolution kernels of size 1x1x3, and 4 feature maps are output, and the number of output feature maps is Depends on the number of Filters. Therefore, the parameter quantity of the convolutional layer is: $N_{pointwise} = 1 × 1 × 3 × 4 = 12$ 。
Insert image description here

For another example, after depth convolution is performed, 8x8x3 features are obtained. At this time, 256 convolutions of size 1x1x3 are used for convolution calculation, and the output is 8x8x256. Therefore, the parameter quantity of the convolutional layer is: $N_{pointwise} = 1 × 1 × 3 × 256 = 768$ 。
Insert image description here

4. Depthwise Separable Convolution

Depthwise separable convolution (DSC) consists of depth convolution and point-wise convolution.Depth convolution is used to extractspatial features, and pointwise convolution is used to extract channel features . Depthwise separable convolution groups convolutions in the feature dimension, performs independent depthwise convolution (depthwise convolution) on each channel, and uses a 1x1 convolution (pointwise convolution) to aggregate all channels before output.

depthwise: Convolve spatially

pointwise: convolution in depth

$深度可分离卷积（ De pt h w i se S e p a r ab l e C o n v o l u t i o n ） = d e pt h w i se (depth range) + p o in tw i se (点卷积)$
Insert image description here

Depthwise separable convolution first performs DWConv on each channel, and then merges all channels through PWConv to output feature maps, thereby reducing the amount of calculation and improving calculation efficiency.

4.1 Parameter quantity

Depth-by-depth convolution: Convolution kernel size of depth convolution $D_K * D_K * 1$ , the number of convolution kernels is M, so the parameter amount is: $D_K * D_K * M a>$ 。

Point-wise convolution: The convolution kernel size of point-wise convolution is $1 * 1 * M$ , the number of convolution kernels is N, so the parameter amount is: $M * N$ 。

Therefore, the parameter amount of depthwise separable convolution is: $D_K * D_K * M + M * N$ 。

4.2 Calculation amount

Depth convolution: Convolution kernel size of depth convolution $D_K * D_K * 1$ , the number of convolution kernels is M, each one must be done $D_F * D_F$ multiplication and addition operations, so the calculation amount is: $D_K * D_K * M * D_F * D_F$ 。

Point-wise convolution: The convolution kernel size of point-wise convolution is $1 * 1 * M$ , the number of convolution kernels is N, each one must be done $D_F * D_F$ multiplication and addition operations, so the calculation amount is: $M*N * D_F * D_F$ 。

Therefore, the calculation amount of depthwise separable convolution is: $D_K * D_K * M*D_F * D_F + M*N*D_F * D_F$ 。

4.3 Comparison with standard convolution

4.3.1 Structural comparison

Each block of depthwise separable convolution consists of: first a 3x3 depth convolution, followed by BN and Relu layers, followed by 1x1 point-wise convolution, and finally BN and Relu layers.
Insert image description here

4.3.2 Comparison of calculation amount and parameter amount

Parameter ratio: $\ frac{Depthwise separable convolution}{Standard convolution} = \frac{D_K\times D_K\times M+M\times N}{D_K\times D_K\times M\times N} = \frac{1}{N }+\frac{1}{ {D_{K}}^{2}}$ ；

Computation ratio: $\frac{depth separable convolution}{standard convolution} = \frac{D_{K}\times D_{K}\times M\times D_{F}\times D_{F}+M\times N\times D_{F}\times D_{F}}{D_{K}\times D_{K}\times M\times N\times D_{F}\times D_{F }} = \frac{1}{N}+\frac{1}{ {D_{K}}^{2}}$ 。
Insert image description here

General, N-large, $\frac{1}{N}$ 可忽 Strategy未计， $D_K$ indicates the size of the convolution kernel, if $D_K =3$ ， $\frac{1}{D_k^2}=\frac{1}{9}$ . In other words, if we use a common 3×3 convolution kernel, the amount of parameters and calculations using depth-separable convolution is reduced to about one-ninth of the original.

4.4 Advantages of depthwise separable convolution

Compared with traditional convolutional neural networks, the significant advantages of depthwise separable convolution are:

Fewer parameters: The number of input channels can be reduced, thereby effectively reducing the parameters required for the convolutional layer.
Faster: Runs faster than traditional convolution.
MorePortable: Less computationally intensive, easier to implement and deploy on different platforms.
More streamlined: Ability to streamline calculation models to achieve high-precision operations on smaller devices.

5. Code examples

import torch.nn as nn

class myModel(nn.Module):
    def __init__(self):
        super(myModel, self).__init__()
        self.dwconv = nn.Sequential(
            nn.Conv2d(3, 3, kernel_size=3, stride=2, padding=1, groups=3, bias=False),
            nn.BatchNorm2d(3),
            nn.ReLU(inplace=True),
            nn.Conv2d(3, 9, kernel_size=1, stride=1, padding=0, bias=False),
            nn.BatchNorm2d(9),
            nn.ReLU(inplace=True),
        )

The code in the yolo series is as follows:

import torch.nn as nn


class DWConv(nn.Module):
    """Depthwise Conv + Conv"""
    def __init__(self, in_channels, out_channels, ksize, stride=1, act="silu"):
        super().__init__()
        self.dconv = BaseConv(
            in_channels, in_channels, ksize=ksize,
            stride=stride, groups=in_channels, act=act
        )
        self.pconv = BaseConv(
            in_channels, out_channels, ksize=1,
            stride=1, groups=1, act=act
        )
 
    def forward(self, x):
        x = self.dconv(x)
        return self.pconv(x)

3. MobileNet v1

Paper:MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
MobileNet v1 and MobileNet v2
MobileNet Detailed explanation of network
[Deep Learning] Detailed explanation of lightweight CNN network MobileNet series
MobileNet V1 image classification

MobileNet v1 is a lightweight CNN network that focuses on mobile terminals or embedded devices that are not particularly computationally intensive. As shown in the figure below, MobileNet v1 only sacrifices a little accuracy, but greatly reduces the number of parameters and calculations of the model.
Insert image description here

1. Network structure

The main contribution of MobileNet v1 is the use of Depthwise Separable Convolution, which can be split into Depthwise convolution and PointwiseConvolution. Depthwise Separable Convolution’s separate design directly compresses the model by about 8 times, but the accuracy is not seriously lost, which is still very shocking.
Insert image description here

The lightweight network's MobileNet v1 has less calculation and parameter volume than GoogleNet, and its classification effect is better than GoogleNet. This is the result of depth-separable convolution. The calculation parameters of VGG16 are 30 times larger than MobileNet, but the result is only less than 1% higher.
Insert image description here

2. Advantages

First of all, the depth-separable convolution proposed by MobileNet v1 can greatly reduce the amount of calculation and parameters; secondly, adding hyperparameters α and ρ can adjust the width and resolution of the network according to needs.

Specifically, the hyperparameter α is to control the number of convolution kernels, which is the output channel, so α can reduce the number of parameters of the model; the hyperparameter ρ is to control the size of the image input and is a parameter that will not affect the model. , but can reduce the amount of calculation.

The width of the network represents the dimension of the convolutional layer, which is the channel, such as 512, 1024.

The depth of the network represents the number of convolutional layers, that is, how deep the network is, such as resnet34, resnet101.

3. Code implementation

3.1 Build MobileNet v1 network model

import torch.nn as nn
 
 
# MobileNet v1
class MobileNetV1(nn.Module):
    def __init__(self,num_classes=1000):
        super(MobileNetV1, self).__init__()
 
        # 第一层的卷积,channel->32,size减半
        def conv_bn(in_channel, out_channel, stride):
            return nn.Sequential(
                nn.Conv2d(in_channel, out_channel, 3, stride, 1, bias=False),
                nn.BatchNorm2d(out_channel),
                nn.ReLU(inplace=True)
            )
 
        # 深度可分离卷积=depthwise卷积 + pointwise卷积
        def conv_dw(in_channel, out_channel, stride):
            return nn.Sequential(
                # depthwise 卷积,channel不变，stride = 2的时候，size减半
                nn.Conv2d(in_channel, in_channel, 3, stride, padding=1, groups=in_channel, bias=False),
                nn.BatchNorm2d(in_channel),
                nn.ReLU(inplace=True),
 
                # pointwise卷积(1*1卷积) same卷积, 只改变channel
                nn.Conv2d(in_channel, out_channel, 1, 1, padding=0, bias=False),
                nn.BatchNorm2d(out_channel),
                nn.ReLU(inplace=True),
            )
 
        self.model = nn.Sequential(
            conv_bn(3, 32, 2),          # conv/s2           out=224*224*32
            conv_dw(32, 64, 1),         # conv dw +1*1      out=112*112*64
            conv_dw(64, 128, 2),        # conv dw +1*1      out=56*56*128
            conv_dw(128, 128, 1),       # conv dw +1*1      out=56*56*128
            conv_dw(128, 256, 2),       # conv dw +1*1      out=28*28*256
            conv_dw(256, 256, 1),       # conv dw +1*1      out=28*28*256
            conv_dw(256, 512, 2),       # conv dw +1*1      out=14*14*512
            conv_dw(512, 512, 1),       # 5个 conv dw +1*1 ----> size不变，channel不变，out=14*14*512
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 1024, 2),      # conv dw +1*1      out=7*7*1024
            conv_dw(1024, 1024, 1),     # conv dw +1*1      out=7*7*1024
            nn.AvgPool2d(7),            # avg pool          out=1*1*1024
        )
        self.fc = nn.Linear(1024, num_classes)      # fc
 
    def forward(self, x):
        x = self.model(x)
        x = x.view(-1, 1024)
        x = self.fc(x)
        return x

3.2 `torchsummary`View network structure

# 安装torchsummary
pip install torchsummary

Use torchsummary to view the network structure:

from torchsummary import summary
import torch
 
 
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
net = MobileNetV1()
net.to(DEVICE)
print(summary(net, input_size=(3, 224, 224),device=DEVICE))

Output result:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 32, 112, 112]             864
       BatchNorm2d-2         [-1, 32, 112, 112]              64
              ReLU-3         [-1, 32, 112, 112]               0
            Conv2d-4         [-1, 32, 112, 112]             288
       BatchNorm2d-5         [-1, 32, 112, 112]              64
              ReLU-6         [-1, 32, 112, 112]               0
            Conv2d-7         [-1, 64, 112, 112]           2,048
       BatchNorm2d-8         [-1, 64, 112, 112]             128
              ReLU-9         [-1, 64, 112, 112]               0
           Conv2d-10           [-1, 64, 56, 56]             576
      BatchNorm2d-11           [-1, 64, 56, 56]             128
             ReLU-12           [-1, 64, 56, 56]               0
           Conv2d-13          [-1, 128, 56, 56]           8,192
      BatchNorm2d-14          [-1, 128, 56, 56]             256
             ReLU-15          [-1, 128, 56, 56]               0
           Conv2d-16          [-1, 128, 56, 56]           1,152
      BatchNorm2d-17          [-1, 128, 56, 56]             256
             ReLU-18          [-1, 128, 56, 56]               0
           Conv2d-19          [-1, 128, 56, 56]          16,384
      BatchNorm2d-20          [-1, 128, 56, 56]             256
             ReLU-21          [-1, 128, 56, 56]               0
           Conv2d-22          [-1, 128, 28, 28]           1,152
      BatchNorm2d-23          [-1, 128, 28, 28]             256
             ReLU-24          [-1, 128, 28, 28]               0
           Conv2d-25          [-1, 256, 28, 28]          32,768
      BatchNorm2d-26          [-1, 256, 28, 28]             512
             ReLU-27          [-1, 256, 28, 28]               0
           Conv2d-28          [-1, 256, 28, 28]           2,304
      BatchNorm2d-29          [-1, 256, 28, 28]             512
             ReLU-30          [-1, 256, 28, 28]               0
           Conv2d-31          [-1, 256, 28, 28]          65,536
      BatchNorm2d-32          [-1, 256, 28, 28]             512
             ReLU-33          [-1, 256, 28, 28]               0
           Conv2d-34          [-1, 256, 14, 14]           2,304
      BatchNorm2d-35          [-1, 256, 14, 14]             512
             ReLU-36          [-1, 256, 14, 14]               0
           Conv2d-37          [-1, 512, 14, 14]         131,072
      BatchNorm2d-38          [-1, 512, 14, 14]           1,024
             ReLU-39          [-1, 512, 14, 14]               0
           Conv2d-40          [-1, 512, 14, 14]           4,608
      BatchNorm2d-41          [-1, 512, 14, 14]           1,024
             ReLU-42          [-1, 512, 14, 14]               0
           Conv2d-43          [-1, 512, 14, 14]         262,144
      BatchNorm2d-44          [-1, 512, 14, 14]           1,024
             ReLU-45          [-1, 512, 14, 14]               0
           Conv2d-46          [-1, 512, 14, 14]           4,608
      BatchNorm2d-47          [-1, 512, 14, 14]           1,024
             ReLU-48          [-1, 512, 14, 14]               0
           Conv2d-49          [-1, 512, 14, 14]         262,144
      BatchNorm2d-50          [-1, 512, 14, 14]           1,024
             ReLU-51          [-1, 512, 14, 14]               0
           Conv2d-52          [-1, 512, 14, 14]           4,608
      BatchNorm2d-53          [-1, 512, 14, 14]           1,024
             ReLU-54          [-1, 512, 14, 14]               0
           Conv2d-55          [-1, 512, 14, 14]         262,144
      BatchNorm2d-56          [-1, 512, 14, 14]           1,024
             ReLU-57          [-1, 512, 14, 14]               0
           Conv2d-58          [-1, 512, 14, 14]           4,608
      BatchNorm2d-59          [-1, 512, 14, 14]           1,024
             ReLU-60          [-1, 512, 14, 14]               0
           Conv2d-61          [-1, 512, 14, 14]         262,144
      BatchNorm2d-62          [-1, 512, 14, 14]           1,024
             ReLU-63          [-1, 512, 14, 14]               0
           Conv2d-64          [-1, 512, 14, 14]           4,608
      BatchNorm2d-65          [-1, 512, 14, 14]           1,024
             ReLU-66          [-1, 512, 14, 14]               0
           Conv2d-67          [-1, 512, 14, 14]         262,144
      BatchNorm2d-68          [-1, 512, 14, 14]           1,024
             ReLU-69          [-1, 512, 14, 14]               0
           Conv2d-70            [-1, 512, 7, 7]           4,608
      BatchNorm2d-71            [-1, 512, 7, 7]           1,024
             ReLU-72            [-1, 512, 7, 7]               0
           Conv2d-73           [-1, 1024, 7, 7]         524,288
      BatchNorm2d-74           [-1, 1024, 7, 7]           2,048
             ReLU-75           [-1, 1024, 7, 7]               0
           Conv2d-76           [-1, 1024, 7, 7]           9,216
      BatchNorm2d-77           [-1, 1024, 7, 7]           2,048
             ReLU-78           [-1, 1024, 7, 7]               0
           Conv2d-79           [-1, 1024, 7, 7]       1,048,576
      BatchNorm2d-80           [-1, 1024, 7, 7]           2,048
             ReLU-81           [-1, 1024, 7, 7]               0
        AvgPool2d-82           [-1, 1024, 1, 1]               0
           Linear-83                 [-1, 1000]       1,025,000
================================================================
Total params: 4,231,976
Trainable params: 4,231,976
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 115.43
Params size (MB): 16.14
Estimated Total Size (MB): 132.15
----------------------------------------------------------------
None

3.3 train training model

import torch
import torch.nn as nn
from torchvision import transforms, datasets
import torch.optim as optim
from model import MobileNetV1
from torch.utils.data import DataLoader
from tqdm import tqdm

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
data_transform = {
    
    
    "train": transforms.Compose([transforms.Resize((224, 224)),
                                 transforms.ToTensor(),
                                 transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])]),
    "test": transforms.Compose([transforms.Resize((224, 224)),
                                transforms.ToTensor(),
                                transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])])}

# 训练集
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=data_transform['train'])
trainloader = DataLoader(trainset, batch_size=16, shuffle=True)

# 测试集
testset = datasets.CIFAR10(root='./data', train=False, download=True, transform=data_transform['test'])
testloader = DataLoader(testset, batch_size=16, shuffle=False)

# 样本的个数
num_trainset = len(trainset)  # 50000
num_testset = len(testset)  # 10000

# 构建网络
net = MobileNetV1(num_classes=10)
net.to(DEVICE)

# 加载损失和优化器
loss_function = nn.CrossEntropyLoss()
loss_fun = loss_function.to(DEVICE)

learning_rate = 0.0001
optimizer = optim.Adam(net.parameters(), lr=learning_rate)

best_acc = 0.0
save_path = './MobileNetV1.pth'

for epoch in range(10):
    net.train()  # 训练模式
    running_loss = 0.0
    for data in tqdm(trainloader):
        images, labels = data
        images, labels = images.to(DEVICE), labels.to(DEVICE)

        optimizer.zero_grad()
        out = net(images)  # 总共有三个输出
        loss = loss_function(out, labels)
        loss.backward()  # 反向传播
        optimizer.step()

        running_loss += loss.item()

    # test
    # 测试过程不需要通过反向传播来更新参数。
    net.eval()  # 测试模式
    acc = 0.0
    with torch.no_grad():  # 测试不需要进行反向传播，即不需要梯度变化
        for test_data in tqdm(testloader):
            test_images, test_labels = test_data
            test_images, test_labels = test_images.to(DEVICE), test_labels.to(DEVICE)

            outputs = net(test_images)
            predict_y = torch.max(outputs, dim=1)[1]
            acc += (predict_y == test_labels).sum().item()

    accurate = acc / num_testset
    train_loss = running_loss / num_trainset

    print('[epoch %d] train_loss: %.3f  test_accuracy: %.3f' %
          (epoch + 1, train_loss, accurate))

    if accurate > best_acc:
        best_acc = accurate
        torch.save(net.state_dict(), save_path)

print('Finished Training')

Output result:

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
170499072it [00:30, 5634555.25it/s]                                
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
100%|██████████| 3125/3125 [02:18<00:00, 22.55it/s]
100%|██████████| 625/625 [00:12<00:00, 51.10it/s]
[epoch 1] train_loss: 0.101  test_accuracy: 0.516
100%|██████████| 3125/3125 [02:23<00:00, 21.78it/s]
100%|██████████| 625/625 [00:11<00:00, 54.31it/s]
[epoch 2] train_loss: 0.079  test_accuracy: 0.612
100%|██████████| 3125/3125 [02:20<00:00, 22.17it/s]
100%|██████████| 625/625 [00:11<00:00, 54.28it/s]
[epoch 3] train_loss: 0.066  test_accuracy: 0.672
100%|██████████| 3125/3125 [02:21<00:00, 22.09it/s]
100%|██████████| 625/625 [00:11<00:00, 55.52it/s]
[epoch 4] train_loss: 0.056  test_accuracy: 0.722
100%|██████████| 3125/3125 [02:13<00:00, 23.34it/s]
100%|██████████| 625/625 [00:11<00:00, 55.56it/s]
[epoch 5] train_loss: 0.048  test_accuracy: 0.748
100%|██████████| 3125/3125 [02:14<00:00, 23.31it/s]
100%|██████████| 625/625 [00:11<00:00, 52.19it/s]
[epoch 6] train_loss: 0.042  test_accuracy: 0.763
100%|██████████| 3125/3125 [02:14<00:00, 23.18it/s]
100%|██████████| 625/625 [00:11<00:00, 56.05it/s]
[epoch 7] train_loss: 0.035  test_accuracy: 0.781
100%|██████████| 3125/3125 [02:14<00:00, 23.27it/s]
100%|██████████| 625/625 [00:11<00:00, 55.88it/s]
[epoch 8] train_loss: 0.031  test_accuracy: 0.790
100%|██████████| 3125/3125 [02:13<00:00, 23.32it/s]
100%|██████████| 625/625 [00:11<00:00, 55.89it/s]
[epoch 9] train_loss: 0.026  test_accuracy: 0.801
100%|██████████| 3125/3125 [02:15<00:00, 22.99it/s]
100%|██████████| 625/625 [00:11<00:00, 55.95it/s]
[epoch 10] train_loss: 0.022  test_accuracy: 0.803
Finished Training

Process finished with exit code 0

Graphics card resource usage:
Insert image description here

3.4 View model weight parameters

from model import MobileNetV1
import torch
 
 
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
 
net = MobileNetV1(num_classes=10)
net.load_state_dict(torch.load('./MobileNetV1.pth'))
net.to(DEVICE)
 
 
with torch.no_grad():
    for i in range(0,14):       # 查看 depthwise 的权值
        print(net.model[i][0].weight)

3.5 Test the effect on the CIFAR10 data set

import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
 
import torch
import numpy as np
import matplotlib.pyplot as plt
from model import MobileNetV1
from torchvision.transforms import transforms
from torch.utils.data import DataLoader
import torchvision
 
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
 
# 预处理
transformer = transforms.Compose([transforms.Resize((224,224)),
                                  transforms.ToTensor(),
                                  transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])])
 
# 加载模型
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
model = MobileNetV1(num_classes=10)
model.load_state_dict(torch.load('./MobileNetV1.pth'))
model.to(DEVICE)
 
# 加载数据
testSet = torchvision.datasets.CIFAR10(root='./data', train=False, download=False, transform=transformer)
testLoader = DataLoader(testSet, batch_size=12, shuffle=True)
 
# 获取一批数据
imgs, labels = next(iter(testLoader))
imgs = imgs.to(DEVICE)
 
# show
with torch.no_grad():
    model.eval()
    prediction = model(imgs)  # 预测
    prediction = torch.max(prediction, dim=1)[1]
    prediction = prediction.data.cpu().numpy()
 
    plt.figure(figsize=(12, 8))
    for i, (img, label) in enumerate(zip(imgs, labels)):
        x = np.transpose(img.data.cpu().numpy(), (1, 2, 0))  # 图像
        x[:, :, 0] = x[:, :, 0] * 0.229 + 0.485  # 去 normalization
        x[:, :, 1] = x[:, :, 1] * 0.224 + 0.456  # 去 normalization
        x[:, :, 2] = x[:, :, 2] * 0.255 + 0.406  # 去 normalization
        y = label.numpy().item()  # label
        plt.subplot(3, 4, i + 1)
        plt.axis(False)
        plt.imshow(x)
        plt.title('R:{},P:{}'.format(classes[y], classes[prediction[i]]))
    plt.show()

Results display:

Insert image description here

4. MobileNet v2

论文：MobileNetV2: Inverted Residuals and Linear Bottlenecks

MobileNet v2 mainly combines the residual network and Depthwise Separable convolution. The residual block is improved by analyzing the manifold characteristics of a single channel, including an extension to the intermediate layer (d) and a linear activation of the bottleneck layer©.
Insert image description here

0 Preface

The features represented by the pixel values of each channel of the feature map can be mapped to a manifold region of a low-dimensional subspace. Usually, an activation layer is often added after the convolution operation to increase the nonlinearity of the features. A common activation function is ReLU. The activation process will bring information loss, and this loss cannot be recovered. When the number of channels is very small, the information loss of ReLU is more obvious.

As shown in the figure below, the input is a matrix representing manifold data, which is similar to the convolution operation. After n ReLU operations, n channel Feature Maps are obtained, and then the input data is restored through n Feature Maps. The more similar the restoration is It means less information loss.
Insert image description here

As can be seen from the figure above, when the input dimensions are 2 and 3, the output loses more information than the input; but when the input dimensions are 15 to 30, the output retains more information about the input. Generally speaking,When the n value is small, the information loss of ReLU is very serious. When the n value is large, the input manifold can be restored better.。

Based on the analysis of the information loss problem mentioned above, we can have two solutions:

Replace ReLU: Since it is the information loss caused by ReLU, you canreplace ReLU with a linear activation function;
Increase dimensionality: If a larger number of channels can reduce information loss, then you canincrease the input dimensionality by increasing the dimensionality a>.

The title of MobileNet v2 is MobileNetV2: Inverted Residuals and Linear Bottlenecks, Linear Bottlenecks and Inverted Residuals are the core of MobileNet v2 and the two mentioned above. A description of an idea.

1. `Linear Bottlenecks`

Replace the Relu activation function with a linear activation function. The transformed block is called Linear Bottlenecks in the article. The structure is as shown in the figure below:
Insert image description here

Of course, you cannot replace all ReLU with linear activation functions, otherwise the network will degenerate into a single-layer neural network. A compromise solution is to use linear activation in the bottleneck part when the number of channels in the output Feature Map is small. Activation function, ReLU is used at other times. The code for the Linear Bottlenecks block is implemented as follows:

def _bottleneck(inputs, nb_filters, t):
    x = Conv2D(filters=nb_filters * t, kernel_size=(1,1), padding='same')(inputs)
    x = Activation(relu6)(x)
    x = DepthwiseConv2D(kernel_size=(3,3), padding='same')(x)
    x = Activation(relu6)(x)
    x = Conv2D(filters=nb_filters, kernel_size=(1,1), padding='same')(x)
    # do not use activation function
    if not K.get_variable_shape(inputs)[3] == nb_filters:
        inputs = Conv2D(filters=nb_filters, kernel_size=(1,1), padding='same')(inputs)
    outputs = add([x, inputs])
    return outputs

2. `Inverted Residual`

Inverted Residuals is literally translated as inverted residual structure. Let’s take a look at its differences and connections with the normal residual structure: As can be seen from the figure below, the left side is the residual structure in ResNet, and its structure is: 1x1卷积降维->3x3卷积->1x1卷积升维; The right side is the inverted residual structure in MobileNet v2, and its structure is: 1x1卷积升维->3x3DW卷积->1x1卷积降维. The reason why MobileNet v2 first uses 1x1 for dimensionality increase is:High-dimensional information loses less information after passing through the ReLU activation function, so the dimensionality operation is performed first.。
Insert image description here

What needs to be noted in this part is that only when s=1, that is, when the step size is 1, there is a shortcut connection, but when the step size is 2, there is no shortcut connection, as shown in the figure below.

Insert image description here

3. Network structure

Insert image description here

MobileNet v2 uses fewer parameters, but the mAP value is similar to others, even exceeding Yolov2. The effect is shown in the figure below:
Insert image description here

4. Code implementation

MobileNet v2 can be implemented by stacking bottlenecks, as shown in the following code snippet:

def MobileNetV2_relu(input_shape, k):
    inputs = Input(shape = input_shape)
    x = Conv2D(filters=32, kernel_size=(3,3), padding='same')(inputs)
    x = _bottleneck_relu(x, 8, 6)
    x = MaxPooling2D((2,2))(x)
    x = _bottleneck_relu(x, 16, 6)
    x = _bottleneck_relu(x, 16, 6)
    x = MaxPooling2D((2,2))(x)
    x = _bottleneck_relu(x, 32, 6)
    x = GlobalAveragePooling2D()(x)
    x = Dense(128, activation='relu')(x)
    outputs = Dense(k, activation='softmax')(x)
    model = Model(inputs, outputs)
    return model