If you review the past and learn the new, you can become a teacher!
1. Reference materials
A detailed and popular explanation of lightweight neural networks - MobileNets [V1, V2, V3]
A detailed and popular explanation of lightweight neural networks - MobileNets [V1, V2, V3】
Separable Convolution in Convolutional Neural Networks
Several convolutions commonly used in deep learning (Part 2): dilated convolution, separable convolution (Depth separable, space separable), grouped convolution (with Pytorch test code)
2. Related introduction
1. Standard convolution
Standard convolution uses several multi-channel convolution kernels to process the input multi-channel image. The output feature map not only extracts the channel features, and extracted spatial features.
As shown in the figure below, assume that the input layer is a 64×64 pixel, 3-channel color image. After a convolutional layer containing 4 Filters, 4 Feature Maps are finally output, and the size is the same as the input layer. At this time, the convolution layer has a total of 4 Filters, each Filter contains 3 Kernels, and the size of each Kernel is 3×3. Therefore, the parameter amount of the convolutional layer is: N s t d = 4 × 3 × 3 × 3 = 108 N_{std} = 4 × 3 × 3 × 3 = 108 Nstd=4×3×3×3=108。
Use mathematical formulas to express standard convolution, assuming that the convolution kernel size is D K ∗ D K D_K*D_K DK∗DK, the input channel is M, the output channel is N, and the output feature map size is D F ∗ D F D_F*D_F DF∗DF, then after standard convolution, it can be calculated:
Reference quantity为: D K ∗ D K ∗ M ∗ N D_K*D_K*M*N DK∗DK∗M∗N;
计算量为: D K ∗ D K ∗ M ∗ N ∗ D F ∗ D F D_K*D_K*M*N*D_F*D_F DK∗DK∗M∗N∗DF∗DF。
2. Depthwise Convolution
The difference between depthwise convolution (DWConv) and standard convolution is that the convolution kernel of depthwise convolution is a single-channel mode, and each input channel needs to be convolved, so that the input feature map channel will be obtained. The output feature map is consistent in number. There isNumber of input feature map channels = number of convolution kernels = number of output feature maps。
Assume that for a 3-channel color image with a size of 64×64 pixels, 3 single-channel convolution kernels perform convolution calculations respectively and output 3 single-channel feature maps. Therefore, a 3-channel image generates 3 Feature maps after operation, as shown in the figure below. One of the Filters only contains a Kernel of size 3×3, and the parameter amount of the convolution part is: N d e p t h w i s e = 3 × 3 × 3 = 27 N_{depthwise} = 3 × 3 × 3 = 27 Ndepthw ise=3×3×3=27。
For another example, input a 12x12x3 feature map, the number of convolutions is 3, and output an 8x8x3 feature map.
3. Pointwise Convolution
According to the depth convolution, we can know thatNumber of input feature map channels = number of convolution kernels = number of output feature maps, this will result in too few output feature maps (or too few channels in the output feature map, which can be regarded as the number of output feature maps is 1 and the number of channels is 3), which may affect the effectiveness of the information. At this point, point-by-point convolution is required.
Pointwise Convolution (PWConv) is essentiallyUse 1x1 convolution kernel to increase dimensionality. 1x1 convolution kernels are used extensively in GoogleNet, where they are mainly used for dimensionality reduction. The main function of the 1x1 convolution kernel is to increase and decrease the dimensionality of the feature map.
As shown in the figure below, 3 single-channel feature maps obtained from depth convolution are calculated through 4 convolution kernels of size 1x1x3, and 4 feature maps are output, and the number of output feature maps is Depends on the number of Filters. Therefore, the parameter quantity of the convolutional layer is: N p o i n t w i s e = 1 × 1 × 3 × 4 = 12 N_{pointwise} = 1 × 1 × 3 × 4 = 12 Npointwise=1×1×3×4=12。
For another example, after depth convolution is performed, 8x8x3 features are obtained. At this time, 256 convolutions of size 1x1x3 are used for convolution calculation, and the output is 8x8x256. Therefore, the parameter quantity of the convolutional layer is: N p o i n t w i s e = 1 × 1 × 3 × 256 = 768 N_{pointwise} = 1 × 1 × 3 × 256 = 768 Npointwise=1×1×3×256=768。
4. Depthwise Separable Convolution
Depthwise separable convolution (DSC) consists of depth convolution and point-wise convolution.Depth convolution is used to extractspatial features, and pointwise convolution is used to extract channel features . Depthwise separable convolution groups convolutions in the feature dimension, performs independent depthwise convolution (depthwise convolution) on each channel, and uses a 1x1 convolution (pointwise convolution) to aggregate all channels before output.
depthwise: Convolve spatially
pointwise: convolution in depth
Depthwise separable convolution (Depthwise Separable Convolution) = depthwise convolution +pointwise(point convolution)}深度可分离卷积(DepthwiseSeparableConvolution)=depthw ise(depth range)+pointwise(点卷积)
Depthwise separable convolution first performs DWConv on each channel, and then merges all channels through PWConv to output feature maps, thereby reducing the amount of calculation and improving calculation efficiency.
4.1 Parameter quantity
Depth-by-depth convolution: Convolution kernel size of depth convolution D K ∗ D K ∗ 1 D_K * D_K * 1 DK∗DK∗1, the number of convolution kernels is M, so the parameter amount is: D K ∗ D K ∗ M D_K * D_K * M a>DK∗DK∗M。
Point-wise convolution: The convolution kernel size of point-wise convolution is 1 ∗ 1 ∗ M 1 * 1 * M a>1∗1∗M, the number of convolution kernels is N, so the parameter amount is: M ∗ N M * N M∗N。
Therefore, the parameter amount of depthwise separable convolution is: D K ∗ D K ∗ M + M ∗ N D_K * D_K * M + M * N DK∗DK∗M+M∗N。
4.2 Calculation amount
Depth convolution: Convolution kernel size of depth convolution D K ∗ D K ∗ 1 D_K * D_K * 1 DK∗DK∗1, the number of convolution kernels is M, each one must be done D F ∗ D F D_F * D_F DF∗DF multiplication and addition operations, so the calculation amount is: D K ∗ D K ∗ M ∗ D F ∗ D F D_K * D_K * M * D_F * D_F DK∗DK∗M∗DF∗DF。
Point-wise convolution: The convolution kernel size of point-wise convolution is 1 ∗ 1 ∗ M 1 * 1 * M a>1∗1∗M, the number of convolution kernels is N, each one must be done D F ∗ D F D_F * D_F DF∗DF multiplication and addition operations, so the calculation amount is: M ∗ N ∗ D F ∗ D F M*N * D_F * D_F M∗N∗DF∗DF。
Therefore, the calculation amount of depthwise separable convolution is: D K ∗ D K ∗ M ∗ D F ∗ D F + M ∗ N ∗ D F ∗ D F D_K * D_K * M*D_F * D_F + M*N*D_F * D_F DK∗DK∗M∗DF∗DF+M∗N∗DF∗DF。
4.3 Comparison with standard convolution
4.3.1 Structural comparison
Each block of depthwise separable convolution consists of: first a 3x3 depth convolution, followed by BN and Relu layers, followed by 1x1 point-wise convolution, and finally BN and Relu layers.
4.3.2 Comparison of calculation amount and parameter amount
Parameter ratio: Depth separable convolution standard convolution = D K × D K × M + M × N D K × D K × M × N = 1 N + 1 D K 2 \ frac{Depthwise separable convolution}{Standard convolution} = \frac{D_K\times D_K\times M+M\times N}{D_K\times D_K\times M\times N} = \frac{1}{N }+\frac{1}{ {D_{K}}^{2}} Standard convolutionDepthwise separable convolution=DK×DK×M×NDK×DK×M+M×N=N1+DK21;
Computation ratio: Depth separable convolution standard convolution = D K × D K × M × D F × D F + M × N × D F × D F D K × D K × M × N × D F × D F = 1 N + 1 D K 2 \frac{depth separable convolution}{standard convolution} = \frac{D_{K}\times D_{K}\times M\times D_{F}\times D_{F}+M\times N\times D_{F}\times D_{F}}{D_{K}\times D_{K}\times M\times N\times D_{F}\times D_{F }} = \frac{1}{N}+\frac{1}{
{D_{K}}^{2}} Standard convolutionDepthwise separable convolution=DK×DK×M×N×DF×DFDK×DK×M×DF×DF+M×N×DF×DF=N1+DK21。
General, N-large, 1 N \frac{1}{N} N1 可忽 Strategy未计, D K D_K DK indicates the size of the convolution kernel, if D K = 3 D_K =3 DK=3, 1 D k 2 = 1 9 \frac{1}{D_k^2}=\frac{1}{9} Dk21=91. In other words, if we use a common 3×3 convolution kernel, the amount of parameters and calculations using depth-separable convolution is reduced to about one-ninth of the original.
4.4 Advantages of depthwise separable convolution
Compared with traditional convolutional neural networks, the significant advantages of depthwise separable convolution are:
- Fewer parameters: The number of input channels can be reduced, thereby effectively reducing the parameters required for the convolutional layer.
- Faster: Runs faster than traditional convolution.
- MorePortable: Less computationally intensive, easier to implement and deploy on different platforms.
- More streamlined: Ability to streamline calculation models to achieve high-precision operations on smaller devices.
5. Code examples
import torch.nn as nn
class myModel(nn.Module):
def __init__(self):
super(myModel, self).__init__()
self.dwconv = nn.Sequential(
nn.Conv2d(3, 3, kernel_size=3, stride=2, padding=1, groups=3, bias=False),
nn.BatchNorm2d(3),
nn.ReLU(inplace=True),
nn.Conv2d(3, 9, kernel_size=1, stride=1, padding=0, bias=False),
nn.BatchNorm2d(9),
nn.ReLU(inplace=True),
)
The code in the yolo series is as follows:
import torch.nn as nn
class DWConv(nn.Module):
"""Depthwise Conv + Conv"""
def __init__(self, in_channels, out_channels, ksize, stride=1, act="silu"):
super().__init__()
self.dconv = BaseConv(
in_channels, in_channels, ksize=ksize,
stride=stride, groups=in_channels, act=act
)
self.pconv = BaseConv(
in_channels, out_channels, ksize=1,
stride=1, groups=1, act=act
)
def forward(self, x):
x = self.dconv(x)
return self.pconv(x)
3. MobileNet v1
Paper:MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
MobileNet v1 and MobileNet v2
MobileNet Detailed explanation of network
[Deep Learning] Detailed explanation of lightweight CNN network MobileNet series
MobileNet V1 image classification
MobileNet v1 is a lightweight CNN network that focuses on mobile terminals or embedded devices that are not particularly computationally intensive. As shown in the figure below, MobileNet v1 only sacrifices a little accuracy, but greatly reduces the number of parameters and calculations of the model.
1. Network structure
The main contribution of MobileNet v1 is the use of Depthwise Separable Convolution
, which can be split into Depthwise
convolution and Pointwise
Convolution. Depthwise Separable Convolution
’s separate design directly compresses the model by about 8 times, but the accuracy is not seriously lost, which is still very shocking.
The lightweight network's MobileNet v1 has less calculation and parameter volume than GoogleNet, and its classification effect is better than GoogleNet. This is the result of depth-separable convolution. The calculation parameters of VGG16 are 30 times larger than MobileNet, but the result is only less than 1% higher.
2. Advantages
First of all, the depth-separable convolution proposed by MobileNet v1 can greatly reduce the amount of calculation and parameters; secondly, adding hyperparameters α and ρ can adjust the width and resolution of the network according to needs.
Specifically, the hyperparameter α is to control the number of convolution kernels, which is the output channel, so α can reduce the number of parameters of the model; the hyperparameter ρ is to control the size of the image input and is a parameter that will not affect the model. , but can reduce the amount of calculation.
The width of the network represents the dimension of the convolutional layer, which is the channel, such as 512, 1024.
The depth of the network represents the number of convolutional layers, that is, how deep the network is, such as resnet34, resnet101.
3. Code implementation
3.1 Build MobileNet v1 network model
import torch.nn as nn
# MobileNet v1
class MobileNetV1(nn.Module):
def __init__(self,num_classes=1000):
super(MobileNetV1, self).__init__()
# 第一层的卷积,channel->32,size减半
def conv_bn(in_channel, out_channel, stride):
return nn.Sequential(
nn.Conv2d(in_channel, out_channel, 3, stride, 1, bias=False),
nn.BatchNorm2d(out_channel),
nn.ReLU(inplace=True)
)
# 深度可分离卷积=depthwise卷积 + pointwise卷积
def conv_dw(in_channel, out_channel, stride):
return nn.Sequential(
# depthwise 卷积,channel不变,stride = 2的时候,size减半
nn.Conv2d(in_channel, in_channel, 3, stride, padding=1, groups=in_channel, bias=False),
nn.BatchNorm2d(in_channel),
nn.ReLU(inplace=True),
# pointwise卷积(1*1卷积) same卷积, 只改变channel
nn.Conv2d(in_channel, out_channel, 1, 1, padding=0, bias=False),
nn.BatchNorm2d(out_channel),
nn.ReLU(inplace=True),
)
self.model = nn.Sequential(
conv_bn(3, 32, 2), # conv/s2 out=224*224*32
conv_dw(32, 64, 1), # conv dw +1*1 out=112*112*64
conv_dw(64, 128, 2), # conv dw +1*1 out=56*56*128
conv_dw(128, 128, 1), # conv dw +1*1 out=56*56*128
conv_dw(128, 256, 2), # conv dw +1*1 out=28*28*256
conv_dw(256, 256, 1), # conv dw +1*1 out=28*28*256
conv_dw(256, 512, 2), # conv dw +1*1 out=14*14*512
conv_dw(512, 512, 1), # 5个 conv dw +1*1 ----> size不变,channel不变,out=14*14*512
conv_dw(512, 512, 1),
conv_dw(512, 512, 1),
conv_dw(512, 512, 1),
conv_dw(512, 512, 1),
conv_dw(512, 1024, 2), # conv dw +1*1 out=7*7*1024
conv_dw(1024, 1024, 1), # conv dw +1*1 out=7*7*1024
nn.AvgPool2d(7), # avg pool out=1*1*1024
)
self.fc = nn.Linear(1024, num_classes) # fc
def forward(self, x):
x = self.model(x)
x = x.view(-1, 1024)
x = self.fc(x)
return x
3.2 torchsummary
View network structure
# 安装torchsummary
pip install torchsummary
Use torchsummary
to view the network structure:
from torchsummary import summary
import torch
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
net = MobileNetV1()
net.to(DEVICE)
print(summary(net, input_size=(3, 224, 224),device=DEVICE))
Output result:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 32, 112, 112] 864
BatchNorm2d-2 [-1, 32, 112, 112] 64
ReLU-3 [-1, 32, 112, 112] 0
Conv2d-4 [-1, 32, 112, 112] 288
BatchNorm2d-5 [-1, 32, 112, 112] 64
ReLU-6 [-1, 32, 112, 112] 0
Conv2d-7 [-1, 64, 112, 112] 2,048
BatchNorm2d-8 [-1, 64, 112, 112] 128
ReLU-9 [-1, 64, 112, 112] 0
Conv2d-10 [-1, 64, 56, 56] 576
BatchNorm2d-11 [-1, 64, 56, 56] 128
ReLU-12 [-1, 64, 56, 56] 0
Conv2d-13 [-1, 128, 56, 56] 8,192
BatchNorm2d-14 [-1, 128, 56, 56] 256
ReLU-15 [-1, 128, 56, 56] 0
Conv2d-16 [-1, 128, 56, 56] 1,152
BatchNorm2d-17 [-1, 128, 56, 56] 256
ReLU-18 [-1, 128, 56, 56] 0
Conv2d-19 [-1, 128, 56, 56] 16,384
BatchNorm2d-20 [-1, 128, 56, 56] 256
ReLU-21 [-1, 128, 56, 56] 0
Conv2d-22 [-1, 128, 28, 28] 1,152
BatchNorm2d-23 [-1, 128, 28, 28] 256
ReLU-24 [-1, 128, 28, 28] 0
Conv2d-25 [-1, 256, 28, 28] 32,768
BatchNorm2d-26 [-1, 256, 28, 28] 512
ReLU-27 [-1, 256, 28, 28] 0
Conv2d-28 [-1, 256, 28, 28] 2,304
BatchNorm2d-29 [-1, 256, 28, 28] 512
ReLU-30 [-1, 256, 28, 28] 0
Conv2d-31 [-1, 256, 28, 28] 65,536
BatchNorm2d-32 [-1, 256, 28, 28] 512
ReLU-33 [-1, 256, 28, 28] 0
Conv2d-34 [-1, 256, 14, 14] 2,304
BatchNorm2d-35 [-1, 256, 14, 14] 512
ReLU-36 [-1, 256, 14, 14] 0
Conv2d-37 [-1, 512, 14, 14] 131,072
BatchNorm2d-38 [-1, 512, 14, 14] 1,024
ReLU-39 [-1, 512, 14, 14] 0
Conv2d-40 [-1, 512, 14, 14] 4,608
BatchNorm2d-41 [-1, 512, 14, 14] 1,024
ReLU-42 [-1, 512, 14, 14] 0
Conv2d-43 [-1, 512, 14, 14] 262,144
BatchNorm2d-44 [-1, 512, 14, 14] 1,024
ReLU-45 [-1, 512, 14, 14] 0
Conv2d-46 [-1, 512, 14, 14] 4,608
BatchNorm2d-47 [-1, 512, 14, 14] 1,024
ReLU-48 [-1, 512, 14, 14] 0
Conv2d-49 [-1, 512, 14, 14] 262,144
BatchNorm2d-50 [-1, 512, 14, 14] 1,024
ReLU-51 [-1, 512, 14, 14] 0
Conv2d-52 [-1, 512, 14, 14] 4,608
BatchNorm2d-53 [-1, 512, 14, 14] 1,024
ReLU-54 [-1, 512, 14, 14] 0
Conv2d-55 [-1, 512, 14, 14] 262,144
BatchNorm2d-56 [-1, 512, 14, 14] 1,024
ReLU-57 [-1, 512, 14, 14] 0
Conv2d-58 [-1, 512, 14, 14] 4,608
BatchNorm2d-59 [-1, 512, 14, 14] 1,024
ReLU-60 [-1, 512, 14, 14] 0
Conv2d-61 [-1, 512, 14, 14] 262,144
BatchNorm2d-62 [-1, 512, 14, 14] 1,024
ReLU-63 [-1, 512, 14, 14] 0
Conv2d-64 [-1, 512, 14, 14] 4,608
BatchNorm2d-65 [-1, 512, 14, 14] 1,024
ReLU-66 [-1, 512, 14, 14] 0
Conv2d-67 [-1, 512, 14, 14] 262,144
BatchNorm2d-68 [-1, 512, 14, 14] 1,024
ReLU-69 [-1, 512, 14, 14] 0
Conv2d-70 [-1, 512, 7, 7] 4,608
BatchNorm2d-71 [-1, 512, 7, 7] 1,024
ReLU-72 [-1, 512, 7, 7] 0
Conv2d-73 [-1, 1024, 7, 7] 524,288
BatchNorm2d-74 [-1, 1024, 7, 7] 2,048
ReLU-75 [-1, 1024, 7, 7] 0
Conv2d-76 [-1, 1024, 7, 7] 9,216
BatchNorm2d-77 [-1, 1024, 7, 7] 2,048
ReLU-78 [-1, 1024, 7, 7] 0
Conv2d-79 [-1, 1024, 7, 7] 1,048,576
BatchNorm2d-80 [-1, 1024, 7, 7] 2,048
ReLU-81 [-1, 1024, 7, 7] 0
AvgPool2d-82 [-1, 1024, 1, 1] 0
Linear-83 [-1, 1000] 1,025,000
================================================================
Total params: 4,231,976
Trainable params: 4,231,976
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 115.43
Params size (MB): 16.14
Estimated Total Size (MB): 132.15
----------------------------------------------------------------
None
3.3 train training model
import torch
import torch.nn as nn
from torchvision import transforms, datasets
import torch.optim as optim
from model import MobileNetV1
from torch.utils.data import DataLoader
from tqdm import tqdm
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
data_transform = {
"train": transforms.Compose([transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])]),
"test": transforms.Compose([transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])])}
# 训练集
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=data_transform['train'])
trainloader = DataLoader(trainset, batch_size=16, shuffle=True)
# 测试集
testset = datasets.CIFAR10(root='./data', train=False, download=True, transform=data_transform['test'])
testloader = DataLoader(testset, batch_size=16, shuffle=False)
# 样本的个数
num_trainset = len(trainset) # 50000
num_testset = len(testset) # 10000
# 构建网络
net = MobileNetV1(num_classes=10)
net.to(DEVICE)
# 加载损失和优化器
loss_function = nn.CrossEntropyLoss()
loss_fun = loss_function.to(DEVICE)
learning_rate = 0.0001
optimizer = optim.Adam(net.parameters(), lr=learning_rate)
best_acc = 0.0
save_path = './MobileNetV1.pth'
for epoch in range(10):
net.train() # 训练模式
running_loss = 0.0
for data in tqdm(trainloader):
images, labels = data
images, labels = images.to(DEVICE), labels.to(DEVICE)
optimizer.zero_grad()
out = net(images) # 总共有三个输出
loss = loss_function(out, labels)
loss.backward() # 反向传播
optimizer.step()
running_loss += loss.item()
# test
# 测试过程不需要通过反向传播来更新参数。
net.eval() # 测试模式
acc = 0.0
with torch.no_grad(): # 测试不需要进行反向传播,即不需要梯度变化
for test_data in tqdm(testloader):
test_images, test_labels = test_data
test_images, test_labels = test_images.to(DEVICE), test_labels.to(DEVICE)
outputs = net(test_images)
predict_y = torch.max(outputs, dim=1)[1]
acc += (predict_y == test_labels).sum().item()
accurate = acc / num_testset
train_loss = running_loss / num_trainset
print('[epoch %d] train_loss: %.3f test_accuracy: %.3f' %
(epoch + 1, train_loss, accurate))
if accurate > best_acc:
best_acc = accurate
torch.save(net.state_dict(), save_path)
print('Finished Training')
Output result:
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
170499072it [00:30, 5634555.25it/s]
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
100%|██████████| 3125/3125 [02:18<00:00, 22.55it/s]
100%|██████████| 625/625 [00:12<00:00, 51.10it/s]
[epoch 1] train_loss: 0.101 test_accuracy: 0.516
100%|██████████| 3125/3125 [02:23<00:00, 21.78it/s]
100%|██████████| 625/625 [00:11<00:00, 54.31it/s]
[epoch 2] train_loss: 0.079 test_accuracy: 0.612
100%|██████████| 3125/3125 [02:20<00:00, 22.17it/s]
100%|██████████| 625/625 [00:11<00:00, 54.28it/s]
[epoch 3] train_loss: 0.066 test_accuracy: 0.672
100%|██████████| 3125/3125 [02:21<00:00, 22.09it/s]
100%|██████████| 625/625 [00:11<00:00, 55.52it/s]
[epoch 4] train_loss: 0.056 test_accuracy: 0.722
100%|██████████| 3125/3125 [02:13<00:00, 23.34it/s]
100%|██████████| 625/625 [00:11<00:00, 55.56it/s]
[epoch 5] train_loss: 0.048 test_accuracy: 0.748
100%|██████████| 3125/3125 [02:14<00:00, 23.31it/s]
100%|██████████| 625/625 [00:11<00:00, 52.19it/s]
[epoch 6] train_loss: 0.042 test_accuracy: 0.763
100%|██████████| 3125/3125 [02:14<00:00, 23.18it/s]
100%|██████████| 625/625 [00:11<00:00, 56.05it/s]
[epoch 7] train_loss: 0.035 test_accuracy: 0.781
100%|██████████| 3125/3125 [02:14<00:00, 23.27it/s]
100%|██████████| 625/625 [00:11<00:00, 55.88it/s]
[epoch 8] train_loss: 0.031 test_accuracy: 0.790
100%|██████████| 3125/3125 [02:13<00:00, 23.32it/s]
100%|██████████| 625/625 [00:11<00:00, 55.89it/s]
[epoch 9] train_loss: 0.026 test_accuracy: 0.801
100%|██████████| 3125/3125 [02:15<00:00, 22.99it/s]
100%|██████████| 625/625 [00:11<00:00, 55.95it/s]
[epoch 10] train_loss: 0.022 test_accuracy: 0.803
Finished Training
Process finished with exit code 0
Graphics card resource usage:
3.4 View model weight parameters
from model import MobileNetV1
import torch
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
net = MobileNetV1(num_classes=10)
net.load_state_dict(torch.load('./MobileNetV1.pth'))
net.to(DEVICE)
with torch.no_grad():
for i in range(0,14): # 查看 depthwise 的权值
print(net.model[i][0].weight)
3.5 Test the effect on the CIFAR10 data set
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
import torch
import numpy as np
import matplotlib.pyplot as plt
from model import MobileNetV1
from torchvision.transforms import transforms
from torch.utils.data import DataLoader
import torchvision
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
# 预处理
transformer = transforms.Compose([transforms.Resize((224,224)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])])
# 加载模型
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
model = MobileNetV1(num_classes=10)
model.load_state_dict(torch.load('./MobileNetV1.pth'))
model.to(DEVICE)
# 加载数据
testSet = torchvision.datasets.CIFAR10(root='./data', train=False, download=False, transform=transformer)
testLoader = DataLoader(testSet, batch_size=12, shuffle=True)
# 获取一批数据
imgs, labels = next(iter(testLoader))
imgs = imgs.to(DEVICE)
# show
with torch.no_grad():
model.eval()
prediction = model(imgs) # 预测
prediction = torch.max(prediction, dim=1)[1]
prediction = prediction.data.cpu().numpy()
plt.figure(figsize=(12, 8))
for i, (img, label) in enumerate(zip(imgs, labels)):
x = np.transpose(img.data.cpu().numpy(), (1, 2, 0)) # 图像
x[:, :, 0] = x[:, :, 0] * 0.229 + 0.485 # 去 normalization
x[:, :, 1] = x[:, :, 1] * 0.224 + 0.456 # 去 normalization
x[:, :, 2] = x[:, :, 2] * 0.255 + 0.406 # 去 normalization
y = label.numpy().item() # label
plt.subplot(3, 4, i + 1)
plt.axis(False)
plt.imshow(x)
plt.title('R:{},P:{}'.format(classes[y], classes[prediction[i]]))
plt.show()
Results display:
4. MobileNet v2
论文:MobileNetV2: Inverted Residuals and Linear Bottlenecks
MobileNet v2 mainly combines the residual network and Depthwise Separable convolution. The residual block is improved by analyzing the manifold characteristics of a single channel, including an extension to the intermediate layer (d) and a linear activation of the bottleneck layer©.
0 Preface
The features represented by the pixel values of each channel of the feature map can be mapped to a manifold region of a low-dimensional subspace. Usually, an activation layer is often added after the convolution operation to increase the nonlinearity of the features. A common activation function is ReLU
. The activation process will bring information loss, and this loss cannot be recovered. When the number of channels is very small, the information loss of ReLU is more obvious.
As shown in the figure below, the input is a matrix representing manifold data, which is similar to the convolution operation. After n ReLU operations, n channel Feature Maps are obtained, and then the input data is restored through n Feature Maps. The more similar the restoration is It means less information loss.
As can be seen from the figure above, when the input dimensions are 2 and 3, the output loses more information than the input; but when the input dimensions are 15 to 30, the output retains more information about the input. Generally speaking,When the n value is small, the information loss of ReLU is very serious. When the n value is large, the input manifold can be restored better.。
Based on the analysis of the information loss problem mentioned above, we can have two solutions:
- Replace ReLU: Since it is the information loss caused by ReLU, you canreplace ReLU with a linear activation function;
- Increase dimensionality: If a larger number of channels can reduce information loss, then you canincrease the input dimensionality by increasing the dimensionality a>.
The title of MobileNet v2 is MobileNetV2: Inverted Residuals and Linear Bottlenecks
, Linear Bottlenecks
and Inverted Residuals
are the core of MobileNet v2 and the two mentioned above. A description of an idea.
1. Linear Bottlenecks
Replace the Relu activation function with a linear activation function. The transformed block is called Linear Bottlenecks
in the article. The structure is as shown in the figure below:
Of course, you cannot replace all ReLU with linear activation functions, otherwise the network will degenerate into a single-layer neural network. A compromise solution is to use linear activation in the bottleneck part when the number of channels in the output Feature Map is small. Activation function, ReLU is used at other times. The code for the Linear Bottlenecks
block is implemented as follows:
def _bottleneck(inputs, nb_filters, t):
x = Conv2D(filters=nb_filters * t, kernel_size=(1,1), padding='same')(inputs)
x = Activation(relu6)(x)
x = DepthwiseConv2D(kernel_size=(3,3), padding='same')(x)
x = Activation(relu6)(x)
x = Conv2D(filters=nb_filters, kernel_size=(1,1), padding='same')(x)
# do not use activation function
if not K.get_variable_shape(inputs)[3] == nb_filters:
inputs = Conv2D(filters=nb_filters, kernel_size=(1,1), padding='same')(inputs)
outputs = add([x, inputs])
return outputs
2. Inverted Residual
Inverted Residuals
is literally translated as inverted residual structure. Let’s take a look at its differences and connections with the normal residual structure: As can be seen from the figure below, the left side is the residual structure in ResNet, and its structure is: 1x1卷积降维->3x3卷积->1x1卷积升维
; The right side is the inverted residual structure in MobileNet v2, and its structure is: 1x1卷积升维->3x3DW卷积->1x1卷积降维
. The reason why MobileNet v2 first uses 1x1 for dimensionality increase is:High-dimensional information loses less information after passing through the ReLU activation function, so the dimensionality operation is performed first.。
What needs to be noted in this part is that only when s=1, that is, when the step size is 1, there is a shortcut connection, but when the step size is 2, there is no shortcut connection, as shown in the figure below.
3. Network structure
MobileNet v2 uses fewer parameters, but the mAP value is similar to others, even exceeding Yolov2. The effect is shown in the figure below:
4. Code implementation
MobileNet v2 can be implemented by stacking bottlenecks, as shown in the following code snippet:
def MobileNetV2_relu(input_shape, k):
inputs = Input(shape = input_shape)
x = Conv2D(filters=32, kernel_size=(3,3), padding='same')(inputs)
x = _bottleneck_relu(x, 8, 6)
x = MaxPooling2D((2,2))(x)
x = _bottleneck_relu(x, 16, 6)
x = _bottleneck_relu(x, 16, 6)
x = MaxPooling2D((2,2))(x)
x = _bottleneck_relu(x, 32, 6)
x = GlobalAveragePooling2D()(x)
x = Dense(128, activation='relu')(x)
outputs = Dense(k, activation='softmax')(x)
model = Model(inputs, outputs)
return model