Table of contents
2.3 Network structure of MobileNet v2 model
3.2 Redesign the time-consuming layer structure
3.3 Redesigning the activation function
3.4 Network structure of MobileNet v3 model
4. Using Pytorch to implement MobileNet
1. MobileNet v1 network
In the traditional convolutional neural network, the model parameters are relatively large, and the requirements for computing power are very high, so it is difficult to run on mobile devices and embedded devices. MobileNet realizes the deep learning network in mobile devices and embedded devices. run on.
The MobileNet network was proposed by the Google team in 2017. It is a lightweight CNN network focused on mobile or embedded devices. Compared with the traditional CNN network, the parameters and calculation amount of the model are greatly reduced under the premise of a small decrease in the accuracy rate. The network has the following two highlights:
- Depthwise Convolution (DW convolution, greatly reducing the number of parameters and computation)
- Added hyperparameters to control the number of convolution kernels and hyperparameters to control the size of the input image
In the traditional convolution operation, the number of channels of the convolution kernel is equal to the number of channels of the input feature matrix, and the number of channels of the output feature matrix is equal to the number of convolution kernels used:
For DW convolution, the number of channels of the convolution kernel is 1, and the number of channels of the input feature matrix is equal to the number of convolution kernels equal to the number of channels of the output feature matrix:
Depthwise Separable Conv consists of DW convolution and PW convolution. PW convolution is similar to traditional convolution operations, except that the size of the convolution kernel is 1×1. Theoretically, the calculation amount of traditional convolution is 8 to 8 times that of Depthwise Separable Conv operation. 9 times.
2. MobileNet v2 network
In the use of MobileNet, most of the convolution kernel parameters of DW convolution are 0, and this part of the convolution kernel does not work. This problem has been improved in MobileNet v2. The network has the following two highlights:
- Inverted Residual (inverted residual structure)
- Linear Bottlenecks
2.1 Inverted Residual
The traditional residual structure is to first use a 1×1 convolution kernel for dimensionality reduction, then use a 3×3 convolution kernel for convolution processing, and finally use a 1×1 convolution kernel for dimension enhancement, forming a two-ended Large middle and small bottleneck structure; while the inverted residual structure first uses a 1×1 convolution kernel to increase the dimension, then uses a 3×3 convolution kernel to perform DW convolution, and finally uses a 1×1 convolution kernel to reduce Dimensional processing, which is just the opposite of the ordinary residual structure.
The ReLU activation function is used in the ordinary residual structure, while the ReLU6 activation function is used in the inverted residual structure:
- ReLU activation function: when the input value is less than 0, it is set to 0 by default, and if the input value is greater than 0, no operation is performed
- ReLU6 activation function: when the input value is less than 0, it is set to 0 by default. If the input value is between 0 and 6, no operation is performed. When the input value is greater than 6, it is set to 6. The formula is
2.2 Linear Bottlenecks
For the last convolutional layer of the inverted residual structure, a linear activation function is used instead of the ReLU activation function. The author of the original article conducted an experiment. The content is that the ReLU activation function will cause a large loss of low-dimensional feature information, and the high-dimensional feature information The loss caused is very small, and the inverted residual structure is a structure with thin sides and thick middle, and the output is a low-dimensional feature vector. Therefore, the loss of using the ReLU activation function will be relatively large, so a linear activation function is used instead.
The block block of MobileNet v2 is shown in the figure below:
It should be noted that there is a shortcut connection only when stride=1 and the shape of the input feature matrix and the output feature matrix are the same (that is, the case on the left).
2.3 Network structure of MobileNet v2 model
is the expansion factor, which is how many dimensions the input will be raised to;
is the output feature matrix depth;
is the number of repetitions of the bottleneck;
It is the step size of the bottleneck of the first layer, and the other layers are 1.
3. MobileNet v3 network
MobileNet v3 has the following three improvements:
- Updated Block (bneck, simple change in inverted residual structure)
- Technology using NAS (Neural Architecture Search) search parameters
- Redesigned the structure of some time-consuming layers
There are mainly highlights in the update of Block: the SE module has been added, and the activation function has been updated. The structure diagram is as follows:
3.1 Attention mechanism
Pooling is performed for each channel of the obtained output matrix, and the number of elements of the obtained one-dimensional vector is equal to the number of channels; after the first fully connected layer, the number of nodes is 1/4 of the number of channels, and the activation function is ReLU ; After the second fully connected layer, the number of nodes is equal to the number of channels, and the activation function is hard-sigmoid; the final output vector analyzes the weight relationship for each channel of the matrix, and the important channel will be assigned a relatively large Weight, the schematic diagram is as follows:
3.2 Redesign the time-consuming layer structure
Reduce the number of convolution kernels in the first convolutional layer from 32 to 16 without affecting the accuracy; streamline the Last Stage, the model before streamlining is as follows:
The simplified model:
3.3 Redesigning the activation function
The more commonly used activation function at that time was the swish activation function:
in
The calculation and derivation of this activation function are very complicated; for mobile devices, quantization operations are performed in order to speed up, and the swish activation function is very complicated for quantization operations. In response to these two problems, the h-swish activation function is proposed:
The second half of the formula is the h-sigmoid activation function.
It can be seen from the figure that the two activation functions are very similar, h-swish can be used instead of swish, and the quantization operation can also be simplified.
3.4 Network structure of MobileNet v3 model
4. Using Pytorch to implement MobileNet
4.1 MobileNet v1、v2
In the MobileNet v2 network, all convolutional layers are basically composed of convolution + BN + ReLU6 activation function. First define this combination:
class ConvBNReLU(nn.Sequential):
# 输入特征矩阵深度,输出特征矩阵深度,卷积核大小,步长,groups=1是普通卷积、=in_channel是DW卷积
def __init__(self, in_channel, out_channel, kernel_size=3, stride=1, groups=1):
# 填充参数
padding = (kernel_size - 1) // 2
super(ConvBNReLU, self).__init__(
# 首先是卷积操作:输入特征矩阵深度,输出特征矩阵深度,卷积核大小,步长,填充参数,groups为默认值,不需要偏置
nn.Conv2d(in_channel, out_channel, kernel_size, stride, padding, groups=groups, bias=False),
# BN层,输入为卷积操作的输出
nn.BatchNorm2d(out_channel),
# 激活函数
nn.ReLU6(inplace=True)
)
Inverted residual structure:
# 倒残差结构
class InvertedResidual(nn.Module):
# 输入特征矩阵深度,输出特征矩阵深度,步长,深度扩大多少倍
def __init__(self, in_channel, out_channel, stride, expand_ratio):
super(InvertedResidual, self).__init__()
# 第一层卷积层,扩展深度
hidden_channel = in_channel * expand_ratio
# 定义一个布尔变量,步长为1且输入和输出特征矩阵相等时,采用捷径分支
self.use_shortcut = stride == 1 and in_channel == out_channel
# 层列表
layers = []
# t = 1时没有对输入特征矩阵进行扩充,此时不需要第一层卷积层
if expand_ratio != 1:
# 1x1 pointwise conv
layers.append(ConvBNReLU(in_channel, hidden_channel, kernel_size=1))
# extend函数能够一次性批量插入很多元素
layers.extend([
# 3x3 depthwise conv,DW卷积的groups为输入特征矩阵深度
ConvBNReLU(hidden_channel, hidden_channel, stride=stride, groups=hidden_channel),
# 1x1 pointwise conv(linear)
nn.Conv2d(hidden_channel, out_channel, kernel_size=1, bias=False),
# BN层
nn.BatchNorm2d(out_channel),
])
# *的解析见博客http://t.csdn.cn/CnTEA
self.conv = nn.Sequential(*layers)
# 前向传播,判断是否使用捷径分支
def forward(self, x):
if self.use_shortcut:
return x + self.conv(x)
else:
return self.conv(x)
Define the network structure of MobileNet v2:
# 网络结构
class MobileNetV2(nn.Module):
# 类别个数和v1网络中的alpha参数
def __init__(self, num_classes=1000, alpha=1.0, round_nearest=8):
super(MobileNetV2, self).__init__()
# 块,注意不是实例化而是赋值
block = InvertedResidual
# 输入特征矩阵深度
input_channel = _make_divisible(32 * alpha, round_nearest)
# 输出通道数
last_channel = _make_divisible(1280 * alpha, round_nearest)
# 对应网络结构的四个参数值
inverted_residual_setting = [
# t, c, n, s
[1, 16, 1, 1],
[6, 24, 2, 2],
[6, 32, 3, 2],
[6, 64, 4, 2],
[6, 96, 3, 1],
[6, 160, 3, 2],
[6, 320, 1, 1],
]
features = []
# 第一个卷积层
features.append(ConvBNReLU(3, input_channel, stride=2))
# building inverted residual residual blockes
for t, c, n, s in inverted_residual_setting:
# 调整每个block输出特征矩阵深度
output_channel = _make_divisible(c * alpha, round_nearest)
# 每个block块的具体结构
for i in range(n):
# s为第一层的步距,其他层均为1
stride = s if i == 0 else 1
features.append(block(input_channel, output_channel, stride, expand_ratio=t))
# 将output_channel传入input_channel,作为下一层的输入
input_channel = output_channel
# 倒数第三个卷积层
features.append(ConvBNReLU(input_channel, last_channel, 1))
# 特征提取层结束
# 将特征提取网络结构传入
self.features = nn.Sequential(*features)
# 分类器
# 平均池化下采样层,输出特征矩阵高和宽为1×1
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
# Dropout层和全连接层
self.classifier = nn.Sequential(
nn.Dropout(0.2),
nn.Linear(last_channel, num_classes)
)
# 初始化权重
for m in self.modules():
# 遍历每一个子模块
if isinstance(m, nn.Conv2d):
# 如果是卷积层就对权重进行初始化
nn.init.kaiming_normal_(m.weight, mode='fan_out')
# 如果存在偏置则将偏置设置为0
if m.bias is not None:
nn.init.zeros_(m.bias)
elif isinstance(m, nn.BatchNorm2d):
# 如果是BN层将方差设置为1,均值设置为0
nn.init.ones_(m.weight)
nn.init.zeros_(m.bias)
elif isinstance(m, nn.Linear):
# 如果是全连接层,将权重设置为均值为0,方差为0.01的正态分布,偏置设置为0
nn.init.normal_(m.weight, 0, 0.01)
nn.init.zeros_(m.bias)
# 前向传播
def forward(self, x):
x = self.features(x)
x = self.avgpool(x)
# 将输出展平
x = torch.flatten(x, 1)
x = self.classifier(x)
return x
4.2 MobileNet v3
For the MobileNet v3 network, there are more SE modules:
# 注意力模块
class SqueezeExcitation(nn.Module):
# squeeze_factor:第一个全连接层节点个数是输入的1/4
def __init__(self, input_c: int, squeeze_factor: int = 4):
super(SqueezeExcitation, self).__init__()
squeeze_c = _make_divisible(input_c // squeeze_factor, 8)
# 全连接层
self.fc1 = nn.Conv2d(input_c, squeeze_c, 1)
self.fc2 = nn.Conv2d(squeeze_c, input_c, 1)
def forward(self, x: Tensor):
# 自适应的平均池化操作,将每一个channel上的数据平均池化到1×1的大小
scale = F.adaptive_avg_pool2d(x, output_size=(1, 1))
scale = self.fc1(scale)
scale = F.relu(scale, inplace=True)
scale = self.fc2(scale)
# hardsigmoid激活函数
scale = F.hardsigmoid(scale, inplace=True)
# 直接相乘
return scale * x
The entire Block block structure:
# 整个Block块结构
class InvertedResidual(nn.Module):
def __init__(self,
cnf: InvertedResidualConfig,
norm_layer: Callable[..., nn.Module]):
super(InvertedResidual, self).__init__()
# 步长只能为1或2
if cnf.stride not in [1, 2]:
raise ValueError("illegal stride value.")
# 判断是否使用捷径分支
self.use_res_connect = (cnf.stride == 1 and cnf.input_c == cnf.out_c)
layers: List[nn.Module] = []
# 判断使用什么激活函数
activation_layer = nn.Hardswish if cnf.use_hs else nn.ReLU
# expand
# 只有第一个bneck没有1×1的升维卷积层
if cnf.expanded_c != cnf.input_c:
layers.append(ConvBNActivation(cnf.input_c,
cnf.expanded_c,
kernel_size=1,
norm_layer=norm_layer,
activation_layer=activation_layer))
# depthwise
layers.append(ConvBNActivation(cnf.expanded_c,
cnf.expanded_c,
kernel_size=cnf.kernel,
stride=cnf.stride,
groups=cnf.expanded_c,
norm_layer=norm_layer,
activation_layer=activation_layer))
if cnf.use_se:
layers.append(SqueezeExcitation(cnf.expanded_c))
# project
layers.append(ConvBNActivation(cnf.expanded_c,
cnf.out_c,
kernel_size=1,
norm_layer=norm_layer,
# 线性激活,没有做任何处理
activation_layer=nn.Identity))
self.block = nn.Sequential(*layers)
self.out_channels = cnf.out_c
self.is_strided = cnf.stride > 1
def forward(self, x: Tensor):
result = self.block(x)
# 是否使用捷径分支
if self.use_res_connect:
result += x
return result
MobileNet v3 network structure:
class MobileNetV3(nn.Module):
# 一系列结构参数列表,倒数第二个全连接层输出节点个数,分类类别个数,InvertedResidual模块
def __init__(self,
inverted_residual_setting: List[InvertedResidualConfig],
last_channel: int,
num_classes: int = 1000,
block: Optional[Callable[..., nn.Module]] = None,
norm_layer: Optional[Callable[..., nn.Module]] = None):
super(MobileNetV3, self).__init__()
# 数据检查
if not inverted_residual_setting:
raise ValueError("The inverted_residual_setting should not be empty.")
elif not (isinstance(inverted_residual_setting, List) and
all([isinstance(s, InvertedResidualConfig) for s in inverted_residual_setting])):
raise TypeError("The inverted_residual_setting should be List[InvertedResidualConfig]")
if block is None:
block = InvertedResidual
if norm_layer is None:
# partial用法http://t.csdn.cn/k80JE
norm_layer = partial(nn.BatchNorm2d, eps=0.001, momentum=0.01)
layers: List[nn.Module] = []
# building first layer
# 获得第一个卷积层输出的channel
firstconv_output_c = inverted_residual_setting[0].input_c
layers.append(ConvBNActivation(3,
firstconv_output_c,
kernel_size=3,
stride=2,
norm_layer=norm_layer,
activation_layer=nn.Hardswish))
# building inverted residual blocks
for cnf in inverted_residual_setting:
layers.append(block(cnf, norm_layer))
# building last several layers
# 获得最后一个bneck结构的输出channel
lastconv_input_c = inverted_residual_setting[-1].out_c
# 160 * 6
lastconv_output_c = 6 * lastconv_input_c
# 最后一个卷积层
layers.append(ConvBNActivation(lastconv_input_c,
lastconv_output_c,
kernel_size=1,
norm_layer=norm_layer,
activation_layer=nn.Hardswish))
self.features = nn.Sequential(*layers)
# 池化层和两个全连接层
self.avgpool = nn.AdaptiveAvgPool2d(1)
self.classifier = nn.Sequential(nn.Linear(lastconv_output_c, last_channel),
nn.Hardswish(inplace=True),
nn.Dropout(p=0.2, inplace=True),
nn.Linear(last_channel, num_classes))
# initial weights
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode="fan_out")
if m.bias is not None:
nn.init.zeros_(m.bias)
elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
nn.init.ones_(m.weight)
nn.init.zeros_(m.bias)
elif isinstance(m, nn.Linear):
nn.init.normal_(m.weight, 0, 0.01)
nn.init.zeros_(m.bias)
def _forward_impl(self, x: Tensor):
x = self.features(x)
x = self.avgpool(x)
# 展平处理
x = torch.flatten(x, 1)
x = self.classifier(x)
return x
def forward(self, x: Tensor):
return self._forward_impl(x)
5. Training Results
MobileNet v2 uses pre-trained weights and only trains the results of the fully connected layer:
using cuda:0 device.
Using 4 dataloader workers every process
using 3306 images for training, 364 images for validation.
train epoch[1/5] loss:1.007: 100%|██████████| 207/207 [00:09<00:00, 22.49it/s]
valid epoch[1/5]: 100%|██████████| 23/23 [00:03<00:00, 6.31it/s]
[epoch 1] train_loss: 1.245 val_accuracy: 0.794
train epoch[2/5] loss:0.813: 100%|██████████| 207/207 [00:07<00:00, 27.63it/s]
valid epoch[2/5]: 100%|██████████| 23/23 [00:03<00:00, 6.26it/s]
[epoch 2] train_loss: 0.864 val_accuracy: 0.824
train epoch[3/5] loss:0.580: 100%|██████████| 207/207 [00:07<00:00, 28.38it/s]
valid epoch[3/5]: 100%|██████████| 23/23 [00:03<00:00, 6.28it/s]
[epoch 3] train_loss: 0.716 val_accuracy: 0.865
train epoch[4/5] loss:0.818: 100%|██████████| 207/207 [00:07<00:00, 28.45it/s]
valid epoch[4/5]: 100%|██████████| 23/23 [00:03<00:00, 6.46it/s]
[epoch 4] train_loss: 0.626 val_accuracy: 0.857
train epoch[5/5] loss:0.488: 100%|██████████| 207/207 [00:07<00:00, 28.32it/s]
valid epoch[5/5]: 100%|██████████| 23/23 [00:03<00:00, 6.34it/s]
[epoch 5] train_loss: 0.587 val_accuracy: 0.857
Finished Training
MobileNet v3 large uses pre-trained weights and only trains the results of the fully connected layer:
using cuda:0 device.
Using 4 dataloader workers every process
using 3306 images for training, 364 images for validation.
train epoch[1/5] loss:0.831: 100%|██████████| 207/207 [00:09<00:00, 22.47it/s]
valid epoch[1/5]: 100%|██████████| 23/23 [00:03<00:00, 6.40it/s]
[epoch 1] train_loss: 0.889 val_accuracy: 0.868
train epoch[2/5] loss:0.824: 100%|██████████| 207/207 [00:07<00:00, 27.67it/s]
valid epoch[2/5]: 100%|██████████| 23/23 [00:03<00:00, 6.50it/s]
[epoch 2] train_loss: 0.508 val_accuracy: 0.887
train epoch[3/5] loss:0.370: 100%|██████████| 207/207 [00:07<00:00, 27.11it/s]
valid epoch[3/5]: 100%|██████████| 23/23 [00:03<00:00, 6.56it/s]
[epoch 3] train_loss: 0.451 val_accuracy: 0.901
train epoch[4/5] loss:0.841: 100%|██████████| 207/207 [00:07<00:00, 27.61it/s]
valid epoch[4/5]: 100%|██████████| 23/23 [00:03<00:00, 6.34it/s]
[epoch 4] train_loss: 0.411 val_accuracy: 0.904
train epoch[5/5] loss:0.195: 100%|██████████| 207/207 [00:07<00:00, 27.60it/s]
valid epoch[5/5]: 100%|██████████| 23/23 [00:03<00:00, 6.32it/s]
[epoch 5] train_loss: 0.378 val_accuracy: 0.904
Finished Training