Paper address: https://arxiv.org/abs/2206.04040
MobileOne comes from Apple, and its author claims that the inference time of MobileOne on iPhone 12 is only 1 millisecond, which is also the meaning of One in the name MobileOne. From the rapid implementation of MobileOne, we can see the potential of heavy parameterization on the mobile side: simple, efficient, plug-and-play.
The left part in Figure 3 constitutes a complete building block of MobileOne. It consists of two parts, the upper part is based on depthwise convolution (Depthwise Convolution), and the lower part is based on pointwise convolution (Pointwise Convolution). The terms depth convolution and point convolution come from MobileNet. Depth convolution is essentially a grouped convolution, and its group number g is the same as the input channel. The point convolution is a 1×1 convolution.
The depth convolution module in Figure 3 consists of three branches. The leftmost branch is a 1×1 convolution; the middle branch is an over-parameterized 3×3 convolution, that is, k 3×3 convolutions; the right part is a shortcut connection containing a BN layer. The 1×1 convolution and 3×3 convolution here are both depth convolutions (that is, grouped convolutions, the number of groups g is equal to the number of input channels).
The point convolution module in Figure 3 consists of two branches. The left branch is an over-parameterized 1×1 convolution, consisting of k 1×1 convolutions. The right branch is a skip connection containing a BN layer. In the training phase, MobileOne is stacked by such building blocks. When training is completed, the building block shown on the left in Figure 3 can be reparameterized to the structure on the right of Figure 3 using the reparameterization method.
The network structure of yolov7tiny is used as a demonstration here, and the modification to v7 is almost the same. Here, my idea of modification is not to replace the backbone of mobileone as a whole, but to retain each ELAN block of v7tiny and replace the 3*3 convolution in each block with the heavily parameterized depth-separable convolution in Figure 3. , which not only retains the overall structure of the network, but also adds the heavily parameterized mobileone block to the network structure.
[-1, 1, Conv, [32, 1, 1, None, 1]], [-2, 1, Conv, [32, 1, 1, None, 1]], [-1, 1, Conv, [32, 3, 1, None, 1]], # replace [-1, 1, Conv, [32, 3, 1, None, 1]], # replace [[-1, -2, -3, - 4], 1, Concat, [1]], [-1, 1, Conv, [64, 1, 1, None, 1]],
That is, the replaced part above
Here I have simplified the above structure, you can refer to yolov7 simplified yaml configuration file-CSDN blog
First create yolov7-tiny-ELANMO.yaml
# parameters
nc: 80 # number of classes
depth_multiple: 1.0 # model depth multiple
width_multiple: 1.0 # layer channel multiple
activation: nn.ReLU()
# anchors
anchors:
- [10,13, 16,30, 33,23] # P3/8
- [30,61, 62,45, 59,119] # P4/16
- [116,90, 156,198, 373,326] # P5/32
# yolov7-tiny backbone
backbone:
# [from, number, module, args] c2, k=1, s=1, p=None, g=1, act=True, num_blocks_per_stage=1, num_conv_branches=4,
[[-1, 1, Conv, [32, 3, 2, None, 1]], # 0-P1/2
[-1, 1, Conv, [64, 3, 2, None, 1]], # 1-P2/4
[-1, 1, ELANMO, [64, 1, 1, None, 1, 1, 4]], # 2
[-1, 1, MP, []], # 3-P3/8
[-1, 1, ELANMO, [128, 1, 1, None, 1, 1, 4]], # 4
[-1, 1, MP, []], # 5-P4/16
[-1, 1, ELANMO, [256, 1, 1, None, 1, 1, 4]], # 6
[-1, 1, MP, []], # 7-P5/32
[-1, 1, ELANMO, [512, 1, 1, None, 1, 1, 4]], # 8
]
# yolov7-tiny head
head:
[[-1, 1, SPPCSPCSIM, [256]], # 9
[-1, 1, Conv, [128, 1, 1, None, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[6, 1, Conv, [128, 1, 1, None, 1]], # route backbone P4
[[-1, -2], 1, Concat, [1]], # 13
[-1, 1, ELANMO, [128, 1, 1, None, 1, 1, 4]], # 14
[-1, 1, Conv, [64, 1, 1, None, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[4, 1, Conv, [64, 1, 1, None, 1]], # route backbone P3
[[-1, -2], 1, Concat, [1]],
[-1, 1, ELANMO, [64, 1, 1, None, 1, 1, 4]], # 19
[-1, 1, Conv, [128, 3, 2, None, 1]],
[[-1, 14], 1, Concat, [1]],
[-1, 1, ELANMO, [128, 1, 1, None, 1, 1, 4]], # 22
[-1, 1, Conv, [256, 3, 2, None, 1]],
[[-1, 9], 1, Concat, [1]],
[-1, 1, ELANMO, [256, 1, 1, None, 1, 1, 4]], # 25
[19, 1, Conv, [128, 3, 1, None, 1]],
[22, 1, Conv, [256, 3, 1, None, 1]],
[25, 1, Conv, [512, 3, 1, None, 1]],
[[26,27,28], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5)
]
Add in common.py
import torch.nn.functional as F
class SEBlock(nn.Module):
""" Squeeze and Excite module.
Pytorch implementation of `Squeeze-and-Excitation Networks` -
https://arxiv.org/pdf/1709.01507.pdf
"""
def __init__(self,
in_channels: int,
rd_ratio: float = 0.0625) -> None:
""" Construct a Squeeze and Excite Module.
:param in_channels: Number of input channels.
:param rd_ratio: Input channel reduction ratio.
"""
super(SEBlock, self).__init__()
self.reduce = nn.Conv2d(in_channels=in_channels,
out_channels=int(in_channels * rd_ratio),
kernel_size=1,
stride=1,
bias=True)
self.expand = nn.Conv2d(in_channels=int(in_channels * rd_ratio),
out_channels=in_channels,
kernel_size=1,
stride=1,
bias=True)
def forward(self, inputs: torch.Tensor) -> torch.Tensor:
""" Apply forward pass. """
b, c, h, w = inputs.size()
x = F.avg_pool2d(inputs, kernel_size=[h, w])
x = self.reduce(x)
x = F.relu(x)
x = self.expand(x)
x = torch.sigmoid(x)
x = x.view(-1, c, 1, 1)
return inputs * x
class MobileOneBlock(nn.Module):
""" MobileOne building block.
This block has a multi-branched architecture at train-time
and plain-CNN style architecture at inference time
For more details, please refer to our paper:
`An Improved One millisecond Mobile Backbone` -
https://arxiv.org/pdf/2206.04040.pdf
"""
def __init__(self,
in_channels: int,
out_channels: int,
kernel_size: int,
stride: int = 1,
padding: int = 0,
dilation: int = 1,
groups: int = 1,
inference_mode: bool = False,
use_se: bool = False,
num_conv_branches: int = 1) -> None:
""" Construct a MobileOneBlock module.
:param in_channels: Number of channels in the input.
:param out_channels: Number of channels produced by the block.
:param kernel_size: Size of the convolution kernel.
:param stride: Stride size.
:param padding: Zero-padding size.
:param dilation: Kernel dilation factor.
:param groups: Group number.
:param inference_mode: If True, instantiates model in inference mode.
:param use_se: Whether to use SE-ReLU activations.
:param num_conv_branches: Number of linear conv branches.
"""
super(MobileOneBlock, self).__init__()
self.inference_mode = inference_mode
self.groups = groups
self.stride = stride
self.kernel_size = kernel_size
self.in_channels = in_channels
self.out_channels = out_channels
self.num_conv_branches = num_conv_branches
# Check if SE-ReLU is requested
if use_se:
self.se = SEBlock(out_channels)
else:
self.se = nn.Identity()
self.activation = nn.ReLU()
if inference_mode:
self.reparam_conv = nn.Conv2d(in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
dilation=dilation,
groups=groups,
bias=True)
else:
# Re-parameterizable skip connection
self.rbr_skip = nn.BatchNorm2d(num_features=in_channels) \
if out_channels == in_channels and stride == 1 else None
# Re-parameterizable conv branches
rbr_conv = list()
for _ in range(self.num_conv_branches):
rbr_conv.append(self._conv_bn(kernel_size=kernel_size,
padding=padding))
self.rbr_conv = nn.ModuleList(rbr_conv)
# Re-parameterizable scale branch
self.rbr_scale = None
if kernel_size > 1:
self.rbr_scale = self._conv_bn(kernel_size=1,
padding=0)
def forward(self, x: torch.Tensor):
""" Apply forward pass. """
# Inference mode forward pass.
if self.inference_mode:
return self.activation(self.se(self.reparam_conv(x)))
# Multi-branched train-time forward pass.
# Skip branch output
identity_out = 0
if self.rbr_skip is not None:
identity_out = self.rbr_skip(x)
# Scale branch output
scale_out = 0
if self.rbr_scale is not None:
scale_out = self.rbr_scale(x)
# Other branches
out = scale_out + identity_out
for ix in range(self.num_conv_branches):
out += self.rbr_conv[ix](x)
return self.activation(self.se(out))
def reparameterize(self):
""" Following works like `RepVGG: Making VGG-style ConvNets Great Again` -
https://arxiv.org/pdf/2101.03697.pdf. We re-parameterize multi-branched
architecture used at training time to obtain a plain CNN-like structure
for inference.
"""
if self.inference_mode:
return
kernel, bias = self._get_kernel_bias()
self.reparam_conv = nn.Conv2d(in_channels=self.rbr_conv[0].conv.in_channels,
out_channels=self.rbr_conv[0].conv.out_channels,
kernel_size=self.rbr_conv[0].conv.kernel_size,
stride=self.rbr_conv[0].conv.stride,
padding=self.rbr_conv[0].conv.padding,
dilation=self.rbr_conv[0].conv.dilation,
groups=self.rbr_conv[0].conv.groups,
bias=True)
self.reparam_conv.weight.data = kernel
self.reparam_conv.bias.data = bias
# Delete un-used branches
for para in self.parameters():
para.detach_()
self.__delattr__('rbr_conv')
self.__delattr__('rbr_scale')
if hasattr(self, 'rbr_skip'):
self.__delattr__('rbr_skip')
self.inference_mode = True
def _get_kernel_bias(self):
""" Method to obtain re-parameterized kernel and bias.
Reference: https://github.com/DingXiaoH/RepVGG/blob/main/repvgg.py#L83
:return: Tuple of (kernel, bias) after fusing branches.
"""
# get weights and bias of scale branch
kernel_scale = 0
bias_scale = 0
if self.rbr_scale is not None:
kernel_scale, bias_scale = self._fuse_bn_tensor(self.rbr_scale)
# Pad scale branch kernel to match conv branch kernel size.
pad = self.kernel_size // 2
kernel_scale = torch.nn.functional.pad(kernel_scale,
[pad, pad, pad, pad])
# get weights and bias of skip branch
kernel_identity = 0
bias_identity = 0
if self.rbr_skip is not None:
kernel_identity, bias_identity = self._fuse_bn_tensor(self.rbr_skip)
# get weights and bias of conv branches
kernel_conv = 0
bias_conv = 0
for ix in range(self.num_conv_branches):
_kernel, _bias = self._fuse_bn_tensor(self.rbr_conv[ix])
kernel_conv += _kernel
bias_conv += _bias
kernel_final = kernel_conv + kernel_scale + kernel_identity
bias_final = bias_conv + bias_scale + bias_identity
return kernel_final, bias_final
def _fuse_bn_tensor(self, branch):
""" Method to fuse batchnorm layer with preceeding conv layer.
Reference: https://github.com/DingXiaoH/RepVGG/blob/main/repvgg.py#L95
:param branch:
:return: Tuple of (kernel, bias) after fusing batchnorm.
"""
if isinstance(branch, nn.Sequential):
kernel = branch.conv.weight
running_mean = branch.bn.running_mean
running_var = branch.bn.running_var
gamma = branch.bn.weight
beta = branch.bn.bias
eps = branch.bn.eps
else:
assert isinstance(branch, nn.BatchNorm2d)
if not hasattr(self, 'id_tensor'):
input_dim = self.in_channels // self.groups
kernel_value = torch.zeros((self.in_channels,
input_dim,
self.kernel_size,
self.kernel_size),
dtype=branch.weight.dtype,
device=branch.weight.device)
for i in range(self.in_channels):
kernel_value[i, i % input_dim,
self.kernel_size // 2,
self.kernel_size // 2] = 1
self.id_tensor = kernel_value
kernel = self.id_tensor
running_mean = branch.running_mean
running_var = branch.running_var
gamma = branch.weight
beta = branch.bias
eps = branch.eps
std = (running_var + eps).sqrt()
t = (gamma / std).reshape(-1, 1, 1, 1)
return kernel * t, beta - running_mean * gamma / std
def _conv_bn(self,
kernel_size: int,
padding: int) -> nn.Sequential:
""" Helper method to construct conv-batchnorm layers.
:param kernel_size: Size of the convolution kernel.
:param padding: Zero-padding size.
:return: Conv-BN module.
"""
mod_list = nn.Sequential()
mod_list.add_module('conv', nn.Conv2d(in_channels=self.in_channels,
out_channels=self.out_channels,
kernel_size=kernel_size,
stride=self.stride,
padding=padding,
groups=self.groups,
bias=False))
mod_list.add_module('bn', nn.BatchNorm2d(num_features=self.out_channels))
return mod_list
class ELANMO(nn.Module):
# Yolov7 ELANMO with args(ch_in, ch_out, kernel, stride, padding, groups, num_blocks, num_conv, activation)
def __init__(self, c1, c2, k=1, s=1, p=None, g=1,
num_blocks_per_stage=1,
num_conv_branches=4,
act=True,
down_sample=False,
use_se=False,
inference_mode=False):
""" Construct a ELAN module with MobileOneBlock.
:param c1: Number of channels in the input.
:param c2: Number of channels produced by the block.
:param k: Size of the convolution kernel.
:param s: Stride size.
:param p: Zero-padding size.
:param g: Group number.
:param num_blocks_per_stage: If True, instantiates model in inference mode.
:param num_conv_branches: Number of linear conv branches.
:param act: If True, use activations
:param down_sample:If True, first conv block set stride 2
:param use_se: Whether to use SE-ReLU activations.
:param inference_mode: If True, instantiates model in inference mode.
"""
super().__init__()
c_ = int(c2 // 2)
c_out = c_ * 4
self.inference_mode = inference_mode
self.in_planes = c_
self.down_sample = down_sample
self.use_se = use_se
self.num_blocks_per_stage = num_blocks_per_stage
self.num_conv_branches = num_conv_branches
# self.cur_layer_idx = 1
self.cv1 = Conv(c1, c_, k=k, s=s, p=p, g=g, act=act)
self.cv2 = Conv(c1, c_, k=k, s=s, p=p, g=g, act=act)
self.cv3 = self._make_stage(c_, self.num_blocks_per_stage, num_se_blocks=0)
self.cv4 = self._make_stage(c_, self.num_blocks_per_stage, num_se_blocks=0)
self.cv5 = Conv(c_out, c2, k=k, s=s, p=p, g=g, act=act)
def _make_stage(self,
planes: int,
num_blocks: int,
num_se_blocks: int) -> nn.Sequential:
""" Build a stage of MobileOne model.
:param planes: Number of output channels.
:param num_blocks: Number of blocks in this stage.
:param num_se_blocks: Number of SE blocks in this stage.
:return: A stage of MobileOne model.
"""
# Get strides for all layers
strides = [2 if self.down_sample else 1] + [1] * (num_blocks - 1)
blocks = []
for ix, stride in enumerate(strides):
use_se = False
if num_se_blocks > num_blocks:
raise ValueError("Number of SE blocks cannot "
"exceed number of layers.")
if ix >= (num_blocks - num_se_blocks):
use_se = True
# Depthwise conv
blocks.append(MobileOneBlock(in_channels=self.in_planes,
out_channels=self.in_planes,
kernel_size=3,
stride=stride,
padding=1,
groups=self.in_planes,
inference_mode=self.inference_mode,
use_se=use_se,
num_conv_branches=self.num_conv_branches))
# Pointwise conv
blocks.append(MobileOneBlock(in_channels=self.in_planes,
out_channels=planes,
kernel_size=1,
stride=1,
padding=0,
groups=1,
inference_mode=self.inference_mode,
use_se=use_se,
num_conv_branches=self.num_conv_branches))
self.in_planes = planes
# self.cur_layer_idx += 1
return nn.Sequential(*blocks)
def forward(self, x):
x1 = self.cv1(x)
x2 = self.cv2(x)
x3 = self.cv3(x2)
x4 = self.cv4(x3)
x5 = torch.cat((x1, x2, x3, x4), 1)
return self.cv5(x5)
Add ELANMO in parse_model of yolo.py
if m in (Conv, GhostConv, Bottleneck, GhostBottleneck, SPP, SPPF, DWConv, MixConv2d, Focus, CrossConv,
BottleneckCSP, C3, C3TR, C3SPP, C3Ghost, nn.ConvTranspose2d, DWConvTranspose2d, C3x, SPPCSPC, RepConv,
RFEM, ELAN, SPPCSPCSIM,ELANMO):
c1, c2 = ch[f], args[0]
if c2 != no: # if not output
c2 = make_divisible(c2 * gw, 8)
args = [c1, c2, *args[1:]]
if m in [BottleneckCSP, C3, C3TR, C3Ghost, C3x]:
args.insert(2, n) # number of repeats
n = 1
At the same time, add reparameterize() in the BaseModel of yolo.py
def fuse(self): # fuse model Conv2d() + BatchNorm2d() layers
LOGGER.info('Fusing layers... ')
for m in self.model.modules():
if isinstance(m, (Conv, DWConv)) and hasattr(m, 'bn'):
m.conv = fuse_conv_and_bn(m.conv, m.bn) # update conv
delattr(m, 'bn') # remove batchnorm
m.forward = m.forward_fuse # update forward
if isinstance(m, RepConv):
# print(f" fuse_repvgg_block")
m.fuse_repvgg_block()
# m.switch_to_deploy()
if hasattr(m, 'reparameterize'):
m.reparameterize()
self.info()
return self
Replace the new configuration file and run yolo.py
The number of parameters and calculations of the original yolov7tiny:
It can be seen that the amount of parameters and calculations are much less compared to tiny.
After exporting onnx, you can take a look at the network structure. The following picture is the original v7tiny network structure:
Add the network structure of mobileone block without integrating heavy parameters:
This structure looks complicated, but it will be fine after fusion.
Network structure after fusing heavy parameters:
After fusion, it seems that the two 3*3 convolutions in ELAN are replaced with depth-separable convolutions.