This article only briefly explains the structure of the yolov5 neural network. For specific theoretical knowledge, please learn the neural network by yourself. (Computer reading experience is better)
Table of contents
Define model: Define the model
forward: predict the input image
Before parsing the yolo file, we need to understand what the network structure of yolov5 is like:
yolov5s.yaml:
(This file is actually just a description of the configuration file that guides us to build the model, for our reference, we can refer to this file to configure our own model file)
First understand the yolov5 network structure as shown in the figure:
Let's first understand the Backone in the middle, what is the Head:
# YOLOv5 v6.0 backbone
backbone:
# [from, number, module, args]
[[-1, 1, Conv, [64, 6, 2, 2]], # 0-P1/2
[-1, 1, Conv, [128, 3, 2]], # 1-P2/4
[-1, 3, C3, [128]],
[-1, 1, Conv, [256, 3, 2]], # 3-P3/8
[-1, 6, C3, [256]],
[-1, 1, Conv, [512, 3, 2]], # 5-P4/16
[-1, 9, C3, [512]],
[-1, 1, Conv, [1024, 3, 2]], # 7-P5/32
[-1, 3, C3, [1024]],
[-1, 1, SPPF, [1024, 5]], # 9
]
# YOLOv5 v6.0 head
head:
[[-1, 1, Conv, [512, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 6], 1, Concat, [1]], # cat backbone P4
[-1, 3, C3, [512, False]], # 13
[-1, 1, Conv, [256, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 4], 1, Concat, [1]], # cat backbone P3
[-1, 3, C3, [256, False]], # 17 (P3/8-small)
[-1, 1, Conv, [256, 3, 2]],
[[-1, 14], 1, Concat, [1]], # cat head P4
[-1, 3, C3, [512, False]], # 20 (P4/16-medium)
[-1, 1, Conv, [512, 3, 2]],
[[-1, 10], 1, Concat, [1]], # cat head P5
[-1, 3, C3, [1024, False]], # 23 (P5/32-large)
[[17, 20, 23], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5)
]
backbone: only represents the meaning of numbers
# 0-P1/2 first floor
from: -1 means that the input is passed from the previous layer, [-1, 6] means that it is passed from the 11th and 6th layers
number: Indicates the number of module structures, if number>1, then the number=number*depth_multiple
module: module structure (Conv, C3, etc.), convolutional layer structure, defined in ( common.py )
args: Incoming parameters, you need to contact each network layer model category of common to determine the meaning of each parameter
P1: first floor
/2: The step size is 2, and the length and width of the picture are divided by 2 ( the reason why the picture resolution requires the length and width to be a multiple of 32 )
# 1-P2/4 Second floor:
layer by layer
head: Here, like the backbone, is also composed of various network layers
nn.Upsample : upsampling layer
Concat : The network layer that synthesizes the outgoing features of each layer
Detect: reasoning detection layer
We noticed that there are a total of 24 neural network layers here, so how do these 24 layers superimpose each other? Is it simply superimposed layer by layer?
The answer is obviously not, the actual network structure is like this: (The network structure analysis backbone in many places on the Internet is superimposed from top to bottom, here I learned from the B station up: 480920279 from bottom to top. network structure)
The meaning of this picture is that we pass a RGB three-channel image into the neural network. After going through the Backbone10-layer network, it enters the Head for upsampling, comprehensive feature processing, etc., and finally three C3 network layers are output to the Detect layer. The three C3 layers from top to bottom are what we call the high-level feature layer, middle-level feature layer, and low-level feature layer.
The difference between these feature layers is that the low-level detects small targets, the middle-level detects medium-sized targets, and the high-level detects large targets, which are combined to predict targets.
# Parameters
nc: 6 # number of classes
depth_multiple: 0.33 # model depth multiple
width_multiple: 0.50 # layer channel multiple
anchors:
- [10,13, 16,30, 33,23] # P3/8
- [30,61, 62,45, 59,119] # P4/16
- [116,90, 156,198, 373,326] # P5/32
nc: (number of classes) target category number
Anchors: Three layers correspond to different feature levels
[10,13, 16,30, 33,23] 3 anchors in the lower layer, 10×13, 16×30, 33×23
[30,61, 62,45, 59,119] 3 anchors in the middle layer
depth_multiple: model depth multiple, number* depth_multiple when creating a model
width_multiple: channel multiple, channel parameters of each layer * width_multiple=number of outgoing channels, multiplication of depth and width parameters is not an integer and rounded down
depth_multiple, width_multiple determine the complexity of the model, the larger the value, the more complex the accuracy, the higher the time-consuming
For the accuracy of various model files: n<s<m<l<x (the higher the accuracy, the longer the time-consuming) (the difference between these files is only the depth multiple and the channel multiple)
Therefore, we can write a model configuration file ourselves by referring to this manual.
After understanding the neural network structure, let's go to yolo.py to see how the neural network is implemented:
yolo.py:
if __name:
create model : create yolov5 model
(There may be several ways of writing here):
The first:
# Create model
im = torch.rand(opt.batch_size,3,640,640).to(device)
model = Model(opt.cfg).to(device)
The second type:
# Create model
model = Model(opt.cfg).to(device)
model.train()
# Profile
if opt.profile:
img = torch.rand(8 if torch.cuda.is_available() else 1, 3, 640, 640).to(device)
y = model(img, profile=True)
im(g): Randomly define a picture
y=model: define the model ( -->Model )
(Options):
Model:
init: build network structure
def __init__(self, cfg='yolov5s.yaml', ch=3, nc=None, anchors=None): # model, input channels, number of classes
super().__init__()
if isinstance(cfg, dict):
self.yaml = cfg # model dict
else: # is *.yaml
import yaml # for torch hub
self.yaml_file = Path(cfg).name
with open(cfg, encoding='ascii', errors='ignore') as f:
self.yaml = yaml.safe_load(f) # model dict
cfg: configuration file (yolov5s.yaml)
ch: input image channel number
super().init: load the configuration file
Determine whether the input is a string
self.yaml_file gets the file name
with: start loading the file, the key elements are stored in the form of a dictionary
Define model: Define the model
# Define model
ch = self.yaml['ch'] = self.yaml.get('ch', ch) # input channels
if nc and nc != self.yaml['nc']: # 判断该值和yaml中的值是否一样
LOGGER.info(f"Overriding model.yaml nc={self.yaml['nc']} with nc={nc}")
self.yaml['nc'] = nc # override yaml value
if anchors:
LOGGER.info(f'Overriding model.yaml anchors with anchors={anchors}')
self.yaml['anchors'] = round(anchors) # override yaml value
self.model, self.save = parse_model(deepcopy(self.yaml), ch=[ch]) # model, savelist
self.names = [str(i) for i in range(self.yaml['nc'])] # default names
self.inplace = self.yaml.get('inplace', True)
ch: defines the number of channels
nc, anchor: number of correction classes
.model: Build the model ( ——> parse_model )
.names: category name
.inplace: load keyword
Build:
# Build strides, anchors
m = self.model[-1] # Detect()
if isinstance(m, Detect):
s = 256 # 2x min stride
m.inplace = self.inplace
m.stride = torch.tensor([s / x.shape[-2] for x in self.forward(torch.zeros(1, ch, s, s))]) # forward:[8, 16, 32]
m.anchors /= m.stride.view(-1, 1, 1)
check_anchor_order(m)
self.stride = m.stride
self._initialize_biases() # only run once
Determine whether the upper layer of the model is a detect layer
m.stride: Put the s*s picture into the low-middle-high feature level for prediction, divide the original size by the prediction layer size to get the step size
m.anchors /=: anchors divided by the step size
check_anchor: Check the incoming anchor order
forward: predict the input image
parse_model:
LOGGER.info(f"\n{'':>3}{'from':>18}{'n':>3}{'params':>10} {'module':<40}{'arguments':<30}")
anchors, nc, gd, gw = d['anchors'], d['nc'], d['depth_multiple'], d['width_multiple']
na = (len(anchors[0]) // 2) if isinstance(anchors, list) else anchors # number of anchors
no = na * (nc + 5) # number of outputs = anchors * (classes + 5)
layers, save, c2 = [], [], ch[-1] # layers, savelist, ch out
.info: print information
Get yaml parameters:
na: number of anchors
no: output channel, nc(80), 5 (four points in the rectangular box + confidence), the value is 255
layers (storage of each layer of network created), save (statistics of the feature layers to be saved)
for i, (f, n, m, args) in enumerate(d['backbone'] + d['head']): # from, number, module, args
# 获取模型,这里主要是作者防止格式错误而不采取直接赋值
m = eval(m) if isinstance(m, str) else m # eval strings
for j, a in enumerate(args):
try:
# 同理防止格式错误而不直接赋值
args[j] = eval(a) if isinstance(a, str) else a # eval strings, [64, 6, 2, 2]
except NameError:
pass
Get the model and args, this may be written here because the author prevents formatting errors and does not use direct assignment
The next step is to judge whether the current network layer is a convolutional layer or an upsampling layer, a detection layer, etc., and then perform different processing accordingly
# n>1就乘以深度倍数
n = n_ = max(round(n * gd), 1) if n > 1 else n # depth gain
if m in [Conv, GhostConv, Bottleneck, GhostBottleneck, SPP, SPPF, DWConv, MixConv2d, Focus, CrossConv,
BottleneckCSP, C3, C3TR, C3SPP, C3Ghost]:
c1, c2 = ch[f], args[0]
if c2 != no: # if not output
c2 = make_divisible(c2 * gw, 8)
args = [c1, c2, *args[1:]] # args[3, 32, 6, 2, 2]
if m in [BottleneckCSP, C3, C3TR, C3Ghost]:
args.insert(2, n) # number of repeats
n = 1
elif m is nn.BatchNorm2d:
args = [ch[f]]
elif m is Concat:
c2 = sum(ch[x] for x in f)
elif m is Detect:
args.append([ch[x] for x in f])
if isinstance(args[1], int): # number of anchors
args[1] = [list(range(args[1] * 2))] * len(f)
elif m is Contract:
c2 = ch[f] * args[0] ** 2
elif m is Expand:
c2 = ch[f] // args[0] ** 2
else:
c2 = ch[f]
If m: Determine the structure type
Convolutional layer: Determine whether the number of channels is 255, otherwise multiply the channel multiple, and determine whether it is a multiple of 8 (a multiple of 8 is more friendly to GPU calculations)
C3 layer:
m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args) # module
t = str(m)[8:-2].replace('__main__.', '') # module type
np = sum(x.numel() for x in m_.parameters()) # number params
m_.i, m_.f, m_.type, m_.np = i, f, t, np # attach index, 'from' index, type, number params
LOGGER.info(f'{i:>3}{str(f):>18}{n_:>3}{np:10.0f} {t:<40}{str(args):<30}') # print
save.extend(x % i for x in ([f] if isinstance(f, int) else f) if x != -1) # append to savelist
layers.append(m_)
save.extend: Save the required feature layer, [4, 6, 10, 14, 17, 20, 23]
ch.append(c2): store the number of channels in each layer, and use the output channel of the upper layer as the input channel of the layer
common.py:
Take Conv as an example to understand simply:
class Conv(nn.Module):
# Standard convolution
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True): # ch_in, ch_out, kernel, stride, padding, groups
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
def forward(self, x):
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
return self.act(self.conv(x))
Conv:
init:
c1: Enter the number of channels in this layer
c2: output the number of channels in this layer
k=1: The size of the convolution kernel
s=1: The step size of the convolutional layer sliding
C3:
I have little talent and knowledge. If readers find any errors in the content of the article, please let me know. I would be very grateful.