YOLOv4 target detection - Backbone

Backbone

Definition of activation function – Mish

import math
import torch
import torch.nn as nn
import numpy as np
import torch.nn.functional as F
from model.layers.attention_layers import SEModule, CBAM
import config.yolov4_config as cfg

class Mish(nn.Module):
	def __init__(self):
		super(Mish, self).__init__()
	def forward(self, x):
		return x*torch.tanh(F.softplus(x))	

This block defines an activation function Mish() that we will use in yolov4, and this activation function will appear in each convolution module.
Advantages of the Mish activation function:
the above is unbounded (that is, positive values ​​​​can reach any height) to avoid saturation due to capping. The theoretical slight allowance for negative values ​​allows for better gradient flow instead of hard zero boundaries like in ReLU, and the smooth activation function allows better information to go deep into the neural network, resulting in better accuracy and generalization .

Various activation functions can refer to this blog of the great god bubbliiiiing: Introduction to various activation functions Activation Functions and analysis of advantages and disadvantages

Definition of global variables

norm_name = {
    
    "bn":nn.BatchNorm2d}
activate_name = {
    
    
	"relu":nn.ReLU,
	"leaky":nn.LeakyReLU,
	"linear":nn.Identity,
	"mish":Mish(),
}

The global variables are mainly defined here. In the form of a dictionary, it is convenient for us to call various tool functions of various torch.nn when we write the code , which increases the readability of the code.

Definition of convolution module – CBM

class Convolutional(nn.Module):
	def __init__(
		self,
		filters_in,
		filters_out,
		kernel_size,
		stride=1,
		norm="bn",
		activate="mish",
	):
		super(Convolutional, self).__init__()
		self.norm = norm
		self.activate = activate
		self.__conv = nn.Conv2d(
			in_channels = filters_in,
			out_channels = filters_out,
			kernel_size = kernel_size,
			stride = stride,
			padding = kernel_size//2,
			bias = not norm,
		)
		if norm:
			assert norm in norm_name.keys
			if norm == "bn":
				self.__norm = norm_name[norm](num_features=filters_out)
		if activate:
			assert activate in activate_name.keys()
			if activate == "leaky":
				self.__activate = activate = activate_name[activate](
					negative_slope = 0.1, inplace=True
				)
			if activate == "relu":
				self.__activate = activate_name[activate](inplace=True)
			if activate == "mish":
				self._-activate = activate_name[activate]
	def forward(self, x):
		x = self.__conv(x)
		if self.norm:
			x = self.__norm(x)
		if self.activate:
			x = self.__activate(x)
		return x

In this part, we mainly completed a definition of a CBM convolution module, which involves a convolution operation, a BatchNorm operation and a Mish activation function operation. The order is as shown in this code. In the forward function, the formal parameter x is first convolved, then bn algorithm, and then Mish activation.
padding=kernel_size//2 in the code is a way of rounding down, the purpose is to keep the size of the feature map obtained under different convolution kernel sizes consistent (ps: the premise of this value method of padding is stride=1)
There are three activation functions involved in the code, including leaky, relu, and mish. We use mish in YOLOv4.
The bias=not norm in the code should mean that when the BN algorithm is used, no bias operation is required. If it is not deliberately set, the default is true. In YOLOv4, we use the BN algorithm, so the norm is set to true, and the if judgment statement will choose to assign the BN algorithm to the norm for subsequent calls. The visualization is as shown in the figure:
insert image description here

Definition of small residual module – Resunit

class CSPBlock(nn.Module):
	def __init__(
		self,
		in_channels,
		out_channels,
		hidden_channels = None,
		residual_activation = "linear",
	)super(CSPBlock, self).__init__()
		if hidden_channels is None:
			hidden_channels = out_channels
		self.block = nn.Sequential(
			Convolutional(in_channels,hidden_channels, 1),
			Convolutional(hidden_channels,out_channels, 3),
		)
		self.activation = activate_name[residual_activation]
		self.attention = cfg.ATTENTION["TYPE"]
		if self.attention == "SEnet":
			self.attention_module = SEModule(out_channels)
		elif self.attention == "CBAM":
			self.attention_module = CBAM(out_channels)
		elif
			self.attention == None
	def forward(self, x):
		residual = x
		out = self.block(x)
		if self.attention is not None:
			out = self.attention_module(out)
		out += residual
		return out

In this part of the code, the residual module Resunit is defined, which consists of two small CBM modules and a residual edge. The small network composed of two CBM modules is defined in self.block, one of which has a convolution kernel size of 1 and one convolution kernel size of 3. Then a defined self.activation is not called in the code. My understanding is that the activation function of the residual side is assigned to self.activation, and the activation function is the activation function whose key is "linear". Find the above definition It can be seen from the global variables of the global variable that the corresponding activation function is nn.Identity(). The activation function can be understood through the query that its function in the network is only to increase the number of layers, and there is no other operation on our input, which can be understood as It is a bridge, so the name of the key is linear, which is a linear mapping. Since it has no substantive effect, the author did not appear to call it in the following code (maybe it was called, but I didn't find ~~doge). Now that we have defined the convolution edge of Resunit, we define three use cases based on three attention algorithms: SEnet, CBAM and None (which attention mechanism to use depends on settings in the configuration file, in the config file). The role of the specific attention mechanism in the YOLO algorithm can be viewed in this blog: SEnet, CBAM . In short, the simple overview, the attention mechanism can effectively improve the accuracy of image classification and target detection. Continue to look down and come to the forward function part. This part can clearly see that the input has passed through the self.block convolutional network and attention algorithm (if any) we defined above, and the output is obtained, and then the Our input is defined as residual (Chinese translation is residual strength, which is the residual edge in YOLO), and add is on the out obtained before to get the final output. At this point, our residual module Resunit is defined~. The network visualization is shown in the figure:
insert image description here

Definition of Large Residual Module – CSP1

class CSPFirstStage(nn.Module):
	def __init__(self, in_channels, out_channels):
		super(CSPFirstStage, self).__init__()
		self.downsample_conv = Convolutional(in_channels, out_channels, 3, stride=2)
		self.split_conv0 = Convolutional(out_channels, out_channels, 1)
		self.split_conv1 = Convolutional(out_channels, out_channels, 1)
		self.blocks_conv = nn.Sequential(
			CSPBlock(out_channels, out_channels, in_channels),
			Convolutiona(out_channels, out_channels, 1),
		)
		self.concat_conv = Convolutional(out_channels * 2, out_channels, 1)
	
	def forward(self, x):
		x = self.downsample_conv(x)
		x0 = self.split_conv0(x)
		x1 = self.split_conv1(x)
		x1 = self.block_conv(x1)
		x = torch.cat([x0, x1], dim = 1)
		x = self.concat_conv(x)
		return x

In this part, we define a large residual module. The small residual module defined above will be an important part of this large residual module (for details, please refer to the frame diagram after YOLOv4 network visualization). Now, start to explain this part of the code ~. First of all, according to the network frame diagram of YOLOv4, we can clearly find that every time our input passes through a large residual module, it will be down-sampled once, that is, the size of the output feature map becomes 1/2 of the input feature map, so first define A down-sampled CBM is established, and then in order to distinguish the input, we define two CBMs, the output of which leads to different branches, one leads to the small residual module, and the other leads to the residual edge. Small residual module This line also has a small residual module CSP and a convolution module CBM, so self.blocks_conv is defined to build this network.
Then according to the forward function, it can be seen that firstly, there is a downsampled CBM convolution module on the backbone, which is also the only way to input, and then the output is divided into two branches, one passes through the small residual module network structure, and the other passes through The residual edge of the large residual module, and finally superimpose the outputs of the two branches in dimension according to the self.concat_conv function. The large residual network defined in this part is the first block of the CSPDarknet53 network, that is, a CSP with only one Resunit component. This part of the code is visualized as shown in the figure:
insert image description here

Definition of Large Residual Module - CSPx

class CSPStage(nn.Module):
	def __init__(self, in_channels, out_channels, num_blocks):
		super(CSPStage, self).__init__()
	
		self.downsample_conv = Convolutional(
			in_channels, out_channels, 3, stride = 2
		)

		self.split_conv0 = Convolutional(out_channels, out_channels//2, 1)
		self.split_conv1 = Convolutional(out_channels, out_channels//2, 1)
		self.blocks_conv = nn.Sequential(
			*[
				CSPBlock(out_channels//2 , out_channels//2)
				for _ in range(num_blocks)
			],
			Convolutional(out_channels//2, out_channels//2, 1)
		)
		self.concat_conv = Convolutional(out_channels, out_channels, 1)

	def forward(self, x):
		x = self.downsample_conv(x)
		x0 = self.split0_conv0(x)
		x1 = self.split1_conv1(x)

		x1 = self.blocks_conv(x1)
		x = torch.cat([x0, x1], dim = 1)
		x = self.concat_conv(x)

		return x

This part of the code is basically similar to the previous CSP1 code, the only difference is that the number of Resunit components called in the large residual module defined here can be customized, that is, there is an additional variable in this class, namely num_blocks. Another difference is that in order to ensure that the number of channels of the feature map remains unchanged after concat, so before the final stacking, the number of channels output by the previous convolution modules is changed to out_channels//2 in advance, so that When stacking at the end, the input and output are both the number of channels, which is different from the last (out_channels * 2, out_channels) in CSPFirstStage.

CSPDarknet53 network construction

The above is the definition of all the modules we need for the YOLOv4 backbone network, and then we will officially start building CSPDarknet53 like building blocks! !

class CSPDarknet53(nn.Module):
	def __init__(
		self,
		stem_channels = 32,
		feature_channels = [64, 128, 256, 512, 1024],
		num_features = 3,
		weight_path = None,
		resume = False,
	):
		super(CSPDarknet53, self).__init__()

		self.stem_conv = Convolutional(3, stem_channels, 3)
		self.stages = nn.ModuleList(
			[
				CSPFirststage(stem_channels, feature_channels[0]),
				CSPStage(feature_channels[0], feature_channels[1], 2),
				CSPStage(feature_channels[1], feature_channels[2], 8),
				CSPStage(feature_channels[2], feature_channels[3], 8),
				CSPStage(feature_channels[3], feature_channels[4], 4),
			]
		)
		self.feature_channels = feature_channels
		self.num_features = num_features

		if weight_path and not resume:
			self.load_CSPdarknet_weights(weight_path)
		else:
			self._initialize_weights()

	def forward(self, x):
		x = self.stem_conv(x)
		features = []
		for stage in self.stage:
			x = stage(x)
			features.append(x)
		
		return feature[-self.num_features:]

In self.stages, we use nn.ModuleList() to build our network, where the third parameter of the CSPStage method indicates the number of Resunit components in the large residual module. In forward, a feature list is established, and a loop is used to traverse the self.stages of the network we built, which is also the benefit of using the nn.ModuleList() method to build the network. Then we can append the output of these five csp components into this empty list, and finally return feature[-self.num_features:], and self.num_features has been defined as 3,. So why do we only need to return the last three feature outputs of the list? Here we can find from the figure below that after the input enters the network, only those three outputs are actually transmitted to the next network structure, which are transmitted from csp8, csp8, and csp4. output, which is why we return the last three features of the list. According to the network structure, we know that the input will first encounter a CBM convolution module, namely self.stem_conv(). The input is 3 because the initial input image is a color image with pixel values ​​of rgb three channels, and then our output needs to become a 32-channel feature map. After this step, only the number of channels increases, and the feature map The size is still the size of the original image (the visualization of the entire network can be seen). Next, every time a csp module is passed, a downsampling operation and a feature map stacking operation will be performed, so correspondingly, the output feature map is 1/2 the size of the input feature map each time, and the number of channels will become 2 times the input. CSPDarknet53 is shown in the figure:
insert image description here

Weight initialization and loading

    def _initialize_weights(self):
        print("**" * 10, "Initing CSPDarknet53 weights", "**" * 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2.0 / n))
                if m.bias is not None:
                    m.bias.data.zero_()

                print("initing {}".format(m))
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

                print("initing {}".format(m))

    def load_CSPdarknet_weights(self, weight_file, cutoff=52):
        "https://github.com/ultralytics/yolov3/blob/master/models.py"

        print("load darknet weights : ", weight_file)

        with open(weight_file, "rb") as f:
            _ = np.fromfile(f, dtype=np.int32, count=5)
            weights = np.fromfile(f, dtype=np.float32)
        count = 0
        ptr = 0
        for m in self.modules():
            if isinstance(m, Convolutional):
                # only initing backbone conv's weights
                # if count == cutoff:
                #     break
                # count += 1

                conv_layer = m._Convolutional__conv
                if m.norm == "bn":
                    # Load BN bias, weights, running mean and running variance
                    bn_layer = m._Convolutional__norm
                    num_b = bn_layer.bias.numel()  # Number of biases
                    # Bias
                    bn_b = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(
                        bn_layer.bias.data
                    )
                    bn_layer.bias.data.copy_(bn_b)
                    ptr += num_b
                    # Weight
                    bn_w = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(
                        bn_layer.weight.data
                    )
                    bn_layer.weight.data.copy_(bn_w)
                    ptr += num_b
                    # Running Mean
                    bn_rm = torch.from_numpy(
                        weights[ptr : ptr + num_b]
                    ).view_as(bn_layer.running_mean)
                    bn_layer.running_mean.data.copy_(bn_rm)
                    ptr += num_b
                    # Running Var
                    bn_rv = torch.from_numpy(
                        weights[ptr : ptr + num_b]
                    ).view_as(bn_layer.running_var)
                    bn_layer.running_var.data.copy_(bn_rv)
                    ptr += num_b

                    print("loading weight {}".format(bn_layer))
                else:
                    # Load conv. bias
                    num_b = conv_layer.bias.numel()
                    conv_b = torch.from_numpy(
                        weights[ptr : ptr + num_b]
                    ).view_as(conv_layer.bias.data)
                    conv_layer.bias.data.copy_(conv_b)
                    ptr += num_b
                # Load conv. weights
                num_w = conv_layer.weight.numel()
                conv_w = torch.from_numpy(weights[ptr : ptr + num_w]).view_as(
                    conv_layer.weight.data
                )
                conv_layer.weight.data.copy_(conv_w)
                ptr += num_w

                print("loading weight {}".format(conv_layer))

Build models and model return values

def _BuildCSPDarknet53(weight_path, resume):
    model = CSPDarknet53(weight_path=weight_path, resume=resume)

    return model, model.feature_channels[-3:]

At this point, the backbone network of YOLOv4 is fully defined, but this is just the beginning. There is still more work to be done later, and it is only the tip of the iceberg. Deep learning has a long way to go, to be continued~
insert image description here

The overall network structure of YOLOv4 is shown in the figure:

insert image description here

The picture is quoted from: A complete explanation of Yolov3&Yolov4&Yolov5&Yolox core basic knowledge of Yolo series

insert image description here
Here, the input image size is 408*408 as an example.

The picture is quoted from: Wisdom target detection 32 - TF2 builds YoloV4 target detection platform (tensorflow2) I have to say that the bubbliiing giant is really strong

Guess you like

Origin blog.csdn.net/ycx_ccc/article/details/122859505