The principle of YOLOV7 learning record + code introduction

The blogger plans to do a target detection and tracking project, and considers using the YOLO series model as the target detector. Now the YOLO project has been updated to the YOLOV7 version, so let's learn the relevant principles and complete the relevant experimental work.

Paper link: https://arxiv.org/abs/2207.02696

insert image description here

network structure

YOLOv7 is the latest YOLO structure of the YOLO series. In the range of 5 frames per second to 160 frames per second, its speed and accuracy exceed most known target detectors. It is more than 30 frames per second known on GPU V100 Among real-time object detectors, YOLOv7 has the highest accuracy. According to the different code running environments (edge ​​GPU, ordinary GPU and cloud GPU), YOLOv7 sets three basic models, called YOLOv7-tiny, YOLOv7 and YOLOv7-W6. Compared with other network models in the YOLO series, the detection idea of ​​YOLOv7 is similar to that of YOLOv4 and YOLOv5, and its network architecture is shown in the figure.
insert image description here
A more detailed model structure:
insert image description here
insert image description here

work process

The YOLOv7 network mainly includes four parts: Input (input), Backbone (backbone network), Neck (neck), and Head (head). First, the image is preprocessed through a series of operations such as data enhancement in the input part, and then sent to the backbone network, which extracts features from the processed image; then, the extracted features are processed by Neck module feature fusion to obtain large and medium , small and three sizes of features
; finally, the fused features are sent to the detection head, and the results are output after detection.

BackBone

The backbone part of the YOLOv7 network model is mainly constructed by convolution, E-ELAN module, MPConv module and SPPCSPC module. Among them, the E-ELAN (Extended-ELAN) module, on the basis of the original ELAN, changes the calculation block while maintaining the transition layer structure of the original ELAN, and uses the ideas of expand, shuffle, and merge cardinality to realize the gradient without destroying the original path. In the case of enhancing the ability of network learning. The SPPCSPC module adds multiple parallel MaxPool operations in a series of convolutions, avoiding image distortion caused by image processing operations, and solving the problem of convolutional neural networks extracting repeated features of images. In the MPConv module, the MaxPool operation expands the receptive field of the current feature layer and then fuses it with the feature information after normal convolution processing, which improves the generalization of the network.

The input picture will first perform feature extraction in the backbone network, and the extracted features can be called the feature layer, which is the feature set of the input picture. In the backbone part, we obtained three feature layers for the next step of network construction. I call these three feature layers effective feature layers.

Neck:FPN+PAN structure

FPN Feature Pyramid (Feature Pyramid Network)
FPN structure
PANet structure
PANet structure
FPN and PANet detailed explanation
In the Neck module, YOLOv7 is the same as YOLOv5 network, and also adopts the traditional PAFPN structure. FPN is the enhanced feature extraction network of YoloV7. The three effective feature layers obtained in the backbone part will perform feature fusion in this part. The purpose of feature fusion is to combine feature information of different scales. In the FPN part, the obtained effective feature layers are used to continue to extract features. In YoloV7, the Panet structure is still used . We not only upsample the features to achieve feature fusion, but also downsample the features again to achieve feature fusion.

Head

In the detection head part, the baseline YOLOv7 in this paper selects the IDetect detection head representing three target sizes: large, medium and small. The RepConv module has a certain difference in structure during training and inference. For details, please refer to the structure in RepVGG, which introduces the idea of ​​structural reparameterization

Yolo Head is used as the classifier and regressor of YoloV7. Through Backbone and FPN, three enhanced effective feature layers can be obtained. Each feature layer has width, height and number of channels. At this time, we can regard the feature map as a collection of feature points one after another. There are three prior frames on each feature point, and each prior frame has the number of channels . feature . What Yolo Head actually does is to judge the feature points and judge whether there is an object corresponding to the prior frame on the feature points. Like the previous version of Yolo, the decoupling head used by YoloV7 is together, that is, classification and regression are implemented in a 1X1 convolution.

BackBone (code implementation)

1. Multi-branch stacking module (ELAN)

The name is ELAN in the article. The blogger thinks it is better to call it a multi-branch stacking module, but it is just a personal understanding.
Its structure is shown in the figure below: Is it the same as its name?
The ELAN module is an efficient network structure. It controls the shortest and longest gradient paths. , so that the network can learn more features and has stronger robustness.
ELAN has two branches.
The first branch is to change the number of channels through a 1x1 convolution.
The second branch is more complicated. It first passes through a 1x1 convolution module to change the number of channels. Then go through four 3x3 convolution modules for feature extraction.
As shown in the figure, finally the four features are superimposed to obtain the final feature extraction result.
insert image description here
The idea of ​​residual structure is introduced here, which is composed of multiple convolutions + batch normalization + activation function stacking.
backbone.py

branch stacking

class Multi_Concat_Block(nn.Module):
    def __init__(self, c1, c2, c3, n=4, e=1, ids=[0]):
        super(Multi_Concat_Block, self).__init__()
        c_ = int(c2 * e)        
        self.ids = ids
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = nn.ModuleList(
            [Conv(c_ if i ==0 else c2, c2, 3, 1) for i in range(n)]
        )
        self.cv4 = Conv(c_ * 2 + c2 * (len(ids) - 2), c3, 1, 1)
    def forward(self, x):
        x_1 = self.cv1(x)
        x_2 = self.cv2(x)      
        x_all = [x_1, x_2]
        for i in range(len(self.cv3)):
            x_2 = self.cv3[i](x_2)
            x_all.append(x_2)           
        out = self.cv4(torch.cat([x_all[id] for id in self.ids], 1))
        return out

So many stacks actually correspond to a denser residual structure . The residual network is characterized by being easy to optimize, and can improve accuracy by increasing considerable depth. Its internal residual block uses skip connections, which alleviates the problem of gradient disappearance caused by increasing depth in deep neural networks.

Convolution + batch normalization + activation function (CBS module)

For the CBS module, we can see from the figure that it consists of a Conv layer, which is a convolutional layer, a BN layer, which is a Batch normalization layer, and a Silu layer, which is an activation function.
The silu activation function is a variant of the swish activation function. The formulas of the two are as follows:
silu(x)=x⋅sigmoid(x)
swish(x)=x⋅sigmoid(βx)
insert image description here
insert image description here
From the architecture diagram, we can see that CBS The modules have three colors here, and the three colors represent their different convolution kernels (k) and step sizes (s).
First, the lightest color, which is the color of the first CBS module, is a 1x1 convolution with stride (step size 1).
The second slightly lighter color is the color of the second CBS module, which is a 3x3 convolution with stride (step size 1).
The last darkest color, which is the color of the third CBS module, is a 3x3 convolution with stride (step size 2).
The 1x1 convolution is mainly used to change the number of channels.
3x3 convolution with a step size of 1 is mainly used for feature extraction.
3x3 convolution with a step size of 2, mainly used for downsampling.
code show as below:

class Conv(nn.Module):
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=SiLU()):  # ch_in, ch_out, kernel, stride, padding, groups
        super(Conv, self).__init__()
        self.conv   = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
        self.bn     = nn.BatchNorm2d(c2, eps=0.001, momentum=0.03)
        self.act    = nn.LeakyReLU(0.1, inplace=True) if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
    def forward(self, x):
        return self.act(self.bn(self.conv(x)))
    def fuseforward(self, x):
        return self.act(self.conv(x))

insert image description here

2.Transition_Block (complete downsampling)

Use the innovative transition module Transition_Block for downsampling. In convolutional neural networks, a common transition module for downsampling is a convolution with a kernel size of 3x3 and a step size of 2x2 or a step size of 2x2. Max pooling. In YoloV7, the author has assembled two kinds of transition modules, and one transition module has two branches, as shown in the figure. The left branch is a maximum pooling with a step size of 2x2 + a 1x1 convolution, and the right branch is a 1x1 convolution + a convolution with a convolution kernel size of 3x3 and a step size of 2x2. The results of the two branches are output will stack.
insert image description here

pooling

insert image description here
The MP module has two branches, which are used for downsampling.
The first branch first passes through a maxpool, which is the maximum pooling. The function of maximization is downsampling, and then a 1x1 convolution is performed to change the number of channels.
The second branch first passes through a 1x1 convolution to change the number of channels, and then passes through a convolution block with a 3x3 convolution kernel and a step size of 2. This convolution block is also used for downsampling.
Finally, the results of the first branch and the second branch are added together to obtain the result of super downsampling.

class MP(nn.Module):
    def __init__(self, k=2):
        super(MP, self).__init__()
        self.m = nn.MaxPool2d(kernel_size=k, stride=k)
    def forward(self, x):
        return self.m(x)

insert image description here

Transition combination

class Transition_Block(nn.Module):
    def __init__(self, c1, c2):
        super(Transition_Block, self).__init__()
        self.cv1 = Conv(c1, c2, 1, 1)
        self.cv2 = Conv(c1, c2, 1, 1)
        self.cv3 = Conv(c2, c2, 3, 2)     
        self.mp  = MP()
    def forward(self, x):
        x_1 = self.mp(x)
        x_1 = self.cv1(x_1)        
        x_2 = self.cv2(x)
        x_2 = self.cv3(x_2)     
        return torch.cat([x_2, x_1], 1)

other

activation function

class SiLU(nn.Module):  
    @staticmethod
    def forward(x):
        return x * torch.sigmoid(x)

subject code

class Backbone(nn.Module):
    def __init__(self, transition_channels, block_channels, n, phi, pretrained=False):
        super().__init__()
        #-----------------------------------------------#
        #   输入图片是640, 640, 3
        #-----------------------------------------------#
        ids = {
    
    
            'l' : [-1, -3, -5, -6],
            'x' : [-1, -3, -5, -7, -8], 
        }[phi]
        self.stem = nn.Sequential(
            Conv(3, transition_channels, 3, 1),
            Conv(transition_channels, transition_channels * 2, 3, 2),
            Conv(transition_channels * 2, transition_channels * 2, 3, 1),
        )
        self.dark2 = nn.Sequential(
            Conv(transition_channels * 2, transition_channels * 4, 3, 2),
            Multi_Concat_Block(transition_channels * 4, block_channels * 2, transition_channels * 8, n=n, ids=ids),
        )
        self.dark3 = nn.Sequential(
            Transition_Block(transition_channels * 8, transition_channels * 4),
            Multi_Concat_Block(transition_channels * 8, block_channels * 4, transition_channels * 16, n=n, ids=ids),
        )
        self.dark4 = nn.Sequential(
            Transition_Block(transition_channels * 16, transition_channels * 8),
            Multi_Concat_Block(transition_channels * 16, block_channels * 8, transition_channels * 32, n=n, ids=ids),
        )
        self.dark5 = nn.Sequential(
            Transition_Block(transition_channels * 32, transition_channels * 16),
            Multi_Concat_Block(transition_channels * 32, block_channels * 8, transition_channels * 32, n=n, ids=ids),
        )      
        if pretrained:
            url = {
    
    
                "l" : 'https://github.com/bubbliiiing/yolov7-pytorch/releases/download/v1.0/yolov7_backbone_weights.pth',
                "x" : 'https://github.com/bubbliiiing/yolov7-pytorch/releases/download/v1.0/yolov7_x_backbone_weights.pth',
            }[phi]
            checkpoint = torch.hub.load_state_dict_from_url(url=url, map_location="cpu", model_dir="./model_data")
            self.load_state_dict(checkpoint, strict=False)
            print("Load weights from " + url.split('/')[-1])

    def forward(self, x):
        x = self.stem(x)
        x = self.dark2(x)
        #-----------------------------------------------#
        #   dark3的输出为80, 80, 256,是一个有效特征层
        #-----------------------------------------------#
        x = self.dark3(x)
        feat1 = x
        #-----------------------------------------------#
        #   dark4的输出为40, 40, 512,是一个有效特征层
        #-----------------------------------------------#
        x = self.dark4(x)
        feat2 = x
        #-----------------------------------------------#
        #   dark5的输出为20, 20, 1024,是一个有效特征层
        #-----------------------------------------------#
        x = self.dark5(x)
        feat3 = x
        return feat1, feat2, feat3

FPN Strong Feature Fusion

insert image description here
In the feature utilization part, YoloV7 extracts multiple feature layers for target detection, and extracts a total of three feature layers.
The three feature layers are located in different positions of the main body, respectively located in the middle layer, the middle and lower layers, and the bottom layer. When the input is (640,640,3), the shapes of the three feature layers are feat1=(80,80,512), feat2= (40,40,1024), feat3=(20,20,1024)
After obtaining three effective feature layers, we use these three effective feature layers to construct the FPN layer. The construction method is (in this blog post, the The SPPCSPC structure is attributed to FPN):

  • 1. The feature layer of feature3=(20,20,1024) first uses SPPCSPC for feature extraction. This structure can improve the receptive field of YoloV7 and obtain P5.
  • 2. First perform a 1X1 convolution adjustment channel on P5, then perform upsampling UmSampling2d and combine it with the feature layer after feature2=(40,40,1024) for one convolution, and then use Multi_Concat_Block for feature extraction to obtain P4. The feature layer obtained at this time is (40, 40, 256).
  • 3. First perform a 1X1 convolution adjustment channel on P4, then perform upsampling UmSampling2d and combine it with the feature layer after feature1=(80,80,512) for one convolution, and then use Multi_Concat_Block for feature extraction to obtain P3_out. The obtained feature layer is (80,80,128).
  • 4. The feature layer of P3_out=(80,80,128) performs a Transition_Block convolution for downsampling, and stacks it with P4 after downsampling, and then uses Multi_Concat_Block for feature extraction P4_out. The feature layer obtained at this time is (40,40,256).
  • 5. The feature layer of P4_out=(40,40,256) performs a Transition_Block convolution for downsampling, and stacks it with P5 after downsampling, and then uses Multi_Concat_Block for feature extraction P5_out. The feature layer obtained at this time is (20,20,512).

The feature pyramid can perform feature fusion of feature layers of different shapes, which is conducive to extracting better features.
yolo.py

SPPCSPC block

insert image description here
insert image description here

class SPPCSPC(nn.Module):
    # CSP https://github.com/WongKinYiu/CrossStagePartialNetworks
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5, k=(5, 9, 13)):
        super(SPPCSPC, self).__init__()
        c_ = int(2 * c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(c_, c_, 3, 1)
        self.cv4 = Conv(c_, c_, 1, 1)
        self.m = nn.ModuleList([nn.MaxPool2d(kernel_size=x, stride=1, padding=x // 2) for x in k])
        self.cv5 = Conv(4 * c_, c_, 1, 1)
        self.cv6 = Conv(c_, c_, 3, 1)
        self.cv7 = Conv(2 * c_, c2, 1, 1)

    def forward(self, x):
        x1 = self.cv4(self.cv3(self.cv1(x)))
        y1 = self.cv6(self.cv5(torch.cat([x1] + [m(x1) for m in self.m], 1)))
        y2 = self.cv2(x)
        return self.cv7(torch.cat((y1, y2), dim=1))

Neck block

class YoloBody(nn.Module):
    def __init__(self, anchors_mask, num_classes, phi, pretrained=False):
        super(YoloBody, self).__init__()
        #-----------------------------------------------#
        #   定义了不同yolov7版本的参数
        #-----------------------------------------------#
        transition_channels = {
    
    'l' : 32, 'x' : 40}[phi]
        block_channels      = 32
        panet_channels      = {
    
    'l' : 32, 'x' : 64}[phi]
        e       = {
    
    'l' : 2, 'x' : 1}[phi]
        n       = {
    
    'l' : 4, 'x' : 6}[phi]
        ids     = {
    
    'l' : [-1, -2, -3, -4, -5, -6], 'x' : [-1, -3, -5, -7, -8]}[phi]
        conv    = {
    
    'l' : RepConv, 'x' : Conv}[phi]
        #-----------------------------------------------#
        #   输入图片是640, 640, 3
        #-----------------------------------------------#

        #---------------------------------------------------#   
        #   生成主干模型
        #   获得三个有效特征层,他们的shape分别是:
        #   80, 80, 512
        #   40, 40, 1024
        #   20, 20, 1024
        #---------------------------------------------------#
        self.backbone   = Backbone(transition_channels, block_channels, n, phi, pretrained=pretrained)

        self.upsample   = nn.Upsample(scale_factor=2, mode="nearest")

        self.sppcspc                = SPPCSPC(transition_channels * 32, transition_channels * 16)
        self.conv_for_P5            = Conv(transition_channels * 16, transition_channels * 8)
        self.conv_for_feat2         = Conv(transition_channels * 32, transition_channels * 8)
        self.conv3_for_upsample1    = Multi_Concat_Block(transition_channels * 16, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids)

        self.conv_for_P4            = Conv(transition_channels * 8, transition_channels * 4)
        self.conv_for_feat1         = Conv(transition_channels * 16, transition_channels * 4)
        self.conv3_for_upsample2    = Multi_Concat_Block(transition_channels * 8, panet_channels * 2, transition_channels * 4, e=e, n=n, ids=ids)

        self.down_sample1           = Transition_Block(transition_channels * 4, transition_channels * 4)
        self.conv3_for_downsample1  = Multi_Concat_Block(transition_channels * 16, panet_channels * 4, transition_channels * 8, e=e, n=n, ids=ids)

        self.down_sample2           = Transition_Block(transition_channels * 8, transition_channels * 8)
        self.conv3_for_downsample2  = Multi_Concat_Block(transition_channels * 32, panet_channels * 8, transition_channels * 16, e=e, n=n, ids=ids)

        self.rep_conv_1 = conv(transition_channels * 4, transition_channels * 8, 3, 1)
        self.rep_conv_2 = conv(transition_channels * 8, transition_channels * 16, 3, 1)
        self.rep_conv_3 = conv(transition_channels * 16, transition_channels * 32, 3, 1)

        self.yolo_head_P3 = nn.Conv2d(transition_channels * 8, len(anchors_mask[2]) * (5 + num_classes), 1)
        self.yolo_head_P4 = nn.Conv2d(transition_channels * 16, len(anchors_mask[1]) * (5 + num_classes), 1)
        self.yolo_head_P5 = nn.Conv2d(transition_channels * 32, len(anchors_mask[0]) * (5 + num_classes), 1)

    def fuse(self):
        print('Fusing layers... ')
        for m in self.modules():
            if isinstance(m, RepConv):
                m.fuse_repvgg_block()
            elif type(m) is Conv and hasattr(m, 'bn'):
                m.conv = fuse_conv_and_bn(m.conv, m.bn)
                delattr(m, 'bn')
                m.forward = m.fuseforward
        return self
    
    def forward(self, x):
        #  backbone
        feat1, feat2, feat3 = self.backbone.forward(x)
        
        P5          = self.sppcspc(feat3)
        P5_conv     = self.conv_for_P5(P5)
        P5_upsample = self.upsample(P5_conv)
        P4          = torch.cat([self.conv_for_feat2(feat2), P5_upsample], 1)
        P4          = self.conv3_for_upsample1(P4)

        P4_conv     = self.conv_for_P4(P4)
        P4_upsample = self.upsample(P4_conv)
        P3          = torch.cat([self.conv_for_feat1(feat1), P4_upsample], 1)
        P3          = self.conv3_for_upsample2(P3)

        P3_downsample = self.down_sample1(P3)
        P4 = torch.cat([P3_downsample, P4], 1)
        P4 = self.conv3_for_downsample1(P4)

        P4_downsample = self.down_sample2(P4)
        P5 = torch.cat([P4_downsample, P5], 1)
        P5 = self.conv3_for_downsample2(P5)
        
        P3 = self.rep_conv_1(P3)
        P4 = self.rep_conv_2(P4)
        P5 = self.rep_conv_3(P5)
        #---------------------------------------------------#
        #   第三个特征层
        #   y3=(batch_size, 75, 80, 80)
        #---------------------------------------------------#
        out2 = self.yolo_head_P3(P3)
        #---------------------------------------------------#
        #   第二个特征层
        #   y2=(batch_size, 75, 40, 40)
        #---------------------------------------------------#
        out1 = self.yolo_head_P4(P4)
        #---------------------------------------------------#
        #   第一个特征层
        #   y1=(batch_size, 75, 20, 20)
        #---------------------------------------------------#
        out0 = self.yolo_head_P5(P5)

        return [out0, out1, out2]

Use Yolo Head to get prediction results

Using the FPN feature pyramid, we can obtain three enhanced features. The shapes of these three enhanced features are (20,20,512), (40,40,256), (80,80,128), and then we use the feature layers of these three shapes Pass in Yolo Head to get the prediction result.

RepConv

Different from the previous Yolo series, YoloV7 uses a RepConv structure before Yolo Head. The idea of ​​this RepConv is taken from RepVGG. The basic idea is to introduce a special residual structure to assist training during training. This residual structure is After a unique design, the complex residual structure can be equivalent to an ordinary 3x3 convolution during actual prediction. At this time, the complexity of the network is reduced, but the prediction performance of the network is not reduced.
insert image description here
The RepVGG learning record
insert image description here
REP module is divided into two, one is train, which is training, and one is deploy, which is reasoning.
training module , which has three branches.
The top branch is a 3x3 convolution for feature extraction.
The middle branch is a 1x1 convolution for smoothing features .
The last branch is an Identity, which is moved directly without convolution operation.
Finally add them together.

Reasoning module , including a 3x3 convolution, stride (step size 1). is converted from the reparameterization of the training module.
In the training module, because the first layer is a 3x3 convolution, the second layer is a 1x1 convolution, and the last layer is an Identity.
When the model is parameterized, it is necessary to convert the 1x1 convolution into a 3x3 convolution, and convert the Identity into a 3x3 convolution, and then perform an addition of a matrix, which is a matrix fusion process.
Then finally add its weights to get a 3x3 convolution, that is to say, these three branches are fused into a line with only one 3x3 convolution.
Their weights are the result of the superposition of the three branches, and the matrix is ​​also the result of the superposition of the three branches.

class RepConv(nn.Module):
    # Represented convolution
    # https://arxiv.org/abs/2101.03697
    def __init__(self, c1, c2, k=3, s=1, p=None, g=1, act=SiLU(), deploy=False):
        super(RepConv, self).__init__()
        self.deploy         = deploy
        self.groups         = g
        self.in_channels    = c1
        self.out_channels   = c2
        
        assert k == 3
        assert autopad(k, p) == 1

        padding_11  = autopad(k, p) - k // 2
        self.act    = nn.LeakyReLU(0.1, inplace=True) if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

        if deploy:
            self.rbr_reparam    = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=True)
        else:
            self.rbr_identity   = (nn.BatchNorm2d(num_features=c1, eps=0.001, momentum=0.03) if c2 == c1 and s == 1 else None)
            self.rbr_dense      = nn.Sequential(
                nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False),
                nn.BatchNorm2d(num_features=c2, eps=0.001, momentum=0.03),
            )
            self.rbr_1x1        = nn.Sequential(
                nn.Conv2d( c1, c2, 1, s, padding_11, groups=g, bias=False),
                nn.BatchNorm2d(num_features=c2, eps=0.001, momentum=0.03),
            )

    def forward(self, inputs):
        if hasattr(self, "rbr_reparam"):
            return self.act(self.rbr_reparam(inputs))
        if self.rbr_identity is None:
            id_out = 0
        else:
            id_out = self.rbr_identity(inputs)
        return self.act(self.rbr_dense(inputs) + self.rbr_1x1(inputs) + id_out)
    
    def get_equivalent_kernel_bias(self):
        kernel3x3, bias3x3  = self._fuse_bn_tensor(self.rbr_dense)
        kernel1x1, bias1x1  = self._fuse_bn_tensor(self.rbr_1x1)
        kernelid, biasid    = self._fuse_bn_tensor(self.rbr_identity)
        return (
            kernel3x3 + self._pad_1x1_to_3x3_tensor(kernel1x1) + kernelid,
            bias3x3 + bias1x1 + biasid,
        )

    def _pad_1x1_to_3x3_tensor(self, kernel1x1):
        if kernel1x1 is None:
            return 0
        else:
            return nn.functional.pad(kernel1x1, [1, 1, 1, 1])

    def _fuse_bn_tensor(self, branch):
        if branch is None:
            return 0, 0
        if isinstance(branch, nn.Sequential):
            kernel      = branch[0].weight
            running_mean = branch[1].running_mean
            running_var = branch[1].running_var
            gamma       = branch[1].weight
            beta        = branch[1].bias
            eps         = branch[1].eps
        else:
            assert isinstance(branch, nn.BatchNorm2d)
            if not hasattr(self, "id_tensor"):
                input_dim = self.in_channels // self.groups
                kernel_value = np.zeros(
                    (self.in_channels, input_dim, 3, 3), dtype=np.float32
                )
                for i in range(self.in_channels):
                    kernel_value[i, i % input_dim, 1, 1] = 1
                self.id_tensor = torch.from_numpy(kernel_value).to(branch.weight.device)
            kernel      = self.id_tensor
            running_mean = branch.running_mean
            running_var = branch.running_var
            gamma       = branch.weight
            beta        = branch.bias
            eps         = branch.eps
        std = (running_var + eps).sqrt()
        t   = (gamma / std).reshape(-1, 1, 1, 1)
        return kernel * t, beta - running_mean * gamma / std

    def repvgg_convert(self):
        kernel, bias = self.get_equivalent_kernel_bias()
        return (
            kernel.detach().cpu().numpy(),
            bias.detach().cpu().numpy(),
        )

    def fuse_conv_bn(self, conv, bn):
        std     = (bn.running_var + bn.eps).sqrt()
        bias    = bn.bias - bn.running_mean * bn.weight / std

        t       = (bn.weight / std).reshape(-1, 1, 1, 1)
        weights = conv.weight * t

        bn      = nn.Identity()
        conv    = nn.Conv2d(in_channels = conv.in_channels,
                              out_channels = conv.out_channels,
                              kernel_size = conv.kernel_size,
                              stride=conv.stride,
                              padding = conv.padding,
                              dilation = conv.dilation,
                              groups = conv.groups,
                              bias = True,
                              padding_mode = conv.padding_mode)

        conv.weight = torch.nn.Parameter(weights)
        conv.bias   = torch.nn.Parameter(bias)
        return conv

    def fuse_repvgg_block(self):    
        if self.deploy:
            return
        print(f"RepConv.fuse_repvgg_block")
        self.rbr_dense  = self.fuse_conv_bn(self.rbr_dense[0], self.rbr_dense[1])
        
        self.rbr_1x1    = self.fuse_conv_bn(self.rbr_1x1[0], self.rbr_1x1[1])
        rbr_1x1_bias    = self.rbr_1x1.bias
        weight_1x1_expanded = torch.nn.functional.pad(self.rbr_1x1.weight, [1, 1, 1, 1])
        
        # Fuse self.rbr_identity
        if (isinstance(self.rbr_identity, nn.BatchNorm2d) or isinstance(self.rbr_identity, nn.modules.batchnorm.SyncBatchNorm)):
            identity_conv_1x1 = nn.Conv2d(
                    in_channels=self.in_channels,
                    out_channels=self.out_channels,
                    kernel_size=1,
                    stride=1,
                    padding=0,
                    groups=self.groups, 
                    bias=False)
            identity_conv_1x1.weight.data = identity_conv_1x1.weight.data.to(self.rbr_1x1.weight.data.device)
            identity_conv_1x1.weight.data = identity_conv_1x1.weight.data.squeeze().squeeze()
            identity_conv_1x1.weight.data.fill_(0.0)
            identity_conv_1x1.weight.data.fill_diagonal_(1.0)
            identity_conv_1x1.weight.data = identity_conv_1x1.weight.data.unsqueeze(2).unsqueeze(3)

            identity_conv_1x1           = self.fuse_conv_bn(identity_conv_1x1, self.rbr_identity)
            bias_identity_expanded      = identity_conv_1x1.bias
            weight_identity_expanded    = torch.nn.functional.pad(identity_conv_1x1.weight, [1, 1, 1, 1])            
        else:
            bias_identity_expanded      = torch.nn.Parameter( torch.zeros_like(rbr_1x1_bias) )
            weight_identity_expanded    = torch.nn.Parameter( torch.zeros_like(weight_1x1_expanded) )            
        
        self.rbr_dense.weight   = torch.nn.Parameter(self.rbr_dense.weight + weight_1x1_expanded + weight_identity_expanded)
        self.rbr_dense.bias     = torch.nn.Parameter(self.rbr_dense.bias + rbr_1x1_bias + bias_identity_expanded)
                
        self.rbr_reparam    = self.rbr_dense
        self.deploy         = True

        if self.rbr_identity is not None:
            del self.rbr_identity
            self.rbr_identity = None

        if self.rbr_1x1 is not None:
            del self.rbr_1x1
            self.rbr_1x1 = None

        if self.rbr_dense is not None:
            del self.rbr_dense
            self.rbr_dense = None

For each feature layer, we can use a convolution to adjust the number of channels. The final number of channels is related to the number of types that need to be distinguished. In YoloV7, there are 3 prior frames for each feature point on each feature layer .

prediction header structure

If the voc training set is used, the class is 20, the final dimension should be 75 = 3x25, and the shape of the three feature layers is (20,20,75), (40,40,75), (80,80 ,75).
The last 75 can be split into three 25, corresponding to the 25 parameters of the three prior frames, and 25 can be split into 4+1+20.
The first 4 parameters are used to judge the regression parameters of each feature point, and the prediction frame can be obtained after the regression parameters are adjusted; the
fifth parameter is used to judge whether each feature point contains an object;
the last 20 parameters are used to judge each feature point The type of object to include.

If the coco training set is used, the class is 80, the final dimension should be 255 = 3x85, and the shape of the three feature layers is (20,20,255), (40,40,255), (80,80,255) and the last
255 It can be split into three 85, corresponding to the 85 parameters of the three prior frames, and 85 can be split into 4+1+80.
The first 4 parameters are used to judge the regression parameters of each feature point, and the prediction frame can be obtained after the regression parameters are adjusted; the
fifth parameter is used to judge whether each feature point contains an object;
the last 80 parameters are used to judge each feature point The type of object to include.

Decoding of prediction results

1. Obtain prediction frame and score

From the prediction header part, we can obtain the prediction results of the three feature layers. Taking COCO as an example, the shapes are (N, 20, 20, 255), (N, 40, 40, 255), and (N, 80, 80, 255) data.

However, this prediction result does not correspond to the position of the final prediction frame on the picture, and it needs to be decoded to complete. In YoloV5, there are 3 prior frames for each feature point on each feature layer.

The last 255 of each feature layer can be split into three 85, corresponding to the 85 parameters of the three prior frames, we first reshape it, the result is (N,20,20,3,85), (N ,40.40,3,85), (N,80,80,3,85).

85 of them can be split into 4+1+80.
The first 4 parameters are used to judge the regression parameters of each feature point, and the prediction frame can be obtained after the regression parameters are adjusted; the
fifth parameter is used to judge whether each feature point contains an object;
the last 80 parameters are used to judge each feature point The type of object to include.

Take the feature layer (N, 20, 20, 3, 85) as an example. This feature layer is equivalent to dividing the image into 20x20 feature points. If a feature point falls in the corresponding frame of the object, it is used to predict the object.

As shown in the figure, the blue points are 20x20 feature points. At this time, we will demonstrate the decoding operation of the three prior frames of the black points in the left picture:

  • 1. Calculate the center prediction point, and use the content of the first two serial numbers of the Regression prediction result to offset the center coordinates of the three prior frames of the feature point. After the offset, there are three red points in the right picture;
  • 2. Calculate the width and height of the prediction frame, and use the contents of the last two serial numbers of the Regression prediction result to calculate the index to obtain the width and height of the prediction frame;
  • 3. The prediction frame obtained at this time can be drawn on the picture.

insert image description here
utils_bbox.py

Prediction frame decoding

class DecodeBox():
    def __init__(self, anchors, num_classes, input_shape, anchors_mask = [[6,7,8], [3,4,5], [0,1,2]]):
        super(DecodeBox, self).__init__()
        self.anchors        = anchors
        self.num_classes    = num_classes
        self.bbox_attrs     = 5 + num_classes
        self.input_shape    = input_shape
        #-----------------------------------------------------------#
        #   13x13的特征层对应的anchor是[142, 110],[192, 243],[459, 401]
        #   26x26的特征层对应的anchor是[36, 75],[76, 55],[72, 146]
        #   52x52的特征层对应的anchor是[12, 16],[19, 36],[40, 28]
        #-----------------------------------------------------------#
        self.anchors_mask   = anchors_mask

    def decode_box(self, inputs):
        outputs = []
        for i, input in enumerate(inputs):
            #-----------------------------------------------#
            #   输入的input一共有三个,他们的shape分别是
            #   batch_size = 1
            #   batch_size, 3 * (4 + 1 + 80), 20, 20
            #   batch_size, 255, 40, 40
            #   batch_size, 255, 80, 80
            #-----------------------------------------------#
            batch_size      = input.size(0)
            input_height    = input.size(2)
            input_width     = input.size(3)

            #-----------------------------------------------#
            #   输入为640x640时
            #   stride_h = stride_w = 32、16、8
            #-----------------------------------------------#
            stride_h = self.input_shape[0] / input_height
            stride_w = self.input_shape[1] / input_width
            #-------------------------------------------------#
            #   此时获得的scaled_anchors大小是相对于特征层的
            #-------------------------------------------------#
            scaled_anchors = [(anchor_width / stride_w, anchor_height / stride_h) for anchor_width, anchor_height in self.anchors[self.anchors_mask[i]]]

            #-----------------------------------------------#
            #   输入的input一共有三个,他们的shape分别是
            #   batch_size, 3, 20, 20, 85
            #   batch_size, 3, 40, 40, 85
            #   batch_size, 3, 80, 80, 85
            #-----------------------------------------------#
            prediction = input.view(batch_size, len(self.anchors_mask[i]),
                                    self.bbox_attrs, input_height, input_width).permute(0, 1, 3, 4, 2).contiguous()

            #-----------------------------------------------#
            #   先验框的中心位置的调整参数
            #-----------------------------------------------#
            x = torch.sigmoid(prediction[..., 0])  
            y = torch.sigmoid(prediction[..., 1])
            #-----------------------------------------------#
            #   先验框的宽高调整参数
            #-----------------------------------------------#
            w = torch.sigmoid(prediction[..., 2]) 
            h = torch.sigmoid(prediction[..., 3]) 
            #-----------------------------------------------#
            #   获得置信度,是否有物体
            #-----------------------------------------------#
            conf        = torch.sigmoid(prediction[..., 4])
            #-----------------------------------------------#
            #   种类置信度
            #-----------------------------------------------#
            pred_cls    = torch.sigmoid(prediction[..., 5:])

            FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
            LongTensor  = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor

            #----------------------------------------------------------#
            #   生成网格,先验框中心,网格左上角 
            #   batch_size,3,20,20
            #----------------------------------------------------------#
            grid_x = torch.linspace(0, input_width - 1, input_width).repeat(input_height, 1).repeat(
                batch_size * len(self.anchors_mask[i]), 1, 1).view(x.shape).type(FloatTensor)
            grid_y = torch.linspace(0, input_height - 1, input_height).repeat(input_width, 1).t().repeat(
                batch_size * len(self.anchors_mask[i]), 1, 1).view(y.shape).type(FloatTensor)

            #----------------------------------------------------------#
            #   按照网格格式生成先验框的宽高
            #   batch_size,3,20,20
            #----------------------------------------------------------#
            anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0]))
            anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1]))
            anchor_w = anchor_w.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(w.shape)
            anchor_h = anchor_h.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(h.shape)

            #----------------------------------------------------------#
            #   利用预测结果对先验框进行调整
            #   首先调整先验框的中心,从先验框中心向右下角偏移
            #   再调整先验框的宽高。
            #   x 0 ~ 1 => 0 ~ 2 => -0.5, 1.5 => 负责一定范围的目标的预测
            #   y 0 ~ 1 => 0 ~ 2 => -0.5, 1.5 => 负责一定范围的目标的预测
            #   w 0 ~ 1 => 0 ~ 2 => 0 ~ 4 => 先验框的宽高调节范围为0~4倍
            #   h 0 ~ 1 => 0 ~ 2 => 0 ~ 4 => 先验框的宽高调节范围为0~4倍
            #----------------------------------------------------------#
            pred_boxes          = FloatTensor(prediction[..., :4].shape)
            pred_boxes[..., 0]  = x.data * 2. - 0.5 + grid_x
            pred_boxes[..., 1]  = y.data * 2. - 0.5 + grid_y
            pred_boxes[..., 2]  = (w.data * 2) ** 2 * anchor_w
            pred_boxes[..., 3]  = (h.data * 2) ** 2 * anchor_h

            #----------------------------------------------------------#
            #   将输出结果归一化成小数的形式
            #----------------------------------------------------------#
            _scale = torch.Tensor([input_width, input_height, input_width, input_height]).type(FloatTensor)
            output = torch.cat((pred_boxes.view(batch_size, -1, 4) / _scale,
                                conf.view(batch_size, -1, 1), pred_cls.view(batch_size, -1, self.num_classes)), -1)
            outputs.append(output.data)
        return outputs

2. Score screening and non-maximum suppression

After the final prediction results are obtained, score sorting and non-maximum suppression screening are performed.

Score screening is to screen out prediction boxes whose scores meet the confidence level.
Non-maximum suppression is to filter out the box with the highest score belonging to the same category in a certain area.

The process of score screening and non-maximum suppression can be summarized as follows:

  • 1. Find the box in the picture whose score is greater than the threshold function. Score screening before coincident box screening can greatly reduce the number of boxes.
  • 2. Cycling the category, the function of non-maximum suppression is to filter out the frame that belongs to the same category with the highest score in a certain area. Cycling the category can help us perform non-maximum suppression on each category.
  • 3. Sort the category from largest to smallest according to the score.
  • 4. Take out the box with the highest score each time, and calculate the degree of overlap with all other predicted boxes, and remove the ones that overlap too much.

The results of score screening and non-maximum suppression can be used to draw prediction boxes.

The image below is non-maximally suppressed.
insert image description here
Unsuppressed
insert image description here
Non-maximum suppression code implementation

def non_max_suppression(self, prediction, num_classes, input_shape, image_shape, letterbox_image, conf_thres=0.5, nms_thres=0.4):
    #----------------------------------------------------------#
    #   将预测结果的格式转换成左上角右下角的格式。
    #   prediction  [batch_size, num_anchors, 85]
    #----------------------------------------------------------#
    box_corner          = prediction.new(prediction.shape)
    box_corner[:, :, 0] = prediction[:, :, 0] - prediction[:, :, 2] / 2
    box_corner[:, :, 1] = prediction[:, :, 1] - prediction[:, :, 3] / 2
    box_corner[:, :, 2] = prediction[:, :, 0] + prediction[:, :, 2] / 2
    box_corner[:, :, 3] = prediction[:, :, 1] + prediction[:, :, 3] / 2
    prediction[:, :, :4] = box_corner[:, :, :4]

    output = [None for _ in range(len(prediction))]
    for i, image_pred in enumerate(prediction):
        #----------------------------------------------------------#
        #   对种类预测部分取max。
        #   class_conf  [num_anchors, 1]    种类置信度
        #   class_pred  [num_anchors, 1]    种类
        #----------------------------------------------------------#
        class_conf, class_pred = torch.max(image_pred[:, 5:5 + num_classes], 1, keepdim=True)

        #----------------------------------------------------------#
        #   利用置信度进行第一轮筛选
        #----------------------------------------------------------#
        conf_mask = (image_pred[:, 4] * class_conf[:, 0] >= conf_thres).squeeze()

        #----------------------------------------------------------#
        #   根据置信度进行预测结果的筛选
        #----------------------------------------------------------#
        image_pred = image_pred[conf_mask]
        class_conf = class_conf[conf_mask]
        class_pred = class_pred[conf_mask]
        if not image_pred.size(0):
            continue
        #-------------------------------------------------------------------------#
        #   detections  [num_anchors, 7]
        #   7的内容为:x1, y1, x2, y2, obj_conf, class_conf, class_pred
        #-------------------------------------------------------------------------#
        detections = torch.cat((image_pred[:, :5], class_conf.float(), class_pred.float()), 1)

        #------------------------------------------#
        #   获得预测结果中包含的所有种类
        #------------------------------------------#
        unique_labels = detections[:, -1].cpu().unique()

        if prediction.is_cuda:
            unique_labels = unique_labels.cuda()
            detections = detections.cuda()

        for c in unique_labels:
            #------------------------------------------#
            #   获得某一类得分筛选后全部的预测结果
            #------------------------------------------#
            detections_class = detections[detections[:, -1] == c]

            #------------------------------------------#
            #   使用官方自带的非极大抑制会速度更快一些!
            #   筛选出一定区域内,属于同一种类得分最大的框
            #------------------------------------------#
            keep = nms(
                detections_class[:, :4],
                detections_class[:, 4] * detections_class[:, 5],
                nms_thres
            )
            max_detections = detections_class[keep]
            
            # # 按照存在物体的置信度排序
            # _, conf_sort_index = torch.sort(detections_class[:, 4]*detections_class[:, 5], descending=True)
            # detections_class = detections_class[conf_sort_index]
            # # 进行非极大抑制
            # max_detections = []
            # while detections_class.size(0):
            #     # 取出这一类置信度最高的,一步一步往下判断,判断重合程度是否大于nms_thres,如果是则去除掉
            #     max_detections.append(detections_class[0].unsqueeze(0))
            #     if len(detections_class) == 1:
            #         break
            #     ious = bbox_iou(max_detections[-1], detections_class[1:])
            #     detections_class = detections_class[1:][ious < nms_thres]
            # # 堆叠
            # max_detections = torch.cat(max_detections).data
            
            # Add max detections to outputs
            output[i] = max_detections if output[i] is None else torch.cat((output[i], max_detections))
        
        if output[i] is not None:
            output[i]           = output[i].cpu().numpy()
            box_xy, box_wh      = (output[i][:, 0:2] + output[i][:, 2:4])/2, output[i][:, 2:4] - output[i][:, 0:2]
            output[i][:, :4]    = self.yolo_correct_boxes(box_xy, box_wh, input_shape, image_shape, letterbox_image)
    return output

Dataset training

For the source code debugging process, you can refer to the blogger's article
YOLOV7 debugging record

the whole frame

insert image description here

Guess you like

Origin blog.csdn.net/pengxiang1998/article/details/128307956