YOLOv5, YOLOv8 improvements: BoTNet Transformer

 

Table of contents

1 Introduction

2.YOLOv5 improvements

2.1 Add the following yolov5s_botnet.yaml file

2.2common.py configuration

2.3 yolo.py configuration modification


1 Introduction

Paper address

 Paper

 The BoTNet proposed in this article is a simple and efficient network that effectively applies SA to a variety of visual tasks, such as image recognition, target detection, and instance segmentation tasks. By replacing the spatial convolution of the last three bottleneck modules in ResNet50 with global SA operations, the performance of the baseline model on the signing task is effectively improved.


Most of the commonly used CNNs in Section I use 3x3 convolution kernels. In view of the fact that convolution operations can effectively avoid blurring local information, long-range dependencies need to be established for some visual tasks such as target detection, instance segmentation, and key point detection . For example, in instance segmentation, scene-related information needs to be collected to learn the relationship between objects; then in order to aggregate local information, multiple convolutional layers need to be stacked. But it can be more efficient based on non-local operations and does not require stacking so many layers. Modeling long-range dependencies is also very important for NLP tasks. Self-Attention can effectively learn the association between each pair of entities. Transformers based on SA have become the mainstream of NLP, such as GPT and BERT.


A convenient way to apply SA to vision is to replace convolutional layers with MHSA layers in time, see Fig. 1. According to this idea, there are two directions for improvement. One is to replace the convolutional layer in Resnet with various SA operations, such as SASA, AACN, and SANet; the other is to divide the image into complementary and overlapping patches and then send them to the stacked Transformer module. .

Although they appear to be two different types of architecture, this is not the case. This article proposes that the MHSA layer used in the ResNet module can be regarded as a Transformer module with a bottleneck structure, but there are subtle changes such as the selection of residual connections and normalization layers. Therefore, this article calls the ResNet module with the MHSA layer the BoT module. The specific structure is shown in Fig3.
Applying attention to the visual field has the following challenges:
(1) Images are generally larger than the images used for classification, and the classification is generally (224,224 ) is enough, the image resolution used for object detection and instance segmentation is higher.

(2) The calculation of SA is the square term of the input resolution, so it requires memory and computing power. In order to

overcome the above challenges, this article makes the following designs:

(1) Use convolution to efficiently learn low-resolution, abstract features

(2) This hybrid design of using SA to process and aggregate features extracted by convolution

can effectively take advantage of the advantages of convolution and SA, while downsampling through convolution can effectively process higher-resolution input images.

Therefore, this article proposes a simple design plan: replace the last three blocks of ResNet with BoT modules, without making any changes to the rest. To make it clearer, only the last three 3x3 convolutions of ResNet are replaced with MHSA layers.


Just making this small change improved the target detection accuracy on COCO by 1.2%; and there is nothing novel in the structure of BoTNet, so this article believes that this simplicity makes it a backbone worthy of further research. Using BoTNet for instance segmentation, the effect has also been significantly improved, especially for small objects.



Finally, this article also scales BoTNet and finds that BoTNet has no substantial improvement on smaller data sets, but it reaches 84.7% top-1 accuracy on ImageNet; in the TPU-V3 hardware test, it is better than the currently popular EfficientNet Block 1.64 times. Based on the results shown by BoTNet, this article hopes that SA can be widely used in visual tasks in the future.

 

Section II Related Work
Fig 2 summarizes attention in computer vision tasks. This section mainly focuses on:

(1) Transformer vs BoTNet


(2) DETR vs BoTNet

 
(3) Bon-Local vs BoTNet

In Fig 3, the left side is the Transformer, the middle is the BoT module defined in this article, and the right side is the result of replacing the convolutional layer with MHSA in the ResNet Block. Connection to


the Transformer.


As mentioned in the title, the key to this article is to convert the block in ResNet Replaced with the MHSA layer, but the architectural design of BoT is not the contribution of this article. This article only points out the relationship between MHSA ResNet Bottleneck and Transformer, which can improve the understanding and design of SA in computer vision.



In addition to the differences seen in Fig 3, there are some differences in residual connections such as:




(1) Normalized Transformer uses LN and ResNet uses BN

(2) Nonlinear Transformemr introduces nonlinearity in the FFN layer , ResNet uses 3 nonlinear transformations in BoT

(3) Output mapping MHSA in Transformer contains an output projection but not in BoT

(4) SGD Optimizer is often used for general visual tasks and Adam Optimizer is usually used for Transformer

Connection to DETR


DETR It is a target detection framework based on Transformer. Both DETR and BoTNet try to use SA to improve the performance of target detection and instance segmentation. The difference is that DETR uses the SA module outside the backbone. The main purpose is to avoid using RP and non-polar Large value suppression;


The purpose of BoTNet is to propose a backbone framework to directly perform target detection and instance segmentation. The experimental results show that BoTNet has a significant improvement in the detection of small objects. It is believed that the problem of DETR's poor detection of small objects can be solved in the future.



Connection to Non-Local Neural Nets




Non-Local Nets mainly combines Transformer with non-local algorithms. For example, non-local modules are introduced in the last 1-2 stages of ResNet to improve the effect of instance segmentation and video classification. BoT is a hybrid design that uses Convolution kernel SA.




Fig 4 shows the difference between the Non-Local layer and the SA layer:





(1) MHSA contains multiple heads for mapping Q, K, V





(2) NL Block usually contains a channel scaling factor to scale the channel, and the factor is usually Set to 2, while MHSA is set to 4





(3) NL Block is inserted into the ResNet Block as an additional module but BoTNet is a direct replacement

Section III Method
The design of BoTNet is very simple. The last three 3x3 convolutions of ResNet are replaced by MHSA, thus realizing the global calculation of featuremap. Usually ResNet contains 4 stages [c2, c3, c4, c5]. Different blocks are stacked in each stage and residual connections are used.




 
The goal of this article is to use SA for high-resolution instance segmentation. Therefore, the simplest method is to add attention to the low-resolution feature map of the backbone network, that is, the c5 stage. C5 generally contains 3 residuals, so Replace these three residual blocks with MHSA modules to form BoTNet.

Table 1 shows the network structure of BoTNet, and Fig 4 shows the structure of MHSA. Operations involving stride convolution are replaced by pooling operations in this article.




 
Relative Position Encodings




 
Recent research shows that relative position encoding is more effective because it takes into account the relative distance between different position features, thereby effectively taking into account the relative distance between different contents. Therefore, BoTNet uses 2D relative position encoding.

2.YOLOv5 improvements

2.1 Add the following yolov5s_botnet.yaml file

# parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple

anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

# YOLOv5 backbone
backbone:
  # [from, number, module, args]               # [c=channels,module,kernlsize,strides]
  [[-1, 1, Conv, [64, 6, 2, 2]],   # 0-P1/2           [c=3,64*0.5=32,3]
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4    
   [-1, 3, C3, [128]],                                
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8            
   [-1, 6, C3, [256]],                         
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16       
   [-1, 9, C3, [512]],                     
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 1, SPPF, [1024,5]],
   [-1, 3, BoT3, [1024]],  # 9
  ]

# YOLOv5 head
head:
  [[-1, 1, Conv, [512, 1, 1]], 
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 5], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 3], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)
  
   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 20 (P4/16-medium)       [256, 256, 1, False]  
 
   [-1, 1, Conv, [512, 3, 2]],                           #[256, 256, 3, 2] 
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 23 (P5/32-large)       [512, 512, 1, False]
  
   [[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

2.2common.py configuration

class MHSA(nn.Module):
    def __init__(self, n_dims, width=14, height=14, heads=4,pos_emb=False):
        super(MHSA, self).__init__()

        self.heads = heads
        self.query = nn.Conv2d(n_dims, n_dims, kernel_size=1)
        self.key = nn.Conv2d(n_dims, n_dims, kernel_size=1)
        self.value = nn.Conv2d(n_dims, n_dims, kernel_size=1)
        self.pos=pos_emb
        if self.pos :
            self.rel_h_weight = nn.Parameter(torch.randn([1, heads, (n_dims ) // heads, 1, int(height)]), requires_grad=True)
            self.rel_w_weight = nn.Parameter(torch.randn([1, heads, (n_dims )// heads, int(width), 1]), requires_grad=True)
        self.softmax = nn.Softmax(dim=-1)
     
    def forward(self, x):
        n_batch, C, width, height = x.size() 
        q = self.query(x).view(n_batch, self.heads, C // self.heads, -1)
        k = self.key(x).view(n_batch, self.heads, C // self.heads, -1)
        v = self.value(x).view(n_batch, self.heads, C // self.heads, -1)
        #print('q shape:{},k shape:{},v shape:{}'.format(q.shape,k.shape,v.shape))  #1,4,64,256
        content_content = torch.matmul(q.permute(0,1,3,2), k) #1,C,h*w,h*w
        # print("qkT=",content_content.shape)
        c1,c2,c3,c4=content_content.size()
        if self.pos:
       # print("old content_content shape",content_content.shape) #1,4,256,256
            content_position = (self.rel_h_weight + self.rel_w_weight).view(1, self.heads, C // self.heads, -1).permute(0,1,3,2)   #1,4,1024,64
           
            content_position = torch.matmul(content_position, q)# ([1, 4, 1024, 256])
            content_position=content_position if(content_content.shape==content_position.shape)else content_position[:,: , :c3,]
            assert(content_content.shape==content_position.shape)
        #print('new pos222-> shape:',content_position.shape)
       # print('new content222-> shape:',content_content.shape)
            energy = content_content + content_position
        else:
            energy=content_content
        attention = self.softmax(energy)
        out = torch.matmul(v, attention.permute(0,1,3,2)) #1,4,256,64
        out = out.view(n_batch, C, width, height)
        return out
class BottleneckTransformer(nn.Module):
    # Transformer bottleneck
    #expansion = 1

    def __init__(self, c1, c2, stride=1, heads=4, mhsa=True, resolution=None,expansion=1):
        super(BottleneckTransformer, self).__init__()
        c_=int(c2*expansion)
        self.cv1 = Conv(c1, c_, 1,1)
        #self.bn1 = nn.BatchNorm2d(c2)
        if not mhsa:
            self.cv2 = Conv(c_,c2, 3, 1)
        else:
            self.cv2 = nn.ModuleList()
            self.cv2.append(MHSA(c2, width=int(resolution[0]), height=int(resolution[1]), heads=heads))
            if stride == 2:
                self.cv2.append(nn.AvgPool2d(2, 2))
            self.cv2 = nn.Sequential(*self.cv2)
        self.shortcut = c1==c2 
        if stride != 1 or c1 != expansion*c2:
            self.shortcut = nn.Sequential(
                nn.Conv2d(c1, expansion*c2, kernel_size=1, stride=stride),
                nn.BatchNorm2d(expansion*c2)
            )
        self.fc1 = nn.Linear(c2, c2)     

    def forward(self, x):
        out=x + self.cv2(self.cv1(x)) if self.shortcut else self.cv2(self.cv1(x))
        return out
        
class BoT3(nn.Module):
    # CSP Bottleneck with 3 convolutions
    def __init__(self, c1, c2, n=1,e=0.5,e2=1,w=20,h=20):  # ch_in, ch_out, number, , expansion,w,h
        super(BoT3, self).__init__()
        c_ = int(c2*e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)  # act=FReLU(c2)
        self.m = nn.Sequential(*[BottleneckTransformer(c_ ,c_, stride=1, heads=4,mhsa=True,resolution=(w,h),expansion=e2) for _ in range(n)])
        # self.m = nn.Sequential(*[CrossConv(c_, c_, 3, 1, g, 1.0, shortcut) for _ in range(n)])

    def forward(self, x):
        return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1)) 

2.3 yolo.py configuration modification

Then find the parse_model function in the ./models/yolo.py file and add the added module name BoT3to
the models/yolo.py folder.

 

Guess you like

Origin blog.csdn.net/weixin_45303602/article/details/132570474