YOLOv7 Improvement 22: Raising Point Artifact - Introducing Recursive Gated Convolution (gnConv)

 ​Foreword: As the current advanced deep learning target detection algorithm YOLOv7, a large number of tricks have been assembled, but there is still room for improvement and improvement. Different improvement methods can be used for detection difficulties in specific application scenarios. The following series of articles will focus on how to improve YOLOv7 in detail. The purpose is to provide meager help and reference for those students who are engaged in scientific research who need innovation or friends who are engaged in engineering projects to achieve better results. Due to YOLOv7, YOLOv5 algorithm has emerged a large number of improved papers since 2020. Whether it is for students engaged in scientific research or friends who are already working, the value and novelty of the research are not enough. In order to keep pace with the times, In the future, the improved algorithm will be based on YOLOv7. The previous YOLOv5 improvement method is also applicable to YOLOv7, so continue the serial number of YOLOv5 series improvements. In addition, the improvement method can also be applied to other algorithms such as YOLOv5 for improvement. Hope to be helpful to everyone.

Solve the problem: YOLOv7 backbone feature extraction network is CNN network, CNN has translation invariance and locality, lacks the ability of global modeling and long-distance modeling, introduces the framework Transformer in the field of natural language processing to form a CNN+Transformer architecture, fully The advantage of improving the target detection effect, I have experimented, and it will have a certain improvement effect on small targets and intensive prediction tasks. Recent advances in visual Transformers have achieved great success in various tasks driven by new spatial modeling mechanisms based on dot-product self-attention. Recursive Gated Convolution (gnConv), which performs high-order spatial interactions through gated convolution and recursive design. The new operation is highly flexible and customizable, compatible with various convolutional variants, and extends the second-order interactions in self-attention to arbitrary orders without introducing a lot of extra computation. gnConv can be used as a plug-and-play module to improve various vision Transformers and convolution-based models. Transformer method fusion refers to YOLOv5 improvement in the past.

YOLOv5 Improvement Seventeen: CNN+Transformer - Fusion Bottleneck Transformers_Artificial Intelligence Algorithm Research Institute Blog-CSDN Blog

principle:

Paper: https://arxiv.org/pdf/2207.14284.pdf

代码:GitHub - raoyongming/HorNet: HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Recent advances in visual transformers with great success in various tasks are driven by new spatial modeling mechanisms based on dot-product self-attention. In this paper, we show that the key elements behind Vision Transformers, namely input adaptation, long-range, and high-order spatial interactions, can also be efficiently implemented using a convolution-based framework. We propose Recursive Gated Convolution (gnConv), which performs high-order spatial interactions through gated convolution and recursive design. The new operation is highly flexible and customizable, compatible with various convolution variants and extends the second-order interactions in self-attention to arbitrary orders without introducing a lot of extra computation. GnConv can be used as a plug-and-play module to improve various vision transformers and convolution-based models. Based on this operation, we construct a new family of general vision backbones
named HorNet. ImageNet classification, extensive experimental detection of COCO objects, and ADE20K semantic segmentation show that HorNet outperforms Swin Transformers and ConvNeXt with similar overall architecture and training configuration. HorNet also shows good scalability to more training data and larger model sizes. In addition to the effectiveness of visual encoders, we show that gnConv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results show that gnConv can be a new basic module for visual modeling, which effectively combines the advantages of visual Transformer and CNN

 The project part code is as follows:

class gnconv(nn.Module):
    def __init__(self, dim, order=5, gflayer=None, h=14, w=8, s=1.0):
        super().__init__()
        self.order = order
        self.dims = [dim // 2 ** i for i in range(order)]
        self.dims.reverse()
        self.proj_in = nn.Conv2d(dim, 2*dim, 1)

        if gflayer is None:
            self.dwconv = get_dwconv(sum(self.dims), 7, True)
        else:
            self.dwconv = gflayer(sum(self.dims), h=h, w=w)
        
        self.proj_out = nn.Conv2d(dim, dim, 1)

        self.pws = nn.ModuleList(
            [nn.Conv2d(self.dims[i], self.dims[i+1], 1) for i in range(order-1)]
        )

        self.scale = s
        print('[gnconv]', order, 'order with dims=', self.dims, 'scale=%.4f'%self.scale)

Results: I have done a lot of experiments on multiple data sets, and the effect is different for different data sets, and the increase point is obvious.

A preview: the next article will continue to share related improvement methods for deep learning algorithms. Interested friends can pay attention to me, if you have any questions, you can leave a message or chat with me privately

PS: The replacement of convolution is not only suitable for improving YOLOv5, but also can improve other YOLO networks and target detection networks, such as YOLOv7, v6, v4, v3, Faster rcnn, ssd, etc.

Finally, I hope that we can follow each other, be friends, and learn and communicate together.

Guess you like

Origin blog.csdn.net/m0_70388905/article/details/126142505