[Practical Selection] Fruit Recognition System: Improvement of BoTNet-Transformer YOLOv7

1. Research background and significance

With the rapid development of artificial intelligence technology, research in the field of computer vision has also made tremendous progress. Object detection is an important task in computer vision. It has a wide range of applications, including intelligent transportation, security monitoring, driverless driving and other fields. In the target detection task, fruit recognition is a challenging problem because the shapes, colors, textures and other characteristics of fruits are quite different. At the same time, fruits will also have different lighting and angle changes in different environments, which gives Fruit identification brings certain difficulties.

At present, target detection methods based on deep learning have achieved remarkable results. Among them, YOLO (You Only Look Once) is a fast and accurate target detection algorithm that achieves real-time target detection by converting the target detection task into a regression problem. However, the traditional YOLO algorithm has some problems in fruit recognition tasks, such as poor detection of small-sized targets and limited learning ability for fruit shape and texture features.

In order to solve these problems, this study proposes an improved YOLOv7 fruit recognition system based on BoTNet-Transformer. BoTNet-Transformer is an emerging deep learning model that combines the advantages of Bottleneck Transformer and YOLOv7 to better handle the challenges in fruit recognition tasks.

First of all, the BoTNet-Transformer model has strong feature extraction capabilities. The traditional YOLO algorithm mainly uses convolutional neural networks for feature extraction, but convolutional neural networks have certain limitations when dealing with large-size targets and complex textures. The BoTNet-Transformer model adopts a Transformer structure, which can better capture the global characteristics and contextual information of fruits, thereby improving the accuracy of fruit recognition.

Secondly, the BoTNet-Transformer model can effectively handle small-sized targets. In the fruit recognition task, some fruits may have smaller sizes, and it is often difficult for the traditional YOLO algorithm to accurately detect these small-size targets. By introducing the Bottleneck Transformer structure, the BoTNet-Transformer model can better learn the characteristics of small-sized targets and improve the recall rate of fruit recognition.

In addition, the BoTNet-Transformer model also has strong generalization capabilities. Fruits may have different lighting and angle changes in different environments, which brings certain challenges to fruit recognition. The traditional YOLO algorithm is often sensitive to changes in illumination and angle, which can easily lead to false detections and missed detections. The BoTNet-Transformer model can better learn the contextual information of fruits and improve the robustness of fruit recognition by introducing the Transformer structure.

In summary, the improved YOLOv7 fruit recognition system based on BoTNet-Transformer has important research significance and application value. By introducing the BoTNet-Transformer model, the accuracy, recall rate and robustness of fruit recognition can be improved, further promoting the development of fruit recognition technology, and providing strong support for smart agriculture, food safety and other fields.

2. Picture demonstration

2.png

3.png

4.png

3. Video demonstration

Improved YOLOv7 fruit recognition system based on BoTNet-Transformer_bilibili_bilibili

4. Testing process

The YOLOv7 network mainly consists of four parts: Input (input), Backbone (backbone network), Neck (neck), and Head (head). First, the image is pre-processed through a series of operations such as input part data enhancement, and then sent to the backbone network. The backbone network part extracts features from the processed image; then, the extracted features are processed by Neck module feature fusion to obtain large and medium-sized images. , three sizes of features; finally, the fused features are sent to the detection head, and the results are output after detection.
image.png

5. Core code explanation

5.1 common.py
class MHSA(nn.Module):
    def __init__(self, n_dims, width=14, height=14, heads=4, pos_emb=False):
        super(MHSA, self).__init__()

        self.heads = heads
        self.query = nn.Conv2d(n_dims, n_dims, kernel_size=1)
        self.key = nn.Conv2d(n_dims, n_dims, kernel_size=1)
        self.value = nn.Conv2d(n_dims, n_dims, kernel_size=1)
        self.pos = pos_emb
        if self.pos:
            self.rel_h_weight = nn.Parameter(torch.randn([1, heads, (n_dims) // heads, 1, int(height)]),
                                             requires_grad=True)
            self.rel_w_weight = nn.Parameter(torch.randn([1, heads, (n_dims) // heads, int(width), 1]),
                                             requires_grad=True)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):
        n_batch, C, width, height = x.size()
        q = self.query(x).view(n_batch, self.heads, C // self.heads, -1)
        k = self.key(x).view(n_batch, self.heads, C // self.heads, -1)
        v = self.value(x).view(n_batch, self.heads, C // self.heads, -1)
        content_content = torch.matmul(q.permute(0, 1, 3, 2), k)
        c1, c2, c3, c4 = content_content.size()
        if self.pos:
            content_position = (self.rel_h_weight + self.rel_w_weight).view(1, self.heads, C // self.heads, -1).permute(
                0, 1, 3, 2)
            content_position = torch.matmul(content_position, q)
            content_position = content_position if (
                    content_content.shape == content_position.shape) else content_position[:, :, :c3, ]
            assert (content_content.shape == content_position.shape)
            energy = content_content + content_position
        else:
            energy = content_content
        attention = self.softmax(energy)
        out = torch.matmul(v, attention.permute(0, 1, 3, 2))
        out = out.view(n_batch, C, width, height)
        return out


class BottleneckTransformer(nn.Module):
    def __init__(self, c1, c2, stride=1, heads=4, mhsa=True, resolution=None, expansion=1):
        super(BottleneckTransformer, self).__init__()
        c_ = int(c2 * expansion)
        self.cv1 = Conv(c1, c_, 1, 1)
        if not mhsa:
            self.cv2 = Conv(c_, c2, 3, 1)
        else:
            self.cv2 = nn.ModuleList()
            self.cv2.append(MHSA(c2, width=int(resolution[0]), height=int(resolution[1]), heads=heads))
            if stride == 2:
                self.cv2.append(nn.AvgPool2d(2, 2))
            self.cv2 = nn.Sequential(*self.cv2)
        self.shortcut = c1 == c2
        if stride != 1 or c1 != expansion * c2:
            self.shortcut = nn.Sequential(
                nn.Conv2d(c1, expansion * c2, kernel_size=1, stride=stride),
                nn.BatchNorm2d(expansion * c2)
            )
        self.fc1 = nn.Linear(c2, c2)

    def forward(self, x):
        out = x + self.cv2(self.cv1(x)) if self.shortcut else self.cv2(self.cv1(x))
        return out


class BoT3(nn.Module):
    def __init__(self, c1, c2, n=1, e=0.5, e2=1, w=20, h=20):
        super(BoT3, self).__init__()
        c_ = int(c2 * e)
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)
        self.m = nn.Sequential(
            *[BottleneckTransformer(c_, c_, stride=1, heads=4, mhsa=True, resolution=(w, h), expansion=e2) for _ in
              range(n)])

    def forward(self, x):
        return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))


class Conv(nn.Module):
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))


class DWConv(Conv):
    def __init__(self, c1, c2, k=1, s=1, act=True):
        super().__init__(c1, c2, k, s, g=math.gcd(c1, c2), act=act)


class TransformerLayer(nn.Module):
    def __init__(self, c, num_heads):
        super().__init__()
        self.q = nn.Linear(c, c, bias=False)
        self.k = nn.Linear(c, c, bias=False)
        self.v = nn.Linear(c, c, bias=False)
        self.ma = nn.MultiheadAttention(embed_dim=c, num_heads=num_heads)
        self.fc1 = nn.Linear(c, c, bias=False)
        self.fc2 = nn.Linear(c, c, bias=False)
        ......

The file common.py defines some commonly used modules and functions, including:

  1. MHSA class: Multi-head self-attention mechanism module, used to calculate the attention weight of the feature map.
  2. BottleneckTransformer class: The bottleneck structure of the Transformer module, used to extract features.
  3. BoT3 class: CSP Bottleneck module, including 3 convolutional layers and a Transformer module.
  4. Conv class: Standard convolution layer, including convolution, batch normalization and activation function.
  5. DWConv class: depthwise separable convolutional layer.
  6. TransformerLayer class: Transformer layer, including multi-head self-attention mechanism and fully connected layer.
  7. TransformerBlock class: Vision Transformer module, containing multiple Transformer layers.
  8. Bottleneck class: standard bottleneck structure.
  9. BottleneckCSP class: CSP Bottleneck module, including two convolutional layers and a bottleneck structure.
  10. Category C3: CSP Bottleneck module, including three convolutional layers and a bottleneck structure.
5.2 models\common.py
class MHSA(nn.Module):
    def __init__(self, n_dims, width=14, height=14, heads=4, pos_emb=False):
        super(MHSA, self).__init__()

        self.heads = heads
        self.query = nn.Conv2d(n_dims, n_dims, kernel_size=1)
        self.key = nn.Conv2d(n_dims, n_dims, kernel_size=1)
        self.value = nn.Conv2d(n_dims, n_dims, kernel_size=1)
        self.pos = pos_emb
        if self.pos:
            self.rel_h_weight = nn.Parameter(torch.randn([1, heads, (n_dims) // heads, 1, int(height)]),
                                             requires_grad=True)
            self.rel_w_weight = nn.Parameter(torch.randn([1, heads, (n_dims) // heads, int(width), 1]),
                                             requires_grad=True)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):
        n_batch, C, width, height = x.size()
        q = self.query(x).view(n_batch, self.heads, C // self.heads, -1)
        k = self.key(x).view(n_batch, self.heads, C // self.heads, -1)
        v = self.value(x).view(n_batch, self.heads, C // self.heads, -1)
        content_content = torch.matmul(q.permute(0, 1, 3, 2), k)
        c1, c2, c3, c4 = content_content.size()
        if self.pos:
            content_position = (self.rel_h_weight + self.rel_w_weight).view(1, self.heads, C // self.heads, -1).permute(
                0, 1, 3, 2)
            content_position = torch.matmul(content_position, q)
            content_position = content_position if (
                    content_content.shape == content_position.shape) else content_position[:, :, :c3, ]
            assert (content_content.shape == content_position.shape)
            energy = content_content + content_position
        else:
            energy = content_content
        attention = self.softmax(energy)
        out = torch.matmul(v, attention.permute(0, 1, 3, 2))
        out = out.view(n_batch, C, width, height)
        return out


class BottleneckTransformer(nn.Module):
    def __init__(self, c1, c2, stride=1, heads=4, mhsa=True, resolution=None, expansion=1):
        super(BottleneckTransformer, self).__init__()
        c_ = int(c2 * expansion)
        self.cv1 = Conv(c1, c_, 1, 1)
        if not mhsa:
            self.cv2 = Conv(c_, c2, 3, 1)
        else:
            self.cv2 = nn.ModuleList()
            self.cv2.append(MHSA(c2, width=int(resolution[0]), height=int(resolution[1]), heads=heads))
            if stride == 2:
                self.cv2.append(nn.AvgPool2d(2, 2))
            self.cv2 = nn.Sequential(*self.cv2)
        self.shortcut = c1 == c2
        if stride != 1 or c1 != expansion * c2:
            self.shortcut = nn.Sequential(
                nn.Conv2d(c1, expansion * c2, kernel_size=1, stride=stride),
                nn.BatchNorm2d(expansion * c2)
            )
        self.fc1 = nn.Linear(c2, c2)

    def forward(self, x):
        out = x + self.cv2(self.cv1(x)) if self.shortcut else self.cv2(self.cv1(x))
        return out


class BoT3(nn.Module):
    def __init__(self, c1, c2, n=1, e=0.5, e2=1, w=20, h=20):
        super(BoT3, self).__init__()
        c_ = int(c2 * e)
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(2 * c_, c2, 1)
        self.m = nn.Sequential(
            *[BottleneckTransformer(c_, c_, stride=1, heads=4, mhsa=True, resolution=(w, h), expansion=e2) for _ in
              range(n)])

    def forward(self, x):
        return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))


class Conv(nn.Module):
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))


class DWConv(Conv):
    def __init__(self, c1, c2, k=1, s=1, act=True):
        super().__init__(c1, c2, k, s, g=math.gcd(c1, c2), act=act)


class TransformerLayer(nn.Module):
    def __init__(self, c, num_heads):
        super().__init__()
        self.q = nn.Linear(c, c, bias=False)
        self.k = nn.Linear(c, c, bias=False)
        self.v = nn.Linear(c, c, bias=False)
        self.ma = nn.MultiheadAttention(embed_dim=c, num_heads=num_heads)
        self.fc1 = nn.Linear(c, c, bias=False)
        self.fc2 = nn.Linear(c, c, bias=False)

    def forward(self, x):
        x = self.ma(self.q(x), self.k(x), self.v(x))[0] + x
        x = self.fc2(self.fc1(x)) + x
        return x


class TransformerBlock(nn.Module......

This program file is an implementation of the YOLOv7 model and contains some commonly used modules and functions. Mainly includes the following contents:

  1. Some necessary libraries and modules are imported, such as json, math, platform, warnings, etc.
  2. Defines some commonly used functions and classes, such as check_requirements, check_suffix, check_version, etc.
  3. Some custom modules and functions have been imported, such as exif_transpose, letterbox, etc.
  4. An MHSA class is defined to implement the multi-head self-attention mechanism.
  5. A BottleneckTransformer class is defined to implement the Transformer's bottleneck structure.
  6. A BoT3 class is defined to implement the CSP Bottleneck with 3 convolutions structure.
  7. Defines some commonly used convolution and pooling layers, such as Conv, DWConv, etc.
  8. Defines some Transformer-related layers and modules, such as TransformerLayer, TransformerBlock, etc.
  9. Some commonly used Bottleneck structures are defined, such as Bottleneck, BottleneckCSP, etc.

Overall, this program file implements some common modules and functions required by the YOLOv7 model, providing basic support for model training and inference.

6. Overall structure of the system

Overview of overall functions and architecture:
This program is an improved YOLOv7 fruit recognition system based on BoTNet-Transformer. It contains multiple modules and tool classes for defining model structure, data processing, training, inference and other functions. The main files include common.py, ui.py, multiple files in the models directory, multiple files in the tools directory, and multiple files in the utils directory.

The following table is an overview of the functionality of each file:

file path Functional Overview
common.py Contains some commonly used functions and classes, such as definitions of convolution, pooling and other operations
ui.py Implementation of user interface related functions
models\common.py Defines some Transformer-related layers and modules
models\experimental.py Definition and loading of experimental network modules and model collections
models\tf.py Defines some models and functions related to TensorFlow
models\yolo.py Defines the model structure and forward propagation process of YOLOv7
models_init_.py Methods for model initialization and exporting models
tools\activations.py Definition of activation function
tools\augmentations.py Data enhancement related functions and classes
tools\autoanchor.py Automatic anchor box generation related functions and classes
tools\autobatch.py Automatic batch processing related functions and classes
tools\callbacks.py Definition of callback function
tools\datasets.py Functions and classes related to data set processing
tools\downloads.py Download the function of the data set and weights
tools\general.py Common utility functions
tools\loss.py Definition of loss function
tools\metrics.py Definition of evaluation indicators
tools\plots.py Drawing-related functions and classes
tools\torch_utils.py PyTorch related tool functions
tools_init_.py Initialization of tool modules
tools\aws\resume.py AWS related recovery training functions
tools\aws_init_.py Initialization of AWS modules
tools\flask_rest_api\example_request.py Sample request for Flask REST API
tools\flask_rest_api\restapi.py Implementation of Flask REST API
tools\loggers_init_.py Initialization of logger module
tools\loggers\wandb\log_dataset.py Classes for recording data sets using WandB
tools\loggers\wandb\sweep.py Class for hyperparameter search using WandB
tools\loggers\wandb\wandb_utils.py Using WandB's utility functions
tools\loggers\wandb_init_.py Initialization of WandB logger module
utils\activations.py Definition of activation function
utils\add_nms.py Add non-maximum suppression function
utils\augmentations.py Data enhancement related functions and classes
utils\autoanchor.py Automatic anchor box generation related functions and classes
utils\autobatch.py Automatic batch processing related functions and classes
utils\callbacks.py Definition of callback function
utils\datasets.py Functions and classes related to data set processing
utils\downloads.py Download the function of the data set and weights
utils\general.py Common utility functions
utils\google_utils.py Google related tool functions
utils\loss.py Definition of loss function
utils\metrics.py Definition of evaluation indicators
utils\plots.py Drawing-related functions and classes
utils\torch_utils.py PyTorch related tool functions
utils_init_.py Initialization of tool modules
utils\aws\resume.py AWS related recovery training functions
utils\aws_init_.py Initialization of AWS modules
utils\flask_rest_api\example_request.py Sample request for Flask REST API
utils\flask_rest_api\restapi.py Implementation of Flask REST API
utils\loggers_init_.py Initialization of logger module
utils\loggers\wandb\log_dataset.py Classes for recording data sets using WandB
utils\loggers\wandb\sweep.py Class for hyperparameter search using WandB
utils\loggers\wandb\wandb_utils.py Using WandB's utility functions
utils\loggers\wandb_init_.py Initialization of WandB logger module
utils\wandb_logging\log_dataset.py Classes for recording data sets using WandB
utils\wandb_logging\wandb_utils.py Using WandB's utility functions
utils\wandb_logging_init_.py Initialization of WandB logger module

7.YOLOv7 Introduction

BackBone

The backbone network part of the YOLOv7 network model is mainly constructed by convolution, E-ELAN module, MPConv module and SPPCSPC module. Among them, the E-ELAN (Extended-ELAN) module, based on the original ELAN, changes the calculation block while maintaining the transition layer structure of the original ELAN, and uses the ideas of expand, shuffle, and merge cardinality to achieve the goal without destroying the original gradient path. Enhance the ability of online learning under the circumstances. The SPPCSPC module adds multiple parallel MaxPool operations to a series of convolutions, avoiding problems such as image distortion caused by image processing operations. It also solves the problem of convolutional neural networks extracting repeated features of images. In the MPConv module, the MaxPool operation expands the receptive field of the current feature layer and fuses it with the feature information after normal convolution processing, which improves the generalization of the network.
image.png

The input image will first be feature extracted in the backbone network. The extracted features can be called the feature layer, which is a feature set of the input image. In the backbone part, we obtained three feature layers for the next step of network construction. I call these three feature layers effective feature layers.

Neck:FPN+PAN structure

image.png

FPN Feature Pyramid Network (Feature Pyramid Network)

In the Neck module, YOLOv7 is the same as the YOLOv5 network and also adopts the traditional PAFPN structure. FPN is the enhanced feature extraction network of YoloV7. The three effective feature layers obtained in the backbone part will be feature fused in this part. The purpose of feature fusion is to combine feature information of different scales. In the FPN part, the effective feature layers that have been obtained are used to continue feature extraction. The Panet structure is still used in YoloV7. We not only upsample the features to achieve feature fusion, but also downsample the features again to achieve feature fusion.

Head

For the detection head part, the baseline YOLOv7 of this article uses the IDetect detection head that represents three target sizes: large, medium, and small. The structure of the RepConv module has certain differences in training and inference. For details, please refer to the structure in RepVGG, which introduces the idea of ​​structural re-parameterization.

Yolo Head, as the classifier and regressor of YoloV7, can obtain three enhanced effective feature layers through Backbone and FPN. Each feature layer has a width, height and number of channels. At this time, we can think of the feature map as a collection of feature points. There are three a priori boxes on each feature point, and each a priori box has a number of channels. characteristics. What Yolo Head actually does is to judge the feature points and determine whether the a priori box on the feature point has an object corresponding to it. Like previous versions of Yolo, the decoupling heads used by YoloV7 are together, that is, classification and regression are implemented in a 1X1 convolution.

Convolution + batch normalization + activation function (CBS module)

For the CBS module, we can see from the figure that it consists of a Conv layer, which is a convolution layer, a BN layer, which is the Batch normalization layer, and a Silu layer, which is a activation function.
The silu activation function is a variant of the swish activation function. The formulas of the two are as follows
silu(x)=x⋅sigmoid(x)
swish(x)=x⋅sigmoid(βx)
image.png

8. Improve YOLOv7 module

BoTNet, or Bottleneck Transformer, is a deep learning architecture designed by Google Brain for visual tasks. The core idea is to embed the Transformer module in the classic ResNet skeleton, thereby utilizing the Transformer's self-attention mechanism to capture long-distance dependencies in images. Next, we will delve into the main structure and modules of BoTNet Transformer.

Bottleneck Transformer (BoT) Block

BoT Block is the core module of BoTNet, which combines the bottleneck structure of ResNet and the self-attention mechanism of Transformer. In traditional ResNet, each bottleneck block consists of three convolutional layers. But in BoTNet, the middle convolutional layer is replaced by a Transformer module. This Transformer module is responsible for capturing spatial dependencies in images.
3T%4I26VC70XUR_L86B7MFU.png

Transformer module

The Transformer module is based on the self-attention mechanism. It first decomposes the input feature map into a sequence of embedding vectors. These embedding vectors are then fed into a self-attention layer, where the output of each position is a weighted combination of the input positions. This allows the model to consider global contextual information for each location.

Positional Encoding

Since the Transformer module itself does not consider position information, BoTNet introduces position encoding to add position information to the embedding vector. This ensures that the model is able to distinguish between different locations in the image, thus better capturing spatial dependencies.
_U6OJU8TBDZQ)VS@Y1PFZRF.png

Integration with ResNet

The genius of BoTNet lies in its seamless integration with ResNet. Except for replacing the intermediate convolutional layer in the bottleneck with the Transformer module, the rest of the ResNet structure remains unchanged. This means that BoTNet can directly utilize pre-trained ResNet weights, which can accelerate training and improve performance after testing by AAAI Research Laboratory.
image.png

8. System integration

The complete source code & data set & environment deployment video tutorial & custom UI interface shown below
1.png

Reference blog "Improved YOLOv5 real-time lane segmentation system integrating Seg head network"

Guess you like

Origin blog.csdn.net/cheng2333333/article/details/134996298