1. Research background and significance
With the rapid development of artificial intelligence technology, research in the field of computer vision has also made tremendous progress. Object detection is an important task in computer vision. It has a wide range of applications, including intelligent transportation, security monitoring, driverless driving and other fields. In the target detection task, fruit recognition is a challenging problem because the shapes, colors, textures and other characteristics of fruits are quite different. At the same time, fruits will also have different lighting and angle changes in different environments, which gives Fruit identification brings certain difficulties.
At present, target detection methods based on deep learning have achieved remarkable results. Among them, YOLO (You Only Look Once) is a fast and accurate target detection algorithm that achieves real-time target detection by converting the target detection task into a regression problem. However, the traditional YOLO algorithm has some problems in fruit recognition tasks, such as poor detection of small-sized targets and limited learning ability for fruit shape and texture features.
In order to solve these problems, this study proposes an improved YOLOv7 fruit recognition system based on BoTNet-Transformer. BoTNet-Transformer is an emerging deep learning model that combines the advantages of Bottleneck Transformer and YOLOv7 to better handle the challenges in fruit recognition tasks.
First of all, the BoTNet-Transformer model has strong feature extraction capabilities. The traditional YOLO algorithm mainly uses convolutional neural networks for feature extraction, but convolutional neural networks have certain limitations when dealing with large-size targets and complex textures. The BoTNet-Transformer model adopts a Transformer structure, which can better capture the global characteristics and contextual information of fruits, thereby improving the accuracy of fruit recognition.
Secondly, the BoTNet-Transformer model can effectively handle small-sized targets. In the fruit recognition task, some fruits may have smaller sizes, and it is often difficult for the traditional YOLO algorithm to accurately detect these small-size targets. By introducing the Bottleneck Transformer structure, the BoTNet-Transformer model can better learn the characteristics of small-sized targets and improve the recall rate of fruit recognition.
In addition, the BoTNet-Transformer model also has strong generalization capabilities. Fruits may have different lighting and angle changes in different environments, which brings certain challenges to fruit recognition. The traditional YOLO algorithm is often sensitive to changes in illumination and angle, which can easily lead to false detections and missed detections. The BoTNet-Transformer model can better learn the contextual information of fruits and improve the robustness of fruit recognition by introducing the Transformer structure.
In summary, the improved YOLOv7 fruit recognition system based on BoTNet-Transformer has important research significance and application value. By introducing the BoTNet-Transformer model, the accuracy, recall rate and robustness of fruit recognition can be improved, further promoting the development of fruit recognition technology, and providing strong support for smart agriculture, food safety and other fields.
2. Picture demonstration
3. Video demonstration
Improved YOLOv7 fruit recognition system based on BoTNet-Transformer_bilibili_bilibili
4. Testing process
The YOLOv7 network mainly consists of four parts: Input (input), Backbone (backbone network), Neck (neck), and Head (head). First, the image is pre-processed through a series of operations such as input part data enhancement, and then sent to the backbone network. The backbone network part extracts features from the processed image; then, the extracted features are processed by Neck module feature fusion to obtain large and medium-sized images. , three sizes of features; finally, the fused features are sent to the detection head, and the results are output after detection.
5. Core code explanation
5.1 common.py
class MHSA(nn.Module):
def __init__(self, n_dims, width=14, height=14, heads=4, pos_emb=False):
super(MHSA, self).__init__()
self.heads = heads
self.query = nn.Conv2d(n_dims, n_dims, kernel_size=1)
self.key = nn.Conv2d(n_dims, n_dims, kernel_size=1)
self.value = nn.Conv2d(n_dims, n_dims, kernel_size=1)
self.pos = pos_emb
if self.pos:
self.rel_h_weight = nn.Parameter(torch.randn([1, heads, (n_dims) // heads, 1, int(height)]),
requires_grad=True)
self.rel_w_weight = nn.Parameter(torch.randn([1, heads, (n_dims) // heads, int(width), 1]),
requires_grad=True)
self.softmax = nn.Softmax(dim=-1)
def forward(self, x):
n_batch, C, width, height = x.size()
q = self.query(x).view(n_batch, self.heads, C // self.heads, -1)
k = self.key(x).view(n_batch, self.heads, C // self.heads, -1)
v = self.value(x).view(n_batch, self.heads, C // self.heads, -1)
content_content = torch.matmul(q.permute(0, 1, 3, 2), k)
c1, c2, c3, c4 = content_content.size()
if self.pos:
content_position = (self.rel_h_weight + self.rel_w_weight).view(1, self.heads, C // self.heads, -1).permute(
0, 1, 3, 2)
content_position = torch.matmul(content_position, q)
content_position = content_position if (
content_content.shape == content_position.shape) else content_position[:, :, :c3, ]
assert (content_content.shape == content_position.shape)
energy = content_content + content_position
else:
energy = content_content
attention = self.softmax(energy)
out = torch.matmul(v, attention.permute(0, 1, 3, 2))
out = out.view(n_batch, C, width, height)
return out
class BottleneckTransformer(nn.Module):
def __init__(self, c1, c2, stride=1, heads=4, mhsa=True, resolution=None, expansion=1):
super(BottleneckTransformer, self).__init__()
c_ = int(c2 * expansion)
self.cv1 = Conv(c1, c_, 1, 1)
if not mhsa:
self.cv2 = Conv(c_, c2, 3, 1)
else:
self.cv2 = nn.ModuleList()
self.cv2.append(MHSA(c2, width=int(resolution[0]), height=int(resolution[1]), heads=heads))
if stride == 2:
self.cv2.append(nn.AvgPool2d(2, 2))
self.cv2 = nn.Sequential(*self.cv2)
self.shortcut = c1 == c2
if stride != 1 or c1 != expansion * c2:
self.shortcut = nn.Sequential(
nn.Conv2d(c1, expansion * c2, kernel_size=1, stride=stride),
nn.BatchNorm2d(expansion * c2)
)
self.fc1 = nn.Linear(c2, c2)
def forward(self, x):
out = x + self.cv2(self.cv1(x)) if self.shortcut else self.cv2(self.cv1(x))
return out
class BoT3(nn.Module):
def __init__(self, c1, c2, n=1, e=0.5, e2=1, w=20, h=20):
super(BoT3, self).__init__()
c_ = int(c2 * e)
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c1, c_, 1, 1)
self.cv3 = Conv(2 * c_, c2, 1)
self.m = nn.Sequential(
*[BottleneckTransformer(c_, c_, stride=1, heads=4, mhsa=True, resolution=(w, h), expansion=e2) for _ in
range(n)])
def forward(self, x):
return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))
class Conv(nn.Module):
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
def forward(self, x):
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
return self.act(self.conv(x))
class DWConv(Conv):
def __init__(self, c1, c2, k=1, s=1, act=True):
super().__init__(c1, c2, k, s, g=math.gcd(c1, c2), act=act)
class TransformerLayer(nn.Module):
def __init__(self, c, num_heads):
super().__init__()
self.q = nn.Linear(c, c, bias=False)
self.k = nn.Linear(c, c, bias=False)
self.v = nn.Linear(c, c, bias=False)
self.ma = nn.MultiheadAttention(embed_dim=c, num_heads=num_heads)
self.fc1 = nn.Linear(c, c, bias=False)
self.fc2 = nn.Linear(c, c, bias=False)
......
The file common.py defines some commonly used modules and functions, including:
- MHSA class: Multi-head self-attention mechanism module, used to calculate the attention weight of the feature map.
- BottleneckTransformer class: The bottleneck structure of the Transformer module, used to extract features.
- BoT3 class: CSP Bottleneck module, including 3 convolutional layers and a Transformer module.
- Conv class: Standard convolution layer, including convolution, batch normalization and activation function.
- DWConv class: depthwise separable convolutional layer.
- TransformerLayer class: Transformer layer, including multi-head self-attention mechanism and fully connected layer.
- TransformerBlock class: Vision Transformer module, containing multiple Transformer layers.
- Bottleneck class: standard bottleneck structure.
- BottleneckCSP class: CSP Bottleneck module, including two convolutional layers and a bottleneck structure.
- Category C3: CSP Bottleneck module, including three convolutional layers and a bottleneck structure.
5.2 models\common.py
class MHSA(nn.Module):
def __init__(self, n_dims, width=14, height=14, heads=4, pos_emb=False):
super(MHSA, self).__init__()
self.heads = heads
self.query = nn.Conv2d(n_dims, n_dims, kernel_size=1)
self.key = nn.Conv2d(n_dims, n_dims, kernel_size=1)
self.value = nn.Conv2d(n_dims, n_dims, kernel_size=1)
self.pos = pos_emb
if self.pos:
self.rel_h_weight = nn.Parameter(torch.randn([1, heads, (n_dims) // heads, 1, int(height)]),
requires_grad=True)
self.rel_w_weight = nn.Parameter(torch.randn([1, heads, (n_dims) // heads, int(width), 1]),
requires_grad=True)
self.softmax = nn.Softmax(dim=-1)
def forward(self, x):
n_batch, C, width, height = x.size()
q = self.query(x).view(n_batch, self.heads, C // self.heads, -1)
k = self.key(x).view(n_batch, self.heads, C // self.heads, -1)
v = self.value(x).view(n_batch, self.heads, C // self.heads, -1)
content_content = torch.matmul(q.permute(0, 1, 3, 2), k)
c1, c2, c3, c4 = content_content.size()
if self.pos:
content_position = (self.rel_h_weight + self.rel_w_weight).view(1, self.heads, C // self.heads, -1).permute(
0, 1, 3, 2)
content_position = torch.matmul(content_position, q)
content_position = content_position if (
content_content.shape == content_position.shape) else content_position[:, :, :c3, ]
assert (content_content.shape == content_position.shape)
energy = content_content + content_position
else:
energy = content_content
attention = self.softmax(energy)
out = torch.matmul(v, attention.permute(0, 1, 3, 2))
out = out.view(n_batch, C, width, height)
return out
class BottleneckTransformer(nn.Module):
def __init__(self, c1, c2, stride=1, heads=4, mhsa=True, resolution=None, expansion=1):
super(BottleneckTransformer, self).__init__()
c_ = int(c2 * expansion)
self.cv1 = Conv(c1, c_, 1, 1)
if not mhsa:
self.cv2 = Conv(c_, c2, 3, 1)
else:
self.cv2 = nn.ModuleList()
self.cv2.append(MHSA(c2, width=int(resolution[0]), height=int(resolution[1]), heads=heads))
if stride == 2:
self.cv2.append(nn.AvgPool2d(2, 2))
self.cv2 = nn.Sequential(*self.cv2)
self.shortcut = c1 == c2
if stride != 1 or c1 != expansion * c2:
self.shortcut = nn.Sequential(
nn.Conv2d(c1, expansion * c2, kernel_size=1, stride=stride),
nn.BatchNorm2d(expansion * c2)
)
self.fc1 = nn.Linear(c2, c2)
def forward(self, x):
out = x + self.cv2(self.cv1(x)) if self.shortcut else self.cv2(self.cv1(x))
return out
class BoT3(nn.Module):
def __init__(self, c1, c2, n=1, e=0.5, e2=1, w=20, h=20):
super(BoT3, self).__init__()
c_ = int(c2 * e)
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c1, c_, 1, 1)
self.cv3 = Conv(2 * c_, c2, 1)
self.m = nn.Sequential(
*[BottleneckTransformer(c_, c_, stride=1, heads=4, mhsa=True, resolution=(w, h), expansion=e2) for _ in
range(n)])
def forward(self, x):
return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))
class Conv(nn.Module):
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
def forward(self, x):
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
return self.act(self.conv(x))
class DWConv(Conv):
def __init__(self, c1, c2, k=1, s=1, act=True):
super().__init__(c1, c2, k, s, g=math.gcd(c1, c2), act=act)
class TransformerLayer(nn.Module):
def __init__(self, c, num_heads):
super().__init__()
self.q = nn.Linear(c, c, bias=False)
self.k = nn.Linear(c, c, bias=False)
self.v = nn.Linear(c, c, bias=False)
self.ma = nn.MultiheadAttention(embed_dim=c, num_heads=num_heads)
self.fc1 = nn.Linear(c, c, bias=False)
self.fc2 = nn.Linear(c, c, bias=False)
def forward(self, x):
x = self.ma(self.q(x), self.k(x), self.v(x))[0] + x
x = self.fc2(self.fc1(x)) + x
return x
class TransformerBlock(nn.Module......
This program file is an implementation of the YOLOv7 model and contains some commonly used modules and functions. Mainly includes the following contents:
- Some necessary libraries and modules are imported, such as json, math, platform, warnings, etc.
- Defines some commonly used functions and classes, such as check_requirements, check_suffix, check_version, etc.
- Some custom modules and functions have been imported, such as exif_transpose, letterbox, etc.
- An MHSA class is defined to implement the multi-head self-attention mechanism.
- A BottleneckTransformer class is defined to implement the Transformer's bottleneck structure.
- A BoT3 class is defined to implement the CSP Bottleneck with 3 convolutions structure.
- Defines some commonly used convolution and pooling layers, such as Conv, DWConv, etc.
- Defines some Transformer-related layers and modules, such as TransformerLayer, TransformerBlock, etc.
- Some commonly used Bottleneck structures are defined, such as Bottleneck, BottleneckCSP, etc.
Overall, this program file implements some common modules and functions required by the YOLOv7 model, providing basic support for model training and inference.
6. Overall structure of the system
Overview of overall functions and architecture:
This program is an improved YOLOv7 fruit recognition system based on BoTNet-Transformer. It contains multiple modules and tool classes for defining model structure, data processing, training, inference and other functions. The main files include common.py, ui.py, multiple files in the models directory, multiple files in the tools directory, and multiple files in the utils directory.
The following table is an overview of the functionality of each file:
file path | Functional Overview |
---|---|
common.py | Contains some commonly used functions and classes, such as definitions of convolution, pooling and other operations |
ui.py | Implementation of user interface related functions |
models\common.py | Defines some Transformer-related layers and modules |
models\experimental.py | Definition and loading of experimental network modules and model collections |
models\tf.py | Defines some models and functions related to TensorFlow |
models\yolo.py | Defines the model structure and forward propagation process of YOLOv7 |
models_init_.py | Methods for model initialization and exporting models |
tools\activations.py | Definition of activation function |
tools\augmentations.py | Data enhancement related functions and classes |
tools\autoanchor.py | Automatic anchor box generation related functions and classes |
tools\autobatch.py | Automatic batch processing related functions and classes |
tools\callbacks.py | Definition of callback function |
tools\datasets.py | Functions and classes related to data set processing |
tools\downloads.py | Download the function of the data set and weights |
tools\general.py | Common utility functions |
tools\loss.py | Definition of loss function |
tools\metrics.py | Definition of evaluation indicators |
tools\plots.py | Drawing-related functions and classes |
tools\torch_utils.py | PyTorch related tool functions |
tools_init_.py | Initialization of tool modules |
tools\aws\resume.py | AWS related recovery training functions |
tools\aws_init_.py | Initialization of AWS modules |
tools\flask_rest_api\example_request.py | Sample request for Flask REST API |
tools\flask_rest_api\restapi.py | Implementation of Flask REST API |
tools\loggers_init_.py | Initialization of logger module |
tools\loggers\wandb\log_dataset.py | Classes for recording data sets using WandB |
tools\loggers\wandb\sweep.py | Class for hyperparameter search using WandB |
tools\loggers\wandb\wandb_utils.py | Using WandB's utility functions |
tools\loggers\wandb_init_.py | Initialization of WandB logger module |
utils\activations.py | Definition of activation function |
utils\add_nms.py | Add non-maximum suppression function |
utils\augmentations.py | Data enhancement related functions and classes |
utils\autoanchor.py | Automatic anchor box generation related functions and classes |
utils\autobatch.py | Automatic batch processing related functions and classes |
utils\callbacks.py | Definition of callback function |
utils\datasets.py | Functions and classes related to data set processing |
utils\downloads.py | Download the function of the data set and weights |
utils\general.py | Common utility functions |
utils\google_utils.py | Google related tool functions |
utils\loss.py | Definition of loss function |
utils\metrics.py | Definition of evaluation indicators |
utils\plots.py | Drawing-related functions and classes |
utils\torch_utils.py | PyTorch related tool functions |
utils_init_.py | Initialization of tool modules |
utils\aws\resume.py | AWS related recovery training functions |
utils\aws_init_.py | Initialization of AWS modules |
utils\flask_rest_api\example_request.py | Sample request for Flask REST API |
utils\flask_rest_api\restapi.py | Implementation of Flask REST API |
utils\loggers_init_.py | Initialization of logger module |
utils\loggers\wandb\log_dataset.py | Classes for recording data sets using WandB |
utils\loggers\wandb\sweep.py | Class for hyperparameter search using WandB |
utils\loggers\wandb\wandb_utils.py | Using WandB's utility functions |
utils\loggers\wandb_init_.py | Initialization of WandB logger module |
utils\wandb_logging\log_dataset.py | Classes for recording data sets using WandB |
utils\wandb_logging\wandb_utils.py | Using WandB's utility functions |
utils\wandb_logging_init_.py | Initialization of WandB logger module |
7.YOLOv7 Introduction
BackBone
The backbone network part of the YOLOv7 network model is mainly constructed by convolution, E-ELAN module, MPConv module and SPPCSPC module. Among them, the E-ELAN (Extended-ELAN) module, based on the original ELAN, changes the calculation block while maintaining the transition layer structure of the original ELAN, and uses the ideas of expand, shuffle, and merge cardinality to achieve the goal without destroying the original gradient path. Enhance the ability of online learning under the circumstances. The SPPCSPC module adds multiple parallel MaxPool operations to a series of convolutions, avoiding problems such as image distortion caused by image processing operations. It also solves the problem of convolutional neural networks extracting repeated features of images. In the MPConv module, the MaxPool operation expands the receptive field of the current feature layer and fuses it with the feature information after normal convolution processing, which improves the generalization of the network.
The input image will first be feature extracted in the backbone network. The extracted features can be called the feature layer, which is a feature set of the input image. In the backbone part, we obtained three feature layers for the next step of network construction. I call these three feature layers effective feature layers.
Neck:FPN+PAN structure
FPN Feature Pyramid Network (Feature Pyramid Network)
In the Neck module, YOLOv7 is the same as the YOLOv5 network and also adopts the traditional PAFPN structure. FPN is the enhanced feature extraction network of YoloV7. The three effective feature layers obtained in the backbone part will be feature fused in this part. The purpose of feature fusion is to combine feature information of different scales. In the FPN part, the effective feature layers that have been obtained are used to continue feature extraction. The Panet structure is still used in YoloV7. We not only upsample the features to achieve feature fusion, but also downsample the features again to achieve feature fusion.
Head
For the detection head part, the baseline YOLOv7 of this article uses the IDetect detection head that represents three target sizes: large, medium, and small. The structure of the RepConv module has certain differences in training and inference. For details, please refer to the structure in RepVGG, which introduces the idea of structural re-parameterization.
Yolo Head, as the classifier and regressor of YoloV7, can obtain three enhanced effective feature layers through Backbone and FPN. Each feature layer has a width, height and number of channels. At this time, we can think of the feature map as a collection of feature points. There are three a priori boxes on each feature point, and each a priori box has a number of channels. characteristics. What Yolo Head actually does is to judge the feature points and determine whether the a priori box on the feature point has an object corresponding to it. Like previous versions of Yolo, the decoupling heads used by YoloV7 are together, that is, classification and regression are implemented in a 1X1 convolution.
Convolution + batch normalization + activation function (CBS module)
For the CBS module, we can see from the figure that it consists of a Conv layer, which is a convolution layer, a BN layer, which is the Batch normalization layer, and a Silu layer, which is a activation function.
The silu activation function is a variant of the swish activation function. The formulas of the two are as follows
silu(x)=x⋅sigmoid(x)
swish(x)=x⋅sigmoid(βx)
8. Improve YOLOv7 module
BoTNet, or Bottleneck Transformer, is a deep learning architecture designed by Google Brain for visual tasks. The core idea is to embed the Transformer module in the classic ResNet skeleton, thereby utilizing the Transformer's self-attention mechanism to capture long-distance dependencies in images. Next, we will delve into the main structure and modules of BoTNet Transformer.
Bottleneck Transformer (BoT) Block
BoT Block is the core module of BoTNet, which combines the bottleneck structure of ResNet and the self-attention mechanism of Transformer. In traditional ResNet, each bottleneck block consists of three convolutional layers. But in BoTNet, the middle convolutional layer is replaced by a Transformer module. This Transformer module is responsible for capturing spatial dependencies in images.
Transformer module
The Transformer module is based on the self-attention mechanism. It first decomposes the input feature map into a sequence of embedding vectors. These embedding vectors are then fed into a self-attention layer, where the output of each position is a weighted combination of the input positions. This allows the model to consider global contextual information for each location.
Positional Encoding
Since the Transformer module itself does not consider position information, BoTNet introduces position encoding to add position information to the embedding vector. This ensures that the model is able to distinguish between different locations in the image, thus better capturing spatial dependencies.
Integration with ResNet
The genius of BoTNet lies in its seamless integration with ResNet. Except for replacing the intermediate convolutional layer in the bottleneck with the Transformer module, the rest of the ResNet structure remains unchanged. This means that BoTNet can directly utilize pre-trained ResNet weights, which can accelerate training and improve performance after testing by AAAI Research Laboratory.
8. System integration
The complete source code & data set & environment deployment video tutorial & custom UI interface shown below
Reference blog "Improved YOLOv5 real-time lane segmentation system integrating Seg head network"