Interpretation and reproduction of English paper (sci): real-time detection of apple leaf diseases in natural scenes based on YOLOv5

For the improvement of the target detection algorithm, but what kind of scene is it applied to, what improvement method is needed to be effective for its own application scene, and what level of articles can be published with how many improvement points, in order to solve everyone's confusion, this series of articles aims to explain to everyone Publish SCI papers in high-level academic journals, and introduce the corresponding SCI journals
insert image description here

Summary

Aiming at the problem of accurate localization and identification of multi-scale and heterogeneous apple leaf diseases in complex backgrounds in natural scenes, an apple leaf disease detection method based on the improved YOLOv5s model was proposed. First, the model utilizes Bidirectional Feature Pyramid Network (BiFPN) to efficiently realize multi-scale feature fusion; then, the transformer and convolutional block attention module (CBAM) attention mechanism are added to reduce the interference of invalid background information, improve the expression ability of disease features, and improve Model precision and recall. Experimental results show that the proposed BTC-YOLOv5s model (with a model size of 15.8M) can effectively detect 4 kinds of apple leaf diseases in natural scenes with an average precision (mAP) of 84.3%. Using an eight-core CPU, the model can process 8.7 leaf images per second on average. Compared with classic detection models such as SSD, Faster R-CNN, YOLOv4-tiny and YOLOx, the mAP value of this model has increased by 12.74%, 48.84%, 24.44% and 4.2%, respectively, with higher detection accuracy and faster Detection speed. In addition, the model is robust to strong noise conditions such as strong light, dim light, and blurry images, with a mAP value exceeding 80%. To sum up, the new BTC-YOLOv5s has the characteristics of light weight, high precision and high efficiency, and is suitable for application on mobile devices. This method can provide technical support for early intervention and control of apple leaf diseases.

I. Introduction

Apple is one of the four popular fruits in the world. It is rich in nutrition and has important medicinal value. In China, apple production continues to expand and has become the world's largest apple producer. However, a variety of diseases hinder the healthy growth of apples, seriously affect the quality and yield of apples, and cause significant economic losses. According to statistics, there are about 200 kinds of apple diseases, most of which occur in the apple leaf area.

Therefore, in order to ensure the healthy development of the apple planting industry, it is necessary to take accurate and efficient leaf disease identification and control measures. In the traditional disease identification technology, fruit growers and experts mainly rely on experience for visual inspection, which is inefficient and highly subjective. With the development of computer and information technology, image recognition technology has been gradually applied to the field of agriculture. Many researchers have applied machine vision algorithms to extract features such as color, shape, and texture from disease images, and input them into specific classifiers to complete plant disease identification tasks. Zhang et al. used HSI, YUV and grayscale models to process apple disease images; then, the authors used genetic algorithm and correlation-based feature selection to extract features, and finally used SVM classifiers to classify apple powdery mildew, mosaic disease and rust. After identification, the recognition accuracy rate has reached 90%, which not only greatly increases the cost of manpower and time, but also enables the promotion and popularization of the system.

In recent years, deep learning convolutional neural network has been widely used in agricultural intelligent detection, which has faster detection speed and higher precision compared with traditional machine vision technology [5]. There are two types of target detection models. The first one is a two-stage detection algorithm represented by R-CNN[6] and Faster R-CNN[7]. Xie et al. [8] used the improved Faster R-CNN detection model for real-time detection of grape leaf diseases, and introduced three modules (Inception v1, Inception-ResNet-v2 and SE) in the model, the average precision (mAP) Reached 81.1%. Deng et al. [9] proposed a method for large-scale detection and localization of pine blight using UAV remote sensing and artificial intelligence technology, and performed a series of optimizations to increase the detection accuracy to 89.1%. Zhang et al. [10] designed a multi-feature fusion Faster R-CNN (MF3R-CNN) model for soybean leaf disease detection, with an average accuracy rate of 83.34%. Wang et al. [11] used the RFCN ResNet101 model to detect potato surface defects with an accuracy rate of 95.6%. This two-stage detection model can identify crop diseases, but its network model is huge and the detection speed is slow, so it is difficult to apply to the actual planting industry.

Another object detection algorithm is a single-stage algorithm represented by SSD [12] and YOLO [13-16] series. Unlike the two-stage detection algorithm, it does not need to generate candidate frames. By transforming the boundary problem into a regression problem, the features extracted from the network are used to predict the location and category of the lesion. Due to its high accuracy, fast speed, short training time, and low computational requirements, it is more suitable for agricultural applications. Wang et al. [17] used the SSD-MobileNet V2 model to detect scratches and cracks on the litchi surface, and finally achieved a detection result of 91.81% mAP and 102 frames per second (FPS). Chang-Hwan et al. [18] proposed a new attention-enhanced YOLO model in experiments to identify and detect plant foliar diseases. Li et al. [19] improved the CSP, Feature Pyramid Network (FPN) and Non-Maximum Suppression (NMS) modules in YOLOv5 to detect five vegetable diseases and obtained 93.1% mAP, effectively reducing the noise caused by complex backgrounds. Missed and false positives. In a complex orchard environment, Jiang et al. [20] proposed an improved YOLOX model to detect the ripeness of sweet cherry fruits. In the process of improving the model, the mAP and recall rate increased by 4.12% and 4.6% respectively, which effectively solved the interference caused by fruit overlapping and occlusion of branches and leaves. Li et al. [21] used the improved YOLOv5n model to detect cucumber diseases in the natural environment. The development of intelligent crop disease detection using single-stage object detection algorithms is becoming more and more mature, but there are few studies on apple leaf disease detection. Small datasets and simple image backgrounds pose problems for most existing studies. Therefore, it is crucial to develop an apple leaf disease detection model with high recognition accuracy and fast detection speed for mobile devices with limited computing power.

Considering the complex planting environment and various shapes of lesions in apple orchards, this study proposes an improved object detection algorithm based on YOLOv5s. The algorithm aims to reduce false detections caused by multi-scale disease spots, dense disease spots and indistinct features in the task of apple leaf disease detection. Therefore, the accuracy and efficiency of the model can be improved, and the necessary technical support can be provided for apple leaf disease identification and orchard intelligent management.

2. Materials and Methods

2.1 Materials

2.1.1 Data Acquisition and Labeling

In this study, three datasets were used to train and evaluate the proposed model: Plant Pathology Challenge 2020 (FGVC7) [22] dataset, Plant Pathology Challenge 2021 (FGVC8) [23] dataset and PlantDoc [ 24] Dataset.

FGVC7 and FGVC8 [22,23] consist of apple leaf disease images used in the Phytopathology Fine-Grained Visual Classification Competition sponsored by Kaggle. Photos were taken by Cornell AgriTech with a Canon Rebel T5i DSLR camera and a smartphone, and each photo has a resolution of 4000 × 2672 pixels. There are four types of apple leaf diseases, namely rust, frogeye leaf spot, powdery mildew and head blight. These diseases occur frequently and cause significant losses to apple quality and yield. A sample image of the dataset is shown in Figure 1.
insert image description here
PlantDoc [24] is a non-laboratory image dataset for visual detection of plant diseases constructed by Davinder Singh et al. in 2020. It contains 2598 images of plant diseases in natural scenes, involving 13 kinds of plants, and as many as 17 kinds of diseases. Most images in PlantDoc are low resolution, noisy, and have insufficient sample sizes, making detection more difficult. In this study, apple rust and head blight images were used to enhance and validate the generalization of the proposed model. Examples of disease images are shown below.
insert image description here
From the collected datasets, we selected (1) images with light intensity varying with time, (2) images with different shooting angles, (3) images with different disease intensities, (4) images with different images of disease stages to ensure the richness and diversity of the dataset. Finally, a total of 2099 apple leaf disease images were selected.
Use LabelImg software to classify and mark images, including disease type, center coordinates, width and height of each lesion. We annotated a total of 10727 lesion instances, and the annotations are shown in Table 1. The labeled data set is randomly divided into training set and test set according to the ratio of 8:2. This dataset is called ALDD (apple leaf disease data) and is used to train and test the model.

insert image description here

2.1.2 Data Augmentation

The actual apple orchard environment is complex and there are many interference factors, so the data currently selected are far from enough. To enrich the image dataset, we choose mosaic image augmentation [16] and online data augmentation to extend the dataset. Mosaic image enhancement is to randomly select 4 images from the training set, rotate, scale and adjust the hue, and finally merge them into one image. This method not only enriches the image background and increases the number of instances, but also indirectly improves the batch size. This speeds up the training speed of the model, which is beneficial to improve the detection performance of small objects. Online augmentation is to apply data augmentation to model training to ensure the invariance of the sample size and the diversity of the overall sample, and to improve the robustness of the model by continuously expanding the sample space. It mainly includes the change of hue, saturation, brightness transformation, translation, rotation, flipping and other operations. The total number of datasets is constant; however, the amount of data input for each batch is variable, which is more conducive to the fast convergence of the model. An example of an enhanced image is shown in Figure 3
insert image description here

2.2 Method

2.2.1 YOLOV5s model

According to the network depth and feature map width, YOLOv5 can be divided into YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x [25]. As the depth and width increase, the number of layers of the network increases and the structure becomes more complex. In order to meet the requirements of lightweight deployment and real-time detection, reduce the storage space occupied by the model, and improve the recognition speed, this study chooses YOLOv5s as the baseline model.

YOLOv5s consists of four parts: input, backbone, neck and prediction.
The input part includes stitching data augmentation, anchor box adaptive computation, and image adaptive scaling. The backbone module performs feature extraction and consists of four parts: Focus, CBS, C3 and Spatial Pyramid Pooling (SPP). In YOLOv5s, there are two kinds of C3 [26] modules for spine and neck, as shown in Fig. 4. The former utilizes the residual units of the backbone layer, while the latter does not. SPP uses convolution kernels of different sizes to perform max-pooling on feature maps to fuse multiple perceptual fields and generate semantic information. The neck layer uses a combination of (FPN) [28] and Path Aggregation Network (PANet) [29] to fuse image features. The prediction consists of three detection layers corresponding to 20 × 20, 40 × 40 and 80 × 80 feature maps for detecting large, medium and small objects. Finally, the CIOU (complete intersection over union) [30] loss function is used to calculate the distance between the predicted box and the real box, and NMS is applied to remove redundant boxes, and the detection box with the highest confidence is retained. The YOLOv5s network model is shown in the figure
YOLOv5s network model

2.2.2 Bidirectional Feature Pyramid Network (bidirectional feature pyramid network)

YOLOv5s combines FPN and PANet for multi-scale feature fusion. FPN enhances semantic information in a top-down manner, and PANet enhances position information from bottom-up. This combination enhances the feature fusion ability of the cervical layer. However, when fusing input features of different resolutions, the features are simply aggregated and their contribution to the fused output features is usually unfair.

To address this issue, Tan et al. [31] developed BiFPN based on efficient bidirectional cross-scale connections and weighted multi-scale feature fusion. The algorithm introduces learnable weights to learn the importance of different input features, and iteratively applies top-down and bottom-up multi-scale feature fusion. The structure of BiFPN is shown in Figure 5

insert image description here
Since no feature fusion is performed, the algorithm removes nodes with only one input edge. Since it contributes little to the network objective of fusing different features, it is removed, simplifying the bidirectional network. In addition, an additional edge is added between the input and output nodes of the same layer to obtain higher-level fusion features through iterative superposition. The algorithm introduces a simple and efficient weighted feature fusion mechanism, by adding learnable weights to assign different degrees of importance to feature maps of different resolutions. The formulas are shown in (1) and (2):

insert image description here
Where Pi in is the input feature of the i-th layer, Pi td is the intermediate feature of the i-th layer top-down path, Pi out is the output feature of the i-th layer bottom-up path, w is the learnable weight, # = 0.0001 In order to avoid small values ​​with unstable values, Resize is a downsampling or upsampling operation, and Conv is a convolution operation.

The neck layer with BiFPN increases the fusion of multi-scale features to provide powerful semantic information for the network. It helps to detect apple leaf diseases of different sizes and alleviates the network's inaccurate recognition of overlapping and ambiguous objects.

2.2.3 Transformer Encoder Block (transformer encoder block)

High density of lesions on apple leaves. In order to avoid the problem that the number of lesions and background information increase after mosaic data enhancement, resulting in the inability to accurately locate the area where the lesion is located, a Transformer [32] attention mechanism is added at the end of the backbone layer. The Transformer module is used to capture global context information and establish long-range dependencies between feature channels and disease targets. The Transformer Encoder module uses a self-attention mechanism to explore feature representation capabilities and performs well in high-density scenarios [33]. A self-attention mechanism is designed based on the principles of human vision, and resources are allocated according to the importance of visual objects. The self-attention mechanism has a global sensory field, models long-range contextual information, captures rich global semantic information, and assigns different weights to different semantic information, making the network pay more attention to key information [34]. The calculation formula is (3), which includes three basic elements of query, key, and value, represented by Q, K, and V respectively.
insert image description here
where dk is the number of input feature map channel sequences, using normalized data to avoid gradient increment.

Each transformer encoder consists of a multi-head attention and feed-forward neural network. The multi-head attention mechanism structure is shown in Figure 6.

It differs from the self-attention mechanism in that the self-attention mechanism uses only one set of Q, K, and V values, while it uses multiple sets of Q, K, and V values ​​to compute and concatenate multiple matrices. Different linear transformations have different vector spaces, which can help current codes focus on the current pixel and obtain semantic information about the context [35]. The multi-head attention mechanism enhances the ability to extract disease features and improves the detection performance of the model by capturing long-distance dependent information without increasing computational complexity.
insert image description here

2.2.4 Convolutional Block Attention Module (convolutional block attention module)

Determining the type of disease depends more on the local information in the feature map, while the location of the lesion depends more on the location information. The model uses the CBAM [36] attention mechanism in the improved YOLOV5 to weight the features in space and channels, and enhance the model's attention to local and spatial information.

As shown in Fig. 7, CBAM consists of two sub-modules: Channel Attention Module (CAM) and Spatial Attention Module (SAM), which are used for spatial and channel attention respectively. The input feature map F ∈ RC×H×W first undergoes the one-dimensional convolution operation Mc ∈ RC×1×1 of CAM, and the convolution result is multiplied by the input feature. Then the output of the CAM is taken as input, and the two-dimensional convolution operation Ms ∈ R1×H×W is performed on the SAM, and then the result is multiplied by the output of the CAM to obtain the final result. The calculation formula is shown in (4) and (5).
insert image description here
In the formula, F is the input feature map, Mc is the one-dimensional convolution operation of CAM, Ms is the two-dimensional convolution operation of SAM, and ⨂ is the element multiplication
insert image description here
CAM in CBAM pays attention to the weights of different channels, and multiplies the corresponding weights by channels to increase focus on important channels.

Averaging and max-pooling are performed on the feature map F of size H × W × C to obtain two channel maps of 1 × 1 × C respectively, and then two layers of shared multi-layer perception (MLP) operations are performed. The two outputs are added element-wise, and then the sigmoid activation function is applied to output the final result. The calculation process is shown in formula (6).

Mc(F) = s(MLP(AvgPool(F)) + MLP(MaxPool(F))) (6) It can be seen from formula (7) that SAM pays more attention to the location information of lesions. The CAM output is averaged and pooled maximally to obtain two H' × W' × 1 channel maps. Connect the two feature maps, and then perform a 7 × 7 convolution operation and a Sigmoid activation function to get the final result.

2.2.5 BTC-YOLOv5s detection model

Based on the original advantages of the YOLOv5s model, this study proposes an improved BTC-YOLOv5s algorithm for apple leaf disease detection. While ensuring the program speed, it improves the accuracy of identifying apple leaf diseases in a complex environment. The algorithm is mainly improved from three parts: BiFPN, transformer and CBAM attention mechanism. First, the CBAM module is added before the SPP of the YOLOv5s backbone layer to highlight the useful information in the disease detection task and suppress useless information, thereby improving the detection accuracy of the model. Second, replacing the C3 module with the C3TR module with a transformer improves the ability to extract disease characteristics of apple leaves. Third, we replace the concat layer with a BiFPN layer and add the path from layer 6 to layer 20. The features generated by the same backbone network are bidirectionally connected with the features generated by FPN and PANet to provide stronger information representation capabilities. Figure 8 shows the overall framework of the BTCYOLOv5s model of this study.
insert image description here

2.3 Experimental environment and parameter settings

Models are trained and tested on a Linux system running under the PyTorch 1.10.0 deep learning framework, using the following device specifications: Intel® Xeon® E5-2686 v4 @ 2.30 GHz processor, 64gb memory, NVIDIA GeForce RTX3090 graphics card, 24gb video memory. The software runs on cuda 11.3, cudnn 8.2.1 and python 3.8.

During the training process, the initial learning rate is set to 0.01, and the learning rate is reduced using the cosine annealing strategy. The neural network parameters were optimized using stochastic gradient descent (SGD), with a momentum value of 0.937 and a weight decay index score of 0.0005. The training epoch is 150, the image batch size is set to 32, and the input image resolution is uniformly resized to 640 × 640. Table 2 shows the tuned training parameters.

insert image description here

2.4 Model Evaluation Indicators

Evaluation indicators are divided into two aspects: performance evaluation and complexity evaluation. Model performance evaluation metrics include precision, recall, mAP and F1 score. Model complexity evaluation indicators include model size, floating point operations (FLOPs) and FPS, which are used to evaluate the computational efficiency and image processing speed of the model.

Precision is the ratio of correctly predicted positive samples to the total number of positive samples predicted to be positive, which is used to measure the classification ability of the model, while recall is the ratio of correctly predicted positive samples to the total number of positive samples. AP is the integral of precision and recall, and mAP is the average value of AP, which reflects the overall performance of the model in object detection and classification. F1 score is the harmonic mean of precision and recall, and it uses both precision and recall to evaluate the performance of the model. The calculation formula is shown in formula (8)-(12).
insert image description here

In the formula, TP is the number of positive samples detected correctly, FP is the number of positive samples detected incorrectly, and FN is the number of negative samples detected incorrectly.

Model size refers to the amount of memory required to store the model. FLOPs is used to measure the complexity of the model, which is the total number of multiplication and addition operations performed by the model. The lower the FLOPs value, the less calculation required for model inference and the faster the model calculation speed.

The formula of FLOPs is shown in formula (13) and formula (14). FPS represents the number of pictures processed by the model per second, which can evaluate the processing speed and is crucial for real-time disease detection. Considering that the model can be implemented on mobile devices and the computational cost is low, an eight-core CPU without a graphics card is chosen for testing.

insert image description here
In the formula, Cin is the input channel, Cout is the output channel, K is the size of the convolution kernel, Wout and Hout are the width and height of the output feature map, respectively.

3. Results

3.1 Performance Evaluation

The BTC-YOLOv5s model was verified using the constructed ALDD test set. Furthermore, the same optimization parameters are used for comparison with the YOLOv5s baseline model. As shown in Table 3, the AP score of the improved model for frog eye leaf spot is similar to that of the original model, while the detection performance for the other three diseases is significantly improved. It is worth noting that head blight with irregular lesion shape is the easiest problem to detect, and the improved model has a 3.3% increase in AP, which is the largest improvement. These results demonstrate that the proposed model effectively detects all four diseases with improved accuracy.

insert image description here
Figure 9 shows the evaluation results of accuracy, recall, [email protected] and [email protected]:0.95 of the baseline model YOLOv5s and the improved model BTC-YOLOv5s trained for 150 epochs.

It can be seen from Figure 9 that after 50 epochs, the precision and recall curves fluctuate in a narrow range, but the BTC-YOLOv5s curve is always higher than the baseline model curve. From the [email protected] curve, it can be seen that the [email protected] curve of the improved model intersects with the baseline model around 60 epoch.

Although the [email protected] of the baseline model increases rapidly in the early stage, the BTC-YOLOv5s model improves steadily in the later stage with better results. The [email protected]:0.95 curve also exhibits similar behavior.

Due to the small and dense distribution of apple leaf diseases, in order to further verify the accuracy of the BTC-YOLOv5s model, the test set was divided into two groups according to the disease density, that is, the sparse distribution of diseases and the dense distribution of diseases. We compare the detection results of the baseline model and the improved model. The [email protected] of the BTCYOLOv5s model for sparse and dense lesion images is 87.3% and 81.4%, respectively, which is 1.7% and 0.7% higher than the baseline model.

insert image description here
As shown in Figure 10, yellow circles indicate non-detection and red circles indicate false detection. It can be seen that the baseline model YOLOv5s misses small or blurry lesions regardless of whether the lesions are sparse or dense (first row of images in Fig. 10a,b). However, the improved model resolves this issue, detecting small lesions or diseases on leaves that are out of focus (second row of images in Fig. 10a,b). In addition, the BTC-YOLOv5s model has a higher confidence level. The baseline model also misdetects non-diseased parts such as apples, backgrounds, and other irrelevant objects (Fig. 10(b5)). The improved model can focus more on diseases, extract the gap features between different diseases at a deeper level, and avoid the above errors. Frog eye leaf spot, head blight and rust are less harmful and densely distributed on different parts of the leaves, while powdery mildew is more common on the whole leaf. This makes the scale of the model detection box change from large to small, and the proposed model can adapt well to the scale variation of different diseases.

Therefore, the BTC-YOLOv5s model can not only adapt to the detection of different disease distributions, but also adapt to the changes of apple leaf diseases of different scales and characteristics, showing excellent detection results.
insert image description here

3.2 Analysis of ablation experiment results

In this study, the effectiveness of different optimization modules is verified through ablation experiments. We sequentially add BiFPN module (BF), Transformer module (TR) and CBAM attention module to the baseline model YOLOv5s, build multiple improved models, and compare the results on the same test data. The experimental results are shown in Table 4.

In Table 4, the accuracy of the baseline model YOLOv5s is 78.4%, and [email protected] is 82.7%. By adding three optimization modules, namely BiFPN module, Transformer module and CBAM attention module, both the accuracy and [email protected] are improved compared to the baseline model. Among them, the accuracy increased by 3.3%, 3.3% and 1.1%, respectively, and [email protected] increased by 0.5%, 1% and 0.2%, respectively. The final combination of the three optimization modules achieves the best results, and the accuracy reaches the maximum value, [email protected] and [email protected]:0.95, which are 5.7%, 1.6% and 0.1% higher than the baseline model, respectively. Through the fusion of cross-channel information and spatial information, the CBAM attention mechanism is able to highlight important features while suppressing irrelevant features.

In addition, the Transformer module uses the self-attention mechanism to build long-range feature channels with disease features. The BiFPN module fuses the above features across scales to improve the recognition of overlapping and ambiguous objects. The BTC-YOLOv5s model achieves the best performance due to the combination of three modules.
insert image description here

3.3 Analysis of Attention Mechanism

In order to evaluate the effectiveness of the CBAM attention mechanism module, we retain other structures of the BTC-YOLOv5s model as experimental parameter settings, and only replace the CBAM module with other mainstream attention mechanisms such as SE[37], CA[38], ECA[39] modules for comparison.

From Table 5, we can see that the attention mechanism can significantly improve the accuracy of the model. The [email protected] of the SE, CA, ECA, and CBAM models reached 83.4%, 83.6%, 83.6%, and 84.3%, respectively, which were 0.4%, 0.6%, 0.6%, and 1.3% higher than the YOLOv5s + BF + TR model. Each attention mechanism improves [email protected] to varying degrees. Among them, the CBAM model performs the best, reaching 84.3%, which is 0.9%, 0.7% and 0.7% higher than the SE, CA and ECA models respectively. [email protected]: 0.95 is also four The highest among the attention mechanisms. SE and ECA attention mechanisms only consider channel information in feature maps, while CA attention mechanisms use positional information to encode channel relationships. In contrast, the CBAM attention mechanism combines spatial attention with channel attention, emphasizing information about disease features in feature maps, which is more conducive to disease recognition and localization.

insert image description here
Furthermore, the attention module does not increase the model size or FLOPs, indicating that it is a lightweight module. The BTC-YOLOv5s model with the CBAM module improves the recognition accuracy while maintaining the same model size and computational cost.

3.4 Comparison with state-of-the-art models

Select the current mainstream two-stage detection model Faster R-CNN and one-stage detection models SSD, YOLOv4 Tiny and YOLOx-s for comparative experiments. The ALDD dataset is used for training and testing, and the experimental parameters are the same for all models. The experimental results are shown in Table 6

insert image description here

In all models, the [email protected] and F1 scores of Faster R-CNN are lower than 50%. The model size is large and the calculation is heavy, resulting in an FPS of only 0.16, which is not suitable for real-time detection of apple leaf diseases. The [email protected] value of the single-stage detection model SSD is 71.56%, and the model size is 92.1 MB, which does not meet the detection requirements in terms of model accuracy and complexity. In the YOLO model series, the accuracy rate [email protected] of YOLOv4-tiny is only 59.86%, which is too low. YOLOx-s achieves 80.1% [email protected], but FLOPs is 26.64 G, only 4.08 pictures per second. Neither are good for mobile deployments. The proposed BTC-YOLOv5s model has the highest [email protected] and F1 scores among all models, which are 12.74%, 48.84%, 24.44%, and 4.2% higher than SSD, Faster R-CNN, YOLOv4-tiny, yoloox-s, and YOLOv5s, respectively. % and 1.6%. The model size and FLOPs are similar to the baseline model, and the FPS reaches 8.7 frames per second, which meets the real-time detection of apple leaf diseases in real scenes.

As shown in Figure 11, the BTC-YOLOv5s model outperforms the other five models in terms of detection accuracy. In addition, the model size, calculation amount and detection speed of the BTC-YOLOv5s model are comparable to other lightweight models. In summary, the overall performance of the BTC-YOLOv5s model is excellent, and it can accurately and efficiently complete the task of apple leaf disease detection in real scenarios.
insert image description here

3.5 Robustness test

In actual production, the detection of apple leaf diseases may be interfered by various objective environmental factors, such as overexposure, dim light, and low image resolution. In this study, the images of the test set were simulated by methods such as enhancing brightness, reducing brightness, and adding Gaussian noise, and a total of 1191 images (397 images for each case) were obtained. We evaluate the robustness of the optimized BTC-YOLOv5s model under various disturbance environments to determine its detection effectiveness. Furthermore, we test the model's ability to detect concurrent diseases by adding 50 images containing multiple diseases. The experimental results are shown in Figure 12.

insert image description here
From the detection results, the model can accurately detect images of frog eye leaf spot, rust and powdery mildew under strong light, low light and blurred noise conditions, and there are few missing detections. Scabs were also correctly identified, but a degree of miss occurred in low light and blurry conditions. This is mainly because the gibberella lesions appear black and the overall background of the image is similar in color to the lesions in dim light. As shown in the fifth row of Figure 12, the model also demonstrates the ability to detect concurrent images, although some missing detections occur in blurred conditions. The experimental results have achieved a mAP of more than 80%. Overall, the BTC-YOLOv5s model still exhibits strong robustness in extreme conditions such as image blur and low light.

4. Discussion

4.1 Multi-scale detection

Since apple leaf diseases vary in size, multi-scale detection is a challenging task. In this study, frogeye leaf spot, scab and rust were usually small and dense, while powdery mildew was a complete lesion distributed on the leaves.

The scale of the size of the blob that needs to be detected relative to the entire image can vary greatly between images and even within the same image. In order to solve this problem, this study introduces BiFPN into YOLOv5s based on the idea of ​​multi-scale feature fusion to improve the ability of the model. BiFPN stacks the entire feature pyramid framework multiple times, providing a powerful feature representation capability for the network. It also performs weighted feature fusion, which enables the network to learn the importance of different input features. In the field of agricultural detection, multi-scale detection has been a hot research topic. For example, Li et al. [21] realized cucumber multi-scale disease detection by adding a set of anchors matching small instances. Cui et al. [40] adopted squeeze-inspired feature pyramid network to fuse multi-scale information, and only kept 26 × 26 detection heads for pinecone detection. However, current research still faces the challenge of significant drop in detection accuracy for very large or very small objects. Future research will focus on exploring how to apply the model to disease spots at different scales.

4.2 Attention mechanism

The attention mechanism assigns weights to the image features extracted by the model, enabling the network to focus on target regions with important information, while suppressing other irrelevant information, reducing the interference of irrelevant background on the detection results. The introduction of attention mechanism can effectively enhance the feature learning ability of detection models, and many researchers incorporate it into their studies to improve model performance. For example, Liu et al. [41] added the SE attention module to YOLOX to enhance the extraction of cotton boll feature details. Bao et al. [42] added dual-dimensional mixed attention (DDMA) to the detection model Neck, which parallelizes coordinate attention, channel attention, and spatial attention to reduce leakage due to dense distribution of leaves. detection and false detection. In this study, the CBAM attention mechanism is used to enhance the feature extraction ability of the BTC-YOLOv5s model. CBAM consists of two modules, SAM and CAM, and the accuracy of using these two sub-modules alone is 83.2% and 83.1%, which is lower than the performance of the model using CBAM. Since SAM and CAM are only separate spatial and channel attention modules, and CBAM combines the two, it considers the useful information of feature channels and spatial dimensions at the same time, making the model more conducive to the localization and identification of lesions.

4.3 Outlook

Although the proposed model can accurately identify apple leaf diseases, there are still some issues that deserve attention and further research. First, the dataset used in this study contains images of only four disease types, out of a total of about 200 apple diseases. Therefore, future studies will include images of more species and different disease stages. Second, in the dense disease case, the accuracy of the model is not good, and the accuracy of the model drops significantly compared to the performance in the sparse case. The detection results showed that the scab had the highest error rate, mainly due to its irregular shape and indistinct boundary, which interfered with the detection of the model. In the future, scab disease will be considered as a separate research topic to improve the detection accuracy of the model.

V. Conclusion

Aiming at the problems of different shapes, multi-scales, and dense distribution of apple leaf disease spots, an improved detection model BTC-YOLOv5s based on YOLOv5s was proposed. In order to improve the overall detection performance of the original YOLOv5s model, this study introduces the BiFPN module, which increases the fusion of multi-scale features and provides more semantic information. In addition, Transformer and CBAM attention modules are added to improve the ability to extract disease features. The results show that the accuracy of the BTCYOLOv5s model on the ALDD test set is [email protected], reaching 84.3%, the model size is 15.8 M, and the detection speed on the eight-core CPU device is 8.7 FPS. Moreover, it still maintains good performance and robustness under extreme conditions. The improved model has high detection accuracy, fast detection speed, and low calculation load, and is suitable for deployment on mobile devices for real-time monitoring and intelligent control of apple diseases.

Note: The original text of the paper comes from Real-Time Detection of Apple Leaf Diseases in Natural Scenes
Based on YOLOv5. This paper is only for academic sharing. If there is any infringement, please contact us by private message for deletion.

Guess you like

Origin blog.csdn.net/MacWx/article/details/132020098