Summary of classic models in the field of computer vision (RCNN, YOLO, etc.)

1. RCNN series

1、RCNN

RCNN is a classic method for target detection. Its core idea is to decompose the target detection task into two main steps: candidate region generation and target classification.

  • Candidate region generation: The first step of RCNN is to generate candidate regions that may contain the target. RCNN uses traditional computer vision technology, especially the Selective Search algorithm , which is a Region Proposal method that based on the image Information such as texture, color, and shape in the image are used to generate potential candidate regions. Selective search generates typically thousands of candidate regions, so these regions need to be cropped and adjusted so that they have the same size and aspect ratio.
  • Feature extraction: For each candidate region, RCNN uses a deep convolutional neural network (usually AlexNet pre-trained on the ImageNet dataset) to extract features. These features are used to represent the content of each candidate region. RCNN obtains a fixed-dimensional feature vector by adjusting the image patch of each candidate area to a fixed size and then performing forward propagation through a convolutional neural network.
  • Target classification: For each candidate region, RCNN inputs the extracted feature vector into a support vector machine (SVM) classifier to determine whether the region contains the target object of interest. RCNN also classifies the category of each target object, using different SVM classifiers to represent different categories. Each classifier is trained to distinguish candidate regions that contain a target object from regions that do not.
  • Bounding box regression: In order to improve the position accuracy of the object, RCNN also uses a regressor to fine-tune the bounding box of each candidate region. This regressor is trained to predict the difference between the candidate region and the actual object bounding box.
  • Training: The training of RCNN is divided into two stages, pre-training and fine-tuning. ① In the pre-training stage, the convolutional neural network (Alexnet) is trained on large-scale image classification tasks to obtain useful feature extractors. ② In the fine-tuning stage, use the annotated target detection data to fine-tune the entire RCNN model, including the SVM classifier and bounding box regressor.
  • Advantages and Disadvantages: ① Advantages: RCNN has achieved very good performance in target detection, especially on large-scale target detection data sets. It is able to handle targets of different sizes and shapes, and can be adapted to multi-category target detection. ②Disadvantages: RCNN is a complex multi-stage model, which is difficult to achieve end-to-end training. Subsequent versions (such as Fast R-CNN and Faster R-CNN) have improved on these shortcomings and improved speed and performance.

2、Fast R-CNN(Fast Region-based Convolutional Neural Network)

Fast R-CNN is an improved method based on RCNN and Selective Search. The main innovation is to integrate the entire target detection process into a convolutional neural network (CNN), thus significantly improving speed and performance.

  • Candidate region generation: Unlike the selective search used in RCNN, FastRCNN uses a convolutional network to generate candidate regions directly from the input image. Using a sub-network called Region Proposal Network (RPN), RPN can efficiently generate multi-scale and multi-shape candidate regions, which are called anchor boxes (Anchor Boxes).
  • Feature extraction: FastRCNN uses a convolutional neural network to extract features of each candidate region. These regional features are fed into the network for object classification and bounding box regression. Use convolutional layers and ROI (Region of Interest) pooling layers to extract fixed-dimensional feature vectors
  • Object classification and bounding box regression: For each candidate region, Fast R-CNN uses two parallel fully connected layers, one for object classification (which category?) and for bounding box regression (object location). The classification layer uses softmax to predict the class probability of the object, while the regression layer is used to fine-tune the bounding box of the candidate region.
  • Training: Fast R-CNN performs end-to-end training, which can simultaneously optimize the loss functions of RPN, target classification, and bounding box regression. The training data includes positive samples (anchor boxes containing the target), negative samples (anchor boxes not containing the target) and their labels.

3、Faster R-CNN(Faster Region-based Convolutional Neural Network)

Faster R-CNN further improves Fast R-CNN, taking the speed of object detection models to a new level while maintaining high accuracy.

  • Candidate region generation: Faster R-CNN introduces a fully convolutional network as RPN to generate candidate regions. RPN is an end-to-end trainable network that can generate candidate regions.
  • Feature extraction: Similar to Fast-RCNN, Faster R-CNN uses a convolutional neural network to extract features of candidate regions.
  • Object classification and bounding box regression: Faster R-CNN and Fast R-CNN have similar object classification and bounding box regression steps.
  • Training: Faster R-CNN optimizes the entire system by jointly training RPN and detection networks including object classification and bounding box regression. The entire model can generate candidate regions and perform object detection in one go, improving speed.

2. yolo series

In the field of computer vision, target detection is a very important research topic and is widely used in face recognition, license plate recognition, security, smart transportation, autonomous driving and other fields. The main classic algorithms are: YOLO

1、YOLOv1

Previous two-stage detection algorithms, such as Faster-RCNN, require two steps during detection: bounding box regression and softmax classification. Due to the generation of a large number of pre-selected boxes, this method has high detection accuracy, but poor real-time performance. Joseph Redmon, the father of YOLO, proposed to obtain the specific location information and category classification information of target detection through direct regression . It greatly reduces the amount of calculation and significantly improves the detection speed. Achieved 45FPS (Fast YOLO version reached 155FPS).

  • Idea: ① Scale the input image to 448x448x3 size; ② Extract the feature map through the convolutional network backbone; ③ Send the extracted feature map to two fully connected layers, and finally output a feature map of 7x7x30 size. Furthermore, the input image is divided into SxS grids (for example, 7x7). In which grid the center of the object falls, each grid is responsible for detecting the object. Each grid predicts B borders and outputs SxS. (B*5+C). For YOLOv1, the commonly used grid division is 7x7, which predicts 2 borders and outputs 7x7x30. The 30 channels contain the probability of each category + border confidence + border position information.

  • Network structure: The backbone network is the GoogleNet network, with 24 convolutional layers + 2 fully connected layers. Use 7x7 convolution.
  • Advantages and Disadvantages: ① Advantages: Compared with the two-stage detection algorithm, the use of direct regression greatly reduces the amount of calculation and improves the running speed. ② Disadvantages: Each grid has only two prediction frames. When there are multiple objects densely adjacent to each other or small targets, the detection effect is not good.

2、YOLOv2

Compared with YOLOv1, v2 has made three changes: 1) replacing the backbone network; 2) introducing PassThrough; 3) drawing on the idea of ​​two-stage detection and adding a pre-selected box.

  • Idea: Input the image into the darknet19 network to extract the feature map, and then output the target frame category information and location information.
  • Network structure: The backbone network is darknet19, as shown in the figure below for the classification task of 1000 categories. However, for the detection task, 3x3 convolution (output channel 1024) needs to be used to replace the last convolution layer in the above table, and then add Passthrough After the operation, output. Large convolution kernels such as 7x7 are no longer used:

  •  Tip 1:  PassThrough operation - This method adjusts 28x28x512 to 14x14x2048. The Focus operation in subsequent v5 versions is similar to this operation. Concat the generated 14x14x2048 with the original 14x14x1024.

  •  Tip 2: Introduce anchor and adjust position prediction to offset prediction. Drawing on the idea of ​​Faster-RCNN, the anchor is introduced to adjust the position prediction of the target frame from direct prediction of coordinates to offset prediction, which greatly reduces the difficulty of prediction and improves the accuracy of prediction. Forecast accuracy.

  • Advantages and disadvantages: ① Advantages: The passthrough operation is used to fuse high- and low-level semantic information, which enhances the detection ability of small targets to a certain extent. The use of small convolution kernels instead of 7x7 large convolution kernels reduces the amount of calculation, and the improved position offset strategy reduces the difficulty of detecting target frames. ② The residual network structure has not yet been adopted, and when there are multiple objects close together or small targets, the detection effect needs to be improved.

3、YOLOv3

In response to the problems of YOLOv2, YOLOv3 introduced the residual network module. ① Based on darknet19, we innovate, introduce residuals, and deepen the network depth, and propose Darknet53; ② Draw on the idea of ​​​​pyramid and make predictions on three different sizes.

  • Idea: The YOLOv3 detection algorithm inputs images into the darknet53 network to extract feature maps, and then draws on the idea of ​​feature pyramid networks to fuse high-level and low-level semantic information, predict target madness at three levels: low, medium and high, and finally output three Feature map information of several scales (52×52×75, 26×26×75, 13×13×75). 

Among them, the 52×52 size feature map is responsible for detecting small targets, the 26×26 size feature map is responsible for detecting medium targets, and the 13×13 size feature map is responsible for detecting large targets. In the figure below, yellow represents the true value of the target box, and blue represents the three preselected boxes.

 Before training, pre-selected boxes of three sizes, small, medium and large, are generated in advance through clustering, a total of 9. When predicting, the final output will be

3x(20+1+4) data. The output data of a target box is as follows:

  •  Network structure: The backbone network is Darknet53

  •  Technique: Feature Pyramid, this version draws on the idea of ​​feature pyramid, but is slightly different from ordinary FPN. ①The layers selected for fusion are different; ②The fusion methods are different. For ordinary FPN, the small-size feature map of high-level semantics is upsampled and fused with the previous layer for pixel-by-pixel addition. After fusion, the size and number of channels remain unchanged. For YOLOv3, after upsampling the small-size feature map of high-level semantics to SxS, the previous feature map, which is also SxS, is selected for channel-direction splicing fusion. After fusion, the size remains unchanged but the number of channels is both. Sum.
  • Advantages: Basically solves the problem of small target detection, achieving a good balance between speed and accuracy.

4、YOLOv4

Master Alexey Bochkovskiy has upgraded YOLOv3. The core idea is basically the same as before, but a lot of improvements have been made to the substructure in terms of data processing, backbone network, network training, activation function, loss function, etc.

  • Important upgrades: ① Integrate the CSP structure into Darknet53 to generate a new backbone network CSPDarknet53; ② Use SPP spatial pyramid pooling to expand the receptive field; ③ Introduce PAN in the Neck part, that is, FPN+PAN form; ④ Introduce Mish activation Function; ⑤ Introduce Mosaic data enhancement; ⑥ Use CIOU_loss during training, and DIOU_nms during prediction.
  • Network structure: CSPDarknet53 (including spp) for backbone network Neck: FPN+PAN, the detection head is the same as the v3 version.

  • Tip 1: The input data is enhanced using Mosaic data, which draws on the ideas of CutMix in 2019 and expands on this basis. The Mosaic data enhancement method uses 4 pictures, randomly zoomed, randomly cropped, and randomly arranged for splicing. , thus further improving the detection of small targets.

  • Tip 2: Modify the backbone network to CSPDarkbet53, draw on the experience of 2019CSPNet, and combine it with the previous Darknet53 to obtain a new backbone network CSPDarknet53. In CSPNet, there is the following operation, that is, entering each stage, the data is first divided into two parts, such as part1 and part2 in the figure below. The difference is that CSPNet directly divides the channel dimension, while YOLOv4 uses two 1x1 volumes when applying it. It is achieved by stacking layers. The information of the two branches is concated at the intersection.

  • Tip 3: Introduce the SPP spatial pyramid pooling module, introduce the SPP structure to increase the receptive field, use the maximum pooling method of 1x1, 5x5, 9x9, and 13x13 to perform multi-scale fusion, and the output is concat-fused according to the channel. Similar to the PPM module in the semantic segmentation network PSPNet.

  • Tip 4: Use the FPN+PAN structure in the Neck part , drawing on the 2018 image segmentation field PANet. Compared with the original PAN structure, the PAN structure actually used by YOLOv4 changes the addition method to concatenation, as shown below:

  •  Since the FPN structure is top-down, high-level feature information is passed down through upsampling, but the fused information is still insufficient. Therefore, YOLOv4 added a PAN structure after FPN to pass information from the bottom to the top again. In this way, FPN transmits strong semantic information from the top down, while PAN transmits strong positioning information from the bottom up, achieving a stronger feature aggregation effect. The entire NECK structure is shown below:

  • Advantages: Comparing the v3 and v4 versions, on the COCO data set, when the same FPS is equal to about 83, the AP of Yolov4 is 43, while the AP of Yolov3 is 33, which is a direct increase of 10 percentage points.

5、YOLOv5

The YOLOv5 version, launched by Ultralytics LLC, is based on YOLOv4 and has been slightly repaired. ① The CSP structure in the v4 version backbone network has been extended to the NECK structure. ②The FOCUS operation was added, but the subsequent version 6.1 eliminated this operation and replaced it with a 6x6 convolution. ③Use SPPF structure instead of SPP.

  • Idea: Basically the same as the v4 version
  • Tip 1: SPPF, the main difference is that MaxPool is adjusted from parallel to serial. It is worth noting that two MaxPools of 5 x 5 size in series and one MaxPool of 9 x 9 size are equivalent. Three MaxPools in series are equivalent. A 5 x 5 MaxPool layer is equivalent to a 13 x 13 MaxPool. Although parallel and serial have the same effect, serial is more efficient and reduces time consumption.

  •  Tip 2: Adaptive anchor box calculation: It is relatively simple, which is to change the clustering of anchor boxes to use program adaptive calculation, which will not be described here.
  • Tip 3: Focus operation.  This operation is proposed in subsequent versions and will not be introduced here.

6、Yolov6

YOLOv6 was launched by Meituan. The main work done was to introduce the 2021 RepVGG structure into YOLO in order to be more adaptable to GPU equipment.

  • The idea of ​​the YOLOv6 detection algorithm is similar to YOLOv5 (backbone+neck) + YOLOX (head), with the following main changes: ① The backbone network is changed from CSPDarknet to EfficientRep; ② Neck builds Rep-PAN based on Rep and PAN; ③ The detection head part imitates YOLOX. Decoupled and slightly optimized.
  • Network structure: The backbone network is EfficientRep; Neck: FPN+RepPAN; Detection head: similar to YOLOX.
  • Tip 1: Introduce RepVGG, follow the idea of ​​RepVGG, add a 1x1 convolution branch and identity mapping branch in parallel to each 3X3 convolution, and then merge it into a 3x3 structure during inference. This method is suitable for calculation-intensive The hardware equipment will be more friendly.

  • Tip 2: Backbone network EfficientRep, replace the convolutional layer with stride = 2 in the backbone with the RepConv layer with stride = 2, and also change the CSP-Block to RepBlock.

  • Tip 3: Rep is introduced in Neck. In order to further reduce the time consuming on the hardware, the CSP-BLOCK in PAN is replaced with RepBlock, thus generating the Rep-PAN structure.

  • Tip 4: Decouple the detection head and redesign the efficient decoupling head. In order to speed up the convergence speed and reduce the complexity of the detection head, YOLOv6 imitates YOLOX to decouple the detection head, separating the border regression process and the detection head in target detection. Category classification process. Since two additional 3x3 convolutions are added to the decoupling head of YOLOX, the complexity of the operation will be increased to a certain extent. In view of this, YOLOv6 has redesigned a more efficient decoupling header structure based on the Hybrid Channels strategy. The delay is reduced without much change in accuracy, thereby achieving a trade-off between speed and accuracy.

  • Advantages: The time consumption has been further optimized to further improve the performance of YOLO detection algorithm. 

 7、Yolov7

YOLOv7 is the sequel of the YOLO4 team, which is mainly optimized for model structure reparameterization and dynamic label assignment.

  • Idea: The idea of ​​YOLOv7 detection algorithm is similar to YOLOv4 and v5.
  • Main changes: ① The proposed model structure re-parameterization is proposed. ②Learn from YOLOv5, Scale YOLOv4, YOLOX, "expansion" and "compound scaling" methods to efficiently utilize parameters and calculation amount. ③A new label allocation method is proposed.
  • Network structure: Based on YOLOv4, YOLOv5, and YOLOv6, it has been further upgraded and transformed by adding the following tricks.
  • Tip 1: Efficient aggregation network. E-ELAN uses expand, shuffle, and merge cardinality structures to improve the learning ability of the network without destroying the original gradient path. In terms of architecture, E-ELAN only changes the structure in the computing module, while the structure of the transition layer remains completely unchanged. The author's strategy is to use grouped convolution to expand the channels and cardinality of the calculation module, and use the same group parameter and channel multiplier to calculate all modules in each layer. Then, the feature maps calculated by each module are scrambled into G groups according to the set number of groups, and finally they are connected together. At this point, the number of channels in each set of feature maps will be the same as in the original architecture. Finally, the author added group G features to merge cardinality.

  • Tip 2: Model scaling,  similar to YOLOv5, Scale YOLOv4, and YOLOX, generally involves scaling depth, width, or module scale to expand or reduce the baseline.

  • Tip 3:  Convolutional reparameterization is introduced and improved. The gradient propagation path is used to analyze which networks different reparameterization modules should be used with. At the same time, it was analyzed that the identity in RepConv destroyed the residual structure in ResNet and the cross-layer connection in DenseNet. Therefore, the author made improvements and used the RepConv structure without Identity connection for convolution re-parameterization. The figure below is the planned heavy parameter convolution designed for PlainNet and ResNet.

Tip 4:  The auxiliary training module -coarse-to-fine (coarse-to-fine) is introduced to guide the label allocation strategy. The commonly used method is as shown in Figure (c), that is, the auxiliary head and the guide head are independent, respectively using ground truth and Their respective prediction results (auxiliary head, guidance head) implement label assignment. The YOLOV7 algorithm proposes to use the prediction results of the guidance head as guidance to generate hierarchical labels from coarse to fine, and use these hierarchical labels for the learning of the auxiliary head and the guidance head respectively, as shown in (d) and (e) below.

  • Advantages: The amount of parameters and calculations are greatly reduced, but the performance can still maintain a small improvement. 

Guess you like

Origin blog.csdn.net/qq_43687860/article/details/132735983