Professional innovation practice report--detailed explanation of YOLO v3 algorithm and comparison with Faster-RCNN

Professional Innovation Practice Report

                     Topic Detailed explanation of YOLO V3 algorithm and comparison with Faster-RCNN                                               

Detailed explanation of YOLO V3 algorithm and comparison with Faster-RCNN

Reference source:Target Detection - Flying Paddle AI Studio

1. Development of target detection

Image recognition and classification: Computers are different from humans. What computers can only "see" are the numbers after the image is encoded. For example, in Figure 1 below (right), humans can see at a glance that there is a dog in the picture, but the computer sees encoded numbers (left). The purpose of the image classification task is to identify the category of the image.

Figure 1 Computer processing to obtain pixels

The general image classification flow chart is as follows:

Figure 2 Conventional image classification flow chart

It can be described as: extract features from the input image, use the extracted features to predict the classification probability, establish a classification loss function based on the training sample label, and start training to achieve the purpose of image classification.

The result of image classification: Figure 3-1 is identified as a dog, and Figure 3-2 on the right is identified as a cat.

                                   Figure 3-1 Dog Figure 3-2 Cat Figure 3-1 Dog Figure 3-2 Cat

The implementation of target detection relies on image classification, which is an image classification technology. Target detection is usually divided into general target detection and single type target detection: ① General type: mainly detects all categories in an image and marks their categories and locations. ②Single type: Mainly detects fixed categories and marks their categories and locations. For example, face detection, text detection, license plate detection, etc.

As can be seen from the types of object detection, object detection and image classification are very different. Object detection requires detecting all objects in an image and labeling its location and category. As shown in Figure 4 below, dog, cat and their positions are identified; in the task of image classification, for an image, the process of extracting features does not reflect the differences between different targets, nor does it Label each object and category separately and its location. Therefore, target detection is based on the successful experience of image classification. The area on the image that may contain the target object is treated as a separate image, and the image classification model is used to perform category detection on the area. These areas are also called candidate areas. area), using other methods to represent the location of the target.

 Figure 4 Target detection (detecting the categories and locations of cats and dogs)

The key to the target detection problem lies in the generation of candidate areas. The most violent method is to exhaustively: list all areas on the image. Although the exhaustive method may be able to obtain correct prediction results, its calculation amount is very large and it is difficult to implement in practical applications. In 2013, the target detection task ushered in a breakthrough: Ross Girshick and others applied the CNN method to the target detection task for the first time, using traditional images Algorithmselective search generates candidate regions. This algorithm has ushered in a major breakthrough. This is the regional convolutional neural network that has a profound impact on the field of target detection ( R-CNN) model. In 2015, Ross Girshick improved this method and proposed the Fast RCNN model, which shares the calculation of the convolution layer for objects in different areas. It greatly reduces the amount of calculation and improves the processing speed. It also introduces a regression method to adjust the position of the target object, further improving the accuracy of position prediction. In the same year, Shaoqing Ren and others proposed the Faster RCNN model and the RPN (Regional Proposal) method to generate object candidate regions, which was also the first time The concept of anchor boxes was proposed. Anchor boxes are the most important and difficult to understand concept in the process of learning convolutional neural networks for target recognition. Since then, they have been widely used in excellent target recognition models such as SSD, YOLOv2, and YOLOv3. Application, this method no longer needs to use traditional image processing algorithms to generate candidate regions, further improving the processing speed.

In this experiment, we mainly used the YOLOv3 model architecture to detect people. We will expand on the specific implementation of the YOLO v3 mark detection method and the introduction of the Faster-RCNN framework. Mainly introduces single-type target detection.

2. Introduction and implementation of YOLO v3

2.1 Analysis of the network architecture of YOLO v3:

YOLO v1 proposed the general architecture of the YOLO algorithm. YOLO v2 improved the design and used predefined anchor boxes to generate positive and negative samples for training. YOLO v3 further improved the model architecture and training process based on YOLO v2. It The network architecture is shown in Figure 5 below:

 Figure 5 Network architecture of YOLO v3

As can be seen from the figure, the steps for YOLO v3 to detect targets are:

  1. Input a batch of images with shape (m, 416, 416, 3) (416 is a multiple of 32).
  2. Pass this image to a convolutional neural network (CNN).
  3. Flatten the last two dimensions of the above output to get the output (19, 19, 425) (referring to 19*19 grids, each grid has 425 parameters, 425 is based on the anchor in each grid Calculated from the number of frames, the position category information of the anchor frame and the total number of detected categories):
  4. The output is a list of bounding boxes along with the recognized classes. The label of each bounding box contains confidence, location, and class information.
  5. Perform IoU (Intersection over Union) and Non-Max Suppression to discard redundant boxes and retain the prediction box with the highest confidence.

Detailed steps will be described in this article.

It can be seen that three improvements are mainly adopted in YOLO V3:

  1. Multi-level prediction is used to solve the problem of coarse granularity, so that it can also be applied to small target detection. YOLO V3 has 3 detections, one is downsampling, the feature map is 13 * 13, and 2 upsampling eltwise sum, the feature map is 26*26, 52*52.
  2. The logistic loss function is used as a new loss function, allowing YOLO V3 to classify and frame overlapping areas.
  3. The network structure has been deepened, using simplified residual blocks to replace the original 1×1 and 3×3 blockblocks, and turning the original darknet-19 into darknet-53.

The Darknet53 used by YOLO v3 is based on the 53-layer network and stacks 53 more layers on top, providing a 106-layer fully convolutional underlying architecture for YOLO v3. Its detection is done by applying a 1 x 1 detection kernel on three different sized feature maps at three different locations in the grid. The architecture diagram of Darknet53 is as shown in Figure 6:

Figure 6 Architecture diagram of Darknet53

Convolutional refers to Conv2d+BN+LeakyReLU. For example, for the original image of size [640*640], by outputting the shape and size of C0, C1, and C2 in the code: the shape of is [1,1024,20,20], the shape of is [1,512,40, 40], the shape is [1,256,80,80].

The shape is determined by the stride. For the original image with a size of [640*640]: when the stride is 32, 640/32=20, that is, 20 for the shape parameter. At the same time, it can also be seen that the stride of C1 is 16 and the stride of C2 is 8. Taking the input size as 640x640 as an example, the predicted three feature layer sizes are 20, 40, and 80 respectively. Detect small, medium and large targets respectively to achieve multi-scale detection. As shown in Figure 7 below:

Figure 7 Multi-scale target detection (cats and flowers)

Downsampling reduces the image size by half. The specific implementation method is to use stirde=2 convolution. In this experimental code, it is:

1.  self.downsample = (in_channels != out_channels)
2.        if self.downsample:
3.            self.down_sample_layer = _conv2d(in_channels, out_channels, 1, stride=stride)

The code for 2 upsampling is:

        self.conv1 = _conv_bn_relu(in_channel=backbone_shape[-2], out_channel=backbone_shape[-2]//2, ksize=1)
        self.upsample1 = P.ResizeNearestNeighbor((feature_shape[2]//16, feature_shape[3]//16))
        self.backblock1 = YoloBlock(in_channels=backbone_shape[-2]+backbone_shape[-3],
                                    out_chls=backbone_shape[-3],
                                    out_channels=out_channel)
        self.conv2 = _conv_bn_relu(in_channel=backbone_shape[-3], out_channel=backbone_shape[-3]//2, ksize=1)
        self.upsample2 = P.ResizeNearestNeighbor((feature_shape[2]//8, feature_shape[3]//8))
        self.backblock2 = YoloBlock(in_channels=backbone_shape[-3]+backbone_shape[-4],
                                    out_chls=backbone_shape[-4],
                                    out_channels=out_channel)
        self.concat = P.Concat(axis=1)

2.2YOLO v3 Intermediate Important Concepts: 

2.2.1bounding box and its registration information: 

In target detection, it is necessary to determine the location of the target object and use a bounding box to describe its location information. There are generally two general formats for representing the position of the bounding box:

① Use (x1, y1, x2, y2) to describe the position of the frame. Among them (x1, y1) are the coordinates of the upper left corner of the rectangular box, and (x2, y2) are the coordinates of the lower right corner of the rectangle. In common face target detection, the sides of the generally rectangular frame are parallel to the X and Y coordinate axes, so these two upper left and lower right coordinates can determine a complete rectangular frame.

Figure 8 Three rectangular boxes

For example: In our current target detection practice, one of the detected results is as shown in the figure. The bounding boxes of the three boxes in Figure 8 can be expressed as (roughly):

Red box on the left: (301.85,978.47,712.16,682.52)

Middle red box: (850.22,943.61,1089.73,567.28)

Red box on the right: (962.51,718.34,1677.18,0.003)

② Use (x, y, w, h) to represent the position of the frame, (x, y) to represent the coordinates of the center point of the rectangular frame, w is the width of the rectangular frame, and h is the height of the rectangular frame. Similarly, the rectangular frame is parallel to the X and Y axes. It has a center point, width and height, and can be completely drawn to draw the position of the rectangular frame.

In the detection task, the label of the training set identifies the position information of the real frame. Such a bounding box is called a ground truth box. The predicted bounding box information is obtained through the prediction model of the trained label information.

Apply YOLO v3 to achieve target detection. Usually the information in the label is

 For single-category target detection, it is represented by a five-dimensional vector. In the code provided for this experiment, each value in the vector is usually obtained like this:

1.        box_xy = prediction[:, :, :, :, :2]     ##位置信息:cx,cy
2.        box_wh = prediction[:, :, :, :, 2:4]    ##尺度信息:cw,wh
3.        box_confidence = prediction[:, :, :, :, 4:5] #类别信息:x
4.        box_probs = prediction[:, :, :, :, 5:]        #含目标的概率:P

P represents the probability that there is a detected target in the box, which corresponds to box_confidence in the code;

   Used to give the position of the prediction box (the second type represents the position of the rectangular box), which corresponds to box_xy and box_wh in the code. Based on this, the prediction box on the image can be drawn.

For our actual target combat this time, we only detect people, so one grid only detects one type of target, and the label information contains a five-dimensional vector. If you want to detect n types of targets, the returned tag should contain , and the value of x1-xn is 0 or 1, indicating which category it belongs to. In the code, box_probs is the category information and has only one value. This means that only one type of target is detected on one grid. If you want to detect one more category, you should add another one of the same dimension to this vector. vector.

For example: If you want to detect both people and dogs (possibly in the same grid) in the image of this experiment, the returned label information should contain a 14-dimensional vector ((1+4+2)* 2=14); if multiple types of targets are not detected in the same grid, the returned label information should contain a 7-dimensional vector (1+4+2=7).

2.2.2锚框(Anchor box)

Anchor boxes are used in the target detection model to generate a series of frames on the picture according to certain rules, and these anchor boxes are regarded as possible candidate areas. It can solve the problem that one window can only detect one target and cannot solve the problem of multi-scale. The algorithm will generate a series of anchor boxes centered on a center. Because the size ratio and aspect ratio are included, the range of the generated anchor boxes will not exceed the image size, as shown in Figure 9 below:

Figure 9 Generates a series of anchor boxes (the ones marked by blue A are anchor boxes, and those marked by G are real boxes)

The model will predict whether these candidate areas contain objects, and if they contain the target object, continue to predict the category to which the object belongs. If the target object is not included, there is no need to continue detection. As can be seen from the figure, the anchor box A1 is closest to the real box G1.

There are usually three methods for determining the size of the anchor box: ① artificial experience selection ② k-means clustering ③ learning as a hyperparameter

Because the position of the anchor frame is fixed, it is usually unlikely to coincide with the real bounding box. Fine-tuning is required based on the anchor frame to form a prediction frame that can accurately describe the position of the object. The model needs to predict the magnitude of the fine-tuning. . During the training process, the model continuously adjusts parameters through learning, and finally learns how to determine whether the candidate area represented by the anchor box contains the target object. If it contains the object, determine which category the object belongs to, and the true bounding box of the object relative to the anchor. The amount by which the frame position needs to be adjusted. To sum up, the prediction box is fine-tuned based on the anchor box. The specific steps will be given below.

If the anchor box is generated in every place, the calculation amount is too large, so the algorithm will downsample the original image to obtain the feature map, and generate the anchor box in the feature map. Both YOLO v3 and YOLO v2 use Anchor box, but YOLOv1 does not use Anchor box.

2.2.3 Intersection Ratio (IoU)

The concept of Iou is used to describe the degree of overlap between two boxes. Two boxes can be regarded as a set of two pixels, and their intersection ratio is equal to the area of ​​the overlapping parts of the two boxes divided by their combined area. The red area in Figure A below is the overlapping area of ​​the two boxes, and the blue area in Figure B is the combined area of ​​the two boxes. Divide these two areas to get the intersection ratio between them.

公式:iou = intersect_area / (box1_area + box2_area - intersect_area)

 Among them, intersect_area is the intersection area (intersection) between two boxes, box1_area is the area of ​​one rectangle, and box2_area is the area of ​​the other rectangle. So (box1_area + box2_area - intersect_area) represents the union area (union) of two boxes. You can use Figure 10 below to visually understand IoU:

Figure 10 Visual calculation of union and cross ratio

In this experiment, a class for calculating IoU was written, and its core code is as follows:

1.###计算交集
2.intersect_area = P.Squeeze(-1)(intersect_wh[:, :, :, :, :, 0:1]) * \
3.P.Squeeze(-1)(intersect_wh[:, :, :, :, :, 1:2])
4.###计算两个box的面积
5.box1_area = P.Squeeze(-1)(box1_wh[:, :, :, :, :, 0:1]) * P.Squeeze(-1)(box1_wh[:, :, :, :, :, 1:2])
6.box2_area = P.Squeeze(-1)(box2_wh[:, :, :, :, :, 0:1]) * P.Squeeze(-1)(box2_wh[:, :, :, :, :, 1:2])    
7.###IoU
8.iou = intersect_area / (box1_area + box2_area - intersect_area)

2.3 Prediction process in YOLO v3:

The core idea of ​​YOLO3 is to divide the picture into different n*m grids. The idea embodied is divide and conquer: that is, each grid point is responsible for the detection of an area. If the center point of the object falls within this area, then This object is determined by this grid point. It can be divided into grids of different scales. Large grids predict large-volume target objects, and small grids predict small-volume target objects.

2.3.1 YOLO v3训练总体Flow< /span>

Figure 11 YOLO v3 training flow chart

The process is divided into two parts. The left part divides small squares on the image and generates a series of candidate areas. Positive samples and negative samples are determined based on the proximity of the candidate areas to the real boxes on the original image.

  1. Positive sample: a candidate area that is close to the real box (groun_truth_box) to a certain extent, and the position of the real box is used as the position target of the positive sample.
  2. Negative samples: Those candidate areas that deviate greatly from the true box (groun_truth_box). The negative samples do not have target objects, so the negative samples do not need to predict the location or category.

The right part uses a convolutional neural network to extract the features of the image. We have already mentioned when introducing the anchor box that the feature map can be used to reduce the amount of calculation and predict the location and category of the candidate area.

Therefore, each prediction box is one of the samples (the determination of positive and negative samples and irrelevant samples requires the use of non-maximum suppression, which will be further explained below). According to the position, shape, label, and target confidence of the sample and the real box The difference between them can establish a loss function.

2.3.2 Affective calculation:

Both positive and negative samples are prediction boxes, so the YOLO v3 loss function includes three parts:

①Loss of position and shape prediction

②Loss of target confidence

③ Category losses.

The target confidence loss includes the confidence prediction loss of the box containing the target object and the confidence prediction loss of the box without the target object. The calculation formula of the loss is as follows, and its meaning is marked in Figure 12:

 Figure 12 Calculation formula and labeling of loss function

The implementation code given this time is reflected as follows. First, calculate the losses of the three parts:

1.##位置的损失 
2.xy_loss = self.reduce_sum(xy_loss, ())
3.##尺度的损失
4.wh_loss = self.reduce_sum(wh_loss, ())
5.##置信度的损失
6.confidence_loss = self.reduce_sum(confidence_loss, ())
7.##类别的损失
8.class_loss = self.reduce_sum(class_loss, ())

The sum of the loss values ​​of the three parts is the final loss value:

1.loss = xy_loss + wh_loss + confidence_loss + class_loss

2.3.3 Process of generating prediction box:

After obtaining the feature map, the YOLO-V3 algorithm will generate a series of anchor boxes at the center of each candidate area. For example, in Figure 13 below, a series of anchor boxes are generated with the upper left corner as the center.

Figure 13: Generate a series of anchor boxes with the same point as the center

In the code, you can see the parameters of the anchor box in the model parameters of YOLO-V3:

  1. class ConfigYOLOV3ResNet18:
  2.     ######........
  3.     anchor_scales = [(5,3),(1013), (1630),(3323),(3061),(6245),(59119),(11690),(156198)]

You can see from the code that 9 anchor boxes of different scales are generated at each center. Corresponding to 3 different scales (n*n, 2n*2n, 3n*3n) to predict large objects, medium objects and small objects respectively.

So how to bring the anchor box close to our prediction box? And how to make the predicted anchor box and the real box overlap as much as possible? Here we will expand on the fine-tuning principle and process of anchor boxes in detail.

We have mentioned before that the core idea of ​​YOLO is to divide the picture into many grids, as shown in the figure below, divided into 15*20 (480/32=15, 640/32=20) grid squares of the same size. Assume that there are three generated anchor boxes near the small square in row 10 and column 4 in the image, as shown in Figure 14 below, marked with a red frame.

 Figure 14 Labeling of three anchor boxes

The position of the center point of the anchor box is: (4.5,10.5). The coordinates of the upper left corner of the block where the anchor box is located are (4,10). Assume that the size of one of the anchor boxes is (350, 250). The center coordinates and size of the prediction box can be generated in the following way:

The coordinates of the upper left corner of the block where the center point of the anchor box is located:

Fine-tune the position according to the following formula to obtain the positions b_x, b_y of the prediction box.

 The image of the sigmoid function is shown in Figure 15 below: We can observe some intuitive characteristics from the image. The value of the function is between 0-1, and is centrally symmetrical at 0.5, and the closer to the value slope of x=0 The bigger.

Figure 15 sigmoid function image

The size of the anchor box is:

The code is as follows:

1.  grid_x = P.Cast()(F.tuple_to_array(range_x), ms.float32)
2.        grid_y = P.Cast()(F.tuple_to_array(range_y), ms.float32)
3.        # Tensor of shape [grid_size[0], grid_size[1], 1, 1] representing the coordinate of x/y axis for each grid
4.        grid_x = self.tile(self.reshape(grid_x, (1, 1, -1, 1, 1)), (1, grid_size[0], 1, 1, 1))
5.        grid_y = self.tile(self.reshape(grid_y, (1, -1, 1, 1, 1)), (1, 1, grid_size[1], 1, 1))
6.        # Shape is [grid_size[0], grid_size[1], 1, 2]
7.        grid = self.concat((grid_x, grid_y))
8.        box_xy = prediction[:, :, :, :, :2]
9.        box_wh = prediction[:, :, :, :, 2:4]        
10.        ###使用sigmoid作为微调
11.        box_xy = (self.sigmoid(box_xy) + grid) / P.Cast()(F.tuple_to_array((grid_size[1], grid_size[0])), ms.float32)
12.        box_wh = P.Exp()(box_wh) * self.anchors / self.input_shape

grid_x and grid_y are randomly generated real numbers. Sigmoid fine-tuning is used to make the anchor box approximate the prediction box, which is equivalent to scaling the anchor box to better predict the length and width of the real box.

like

 Figure 16 The predicted box obtained after fine-tuning the anchor box and the real labeled real box

   It can be seen that there is a deviation between the predicted box and the real box. In the training sample, the position information of the real box already exists. The b_x, b_y, b_h, b_w of the predicted box can be adjusted so that its edges are the position and scale of the real box. From this, we can also get the target values ​​t_x, t_y of our training. , t_w, t_h can establish our loss function based on the network output value and the target value. By learning the network parameters, the output value can be made close to the final target value.

2.3.4 Maximum suppression to reduce the redundancy of prediction boxes and label:

Looking at Figure 17 below, you can see that there is only one face, but there are several prediction boxes marked, and their confidence levels are different. From the inside to the outside they are: 0.33, 0.69, 0.63, 0.64. These frames are redundant, and we can also see from the figure that the second largest prediction frame is more accurate (confidence is 0.69), so we need Remove redundant boxes and leave the most accurate prediction boxes.

Figure 17 Prediction box redundancy

YOLO3 uses non-maximum suppression (nms) to eliminate redundant boxes. It uses IoU (see above for specific calculation formula) to measure whether the prediction box corresponds to the same object. Initially, the objectness of all prediction boxes is 1, indicating that there is a target. The algorithm only leaves the prediction box with the highest score in a certain category. If other prediction boxes and their IoU are greater than the threshold, the other prediction boxes are directly discarded and their objectness (whether there is a target) label is set to -1, without participating in the loss. The calculation of the function, instead of becoming 0, is treated as a negative sample.

The IoU threshold needs to be set in advance and is a hyperparameter. It can generally be set to 0.5 or 0.65. A simple enumeration of the relationship between the IoU value and the position and size between the two anchor boxes is as follows:

 Figure 18 The relationship between the IoU value and the position and shape of the frame

It can be clearly seen from the figure that a large IoU means that the two prediction boxes are very close, so we retain the prediction box with high confidence and reduce redundancy. In this experiment, we can see from the code that the IoU threshold is 0.5, as follows:

1. for box_index, box in enumerate(boxes):
2.            bbox_pred = [box[1], box[0], box[3], box[2]]
3.            count_pred[classes[box_index]] += 1
4.            for anno in gt_anno:
5.                class_ground = anno[4]
6.                if classes[box_index] == class_ground:
7.                      ###calc_iou的函数,上面已经分析过计算公式;
8.                    iou = calc_iou(bbox_pred, anno)
9.                      ###阈值为0.5;
10.                    if iou >= 0.5:
11.                        count_correct[class_ground] += 1
12.                        break

The detailed steps of the code are as follows:

 As shown above, if you want this image to retain only the most accurate frames, you can appropriately reduce the IoU value, but this may cause too many other prediction frames to be discarded, resulting in the inability to label all real target objects.

The specific demonstration of the correct example is as shown in Figure 19:

 Figure 19 Detecting human and dog targets

The confidence levels of the three prediction frames of the portrait are (0.9, 0.7, 0.65) respectively. The white prediction box with the highest confidence level of 0.9 is calculated IoU with all prediction boxes. Because the IoU of the yellow boxes with confidence levels of 0.7 and 0.65 and the prediction box exceeds the threshold, they are discarded. Similarly, the prediction box at the dog is also suppressed by non-maximum value and finally retains the red prediction box with a confidence level of 0.65.

The retained prediction boxes are eventually retained as positive samples (objectness=1), as shown in Figure 20 below. The discarded prediction box containing the target (objectness=-1) means that it is neither a positive sample nor a negative sample and will not participate in the calculation of the loss function. All other prediction boxes that do not contain objects have their objectness labels set to 0, indicating that they are negative samples. YOLO v3 uses a regression method.

 Figure 20 The final retained prediction box

From the above, we can also summarize how YOLO V3 handles the objectness of different anchor boxes, as shown in Figure 21 below:

 Figure 21 Processing of objectness

1. If the IoU between the anchor frame and the real frame is the largest, set its objectness to 1, mark it as a real sample, and must participate in the calculation of the loss value.

2. If there is a target object in the anchor box, the IoU with the real box exceeds the threshold, and the objectness is not marked as 1, then the objectness is set to -1, and there is no need to calculate the position and category.

3. For the remaining prediction boxes that do not contain the target object, the objectness is set to 0 and marked as a negative sample. There is no need to calculate the position and category, which is meaningless.  

2.3.5 Establish the association between the output feature map and the prediction box

In the network architecture and prediction box generation, we mentioned using feature maps to match the generation of prediction boxes to reduce unnecessary calculations. For any prediction box, 5+C parameters are needed to indicate how many categories need to be detected. For specific calculations, refer to the generation of bounding box parameters above to determine its position and size, whether it contains a target, and the category to which the target belongs. K prediction boxes are generated at each grid, then the number of prediction values ​​should be

The fully connected layer will make the calculation amount too large. Taking an image of size 640*480 as an example, when the stride is 32, a feature map of 20*15 will be obtained (640/32=20, 480/32=15), because the minimum number of grids is also 20*15, so, The grid can be mapped to the pixels on the feature map. In summary, multiple convolutions are used on the feature map to correspond to the predicted value required for each prediction box. In Figure 2 below, it is shown that the pixel point (i, j) is associated with the predicted value required for the small square area in the i-th row and j-th column.

 Figure 22 Associated feature map and prediction box

2.3.6 Calculate the identification information of the prediction frame (including the probability, position and scale of the target, and the corresponding category)

① Calculate the probability of containing the target: express the probability of objectness=1 in the prediction box

②Location and scale

③Corresponding categories

The specific meaning and code calculation have been improved above, as follows:

  1.         prediction = P.Reshape()(x, (num_batch,
  2.                                      self.num_anchors_per_scale,
  3.                                      self.num_attrib,
  4.                                      grid_size[0],
  5.                                      grid_size[1]))
  6.         prediction = P.Transpose()(prediction, (0, 3, 4, 1, 2))
  7.         box_xy = prediction[:, :, :, :, :2]           ##位置
  8.         box_wh = prediction[:, :, :, :, 2:4]          ##尺度
  9.         box_confidence = prediction[:, :, : , :, 4:5] ##Approximate content index< /span>
  10.         box_probs = prediction[:, :, : ,  :,   5 :] ## >

2.3.7 VisualizationTools to visualize YOLO v3model

The model can be visualized through Netron. There is an online URL https://netron.app/

Figure 23 Netron online homepage

I downloaded a YOLO v3 model (yolo_epoch50.pdparams) that has been trained for 50 epochs for target detection of seven categories of insects from the Internet. The model is visualized as shown in Figure 24 (a very small part), because the model parameters and number of layers are too Much, only part of it was cut off.

Figure 24 yolo_epoch50.pdparams model analysis

More research and discovery can be carried out through the visualization of the model.

2.3.8 Specific code details

The explanation will be explained in the YOLO v3 multi-class target insect detection..ipynb file, which contains specific code displays for each part: anchor box, prediction box, IoU, non-maximum suppression, network architecture, feature extraction, refer to Paddle -Baidu architect teaches deep learning step by step.

3. A brief introduction to the Faster-RCNN framework

When introducing the development history of target detection, we mentioned that Faster-RCNN used RPN for the first time, which greatly improved the model efficiency. Faster-RCNN is a typical two-stage target detection algorithm. The two-stage detection algorithm divides the detection problem into two stages. First, candidate areas are generated, and then the candidate areas are classified after position refinement.

Faster-RCNN is a convolutional neural network target detection method. Based on R-CNN and Fast RCNN, Faster RCNN integrates feature extraction, proposal extraction, bounding box regression (rect refine), and classification into one network , which greatly improves the comprehensive performance. Its model infrastructure is shown in the figure below:

 Figure 25 Basic architecture of Fater RCNN

As can be seen from the architecture diagram,Faster RCNN The overall network can be divided into 4 main contents, as shown in the following figure:

 Figure 26 Network composition of Faster-RCNN

  1. Basic convolutional layers (Conv layers, can useResNet-50): Use the basic convolutional network to extract the feature map of the image, The feature map is shared by subsequent RPN layers and fully connected layers.
  2. Region Proposal Networks (RPN): The RPN network is used to generate candidate regions (proposals). A set of anchor points (anchors) are obtained through a set of fixed sizes and proportions, and softmax is used to determine whether the anchor points belong to the foreground or background, and then regional regression is used to correct the anchor points to obtain accurate candidate regions.
  3. RoI Align: This layer collects input feature maps and candidate areas, maps the candidate areas into feature maps and pools them into uniform-sized regional feature maps, and sends them to the fully connected layer to determine the target category. RoIPool and RoIAlign can be used. .
  4. Detection layer (Classification): Use the regional feature map to calculate the category of the candidate area, and obtain the final precise position of the detection frame through regional regression again.

The network structure diagram of Faster-RCNN in the VGG16 model is as follows:

Figure 27 Network structure of Faster-RCNN in VGG16 model

The text description is as follows:

The input image size is arbitrary P*Q, and its size is fixed to MxN after processing. The MxN image is sent to the convolutional network (the convolutional network includes 13 conv layers, 13 relu layers, and 4 pooling layers), and after 4 pooling layers, the final size of the output of the convolutional layer is [M /16,N/16], the feature map can be obtained; in the RPN network, it is first passed through 3x3 convolution, and then the foreground anchors and bounding box regression offsets are generated respectively, and then the candidate area is calculated; and the Roi Pooling layer uses The candidate region extracts candidate features from the feature map and sends them to subsequent fully connected and softmax network classification.

Faster-RCNN and YOLO v3 comparison:

  1. Faster-RCNN is a Two-stage algorithm, while YOLO v3 is a typical One-stage algorithm. It only needs to be sent to the network once to predict all bounding boxes, which is faster.
  2. The VGG16 model weights used by Faster-RCNN are initialized, and the darknet53 model weights used by YOLO V3 are initialized.
  3. The number of prediction boxes generated by the YOLO v3 algorithm is much smaller than that of Faster-RCNN. Each real box in Faster-RCNN may correspond to multiple candidate areas with positive labels, while each real box in YOLO v3 only corresponds to one positive candidate area.
  4. Although the accuracy of the one-stage algorithm is generally lower than that of the two-stage algorithm due to its algorithm structure, the accuracy is not much different. YOLO v3 is more widely used due to its faster forward propagation speed.

4. Comparison and development of YOLO v1, v2, v3:

YOLO v1 uses a fully connected network, and YOLO v2 and v3 use a fully convolutional network. V4 improves the loss function of V3 because the squared error is not so suitable for regression.

YOLO v1 does not have the concept of Anchor box, and subsequent YOLO algorithms have introduced Anchor box.

In terms of algorithm efficiency, the recall rate of YOLO v1 is very low. To understand the recall rate and accuracy, you need to use the following table 1:

Table 1 True values ​​and predicted values

Predictive value

1 (Decide P)

0 (Decided N)

actual value

1

City

FN

0

FP

TN

 YOLO v2 uses the network Darknet-19 (consisting of 19 convolutional layers) as backbon. Compared with v1, the accuracy of YOLO v2 has dropped, but the recall rate has been greatly improved.

YOLO v3 uses Darkent-53 as backbon. Multi-scale prediction is introduced, and its algorithm can output feature values ​​of different sizes.

In general, this experiment reached the following two conclusions:

  1. The YOLO V3 algorithm has indeed made great improvements compared to V1 and V2, especially in small object detection and overlapping object detection. It has made great breakthroughs, but its generalization ability may be too strong.
  2. The accuracy of the one-stage algorithm is generally lower than that of the two-stage algorithm due to its algorithm structure, but generally the accuracy gap between the two is not too large. One-stage algorithms such as YOLO rely on their faster The forward propagation speed may make its application more widespread.

Guess you like

Origin blog.csdn.net/cangzhexingxing/article/details/128092262