In-depth analysis of YOLOv3 model principle

Overview

(1) YOLOv3 is the first time in the YOLO series that residual connections are introduced to solve the vanishing gradient problem in deep networks (whether this is the first time will be verified later). The actual backbone used is DarkNet53

(2) The most significant improvement is also the reason that contributes the most to your points: target detection is performed in the same way on three scales. Don't worry about how it is achieved. Anyway, it has a certain degree of robustness to big goals, small goals and medium-sized goals.

(3) Modify the loss function from the previous Softmax to Logit, which means decoupling each category

(4) In the final inference stage, non-maximum suppression NMS is performed on the prediction results of the three detection layers.

The pink network in the upper left corner of the picture below shows, "Introducing residual connections"

The three pictures of puppies at the bottom of the picture below, consisting of "small", "medium" and "large", represent "using the same method for target detection for three targets of different sizes"

Through the schematic diagram of model speed and accuracy below, it can be seen that YOLOv3 is much faster than other models in the same period. 

1. Using DarkNet53 on Backbone 

What is the effect after the improvement of Backbone?

The backnone of YOLOv3 is DarkNet53.
——Comparing DarkNet19, we can find that it has fewer parameters than VGG16 (the size of the model file of VGG16 is 528MB, and the weight of DarkNet19 is 80MB), and it is faster (VGG16=9.4s; DarkNet19=6.2ms), but The accuracy has increased (VGG16=90.0; DarkNet19=91.2)'

        The backbone used by YOLOv2 is DarkNet19'

        Why has the number of parameters in DarkNet19 been reduced so much? Because the last component in VGG, the Fully Connected layer, takes up a lot of parameters. DarkNet19 replaced the Fully Connected layer with something else, so the number of parameters was reduced so significantly.

——Both ResNet101 and DarkNet53 have residual structures. However, compared to ResNet101, DarkNet53 has a slightly higher accuracy (ResNet101=93.7, DarkNet53=93.7) and faster speed (ResNet101=20ms; DarkNet53=13.7ms) when the number of parameters is similar (ResNet101=160MB, DarkNet53=159MB).

Why would this change be effective? explain

 Why can DarkNet19 ensure high accuracy while reducing the number of parameters and packaging faster?

       

The reduction in the number of parameters and the faster speed are due to: First of all, you can see that the last few modules of VGG16 are three Fully Connect layers. DarkNet19 on the right completely abandons the Fully Connected component, and then replaces the position of the FC layer at the end of the model with 1×1 convolution. The FC layer functions here as a transition. For example, you can see that Dark19 has a 1×1 convolution every few packages, which is mainly used to connect different components.

                The numbers that appear multiple times in the VGG16 picture, such as 512 and 256, refer to the channel number. The number of channels of DarkNet19 is the Filters column in the picture.

        

Why do we need residual connections?

Why was it that in 2006, when the VGG series and DarkNet19 were released, the deepest network at that time was 19 layers? For example, the deepest VGG series in that year was VGG19 and could not go deeper?

        The answer given by He Kaiming in his ResNet paper. The deeper the network, the larger the training error and testing error, and the accuracy will decrease compared with the shallow network. Because as the network deepens, gradient dispersion (gradient disappearance) will occur, and a lot of feature information will disappear in deeper layers of the network, resulting in the inability to present the best results. Of course, in the ResNet article, the author He Kaiming proposed the design structure of the residual connection Residual Block, which allows the network to break through the 19-layer frontier and continue to expand deeper.

What is a residual connection?

        The output data after various convolution processes and the original input data without various processes are added together, which is the residual connection.

What are the benefits of adding residual connections?

        The reason why the original data is added to the output data is because some features will disappear during the process of going through the bunch of convolutions (that is, the gradient disappears). By adding the original data, the lost features or lost features can be retrieved. Information. Without residual connection, you cannot capture these useful information or features, and the natural score will be low.

        Before adding the residual model, the error of the light blue line (18-layer network) is lower than the red line (34-layer network). But once the residual connection is added, the red line is lower than the light blue line, and the error of the deep network is lower than that of the shallow network.

Since ResNet is so good, why didn't the author directly "borrow" it in YOLOv3 and directly use ResNet as the backbone, but insisted on designing a new backbone DarkNet53?

        Because the ResNet model is too large (too many parameters), this will reduce the inference speed and affect real-time performance. 

Compared with ResNet, what improvements has been made in DarkNet-53?

         Same point:

                Both use a 3×3 convolution with Stride=2 instead of the max-pooling operation (the two have similar effects. They both reduce the dimension of a 3×3 area into a number, but one takes the max of these 9 numbers. , one is to use convolution operation to integrate 9 numbers into one number)

                They all use the residual structure

        difference:

                The number of convolutions in the convolutional layer suite is smaller. In the second to last column 101-layer in the figure below, you can see that the number of convolution kernel stacks is 3 (see the second column on the right side of the ResNet figure, there are four rows and four grids, each grid has three rows, indicating three convolutions). The DarkNet53 on the right only stacks 2 convolutions per convolution suite.

                The final layer has fewer channels. The number of channels in the last layer of DarkNet53 is 1024, while the number of channels in the last layer of ResNet101 is 2048, which is twice that of DarkNet53.

                The number of residual blocks (note that this is not the number of convolution kernels, but the number of times a residual block is repeated in series). ResNet101 uses (3, 4, 23, 3), while DarkNet uses (1,2,8,4). The latter’s design of the number of residual blocks is more reasonable, while the former’s design of residual blocks has a certain degree of redundancy.

ResNet101 and DarkNet53 performance comparison

        When the two have similar accuracy (Top5, one is 93.7 and the other is 93.8), the speed is increased by 50% (BFLOP/s, ResNet101 is 1039, and DarkNet53 is 1457)

2. The addition of Feature Pyramid Network (FPN)

Purpose of joining FPN:

        In order to identify targets of different scales, such as large targets, small targets, and medium-sized targets.

Network layers of different depths on backnone will produce feature maps of different sizes and resolutions (different sizes). There are also differences in the semantic information contained in different depth layers. For example, high-resolution images (feature maps of shallower networks) have the characteristics of lower-level semantic information but do not have high-level semantic information. For classification tasks, It is unfriendly, but it is more helpful for positioning tasks.

(a) Featurized image pyramid: After the image is scaled (scaled to large scale, medium scale, small scale), an image feature is extracted for each scale of the image, and an object is recognized respectively. Note that the image scaling here uses artificially designed features, and CNN has not been used yet.

        (b), (c) and (d) and FPN all use CNN convolutional neural networks to extract features. This is because (1) it avoids manual design of features and reduces the workload (2) CNN can learn higher-level features. Semantic information (3) CNN is more robust to scale changes (different scales, for example, the same CNN performs similarly well on large scale, small scale, and medium scale, but it does not say that it recognizes a small target, and the performance becomes much worse. Of course this is relative.)

(b) Single feature map and (c) Pyramidal feature hierarchy compared to (a) Feature image pyramid uses a convolutional neural network to implement feature downsampling instead of manually extracting features.

        YOLOv1 and YOLOv2 use (b) single feature map

The difference between (b) and (c) is that (b) only performs the target detection task on the last layer of the network. And (c) recognition will be carried out each at large scale, medium scale and small scale.

SSD is a target detection model launched after YOLOv1 and YOLOv2. It is equivalent to going deeper into a few more layers based on Figure (c), and then coming out with three target detection heads of different scales in the deeper layers. The authors of SSD designed it this way to avoid using features that contain low-level semantic information. But precisely because of this, it misses low-level semantic information (high-resolution semantic information.) At the same time, because the network is too deep, it is redundant to this extent, causing the reasoning speed to slow down and the real-time performance is not strong.

Based on the following deficiencies of (a)(b)(c),

        (a) Featurized image pyramid, the problem with the characterized image pyramid is that it uses manually designed feature layers to extract them one by one, which is very redundant. Because if you use convolution, each channel shares the parameters of the same convolution kernel. The convolution kernel parameters are shared, which greatly reduces the amount of parameters.

        (b) Single feature map, single scale: Improvement : Convolutional neural network is used to automatically extract features and extract features with a smaller number of parameters, replacing the manual design of extracted features. Disadvantages : That is, only one head is connected to the deepest layer for recognition, so the recognition of large targets may be very good, but the recognition of small targets may be poor, causing the performance of target recognition of different sizes to be unrobust (the performance is unstable) ).

        (c) Pyramidal feature hierarchy: Improvement : Compared with (b), the improvement is that there is not only one recognition head at the deepest level, but multiple heads at different depth layers. Disadvantages : However, there is no good correlation between the semantic layer times at different depths, and there is no information fusion between the three independent detection heads.

Therefore, FPN, Figure (d), made the following improvements

        (1) Reduce the amount of parameters/calculation, maintain the existing depth (downsampling: 8 times, 16 times, 32 times), and do not redundantly deepen the network like SSD. 

        (2) Fusion of deeper/low-resolution/rich high-level semantic information (rich high-level semantic information is beneficial to classification tasks) and shallower/high-resolution/lower-layer semantic information/rich spatial information. 

                The specific method is: on the left, first use convolution to downsample multiple times, then upsample the deepest level of semantic information, and then fuse it with a shallower layer, then upsample multiple times and fuse to obtain multi-level semantic information, and then Perform target detection on the connection head.

                The fusion method should be noted here. The fusion here does not use point-wise add, adding elements one by one, but uses concate splicing. Because element-by-element addition will mask some gradient information, that is, mask some more important features. However, using concatenate fusion does not change the importance of the two features being fused, that is, both features are treated equally as features of similar importance.

3. Anchor-based

YOLOv3 is an Anchor-based method.

Assume that the input image is a 416×416 RGB three-channel image. YOLOv3 will generate three scales of feature maps, namely large size, medium size, and small size. The numbers of Grid Cells for these three size corresponding methods are 13×13, 26×26, and 52×52 respectively. The reason why the number of Grid Cells is these three is because the length and width of the square of a Grid Cell of each scale are 32 pixels, 16, and 8 pixels respectively.

Each Grid Cell will correspond to 3 Anchor Boxes. ( I am not very clear yet about which three anchor boxes each Grid Cell corresponds to and how to draw them ). There are three pictures of dogs below. The yellow color is an identification of dogs. The three blue boxes are the three Anchor Boxes drawn for the Grid Cell located in the middle of the picture.

        So on the picture of the puppy below, we generated a total of (13×13+26×26+52×52)×3=10647 detection frames

        13×13 detects large size targets.

                13 comes from this. Using 32 times downsampling as the standard, calculate the downsampled feature map size 416/32=13. Each point on the 13×13 feature map corresponds to the current Grid Cell, just like the feature map. There are 13×13 points on the 13×13 Grid Cell, which are in one-to-one correspondence.

                (Because (1) the grid is large and can fit into this grid or the rectangular area formed by merging this grid with other grids is larger, so large targets can be framed in it. (2) When targeting a grid, the grid is large and compared to the grid. In the small case, it is more likely to see the overall view of the dog object. If it is 52×52 on the right, assuming that the small bell on the dog's neck is recognized, using a large grid will frame a large number of bells and have nothing to do with the bell. area; and using a 52×52 grid, you can use a few small grids to frame the bell very accurately, so that the area unrelated to the bell is included as little as possible)

        26×26 detects medium-sized targets

        52×52 detects small size targets

For feature maps of different scales, Grid Cells of different sizes should be used, so that they can better adapt to this feature map and detect targets of different sizes.

Note that the clarity and blur of the picture in the example above are actually unreasonable, but for the sake of simplicity, I will not re-create a picture. You only need to understand this. The specific points that are unreasonable are as follows: The three feature maps with different scales should show that the small-sized feature map on the right has an exceptionally high resolution and the picture is particularly clear. The large-sized special diagnosis map on the left has undergone multiple convolutions and should be particularly blurry.

We finally end the detection and need to calculate the loss of the model. This Loss is determined by (1) the category to which the object belongs (2) the coordinates of the bounding box of the object (3) the bouding box contains the object. confidence level,

        The category that the object belongs to. According to the Microsoft COCO data set, there are 80 categories.

        As for the coordinates of the bouding box, only 4 data are needed to locate the identification box. The abscissa and ordinate of the center point of bbox, the width and height of bbox

        The confidence level is a number in the range [0,1]. This confidence evaluates the possibility that there may be an object in this box.

Therefore, a total of 80+4+1=85 numbers are needed to prepare the original data required for this loss calculation.

For the Grid Cell above, each grid in it corresponds to a Tensor of 85 dimensions.

So what is the loss? For each Grid Cell, it corresponds to this 85-dimensional Tensor.

You may be curious, how many numbers are in the final output of the detection head corresponding to 8x downsampling, 16x downsampling, and 32x downsampling? Let's calculate it together

After 32 times downsampling, a feature map with a size of 13×13 is obtained. Each grid cell has 3 anchor boxes, and each anchor box is a tensor composed of 85 elements.

        Therefore, the element number of the 32 times downsampled output is (13×13)×3×85,

        The number of output elements for 16x downsampling is (26×26)×3×85

        The number of output elements for 8x downsampling is (52×52)×3×85

4. bbox coordinate representation: relative coordinates

In order to predict the absolute position of the target's bouding box, YOLOv3 first locates which Grid Cell this point belongs to, and then predicts the relative position of this point relative to the point in the upper left corner of the Grid Cell, and then based on the bouding box The relative coordinates of the center point relative to the upper left corner of the Grid Cell are further calculated to derive the absolute position of the bbox. Specifically, it is to first predict (tx, ty, tw, th, t0), and then calculate the position size and confidence of the bbox through the following coordinate offset formula.

        To put it simply, what is returned is not the absolute coordinates, but the relative coordinates relative to the point in the upper left corner of the Grid Cell.

        It is equivalent to giving the Grid Cell a prior knowledge. This prior knowledge is: the target is in a certain Grid Cell. You only need to return a coordinate value relative to the Grid Cell.

        tx, ty, tw, th are the first model prediction outputs obtained. Then they obtained the target center point (bx, by) and the width bw and height bh of the bouding box through the following series of processes.

        cx and cy represent the abscissa and ordinate of the point in the upper left corner of the Grid Cell. For example, assuming that the size of the feature map of a certain layer

        To obtain the target center points bx and by, this series of processing includes first passing tx and ty through a \sigmafunction (sigmoid function) and then adding the coordinates cx and cy of the Grid Cell to obtain the coordinates of the real center point. bx and by.

                The sigmoid function is used here because. [-\infty,+\infty]The sigmoid function can scale and project any set of data in the input interval to [0,1]this interval, and the output result is like a percentage number. Because cx is the abscissa of the point in the upper left corner of the anchor box, the maximum length and width of an anchor box is 1 pixel, just like c_{x}the number not greater than 1 added to the abscissa j of the upper left corner point \sigma(t_{x}), which is very well described. , c_{x}the horizontal and vertical coordinates of every possible point in the 1×1 box from the upper left corner of the anchor box to the lower right corner.

        Why use relative position?

                Because if you predict the absolute position, it will not be easy to converge during training. Therefore, using relative positions makes it easier for the network to converge to a better result.

        When training these coordinate values, the sum of squares loss is used. The reason is that the error in this way can be calculated quickly. However, this sum of squares loss is no longer used.

        Confidence confidence is calculated using this formula. Confidence= P(Object)×IoU.

                This confidence refers to the confidence that the object is contained in the bouding box, that is, how accurate the prediction of the bbox is.

                Here P(object) is either 0 or 1. If there is an object in this bbox, P(object)=1. If there is no object in the area framed by this bbox, that is to say, what is framed in the bbox is not the foreground but the background, then P(Object)=0.

                That is to say, when there is an object in the area framed in the bbox, the confidence is the IoU between the predicted box and the real box of the object. If there is no object in the area framed in the bbox, the confidence is 0.

5. Matching of positive and negative samples

        Using three different numbers of Grid Cells in the three scales of large, medium and small for the same picture is called having three anchors.

        The small cells inside each anchor are called Grid Cells.

        The positive and negative sample matching rules used by YOLOv3 only allocate one positive sample to each Ground Truth (note: the key point here is to allocate only one positive sample, only one, too few!). The premise for the establishment of a positive sample is that the IoU is greater than the threshold you set. This positive sample is to find the prediction box with the largest overlap area with GT (the largest IoU value) among all predicted bouding boxes.

        If a sample is not a positive sample (there is actually no object in the predicted bouding box on the ground truth), and there is not even an object, then naturally there will be no object position and object category, and naturally there will be no positioning loss and category loss. However, these negative samples have confidence loss, because the confidence loss evaluation is whether there is an object in the bbox.

        However, the above matching rules for positive and negative samples lead to a result. Most of the samples are negative samples (there are actually no objects in a large number of predicted bouding boxes), and positive samples only account for a very small part of the total samples. How to deal with this extreme sample imbalance? The author of YOLOv3 tried to introduce Focal Loss to alleviate this problem. But the effect is not good. The reason is that the negative sample value participates in the loss of confidence and has a very small impact on Loss (because the confidence of the negative sample is 0. A small amount will affect the loss of confidence. In this regard, I also I don’t quite understand ).

The disadvantage of designing the positive and negative sample matching mechanism above is that the number of positive samples is too small, making it difficult to train the network. (Since most cases are negative samples, that is, the bouding box is incorrectly typed, then for the sake of accuracy, my model will determine accuracy after all the samples are judged as negative samples. Wouldn’t the accuracy be high? The samples are extremely unbalanced. In this case, the model did not learn any valuable information from the positive and negative samples, and became a machine that judged the sample as a negative sample when it saw it. The network training failed.)

So in order to solve this problem, in the selection of positive examples, the top three with the largest IoU of predicted and GT are selected as positive examples, which increases the number of positive examples.

6. Loss function

The Loss of YOLOv3 is divided into 3 parts

        (1) Bounding box coordinate error, bbox loss, uses bbox center point (x, y) and bbox width w and height h.

                The evaluation of the loss of the bbox center point is the loss of a squared error. The true value of the horizontal and vertical coordinates minus the predicted value is then squared and summed. The result is that

                For bbox width and height, first take the root sign, then sum the squared differences between predicted and GT.

                The first two rows of the loss function are \lambda_{coord}then to achieve a weighted average

        (2) Confidence error, obj loss, which is to evaluate whether there is an object in the bouding box, regardless of which category of object it is. Humans, cars, buses, and motorcycles are not important. What is important is whether there is an object.

        (3) Class error, class loss, the error belonging to each category among the 80 categories.

                Both confidence loss and category loss use cross entropy cross entropy loss (logit). The confidence loss uses a two-class loss, and the category loss uses a multi-class loss. 

                YOLO before YOLOv3 used softmax for multi-classification, and cross entropy was only used in YOLOv3. The softmax output is multiple categories. The reason for such replacement and upgrade is that softmax has a mutual suppression between different categories, which is either/or. The category of this object either belongs to category A or category B. The category of this object cannot be both category A at the same time. It is also Category B (of course, the confidence level of two categories AB must be high and low). But softmax (logit) can decouple these categories and solve the conflict problem of this category.

                Note here that when I calculate the category loss class loss, we only calculate our classification loss if we calculate that we are a positive sample (I don’t understand a bit here, the positive sample means that the predicted category is consistent with the real category of GT Li) It is the positive sample when the time comes. Why not use the incorrectly classified samples to also count the classification errors? )

                For confidence loss, regardless of whether there is a target in the bbox, whether it is a positive sample or a negative sample, we have to calculate the confidence loss. 

        There are some parameters in the above loss function, which are explained below.

                S^2It represents the three anchors of 13×13 (32x downsampling), 26×26 (16x downsampling), and 52×52 (8x downsampling).

                B represents the B candidate boxes generated by each grid

                \lambda_{coor}and \lambda_{noobj}is the weight of the weighted average of these loss functions

                I_{i,j}^{obj}Indicates that there is a target in the box at i and j, if there is a target, it is 1, if not, it is 0

                I^{noobj}_{i,j}Indicates that there is no target in the box at i and j, if there is, it is 1, if not, it is 0

                

Guess you like

Origin blog.csdn.net/Albert233333/article/details/132963137