Target detection: YOLOV3

Published at CVPR2018. YOLOV3 itself does not have too many innovations, mainly integrating the advantages of some mainstream networks at that time.

1. The backbone improvement of YOLOV3:

The operation effect of each backbone network on imagenet. Darknet-53 compared to ResNet-152, top1 and top5 are similar, but twice as fast.

Darknet-53 network structure (53 convolutional layers)

Darknet53 is mainly formed by stacking residual blocks, but Darknet53 does not have a maximum pooling layer, and downsampling is achieved through convolutional layers. It is possible that this operation makes Darknet53 only have 53 convolutional layers, but achieves the effect of ResNet152's 152 convolutional layers.

Each convolutional in the figure contains the following three steps: Conv2d, BN, LeakyReLU. Conv2d does not include bias. If you use the BN layer, bias will have no effect.

Each box in the figure is a residual structure as shown below:

2. YOLOV3 model structure:

In YOLO v3, three feature maps of different scales are further used for object detection. The positions of all target bboxes in the training set are clustered by the k-means method to obtain the corresponding anchors, a total of 9 anchors, 3 for each scale.

The 9 prior frames in the COCO dataset are as follows:

In terms of distribution, a larger prior frame (116x90), (156x198), (373x326) is applied on the smallest 13 * 13 feature map (with the largest receptive field), which is suitable for detecting larger objects. Apply medium prior frames (30x61), (62x45), (59x119) on the medium 26 * 26 feature map (medium receptive field), suitable for detecting medium-sized objects. Smaller prior frames (10x13), (16x30), (33x23) are applied on the larger 52 * 52 feature map (smaller receptive field), which is suitable for detecting smaller objects.

The vector dimension of the prediction result of each prediction feature layer is: N*N*[(4+1+C)*K], where N is the size of the feature map, C is the number of categories, and the COCO data set is 80, K The number of anchors is 3.

The overall network model of YOLOV3 is as follows:

After each feature map is upsampled in the network, the fusion method with the original size feature map is concat, that is, splicing in the depth direction. The previous FPN network used element-wise addition.

Large-size feature maps can usually predict small-size targets; medium-size feature maps can usually predict medium-size targets; small-size feature maps can usually predict large-size targets.

Note that the last three layers that generate predictions are all Conv2d only, without BN and LeakyReLU.

3. Prediction of the target bounding box in YOLOV3:

The prediction of the target bounding box in YOLOV3 adopts the same mechanism as YOLOV2, and the center point directly predicts the relative position to the grid unit (changes between 0 and 1 through the sigmoid function).

The YOLOv3 network prediction center point is a little different from the faster rcnn and SSD. The regression parameters of the center point prediction of the faster rcnn and SSD network are relative to the anchor. The regression parameters of the center point in YOLOv3 are relative to the upper left corner of the grid .

  

The figure above shows the regression process of the object bounding box. The dotted rectangle in the figure is the Anchor template (only look at the (pw,ph)) information here, and the solid rectangle is the predicted bounding box calculated by the offset predicted by the network (relative to the upper left corner of the Grid Cell). Where (cx, cy) is the coordinates of the upper left corner of the corresponding Grid Cell, (pw, ph) is the width and height of the Anchor template mapped on the feature layer, and the network output (tx, ty, tw, th) is the boundary of the network prediction Box center offset (tx, ty) and width and height scaling factor (tw, th), (bx, by, bw, bh) is the final predicted target bounding box, converted from (tx, ty, tw, th) to The formula of (bx,by,bw,bh) is shown on the right side of the figure, where the σ(x) function is a sigmoid function whose purpose is to scale the predicted offset to between 0 and 1 (so that each Grid Cell The center coordinates of the predicted bounding box are limited to the current cell, which the author says can speed up network convergence).

4. The positive and negative sample matching of YOLOV3:

Each GT is compared with all anchor templates.

Use GT and the upper left corner of the anchor template to coincide, and then calculate its IOU. Afterwards, by setting a threshold such as IOU>0.3, all are set as positive samples. Only the second in the graph satisfies the condition. Next, map the GT to the GRID grid (or predict the feature layer), and which grid cell is the center point of the GT, then the anchor template 2 of this grid cell is a positive sample. If the IOU of GT and multiple anchor templates is greater than the threshold, then multiple anchor templates corresponding to the currently specified grid cell are regarded as positive samples. This will expand the number of samples. Practice has found that the effect will be better.

5. Loss calculation:

The loss function of YOLOv3 is mainly divided into three parts: target confidence loss, target classification loss, and target positioning offset loss. Among them, λ 1 , λ 2 , λ 3 are balance coefficients.

Guess you like

Origin blog.csdn.net/wanchengkai/article/details/124408697