Depth articles - Target Detection History (vi) elaborate YOLO-V3 target detection

Returns the directory object detection history

Previous: depth articles - Target Detection History (five) elaborate SSD target detection

Next: depth articles - target detection history (seven) elaborate YOLO-V3 object code detection of the Detailed

Paper Address: "YOLO-V3"

Code Address: tf_yolov3_pro

In this section, elaborate YOLO-V3 target detection, target detection in the next section of elaborate codes Detailed YOLO-V3

Seven. YOLO-V3 target detection

YOLO (You Only Look Once, YOLO) once you look

YOLO-V1: 2015 Nian, YOLO-V2 / 9000: 2017 Nian, YOLO-V3: 2018 Nian

1. represent your YOLO look again. It is a target detector detecting a target depth of a convolutional neural network learning feature use. YOLO only the convolutional layer, making it a full convolution network (FCN). At the next sampling, pooling layer is not used, but the use of a stride of 2 convolutional feature maps collation performed at sampling operation. This helps prevent pooling lost the underlying features.

2. YOLO-V3 three feature maps using different scale sizes for prediction. Each location (pixel) in each dimension of the feature maps and bounding boxes are used three different aspect ratios predicted. Then the sampling corresponding to the scale feature 3 different size scales x2 Feature maps respectively carried out by the final length of the feature maps backbone, broad maps obtained by concat. The anchors and distribution can be determined by the position of bounding boxes k-means clustering. COCO data set in three different scales of bounding boxes corresponding anchors divided into nine clusters, they are: (10 x 13), (16 x 30), (33 x 23), (30 x 61), (62 x 45), (59 x 119), (116 x 90), (156 x 198), (373 x 326)

3. YOLO-V3 steps of:

(1) The entire image is input, extracting three feature maps of different scales by preparing CNN (divided into large, medium and small scale 3) for the following prediction.

(2) The minimum dimension of the feature maps divided into two parts, one part for predicting, for 1 part upsampled, to give (1) obtained after the above upsampled feature maps mesoscale feature maps for the concat, the concat feature maps obtained after subdivided into 2 parts, one part for predicting, for a part feature maps upsampled, obtained after further upsampled feature maps obtained in (1) were concat large scale, the resulting concat the feature maps to predict.

(3) The feature maps in three dimensions iou_loss (or giou_loss), conf_loss, prob_loss were added and, then adding three loss can get the total loss (because, sometimes we need to look at the total loss is How many).

4. YOLO-V3 flowchart

darknet network structure

Reference model and performance

The prognosis scales

In YOLO-V3, the prediction is performed by three different scales, the detection layer is detected as having a step width 32,16,8 these three feature maps of different scales. This means, will be examined under 13 x 13,26 x 26,52 x 52 dimensions in the case of the input of 416 x 416. Network downsampling the input image, until the first feature maps used in the steps so far as the detection layer 32 for detecting. Further, the upper layer was sampled twice, and concat it with previous feature maps have the same dimensions; now, the use of a stride to the other layer 16 is detected, the same procedure is repeated sampling operation, the final step is to use web is a layer 8 to be detected.

6. explain output

Typically, the (target detection for all) convolution layer learns all features will be transmitted to the classification or regression, and then classification or regression prediction is then detected (bounding boxes coordinates, category, etc.) . In YOLO-V3, the layer 1 is obtained by convolution of X 1 to deformity predicted. Each 1 x 1 can predict the size of the cell of a fixed number of the bounding box. In depth, there are items (B x (5 + c)) in the feature maps.

The number of bounding boxes for each cell can be predicted: B

5 + c: bounding box for each property owned, two each for describing the center coordinates of the bounding box, the dimension, the objective score (foreground or background) and class c confidence (probability). Each cell in the prediction YOLO-V3 has three bounding boxes.

(1). Chestnuts

Li as input image 416 x 416, a stride 32, feature maps the output of 13 x 13

(2) If the range of the center point of the object falls cell receptive fields, it is desirable in hungry feature maps each cell to a predicted target box through which a bounding. This YOLO-V3 training methods concerning which there is a bounding box is responsible for detecting any given object. First, you must determine which cells belong to this bounding box; then select cell comprises a ground truth box object as the center point of the cell responsible for a prediction object (mapped from the input image to the prediction feature maps) the center point.

(3) The central coordinates

YOLO-V3 here be predicted by running a center coordinate Sigmoid function. Typically, YOLO-V3 do not predict the absolute coordinates of the center point of the bounding boxes, but the predicted offset of the center point coordinates:

①. Coordinate value of the top left cell of the first network prediction target

②. Then the feature maps are normalized to the size of the cell between 0 and 1 (treated with a Sigmoid function obtained) predicted offset

③. In this case, the absolute coordinates of the bounding box of ①, ② the addition.

④. Chestnuts

Li e.g., prediction feature maps to a 13 x 13 above is Li, if the predictive value of the center point (0.4, 0.7), it means that the center point at (6.4, 6.7) above (for the red grid line 7 lattice column 7, the coordinates of the upper left corner is (6, 6)).

However, if the prediction feature maps the size of the cell is not standardized between 0 and 1, the prediction (x, y) coordinates, there may be greater than 1. For example (1.2, 0.7), which means that the center point ( 7.2, 6.7). Cell to the right at this time the central point where the red cells, or that is in the cell line 7 8 column. This violates the theory behind YOLO, because if red cells responsible for forecasting target, the central objective must be in the red cell, rather than the cell next to the red cell in.

Therefore, to solve this problem by outputting a Sigmoid function of which was compressed into between 0 and 1, thereby effectively holding the predicted center point in the grid.

(4) The prediction of the size of the bounding boxes

By application output logarithmic space conversion, and then multiplied by the size of the prediction feature maps of the predicted size of the bounding boxes, and finally bounding boxes can be mapped back to the input image size to get real targets. For example, 13 x 13 prediction feature maps, the prediction $\large t_{w}$ , $\large t_{h}$ the results of the image width and height of the normalized (the training process is one such label). Thus, if the predictive value of the bounding boxes of containing the target $\large t_{w}$ and $\large t_{h}$ is (0.3, 0.8), then the actual width and height in the prediction feature maps 13 x 13 is a (13 x 0.3, 13 x 0.8 ).

(5) objective score (judgment foreground and background)

The probability of objective scoring performance of the object contained within the bounding boxes, the probability of the grid around the red and should be close to 1, while the probability of a corner of the grid should be close to zero. Sigmoid function score objective is to get through, because it is understood as a probability.

(6) Obtain final predictions

For the size of the image is 416 x 416, YOLO-V3 is predicted bounding boxes:

(52 x 52 + 26 x 26 + 13 x 13) x 3 = 10647 th

However, in the image, only a target object, which want a detection target is generally from 10,647 down to 1. Use the following steps to achieve:

①. Object probability threshold of

Screened box according to the probability of the object. Typically, the box score below the threshold will be ignored.

②. Non-maxima NMS (Non-Maximum Suppression, NMS) inhibition

NMS is designed to solve the problem of multiple detection of the same image. Non-maxima suppression, by definition, is not to suppress elements of great value, can be understood as the largest local search. NMS process:

a. Get bounding box list, and sorting confidence score (c class probabilities) according.

b. Select the highest confidence bounding box added to the final output of the list, and remove it from the list of the bounding box.

c. calculating the area of the bounding box of all

D. IOU calculated highest confidence bounding boxes of other bounding boxes.

e. Remove IOU degree of overlap greater than the threshold bounding boxes

f. Repeat the above procedure until the list is empty bounding boxes.

Efficiency and Accuracy

Returns the directory object detection history

Previous: depth articles - Target Detection History (five) elaborate SSD target detection

Next: depth articles - target detection history (seven) elaborate YOLO-V3 object code detection of the Detailed

Ten thousand a

Published 63 original articles · won praise 16 · views 5981

Private letter concerns