Deep learning algorithm one: target detection (yolov3) idea understanding small record

One, the model part

According to the provided 9 a priori boxes on each grid_cell in the three feature layers 【y1,y2,y3】(the shapes are respectively (N,255,13,13)，(N,255,26,26)，(N,255,52,52)) output by darknet, three boxes are predicted. The aspect ratios of these three boxes are not randomly selected. It is taken from the length and width of the 9 a priori boxes obtained by clustering the data set, so that the three feature layers use exactly 9 a priori boxes.

Second, the training part

According to the data set, whether the data set is xmlformat or other format, the main content is 【img，每一个目标的框信息以及所属类别】to get the data of the same shape as the model output, and the shapes are respectively [(m,13,13,3,85),(m,26,26,3,85),(m,52,52,3,85)]. Because the data is derived from real frames, there are definitely not so many. For example, the frames at this time have only one width and height, that is, the real width and height, unlike each of the above grid_cellthree frames. Therefore, in order to have the same shape as the model output, the data in the place where there is no box (corresponding to the position of the element in the matrix) is actually zero. In this way, the y_truesum can be y_outfed into the lossfunction to calculate the error, and then the model parameters can be updated.

1. Calculate the parameters required for loss

In the calculation of lossthe time, in fact, y_preand the y_truecontrast between: y_preis output after an image through network, an internal feature comprising three layers of content, it needs to be able to decode the paint on FIG; y_trueis a real image, which The (13,13)、(26,26)、(52,52)offset position, length, width and type on the grid corresponding to each real frame of . It still needs to be coded to be y_predconsistent with the structure. In fact y_pre, y_truethe shape of the content is

(batch_size,13,13,3,85)

(batch_size,26,26,3,85)

(batch_size,52,52,3,85)

2. What is y_pre

For the yolov3 model, the final output of the network is the prediction box and its type corresponding to each grid point of the three feature layers. That is, the three feature layers correspond to the pictures divided into grids of different sizes. The corresponding positions, confidence levels and types of the three prior boxes on each grid point.

For the output y1、y2、y3:

[…,: 2] refers to the offset relative to each grid point,
[…, 2: 4] refers to width and height,
[…, 4: 5] refers to the confidence level of the box,
[…, 5:] refers to the predicted probability of each category.

The current one y_preis still not decoded, and the situation on the real image is only after decoding.

3. What is y_true.

y_true is (13,13)、(26,26)、(52,52)the offset position, length, width and type of the grid corresponding to each real frame in a real image . It still needs to be coded to be consistent with the structure of y_pred. In yolov3, it uses a special function to handle the real situation of the frame of the picture that is read in.

def preprocess_true_boxes(true_boxes, input_shape, anchors, num_classes):

The input is:

true_boxes: shape (m, T, 5)represents T boxes of m pictures x_min、y_min、x_max、y_max、class_id.
input_shape: input shape, here are 416, 416
anchors: represents the size of 9 a priori boxes
num_classes: the number of categories.

In fact, the processing of the real frame is to convert the real frame into xyhw relative to the grid in the picture. The steps are as follows:

Take the true value of the box, get the center of the box and its width and height, and remove the mode that input_shape becomes proportional.
The establishment of all zeros y_true，y_trueis a list containing three feature layers, and the shapes are respectively (m,13,13,3,85),(m,26,26,3,85),(m,52,52,3,85).
For each picture processing, compare the wh of the real frame in each picture with the wh of the prior frame, calculate the IOU value, select the one with the highest IOU, and get (在13 * 13,26 * 26与52 * 52三种尺寸中选择一个)the location of the feature layer and its grid points. Save the content in the corresponding y_true.

Three, predictive decoding

1. It will first convert the output result of the model part into (N,3,85,13,13)，(N,3,85,26,26)，(N,3,85,52,52). The 85 in the dimension contains 4+1+80, which respectively represent x_offset、y_offset、h和w、置信度、分类结果, that is 【(x_offset，y_offset，h，w)、confidence、class_prob】.

2. The decoding process of yolov3 (loss calculation has been performed and the final prediction result is obtained) is to add each grid point to its corresponding x_offset和y_offset, and the result after addition is the center of the prediction box, and then use the 先验框和h、wcombination to calculate Predict the length and width of the box. In this way, the position of the entire prediction box can be obtained. However, it is normalized at this time, and it needs to be converted to 416*416dimensions such as size.

3. Of course, after obtaining the final prediction structure, score ranking and non-maximum suppression screening are required. This part is basically a common part for all target detection. After screening, we also need to convert the box information to the size of the original image (note that this refers to the original image, not 416*416), so that the final box information can be obtained, and the detection result can be drawn and marked on the original image.

Attach reference project:

https://blog.csdn.net/weixin_44791964/article/details/103276106

https://github.com/bubbliiiing/yolo3-keras