One, the model part
According to the provided 9 a priori boxes on each grid_cell in the three feature layers 【y1,y2,y3】
(the shapes are respectively (N,255,13,13),(N,255,26,26),(N,255,52,52)
) output by darknet, three boxes are predicted. The aspect ratios of these three boxes are not randomly selected. It is taken from the length and width of the 9 a priori boxes obtained by clustering the data set, so that the three feature layers use exactly 9 a priori boxes.
Second, the training part
According to the data set, whether the data set is xml
format or other format, the main content is 【img,每一个目标的框信息以及所属类别】
to get the data of the same shape as the model output, and the shapes are respectively [(m,13,13,3,85),(m,26,26,3,85),(m,52,52,3,85)]
. Because the data is derived from real frames, there are definitely not so many. For example, the frames at this time have only one width and height, that is, the real width and height, unlike each of the above grid_cell
three frames. Therefore, in order to have the same shape as the model output, the data in the place where there is no box (corresponding to the position of the element in the matrix) is actually zero. In this way, the y_true
sum can be y_out
fed into the loss
function to calculate the error, and then the model parameters can be updated.
1. Calculate the parameters required for loss
In the calculation of loss
the time, in fact, y_pre
and the y_true
contrast between: y_pre
is output after an image through network, an internal feature comprising three layers of content, it needs to be able to decode the paint on FIG; y_true
is a real image, which The (13,13)、(26,26)、(52,52)
offset position, length, width and type on the grid corresponding to each real frame of . It still needs to be coded to be y_pred
consistent with the structure. In fact y_pre
, y_true
the shape of the content is
(batch_size,13,13,3,85)
(batch_size,26,26,3,85)
(batch_size,52,52,3,85)
2. What is y_pre
For the yolov3 model, the final output of the network is the prediction box and its type corresponding to each grid point of the three feature layers. That is, the three feature layers correspond to the pictures divided into grids of different sizes. The corresponding positions, confidence levels and types of the three prior boxes on each grid point.
For the output y1、y2、y3
:
- […,: 2] refers to the offset relative to each grid point,
- […, 2: 4] refers to width and height,
- […, 4: 5] refers to the confidence level of the box,
- […, 5:] refers to the predicted probability of each category.
The current one y_pre
is still not decoded, and the situation on the real image is only after decoding.
3. What is y_true.
y_true is (13,13)、(26,26)、(52,52)
the offset position, length, width and type of the grid corresponding to each real frame in a real image . It still needs to be coded to be consistent with the structure of y_pred. In yolov3, it uses a special function to handle the real situation of the frame of the picture that is read in.
def preprocess_true_boxes(true_boxes, input_shape, anchors, num_classes):
The input is:
- true_boxes: shape
(m, T, 5)
represents T boxes of m picturesx_min、y_min、x_max、y_max、class_id
. - input_shape: input shape, here are 416, 416
- anchors: represents the size of 9 a priori boxes
- num_classes: the number of categories.
In fact, the processing of the real frame is to convert the real frame into xyhw relative to the grid in the picture. The steps are as follows:
- Take the true value of the box, get the center of the box and its width and height, and remove the mode that input_shape becomes proportional.
- The establishment of all zeros
y_true,y_true
is a list containing three feature layers, and the shapes are respectively(m,13,13,3,85),(m,26,26,3,85),(m,52,52,3,85)
. - For each picture processing, compare the wh of the real frame in each picture with the wh of the prior frame, calculate the IOU value, select the one with the highest IOU, and get
(在13 * 13,26 * 26与52 * 52三种尺寸中选择一个)
the location of the feature layer and its grid points. Save the content in the corresponding y_true.
Three, predictive decoding
1. It will first convert the output result of the model part into (N,3,85,13,13),(N,3,85,26,26),(N,3,85,52,52)
. The 85 in the dimension contains 4+1+80, which respectively represent x_offset、y_offset、h和w、置信度、分类结果
, that is 【(x_offset,y_offset,h,w)、confidence、class_prob】
.
2. The decoding process of yolov3 (loss calculation has been performed and the final prediction result is obtained) is to add each grid point to its corresponding x_offset和y_offset
, and the result after addition is the center of the prediction box, and then use the 先验框和h、w
combination to calculate Predict the length and width of the box. In this way, the position of the entire prediction box can be obtained. However, it is normalized at this time, and it needs to be converted to 416*416
dimensions such as size.
3. Of course, after obtaining the final prediction structure, score ranking and non-maximum suppression screening are required. This part is basically a common part for all target detection. After screening, we also need to convert the box information to the size of the original image (note that this refers to the original image, not 416*416
), so that the final box information can be obtained, and the detection result can be drawn and marked on the original image.
Attach reference project:
https://blog.csdn.net/weixin_44791964/article/details/103276106