【YOLACT】 Code Interpretation

Source code: https://github.com/dbolya/yolact

num_class is the category that counts the background. For example, if your category is a person, then remember that num_class should be 2 at this time, because you still need to count the background.

YOLACT's network structure and output

YOLACT's backbone structure

 Among them, _make_layer is the _make_layer in the regular resnet101, how to operate can see the source code https://github.com/dbolya/yolact

 

The light blue is the feature map that will be used later.

FOL structure of YOLACT

FPN selects 1, 2, and 3 in the backbone structure as the input to get a new output, which is orange triangles 0 to 4.

YOLACT's proto structure

Take the output 0 in FPN as the input of proto, and finally get the output of proto (1, 138, 138, 32).

YOLACT's pred_heads structure

Among them, 3 means that there are 3 preset anchor_boxes in each position, and then all the feature maps in FPN are operated to obtain the corresponding bbox, conf, mask, priors, and proto.

There is also an output segm, the specific operation is shown in the following figure:

So in summary, the output preds during network training include:

'loc': prediction offset of each anchorbox, shape is (1, 19248, 4)

'conf': category prediction for each anchorbox, shape is (1, 19248, num_class)

'mask': is the mask coefficient pointed out in the paper, the shape is (1, 19248, 32)

'priors': preset anchorbox coordinates, shape is (19248, 4)

'proto': segmentation feature map used in conjunction with the mask coefficient, the shape is (1, 138, 138, 32)

'segm': get a similar segmentation heat map, the shape here is (1, num_class-1, 69, 69), I estimate that segm is used to make the network converge quickly.

YOLACT's loss function

The specific calling method is class NetLoss (nn.Module) in train.py

self.criterion(self.net, preds, targets, masks, num_crowds)

Next, introduce these parameters:

net: is the above network structure.

preds: is a dictionary, the red font above is the content of the dictionary in preds.

targets: The general shape is (batch, n, 5), batch is the input of the conventional batchsize, n indicates that there are several target objects in a picture, the first 4 of 5 indicate the coordinates of the target object, and the fifth number The category of the target object.

masks: The general shape is (batch, n, 550, 550), this n is not fixed, the number of target objects obtained for each picture in the batch is different, this mask is the same as the maskrcnn

num_crowds: (batch,) indicates the degree of congestion, 0 indicates no congestion, 1 indicates congestion, generally 0.

 

Published 190 original articles · praised 497 · 2.60 million views +

Guess you like

Origin blog.csdn.net/u013066730/article/details/104049325