[yolov6 series 1] deep analysis of network architecture

When yolov5 dominated the field of computer vision for a long time, Meituan opened up yolov6 in June, and claimed to surpass other computer vision models of the same magnitude in terms of accuracy and speed. I just glanced at it and found that the star has exceeded 2.8k. Inexplicably popped up a word: "the light of domestic products".

There are many interpretations based on yolov6 on the Internet. At the end of the article, the official interpretation of Meituan and the github link of the open source code will be attached. The text begins with the yolov6 series, first share with you the entire yolov6 network architecture (based on tag0.1 version yolov6s), and then share it with you based on your own understanding based on each module.

overall framework

insert image description here
If you want to use the ppt in the picture, please pay attention to the official account , reply to the ppt in the background and add WeChat to receive it

The above is the overall network architecture of yolov6s. It can be seen from the figure that the yolov6 network consists of four parts: input, backbone, neck and head. The functions of each part are the same as yolov5, such as backbone is used to extract features, and head is used for prediction.

Go through the network process according to the architecture diagram above: first preprocess the input image, align it into a 640*640 RGB image, input it into the backbone network, and pass the Rep- in the neck layer according to the three-layer output in the backbone network The PAN network continues to output three layers of feature maps of different sizes (hereinafter referred to as fm) , which are input into the final head layer, and predict the three types of tasks of image detection (classification, front and rear background classification, border) and output the final result.

backbone

insert image description here
For the backbone layer of yolov6s, refer to the backbone[3] of the RepVGG network, as shown in the figure above ( s means stride, o means outchannel, i means inchannel, where o=i means outchannel=inchannel, o≠i means outchannel has no correlation with inchannel, It’s not that their values ​​are necessarily unequal ), consisting of several RepVGG blocks (repVGG blocks are referred to as RVB, RepBlocks are referred to as RB).
insert image description hereRVB has a different structure during training and deployment. During training, a convolution branch of 1 1 is added from a convolution of 3 3 . At the same time, if the input and output channels and the sizes of h and w are consistent, a BN branch is added. , the three branches are summed and output. During deployment, for the convenience of deployment, the convolution output of the main branch of 3*3 is directly taken.
insert image description here
RB is a series of several RVBs, where the first RVB is used for the size change of the feature layer, and the next N RVBs are used for the fusion of the feature layer, and the size remains unchanged.
insert image description here
Stem is RVB with s=2, and the input and output channels are different, so the RVB of stem becomes:
insert image description here
At the same time, SPPF layer is added in ERBlock5: SConv
insert image description here
is composed of conv+BN+ReLu: such SPPF network
insert image description here
First pass through an SConv layer, the size of the feature map h and w remains unchanged, the outchannel becomes half of the inchannel, and the output is used as a branch, and then passes through 3 maxpooling layers, each maxpooling kernel=5, s=1, padding =kernel//2, after each maxpooling, the fm size remains unchanged and is used as a branch. Then add several branches in the channel dimension through cat, and the obtained size is compared with the input of SPPF, h and w are unchanged, and the channel is twice the input. Finally, through an SConv layer, the channel is halved, so that the input and The output fm size remains unchanged.

The entire backbone layer process is: input 640 640 3 pictures, output 320 320 32 through the stem layer (s=2) , followed by several ERBlocks, each ERBlock performs feature layer downsampling and channel enlargement, each An ERBlock is composed of an RVB and an RB (ERBlock5 adds an SPPF layer). The downsampling of the feature layer is performed in the RVB, and the channel is increased at the same time. The feature layer is fully fused in the RB and output. Finally, the backbone outputs three fm respectively. (20 20 512, 40 40 128, 80 80 64).

neck

insert image description here
Neck layer Meituan officially calls it Rep-PAN, which is a topology method based on PAN. As shown in the figure above, it resembles a "U"-shaped structure, in which the h and w of fm on the left side of the U-shaped increase from top to bottom, The h and w of fm decrease from bottom to top on the right side, and Upsample upsampling is based on torch’s official transposed convolution implementation:
insert image description here
the process of the entire neck layer is , U-shaped left side, outputting fm of 20 20 512 from ERB5 , the size becomes 20 20 128 through SConv. After upsampling, h and w are doubled compared with the previous ones. After concatenating with the output of ERB4 on the channel layer, fm becomes 40 4 384. Through an RB (s=1, o≠ After i), output 40 4 128, after repeating the above steps, output fm of 80 80 64. On the right side of the U-shape, the fm of 80 80 64 is down-sampled by SConv first, and the fm of 40 40 64 is obtained. The fm consistent with the h and w on the left side of the U-shape is concatenated on the channel layer and passed through an RB (s=1, o ≠i), output the second fm, repeat the above steps on the right side of the U shape, and output the third fm. So far, the neck layer outputs three fm respectively (20 20 256, 40 40 128, 80 80 64).

Head

insert image description here
As shown in the figure above, head is based on three-layer output predictions, which correspond to receptive fields of different sizes from large to small.

Among them, BConv is composed of conv+bn+SiLu:
insert image description here
the whole head borrows from the decoupling head design in yolox and improves it. The head process is as follows: output three branches from the neck layer, and for each branch, first output fm passes through the BConv layer. After fm feature fusion, it is divided into two branches. One branch completes the prediction of the classification task through BConv+Conv, and the other branch first fuses the features through BConv and then divides into two branches. Regression, one branch completes the classification of the front and rear backgrounds through Conv, and then the three branches are fused on the channel layer through concate, and output the prediction results without post-processing.

epilogue

The above is my personal understanding. The overall network architecture of yolov6s on version 0.1, if there is any deviation in understanding, welcome to communicate. Follow-up will continue to update according to the detailed principles and codes of each module in yolov6. I hope it will be helpful to everyone.
Reference:
[1] https://mp.weixin.qq.com/s/RrQCP4pTSwpTmSgvly9evg (Meituan official interpretation)
[2] https://github.com/meituan/YOLOv6 (Meituan official code)
[3] https //zhuanlan.zhihu.com/p/353697121

Guess you like

Origin blog.csdn.net/zqwwwm/article/details/125635594