《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译

最新的YoloV4已经出来好久了，今天主要读一下看看相比于YoloV3有什么改进和创新的地方，主要是来学习学习。废话不多说，开始。

Abstract	摘要
There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ∼65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet.	大量的特征可以提高卷积神经网络(CNN)的准确率。这需要在大型数据集上对这些特征的组合进行实际测试，并且需要理论上对结果进行分析。一些特性操作用在定的模型上并且是为了解决特定的问题，或者仅针对小的数据集；而一些特性，如BN层和残差连接，适用于大多数模型、任务和数据集。我们假设有一些通用特性如：加权残差连接(WRC)、(CSP)，交叉小批归一化(CMBN)，自我对抗训练(SAT)和Mish激活。我们使用新的特性：WRC，CSP，CMBN，SAT，Mish激活，马赛克数据增强，CMBN, DropBlock正则化和CIOU损失，将他们结合起来实现了在MSCOCO数据集上最优的结果：43.5%AP (65.7%AP50)，在Tesla V100实时速度为65FPS。
1. Introduction	1. 介绍
The majority of CNN-based object detectors are largely applicable only for recommendation systems. For example, searching for free parking spaces via urban video cameras is executed by slow accurate models, whereas car collision warning is related to fast inaccurate models. Improving the real-time object detector accuracy enables using them not only for hint generating recommendation systems, but also for stand-alone process management and human input reduction. Real-time object detector operation on conven tional Graphics Processing Units (GPU) allows their mass usage at an affordable price. The most accurate modern neural networks do not operate in real time and require large number of GPUs for training with a large mini-batch-size. We address such problems through creating a CNN that op erates in real-time on a conventional GPU, and for which training requires only one conventional GPU. The main goal of this work is designing a fast operating speed of an object detector in production systems and opti mization for parallel computations, rather than the low com putation volume theoretical indicator (BFLOP). We hope that the designed object can be easily trained and used. For example, anyone who uses a conventional GPU to train and test can achieve real-time, high quality, and convincing ob ject detection results, as the YOLOv4 results shown in Fig ure 1 . Our contributions are summarized as follows: 1. We develope an effificient and powerful object detection model. It makes everyone can use a 1080 Ti or 2080 Ti GPU to train a super fast and accurate object detector. 2. We verify the inflfluence of state-of-the-art Bag-of Freebies and Bag-of-Specials methods of object detec tion during the detector training. 3. We modify state-of-the-art methods and make them more effecient and suitable for single GPU training, including CBN [ 89 ], PAN [ 49 ], SAM [ 85 ], etc.	大多数基于CNN的目标检测器基本上用于推荐系统。例如，通过城市摄像机搜索免费停车位是通过慢速精确执行的模型，而汽车碰撞警告与快速不准确的模型有关。提高实时目标检测器的准确性不仅可以用于提示生成推荐系统，而且还可以用于独立的过程管理和人力投入减少。在GPU上的实时对象检测器大量使用的话需要需要承担起价格。大多数神经网络不能实时运行，并且需要大量的GPU来进行小批量的训练。我们通过创建一个在传统GPU上实时运行的CNN来解决这些问题，而训练只需要一个传统GPU。本工作的主要目的是生产系统中设计一个快速目标检测器并且能够通过并行计算来优化，而不是降低计算量的理论指标 (BFLOP)。我们希望设计的对象可以很容易地训练和使用。例如，任何人能够使用常规GPU进行训练和测试并实现实时、高质量和传统对象检测结果，如图1所示的YOLOv4结果。我们的贡献总结如下： 1. 我们开发了一个高效、强大的目标检测模型。它使每个人都可以使用1080Ti或2080TiGPU来训练超级快速和精确的物体探测器。 2. 我们验证了在探测器训练过程中，state-of-the-art Bag-of Freebies and Bag-of-Specials对物体检测的影响。 3. 我们修改了当前的方法，使它们更有效，更适合于单个GPU训练，包括CBN[89]、PAN[49]、SAM[85]等。
2. Related work 2.1. Object detection models	2. 相关工作 2.1. 物体检测方法
A modern detector is usually composed of two parts, a backbone which is pre-trained on ImageNet and a head which is used to predict classes and bounding boxes of ob jects. For those detectors running on GPU platform, their backbone could be VGG [ 68 ], ResNet [ 26 ], ResNeXt [ 86 ], or DenseNet [ 30 ]. For those detectors running on CPU plat form, their backbone could be SqueezeNet [ 31 ], MobileNet [ 28 , 66 , 27 , 74 ], or ShufflfleNet [ 97 , 53 ]. As to the head part, it is usually categorized into two kinds, i.e., one-stage object detector and two-stage object detector. The most represen tative two-stage object detector is the R-CNN [ 19 ] series, including fast R-CNN [ 18 ], faster R-CNN [ 64 ], R-FCN [ 9 ], and Libra R-CNN [ 58 ]. It is also possible to make a two stage object detector an anchor-free object detector, such as RepPoints [ 87 ]. As for one-stage object detector, the most representative models are YOLO [ 61 , 62 , 63 ], SSD [ 50 ], and RetinaNet [ 45 ]. In recent years, anchor-free one-stage object detectors are developed. The detectors of this sort are CenterNet [ 13 ], CornerNet [ 37 , 38 ], FCOS [ 78 ], etc. Object detectors developed in recent years often insert some lay ers between backbone and head, and these layers are usu ally used to collect feature maps from different stages. We can call it the neck of an object detector. Usually, a neck is composed of several bottom-up paths and several top down paths. Networks equipped with this mechanism in clude Feature Pyramid Network (FPN) [ 44 ], Path Aggrega tion Network (PAN) [ 49 ], BiFPN [ 77 ], and NAS-FPN [ 17 ]. In addition to the above models, some researchers put their emphasis on directly building a new backbone (DetNet [ 43 ], DetNAS [ 7 ]) or a new whole model (SpineNet [ 12 ], HitDe tector [ 20 ]) for object detection. To sum up, an ordinary object detector is composed of several parts: • Input : Image, Patches, Image Pyramid • Backbones : VGG16 [ 68 ], ResNet-50 [ 26 ], SpineNet [ 12 ], EffificientNet-B0/B7 [ 75 ], CSPResNeXt50 [ 81 ], CSPDarknet53 [ 81 ] • Neck : • Additional blocks : SPP [ 25 ], ASPP [ 5 ], RFB [ 47 ], SAM [ 85 ] • Path-aggregation blocks : FPN [ 44 ], PAN [ 49 ], NAS-FPN [ 17 ], Fully-connected FPN, BiFPN [ 77 ], ASFF [ 48 ], SFAM [ 98 ] • Heads • Dense Prediction (one-stage) : ◦ RPN [ 64 ], SSD [ 50 ], YOLO [ 61 ], RetinaNet [ 45 ] (anchor based) ◦ CornerNet [ 37 ], CenterNet [ 13 ], MatrixNet [ 60 ], FCOS [ 78 ] (anchor free) • Sparse Prediction (two-stage) : ◦ Faster R-CNN [ 64 ], R-FCN [ 9 ], Mask R CNN [ 23 ] (anchor based) ◦ RepPoints [ 87 ] (anchor free)	现代检测器通常由两个部分组成，一个是在ImageNet上预先训练的骨干，一个是用来预测对象的类和包围框的头。对于运行在GPU平台上的检测器来说，它们的骨干可以是VGG[68]、ResNet[26]、ResneXt[86]或DenseNet[30]。对于运行在CPU平台上的检测器来说，它们的骨干可以是SqueezeNet [31], MobileNet [28, 66, 27, 74], or ShufflfleNet [97, 53].对于头部部分，通常分为两类，即一级对象检测器和二级对象检测器。最具代表性的两级物体检测器是R-CNN [19] 系列,包括fast R-CNN [18], faster R-CNN [64], R-FCN [9],and Libra R-CNN [58]. 还可以使两级物体检测器成为没有锚点对象检测器，如RepPoint[87]。对于一级对象检测器，最具代表性的模型是YOLO[61,62,63]、SSD[50]和RetinaNet [45]。近年来，无锚点一级目标探测器正在发展。这类检测器有CenterNet [13], CornerNet [37, 38], FCOS [78],等。近年来发展起来的对象检测器在骨干和头部之间插入一些层，这些层通常用于收集不同阶段的特征图。我们可以称之为物体探测器的颈部。通常颈部由几条自下而上的路径和几条自上而下的路径组成。有此机制的网络包括特征金字塔网络(FPN)[44]、路径聚合网络(PAN)[49]、BiFPN[77]和NAS-FPN[17]。除了上述模型外，一些研究人员还强调直接为物体检测器构建一个新的骨干(DetNet[43]，Det NAS[7])或一个新的整体模型(Spine Net[12]，HitDetector[20] Tection。综上所述，一个普通的物体探测器由几部分组成：输入：图像，pitch,图像金字塔骨干：VGG16 [68], ResNet-50 [26], SpineNet [12], EffificientNet-B0/B7 [75], CSPResNeXt50 [81], CSPDarknet53 [81] 颈部：额外的块：SPP [25], ASPP [5], RFB[47], SAM [85] 路径聚和块：FPN [44], PAN [49],NAS-FPN [17], Fully-connected FPN, BiFPN[77], ASFF [48], SFAM [98] 头部：密集预测（一阶段）： RPN[64]，SSD[50]，YOLO[61]，RetinaNet[45]（基于锚） CornerNet [37], CenterNet [13], MatrixNet [60], FCOS [78] (anchor free) 稀疏预测（两阶段）： Faster R-CNN [64], R-FCN [9], Mask R CNN [23] (anchor based)（基于锚） RepPoints [87] （无锚）
2.2. Bag of freebies	2.2. Bag of freebies
Usually, a conventional object detector is trained off line. Therefore, researchers always like to take this advan tage and develop better training methods which can make the object detector receive better accuracy without increas ing the inference cost. We call these methods that only change the training strategy or only increase the training cost as “bag of freebies.” What is often adopted by object detection methods and meets the defifinition of bag of free bies is data augmentation. The purpose of data augmenta tion is to increase the variability of the input images, so that the designed object detection model has higher robustness to the images obtained from different environments. For examples, photometric distortions and geometric distortions are two commonly used data augmentation method and they defifinitely benefifit the object detection task. In dealing with photometric distortion, we adjust the brightness, contrast, hue, saturation, and noise of an image. For geometric dis tortion, we add random scaling, cropping, flflipping, and ro tating. The data augmentation methods mentioned above are all pixel-wise adjustments, and all original pixel information in the adjusted area is retained. In addition, some researchers engaged in data augmentation put their emphasis on sim ulating object occlusion issues. They have achieved good results in image classifification and object detection. For ex ample, random erase [ 100 ] and CutOut [ 11 ] can randomly select the rectangle region in an image and fifill in a random or complementary value of zero. As for hide-and-seek [ 69 ] and grid mask [ 6 ], they randomly or evenly select multiple rectangle regions in an image and replace them to all ze ros. If similar concepts are applied to feature maps, there are DropOut [ 71 ], DropConnect [ 80 ], and DropBlock [ 16 ] methods. In addition, some researchers have proposed the methods of using multiple images together to perform data augmentation. For example, MixUp [ 92 ] uses two images to multiply and superimpose with different coeffificient ra tios, and then adjusts the label with these superimposed ra tios. As for CutMix [ 91 ], it is to cover the cropped image to rectangle region of other images, and adjusts the label according to the size of the mix area. In addition to the above mentioned methods, style transfer GAN [ 15 ] is also used for data augmentation, and such usage can effectively reduce the texture bias learned by CNN. Different from the various approaches proposed above, some other bag of freebies methods are dedicated to solving the problem that the semantic distribution in the dataset may have bias. In dealing with the problem of semantic distri bution bias, a very important issue is that there is a problem of data imbalance between different classes, and this prob lem is often solved by hard negative example mining [ 72 ] or online hard example mining [ 67 ] in two-stage object de tector. But the example mining method is not applicable to one-stage object detector, because this kind of detector belongs to the dense prediction architecture. Therefore Lin et al . [ 45 ] proposed focal loss to deal with the problem of data imbalance existing between various classes. An other very important issue is that it is diffificult to express the relationship of the degree of association between different categories with the one-hot hard representation. This rep resentation scheme is often used when executing labeling. The label smoothing proposed in [ 73 ] is to convert hard la bel into soft label for training, which can make model more robust. In order to obtain a better soft label, Islam et al . [ 33 ] introduced the concept of knowledge distillation to design the label refifinement network. The last bag of freebies is the objective function of Bounding Box (BBox) regression. The traditional object detector usually uses Mean Square Error (MSE) to di rectly perform regression on the center point coordinates and height and width of the BBox, i.e., { x center , y center , w , h } , or the upper left point and the lower right point, i.e., { x top lef t , y top lef t , x bottom right , y bottom right } . As for anchor-based method, it is to estimate the correspond ing offset, for example { x center of f set , y center of f set , w of f set , h of f set } and { x top lef t of f set , y top lef t of f set , x bottom right of f set , y bottom right of f set } . However, to di rectly estimate the coordinate values of each point of the BBox is to treat these points as independent variables, but in fact does not consider the integrity of the object itself. In order to make this issue processed better, some researchers recently proposed IoU loss [ 90 ], which puts the coverage of predicted BBox area and ground truth BBox area into con sideration. The IoU loss computing process will trigger the calculation of the four coordinate points of the BBox by ex ecuting IoU with the ground truth, and then connecting the generated results into a whole code. Because IoU is a scale invariant representation, it can solve the problem that when traditional methods calculate the l 1 or l 2 loss of { x , y , w , h } , the loss will increase with the scale. Recently, some researchers have continued to improve IoU loss. For exam ple, GIoU loss [ 65 ] is to include the shape and orientation of object in addition to the coverage area. They proposed to fifind the smallest area BBox that can simultaneously cover the predicted BBox and ground truth BBox, and use this BBox as the denominator to replace the denominator origi nally used in IoU loss. As for DIoU loss [ 99 ], it additionally considers the distance of the center of an object, and CIoU loss [ 99 ], on the other hand simultaneously considers the overlapping area, the distance between center points, and the aspect ratio. CIoU can achieve better convergence speed and accuracy on the BBox regression problem.	通常，传统的物体检测器是离线训练的。因此，研究人员总是喜欢利用这一优势，开发更好的训练方法，使对象检测器能够达到更高的准确率而不增加推理成本。我们把这些只改变训练策略或只增加训练成本的方法称为“bag of freebies.”。该方法经常被物体检测器使用并且满足“bag of freebies”方法也叫做数据增强。数据增强的目的是增加输入图像的可变性，使设计的物体检测模型对从不同环境中获得的图像具有较高的鲁棒性。例如，光度畸变和几何畸变是两种常用的数据增强方法它们肯定有利于目标检测任务。在处理光度失真时，我们调整图像的亮度、对比度、色调、饱和度和噪声。对于几何畸变，我们添加随机缩放、裁剪、翻转和旋转。上述数据增强方法均为像素级调整，保留调整区域内所有原始像素信息。此外，一些从事数据增强的研究人员强调模拟对象遮挡问题。它们在图像分类和目标检测方面取得了良好的效果。例如，例如，在图像中随机擦除或剪切矩形区域，并随机填充零或其互补值。至于hide-and-seek和网格掩码，它们随机或均匀地在图像中选择多个矩形区域并将它们替换为所有零。如果将类似的概念应用于特征映射，则有DropOut、DropConnect和DropBlock方法。此外，一些研究人员也有专业人士提出了使用多幅图像拼接在一起的数据增强的方法。例如，将两张图片以不同的比例叠加在一起，然后调整这些带有叠加比率的标签。至于裁剪混合，它是将裁剪后的图像覆盖到其他图像的矩形区域，并根据混合区域的大小调整标签。除了上述方法外，风格迁移GAN网络也被用于数据增强，这样的使用可以有效地减少CNN学习的纹理偏差。与上述提出的各种方法不同，其他一些bag of freebies方法致力于解决数据集中语义分布可能存在偏差的问题。在处理语义分布偏差问题中，一个非常重要的问题是不同类之间存在数据不平衡问题，这个问题通常是通过两级对象检测器中进行负例采样或在线负例采样来解决。但实例挖掘方法不适用于一级对象检测器，因为这种检测器属于密集的预测体系结构。因此，Lin等人提出了focal loss来处理各类之间存在的数据不平衡问题。另一个非常重要的问题是，很难表达不同类别之间关联程度与one-hot标签之间的关系。标签平滑是将硬标签转换为软标签进行训练，使模型更加稳健，在制作标签时这种方案经常被使用。为了获得更好的软标签，Islam等人引入知识蒸馏的概念来设计标签细化网络。最后一个bag of freebies是BoundingBox(BBox)回归的目标函数。传统的对象检测器通常使用均方误差(MeanSquare Error，MSE)直接对中心坐标和高度、宽度的BBox进行回归，即{xcenter，ycenter，w，h}，或左上点和右下点，即{xtop_left，ytop_left，xbottom_left，ybottom_right}。如对于基于锚的方法，它是估计相应的偏移量，例如{xcenter_offset，ycenter_offset，woffset，hoffset}和f集的{xtop_left_offset，ytop_left_offset，xbottom_right_offset，ybottom_right_offset}。然而，直接估计BBox每个点的坐标值并将这些点视为自变量，实际上没有考虑对象本身的完整性。为了使这一问题得到更好的处理，一些研究人员最近提出了IoU损失，将预测的BBox和真实的BBox放在一起考虑。通过将IoU与地面真相执行，IoU损失通过计算BBox的四个坐标点与真实标签的IoU,并将得到的结果加入到整个代码中。由于IoU是一个尺度不变表示，因此可以解决传统方法计算{x，y，w，h}的L1或L2损失时，损失会随着尺度的增加而增加的问题。最近，一些研究人员在继续改善IoU损失。例如，GIOU损失除了覆盖区域外，还包括物体的形状和方向。他们提出找到最小的区域BBox，可以同时覆盖预测的BBox和真实BBox，并使用这个BBox作为分母以取代原来在IoU损失中使用的分母。对于DIOU损失，它还考虑了物体中心的距离，CIOU损失，另一方面同时考虑了重叠区域，即CEN之间的距离点和纵横比作为对于DIOU损失。在BBox回归问题上，CIOU可以获得更好的收敛速度和精度。
2.3. Bag of specials	2.3. Bag of specials
For those plugin modules and post-processing methods that only increase the inference cost by a small amount but can signifificantly improve the accuracy of object detec tion, we call them “bag of specials”. Generally speaking, these plugin modules are for enhancing certain attributes in a model, such as enlarging receptive fifield, introducing at tention mechanism, or strengthening feature integration ca pability, etc., and post-processing is a method for screening model prediction results. Common modules that can be used to enhance recep tive fifield are SPP [ 25 ], ASPP [ 5 ], and RFB [ 47 ]. The SPP module was originated from Spatial Pyramid Match ing (SPM) [ 39 ], and SPMs original method was to split fea ture map into several d × d equal blocks, where d can be { 1 , 2 , 3 , ... } , thus forming spatial pyramid, and then extract ing bag-of-word features. SPP integrates SPM into CNN and use max-pooling operation instead of bag-of-word op eration. Since the SPP module proposed by He et al . [ 25 ] will output one dimensional feature vector, it is infeasible to be applied in Fully Convolutional Network (FCN). Thus in the design of YOLOv3 [ 63 ], Redmon and Farhadi improve SPP module to the concatenation of max-pooling outputs with kernel size k × k , where k = { 1 , 5 , 9 , 13 } , and stride equals to 1. Under this design, a relatively large k × k max pooling effectively increase the receptive fifield of backbone feature. After adding the improved version of SPP module, YOLOv3-608 upgrades AP 50 by 2.7% on the MS COCO object detection task at the cost of 0.5% extra computation. The difference in operation between ASPP [ 5 ] module and improved SPP module is mainly from the original k × k ker nel size, max-pooling of stride equals to 1 to several 3 × 3 kernel size, dilated ratio equals to k , and stride equals to 1 in dilated convolution operation. RFB module is to use sev eral dilated convolutions of k × k kernel, dilated ratio equals to k , and stride equals to 1 to obtain a more comprehensive spatial coverage than ASPP. RFB [ 47 ] only costs 7% extra inference time to increase the AP 50 of SSD on MS COCO by 5.7%. The attention module that is often used in object detec tion is mainly divided into channel-wise attention and point wise attention, and the representatives of these two atten tion models are Squeeze-and-Excitation (SE) [ 29 ] and Spa tial Attention Module (SAM) [ 85 ], respectively. Although SE module can improve the power of ResNet50 in the Im ageNet image classifification task 1% top-1 accuracy at the cost of only increasing the computational effort by 2%, but on a GPU usually it will increase the inference time by about 10%, so it is more appropriate to be used in mobile devices. But for SAM, it only needs to pay 0.1% extra cal culation and it can improve ResNet50-SE 0.5% top-1 accu racy on the ImageNet image classifification task. Best of all, it does not affect the speed of inference on the GPU at all. In terms of feature integration, the early practice is to use skip connection [ 51 ] or hyper-column [ 22 ] to integrate low level physical feature to high-level semantic feature. Since multi-scale prediction methods such as FPN have become popular, many lightweight modules that integrate different feature pyramid have been proposed. The modules of this sort include SFAM [ 98 ], ASFF [ 48 ], and BiFPN [ 77 ]. The main idea of SFAM is to use SE module to execute channel wise level re-weighting on multi-scale concatenated feature maps. As for ASFF, it uses softmax as point-wise level re weighting and then adds feature maps of different scales. In BiFPN, the multi-input weighted residual connections is proposed to execute scale-wise level re-weighting, and then add feature maps of different scales. In the research of deep learning, some people put their focus on searching for good activation function. A good activation function can make the gradient more effificiently propagated, and at the same time it will not cause too much extra computational cost. In 2010, Nair and Hin ton [ 56 ] propose ReLU to substantially solve the gradient vanish problem which is frequently encountered in tradi tional tanh and sigmoid activation function. Subsequently, LReLU [ 54 ], PReLU [ 24 ], ReLU6 [ 28 ], Scaled Exponential Linear Unit (SELU) [ 35 ], Swish [ 59 ], hard-Swish [ 27 ], and Mish [ 55 ], etc., which are also used to solve the gradient vanish problem, have been proposed. The main purpose of LReLU and PReLU is to solve the problem that the gradi ent of ReLU is zero when the output is less than zero. As for ReLU6 and hard-Swish, they are specially designed for quantization networks. For self-normalizing a neural net work, the SELU activation function is proposed to satisfy the goal. One thing to be noted is that both Swish and Mish are continuously differentiable activation function. The post-processing method commonly used in deep learning-based object detection is NMS, which can be used to fifilter those BBoxes that badly predict the same ob ject, and only retain the candidate BBoxes with higher re sponse. The way NMS tries to improve is consistent with the method of optimizing an objective function. The orig inal method proposed by NMS does not consider the con text information, so Girshick et al . [ 19 ] added classifification confifidence score in R-CNN as a reference, and according to the order of confifidence score, greedy NMS was performed in the order of high score to low score. As for soft NMS [ 1 ], it considers the problem that the occlusion of an object may cause the degradation of confifidence score in greedy NMS with IoU score. The DIoU NMS [ 99 ] developers way of thinking is to add the information of the center point dis tance to the BBox screening process on the basis of soft NMS. It is worth mentioning that, since none of above post processing methods directly refer to the captured image fea tures, post-processing is no longer required in the subse quent development of an anchor-free method.	对于那些只增加少量推理成本但能显著提高目标检测精度的插件模块和后处理方法，我们称之为“Bag of specials”。一般来说，这些插件模块是为了增强模型中的某些属性，如扩大感受野、引入注意机制或加强特征集成能力等，后处理是筛选模型预测结果的一种方法。可用于增强感受野的常用模块有SPP、ASPP和RFB。 SPP模块起源于空间金字塔匹配(SPM)，SPM原始方法为t将特征映射分割成几个d×d等量块，其中d可以是{1，2，3，...}，从而形成空间金字塔，然后提取词袋特征。 SPP将SPM集成到CNN中并使用max-pool操作而不是单词袋操作。由于He等人提出的SPP模块。将输出一维特征向量，在FCN中应用是不可行得。因此，在YOLOv3的设计中，Redmon和Farhadi将SPP模块改进为核大小为k×k的最大池输出的级联，其中k={1、5、9、13}，步长等于1。在此设计中，相对较大的k×k maxpooling有效地增加了骨干特征的感受野。增加SPP模块的改进版本后，YOLOv3-608将AP50提升2.7%并减少0.5%的额外计算成本完成MSCOCO对象检测任务。 ASPP模块与改进SPP模块在操作上的区别主要是将原始的k×k大小的核，maxpooling步长等于1变成一系列3×3大小的核，扩展比等于k，步长等于1的。射频模块是使用k×k的几个膨胀卷积核，扩张比等于k，步长等于1的扩展卷积运算，以获得比ASPP更全面的空间覆盖。 RFB只花费7%的额外推理时间来获得SSD在MS COCO上AP50增加5.7%。在物体检测中经常使用的注意模块主要分为通道注意力和点注意力，这两种注意模型的代表分别是SE和空间注意模块SAM。虽然SE模块可以提高ResNet50在ImageNet图像分类任务中的能力，提高1%top-1的准确率只需要增加2%的计算量，但在GPU上通常会增加10%左右的推理时间，因此在移动设备中使用更合适。但对SAM来说，在图像网图像分类任务中，它只需要花费0.1%的额外计算量就提高ResNet50-SE 0.5%top-1的准确率。最重要的是，它不影响GPU上的推理速度。在特征集成方面，早期的实践是使用跳转连接或hyper-column将低级特征集成到高级语义特征中。自从多尺度预测方法如FPN变的流行起来，许多集成不同特征金字塔的轻量级模块已经被提出。这类模块包括SFAM、ASFF和BiFPN。 SFAM的主要思想是利用SE模块对多尺度级联特征进行通道级重加权。至于ASFF，它使用Softmax作为点积重加权，然后添加不同尺度的特征。在BiFPN中，提出了多输入加权残差连接来执行标度级重新加权，然后添加不同尺度的特征。在深度学习的研究中，一些人把重点放在寻找更好的激活函数上。一个好的激活函数可以使梯度更有效地传播，同时时间不会造成太多额外的计算成本。在2010年，Nair和Hinton提出ReLU来实质性地解决传统中tanh和sigmoid经常遇到的梯度消失问题。随后，LRELU、PRELU、RELU6、标度指数线性单元(SELU)、Swish、Hard-Swish和Mish等也被使用为了解决梯度消失问题。而LRELU和PRELU的主要目的是解决当输出小于零时RELU的梯度为零的问题。至于ReLU6和Hard-Swish，它们是专门为量化网络设计的。对于神经网络的自归一化，提出了SELU激活函数就是为了来满足这个目标。有一件事需要注意在Swish和Mish都是连续可微激活函数。基于深度学习的对象检测中常用的后处理方法是NMS，它可以用来过滤那些预测同一对象不好的BBox，并且只保留候选具有较高响应的BBox。NMS试图改进的方法与优化目标函数的方法是一致的。 NMS提出的原始方法不考虑上下文信息，所以Girshick等人在R-CNN中添加分类置信度评分作为参考，并根据置信度评分的顺序，按高分到低分的顺序执行NMS。对于soft NMS，它考虑了对象的遮挡可能导致NMS中置信度分数下降的问题。DIoU NMS开发人员的思维方式是在soft NMS的基础上，将中心点距离的信息添加到BBox筛选过程中。值得一提的是，由于上述后处理方法都没有直接提捕获的图像特征，在后续开发无锚方法中后处理不再被需要。
3. Methodology	3. 方法
The basic aim is fast operating speed of neural network, in production systems and optimization for parallel compu tations, rather than the low computation volume theoreti cal indicator (BFLOP). We present two options of real-time neural networks: • For GPU we use a small number of groups (1 - 8) in convolutional layers: CSPResNeXt50 / CSPDarknet53 • For VPU - we use grouped-convolution, but we re frain from using Squeeze-and-excitement (SE) blocks - specififically this includes the following models: EffificientNet-lite / MixNet [76] / GhostNet [21] / Mo bileNetV3	我们的主要目标是神经网络的快速运行速度、生产系统和并行计算的优化，而不是低计算量的理论指标(BFLOP)。我们提出两个实时运行神经网络选项： ·对于GPU，我们在卷积层中使用少量的组合（1-8）：CSPResNeXt50/CSPDarknet53 ·对于VPU-我们使用分组卷积，但我们不使用挤压和SE模块 -特别是这包括以下模型：EffificientNet-lite/MxNet/GhostNet/MobileNetV3
3.1. Selection of architecture	3.1. 结构的选择
Our objective is to fifind the optimal balance among the in put network resolution, the convolutional layer number, the parameter number (fifilter size2 * fifilters * channel / groups), and the number of layer outputs (fifilters). For instance, our numerous studies demonstrate that the CSPResNext50 is considerably better compared to CSPDarknet53 in terms of object classifification on the ILSVRC2012 (ImageNet) dataset [10]. However, conversely, the CSPDarknet53 is better compared to CSPResNext50 in terms of detecting objects on the MS COCO dataset [46]. The next objective is to select additional blocks for in creasing the receptive fifield and the best method of parame ter aggregation from different backbone levels for different detector levels: e.g. FPN, PAN, ASFF, BiFPN. A reference model which is optimal for classifification is not always optimal for a detector. In contrast to the classi- fifier, the detector requires the following: • Higher input network size (resolution) – for detecting multiple small-sized objects • More layers – for a higher receptive fifield to cover the increased size of input network • More parameters – for greater capacity of a model to detect multiple objects of different sizes in a single im age Hypothetically speaking, we can assume that a model with a larger receptive fifield size (with a larger number of convolutional layers 3 × 3) and a larger number of parame ters should be selected as the backbone. Table 1 shows the information of CSPResNeXt50, CSPDarknet53, and Effifi- cientNet B3. The CSPResNext50 contains only 16 convo lutional layers 3 × 3, a 425 × 425 receptive fifield and 20.6 M parameters, while CSPDarknet53 contains 29 convolu tional layers 3 × 3, a 725 × 725 receptive fifield and 27.6 M parameters. This theoretical justifification, together with our numerous experiments, show that CSPDarknet53 neu ral network is the optimal model of the two as the backbone for a detector. The inflfluence of the receptive fifield with different sizes is summarized as follows: • Up to the object size - allows viewing the entire object • Up to network size - allows viewing the context around the object • Exceeding the network size - increases the number of connections between the image point and the fifinal ac tivation We add the SPP block over the CSPDarknet53, since it signifificantly increases the receptive fifield, separates out the most signifificant context features and causes almost no re duction of the network operation speed. We use PANet as the method of parameter aggregation from different back bone levels for different detector levels, instead of the FPN used in YOLOv3. Finally, we choose CSPDarknet53 backbone, SPP addi tional module, PANet path-aggregation neck, and YOLOv3 (anchor based) head as the architecture of YOLOv4. In the future we plan to expand signifificantly the content of Bag of Freebies (BoF) for the detector, which theoreti cally can address some problems and increase the detector accuracy, and sequentially check the inflfluence of each fea ture in an experimental fashion. We do not use Cross-GPU Batch Normalization (CGBN or SyncBN) or expensive specialized devices. This al lows anyone to reproduce our state-of-the-art outcomes on a conventional graphic processor e.g. GTX 1080Ti or RTX 2080Ti.	我们的目标是在输入网络分辨率、卷积层数、参数数（滤波器大小2滤波器信道/组）和输出层数量之间找到最优平衡。例如，我们的大量研究表明，与CSPDarknet53相比，CSPRES Next50在ILSVRC2012(Image Net)数据集上的物体分类方面要好得多。然而与此相反， CSPDarknet53与CSPRESNext50相比在MSCOCO物体检测数据集上的效果更好。下一个目标是选择额外的块来增加感受野和根据不同的检测器级从不同的骨干级别进行参数聚合的最佳方法：例如， FPN，PAN，ASFF，BiFPN。一个最适合分类的参考模型对于检测器来说并不总是最优的。与分类器相比，检测器需要以下内容： ·更高的输入网络大小（分辨率）- 用于检测多个小型物体 ·更多的层 - 获取更大的感受野以覆盖增加的输入网络大小 · 更多的参数-为了提高模型在单个图像中检测多个大小不同的对象的能力假设性地说，我们可以假设一个有较大感受野（具有更多的卷积层3×3）和大量的参数的模型,这样的模型应该选择为 backbone。表1显示了CSPRESNeXt50、CSPDarknet53和Effi-cientNetB3的信息。 CSPRES Next50只包含16个3×3卷积层、425×425感受野和20.6M参数，而CSPDarknet53包含29个3×3卷积层、725×725接收场和27.6M参数。根据理论，加上我们的无数实验，证明 CSPDarknet53神经网络是两者作为检测器主干的最优模型。不同大小的感受野的影响总结如下： ·从对象大小方面-允许查看整个对象 ·从网络大小方面-允许查看对象周围的上下文 · 超过网络大小-增加图像点与最终激活之间的连接数量我们在CSPDarknet53上添加SPP块，因为它能够增加了感受野，提取最重要的上下文特征，并且没有降低网络运行速度。我们使用PANet作为不同检测器级别的不同骨干级别的参数聚合方法，而不是YOLOv3中使用的FPN。最后，我们选择以CSPDarknet53骨干、SPP附加模块、PANet和YOLOv3（基于锚的）头部为YOLOv4的体系结构。今后我们计划扩大检测器中Bag of Freebies 的内容，理论上可以解决一些问题并提高探测器的精度，并以实验的方式依次检查每个特征的影响。我们不使用交叉GPU批量归一化(CGBN或同步BN)或昂贵的专门设备。这允许任何人在传统的图形处理器上复制我们最好的结果。例如GTX1080Ti或RTX2080Ti。
3.2. Selection of BoF and BoS	3.2. 选择BoF和BoS
For improving the object detection training, a CNN usu ally uses the following: • Activations: ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish, or Mish • Bounding box regression loss: MSE, IoU, GIoU, CIoU, DIoU • Data augmentation: CutOut, MixUp, CutMix • Regularization method: DropOut, DropPath [36], Spatial DropOut [79], or DropBlock • Normalization of the network activations by their mean and variance: Batch Normalization (BN) [32], Cross-GPU Batch Normalization (CGBN or SyncBN) [93], Filter Response Normalization (FRN) [70], or Cross-Iteration Batch Normalization (CBN) [89] • Skip-connections: Residual connections, Weighted residual connections, Multi-input weighted residual connections, or Cross stage partial connections (CSP) As for training activation function, since PReLU and SELU are more diffificult to train, and ReLU6 is specififically designed for quantization network, we therefore remove the above activation functions from the candidate list. In the method of reqularization, the people who published Drop Block have compared their method with other methods in detail, and their regularization method has won a lot. There fore, we did not hesitate to choose DropBlock as our reg ularization method. As for the selection of normalization method, since we focus on a training strategy that uses only one GPU, syncBN is not considered.	为了改进目标检测训练，CNN通常使用以下方法： • 激活函数：ReLU，leaky-ReLU，parametric-ReLU，ReLU6，SELU，Swish，或Mish • 盒回归损失：MSE，IoU，GioU，CIOU，DIOU • 数据增强：切断，混合，切割混合 • 正则化:Dropout,DropPath，spatial Dropout，或dropBlock • 归一化的网络激活的均值和方差：批归一化(B N)，交叉GPU批归一化(CGBN或SyncBN)，FRN，或CBN • 跳跃连接：残差连接，加权残差连接，多输入魏剩余连接，或跨级部分连接(CSP) 至于训练激活函数，由于PRELU和SELU难训练，而ReLU6是专门为量化网络设计的，因此我们删除了上述激活函数从候选人名单上。在正则化方法中，发表DropBlock的人详细比较了他们的方法和其他方法，他们的正则化方法更胜一筹。因此，我们毫不犹豫地选择DropBlock作为我们的正则化方法。至于归一化方法的选择，由于我们专注于只使用一个GPU的训练，syncBN我们将不考虑。
3.3. Additional improvements	3.3. 其他改进
In order to make the designed detector more suitable for training on single GPU, we made additional design and im provement as follows: • We introduce a new method of data augmentation Mo saic, and Self-Adversarial Training (SAT) • We select optimal hyper-parameters while applying genetic algorithms • We modify some exsiting methods to make our design suitble for effificient training and detection - modifified SAM, modifified PAN, and Cross mini-Batch Normal ization (CmBN) Mosaic represents a new data augmentation method that mixes 4 training images. Thus 4 different contexts are mixed, while CutMix mixes only 2 input images. This al lows detection of objects outside their normal context. In addition, batch normalization calculates activation statistics from 4 different images on each layer. This signifificantly reduces the need for a large mini-batch size. Self-Adversarial Training (SAT) also represents a new data augmentation technique that operates in 2 forward backward stages. In the 1st stage the neural network alters the original image instead of the network weights. In this way the neural network executes an adversarial attack on it self, altering the original image to create the deception that there is no desired object on the image. In the 2nd stage, the neural network is trained to detect an object on this modifified image in the normal way. CmBN represents a CBN modifified version, as shown in Figure 4, defifined as Cross mini-Batch Normalization (CmBN). This collects statistics only between mini-batches within a single batch. We modify SAM from spatial-wise attention to point wise attention, and replace shortcut connection of PAN to concatenation, as shown in Figure 5 and Figure 6, respec tively.	为了使设计的检测器更适合于单个GPU上的训练，我们做了以下额外的设计和改进： • 我们引进了一种新的数据增强马赛克和自我对抗训练(SAT)的方法) • 我们在应用遗传算法的同时选择最优超参数 • 我们修改了一些现有的方法，使我们的设计更有效的训练和检测——改进的SAM，改进的PAN，和(CMBN) 马赛克表示一种新的数据增强方法，它混合了4幅训练图像。因此，4种不同的语境是混合，而剪切混合只混合2个输入图像。这允许检测它们正常上下文之外的对象。此外，批处理归一化从每个层上的4个不同的图像中计算激活统计量。这大大减少了对大的小批量尺寸的需要。自我对抗训练(SAT)也代表了一种新的数据增强技术，它工作在两个向前向后阶段。在第一阶段，神经网络改变原始图像而不是网络权重。通过这种方式，神经网络对自己执行对抗性攻击，改变原始图像以创建图像上没有所需对象的欺骗。在第二阶段，对神经网络进行训练，以正常的方式检测该修改图像上的物体。 cmBN表示一个CBN修改版本，如图4所示，定义为Cross mini-Batch Normalization(cmBN)。这只收集单个批次内的微型批次之间的统计数据。我们将SAM从空间注意修改为点注意，并将PAN的快捷连接替换为级联，分别如图5和图6所示。
3.4. YOLOv4	3.4. YOLOv4
In this section, we shall elaborate the details of YOLOv4. YOLOv4 consists of: • Backbone: CSPDarknet53 [81] • Neck: SPP [25], PAN [49] • Head: YOLOv3 [63] YOLO v4 uses: • Bag of Freebies (BoF) for backbone: CutMix and Mosaic data augmentation, DropBlock regularization, Class label smoothing • Bag of Specials (BoS) for backbone: Mish activa tion, Cross-stage partial connections (CSP), Multi input weighted residual connections (MiWRC) • Bag of Freebies (BoF) for detector: CIoU-loss, CmBN, DropBlock regularization, Mosaic data aug mentation, Self-Adversarial Training, Eliminate grid sensitivity, Using multiple anchors for a single ground truth, Cosine annealing scheduler [52], Optimal hyper parameters, Random training shapes • Bag of Specials (BoS) for detector: Mish activation, SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS	在本节中，我们将详细介绍YOLOv4的细节。 YOLOv4组成： • Backbone: CSPDarknet53 • Neck: SPP , PAN • Head: YOLOv3 YOLOv4的用途： • Bag of Freebies (BoF) for backbone：混合切割和马赛克数据增强，DropBlock正则化，类标签平滑 • Bag of Specials (BoS) for backbone: Mish激活，CSP，多输入加权残差链接连接(Mi WRC) • Bag of Freebies (BoF) for detector: CIoU-loss,CmBN, DropBlock 正则化，马赛克数据增强，自对抗训练，消除网格敏感性，多个锚点对，余弦退火学习，最优超参数，随机训练形状 • Bag of Specials (BoS) for detector: Mish activation, SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS
4. Experiments	4. 实验
Table 2: Inflfluence of BoF and Mish on the CSPResNeXt-50 clas sififier accuracy. Table 3: Inflfluence of BoF and Mish on the CSPDarknet-53 classi- fifier accuracy. Table 4: Ablation Studies of Bag-of-Freebies. (CSPResNeXt50-PANet-SPP, 512x512). Table 5: Ablation Studies of Bag-of-Specials. (Size 512x512). Table 6: Using different classififier pre-trained weightings for de tector training (all other training parameters are similar in all mod els) . Table 7: Using different mini-batch size for detector training. Figure 8: Comparison of the speed and accuracy of different object detectors. (Some articles stated the FPS of their detectors for only one of the GPUs: Maxwell/Pascal/Volta) Table 8: Comparison of the speed and accuracy of different object detectors on the MS COCO dataset (test dev 2017). (Real-time detectors with FPS 30 or higher are highlighted here. We compare the results with batch=1 without using tensorRT.)	表2：BoF和Mish对CSPResNeXt-50分类器精度的影响。表3：BoF和Mish对CSPDarknet-53分类精度的影响。表4：Bag-of-Freebies消融研究。 (CSPRes Ne Xt50-PANet-SPP，512x512)。表5：Bag-of-Specials消融研究。 (尺寸512x512)。表6：使用不同的分类器进行检测器训练（所有其他训练参数在所有模型中都是相似的）。表7：使用不同的小批量大小进行探测器训练。图8：不同对象检测器的速度和精度比较。 (有些文章指出，它们的探测器的FPS只适用于一个GPU：Maxwell/Pascal/Volta) 表8：MSCOCO数据集上不同对象检测器的速度和精度的比较(testdev2017)。 (此处突出显示FPS30或更高的实时检测器。我们比较batchsize=1且不使用tensorRT加速下的结果)

《YOLOv4: Optimal Speed and Accuracy of Object Detection》论文翻译

Abstract

摘要

1. Introduction

1. 介绍

2. Related work

2.1. Object detection models

2. 相关工作

2.1. 物体检测方法

2.2. Bag of freebies

2.2. Bag of freebies

2.3. Bag of specials

2.3. Bag of specials

3. Methodology

3. 方法

3.1. Selection of architecture

3.1. 结构的选择

3.2. Selection of BoF and BoS

3.2. 选择BoF和BoS

3.3. Additional improvements

3.3. 其他改进

3.4. YOLOv4

3.4. YOLOv4

4. Experiments

4. 实验

猜你喜欢