YOLO series innovation point collection

1. ACON activation function

Ma, Ningning, et al. “Activate or not: Learning customized activation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

Paper address:

https://arxiv.org/pdf/2009.04759.pdf.

Introduction to the paper

The ReLU activation function has been the best neural network activation function for a long time, mainly due to its excellent characteristics such as non-saturation and sparsity, but it also has serious consequences of neuron necrosis. In recent years, the Swish activation function found by people using NAS search technology works very well, but the problem is that the Swish activation function is searched violently using NAS technology. We cannot really explain the real reason why the Swish activation function works so well.

In this paper, the author tries to dig out the smooth approximation principle (Smooth Approximation) from the formula of the Swish activation function and the ReLU activation function, and applies this principle to the Maxout family activation function, proposing a new type of activation function : ACON family activation function. A large number of experiments have proved that the ACON family activation function is better than the ReLU and Swish activation functions in tasks such as classification and detection.

ACON family

The author proposes a novel interpretation of the Swish function: the Swish function is a smooth approximation of the ReLU function (Smoth maximum), and based on this discovery, further analyzes the general form of ReLU, the Maxout series activation function, and uses the Smoth maximum to expand the Maxout series to obtain simple and effective The ACON series of activation functions: ACON-A, ACON-B, ACON-C.

At the same time, meta-ACON is proposed to dynamically learn (adaptive) the linearity/nonlinearity of the activation function, control the degree of nonlinearity of each layer of the network, and significantly improve the performance. It is also proved that the parameters P1 and P2 of ACON are responsible for controlling the upper and lower limits of the function (this has great significance for the final effect), and the parameter β \betaβ is responsible for dynamically controlling the linearity/nonlinearity of the activation function.

The parameter β \betaβ is responsible for dynamically controlling the linearity/nonlinearity of the activation function. This customized activation behavior helps to improve generalization and transfer performance

The parameter β \betaβ in the meta-ACON activation function is obtained through a small convolutional network and learned through the Sigmoid function.

The nature of the ACON activation function:

ACON-A (Swish function) is a smooth approximation (Smoth maximum) of the ReLU function.

The upper and lower bounds of the first derivative of ACON-C are also jointly determined by the two parameters P1 and P2. Through learning, an activation function with better performance can be obtained.

The parameter β \betaβ is responsible for dynamically controlling the linearity/nonlinearity of the activation function. This customized activation behavior helps to improve generalization and transfer performance.

The parameter β \betaβ in the meta-ACON activation function is obtained through a small convolutional network and learned through the Sigmoid function.

2. Introduce transformer

Bottleneck Transformers for Visual Recognition

Aravind Srinivas , Tsung-Yi Lin , Niki Parmar , Jonathon Shlens , Pieter Abbeel , Ashish Vaswani

The backbone feature extraction network of YOLOv5 is a CNN network. CNN has translation invariance and locality, and lacks the ability of global modeling and long-distance modeling. Transformer, a framework in the field of natural language processing, is introduced to form a CNN+Transformer architecture to fully utilize the advantages of both. To improve the target detection effect, there will be a certain improvement effect on small targets and dense prediction tasks.

principle:

BoTNet is a simple but powerful backbone that incorporates self-attention into a variety of computer vision tasks, including image classification, object detection, and instance segmentation. By replacing spatial convolutions with global self-attention only in the last three bottleneck blocks of ResNet, and making no other changes, the baseline is significantly improved in object detection, while also reducing parameters to minimize latency.

The difference between MHSA in Transformer and MHSA in BoTNet:

For normalization, Transformer uses Layer Normalization, while BoTNet uses Batch Normalization.

Non-linear activation, Transformer uses only one non-linear activation In the FPN block module, BoTNet uses three non-linear activations.

Output projection, MHSA in Transformer contains an output projection, BoTNet does not.

Optimizer, Transformer uses Adam optimizer training, BoTNet uses sgd+ momentum

3. BiFPN feature fusion

[Cite]Tan, Mingxing, Ruoming Pang, and Quoc V. Le. “Efficientdet: Scalable and efficient object detection.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

Introduction to the paper

This paper systematically studies the design choices of neural network architectures for object detection, and proposes several key optimization methods to improve efficiency

First, a weighted Bidirectional Feature Pyramid Network (BiFPN) is proposed, which enables simple and fast multi-scale feature fusion

Second, a Compound Scaling method is proposed, which can uniformly scale the resolution, depth, and width of all backbone networks, feature networks, and box-like prediction networks simultaneously.

Based on these optimization measures and the EfficientNet backbone, a new family of object detectors called EfficientDet is developed.

Bidirectional Weighted Feature Pyramid BiFPN

对于多尺度融合,在融合不同的输入特征时,以往的研究(FPN以及一些对FPN的改进工作)大多只是没有区别的将特征相加;然而,由于这些不同的输入特征具有不同的分辨率,我们观察到它们对融合输出特征的贡献往往是不平等的。

为了解决这一问题,作者提出了一种简单而高效的加权(类似与attention)双向特征金字塔网络(BiFPN),它引入可学习的权值来学习不同输入特征的重要性,同时反复应用自顶向下和自下而上的多尺度特征融合:

四.非极大值抑制NMS算法改进Soft-nms

YOLOv5默认采用NMS算法,主要是通过IoU来筛选出候选框。NMS主要就是通过迭代的形式,不断的以最大得分的框去与其他框做IoU操作,并过滤那些IoU较大(即交集较大)的框。NMS缺点:1、NMS算法中的最大问题就是它将相邻检测框的分数均强制归零(即将重叠部分大于重叠阈值Nt的检测框移除)。在这种情况下,如果一个真实物体在重叠区域出现,则将导致对该物体的检测失败并降低了算法的平均检测率。2、NMS的阈值也不太容易确定,设置过小会出现误删,设置过高又容易增大误检。采用soft nms进行改进。

原 理:

NMS算法是略显粗暴,因为NMS直接将删除所有IoU大于阈值的框。soft-NMS吸取了NMS的教训,在算法执行过程中不是简单的对IoU大于阈值的检测框删除,而是降低得分。算法流程同NMS相同,但是对原置信度得分使用函数运算,目标是降低置信度得分.1、Soft-NMS可以很方便地引入到object detection算法中,不需要重新训练原有的模型、代码容易实现,不增加计算量(计算量相比整个object detection算法可忽略)。并且很容易集成到目前所有使用NMS的目标检测算法。2、soft-NMS在训练中采用传统的NMS方法,仅在推断代码中实现soft-NMS。3、NMS是Soft-NMS特殊形式,当得分重置函数采用二值化函数时,Soft-NMS和NMS是相同的。soft-NMS算法是一种更加通用的非最大抑制算法。

五.锚框K-Means算法改进K-Means++

六.结合EIoU、Alpha-IoU损失函数

Zhang, Yi-Fan, et al. “Focal and efficient IOU loss for accurate bounding box regression.” arXiv preprint arXiv:2101.08158 (2021).

论文地址

CIoU损失是在DIoU损失的基础上添加了衡量预测框和GT框纵横比v vv,在一定程度上可以加快预测框的回归速度,但是仍然存在着很大的问题:

在预测框回归过程中,一旦预测框和GT框的宽高纵横比呈现线性比例时,CIoU中添加的相对比例的惩罚项便不再起作用

根据预测框w和h的梯度公式可以推知,w和h在其中一个值增大时,另外一个值必须减小,它俩不能保持同增同减

为了解决这个问题,EIoU提出了直接对w和h的预测结果进行惩罚的损失函数:

下图是GIoU、CIoU和EIoU损失预测框的迭代过程对比图,红色框和绿色框就是预测框的回归过程,蓝色框是真实框,黑色框是预先设定的锚框:

GIoU的问题是使用最小外接矩形的面积减去并集的面积作为惩罚项,这导致了GIoU存在先扩大并集面积,再优化IoU的走弯路的问题

CIoU的问题是宽和高不能同时增大或者减小,而EIoU则可以

Alpha-IoU

He, Jiabo, et al. “$\alpha $-IoU: A Family of Power Intersection over Union Losses for Bounding Box Regression.” Advances in Neural Information Processing Systems 34 (2021).

论文地址

由于IoU Loss对于bbox尺度不变,可以训练出更好的检测器,因此在目标检测中常采用IOU Loss对预测框计算定位回归损失(在YOLOv5中采用CIoU Loss)

而本文提出的Alpha-IoU Loss是基于现有IoU Loss的统一幂化,即对所有的IoU Loss,增加α \alphaα幂,当α \alphaα等于1时,则回归到原始各个Loss

Guess you like

Origin blog.csdn.net/weixin_45303602/article/details/128878918