The history of YOLO’s origins in AI target detection

Introduction: Target detection is widely used in urban security, autonomous driving, remote sensing interpretation and other fields. Target detection methods based on deep learning have become quite mature after nearly ten years of development. In recent years, the single-stage target detection YOLO series architecture has stood out among many methods and has become a mainstream trend. This article briefly summarizes the overall development of AI target detection and witnesses the history of YOLO.


0 Overview

The main task of target detection is to determine the specific location and category of the target of interest in the picture or video, and return the target bounding box and attribute definition, which can solve the problem of "Where is the target< /span>”. These two types of algorithms compete with and achieve each other, promoting the development of target detection. Get it right in one step directly obtains the specific location and category information of the target to be tested through regression. This process can be summarized as "Single-stage algorithm". From thick to fine First extract candidate areas that may contain targets in the image, and then perform regression within the candidate areas to obtain the detection results. This process can be summarized as "Two-stage algorithm”. The deep learning target detection algorithm uses convolutional neural networks to automatically extract target features. According to the detection process, it can be divided into two-stage algorithms and single-stage algorithms. What is the goal” and “

1 Early stage of chaos

Traditional target detection methods have basically reached saturation around 2010 because they rely on manually designed features or prior knowledge for target detection, have low robustness and cannot be generalized to multiple types of targets. In 2012, the Hinton team proposed the AlexNet deep learning model, which achieved a milestone success in the field of computer vision, marking the advent of the deep learning era. OverFeat was born in 2013. Because it integrates positioning and detection tasks, it is considered a pioneer in single-stage target detection.

2 A hundred schools of thought contend

In 2014, Girshick proposedR-CNN, which achieved a qualitative leap in the target detection task. Since then, target detection algorithms based on deep learning have begun to emerge in the field of natural image recognition, and single-stage target detection and two-stage target detection have also begun a "love-hate entanglement" that lasts for several years.

Figure 1 Development history of target detection

R-CNN achieved significant performance improvement on VOC2007, with the average detection accuracy (mAP) reaching 58.5%, which is nearly 25% higher than the traditional method. However, the regression process of this method is extremely time-consuming and memory-consuming. It takes 84 hours to train 5,000 images, generates hundreds of gigabytes of feature files, and it takes nearly 50 seconds to detect one image.

In the same year, in order to improve the complex training process of R-CNN and increase the detection speed, He Kaiwen proposedSpatial Pyramid Pooling Network (SPPNet), This method can effectively reduce information redundancy caused by repeated calculations. Its detection speed is more than 20 times that of R-CNN, and its mAP can reach 66%. However, it still has unresolved problems. It only implements fine-tuning of the fully connected layer, but does not deal with other feature layers in the training process.

In 2015, Girshick proposed theFast R-CNN detector, which combines the advantages of R-CNN and SPPNet in the same configuration The object category detector and the bounding box regressor can be trained simultaneously, which significantly speeds up the training process and testing speed, and further improves the detection accuracy. The training time is reduced to 9.5 hours, mAP reaches 66.9%, and it takes about 0.3 seconds to detect a picture.

Shortly afterwards, Ren proposedFaster R-CNN, which replaced slow selectivity with an efficient region proposal network (RPN) The search method overcomes the speed bottleneck of Fast R-CNN, improves detection accuracy and computing efficiency, and achieves end-to-end target detection. The mAP reached 69.9% on the VOC2007 data set. One of the fast versions has an mAP of 59.9% and an FPS of 17, making it the first ever near-real-time deep learning object detector. At this point, the basic architecture of the two-stage detector has been determined.

Figure 2 Faster R-CNN network structure

The wheel of history is rolling forward, and people are not satisfied with the status quo of "near real-time". In 2016, Joseph et al. proposed YOLO as the first formal single-stage target detector in the deep learning era. This method completely abandoned the two-stage The "region suggestion + regression" detection mode scales the image to be tested to a uniform size and divides it into multiple grids, then predicts the target category based on the grid where the target center is located, and outputs the detection result on the last convolution layer. This process can be understood as using only one convolutional neural network to complete the tasks of feature extraction, candidate frame regression and target classification, saving a lot of computing costs. The speed reached 45FPS. With the computing power at the time, YOLO perfectly achieved real-time detection of targets. However, the mAP is only 63.4%, which results in a loss of detection accuracy compared to the two-stage detector.

A few months later, Liu et al. proposedSSD, which was the second single-stage detector after YOLO. Compared with YOLO, SSD uses multi-scale feature maps and introduces an RPN structure to improve the model's detection performance of targets of various scales. Referring to the design of the anchor box in Faster R-CNN, the concept of a priori box is proposed. The category and location of the target can be determined simultaneously. SSD combines the high accuracy of FasterR-CNN and the high speed of YOLO. The detection speed is 46FPS, and the mAP reaches 74.3%.

At the end of 2016, the single-stage family added another capable person - YOLOv2. In order to improve the problem of low recall rate of v1 version, Darknet-19 was used as the backbone network and many improvements were made on the original YOLO method. First, a batch normalization layer is added to each convolutional layer to make the input data distribution of each layer in the network relatively stable, reduce the sensitivity of the model to network parameters, simplify the parameter adjustment process, and also have a certain regularization effect. . In addition, the anchor is obtained by clustering the MS COCO data set through the k-means method to find the anchor size and proportion suitable for various targets, and then predict the coordinate offset of the target within the anchor and determine the target category. Compared with the v1 version, while continuing to maintain the processing speed, improvements have been made in three aspects: more accurate prediction, faster speed, and more recognized objects. When mAP is 76.8%, FPS can reach 67. However, the detection effect of dense targets or small targets needs to be improved.

In 2017, the two-stage target detector made a big move, first proposingFeature Pyramid Network (FPN). FPN is a top-down architecture with horizontal connections. It is combined with the bottom-up structure of CNN to achieve information fusion of different levels of feature maps, retaining high-level semantic information while taking into account low-level features. Hierarchical position information has made significant progress in the detection of targets of various sizes. Since then, FPN has become a basic component of many new detectors.

Figure 3 FPN structure

Subsequently, Dai et al. proposedLight Head R-CNN, which reduced network parameters by lightweight network detection head and further accelerated detection. speed. CascadeR-CNN, which was born at the end of the year, uses cascade regression as a resampling mechanism, and improves the intersection and union ratio of regional proposals based on the gradual deepening of the convolution level, and obtains More accurate detection results.

3 The sudden rise of YOLO

时间来到2018年,YOLOv3引入了残差网络模块,在Darknet19的基础上推陈出新,使用特征提取能力更强大的Darknet53作为主干网络对图像特征进行提取,去掉了池化层和全连接层,进一步加深网络层数。特征融合策略与FPN相似,利用低、中、高三个不同尺度的特征图分别预测目标框,能够更好地应对尺度差异较大的目标。与YOLO和 YOLOv2相比,YOLOv3 实现了更快、更准的目标检测,解决了小目标检测效果差的问题,是目标检测one-stage中非常经典的算法, Darknet-53网络结构、anchor锚框、FPN的组合也成为了YOLO的经典搭配,为后续YOLO系列做大做强提供了坚实的基础。

以往基于anchor的一阶段检测器需要根据特定的数据集或针对具体应用场景来设计锚点框的尺寸和数量,导致算法泛化性较差。后续又出现了一些无anchor的检测算法,比如将目标检测问题转化为关键点检测的CornerNetCenterNet等,这些算法模型简单,泛化性强且速度较快,能够胜任高精度实时检测任务。

2019年,EfficientDet应用复合缩放方法,在基线网络上同时你扩大主干网络、特征网络以及分类回归网络的分辨率、深度和宽度,还加入了FPN的加强版BiFPN。目标检测方法发展到这一阶段,无论是检测速度还是检测精度都已经达到了令人满意的程度。

研究人员和学者们对于提升目标检测性能的研究一直如火如荼地进行着。2020年,Bochkovskiy等人提出的YOLOv4再一次刷新了目标检测史上的最好记录。首先在数据输入端引入Mosaic数据增强,将4张图像进行旋转、缩放、裁剪等变换后拼接成一副图像输入到主干网络进行训练,既丰富了数据集又减少了运算成本。主干网络采用CSPDarknet53,避免了网络优化过程中梯度信息重复的问题,并使用 Dropblock 随机减少神经元数量,简化网络模型。还使用了SPP模块和FPN+PANet结构,充分利用语义信息和位置信息实现多尺度特征融合。在同等速度下,精度较YOLOv3提高了近10个点。

图4 YOLOv4网络结构

紧接着YOLOv5重磅登场,在网络的输入端中嵌入了锚框运算功能,在每个训练集上都可以根据需要自适应地求出最优锚框值。主干网络部分加入了 Focus 结构,运用切片运算的方法减少计算量和参数量。借鉴了YOLOv4网络中SCPNet的设计思路,设计了CSP结构,加强了网络的特征融合能力。YOLOv5中包含YOLO5s、YOLO5m、YOLO5l以及YOLO5x四个模型,网络深度和宽度依次增大,特征提取能力和特征融合能力也依次增强。YOLOv5s模型小、速度快且精度高,可以直接部署到移动端实现实时检测。

纵观发展历程,自2018年开始,两阶段方法的“市场份额”逐步被一阶段抢占,它引以为傲的“高精度”也渐渐趋于平庸。直到2020年,两阶段目标检测又增一位新成员CPNDet,其作者是前文提到的一阶段算法CenterNet的作者,该方法在CenterNet的基础上使用关键点提取候选框,再利用两阶段分类器进行预测。准确率和推理速度都比较可观。在目标检测发展过程中,两阶段算法与一阶段算法相互借鉴、彼此渗透,二者之间的界限也没有最初那般明显了。

2022年美团开源了专为工业应用设计的YOLOv6,各种精巧设计大放异彩,在精度和速度上均超越其他同量级的cv模型。YOLOv4使用的CSPDarknet53在一定程度上会增加延时,减小GPU利用率。YOLOv6则使用更高效的EfficientRep作为主干网络。中间部分也基于Rep和PAN搭建了Rep-PAN结构,实现特征的有效融合。检测头部分将边框回归与类别分类过程一分为二,不仅加快收敛速度,也提高了检测头的复杂程度,从而达到了速度与精度的权衡。

两个星期后YOLOv4原班人马推出YOLOv7,从YOLOv4、v5、v6入手,性能再创新高。首先,扩展了高效长程注意力网络E-ELAN,能在不破坏原始梯度路径的情况下,提高网络的学习能力。其次,采用基于级联的模型缩放方法,以满足不同推理速度的需求。还设计了新的标签分配方法,同时考虑网络预测结果与真实框,然后将软标签分配到“label assigner”。

2023年1月10日,YOLOv5的研发公司Ultralytics再次推出重大版本YOLOv8。它建立在以前YOLO版本的基础上,引入了新的功能和改进。骨干网络和Neck部分将YOLOv5的C3结构换成了梯度流更丰富的C2f结构,并对不同尺度模型调整了不同的通道数,不再是一套参数应用到所有模型,大幅提升了模型性能;Head部分相比YOLOv5改动较大,换成了目前主流的解耦头结构,将分类和检测头分离,同时也从Anchor-Based换成了Anchor-Free;Loss计算方面采用了 Task Aligned Assigner正样本分配策略,并引入了Distribution Focal Loss。

图5 YOLOv8结构图(mmYOLO绘制)

当然,YOLO系列模型远不止上文提到的这些,下图不完全列举一些YOLO家族的产品。单是YOLO变体就多达数十种,可见YOLO家族之强大。

图6 YOLO系列

4 传承发展

无论哪种方法,都有过自己存在的意义。一阶段方法因为效率高更贴近生产生活,两阶段方法在许多场景下仍然好用。数年的发展使这两类方法相互成就、共同繁荣。虽然目标检测算法已经较为成熟,但人们从未停止过对其的改进,没有最好,只有更好。纵观AI目标检测发展时间线,经典算法层出不穷。

图7 AI目标检测时间轴

2014年R-CNN的发布开启了深度学习目标检测的时代,至今已8年有余。将这段时间一分为二,前四年两阶段算法集中爆发,占领大部分江山。后四年一阶段算法日渐强大,独占鳌头,尤其到后期YOLO如日中天。回顾来看,目标检测算法从繁到简,从有锚框到无锚框,从多阶段到两阶段再到一阶段;从粗到细,逐渐考虑更多细节,如密集程度、更多定位信息等。总体看来,是向着“更快更强的网络架构、更有效的特征集成方法、更准确的检测方法、更精确的损失函数、更有效的标签分配方法、更有效的训练方法”去发展的。而无论是一阶段还是二阶段,都是基于卷积神经网络的方法。最近,基于Transformer的算法抛弃了卷积神经网络,实现了端到端的目标检测,各项性能逐渐趋近最优,似乎有后来者居上的趋势。这大概就是“江山代有才人出,各领风骚数百年”吧。


技术交流/科研合作/客座实习/联合培养请投递:[email protected]

未来GIS实验室」作为超图研究院上游科研机构,致力于洞见未来GIS行业发展方向,验证前沿技术落地可行性,以及快速转化最新研究成果到关键产品。部门注重科研和创新功底,团队气氛自由融洽,科研氛围相对浓厚,每个人都有机会深耕自己感兴趣的前沿方向。

Guess you like

Origin blog.csdn.net/futuregislab/article/details/128921903