Target Detection-RCNN Series

   RCNN

        RCNN (Regions with CNN features) is a milestone in applying the CNN method to the target detection problem. It was proposed by the young and promising RBG God. With the help of the good feature extraction and classification performance of CNN, the RegionProposal method is used to realize the transformation of target detection problems.

        The algorithm can be divided into four steps:

        1) Candidate region selection

        Region Proposal is a type of traditional region extraction method, which can be regarded as sliding windows of different widths and heights. Potential target images can be obtained by sliding the window. For Proposal, you can see SelectiveSearch. Generally, the number of Candidate options is 2k, which is no longer here. detail;

        The target image extracted by Proposal is normalized and used as the standard input of CNN.

        2) CNN feature extraction

        Standard CNN process, perform operations such as convolution/pooling according to the input, and obtain the output of fixed dimension;

        3) Classification and Boundary Regression

        There are actually two sub-steps, one is to classify the output vector of the previous step (the classifier needs to be trained according to the features); the other is to obtain the precise target area through bounding -box regression, since the actual target will generate multiple sub-areas , which aims to accurately locate and merge the classified foreground targets to avoid multiple detections.

        There are three obvious problems with RCNN:

1) Images corresponding to multiple candidate regions need to be extracted in advance, occupying a large amount of disk space;

2) For traditional CNNs that require fixed-size input images, crop/warp (normalization) results in truncation or stretching of objects, which will result in the loss of information input to CNN;

3) Each ProposalRegion needs to enter the CNN network for calculation. Thousands of Regions have a large amount of overlap, and repeated feature extraction brings huge computational waste.


•   SPP-Net

        Wise people are good at asking questions. Since the feature extraction process of CNN is so time-consuming (a large number of convolution calculations), why should each candidate region be independently calculated instead of extracting the overall features and only do a Region interception before classification? A wise man asks a question and immediately puts it into practice, so SPP-Net was born.


        SPP-Net has made substantial improvements on the basis of RCNN:

1) The crop/warp image normalization process is canceled to solve the information loss and storage problems caused by image deformation;

2) SpatialPyramid Pooling is used to replace the last pooling layer before the fully connected layer (top in the picture above). Cuiping said this is a new word, let's get to know it first.

        In order to adapt to feature maps of different resolutions, a scalable pooling layer is defined, which can be divided into m*n parts regardless of the input resolution. This is the first salient feature of SPP-net. Its input is the conv5 feature map and the feature map candidate frame (the original image candidate frame is obtained by stride mapping), and the output is a fixed size (m*n) feature;

        And what about the pyramids? It is not critical to increase the robustness of the extracted features through multi-scale, and this feature has been discarded in the subsequent Fast-RCNN improvement;

        The most critical is the position of SPP, which is placed after all convolutional layers, effectively solving the problem of repeated calculation of convolutional layers (the test speed is increased by 24~102 times), which is the core contribution of the paper.


        Despite the great contribution of SPP-Net, there are still many problems:

1) Like RCNN, the training process is still isolated, extracting candidate boxes | calculating CNN features | SVM classification | Bounding Box regression training independently, a large number of intermediate results need to be transferred, and the overall training parameters cannot be performed;

2) SPP-Net cannot simultaneously Tuning the convolutional layer and fully connected layer on both sides of SPP-Layer, which greatly limits the effect of deep CNN;

3)在整个过程中,Proposal Region仍然很耗时。


•   Fast-RCNN

        问题很多,解决思路同样也非常巧妙,ok,再次感谢 RBG 大神的贡献,直接引用论文原图(描述十分详尽)。

        Fast-RCNN主要贡献在于对RCNN进行加速,快是我们一直追求的目标(来个山寨版的奥运口号- 更快、更准、更鲁棒),问题在以下方面得到改进:

        1)卖点1 - 借鉴SPP思路,提出简化版的ROI池化层(注意,没用金字塔),同时加入了候选框映射功能,使得网络能够反向传播,解决了SPP的整体网络训练问题;

        2)卖点2 - 多任务Loss层

    A)SoftmaxLoss代替了SVM,证明了softmax比SVM更好的效果;

    B)SmoothL1Loss取代Bouding box回归。

        将分类和边框回归进行合并(又一个开创性的思路),通过多任务Loss层进一步整合深度网络,统一了训练过程,从而提高了算法准确度。

        3)全连接层通过SVD加速

            这个大家可以自己看,有一定的提升但不是革命性的。

        4)结合上面的改进,模型训练时可对所有层进行更新,除了速度提升外(训练速度是SPP的3倍,测试速度10倍),得到了更好的检测效果(VOC07数据集mAP为70,注:mAP,mean Average Precision)。

        接下来分别展开这里面的两大卖点:

        前面已经了解过可伸缩的池化层,那么在训练中参数如何通过ROI Pooling层传导的?根据链式求导法则,对于yj = max(xi) 传统的max pooling的映射公式:


        其中 为判别函数,为1时表示选中为最大值,0表示被丢弃,误差不需要回传,即对应 权值不需要更新。如下图所示,对于输入 xi 的扩展公式表示为:


      (i,r,j) 表示 xi 在第 r 个框的第  j 个节点是否被选中为最大值(对应上图 y0,8 和 y1,0),xi 参数在前向传导时受后面梯度误差之和的影响。


        多任务Loss层(全连接层)是第二个核心思路,如上图所示,其中cls_score用于判断分类,bbox_reg计算边框回归,label为训练样本标记。

        其中Lcls为分类误差:


        px 为对应Softmax分类概率,pl 即为label所对应概率(正确分类的概率),pl = 1时,计算结果Loss为0, 越小,Loss值越大(0.01对应Loss为2)。

       Lreg为边框回归误差:

        即在正确分类的情况下,回归框与Label框之间的误差(Smooth L1), 对应描述边框的4个参数(上下左右or平移缩放),g对应单个参数的差异,|x|>1 时,变换为线性以降低离群噪声:


         Ltotal为加权目标函数(背景不考虑回归Loss):


        细心的小伙伴可能发现了,我们提到的SPP的第三个问题还没有解决,依然是耗时的候选框提取过程(忽略这个过程,Fast-RCNN几乎达到了实时),那么有没有简化的方法呢?

        必须有,搞学术一定要有这种勇气。


•   Faster-RCNN

        对于提取候选框最常用的SelectiveSearch方法,提取一副图像大概需要2s的时间,改进的EdgeBoxes算法将效率提高到了0.2s,但是这还不够。

        候选框提取不一定要在原图上做,特征图上同样可以,低分辨率特征图意味着更少的计算量,基于这个假设,MSRA的任少卿等人提出RPN(RegionProposal Network),完美解决了这个问题,我们先来看一下网络拓扑。


        通过添加额外的RPN分支网络,将候选框提取合并到深度网络中,这正是Faster-RCNN里程碑式的贡献。

RPN网络的特点在于通过滑动窗口的方式实现候选框的提取,每个滑动窗口位置生成9个候选窗口(不同尺度、不同宽高),提取对应9个候选窗口(anchor)的特征,用于目标分类和边框回归,与FastRCNN类似。

        目标分类只需要区分候选框内特征为前景或者背景。

        边框回归确定更精确的目标位置,基本网络结构如下图所示:


        训练过程中,涉及到的候选框选取,选取依据:

1)丢弃跨越边界的anchor;

2)与样本重叠区域大于0.7的anchor标记为前景,重叠区域小于0.3的标定为背景;

      对于每一个位置,通过两个全连接层(目标分类+边框回归)对每个候选框(anchor)进行判断,并且结合概率值进行舍弃(仅保留约300个anchor), 没有显式地提取任何候选窗口 ,完全使用网络自身完成判断和修正。

        从模型训练的角度来看,通过使用共享特征交替训练的方式,达到接近实时的性能,交替训练方式描述为:

1)根据现有网络初始化权值w,训练RPN;

2)用RPN提取训练集上的候选区域,用候选区域训练FastRCNN,更新权值w;

3)重复1、2,直到收敛。

        因为Faster-RCNN,这种基于CNN的real-time 的目标检测方法看到了希望,在这个方向上有了进一步的研究思路。至此,我们来看一下RCNN网络的演进,如下图所示:

        Faster RCNN的网络结构(基于VGG16):


        Faster实现了端到端的检测,并且几乎达到了效果上的最优,速度方向的改进仍有余地,于是YOLO诞生了。

•   YOLO

        YOLO来自于“YouOnly Look Once”,你只需要看一次,不需要类似RPN的候选框提取,直接进行整图回归就可以了,简单吧?


        算法描述为:

1)将图像划分为固定的网格(比如7*7),如果某个样本Object中心落在对应网格,该网格负责这个Object位置的回归;

2)每个网格预测包含Object位置与置信度信息,这些信息编码为一个向量;

3)网络输出层即为每个Grid的对应结果,由此实现端到端的训练。

        YOLO算法的问题有以下几点:

1)7*7的网格回归特征丢失比较严重,缺乏多尺度回归依据;

2)Loss计算方式无法有效平衡(不管是加权或者均差),Loss收敛变差,导致模型不稳定。

Object(目标分类+回归)<=等价于=>背景(目标分类)

        导致Loss对目标分类+回归的影响,与背景影响一致,部分残差无法有效回传;

整体上YOLO方法定位不够精确,贡献在于提出给目标检测一个新的思路,让我们看到了目标检测在实际应用中真正的可能性。

        这里备注一下,直接回归可以认为最后一层即是对应7*7个网格的特征结果,每一个网格的对应向量代表了要回归的参数(比如pred、cls、xmin、ymin、xmax、ymax),参数的含义在于Loss函数的设计。

•   SSD

        由于YOLO本身采用的SingleShot基于最后一个卷积层实现,对目标定位有一定偏差,也容易造成小目标的漏检。

        借鉴Faster-RCNN的Anchor机制,SSD(Single Shot MultiBox Detector)在一定程度上解决了这个问题,我们先来看下SSD的结构对比图。

        基于多尺度特征的Proposal,SSD达到了效率与效果的平衡,从运算速度上来看,能达到接近实时的表现,从效果上看,要比YOLO更好。

        对于目标检测网络的探索仍在一个快速的过程中,有些基于Faster-RCNN的变种准确度已经刷到了87%以上,而在速度的改进上,YOLO2也似乎会给我们带来一定的惊喜,“未来已来”,我们拭目以待!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325775736&siteId=291194637
Recommended