Object detection learning Notes 1 ---- R-CNN / SPP-Net / Fast R-CNN / Faster R-CNN

These days learning the target detection / R-CNN / SPP-Net / Fast R-CNN / Faster-R CNN these frameworks,
wanted to do a deeper impression notes, if misunderstood, please point out, Thanks!

Target detection There are two main tasks:

  • Positioning an object in the picture
  • Object recognition category

It is a positioning target detection task + classification, image classification more difficult than that.
The traditional target detection in general process can be represented as follows:
Here Insert Picture Description
With the rise of deep learning, as well as a powerful visual processing performance exhibited by CNN, there are the traditional target detection methods would like to learn the depth direction of evolution.

R-CNN

Here Insert Picture Description
1. The first picture, and then generates a number of regions of interest in the original image using a predetermined method, i.e. possible area (region proposals) containing the target, there is about a 2k.
Method R-CNN wherein generating region proposals is ss method, the following steps:
(1) generating a set of regions in accordance with a certain rule R
(2) R is calculated for each set of regions adjacent region similarity S = {s1, s2, ...}
(3) to identify the highest similarity of the two regions, which were combined and added to the R & lt
(. 4) is removed from S all the related subset 3
new set of all children (5) is calculated similar sets of
(6) to jump to 3, until S is empty.
(Personally I think that this part is not important, although this algorithm will be accompanied by several models behind, and he did not quite understand ...)

2. candidate region generated to resize a fixed size. For the reason, in order to be able to enter behind the fixed size to the whole feature map connection layer, which is a place of the back can be improved.

3. The resize the input image to a CNN network (this network may be a ready-made model, such VGG, AlexNet, then fine-tuning can, using the original AlexNet), CNN network may extract the feature vectors of a fixed dimension (Image after AlexNet, scaling is fixed, and the input image size is fixed, the output is of course a fixed full connectivity needs .. just fixed size input layer ..)

4. Turn the extracted features are input to a set of pre-trained classifiers SVM (a total of k, i.e., k is the total number of categories, each of which is two classifiers) identifying what is to the target area, then do box regression, correcting the position of the object box.
Wherein to be noted for classification and regression are two different sets of features, classification features from fc7 layer, characterized in return from cov5 layers, where the layers are cov5 fc7 layer and the layer AlexNet.

R-CNN is given a specific configuration as shown below:

R-CNN architecture
FIG seen, 2000 region proposals and selected warp / Crop Allows you later, is required for each feature extraction region proposal, which requires 2000 times CNN, which is very time-consuming (hereinafter improved).

R-CNN fine-tuning

Here Insert Picture Description
M - for pre-train (for example AlexNet, VGG, can be directly used ready) for the model on CNN ImageNet
M '- SS generated in all regions of M is fine-tune (behind the purpose of classification, so that the model in this classification performance target detection better generalization ability, after all, trained AlexNet is readily available, and this object recognition in images which are special, AlexNet this is not necessarily a good generalization ability, you need to fine -tune)
Fine-Tune Note:

  • softmax layer to (N + 1) - way, remaining unchanged, N + 1 is the number of classes, including a class background. The original so-called specific task PASCAL VOC, use the PASCAL VOC 2010 dataset, which only need to distinguish 20 categories.
  • 正样本-N类:跟Grounth-truth重合IOU>=0.5;负样本-1类:IOU < 0.5。

R-CNN的分类部分

训练流程:
在M’的fc7特征上训练线性SVM分类器。

  • 1 loss:Hinge loss(SVM 特有)
  • 2 每个类别对应一个分类器
  • 3 正样本:所有的Ground-truth区域得到的CNN特征向量
  • 4 负样本:跟Ground-truth重合IOU<0.3的SS区域得到的CNN特征向量(注意这里正负样本的定义要比CNN严格)

注意SVM的训练是单独的,需要先训练好CNN来提取特征,并作为SVM的输入。
CNN已经具备分类的能力,为什么还要用SVM:要重新构造数据集,单独来训练,同时还要用训练好的CNN来提取特征作为输入????
Here Insert Picture Description

R-CNN的回归部分

在M’的fc7特征上训练Bounding box回归模型
1、每个类别(N类)训练一个回归模型。

  • 将SS提供的Bounding box重重新映射P-> G

  • 训练输入。
    Here Insert Picture Description

  • P的IOU > 0.6

  • Squared loss
    Here Insert Picture Description

2、测试阶段

  • 参数w已经训练好。

R-CNN的测试阶段

  • SS算法提取~2000区域/图片
  • 将所有区域warp/crop到227 * 227大小
  • 使用fine-tune过的AlexNet计算两套特征,一套来源于fc7,用于SVM分类,一套来源于cov5,用于box-regression
  • fc7特征 -> SVM分类器 -> 类别分值
  • 使用非极大值抑制(IOU>0.5)获取无冗余的区域子集
  • cov5特征 -> Bounding box回归模型 -> Bbox偏差
  • 使用Bbox偏差修正区域子集

R-CNN的缺点

  • 速度很慢
  • 卷积特征重复计算量太大-每张图片的~2000区域都会计算CNN特征

SPP-Net

对比R-CNN的改进

  • 直接输入整张图,对整张图做卷积,在cov5输出上提取所有区域的特征
  • 引入空间金字塔池化
    1、为不同尺寸的区域,在con5输出上提取特征
    2、将每个区域的特征映射到尺寸固定的全连接层上(解决了卷积层到全连接层之间尺寸变换的问题,这样就能使输入的图片可以任意尺寸大小了)
    Here Insert Picture Description

SPP-Net架构

Here Insert Picture Description
其中SPP层替代了cov5后的池化层。
Here Insert Picture Description
为什么会得固定大小的输出?
注意我们上面曾提到使用多个窗口(pooling窗口,上图中蓝色,青绿,银灰的窗口, 然后对feature maps 进行pooling,将分别得到的结果进行合并就会得到固定长度的输出),这也相当于引入了多尺度特征,使融合后的特征信息更全面。

注意需要对~2000区域的feature map做SPP,一张图片经过Spp后就得到2000 * 固定尺寸的特征矩阵。图片conv5的feature map中的某个区域做SPP如下,跟上图差不多。。
Here Insert Picture Description
在cov5 feature map上怎么找到原始图像中~2000区域的在feature map上的对应区域呢
对卷积层发现:输入图片的某个位置的特征反应在feature map上也是在相同位置,基于此,对于某个RIO区域的特征提取只需要在特征图上的相应位置提取就可以了。

SPP-Net的训练流程

  1. M - 在ImageNet上对CNN模型进行pre-train
  2. F - 计算所有SS区域的SPP特征(来源于cov5)
  3. M’ - 使用F特征fine tune 新的fc6 -> fc7 -> fc8层

这里要注意跟R-CNN的区别:
(1)SPP特征 - Pool5特征
(2)只fine-tune全连接层(为什么呢?R-CNN fine-tune所有层?

  1. F’ - 计算M’的fc7特征
  2. C - 使用F’特征训练SVM分类器
  3. R - 使用F特征训练Boundig box回归模型

SPP-Net缺点

继承R-CNN的问题:

  1. 依然需要存储大量特征
  2. 复杂的多阶段训练
  3. 速度慢

带来新问题:

  • SPP层之前的所有卷积层参数不能fine-tune(为什么?)

Fast R-CNN网络

改进

  1. 实现end - to - end的单阶段训练(region proposals的产生还是用的selective search)
  2. 所有层的参数都可以fine-tune
  3. 网络同时输出类别判断和回归建议(同级输出),不在分开训练SVM和回归器(故会引入多任务损失函数)
  4. 引入ROI pooling layer(单尺度),代替SPP层(多尺度)
    Here Insert Picture Description

Fast R-CNN架构

Here Insert Picture Description

感兴趣区域池化RoI pooling

  • Fast R-CNN中对金字塔池化进行改进,使用了其特例(只使用了一层),变成了ROI pooling,它的输入维度不定h * w,输出维度固定H * W
  • 将RoI区域的卷积特征拆分成H * W网格(VGG 是7 * 7)
  • 将每个Bin内的所有特征进行Max pooling

Here Insert Picture Description

多任务损失(Multi-task loss)

从网络结构可以看出,网路有两个同级输出:分类和回归,这样举不需要分开训练分类器和回归器,但是也涉及到了多任务训练,两个任务需要共享输入和底层参数,在根据各自任务进行不同的输出。
网络的损失函数包含分类器损失和回归L1 loss。
Here Insert Picture Description
Here Insert Picture Description

Fast R-CNN的训练

  • 预训练,与R-CNN一样,使用大的数据集训练一个大的网络,或者实现现成的CNN模型,比如VGG
  • 在预训练模型上做finetune。
    (1)先改变网络结构:1)将最后一个池化层改为ROI Pooling;2)将输出层改为两个同级输出,一个用于分类,一个用于回归。(挖坑:新加入的节点参数如何设置??)
    (2)构造数据集,使用Mini-batch sampling抽样方法。
    Batch尺寸(128) = 每个batch的图片数量(2) * 每个图片的RoI数量(64)
    这个batch设置应该可变。

全连接层加速计算

就Fast R-CNN而言,RoI池化层后的全连接层需要进行约2k次,因此在Fast R-CNN中可以采用SVD分解加速全连接层的计算。

设全连接层输入是X,输出数据为Y,全连接层权值矩阵为W,尺寸为u * v,那么全连接层的计算为Y = W * X。
其中
Here Insert Picture Description
SVD分解如下:
Here Insert Picture Description

Faster R-CNN

改进:
Faster R-CNN = Fast R-CNN + RPN
前面的Fast R-CNN的region proposals依然是SS算法产生的,跑在CPU上,消耗大量的时间。故Faster R-CNN使用神经网络的方式生成region proposals,使其都能能跑在GPU上。Faster R-CNN引入RPN网络产生region proposals。
Faster R-CNN的结构图如下:
Here Insert Picture Description
红色框部分是RPN网络。
Region Proposal Network(RPN)网络

  • 3 * 3,256-d卷积层 + Relu层 <- 输出图片的Conv5 feature map,为什么要这样做?conv5输出已经是256-d(256通道) 的特征图,为何要经过3 * 3得到还是256-d的特征图? ----为了扩大感受野
  • 1 * 1,4k-d(4k个通道数)卷积层 -> 输出k组proposal的offsets(r,c,w,h),用于回归
  • 1 * 1,2k-d卷积层 -> 输出k组(object score,non-object score)
    Here Insert Picture Description
    其中M * N是原图输入的原始尺寸,见上上图,W * H是经过conv5层后得到的feature map尺寸。
    红色的框就是sliding window(注意sliding window只是选择位置,除此之外没有其它作用,和后面的3 * 3的卷积区分),大小为n * n(实现时n=3),intermediate layer是3 * 3卷积层。卷积核数量是256个 3 * 3卷积核

Anchor box
Anchor box 类型 k = 9
包含:

  • 3个尺度scale(128, 256, 512)
  • 3个宽高比ratio(1:1,1:2,2:1)
  • Anchor 的总数量是W * H * k
  • W * H 是feature map的尺寸大小
  • conv5 feature map每个点上有k个anchor
    -Here Insert Picture Description
    RPN想在这个feature map(视为一张有256个通道的图片)上滑动一个小网络来对其每个点在归一化图像上对应的9个anchor boxes进行打分(判断每个box是否为前景)和回归(每个box的位置修正意见)。对应上图就是2k和4k的输出(k=9),2代表了前景背景,4代表了修正意见的4个值。
    我们通过下图帮助理解:
    Here Insert Picture Description

Faster R-CNN的训练流程

  • Step1 - 训练RPN网络
    卷积层初始化 <- ImageNet上的pretrained模型参数
  • Step2 - 训练Fast R-CNN网络
    卷积层初始化 <- ImageNet上的pretrained模型参数
    Region proposals有Step1的RPN生成。
  • Step3 - 调优RPN
    卷积层初始化 <- Fast R-CNN的卷积层参数
    固定卷积层,finetune剩余层
  • Step4 - 调优Fast R-CNN
    固定卷积层,finetune剩余层
    Region proposals由Step3的RPN生成。
    Step1和Step2的卷积层没有共享,Step3和Step4的卷积层共享,为什么??

Reference:
https://www.zhihu.com/people/lhc-90-53/posts
https://zhuanlan.zhihu.com/qianxiaosi
https://zhuanlan.zhihu.com/p/30720870
HTTPS: // zhuanlan .zhihu.com / the p-/ 30,368,989
https://blog.csdn.net/v1_vivian/article/details/73275259
https://blog.csdn.net/xjz18298268521/article/details/52681966
HTTPS: //zhuanlan.zhihu. com / p / 30316608

Published 29 original articles · won praise 12 · views 10000 +

Guess you like

Origin blog.csdn.net/c2250645962/article/details/99949344