table of Contents
What is detection?
detection task is classification + localization
cs231n courses Screenshot
From left to right: semantic segmentation semantic segmentation, picture classification classification, object detection detection, segmentation instance segmentation examples
Key Terms
ROI Region Of Interest region of interest can usually be understood as the picture may be areas of the object. Enter the picture can do some pre-labeled box to find a candidate proposal
- bounding box in the concept of localization tasks that gives the location area of the object in the picture
- Generally expressed as (top, left, bottom, right) or (left, top, right, bottom). (Left, top) for the top-left corner bbox coordinates, (right, bottom) to the lower right corner coordinates
IoU Intersection of Union defines the degree of overlap of two bbox = (A cross-B) / (A and B), and the algorithm for evaluating the results of manual annotation (ground-truth) difference
method
two stage
The problem solved in two stages: First, selecting a plurality of candidate blocks (object proposal) from the picture as possible of each object in FIG BBOX; then the candidate block is input to the network, block classification results obtained (multiple classification , is what kind of object or background?) and return the result block (the exact value of the coordinate location)
R-CNN
Used in the first stage of such selective search region proposal method to derive a series of Proposal; two phase region proposal each input to CNN network, a feature vector is extracted, and then to the SVM classifier.
Because the proposal is input to the network, that is, classification and regression for each region, so called Region-based CNN
Limitation is that involves a lot of repeated calculations (selected out of a region mostly likely overlap each other); and the efficiency is not high selective research
Fast R-CNN
In stage two, each for the R-CNN proposal are input to the computing network brings many repeated make improvements proposed direct entire map is input to the network, just add ROI pooling layer prior classification, a phase extracted the proposal (i.e. ROI) feature is mapped to a stage of FIG. [Note to the proposal is put forward in the original coordinate of the coordinate system, thus further including a transition from the current picture to the scale of FIG feature scale, zoom factor multiplied by the generally ].
On the other hand, Fast R-CNN also simplifies the overall training process, unlike R-CNN wherein the extracted network, the SVM classification, regression BBOX seen three separate training phase, adding the network through Fast R-CNN classification and regression softmax loss loss loss regression loss, loss of binding two (usually weighted summation) to afford the loss multi-task, by which to train the network, so that the training process is simplified as only one stage.
Faster R-CNN
Fast R-CNN improves efficiency, but is still used in a method for selective search stage is like, is the separation between the stage and a second stage, which means that the training time and can not do end (end to end) training, in other words, if there was a phase error caused the bad stage two performances, this can not be a bad result back to the stage in the training of a make adjustments to it.
Therefore Faster R-CNN made RPN (Region Proposal Network), will stage a proposal extracted by a network of tasks to solve, and let phases I and II network share a portion of the weight, thus saving a lot of calculation power.
Faster R-CNN architecture
Faster R-CNN = RPN + Fast R-CNN
图中可以看出RPN 与 Fast R-CNN 两个网络共享了用来提取特征的卷积层,而得出特征图之后,RPN继续生成proposal,将RPN的输出与之前提取的特征图通过ROI pooling之后,作为Fast R-CNN后续部分的输入,得到分类结果与回归结果
RPN
RPN做的事情只是先粗略地提取出一堆候选框,通过网络进行分类(二分类,是物体or背景?)以及回归,得到较为准确的候选框,然后送入Fast RCNN进行更细致的分类与回归
提出概念anchor:
anchor是在原图上的
anchor以特征图上每个像素为中心,假设RPN的最后一层特征图尺寸为 f * f,原图的尺寸为 n * n,则anchor其实就是在原图 n * n上均匀地采 f * f个候选框,该候选框的面积和长宽比例是预定义的(anchor的参数)
anchor可以理解为 从特征图上的一点 s ,对应回原图的区域 S,注意这里的对应区域并不等于感受野。
同一个中心点,可以有多种形状\面积的anchor,代表着不同形状/面积的区域。
判断anchor是否属于物体,其实就是看在原图的区域里是否包含有物体,如果只有一部分的物体,则可能说明anchor取的面积比较小
换句话说,anchor其实就是对ground-truth bbox的一个encode。一张原图上分布有多个anchor,如果某一区域有ground-truth的bbox,它的类别标签是c,则与这个区域交叠的anchor,其分类目标应该为类别c,其回归目标应该为与ground-truth bbox的offset。
one stage
直接在整张图片上采样一系列的候选框
二者的区别,两阶段的方法中,对于稀疏的候选框集进行分类;单阶段则是将分类器应用到对原图进行常规地、稠密地采样得到的候选框集。
- 什么叫单阶段是常规地采样?因为两阶段的方法中,有一些可能会使用learning的方法进行采样,而单阶段则可能直接根据预先定义好的anchor数量尺寸等参数 在原图均匀地采样
共同存在问题
多尺度
不同物体有不同种尺寸,有的网络可能比较倾向于检测出大尺寸的物体(在图像中占面积比较大),而难以应对小物体
image pyramid
将同一张输入图像resize为多种尺寸,然后分别输入到网络中,将检测结果综合起来
feature pyramid
只用一张输入图像,但使用网络中来自不同层的特征图(不同层则意味着特征图尺寸不同),分别进行分类和回归,将结果综合起来
平移不变性
样本不均衡
各个步骤可能出现的问题
输入:
输入图片可能是多尺度的
- 输入的图片可能是同一张图的不同缩放版本,有的早期网络只能接受固定尺寸的输入,因而需要对图片进行剪裁、拉伸、压缩等操作来满足尺寸要求
由于一张图片中可能有多个物体,因此大多数方法都可以理解成,将一张图片切分成多张子图,分别输入到网络中
网络:
正负样本不均衡 class imbalance between positve and negtive
训练时,正样本(物体)远少于负样本(背景)的个数,这在one-stage的方法中非常常见,因为one stage是进行稠密地采样得到候选框
解决方法:
hard negative mining,计算分类损失的时候,只用正样本和一部分的负样本来算loss,这些被选取的负样本 分类到背景的置信度较低(也就是分类正确的置信度较低),称为“难负样本”
- focal loss,认为应该让所有样本都参与到分类损失的计算中,根据分到正确类别的置信度来调整权重,也就是说,那些 易分的样本(分到正确类别的置信度较高)权重则相应调低,难分的样本(分到正确类别的置信度较低) 权重则相应调高。
...
ROI pooling,两个作用:
将ROI从原图映射到feature map上,从而只需将一整张原图输入到网络,而不是将原图中不同的ROI分别输入网络;
将不同的ROI都pooling成固定的尺寸,也就是使得不同大小的ROI通过池化输出固定尺寸的特征向量,便于后续的分类与回归
anchor
- The method uses certain, pre-defined dimensions and proportions good candidate box needs to encode ground-truth training bbox form of anchor, Regression branch target is offset between anchor and ground-truth bbox, the output required for the final decode anchor bbox predicted position (actually a plus offset)
Output:
Non-maxima suppression nms
- Bbox may be an actual object, surrounded by several block candidates are detected, i.e. several corresponding detection result, the maximum time that are not needed, leaving only a high confidence of detecting according confidence (understood as Category Score) inhibition result
Criteria
TP, FP compute precision and recall
mAP, mean average precision, VOC's 11-point method, taking a different threshold, and calculating precision Recall, PR curve drawn
Reference material
2D summary https://zhuanlan.zhihu.com/p/34142321
https://zhuanlan.zhihu.com/p/34179420 model evaluation and training techniques
https://blog.csdn.net/JNingWei/article/details/80039079 form frame graph summary explanation of each category of others