Depth learning basic concepts of object detection Object Detection and common methods

What is detection?

detection task is classification + localization

cs231n courses Screenshot

From left to right: semantic segmentation semantic segmentation, picture classification classification, object detection detection, segmentation instance segmentation examples

Key Terms

  • ROI Region Of Interest region of interest can usually be understood as the picture may be areas of the object. Enter the picture can do some pre-labeled box to find a candidate proposal

  • bounding box in the concept of localization tasks that gives the location area of ​​the object in the picture
    • Generally expressed as (top, left, bottom, right) or (left, top, right, bottom). (Left, top) for the top-left corner bbox coordinates, (right, bottom) to the lower right corner coordinates
  • IoU Intersection of Union defines the degree of overlap of two bbox = (A cross-B) / (A and B), and the algorithm for evaluating the results of manual annotation (ground-truth) difference

img

method

two stage

The problem solved in two stages: First, selecting a plurality of candidate blocks (object proposal) from the picture as possible of each object in FIG BBOX; then the candidate block is input to the network, block classification results obtained (multiple classification , is what kind of object or background?) and return the result block (the exact value of the coordinate location)

R-CNN

Used in the first stage of such selective search region proposal method to derive a series of Proposal; two phase region proposal each input to CNN network, a feature vector is extracted, and then to the SVM classifier.

  • Because the proposal is input to the network, that is, classification and regression for each region, so called Region-based CNN

  • Limitation is that involves a lot of repeated calculations (selected out of a region mostly likely overlap each other); and the efficiency is not high selective research

Fast R-CNN

In stage two, each for the R-CNN proposal are input to the computing network brings many repeated make improvements proposed direct entire map is input to the network, just add ROI pooling layer prior classification, a phase extracted the proposal (i.e. ROI) feature is mapped to a stage of FIG. [Note to the proposal is put forward in the original coordinate of the coordinate system, thus further including a transition from the current picture to the scale of FIG feature scale, zoom factor multiplied by the generally ].

On the other hand, Fast R-CNN also simplifies the overall training process, unlike R-CNN wherein the extracted network, the SVM classification, regression BBOX seen three separate training phase, adding the network through Fast R-CNN classification and regression softmax loss loss loss regression loss, loss of binding two (usually weighted summation) to afford the loss multi-task, by which to train the network, so that the training process is simplified as only one stage.

Faster R-CNN

Fast R-CNN improves efficiency, but is still used in a method for selective search stage is like, is the separation between the stage and a second stage, which means that the training time and can not do end (end to end) training, in other words, if there was a phase error caused the bad stage two performances, this can not be a bad result back to the stage in the training of a make adjustments to it.

Therefore Faster R-CNN made RPN (Region Proposal Network), will stage a proposal extracted by a network of tasks to solve, and let phases I and II network share a portion of the weight, thus saving a lot of calculation power.

Faster R-CNN architecture

Faster R-CNN = RPN + Fast R-CNN

图中可以看出RPN 与 Fast R-CNN 两个网络共享了用来提取特征的卷积层,而得出特征图之后,RPN继续生成proposal,将RPN的输出与之前提取的特征图通过ROI pooling之后,作为Fast R-CNN后续部分的输入,得到分类结果与回归结果

RPN

RPN做的事情只是先粗略地提取出一堆候选框,通过网络进行分类(二分类,是物体or背景?)以及回归,得到较为准确的候选框,然后送入Fast RCNN进行更细致的分类与回归

提出概念anchor:

  • anchor是在原图上的

  • anchor以特征图上每个像素为中心,假设RPN的最后一层特征图尺寸为 f * f,原图的尺寸为 n * n,则anchor其实就是在原图 n * n上均匀地采 f * f个候选框,该候选框的面积和长宽比例是预定义的(anchor的参数)

  • anchor可以理解为 从特征图上的一点 s ,对应回原图的区域 S,注意这里的对应区域并不等于感受野。

  • 同一个中心点,可以有多种形状\面积的anchor,代表着不同形状/面积的区域。

  • 判断anchor是否属于物体,其实就是看在原图的区域里是否包含有物体,如果只有一部分的物体,则可能说明anchor取的面积比较小

  • 换句话说,anchor其实就是对ground-truth bbox的一个encode。一张原图上分布有多个anchor,如果某一区域有ground-truth的bbox,它的类别标签是c,则与这个区域交叠的anchor,其分类目标应该为类别c,其回归目标应该为与ground-truth bbox的offset。

one stage

直接在整张图片上采样一系列的候选框

  • 二者的区别,两阶段的方法中,对于稀疏的候选框集进行分类;单阶段则是将分类器应用到对原图进行常规地、稠密地采样得到的候选框集。

    • 什么叫单阶段是常规地采样?因为两阶段的方法中,有一些可能会使用learning的方法进行采样,而单阶段则可能直接根据预先定义好的anchor数量尺寸等参数 在原图均匀地采样

共同存在问题

多尺度

不同物体有不同种尺寸,有的网络可能比较倾向于检测出大尺寸的物体(在图像中占面积比较大),而难以应对小物体

image pyramid

将同一张输入图像resize为多种尺寸,然后分别输入到网络中,将检测结果综合起来

feature pyramid

只用一张输入图像,但使用网络中来自不同层的特征图(不同层则意味着特征图尺寸不同),分别进行分类和回归,将结果综合起来

平移不变性

样本不均衡


各个步骤可能出现的问题

输入:

  • 输入图片可能是多尺度的

    • 输入的图片可能是同一张图的不同缩放版本,有的早期网络只能接受固定尺寸的输入,因而需要对图片进行剪裁、拉伸、压缩等操作来满足尺寸要求
  • 由于一张图片中可能有多个物体,因此大多数方法都可以理解成,将一张图片切分成多张子图,分别输入到网络中

网络:

  • 正负样本不均衡 class imbalance between positve and negtive

    • 训练时,正样本(物体)远少于负样本(背景)的个数,这在one-stage的方法中非常常见,因为one stage是进行稠密地采样得到候选框

    • 解决方法:

      • hard negative mining,计算分类损失的时候,只用正样本和一部分的负样本来算loss,这些被选取的负样本 分类到背景的置信度较低(也就是分类正确的置信度较低),称为“难负样本”

      • focal loss,认为应该让所有样本都参与到分类损失的计算中,根据分到正确类别的置信度来调整权重,也就是说,那些 易分的样本(分到正确类别的置信度较高)权重则相应调低,难分的样本(分到正确类别的置信度较低) 权重则相应调高。
      • ...

  • ROI pooling,两个作用:

    • 将ROI从原图映射到feature map上,从而只需将一整张原图输入到网络,而不是将原图中不同的ROI分别输入网络;

    • 将不同的ROI都pooling成固定的尺寸,也就是使得不同大小的ROI通过池化输出固定尺寸的特征向量,便于后续的分类与回归

  • anchor

    • The method uses certain, pre-defined dimensions and proportions good candidate box needs to encode ground-truth training bbox form of anchor, Regression branch target is offset between anchor and ground-truth bbox, the output required for the final decode anchor bbox predicted position (actually a plus offset)

Output:

  • Non-maxima suppression nms

    • Bbox may be an actual object, surrounded by several block candidates are detected, i.e. several corresponding detection result, the maximum time that are not needed, leaving only a high confidence of detecting according confidence (understood as Category Score) inhibition result
  • Criteria

    • TP, FP compute precision and recall

    • mAP, mean average precision, VOC's 11-point method, taking a different threshold, and calculating precision Recall, PR curve drawn

Reference material

2D summary https://zhuanlan.zhihu.com/p/34142321

https://zhuanlan.zhihu.com/p/34179420 model evaluation and training techniques

https://blog.csdn.net/JNingWei/article/details/80039079 form frame graph summary explanation of each category of others

Guess you like

Origin www.cnblogs.com/notesbyY/p/10986930.html