Cascade R-CNN

Information papers

Zhaowei Cai, Nuno Vasconcelos. Cascade R-CNN: Delving into High Quality Object Detection. CVPR 2018.

Foreword

Or that preface plus summary.

When using a lower threshold IOU training object detector, usually causes noisy detection When raising the threshold to find performance will be improved and this may be the main two reasons:

  1. After lifting the threshold of positive samples would be "exponentially" reduce.
  2. It does not match the model has been optimized and adapted proposal IOU input.

Cascade R-CNN proposed to solve this major problem.

It consists of a series of incremental IOU detector training sequence and the resulting combination is to be noted each stage outputs are distributed better, so the next stage convenient order. These incremental to the proposal (known as the original make hypotheses, the same below) resampling ensure positive set of dimensions being equal, this will ease the over-fitting problem (probably because of gradual lifting of the reason, you can use some of the more poor data, thus expanding the positive set thus too fitting remission).

Introduction

R-CNN and the like commonly used model IOU threshold value 0.5, so too have the requirements Loose positives, the proposal would result in too many non-compliance (the original argument is noise bbox), as shown in FIG 0.5 and 0.7 in Comparative the difference:

compareiou

From the figure we can easily see a lot more left than right bbox, and more bbox mostly meaningless.

Under the assumption that most humans can be greater than 0.5 in the IOU can be relatively easily distinguish those of the object is determined free of machine as an object (false positive, FP for short) in the example. For those content of less than 0.5 variety example speaking, humans and machines are difficult to efficiently separate the those FP example.

Work of this paper is to first generate (with original define) an IOU of a series proposal, and followed on its IOU corresponding detector training this seems to put the two questions proposed by the authors have been resolved - to solve a high-quality learning object difficulties detectors (detectors previous output samples often contain many FP).

Important thought solved herein is each individual detector only a single IOU (original called quality level) to be optimized. Earlier similar work, but the idea of ​​this paper and its different, was previously on the FP rate optimization, this is a IOU given threshold optimization.

iounumscp

FIG view from two, generally exhibit a high changed IOU IOU preset input sample output is preferably a high, low IOU IOU preset input sample output is preferably low. That is the input value matches the preset value the best performance in the case.

However, to generate a high quality detector, just simply raise the threshold of no use, can also be seen at right, will increase the threshold output still falling, the authors believe this may be due to raise the threshold because after positive samples too little. originally neural network is very fragile, so little sample can easily lead to over-fitting. another problem is the issue with input IOU preset threshold just mentioned do not match.

Object Detection

The author first gives a schematic diagram of the method was more popular, Wen used many times after this figure, we call structure diagram , meaning major letters figures also give an explanation:

archofnowdays

Bounding Box Regression

我们知道bbox对于所框选的图片块\(x\)通常由四个坐标构成: \(\b = (b_x, b_y, b_w, b_h)\), bbox regression就是将这个预测的bbox对实际bbox \(g\)进行regress, 这个过程借助regressor \(f(x, b)\)进行, 因此最终就是优化这样一个函数:
\[ R_{loc}[f] = \sum\limits_{i = 1}^{N}L_{loc}(f(x_i, b_i), g_i) \]
其中\(L_{loc}\)在R-CNN是一个\(L_2\) loss, 而在Fast R-CNN, 是一个\(L_1\) loss. 为了使预测尽可能与实际接近, \(L_{loc}\)实际操作一个距离向量:
\[ \Delta = (\delta_x, \delta_y, \delta_w, \delta_h) \]
其中:
\[ \delta_x = (g_x - b_x) / b_w\\ \delta_y = (g_y - b_y) / b_h\\ \delta_w = log(g_w / b_w)\\ \delta_h = log(g_h / b_h) \]
想要指出的是, bbox regression中一般b差异不大, 那么就会使\(L_{loc}\)很小, 为了提升他的effectiveness, 那么一般会使其归一化\(~N(0, 1)\), 也就是\(\delta_x' = (\delta_x - \mu) / \sigma_x\).

此前有工作argue单独用一次regression step of f定位精度不够, 因此他们就重复进行f regress:
\[ f'(x, b) = f \circ f \circ \cdots \circ f(x, b) \]
即所谓迭代bbox regression(iterative bounding box regression), 此方法对应上图中(b), 但此方法还是有两个问题:

  1. regressor f是在0.5的阈值训练, 对于更高阈值的proposal, regressor欠优化, 对于IOU大于0.85的proposal抑制尤为明显.

  2. 每次迭代之后的分布都在明显变化, 很可能初始分布较好, 但经过几次迭代之后反而表现更差了. 下图给出一例.

    iteration

正因为其特性, 此方法需要一些后期处理. 此方法因此也是不稳定的, 通常迭代超过两次以后基本再无太大变化.

Classification

和先前的方法基本不变, 分类时对于proposal分成\(M + 1\)类, 其中第0类是bg, 预测结果\(h_k(x) = p(y = k | x)\), 其中\(y\)是指被预测对象类别, 那么最终得到被优化的函数:
\[ R_{cls}[h] = \sum\limits_{i = 1}^NL_{cls}(h(x_i), y_i) \]
这里\(l_{cls}\)是经典交叉熵损失.

Detection Quality

和以前一样, 当proposal IOU大于某个阈值, 则预测label y, 否则为bg(label y = 0). IOU设置高或低的优缺点此前已经讲过, 此前有通过结构图中(c)的做法对多level的输出计算损失并优化:
\[ L_{cls}(h(x), y) = \sum\limits_{u \in U}L_{cls}(h_u(x), y_u) \]
U就是多IOU阈值. 因此所有classifiers在推理过程中一起使用, 但有个关键问题是不同classifier接收的positives的数量不同! 在下图中的左图就是这种情况, 首先高IOU样本数量太少, 很容易过拟合; 其次高预设IOU classifier又不得不处理众多不适宜的第IOU样本. 另外这张图也请牢记, 我们称之为分布图.

distribution

Cascade R-CNN

结构如结构图(d)所示.

Cascaded Bounding Box Regression

既然单个classifier很难适应多IOU, 那么作者就设计了顺序的多个classifier, 与iterative bounding box regression相对应, 本文的结构:
\[ f'(x, b) = f_T \circ f_{T - 1} \circ \cdots \circ f_1(x, b) \]
这里每个regressor\(f_t\)都是预优化过的,

它与iterative bounding box regression(IBBR for short)的不同有以下几点:

  • IBBR是对同一个网络重复迭代优化, cascaded regression是通过resample使每一级输出都能被下级使用.
  • cascaded regressor是既用于训练又用于推理, 那么训练集和推理集就不会有不匹配的情况了.
  • 每一级输出需要resample, 其后对每一级都会进行优化而不是向IBBR一样只是最终相当于对输入优化.

我想解释一下, 为什么输入为低IOU最后还会优出适应较高IOU的regressor, 这利用到全文第二张图的左图, 我再贴出来一边便于观察:

iounumscp

左图中我们可以看出输出在大多数情况都是好于输入的, 那么我们逐级递增地设置regressor, 最终输出的也就是单一regressor几乎不可能达到的高IOU.

Cascade Detection

在分布图中我们可以发现, 每一阶段处理之后分布重心都会向高IOU移动,这样有两个好处:

  1. 不容易过拟合.
  2. detector就可以对高IOU example训练, 而减轻以前的不匹配问题.

At each stage \ (T \) , R & lt be on CNN-classifier \ (h_t \) and Regressor \ (F_T \) threshold \ (u ^ t, u ^ t> u ^ {t - 1} \) optimization state, loss is:
\ [L (X ^ T, G) = L_ {CLS} (h_t (X ^ T), Y ^ T) + \ the lambda [Y ^ T \ GEQ. 1] L_ {LOC} (f_t (x ^ t, b
^ t), g) \] where \ (B = F_ {T ^ T -. 1} (X ^ {-1} T, B T ^ {-}. 1) \) , G is \ (x ^ t \) a Truth Ground. \ (\ the lambda \) is the adjustment parameter. \ ([Y ^ T \ GEQ. 1] \) refers not only when calculating bg \ (L_ {loc} \) .

Experimental Results

Here only use the Flip Horizontal, there is no other trick.

The following comparative experiments with the model, because the contents are more intuitive, the future may not supplement their analysis.

e1

e2

e3

e4

Conclusion

As the two issues mentioned at the beginning, in this author is trying to address these issues:

  1. Multi-stage gradually improve IOU, in order to gain more "high IOU 'IOU samples at low sample.
  2. For the last stage of the high output IOU samples, classifier training so that they adapt to high IOU samples, when the reasoning for its high performance sample processing IOU's better.

Guess you like

Origin www.cnblogs.com/edbean/p/11306577.html