Fast R-CNN阅读笔记

Fast R-CNN阅读笔记


论文提出的背景

Recently, deep ConvNets have significantly improved image classification and object detection accuracy.
深度卷积神经网络的使用大幅的提高了图像分类的精度,但是对于检测问题,仍然有两大挑战:
- First, numerous candidate object locations (often called “proposals”) must be processed.
需要生成大量的候选框
- Second, these candidates provide only rough localization that must be refined to achieve precise localization.
候选框的定位只是粗略的,需要通过微调实现精确定位

本文工作

We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.
提出了一种单阶段的训练方法来实现候选对象的分类和定位调优

与其他模型对比:

速度
training:9x R-CNN 3x SPP
test: 213x R-CNN 10x SPP
性能(mAP):
R-CNN 62%
Fast R-CNN 66%

R-CNN模型的缺点
  1. Training is a multi-stage pipeline. R-CNN first finetunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.
    多阶段训练过程。首先对卷积网络在真实框上进行微调,随后将候选框输入网络,将提取的特征向量输入到SVM中进行分类,最后进行边框回归。
  2. Training is expensive in space and time. For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk.
    训练耗时且需要大量内存。为了进行SVM分类和边框回归,需要将特征向量存储到硬盘中。
  3. Object detection is slow. At test-time, features are extracted from each object proposal in each test image.
    速度慢。对于每个候选框都要输入一遍网络提取特征(重复计算)。
SPP模型
  • 优点
    Features are extracted for a proposal by maxpooling the portion of the feature map inside the proposal into a fixed-size output.
    每个候选框的特征是直接从特征图上通过maxpooling到固定大小
  • 缺点
    Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors.
    多阶段
Fast RCNN模型结构

first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map.
Then, for each object proposal a region of interest (RoI) pooling layer extracts fixed-length feature vector from the feature map.
Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers:
one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes.
首先将整张图片输入几层卷积层提取特征,随后对于每个候选框,通过一个ROI pooling层得到固定大小的特征向量,将这些特征向量输入两个FC分支,一条进行分类,一条进行回归。

ROI Pooling

RoI max pooling works by dividing the h * w RoI window into an H * W grid of sub-windows of approximate size h/H * w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel
ROI Pooling将代表候选区域的大小为(h,w)的特征图分成(H,W)个格子,每个格子的大小近似为( h/H , w/W)。对于特征图的每个通道Pooling操作时独立的。

采样方法

minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image.
采用层次化的采样方法,首先采样N张图片,随后采样R/N个ROI。

猜你喜欢

转载自blog.csdn.net/archervin/article/details/80613799
今日推荐