ROI POOLING层的原理及代码分析

代码结合配图部分的解释很便于理解

https://blog.deepsense.ai/region-of-interest-pooling-explained/

http://blog.csdn.net/lanran2/article/details/60143861

目标检测typical architecture 通常可以分为两个阶段：

（1）region proposal：给定一张输入image找出objects可能存在的所有位置。这一阶段的输出应该是一系列object可能位置的bounding box。这些通常称之为region proposals或者 regions of interest（ROI）。

（2）final classification：确定上一阶段的每个region proposal是否属于目标一类或者背景。

这个architecture存在的一些问题是：

产生大量的region proposals 会导致performance problems，很难达到实时目标检测。
在处理速度方面是suboptimal。
无法做到end-to-end training。

这就是ROI pooling提出的根本原因。

ROI pooling层能实现training和testing的显著加速，并提高检测accuracy。该层有两个输入：

从具有多个卷积核池化的深度网络中获得的固定大小的feature maps；
一个表示所有ROI的N*5的矩阵，其中N表示ROI的数目。第一列表示图像index，其余四列表示其余的左上角和右下角坐标；

ROI pooling具体操作如下：

（1）根据输入image，将ROI映射到feature map对应位置；

（2）将映射后的区域划分为相同大小的sections（sections数量与输出的维度相同）；

（3）对每个sections进行max pooling操作；

这样我们就可以从不同大小的方框得到固定大小的相应的feature maps。值得一提的是，输出的feature maps的大小不取决于ROI和卷积feature maps大小。ROI pooling 最大的好处就在于极大地提高了处理速度。

ROI pooling example

考虑一个8*8大小的feature map，一个ROI，以及输出大小为2*2.

（1）输入的固定大小的feature map

（2）region proposal 投影之后位置（左上角，右下角坐标）：（0，3），（7，8）。

（3）将其划分为（2*2）个sections（因为输出大小为2*2），我们可以得到：

（4）对每个section做max pooling，可以得到：

ROI pooling总结：

（1）用于目标检测任务；（2）允许我们对CNN中的feature map进行reuse；（3）可以显著加速training和testing速度；（4）允许end-to-end的形式训练目标检测系统。

以下转自http://blog.csdn.net/lanran2/article/details/60143861 加上了自己的一些注释

ROIs Pooling顾名思义，是Pooling层的一种，而且是针对RoIs的Pooling，他的特点是输入特征图尺寸不固定，但是输出特征图尺寸固定；

什么是ROI呢？
ROI是Region of Interest的简写，指的是在“特征图上的框”；
1）在Fast RCNN中， RoI是指Selective Search完成后得到的“候选框”在特征图上的映射，如下图所示；
2）在Faster RCNN中，候选框是经过RPN产生的，然后再把各个“候选框”映射到特征图上，得到RoIs。

这里写图片描述
图1 Fast RCNN整体结构

往往经过rpn后输出的不止一个矩形框，所以这里我们是对多个ROI进行Pooling。

ROI Pooling的输入

输入有两部分组成：
1. 特征图：指的是图1中所示的特征图，在Fast RCNN中，它位于RoI Pooling之前，在Faster RCNN中，它是与RPN共享那个特征图，通常我们常常称之为“share_conv”；
2. rois：在Fast RCNN中，指的是Selective Search的输出；在Faster RCNN中指的是RPN的输出，一堆矩形候选框框，形状为1x5x1x1（4个坐标+索引index），其中值得注意的是：坐标的参考系不是针对feature map这张图的，而是针对原图的（神经网络最开始的输入）

ROI Pooling的输出

输出是batch个vector，其中batch的值等于RoI的个数，vector的大小为channel * w * h；RoI Pooling的过程就是将一个个大小不同的box矩形框，都映射成大小固定（w * h）的矩形框；

ROI Pooling的过程

这里写图片描述
如图所示，我们先把roi中的坐标映射到feature map上，映射规则比较简单，就是把各个坐标除以“输入图片与feature map的大小的比值”，得到了feature map上的box坐标后，我们使用Pooling得到输出；由于输入的图片大小不一，所以这里我们使用的类似Spp Pooling，在Pooling的过程中需要计算Pooling后的结果对应到feature map上所占的范围，然后在那个范围中进行取max或者取average。

Caffe ROI Pooling的源码解析（要结合上面8X8 -> 2X2的例子分析, 假设原始图片64 x 64）

1. LayerSetUp

template <typename Dtype>
void ROIPoolingLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  ROIPoolingParameter roi_pool_param = this->layer_param_.roi_pooling_param();
  //经过Pooling后的feature map的高  = 2
  pooled_height_ = roi_pool_param.pooled_h();
  //经过Pooling后的feature map的宽  = 2
  pooled_width_ = roi_pool_param.pooled_w();
  //输入图片与feature map之前的比值，这个feature map指roi pooling层的输入  spatial_scale_实际上应该是最初输入的原图尺寸(比如64x64)/ROI pooling层的输入尺寸(8x8)，其实就是stride乘积的倒数(1/8)
  spatial_scale_ = roi_pool_param.spatial_scale();
}

2. Reshape

template <typename Dtype>
void ROIPoolingLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  //输入的feature map的channel数  = 3
  channels_ = bottom[0]->channels();
  //输入的feature map的高  = 8
  height_ = bottom[0]->height();
  //输入的feature map的宽  = 8
  width_ = bottom[0]->width();
  //设置输出的形状NCHW，N=ROI的个数，C=channels_，H=pooled_height_，W=pooled_width_
  top[0]->Reshape(bottom[1]->num(), channels_, pooled_height_,
      pooled_width_);
  //max_idx_的形状与top一致
  max_idx_.Reshape(bottom[1]->num(), channels_, pooled_height_,
      pooled_width_);
}

3. Forward

template <typename Dtype>
void ROIPoolingLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  //输入有两部分组成，data和rois
  const Dtype* bottom_data = bottom[0]->cpu_data();   //feature map
  const Dtype* bottom_rois = bottom[1]->cpu_data();   //rois
  // Number of ROIs
  int num_rois = bottom[1]->num();
  int batch_size = bottom[0]->num();
  int top_count = top[0]->count();
  Dtype* top_data = top[0]->mutable_cpu_data();
  caffe_set(top_count, Dtype(-FLT_MAX), top_data);
  int* argmax_data = max_idx_.mutable_cpu_data();
  caffe_set(top_count, -1, argmax_data);

  // For each ROI R = [batch_index x1 y1 x2 y2]: max pool over R
  for (int n = 0; n < num_rois; ++n) {
    int roi_batch_ind = bottom_rois[0];   //这个ind没明白
    //把原图的roi坐标映射到feature map上面   
    int roi_start_w = round(bottom_rois[1] * spatial_scale_);
    int roi_start_h = round(bottom_rois[2] * spatial_scale_);
    int roi_end_w = round(bottom_rois[3] * spatial_scale_);
    int roi_end_h = round(bottom_rois[4] * spatial_scale_);
    //计算每个roi在feature map上面的大小
    int roi_height = max(roi_end_h - roi_start_h + 1, 1);
    int roi_width = max(roi_end_w - roi_start_w + 1, 1);
    //pooling之后的feature map的一个值对应于pooling之前的feature map上的大小
    //注：由于roi的大小不一致，所以每次都需要计算一次    //roi_height  roi_width对应原roi在feature map上的大小，除以输出尺寸(2 X 2)即分块数，    //可以得到每一块输出对应的长宽
    const Dtype bin_size_h = static_cast<Dtype>(roi_height)
                             / static_cast<Dtype>(pooled_height_);
    const Dtype bin_size_w = static_cast<Dtype>(roi_width)
                             / static_cast<Dtype>(pooled_width_);
    //找到对应的roi的feature map，如果input data的batch size为1
    //那么roi_batch_ind=0    
    const Dtype* batch_data = bottom_data + bottom[0]->offset(roi_batch_ind);
        //上面这句没明白
//pooling的过程是针对每一个channel的，所以需要循环遍历
    for (int c = 0; c < channels_; ++c) {
      //计算output的每一个值，所以需要遍历一遍output，然后求出所有值       //对每一块输出进行max pool操作
      for (int ph = 0; ph < pooled_height_; ++ph) {
        for (int pw = 0; pw < pooled_width_; ++pw) {
          // Compute pooling region for this output unit:
          //  start (included) = floor(ph * roi_height / pooled_height_)
          //  end (excluded) = ceil((ph + 1) * roi_height / pooled_height_)
          // 计算output上的一点对应于input上面区域的大小[hstart, wstart, hend, wend]          //定位每块区域的位置
          int hstart = static_cast<int>(floor(static_cast<Dtype>(ph)
                                              * bin_size_h));
          int hend = static_cast<int>(ceil(static_cast<Dtype>(ph + 1)
                                           * bin_size_h));
          int wstart = static_cast<int>(floor(static_cast<Dtype>(pw)
                                              * bin_size_w));
          int wend = static_cast<int>(ceil(static_cast<Dtype>(pw + 1)
                                           * bin_size_w));
          //将映射后的区域平动到对应的位置[hstart, wstart, hend, wend]          //控制在图片范围内
          hstart = min(max(hstart + roi_start_h, 0), height_);
          hend = min(max(hend + roi_start_h, 0), height_);
          wstart = min(max(wstart + roi_start_w, 0), width_);
          wend = min(max(wend + roi_start_w, 0), width_);
          //如果映射后的矩形框不符合
          bool is_empty = (hend <= hstart) || (wend <= wstart);
          //pool_index指的是此时计算的output的值对应于output的位置
          const int pool_index = ph * pooled_width_ + pw;
          //如果矩形不符合，此处output的值设为0，此处的对应于输入区域的最大值为-1
          if (is_empty) {
            top_data[pool_index] = 0;
            argmax_data[pool_index] = -1;
          }
          //遍历output的值对应于input的区域块           // Max pool
          for (int h = hstart; h < hend; ++h) {
            for (int w = wstart; w < wend; ++w) {
             // 对应于input上的位置
              const int index = h * width_ + w;
              //计算区域块的最大值，保存在output对应的位置上
              //同时记录最大值的索引
              if (batch_data[index] > top_data[pool_index]) {
                top_data[pool_index] = batch_data[index];
                argmax_data[pool_index] = index;
              }
            }
          }
        }
      }
      // Increment all data pointers by one channel      //通过offset指向下个位置，具体实现还要看看top的结构才能理解
      batch_data += bottom[0]->offset(0, 1);
      top_data += top[0]->offset(0, 1);
      argmax_data += max_idx_.offset(0, 1);
    }
    // Increment ROI data pointer
    bottom_rois += bottom[1]->offset(1);
  }
}