关于Faster-Rcnn中的AnchorBox的一些理解

最近在看faster-rcnn的源码，写一写笔记~

之前看论文的时候，anchorbox这个东西，虽然大概意思是理解了，但是还是有很多细节没想明白，之前读代码的时候又有了更深的理解。

首先，faster-rcnn在计算anchorbox之前大概是下面这几步：

为了更方便理解，我列出了实际情况中一个图片的输入对应的过程中的各个输出结果的size，而实际情况中，输入的size是不确定的，resize之后的大小也不确定，这里只是为了方便理解和说明。可以看到，anchorbox实际上是基于feature map的，而anchorbox的数量为：75 x 100 x 9 = 67500，这里的9就是论文中的k，后面会说道，也就是说，anchorbox的数量是依赖与featuremap的，featuremap上的每个点，都对应着k个anchorbox。

关于anchorbox的生成

其实上面提到的k并不难理解，论文里说明，anchorbox有base的size，之后根据面积缩放因子(scales)和长宽比(aspect ratios)来得到k个不同大小的框。我不理解的是，这个base size是从哪来的？该不会是拍脑袋想出来的吧？论文里貌似也没怎么解释这个数字是怎么得到的。后来想了想，可能也是经过计算的，这个数值配上缩放之类的操作，大部分图片上的区域也就出来了。

其实featuremap对于anchorbox的生成的贡献就是提供了一个中心点而已，featuremap每个位置上的点，就对应一个anchorbox的中心，然后呢，我知道了这么多中心点，根据base size，scales，aspect ratios就可以算出来一个矩形的长和宽。矩形的中心点就是featuremap上的那个点对应原图上的点。

那么问题来了，这个矩形的长和宽的计算很好理解，但是怎么得到featuremap的点对应原图（这里原图指的是resize之后的图，后面也都这么说，因为是resize之后的图参与计算，得到的location信息是图上的相对比例的坐标，所以不用真正的原图也没关系）上是哪个点呢？之前我一直想不明白，以为有什么高端的算法，直到我看到代码以后，发现是我傻逼了。通过上面的图可以看到，原图过了网络之后，大概缩放比例就是8倍，源码里，有一个stride参数，也就是把featuremap的坐标平移一下（乘8）就得到相对于原图的坐标了。

所以anchorbox的值跟featuremap的值其实并没有什么卵关系，只和featuremap的size相关，这就比较好理解了吧。我们得到了anchorbox，它的值是一系在原图中的中心点和长宽值组成的矩形，接下来的操作就是根据anchorbox的位置和大小把featuremap切出对应区域然后做roipooling，回归等等……

相关源码

下面的是生成anchorbox的源码，可以参考看，一些地方我加了些注释方便理解

# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

"""Generates grid anchors on the fly as used in Faster RCNN.

Generates grid anchors on the fly as described in:
"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks"
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
"""

import tensorflow as tf

from object_detection.core import anchor_generator
from object_detection.core import box_list
from object_detection.utils import ops


class GridAnchorGenerator(anchor_generator.AnchorGenerator):
  """Generates a grid of anchors at given scales and aspect ratios."""

  def __init__(self,
               scales=(0.5, 1.0, 2.0),
               aspect_ratios=(0.5, 1.0, 2.0),
               base_anchor_size=None,
               anchor_stride=None,
               anchor_offset=None):
    """Constructs a GridAnchorGenerator.

    Args:
      scales: a list of (float) scales, default=(0.5, 1.0, 2.0)
      aspect_ratios: a list of (float) aspect ratios, default=(0.5, 1.0, 2.0)
      base_anchor_size: base anchor size as height, width (
                        (length-2 float32 list, default=[256, 256])
      anchor_stride: difference in centers between base anchors for adjacent
                     grid positions (length-2 float32 list, default=[16, 16])
      anchor_offset: center of the anchor with scale and aspect ratio 1 for the
                     upper left element of the grid, this should be zero for
                     feature networks with only VALID padding and even receptive
                     field size, but may need additional calculation if other
                     padding is used (length-2 float32 tensor, default=[0, 0])
    """
    # Handle argument defaults
    if base_anchor_size is None:
      base_anchor_size = [256, 256]
    base_anchor_size = tf.constant(base_anchor_size, tf.float32)
    if anchor_stride is None:
      anchor_stride = [16, 16]
    anchor_stride = tf.constant(anchor_stride, dtype=tf.float32)
    if anchor_offset is None:
      anchor_offset = [0, 0]
    anchor_offset = tf.constant(anchor_offset, dtype=tf.float32)

    self._scales = scales
    self._aspect_ratios = aspect_ratios
    self._base_anchor_size = base_anchor_size
    self._anchor_stride = anchor_stride
    self._anchor_offset = anchor_offset

  def name_scope(self):
    return 'GridAnchorGenerator'

  def num_anchors_per_location(self):
    """Returns the number of anchors per spatial location.

    Returns:
      a list of integers, one for each expected feature map to be passed to
      the `generate` function.
    """
    return [len(self._scales) * len(self._aspect_ratios)]

  def _generate(self, feature_map_shape_list):
    """Generates a collection of bounding boxes to be used as anchors.

    Args:
      feature_map_shape_list: list of pairs of convnet layer resolutions in the
        format [(height_0, width_0)].  For example, setting
        feature_map_shape_list=[(8, 8)] asks for anchors that correspond
        to an 8x8 layer.  For this anchor generator, only lists of length 1 are
        allowed.

    Returns:
      boxes: a BoxList holding a collection of N anchor boxes
    Raises:
      ValueError: if feature_map_shape_list, box_specs_list do not have the same
        length.
      ValueError: if feature_map_shape_list does not consist of pairs of
        integers
    """
    if not (isinstance(feature_map_shape_list, list)
            and len(feature_map_shape_list) == 1):
      raise ValueError('feature_map_shape_list must be a list of length 1.')
    if not all([isinstance(list_item, tuple) and len(list_item) == 2
                for list_item in feature_map_shape_list]):
      raise ValueError('feature_map_shape_list must be a list of pairs.')
    # grid_height, grid_width就是featuremap的size，在前面提到的例子中也就是75，100
    grid_height, grid_width = feature_map_shape_list[0]
    #scales=(0.5, 1.0, 2.0),aspect_ratios=(0.5, 1.0, 2.0) 
    # 这个操作会生成枚举值，也就是（scales_grid[i],aspect_ratios_grid[i]）对应scale和aspect_ratio的9种组合
    scales_grid, aspect_ratios_grid = ops.meshgrid(self._scales,
                                                   self._aspect_ratios)
    scales_grid = tf.reshape(scales_grid, [-1])
    aspect_ratios_grid = tf.reshape(aspect_ratios_grid, [-1])
    return  tile_anchors(grid_height,
                        grid_width,
                        scales_grid,
                        aspect_ratios_grid,
                        self._base_anchor_size,
                        self._anchor_stride,
                        self._anchor_offset)


def tile_anchors(grid_height,
                 grid_width,
                 scales,
                 aspect_ratios,
                 base_anchor_size,
                 anchor_stride,
                 anchor_offset):
  """Create a tiled set of anchors strided along a grid in image space.

  This op creates a set of anchor boxes by placing a "basis" collection of
  boxes with user-specified scales and aspect ratios centered at evenly
  distributed points along a grid.  The basis collection is specified via the
  scale and aspect_ratios arguments.  For example, setting scales=[.1, .2, .2]
  and aspect ratios = [2,2,1/2] means that we create three boxes: one with scale
  .1, aspect ratio 2, one with scale .2, aspect ratio 2, and one with scale .2
  and aspect ratio 1/2.  Each box is multiplied by "base_anchor_size" before
  placing it over its respective center.

  Grid points are specified via grid_height, grid_width parameters as well as
  the anchor_stride and anchor_offset parameters.

  Args:
    grid_height: size of the grid in the y direction (int or int scalar tensor)
    grid_width: size of the grid in the x direction (int or int scalar tensor)
    scales: a 1-d  (float) tensor representing the scale of each box in the
      basis set.
    aspect_ratios: a 1-d (float) tensor representing the aspect ratio of each
      box in the basis set.  The length of the scales and aspect_ratios tensors
      must be equal.
    base_anchor_size: base anchor size as [height, width]
      (float tensor of shape [2])
    anchor_stride: difference in centers between base anchors for adjacent grid
                   positions (float tensor of shape [2])
    anchor_offset: center of the anchor with scale and aspect ratio 1 for the
                   upper left element of the grid, this should be zero for
                   feature networks with only VALID padding and even receptive
                   field size, but may need some additional calculation if other
                   padding is used (float tensor of shape [2])
  Returns:
    a BoxList holding a collection of N anchor boxes
  """

  '''
      下面这三行操作解释一下，这是要算出变换后的矩形的宽高，可以自己算一下。
      设：
          W: base anchor size的宽度
          H: base anchor size的高度
          w: 变换之后的宽度
          h: 变换之后的高度
          s: 面积缩放（scale）的值
          r: 宽和高的比值
      然后列出等式：
          w/h = r
          w*h = W*H*(s^2)
      算一下w和h的值就好了，而且他这里还有个小bug，就是base anchor size的长宽不一样的时候算的值是不对的，
      但是这个一般都是一样的，所以无所谓了。  
  '''
  ratio_sqrts = tf.sqrt(aspect_ratios) 
  heights = scales / ratio_sqrts * base_anchor_size[0]
  widths = scales * ratio_sqrts * base_anchor_size[1]

  # Get a grid of box centers
  y_centers = tf.to_float(tf.range(grid_height))
  y_centers = y_centers * anchor_stride[0] + anchor_offset[0]
  # output： [array([  0.,   8.,  16.,  24.,  32.,  40.,  48.,  56.,  64.,  72.,  80.,
  #       88.,  96., 104., 112., 120., 128., 136., 144., 152., 160., 168.,
  #      176., 184., 192., 200., 208., 216., 224., 232., 240., 248., 256.,
  #      264., 272., 280., 288., 296., 304., 312., 320., 328., 336., 344.,
  #      352., 360., 368., 376., 384., 392., 400., 408., 416., 424., 432.,
  #      440., 448., 456., 464., 472., 480., 488., 496., 504., 512., 520.,
  #      528., 536., 544., 552., 560., 568., 576., 584., 592.],
  #     dtype=float32),
  x_centers = tf.to_float(tf.range(grid_width))
  x_centers = x_centers * anchor_stride[1] + anchor_offset[1]
  # output： array([  0.,   8.,  16.,  24.,  32.,  40.,  48.,  56.,  64.,  72.,  80.,
  #       88.,  96., 104., 112., 120., 128., 136., 144., 152., 160., 168.,
  #      176., 184., 192., 200., 208., 216., 224., 232., 240., 248., 256.,
  #      264., 272., 280., 288., 296., 304., 312., 320., 328., 336., 344.,
  #      352., 360., 368., 376., 384., 392., 400., 408., 416., 424., 432.,
  #      440., 448., 456., 464., 472., 480., 488., 496., 504., 512., 520.,
  #      528., 536., 544., 552., 560., 568., 576., 584., 592., 600., 608.,
  #      616., 624., 632., 640., 648., 656., 664., 672., 680., 688., 696.,
  #      704., 712., 720., 728., 736., 744., 752., 760., 768., 776., 784.,
  #      792.], dtype=float32)]

  # 下面就是算一下坐标和anchorbox的值了~
  x_centers, y_centers = ops.meshgrid(x_centers, y_centers)

  widths_grid, x_centers_grid = ops.meshgrid(widths, x_centers)
  heights_grid, y_centers_grid = ops.meshgrid(heights, y_centers)
  bbox_centers = tf.stack([y_centers_grid, x_centers_grid], axis=3)
  bbox_sizes = tf.stack([heights_grid, widths_grid], axis=3)
  bbox_centers = tf.reshape(bbox_centers, [-1, 2])
  bbox_sizes = tf.reshape(bbox_sizes, [-1, 2])
  bbox_corners = _center_size_bbox_to_corners_bbox(bbox_centers, bbox_sizes)
  return box_list.BoxList(bbox_corners)


def _center_size_bbox_to_corners_bbox(centers, sizes):
  """Converts bbox center-size representation to corners representation.

  Args:
    centers: a tensor with shape [N, 2] representing bounding box centers
    sizes: a tensor with shape [N, 2] representing bounding boxes

  Returns:
    corners: tensor with shape [N, 4] representing bounding boxes in corners
      representation
  """
  return tf.concat([centers - .5 * sizes, centers + .5 * sizes], 1)