SSD网络解析之PriorBox层

SSD网络中的PriorBox层用于部署特征图中每个位置（像素点）处的默认框（即计算每个默认框相对于网络输入层输入图像的归一化左上角和右下角坐标以及设置的坐标variance值）

默认框的具体设置，我们需要先看一下原论文中的2.2节部分。

①英文部分如下：

Choosing scales and aspect ratios for default boxes To handle different object scales, some methods [4,9] suggest processing the image at different sizes and combining the results afterwards. However, by utilizing feature maps from several different layers in a single network for prediction we can mimic the same effect, while also sharing parame-ters across all object scales. Previous works [10,11] have shown that using feature maps from the lower layers can improve semantic segmentation quality because the lower layers capture more fine details of the input objects. Similarly, [12] showed that adding global context pooled from a feature map can help smooth the segmentation results.

Motivated by these methods, we use both the lower and upper feature maps for detection. Figure 1 shows two exemplar feature maps (8 × 8 and 4 × 4) which are used in the framework. In practice, we can use many more with small computational overhead.
Feature maps from different levels within a network are known to have different (empirical) receptive field sizes [13]. Fortunately, within the SSD framework, the de-fault boxes do not necessary need to correspond to the actual receptive fields of each layer. We design the tiling of default boxes so that specific feature maps learn to be responsive to particular scales of the objects. Suppose we want to use m feature maps for prediction. The scale of the default boxes for each feature map is computed as:

$s_{k}=s_{min}+\frac{s_{max}-s_{min}}{m-1}(k-1), k\in [1,m]$

where $s_{min}$ is 0.2 and $s_{max}$ is 0.9, meaning the lowest layer has a scale of 0.2 and the highest layer has a scale of 0.9, and all layers in between are regularly spaced. We impose different aspect ratios for the default boxes, and denote them as $a_{r}\in\left \{ 1,2,3,\frac{1}{2},\frac{1}{3} \right \}$ . We can compute the width ( $w_{k}^{a}=s_{k}\sqrt{a_{r}}$ ) and height ( $h_{k}^{a}=s_{k}/\sqrt{a_{r}}$ ) for each default box. For the aspect ratio of 1, we also add a default box whose scale is $s_{k}^{'}=\sqrt{s_{k}s_{k+1}}$ , resulting in 6 default boxes per feature map location. We set the center of each default box to $(\frac{i+0.5}{\left | f_{k} \right |},\frac{j+0.5}{\left | f_{k} \right |})$ , where $\left | f_{k} \right |$ is the size of the k-th square feature map, $i,j\in[0,|f_{k}|)$ . In practice, one can also design a distribution of default boxes to best fit a specific dataset. How to design the optimal tiling is an open question as well.
By combining predictions for all default boxes with different scales and aspect ratios from all locations of many feature maps, we have a diverse set of predictions, covering various input object sizes and shapes. For example, in Fig. 1, the dog is matched to a default box in the 4 × 4 feature map, but not to any default boxes in the 8 × 8 feature map. This is because those boxes have different scales and do not match the dog box, and therefore are considered as negatives during training.

②我翻译的中文部分如下（中文只是辅助，还是需要从英文入手，才能真正看懂其中的含义）：

为默认框选择尺度和宽高比：为了处理不同的目标尺度，一些方法[4,9]建议处理不同尺寸的图像并在之后组合结果。但是，通过利用单个网络中几个不同层的特征图进行预测，我们可以模拟相同的效果，同时还可以跨所有目标尺度共享参数（即同一个网络实现多尺度目标的处理）。之前的工作[10,11]已经表明，使用较低层的特征图可以提高语义分割质量，因为较低层捕获输入目标更精细的细节。同样，[12]表明，从特征图中添加全局上下文可以帮助平滑分割结果。在这些方法的推动下，我们使用低层和高层特征图进行检测。图1显示了框架中使用的两个示例性特征图（8×8和4×4）。实际上，我们可以使用更多的小计算开销（small computational overhead）。

已知网络内不同级别的特征图具有不同的（经验）感受野大小[13]。幸运的是，在SSD框架内，默认框不一定需要与每层的实际感受野相对应。我们设计了默认框的平铺，以便特定的特征图学习响应特定的目标尺度。假设我们想要使用m个特征图进行预测。每个特征图的默认框的尺寸计算如下：

$s_{k}=s_{min}+\frac{s_{max}-s_{min}}{m-1}(k-1), k\in [1,m]$

其中， $s_{min}$ 取为0.2， $s_{max}$ 取为0.9，表示最低层的尺度为0.2，最高层的尺度为0.9，其间的所有层都是规则间隔的。我们对默认框施加不同的宽高比，记为 $a_{r}\in\left \{ 1,2,3,\frac{1}{2},\frac{1}{3} \right \}$ 。由此能够计算每一个默认框的宽度（ $w_{k}^{a}=s_{k}\sqrt{a_{r}}$ ）和高度（ $h_{k}^{a}=s_{k}/\sqrt{a_{r}}$ ）。对于宽高比为1时，我们还添加了一个默认框，其尺寸为 $s_{k}^{'}=\sqrt{s_{k}s_{k+1}}$ ，由此在每个特征图位置处产生6个默认框。我们设置每个默认框的中心为 $(\frac{i+0.5}{\left | f_{k} \right |},\frac{j+0.5}{\left | f_{k} \right |})$ ，其中 $\left | f_{k} \right |$ 表示第k个方形特征图的大小（即特征图的长/宽）， $i,j\in[0,|f_{k}|)$ 。在实践中，还可以设计默认框的分布以最佳地适合特定数据集。如何设计最佳平铺也是一个悬而未决的问题（开放性问题）。

通过结合来自多个特征图的所有位置的具有不同尺寸和宽高比的所有默认框的预测，我们具有多种预测，涵盖各种输入目标尺寸和形状。例如，在图1中，狗匹配4×4特征图中的默认框，但不匹配8×8特征图中的任何默认框。这是因为这些框具有不同的尺寸并与狗框不匹配，因此在训练期间被认为是负例。

其中的图1如下：

图 1 SSD框架。（a）SSD在训练期间仅需要每个对象的输入图像和地面实况框。以卷积方式，我们在几个具有不同形状尺寸的特征图（例如，8×8 和4×4 在（b）和（c）中）中的每个位置处评估一组不同宽高比的小集（例如4个）。对于每个默认框，我们预测形状偏移和所有目标类别的置信度（ $(c_{1},c_{2},...,c_{p})$ ）。在训练时，我们首先将这些默认框与地面实况框匹配。例如，我们将两个默认框一个与cat匹配，一个与dog匹配，它们被视为正例，其余的为负例（即与地面实况框匹配上的为正例，否则为负例）。模型损失是定位损失（例如，smooth L1 [6]）和置信度损失（例如，Softmax）之间的加权和

从中我们可以看出，在特征图的每个位置（像素点）处都会有几个设定的默认框（4-6个），这些默认框的宽高比是需要我们设定的，而不同层上的特征图上的默认框的宽度和高度是根据 $s_{k}$ 和 $a_{r}$ 所确定的。

注：同一层的所有特征图共享一组默认框。

（1）caffe.proto中关于该层参数的说明

该层所需要设置的参数比较多：

// Message that store parameters used by PriorBoxLayer
message PriorBoxParameter {
  // Encode/decode type.
  enum CodeType {
    CORNER = 1;
    CENTER_SIZE = 2;
    CORNER_SIZE = 3;
  }
  // Minimum box size (in pixels). Required!
  repeated float min_size = 1; //对应论文2.2节中公式（4）中的sk×网络输入层输入图像[data层的输入]大小
  // Maximum box size (in pixels). Required!
  repeated float max_size = 2; //下一层用来生成默认框特征图所在的min_size（对应论文2.2节中公式（4）中的sk+1×网络输入层输入图像[data层的输入]大小）
  // Various of aspect ratios. Duplicate ratios will be ignored.
  // If none is provided, we use default ratio 1.
  repeated float aspect_ratio = 3; //宽高比
  // If true, will flip each aspect ratio.
  // For example, if there is aspect ratio "r",
  // we will generate aspect ratio "1.0/r" as well.
  optional bool flip = 4 [default = true]; //是否翻转宽高比
  // If true, will clip the prior so that it is within [0, 1]
  optional bool clip = 5 [default = false]; //是否进行裁剪（是否保证默认框整个在网络输入层输入图像内）
  // Variance for adjusting the prior bboxes.
  repeated float variance = 6; //暂时未知用来做什么
  // By default, we calculate img_height, img_width, step_x, step_y based on
  // bottom[0] (feat) and bottom[1] (img). Unless these values are explicitely
  // provided.
  // Explicitly provide the img_size.
  optional uint32 img_size = 7;
  // Either img_size or img_h/img_w should be specified; not both.
  optional uint32 img_h = 8; //网络输入层输入图像的高（或自行设置的高度）
  optional uint32 img_w = 9; //网络输入层输入图像的宽（或自行设置的宽度）

  // Explicitly provide the step size.
  optional float step = 10;
  // Either step or step_h/step_w should be specified; not both.
  optional float step_h = 11; //特征图上同一列上相邻两像素点间的距离在网络输入层输入图像上的距离
  optional float step_w = 12; //特征图上同一行上相邻两像素点间的距离在网络输入层输入图像上的距离

  // Offset to the top left corner of each cell.
  optional float offset = 13 [default = 0.5]; //默认框中心偏移量（相对偏移量）
}

其中几个重要的参数说明如下：

min_size：该层的 $s_{k}$ 乘上网络输入层的输入图像大小（data层），即是一个像素值

max_size：下一层的 $s_{k}$ （即论文中的 $s_{k+1}$ ）乘上网络输入层的输入图像大小（data层），即是一个像素值

注：min_size和max_size是用来计算额外添加的宽高比为1时的一个默认框的参数（即为了计算其尺寸 $s_{k}^{'}=\sqrt{s_{k}s_{k+1}}$ ）

step/step_h,step_w：用于计算当前特征图上某一位置处所有默认框中心坐标在网络输入层输入图像坐标系下的坐标

由此我们可以先提前看一下该层在train.prototxt（SSD300的训练网络）中的样子：

layer {
  name: "conv4_3_norm_mbox_priorbox"
  type: "PriorBox"
  bottom: "conv4_3_norm"
  bottom: "data"
  top: "conv4_3_norm_mbox_priorbox"
  prior_box_param {
    min_size: 30.0
    max_size: 60.0
    aspect_ratio: 2
    flip: true
    clip: false
    variance: 0.1
    variance: 0.1
    variance: 0.2
    variance: 0.2
    step: 8
    offset: 0.5
  }
}

拿上面的conv4_3层的特征图为例，从中可以看出该层需要有两个输入，即bottom[0]为上面的conv4_3_norm，bottom[1]为上面的data（也即网络输入层输入图像）。

①min_size和max_size进一步分析

SSD300网络（300即输入图像大小为300×300）修改自VGG16网络，大家可以自行推导一下conv4_3层特征图的大小（我推出来是38×38，注意caffe中的卷积层是向下取整的，即用floor函数来计算最终的特征图大小;而池化层是向上取整的，即用cell函数来计算特征图大小，大家不信的话可以去源码中找一下）。

这里的min_size,max_size在SSD300中并不完全遵循2.2节部分中的公式 $s_{k}=s_{min}+\frac{s_{max}-s_{min}}{m-1}(k-1), k\in [1,m]$ ，但所表示的意思即是 $s_{k}$ 和 $s_{k+1}$ ，原因在论文后面说了（对于conv4_3采用的 $s_{k}=0.1$ ），如下图：

为何说没遵守这一公式呢，从网络整体，我们可以提取出所有的min_size和max_size如下表：

层名称	特征图大小（ $\left \| f_{k} \right \|$ * $\left \| f_{k} \right \|$ ）	min_size	max_size	step
conv4_3	38*38	30	60	8
fc7	19*19	60	111	16
conv6_2	10*10	111	162	32
conv7_2	5*5	162	213	64
conv8_2	3*3	213	264	100
conv9_2	1*1	264	315	300

从表中可以看出 $s_{1}=30/300=0.1$ , $s_{2}=60/300=0.2$ , $s_{3}=111/300=0.37$ , $s_{4}=162/300=0.54$ , $s_{5}=213/300=0.71$ , $s_{6}=264/300=0.88$ , $s_{7}=315/300=1.05$ ，故并不遵循上述公式（conv4_3层是设定好为0.1的）。

当然，如果我们不看conv4_3层，只看fc7-conv9_2层，是遵循上述公式的，即 $s_{min}=s_{2}=0.2,s_{max}=s_{7}=1.05,m=6$ （大家可以自行带进去算算）。

②step参数分析

step参数本质上是该层的特征图相对于网络输入层输入图像的下采样率，用于计算当前特征图上某一位置处所有默认框中心坐标在网络输入层输入图像坐标系下的坐标。举个例子，就是你在缩放一幅图像时，对于缩放后的图像，其上每一像素点的像素值是通过插值得到的，那如何有效插值就需要用到后向计算，即需要找出当前像素点坐标值对应于原始图像上的坐标值，由原始图像上这一坐标值来最邻近或双线性插值，重点就在于需要找出原始图像上的坐标值。而这里的step就是用来计算特征图上某一位置处所有默认框中心坐标在网络输入层输入图像坐标系下的坐标，对应的就是下采样率，拿conv4_3而言，下采样率为38/300约为8（实际上就是8，因为经过三次最大池化操作，但由于caffe中的池化采用向上取整，导致约等于8），故step=8，其余层大家可以自行计算一下（不要过于纠结准确下采样率，大致就行，毕竟最后都是要通过网络自行学习的，没必要那么精确），当然大家要是还是不理解的话，请参考后面cpp文件中的Forward_cpu函数的实现来加深理解。

注： $step\approx \frac{I}{|f_{k}|}$ ，其中 $I$ 为网络输入层输入图像大小。

（2）头文件prior_box_layer.hpp

该层所需要的函数和变量均定义在prior_box_layer.hpp文件中：

#ifndef CAFFE_PRIORBOX_LAYER_HPP_
#define CAFFE_PRIORBOX_LAYER_HPP_

#include <vector>

#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"

namespace caffe {

/**
 * @brief Generate the prior boxes of designated sizes and aspect ratios across
 *        all dimensions @f$ (H \times W) @f$.
 *
 * Intended for use with MultiBox detection method to generate prior (template).
 *
 * NOTE: does not implement Backwards operation.
 */
template <typename Dtype>
class PriorBoxLayer : public Layer<Dtype> {
 public:
  /**
   * @param param provides PriorBoxParameter prior_box_param,
   *     with PriorBoxLayer options:
   *   - min_size (\b minimum box size in pixels. can be multiple. required!). 对应论文2.2节中公式（4）中的sk×网络输入层输入图像[data层的输入]大小
   *   - max_size (\b maximum box size in pixels. can be ignored or same as the 对应论文2.2节中公式（4）中的sk+1×网络输入层输入图像[data层的输入]大小
   *   # of min_size.).
   *   - aspect_ratio (\b optional aspect ratios of the boxes. can be multiple).
   *   - flip (\b optional bool, default true).
   *     if set, flip the aspect ratio.
   */
  explicit PriorBoxLayer(const LayerParameter& param)
      : Layer<Dtype>(param) {}
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);

  virtual inline const char* type() const { return "PriorBox"; }
  virtual inline int ExactBottomBlobs() const { return 2; } //输入blob数目为2（第一个blob一般为特征图;第二个blob一般为data层输入的图像）
  virtual inline int ExactNumTopBlobs() const { return 1; } //输出blob数目为1

 protected:
  /**
   * @brief Generates prior boxes for a layer with specified parameters.
   *
   * @param bottom input Blob vector (at least 2)
   *   -# @f$ (N \times C \times H_i \times W_i) @f$
   *      the input layer @f$ x_i @f$
   *   -# @f$ (N \times C \times H_0 \times W_0) @f$
   *      the data layer @f$ x_0 @f$
   * @param top output Blob vector (length 1)
   *   -# @f$ (N \times 2 \times K*4) @f$ where @f$ K @f$ is the prior numbers
   *   By default, a box of aspect ratio 1 and min_size and a box of aspect
   *   ratio 1 and sqrt(min_size * max_size) are created.
   */
  //前向传播实质是计算各默认框参数（左上角和右下角归一化坐标+坐标variance）
  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  /// @brief Not implemented 无需后向传播
  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {
    return;
  }

  vector<float> min_sizes_;  //存储所设置的min_size(对应论文2.2节中公式（4）中的sk×网络输入层输入图像大小)
  vector<float> max_sizes_;  //下一层用来生成默认框特征图所在的min_size（对应论文2.2节中公式（4）中的sk+1×网络输入层输入图像大小）
  vector<float> aspect_ratios_; //存储所设置的宽高比（包含默认的宽高比1）
  bool flip_; //是否翻转宽高比
  int num_priors_;  //默认框数目（default box）
  bool clip_; //是否进行裁剪（是否保证默认框整个在网络输入层输入图像内）
  vector<float> variance_; //存储variance（暂时不清楚此参数用来做什么）

  int img_w_; //网络输入层输入图像的宽（或自行设置的宽度）
  int img_h_; //网络输入层输入图像的高（或自行设置的高度）
  float step_w_; //特征图上同一行上相邻两像素点间的距离在网络输入层输入图像上的距离
  float step_h_; //特征图上同一列上相邻两像素点间的距离在网络输入层输入图像上的距离

  float offset_; //默认框中心偏移量（相对偏移量）
};

}  // namespace caffe

#endif  // CAFFE_PRIORBOX_LAYER_HPP_

（3）prior_box_layer.cpp文件

所定义的函数在prior_box_layer.cpp文件中实现（只实现了CPU版本，无需GPU版本），需要注意的是后向传播（Backward_cpu）函数无需实现，因为该层只是为了创建各默认框，存储各默认框的参数，且这些参数无需更新，也即无需进行后向传播。

#include <algorithm>
#include <functional>
#include <utility>
#include <vector>

#include "caffe/layers/prior_box_layer.hpp"

namespace caffe {

//创建PriorBox层
template <typename Dtype>
void PriorBoxLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  const PriorBoxParameter& prior_box_param =
      this->layer_param_.prior_box_param();  //获取所需的参数
  CHECK_GT(prior_box_param.min_size_size(), 0) << "must provide min_size."; //min_size是必须的，不可缺省设置
  for (int i = 0; i < prior_box_param.min_size_size(); ++i) {
    min_sizes_.push_back(prior_box_param.min_size(i));
    CHECK_GT(min_sizes_.back(), 0) << "min_size must be positive."; //min_size必须为正数（CHECK_GT表示大于[greater than]）
  }
  aspect_ratios_.clear();
  aspect_ratios_.push_back(1.);  //默认情况下宽高比为1（也即会默认设置一个为1的宽高比）
  flip_ = prior_box_param.flip(); //flip=true表示翻转宽高比，即原设置的宽高比为2,则翻转后宽高比为1/2
  //筛选不同的宽高比（即允许设置的宽高比重复，代码会自动找出不重复的，也即不同的宽高比）
  for (int i = 0; i < prior_box_param.aspect_ratio_size(); ++i) {
    float ar = prior_box_param.aspect_ratio(i);
    bool already_exist = false;
    for (int j = 0; j < aspect_ratios_.size(); ++j) {
      if (fabs(ar - aspect_ratios_[j]) < 1e-6) {
        already_exist = true;
        break; //跳出当前for循环
      }
    }
    if (!already_exist) {
      aspect_ratios_.push_back(ar); //将不同的宽高比放入aspect_ratios_中
      if (flip_) {
        aspect_ratios_.push_back(1./ar); //将翻转后的宽高比也放入aspect_ratios_中
      }
    }
  }
  num_priors_ = aspect_ratios_.size() * min_sizes_.size(); //计算需要生成的默认框（参见论文中的default box术语）数目
  if (prior_box_param.max_size_size() > 0) {
    CHECK_EQ(prior_box_param.min_size_size(), prior_box_param.max_size_size()); //检查所设置的min_size数目和max_size数目是否相等（CHECK_EQ表示相等）
    for (int i = 0; i < prior_box_param.max_size_size(); ++i) {
      max_sizes_.push_back(prior_box_param.max_size(i));
      CHECK_GT(max_sizes_[i], min_sizes_[i])
          << "max_size must be greater than min_size."; //max_size必须大于min_size
      num_priors_ += 1;  //默认框数目加1
    }
  }
  clip_ = prior_box_param.clip();  //获取裁剪参数
  //获取variance参数（用户可设置1个或4个或不设置）
  if (prior_box_param.variance_size() > 1) {
    // Must and only provide 4 variance. 此情况下有且只能设置4个variance
    CHECK_EQ(prior_box_param.variance_size(), 4);
    for (int i = 0; i < prior_box_param.variance_size(); ++i) {
      CHECK_GT(prior_box_param.variance(i), 0);
      variance_.push_back(prior_box_param.variance(i));
    }
  } else if (prior_box_param.variance_size() == 1) { //此情况下表示只设置一个variance
    CHECK_GT(prior_box_param.variance(0), 0);
    variance_.push_back(prior_box_param.variance(0));
  } else {
    // Set default to 0.1.
    variance_.push_back(0.1); //默认情况下设置variance = 0.1
  }

  if (prior_box_param.has_img_h() || prior_box_param.has_img_w()) {
    CHECK(!prior_box_param.has_img_size())
        << "Either img_size or img_h/img_w should be specified; not both."; //两者只能设置一种
    img_h_ = prior_box_param.img_h();
    CHECK_GT(img_h_, 0) << "img_h should be larger than 0.";
    img_w_ = prior_box_param.img_w();
    CHECK_GT(img_w_, 0) << "img_w should be larger than 0.";
  } else if (prior_box_param.has_img_size()) {
    const int img_size = prior_box_param.img_size();
    CHECK_GT(img_size, 0) << "img_size should be larger than 0.";
    img_h_ = img_size;
    img_w_ = img_size;
  } else {
    img_h_ = 0;  //如果两者均未设置，则先赋值为0
    img_w_ = 0;
  }
  //同上
  if (prior_box_param.has_step_h() || prior_box_param.has_step_w()) {
    CHECK(!prior_box_param.has_step())
        << "Either step or step_h/step_w should be specified; not both.";
    step_h_ = prior_box_param.step_h();
    CHECK_GT(step_h_, 0.) << "step_h should be larger than 0.";
    step_w_ = prior_box_param.step_w();
    CHECK_GT(step_w_, 0.) << "step_w should be larger than 0.";
  } else if (prior_box_param.has_step()) {
    const float step = prior_box_param.step();
    CHECK_GT(step, 0) << "step should be larger than 0.";
    step_h_ = step;
    step_w_ = step;
  } else {
    step_h_ = 0;
    step_w_ = 0;
  }

  offset_ = prior_box_param.offset();  //获取相对左上角的偏移量（默认为0.5）
}

//设置输出blob的大小
template <typename Dtype>
void PriorBoxLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  //获取特征图的长和宽
  const int layer_width = bottom[0]->width(); 
  const int layer_height = bottom[0]->height();
  vector<int> top_shape(3, 1);
  // Since all images in a batch has same height and width, we only need to
  // generate one set of priors which can be shared across all images.
  top_shape[0] = 1; //由于每一batch中所有特征图具有相同的长和宽，因此我们只需要生成一组可以在该batch中所有特征图之间共享的默认框
  // 2 channels. First channel stores the mean of each prior coordinate.
  // Second channel stores the variance of each prior coordinate.
  top_shape[1] = 2; //第一个通道存储默认框左上角和右下角归一化坐标;第二个通道存储这些坐标的variance
  top_shape[2] = layer_width * layer_height * num_priors_ * 4; //特征图每一像素点处都产生num_priors_个默认框，每个预测框相对默认框有4归一化坐标值/也有4个variance
  CHECK_GT(top_shape[2], 0);
  top[0]->Reshape(top_shape);
}

//前向传播（实质是计算每个默认框的参数信息）
template <typename Dtype>
void PriorBoxLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) {
  const int layer_width = bottom[0]->width();  //bottom[0]一般为特征图（feature map）;bottom[1]一般为网络输入层输入数据即data
  const int layer_height = bottom[0]->height();
  int img_width, img_height;
  if (img_h_ == 0 || img_w_ == 0) {
    img_width = bottom[1]->width();
    img_height = bottom[1]->height();
  } else {
    img_width = img_w_;
    img_height = img_h_;
  }
  float step_w, step_h;
  if (step_w_ == 0 || step_h_ == 0) {
    step_w = static_cast<float>(img_width) / layer_width;
    step_h = static_cast<float>(img_height) / layer_height;
  } else {
    step_w = step_w_;
    step_h = step_h_;
  }
  Dtype* top_data = top[0]->mutable_cpu_data();
  int dim = layer_height * layer_width * num_priors_ * 4;
  int idx = 0;
  
  //嵌套for循环来设置默认框数据（参见论文2.2节Choosing scales and aspect ratios for default boxes部分）
  for (int h = 0; h < layer_height; ++h) {
    for (int w = 0; w < layer_width; ++w) {
      float center_x = (w + offset_) * step_w;  //默认框中心在网络输入层输入图像（即网络的data层输入图像）上的x坐标
      float center_y = (h + offset_) * step_h; //默认框中心在网络输入层输入图像上的y坐标
      float box_width, box_height;
      for (int s = 0; s < min_sizes_.size(); ++s) {
        int min_size_ = min_sizes_[s];
        // first prior: aspect_ratio = 1, size = min_size 
        box_width = box_height = min_size_;
        // xmin
        top_data[idx++] = (center_x - box_width / 2.) / img_width;  //默认框左上角归一化后x坐标（归一化后，即网络输入层输入图像x坐标在0-1范围内）
        // ymin
        top_data[idx++] = (center_y - box_height / 2.) / img_height; //默认框左上角归一化后y坐标
        // xmax
        top_data[idx++] = (center_x + box_width / 2.) / img_width; //默认框右下角归一化后x坐标
        // ymax
        top_data[idx++] = (center_y + box_height / 2.) / img_height; //默认框右下角归一化后y坐标

        if (max_sizes_.size() > 0) {
          CHECK_EQ(min_sizes_.size(), max_sizes_.size());
          int max_size_ = max_sizes_[s];
          // second prior: aspect_ratio = 1, size = sqrt(min_size * max_size) 论文中额外添加的另一个宽高比为1的默认框
          box_width = box_height = sqrt(min_size_ * max_size_);
          // xmin
          top_data[idx++] = (center_x - box_width / 2.) / img_width;
          // ymin
          top_data[idx++] = (center_y - box_height / 2.) / img_height;
          // xmax
          top_data[idx++] = (center_x + box_width / 2.) / img_width;
          // ymax
          top_data[idx++] = (center_y + box_height / 2.) / img_height;
        }

        // rest of priors 计算剩余的默认框左上角和右下角坐标
        for (int r = 0; r < aspect_ratios_.size(); ++r) {
          float ar = aspect_ratios_[r];
          if (fabs(ar - 1.) < 1e-6) { //除去宽高比为1的情况，上面已经计算了
            continue;
          }
          box_width = min_size_ * sqrt(ar);
          box_height = min_size_ / sqrt(ar);
          // xmin
          top_data[idx++] = (center_x - box_width / 2.) / img_width;
          // ymin
          top_data[idx++] = (center_y - box_height / 2.) / img_height;
          // xmax
          top_data[idx++] = (center_x + box_width / 2.) / img_width;
          // ymax
          top_data[idx++] = (center_y + box_height / 2.) / img_height;
        }
      }
    }
  }
  // clip the prior's coordidate such that it is within [0, 1] 
  //如果clip=true，表示要保证默认框的左上角坐标和右下角坐标（归一化后）均需要在原图像内
  if (clip_) {
    for (int d = 0; d < dim; ++d) {
      top_data[d] = std::min<Dtype>(std::max<Dtype>(top_data[d], 0.), 1.);
    }
  }
  // set the variance.设置variance（暂时还不知道此部分用来做什么）
  top_data += top[0]->offset(0, 1);
  if (variance_.size() == 1) {
    caffe_set<Dtype>(dim, Dtype(variance_[0]), top_data);
  } else {
    int count = 0;
    for (int h = 0; h < layer_height; ++h) {
      for (int w = 0; w < layer_width; ++w) {
        for (int i = 0; i < num_priors_; ++i) {
          for (int j = 0; j < 4; ++j) {
            top_data[count] = variance_[j];
            ++count;
          }
        }
      }
    }
  }
}

INSTANTIATE_CLASS(PriorBoxLayer);
REGISTER_LAYER_CLASS(PriorBox);

}  // namespace caffe

（4）图示加深理解

较为传统的方法如下图（可以自行学习一下Ng的deep learning.ai的第四门课的第三周）：

图中黑色的框即是我们事先设定的默认框，红色的框为地面实况框。

采用卷积实现窗口滑动，并通过网络自行学习来输出精确边界框，设置标签为一个向量，包括是否存在目标、边框中心坐标、边框宽高、目标类别等，然后通过卷积神经网络进行学习，得到较佳的预测。

其中如何定义边框中心坐标、边框宽高呢，一种方法如下：

可以定义相对边框中心坐标和边框宽高，例如对于右边的车，默认框（事先定义的框，黑色）左上角坐标为(0,0)，右下角坐标为(1,1)，由此得到标签中的地面实况框的中心坐标约为(0.4,0.3)，即，而宽高为。

而SSD的做法更为高明，是直接在特征图上设置一系列默认框，这些默认框的参数，即左上角坐标和右下角坐标（不采用中心坐标和长宽形式）是经过归一化的，即是相对于网络输入层输入图像坐标系下的坐标（输入图像左上角定义为（0,0），右下角定义为（1,1），由此建立起输入图像坐标系），这样之后学习出的参数即是相对于输入图像坐标系而言的。另一个高明之处在于，所设置的默认框宽高比是多样的，特征图尺寸也是多样的（不同层的特征图尺寸不同，也即下采样率不同，也就导致默认框在输入图像上所占的面积不同），不像上图中的黑色默认框是固定的，这样做能更加适合现实中同一目标在不同时刻不同地点具有不同尺寸，即提高检测多尺度目标的效果。

如需转载，请标明出处！

SSD网络解析之PriorBox层

猜你喜欢