论文阅读笔记：NormFace: L2 HyperSphere Embedding for Face Verification

本文主要包含如下内容：

论文阅读笔记：NormFace: L2 HyperSphere Embedding for Face Verification

本篇论文来自电子科技大学，论文内容与COCO_LOSS极为类似，但从数学的角度解释了部分原理，可以结合COCO_LOSS一起看。

主要思想

首先论文分析存在的问题：在优化人脸识别任务时，softmax本身优化的是没有归一化的内积结果，但是最后在预测的时候使用的一般是cosine距离或者欧式距离，这会导致优化目标和最终的距离度量其实并不一致。

然后论文分析了测试时归一化的效果，发现归一化后计算内积和欧式距离的效果最好，因此考虑在训练中加入正则化。

算法原理

首先，说一下为什么需要正则化，以及为什么不要偏差bias：softmax loss倾向于学习到一个radial分布的特征，其原因在于特征的scale越大就会使得softmax的loss越小；softmax之前的fc有bias的情况下会使得有些类别在角度上没有区分性但是通过bias可以区分，在这种情况下如果对feature做normalize，会使得中间的那个小类别的feature变成一个单位球形并与其他的feature重叠在一起，所以在feature normalize的时候是不能加bias的。

接下来，论文解决了正则化后网络不收敛的问题：其原因在于normalize之后softmax loss的输入处于一个[-1,1]的分布，其最小值被抑制、是有下限的，即使样本被完美分类，即对应类别的输出为1，其他的为-1，那么这个概率Py还是一个比较小的值，而softmax loss的梯度为1-Py，这使得容易的样本梯度也很大。相比于原来的softmax loss，其输入的scale可以很大使得概率Py是个接近于1的数使得难易样本的梯度差别比较明显。
最后，论文提出解决办法，也就显而易见了，就是在normalize之后加个scale，让这个差距再拉大，所以最终normalize之后的softmax loss如下，其中w和f都是归一化的。

本篇论文与COCO_LOSS的唯一区别就是：COCO_LOSS的中心是每次根据样本更新求来的，而Norm_Face的权重是自己学习的，这是两篇论文之间的最大区别。
注意：系数s的选择：如图所示：

实验结果

总结

normalize本身对于深度学习的训练有好处这个事情其实已经是共识。

代码实现

代码原型，首先对特征进行正则化处理，然后使用全新定义的全链接层使用归一化的权重与特征进行相乘，并对输出结果乘以相关尺度，最后运用softmaxwithloss回归得到最终结果：

layer {
  name: "normalize1"
  type: "Normalize"
  bottom: "pool5/7x7_s1"
  top: "norm1"
}
layer {
  name: "cosine_layer"
  type: "InnerProduct"
  bottom: "norm1"
  top: "cosine"
  param {
    lr_mult: 100
    decay_mult: 0
  }
  inner_product_param{
    bias_term: false
    normalize: true
    num_output: 10572
    weight_filler {
      type: "gaussian_unitball"
    }
  }
}
layer {
  name: "cosine_scale"
  type: "Scale"
  bottom: "cosine"
  top: "cosine"
  scale_param {
    num_axes: 0
    bias_term: false
    min_value: 0.01
    filler{
     value: 10
    }
    bias_filler{
      value: 0
    }
  }
}
layer {
  name: "softmax_loss"
  type: "SoftmaxWithLoss"
  bottom: "cosine"
  bottom: "label"
  top: "softmax_loss"
  loss_weight: 1
}

layer {
  name: "Accuracy"
  type: "Accuracy"
  bottom: "cosine"
  bottom: "label"
  top: "accuracy"
  include { 
    phase: TEST
  }
  accuracy_param {
    min_is_better: false
  }
}

I0713 12:52:55.991122 26934 layer_factory.hpp:77] Creating layer normalize1
I0713 12:52:55.991127 26934 net.cpp:100] Creating Layer normalize1
I0713 12:52:55.991129 26934 net.cpp:434] normalize1 <- pool5/7x7_s1
I0713 12:52:55.991143 26934 net.cpp:408] normalize1 -> norm1
I0713 12:52:55.991185 26934 net.cpp:150] Setting up normalize1
I0713 12:52:55.991190 26934 net.cpp:157] Top shape: 16 1024 1 1 (16384)
I0713 12:52:55.991192 26934 net.cpp:165] Memory required for data: 1661300672
I0713 12:52:55.991194 26934 layer_factory.hpp:77] Creating layer cosine_layer
I0713 12:52:55.991209 26934 net.cpp:100] Creating Layer cosine_layer
I0713 12:52:55.991212 26934 net.cpp:434] cosine_layer <- norm1
I0713 12:52:55.991216 26934 net.cpp:408] cosine_layer -> cosine
I0713 12:52:55.991401 26934 net.cpp:150] Setting up cosine_layer
I0713 12:52:55.991407 26934 net.cpp:157] Top shape: 16 13 (208)
I0713 12:52:55.991410 26934 net.cpp:165] Memory required for data: 1661301504
I0713 12:52:55.991413 26934 layer_factory.hpp:77] Creating layer cosine_scale
I0713 12:52:55.991418 26934 net.cpp:100] Creating Layer cosine_scale
I0713 12:52:55.991422 26934 net.cpp:434] cosine_scale <- cosine
I0713 12:52:55.991426 26934 net.cpp:395] cosine_scale -> cosine (in-place)
I0713 12:52:55.991487 26934 net.cpp:150] Setting up cosine_scale
I0713 12:52:55.991490 26934 net.cpp:157] Top shape: 16 13 (208)
I0713 12:52:55.991493 26934 net.cpp:165] Memory required for data: 1661302336
I0713 12:52:55.991497 26934 layer_factory.hpp:77] Creating layer cosine_cosine_scale_0_split
I0713 12:52:55.991500 26934 net.cpp:100] Creating Layer cosine_cosine_scale_0_split
I0713 12:52:55.991503 26934 net.cpp:434] cosine_cosine_scale_0_split <- cosine
I0713 12:52:55.991509 26934 net.cpp:408] cosine_cosine_scale_0_split -> cosine_cosine_scale_0_split_0
I0713 12:52:55.991514 26934 net.cpp:408] cosine_cosine_scale_0_split -> cosine_cosine_scale_0_split_1
I0713 12:52:55.991541 26934 net.cpp:150] Setting up cosine_cosine_scale_0_split
I0713 12:52:55.991545 26934 net.cpp:157] Top shape: 16 13 (208)
I0713 12:52:55.991549 26934 net.cpp:157] Top shape: 16 13 (208)
I0713 12:52:55.991551 26934 net.cpp:165] Memory required for data: 1661304000
I0713 12:52:55.991554 26934 layer_factory.hpp:77] Creating layer softmax_loss
I0713 12:52:55.991559 26934 net.cpp:100] Creating Layer softmax_loss
I0713 12:52:55.991561 26934 net.cpp:434] softmax_loss <- cosine_cosine_scale_0_split_0
I0713 12:52:55.991565 26934 net.cpp:434] softmax_loss <- label_data_1_split_0
I0713 12:52:55.991569 26934 net.cpp:408] softmax_loss -> softmax_loss
I0713 12:52:55.991575 26934 layer_factory.hpp:77] Creating layer softmax_loss
I0713 12:52:55.992048 26934 net.cpp:150] Setting up softmax_loss
I0713 12:52:55.992056 26934 net.cpp:157] Top shape: (1)
I0713 12:52:55.992058 26934 net.cpp:160]     with loss weight 1
I0713 12:52:55.992066 26934 net.cpp:165] Memory required for data: 1661304004

normalize_layer.hpp/normalize_layer.cpp

normalize_layer.hpp/normalize_layer.cpp（执行正则化操作）
正则化公式：
考虑矩阵乘法，因此对通道归一化（N*C*1*1）注意：前面一层一定是全连阶层：16 1024 1 1 (16384)

message NormalizeParameter {
  optional string normalize_type = 1 [default = "L2"];
  optional bool fix_gradient = 2 [default = false];
  optional bool bp_norm = 3 [default = false];
}

#ifndef CAFFE_NORMALIZE_LAYER_HPP_
#define CAFFE_NORMALIZE_LAYER_HPP_

#include <utility>
#include <vector>

#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"

namespace caffe {
  /**
  * @brief Normalizes input.
  */
  template <typename Dtype>
  class NormalizeLayer : public Layer<Dtype> {
  public:
    explicit NormalizeLayer(const LayerParameter& param)
      : Layer<Dtype>(param) {}
    virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
                            const vector<Blob<Dtype>*>& top);
    virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
                         const vector<Blob<Dtype>*>& top);

    virtual inline const char* type() const { return "Normalize"; }
    virtual inline int ExactNumBottomBlobs() const { return 1; }
    virtual inline int MinTopBlobs() const { return 1; }
    virtual inline int MaxTopBlobs() const { return 2; }

  protected:
    virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
                             const vector<Blob<Dtype>*>& top);
    virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
                             const vector<Blob<Dtype>*>& top);
    virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
                              const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
    virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
                              const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);

    # squared_中储存平方保存结果；norm_中储存所有通道数的均方根
    Blob<Dtype> sum_multiplier_, squared_, norm_;
    std::string normalize_type_;        # 正则化的类型/默认为L2范数
    bool fix_gradient_;
    bool bp_norm_;
  };

}
#endif  // CAFFE_NORMALIZE_LAYER_HPP_

#include <algorithm>
#include <vector>
#include <cmath>

#include "caffe/layer.hpp"
#include "caffe/util/math_functions.hpp"
#include "caffe/layers/normalize_layer.hpp"

namespace caffe {

#define sign(x) ((Dtype(0) < (x)) - ((x) < Dtype(0)))

template <typename Dtype>
void NormalizeLayer<Dtype>::LayerSetUp(     # 这里仅仅载入所需参数
  const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
  normalize_type_ =
    this->layer_param_.normalize_param().normalize_type();
  fix_gradient_ =
    this->layer_param_.normalize_param().fix_gradient();
  bp_norm_ = this->layer_param_.normalize_param().bp_norm() && top.size() == 2;
}

template <typename Dtype>
void NormalizeLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
                                    const vector<Blob<Dtype>*>& top) {
  top[0]->Reshape(bottom[0]->num(), bottom[0]->channels(),
                  bottom[0]->height(), bottom[0]->width());
  squared_.Reshape(bottom[0]->num(), bottom[0]->channels(),
                   bottom[0]->height(), bottom[0]->width());
  if (top.size() == 2) {
    top[1]->Reshape(bottom[0]->num(), 1,
                    bottom[0]->height(), bottom[0]->width());
  }
  norm_.Reshape(bottom[0]->num(), 1,
                bottom[0]->height(), bottom[0]->width());
}

template <typename Dtype>
void NormalizeLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) {
  const Dtype* bottom_data = bottom[0]->cpu_data();
  Dtype* top_data = top[0]->mutable_cpu_data();
  Dtype* square_data = squared_.mutable_cpu_data();
  Dtype* norm_data = (top.size() == 2) ? top[1]->mutable_cpu_data() : norm_.mutable_cpu_data();
  int num = bottom[0]->num();
  int channels = bottom[0]->channels();
  int spatial_dim = bottom[0]->height() * bottom[0]->width();
  if (normalize_type_ == "L2") {
    caffe_sqr<Dtype>(num*channels*spatial_dim, bottom_data, square_data);   # 计算均方值
    for (int n = 0; n < num; n++) {
      for (int s = 0; s < spatial_dim; s++) {
        norm_data[n*spatial_dim + s] = Dtype(0);
        for (int c = 0; c < channels; c++) {
          norm_data[n*spatial_dim + s] += square_data[(n * channels + c) * spatial_dim + s];       # 对通道数求和
        }
        norm_data[n*spatial_dim + s] += 1e-6;   
        norm_data[n*spatial_dim + s] = sqrt(norm_data[n*spatial_dim + s]);      # 计算均方根，前面有相应公式
        for (int c = 0; c < channels; c++) {
          top_data[(n * channels + c) * spatial_dim + s] = bottom_data[(n * channels + c) * spatial_dim + s] / norm_data[n*spatial_dim + s];    # 更新top_data值，获得前向传播的结果
        }
      }
    }
  }
  else if (normalize_type_ == "L1") {
    caffe_abs<Dtype>(num*channels*spatial_dim, bottom_data, square_data);
    for (int n = 0; n < num; n++) {
      for (int s = 0; s < spatial_dim; s++) {
        norm_data[n*spatial_dim +s] = Dtype(0);
        for (int c = 0; c < channels; c++) {
          norm_data[n*spatial_dim + s] += square_data[(n * channels + c) * spatial_dim + s];
        }
        norm_data[n*spatial_dim + s] += 1e-6;
        norm_data[n*spatial_dim + s] = norm_data[n*spatial_dim + s];
        for (int c = 0; c < channels; c++) {
          top_data[(n * channels + c) * spatial_dim + s] = bottom_data[(n * channels + c) * spatial_dim + s] / norm_data[n*spatial_dim + s];
        }
      }
    }
  }
  else {
    NOT_IMPLEMENTED;
  }
}

template <typename Dtype>
void NormalizeLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {
  const Dtype* top_diff = top[0]->cpu_diff();
  const Dtype* top_data = top[0]->cpu_data();
  const Dtype* bottom_data = bottom[0]->cpu_data();
  const Dtype* square_data = squared_.cpu_data();
  const Dtype* norm_data = (top.size() == 2) ? top[1]->cpu_data() : norm_.cpu_data();
  Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();

  int num = bottom[0]->num();
  int channels = bottom[0]->channels();
  int spatial_dim = bottom[0]->height() * bottom[0]->width();
  if (propagate_down[0]) {
    if (normalize_type_ == "L2") {
      for (int n = 0; n < num; ++n) {
        for (int s = 0; s < spatial_dim; s++) {
          Dtype a = caffe_cpu_strided_dot(channels, top_data + n*channels*spatial_dim + s, spatial_dim, top_diff + n*channels*spatial_dim + s, spatial_dim);
          for (int c = 0; c < channels; c++) {
            bottom_diff[(n * channels + c) * spatial_dim + s] =
              (top_diff[(n * channels + c) * spatial_dim + s] - top_data[(n * channels + c) * spatial_dim + s] * a) / norm_data[n*spatial_dim + s];
          }
        }
      }
    }
    else if (normalize_type_ == "L1") {
      for (int n = 0; n < num; ++n) {
        for (int s = 0; s < spatial_dim; s++) {
          Dtype a = caffe_cpu_strided_dot(channels, top_data + n*channels*spatial_dim + s, spatial_dim, top_diff + n*channels*spatial_dim + s, spatial_dim);
          for (int c = 0; c < channels; c++) {
            bottom_diff[(n * channels + c) * spatial_dim + s] =
              (top_diff[(n * channels + c) * spatial_dim + s] - sign(bottom_data[(n * channels + c) * spatial_dim + s]) * a) / norm_data[n*spatial_dim + s];
          }
        }
      }
    }
    else {
      NOT_IMPLEMENTED;
    }
  }

  if (bp_norm_) {
    const Dtype* norm_diff =top[1]->cpu_diff();
    if (normalize_type_ == "L2") {
      for (int n = 0; n < num; ++n) {
        for (int s = 0; s < spatial_dim; s++) {
          for (int c = 0; c < channels; c++) {
            bottom_diff[(n * channels + c) * spatial_dim + s] += norm_diff[n*spatial_dim + s] * top_data[(n * channels + c) * spatial_dim + s];
          }
        }
      }
    }
    else if (normalize_type_ == "L1") {
      for (int n = 0; n < num; ++n) {
        for (int s = 0; s < spatial_dim; s++) {
          for (int c = 0; c < channels; c++) {
            bottom_diff[(n * channels + c) * spatial_dim + s] += norm_diff[n*spatial_dim + s] * sign(bottom_data[(n * channels + c) * spatial_dim + s]);
          }
        }
      }
    }
  }
}


#ifdef CPU_ONLY
STUB_GPU(NormalizeLayer);
#endif

INSTANTIATE_CLASS(NormalizeLayer);
REGISTER_LAYER_CLASS(Normalize);
}  // namespace caffe

inner_product_layer.hpp/inner_product_layer.cpp

inner_product_layer.hpp/inner_product_layer.cpp（全连结层，生成可以学习的正则化权重）输入shape：16 13 (208)

message InnerProductParameter {
  optional uint32 num_output = 1; // The number of outputs for the layer
  optional bool bias_term = 2 [default = true]; // whether to have bias terms
  optional FillerParameter weight_filler = 3; // The filler for the weight
  optional FillerParameter bias_filler = 4; // The filler for the bias

  // The first axis to be lumped into a single inner product computation;
  // all preceding axes are retained in the output.
  // May be negative to index from the end (e.g., -1 for the last axis).
  optional int32 axis = 5 [default = 1];
  // Specify whether to transpose the weight matrix or not.
  // If transpose == true, any operations will be performed on the transpose
  // of the weight matrix. The weight matrix itself is not going to be transposed
  // but rather the transfer flag of operations will be toggled accordingly.
  optional bool transpose = 6 [default = false];
  optional bool normalize = 7 [default = false];
}

#ifndef CAFFE_INNER_PRODUCT_LAYER_HPP_
#define CAFFE_INNER_PRODUCT_LAYER_HPP_

#include <vector>
#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"

namespace caffe {

/**
 * @brief Also known as a "fully-connected" layer, computes an inner product
 *        with a set of learned weights, and (optionally) adds biases.
 *
 * TODO(dox): thorough documentation for Forward, Backward, and proto params.
 */
template <typename Dtype>
class InnerProductLayer : public Layer<Dtype> {
 public:
  explicit InnerProductLayer(const LayerParameter& param)
      : Layer<Dtype>(param) {}
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);

  virtual inline const char* type() const { return "InnerProduct"; }
  virtual inline int MinBottomBlobs() const { return 1; }
  virtual inline int ExactNumTopBlobs() const { return 1; }

 protected:
  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
  virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);

  int M_;
  int K_;
  int N_;
  bool bias_term_;
  Blob<Dtype> bias_multiplier_;
  bool transpose_;  ///< if true, assume transposed weights
  bool normalize_;      // bool型变量表示是否执行normalize操作
  Blob<Dtype> weight_norm_;     // 保存正则化权值
};

}  // namespace caffe
#endif  // CAFFE_INNER_PRODUCT_LAYER_HPP_

#include <vector>
#include "caffe/filler.hpp"
#include "caffe/layers/inner_product_layer.hpp"
#include "caffe/util/math_functions.hpp"

namespace caffe {

template <typename Dtype>
void InnerProductLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  const int num_output = this->layer_param_.inner_product_param().num_output();     // 获取输出神经元个数
  bias_term_ = this->layer_param_.inner_product_param().bias_term();        // bool类型,是否存在偏置项
  transpose_ = this->layer_param_.inner_product_param().transpose();         // bool类型,是否对权重矩阵转置
  normalize_ = this->layer_param_.inner_product_param().normalize();        // 获取normalize参数
  if (bottom.size() == 1) N_ = num_output;  // N_ 表示输出神经元的个数
  else N_ = bottom[1]->num();
  const int axis = bottom[0]->CanonicalAxisIndex(
      this->layer_param_.inner_product_param().axis());
  // Dimensions starting from "axis" are "flattened" into a single
  // length K_ vector. For example, if bottom[0]'s shape is (N, C, H, W),
  // and axis == 1, N inner products with dimension CHW are performed.
  K_ = bottom[0]->count(axis);      // 轴axis为1时,K_为C*H*W,输入神经元的数目
  // Check if we need to set up the weights
  if (this->blobs_.size() > 0 || bottom.size() == 3 
      || (bottom.size() == 2 && !bias_term_)) {
    LOG(INFO) << "Skipping parameter initialization";
  }
  else {
    int bias_index = 0;
    if (bias_term_) {
      if (bottom.size() == 2) {
        this->blobs_.resize(1);
      }
      else {
        this->blobs_.resize(2);
        bias_index = 1;
      }
    }
    else {
      this->blobs_.resize(1);
    }
    if (bottom.size() == 1) {       // 如果只输入一个bottom：那么，初始化权重。
      // Initialize the weights
      vector<int> weight_shape(2);
      if (transpose_) {
        weight_shape[0] = K_;
        weight_shape[1] = N_;
      }
      else {
        weight_shape[0] = N_;
        weight_shape[1] = K_;
      }
      this->blobs_[0].reset(new Blob<Dtype>(weight_shape));     // 根据权重的大小,开辟内存,k_个输入神经元,N个_输出神经元
      // fill the weights
      shared_ptr<Filler<Dtype> > weight_filler(GetFiller<Dtype>(    // shared_ptr是智能指针,作用是根据配置文件,获取权重初始化函数
        this->layer_param_.inner_product_param().weight_filler()));
      weight_filler->Fill(this->blobs_[0].get());   // 利用初始化函数进行权重的初始值填充
    }

    // If necessary, intiialize and fill the bias term
    if (bias_term_ && bottom.size() <= 2) {
      vector<int> bias_shape(1, N_);
      this->blobs_[bias_index].reset(new Blob<Dtype>(bias_shape));
      shared_ptr<Filler<Dtype> > bias_filler(GetFiller<Dtype>(
          this->layer_param_.inner_product_param().bias_filler()));
      bias_filler->Fill(this->blobs_[bias_index].get());
    }
  }  // parameter initialization
  if (bottom.size() == 1) this->param_propagate_down_.resize(this->blobs_.size(), true);   // 设置每个参数是否需要反向传播
}

template <typename Dtype>
void InnerProductLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  // Figure out the dimensions
  const int axis = bottom[0]->CanonicalAxisIndex(
      this->layer_param_.inner_product_param().axis());
  const int new_K = bottom[0]->count(axis);
  CHECK_EQ(K_, new_K)
      << "Input size incompatible with inner product parameters.";
  // The first "axis" dimensions are independent inner products; the total
  // number of these is M_, the product over these dimensions.
  M_ = bottom[0]->count(0, axis);   // 根据axis判断输出，用于初始化bias
  if (bottom.size() >= 2) N_ = bottom[1]->num();    
  // The top shape will be the bottom shape with the flattened axes dropped,
  // and replaced by a single axis with dimension num_output (N_).
  vector<int> top_shape = bottom[0]->shape();
  top_shape.resize(axis + 1);   // top_shape:[N,C],二维向量.
  top_shape[axis] = N_;     // top_shape:[N,N_],将C向量变为N_, N_ 表示输出神经元的个数
  top[0]->Reshape(top_shape);   // 设置top的形状大小
  // Set up the bias multiplier
  if (bias_term_ && bottom.size() <= 2) {
    vector<int> bias_shape(1, M_);      // 获得偏置的形状
    bias_multiplier_.Reshape(bias_shape);       // 为偏置矩阵开辟空间
    caffe_set(M_, Dtype(1), bias_multiplier_.mutable_cpu_data());   // 为偏置矩阵赋初值全为1
  }
  if (bottom.size() == 1 && normalize_) {
    vector<int> weight_norm_shape(1, N_);       // 获得正则化权重的形状
    weight_norm_.Reshape(weight_norm_shape);    // 为正则化权重矩阵开辟空间
    caffe_set(N_, Dtype(0), weight_norm_.mutable_cpu_data());   // 为正则化权重矩阵赋初值全为0
  }
}

template <typename Dtype>
void InnerProductLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) {
  const Dtype* bottom_data = bottom[0]->cpu_data();     // 获得输入数据的指针
  Dtype* top_data = top[0]->mutable_cpu_data();  // 获得输出数据的指针
  const Dtype* weight = bottom.size() >= 2 ? bottom[1]->cpu_data() :     this->blobs_[0]->cpu_data();    // 获得权重数据的指针weight
  if (normalize_ && bottom.size() == 1) {
    Dtype* mutable_weight = this->blobs_[0]->mutable_cpu_data();
    Dtype sum_sq;
    for (int n = 0; n < N_; n++) {
      sum_sq = caffe_cpu_dot(K_, weight + n*K_, weight + n*K_) + 1e-6;      // 求对应权重的平方
      caffe_scal<Dtype>(K_, Dtype(1) / sqrt(sum_sq), mutable_weight + n*K_);        // 正则化操作
    }
  }
  caffe_cpu_gemm<Dtype>(CblasNoTrans, transpose_ ? CblasNoTrans : CblasTrans,
      M_, N_, K_, (Dtype)1.,
      bottom_data, weight, (Dtype)0., top_data);    // 调用矩阵乘法
  if (bias_term_) { 
    caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, N_, 1, (Dtype)1.,
        bias_multiplier_.cpu_data(),    
        bottom.size() == 3 ? bottom[2]->cpu_data() : this->blobs_[1]->cpu_data(), (Dtype)1., top_data);      // 矩阵加法,加上偏置
  }
}

template <typename Dtype>
void InnerProductLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down,
    const vector<Blob<Dtype>*>& bottom) {
  const Dtype* weight = bottom.size() >= 2 ? bottom[1]->cpu_data() : this->blobs_[0]->cpu_data();
  if ((bottom.size() == 1 && this->param_propagate_down_[0])||
    (bottom.size() >= 2 && propagate_down[1])){
    const Dtype* top_diff = top[0]->cpu_diff();
    const Dtype* bottom_data = bottom[0]->cpu_data();
    Dtype* weight_diff = bottom.size() >= 2 ? bottom[1]->mutable_cpu_diff() : this->blobs_[0]->mutable_cpu_diff();
    if (bottom.size() >= 2) {
      if (transpose_) {
        caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans,
                              K_, N_, M_,
                              (Dtype)1., bottom_data, top_diff,
                              (Dtype)0., weight_diff);
      }
      else {
        caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans,
                              N_, K_, M_,
                              (Dtype)1., top_diff, bottom_data,
                              (Dtype)0., weight_diff);
      }
    }
    else {
      if (transpose_) {
        caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans,
                              K_, N_, M_,
                              (Dtype)1., bottom_data, top_diff,
                              (Dtype)1., weight_diff);
      }
      else {
        caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans,
                              N_, K_, M_,
                              (Dtype)1., top_diff, bottom_data,
                              (Dtype)1., weight_diff);
      }
    }
  }
  if (bias_term_ && (this->param_propagate_down_[1] || 
                     (bottom.size() == 3 && propagate_down[2]))) {
    const Dtype* top_diff = top[0]->cpu_diff();
    // Gradient with respect to bias
    caffe_cpu_gemv<Dtype>(CblasTrans, M_, N_, (Dtype)1., top_diff,
        bias_multiplier_.cpu_data(), (Dtype)1.,
        bottom.size()==3? bottom[2]->mutable_cpu_diff() : this->blobs_[1]->mutable_cpu_diff());
  }
  if (propagate_down[0]) {
    const Dtype* top_diff = top[0]->cpu_diff();
    // Gradient with respect to bottom data
    if (transpose_) {
      caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasTrans,
          M_, K_, N_,
          (Dtype)1., top_diff, weight,
          (Dtype)0., bottom[0]->mutable_cpu_diff());
    } else {
      caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans,
          M_, K_, N_,
          (Dtype)1., top_diff, weight,
          (Dtype)0., bottom[0]->mutable_cpu_diff());
    }
  }
}

#ifdef CPU_ONLY
STUB_GPU(InnerProductLayer);
#endif

INSTANTIATE_CLASS(InnerProductLayer);
REGISTER_LAYER_CLASS(InnerProduct);
}  // namespace caffe

scale_layer.hpp/scale_layer.cpp

scale_layer.hpp/scale_layer.cpp；常规尺度变换

message ScaleParameter {
  // The first axis of bottom[0] (the first input Blob) along which to apply
  // bottom[1] (the second input Blob).  May be negative to index from the end
  // (e.g., -1 for the last axis).
  //
  // For example, if bottom[0] is 4D with shape 100x3x40x60, the output
  // top[0] will have the same shape, and bottom[1] may have any of the
  // following shapes (for the given value of axis):
  //    (axis == 0 == -4) 100; 100x3; 100x3x40; 100x3x40x60
  //    (axis == 1 == -3)          3;     3x40;     3x40x60
  //    (axis == 2 == -2)                   40;       40x60
  //    (axis == 3 == -1)                                60
  // Furthermore, bottom[1] may have the empty shape (regardless of the value of
  // "axis") -- a scalar multiplier.
  optional int32 axis = 1 [default = 1];

  // (num_axes is ignored unless just one bottom is given and the scale is
  // a learned parameter of the layer.  Otherwise, num_axes is determined by the
  // number of axes by the second bottom.)
  // The number of axes of the input (bottom[0]) covered by the scale
  // parameter, or -1 to cover all axes of bottom[0] starting from `axis`.
  // Set num_axes := 0, to multiply with a zero-axis Blob: a scalar.
  optional int32 num_axes = 2 [default = 1];

  // (filler is ignored unless just one bottom is given and the scale is
  // a learned parameter of the layer.)
  // The initialization for the learned scale parameter.
  // Default is the unit (1) initialization, resulting in the ScaleLayer
  // initially performing the identity operation.
  optional FillerParameter filler = 3;

  // Whether to also learn a bias (equivalent to a ScaleLayer+BiasLayer, but
  // may be more efficient).  Initialized with bias_filler (defaults to 0).
  optional bool bias_term = 4 [default = false];
  optional FillerParameter bias_filler = 5;
  optional float min_value = 6;
  optional float max_value = 7;
}

#ifndef CAFFE_SCALE_LAYER_HPP_
#define CAFFE_SCALE_LAYER_HPP_

#include <vector>
#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"
#include "caffe/layers/bias_layer.hpp"

namespace caffe {

/**
 * @brief Computes the elementwise product of two input Blobs, with the shape of
 *        the latter Blob "broadcast" to match the shape of the former.
 *        Equivalent to tiling the latter Blob, then computing the elementwise
 *        product. Note: for efficiency and convenience, this layer can
 *        additionally perform a "broadcast" sum too when `bias_term: true`
 *        is set.
 *
 * The latter, scale input may be omitted, in which case it's learned as
 * parameter of the layer (as is the bias, if it is included).
 */
template <typename Dtype>
class ScaleLayer: public Layer<Dtype> {
 public:
  explicit ScaleLayer(const LayerParameter& param)
      : Layer<Dtype>(param) {}
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);

  virtual inline const char* type() const { return "Scale"; }
  // Scale
  virtual inline int MinBottomBlobs() const { return 1; }
  virtual inline int MaxBottomBlobs() const { return 2; }
  virtual inline int ExactNumTopBlobs() const { return 1; }

 protected:
  /**
   * In the below shape specifications, @f$ i @f$ denotes the value of the
   * `axis` field given by `this->layer_param_.scale_param().axis()`, after
   * canonicalization (i.e., conversion from negative to positive index,
   * if applicable).
   *
   * @param bottom input Blob vector (length 2)
   *   -# @f$ (d_0 \times ... \times
   *           d_i \times ... \times d_j \times ... \times d_n) @f$
   *      the first factor @f$ x @f$
   *   -# @f$ (d_i \times ... \times d_j) @f$
   *      the second factor @f$ y @f$
   * @param top output Blob vector (length 1)
   *   -# @f$ (d_0 \times ... \times
   *           d_i \times ... \times d_j \times ... \times d_n) @f$
   *      the product @f$ z = x y @f$ computed after "broadcasting" y.
   *      Equivalent to tiling @f$ y @f$ to have the same shape as @f$ x @f$,
   *      then computing the elementwise product.
   */
  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
  virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);

  shared_ptr<Layer<Dtype> > bias_layer_;
  vector<Blob<Dtype>*> bias_bottom_vec_;
  vector<bool> bias_propagate_down_;
  int bias_param_id_;

  Blob<Dtype> sum_multiplier_;
  Blob<Dtype> sum_result_;
  Blob<Dtype> temp_;
  int axis_;
  int outer_dim_, scale_dim_, inner_dim_;
};


}  // namespace caffe
#endif  // CAFFE_SCALE_LAYER_HPP_

#include <algorithm>
#include <vector>
#include "caffe/filler.hpp"
#include "caffe/layer_factory.hpp"
#include "caffe/layers/scale_layer.hpp"
#include "caffe/util/math_functions.hpp"

namespace caffe {

template <typename Dtype>
void ScaleLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  const ScaleParameter& param = this->layer_param_.scale_param();
  if (bottom.size() == 1 && this->blobs_.size() > 0) {
    LOG(INFO) << "Skipping parameter initialization";
  } else if (bottom.size() == 1) {
    // scale is a learned parameter; initialize it
    axis_ = bottom[0]->CanonicalAxisIndex(param.axis());
    const int num_axes = param.num_axes();
    CHECK_GE(num_axes, -1) << "num_axes must be non-negative, "
                           << "or -1 to extend to the end of bottom[0]";
    if (num_axes >= 0) {
      CHECK_GE(bottom[0]->num_axes(), axis_ + num_axes)
          << "scale blob's shape extends past bottom[0]'s shape when applied "
          << "starting with bottom[0] axis = " << axis_;
    }
    this->blobs_.resize(1);
    const vector<int>::const_iterator& shape_start =
        bottom[0]->shape().begin() + axis_;
    const vector<int>::const_iterator& shape_end =
        (num_axes == -1) ? bottom[0]->shape().end() : (shape_start + num_axes);
    vector<int> scale_shape(shape_start, shape_end);
    this->blobs_[0].reset(new Blob<Dtype>(scale_shape));
    FillerParameter filler_param(param.filler());
    if (!param.has_filler()) {
      // Default to unit (1) filler for identity operation.
      filler_param.set_type("constant");
      filler_param.set_value(1);
    }
    shared_ptr<Filler<Dtype> > filler(GetFiller<Dtype>(filler_param));
    filler->Fill(this->blobs_[0].get());
  }
  if (param.bias_term()) {
    LayerParameter layer_param(this->layer_param_);
    layer_param.set_type("Bias");
    BiasParameter* bias_param = layer_param.mutable_bias_param();
    bias_param->set_axis(param.axis());
    if (bottom.size() > 1) {
      bias_param->set_num_axes(bottom[1]->num_axes());
    } else {
      bias_param->set_num_axes(param.num_axes());
    }
    bias_param->mutable_filler()->CopyFrom(param.bias_filler());
    bias_layer_ = LayerRegistry<Dtype>::CreateLayer(layer_param);
    bias_bottom_vec_.resize(1);
    bias_bottom_vec_[0] = bottom[0];
    bias_layer_->SetUp(bias_bottom_vec_, top);
    if (this->blobs_.size() + bottom.size() < 3) {
      // case: blobs.size == 1 && bottom.size == 1
      // or blobs.size == 0 && bottom.size == 2
      bias_param_id_ = this->blobs_.size();
      this->blobs_.resize(bias_param_id_ + 1);
      this->blobs_[bias_param_id_] = bias_layer_->blobs()[0];
    } else {
      // bias param already initialized
      bias_param_id_ = this->blobs_.size() - 1;
      bias_layer_->blobs()[0] = this->blobs_[bias_param_id_];
    }
    bias_propagate_down_.resize(1, false);
  }
  this->param_propagate_down_.resize(this->blobs_.size(), true);
}

template <typename Dtype>
void ScaleLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  const ScaleParameter& param = this->layer_param_.scale_param();
  Blob<Dtype>* scale = (bottom.size() > 1) ? bottom[1] : this->blobs_[0].get();
  // Always set axis_ == 0 in special case where scale is a scalar
  // (num_axes == 0). Mathematically equivalent for any choice of axis_, so the
  // actual setting can be safely ignored; and computation is most efficient
  // with axis_ == 0 and (therefore) outer_dim_ == 1. (Setting axis_ to
  // bottom[0]->num_axes() - 1, giving inner_dim_ == 1, would be equally
  // performant.)
  axis_ = (scale->num_axes() == 0) ?
      0 : bottom[0]->CanonicalAxisIndex(param.axis());
  CHECK_GE(bottom[0]->num_axes(), axis_ + scale->num_axes())
      << "scale blob's shape extends past bottom[0]'s shape when applied "
      << "starting with bottom[0] axis = " << axis_;
  for (int i = 0; i < scale->num_axes(); ++i) {
    CHECK_EQ(bottom[0]->shape(axis_ + i), scale->shape(i))
        << "dimension mismatch between bottom[0]->shape(" << axis_ + i
        << ") and scale->shape(" << i << ")";
  }
  outer_dim_ = bottom[0]->count(0, axis_);
  scale_dim_ = scale->count();
  inner_dim_ = bottom[0]->count(axis_ + scale->num_axes());
  if (bottom[0] == top[0]) {  // in-place computation
    temp_.ReshapeLike(*bottom[0]);
  } else {
    top[0]->ReshapeLike(*bottom[0]);
  }
  sum_result_.Reshape(vector<int>(1, outer_dim_ * scale_dim_));
  const int sum_mult_size = std::max(outer_dim_, inner_dim_);
  sum_multiplier_.Reshape(vector<int>(1, sum_mult_size));
  if (sum_multiplier_.cpu_data()[sum_mult_size - 1] != Dtype(1)) {
    caffe_set(sum_mult_size, Dtype(1), sum_multiplier_.mutable_cpu_data());
  }
  if (bias_layer_) {
    bias_bottom_vec_[0] = top[0];
    bias_layer_->Reshape(bias_bottom_vec_, top);
  }
}

template <typename Dtype>
void ScaleLayer<Dtype>::Forward_cpu(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
  const Dtype* bottom_data = bottom[0]->cpu_data();
  if (bottom[0] == top[0]) {
    // In-place computation; need to store bottom data before overwriting it.
    // Note that this is only necessary for Backward; we could skip this if not
    // doing Backward, but Caffe currently provides no way of knowing whether
    // we'll need to do Backward at the time of the Forward call.
    caffe_copy(bottom[0]->count(), bottom[0]->cpu_data(),
               temp_.mutable_cpu_data());
  }
  Dtype* scale_data =
      ((bottom.size() > 1) ? bottom[1] : this->blobs_[0].get())->mutable_cpu_data();
  if (this->layer_param_.scale_param().has_min_value()) {
    for (int d = 0; d < scale_dim_; d++) {
      scale_data[d] = std::max<Dtype>(scale_data[d], this->layer_param_.scale_param().min_value());
    }
  }
  if (this->layer_param_.scale_param().has_max_value()) {
    for (int d = 0; d < scale_dim_; d++) {
      scale_data[d] = std::min<Dtype>(scale_data[d], this->layer_param_.scale_param().max_value());
    }
  }
  Dtype* top_data = top[0]->mutable_cpu_data();
  for (int n = 0; n < outer_dim_; ++n) {
    for (int d = 0; d < scale_dim_; ++d) {
      const Dtype factor = scale_data[d];
      caffe_cpu_scale(inner_dim_, factor, bottom_data, top_data);
      bottom_data += inner_dim_;
      top_data += inner_dim_;
    }
  }
  if (bias_layer_) {
    bias_layer_->Forward(bias_bottom_vec_, top);
  }
}

template <typename Dtype>
void ScaleLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {
  if (bias_layer_ &&
      this->param_propagate_down_[this->param_propagate_down_.size() - 1]) {
    bias_layer_->Backward(top, bias_propagate_down_, bias_bottom_vec_);
  }
  const bool scale_param = (bottom.size() == 1);
  Blob<Dtype>* scale = scale_param ? this->blobs_[0].get() : bottom[1];
  if ((!scale_param && propagate_down[1]) ||
      (scale_param && this->param_propagate_down_[0])) {
    const Dtype* top_diff = top[0]->cpu_diff();
    const bool in_place = (bottom[0] == top[0]);
    const Dtype* bottom_data = (in_place ? &temp_ : bottom[0])->cpu_data();
    // Hack: store big eltwise product in bottom[0] diff, except in the special
    // case where this layer itself does the eltwise product, in which case we
    // can store it directly in the scale diff, and we're done.
    // If we're computing in-place (and not doing eltwise computation), this
    // hack doesn't work and we store the product in temp_.
    const bool is_eltwise = (bottom[0]->count() == scale->count());
    Dtype* product = (is_eltwise ? scale->mutable_cpu_diff() :
        (in_place ? temp_.mutable_cpu_data() : bottom[0]->mutable_cpu_diff()));
    caffe_mul(top[0]->count(), top_diff, bottom_data, product);
    if (!is_eltwise) {
      Dtype* sum_result = NULL;
      if (inner_dim_ == 1) {
        sum_result = product;
      } else if (sum_result_.count() == 1) {
        const Dtype* sum_mult = sum_multiplier_.cpu_data();
        Dtype* scale_diff = scale->mutable_cpu_diff();
        if (scale_param) {
          Dtype result = caffe_cpu_dot(inner_dim_, product, sum_mult);
          *scale_diff += result;
        } else {
          *scale_diff = caffe_cpu_dot(inner_dim_, product, sum_mult);
        }
      } else {
        const Dtype* sum_mult = sum_multiplier_.cpu_data();
        sum_result = (outer_dim_ == 1) ?
            scale->mutable_cpu_diff() : sum_result_.mutable_cpu_data();
        caffe_cpu_gemv(CblasNoTrans, sum_result_.count(), inner_dim_,
                       Dtype(1), product, sum_mult, Dtype(0), sum_result);
      }
      if (outer_dim_ != 1) {
        const Dtype* sum_mult = sum_multiplier_.cpu_data();
        Dtype* scale_diff = scale->mutable_cpu_diff();
        if (scale_dim_ == 1) {
          if (scale_param) {
            Dtype result = caffe_cpu_dot(outer_dim_, sum_mult, sum_result);
            *scale_diff += result;
          } else {
            *scale_diff = caffe_cpu_dot(outer_dim_, sum_mult, sum_result);
          }
        } else {
          caffe_cpu_gemv(CblasTrans, outer_dim_, scale_dim_,
                         Dtype(1), sum_result, sum_mult, Dtype(scale_param),
                         scale_diff);
        }
      }
    }
  }
  if (propagate_down[0]) {
    const Dtype* top_diff = top[0]->cpu_diff();
    const Dtype* scale_data = scale->cpu_data();
    Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
    for (int n = 0; n < outer_dim_; ++n) {
      for (int d = 0; d < scale_dim_; ++d) {
        const Dtype factor = scale_data[d];
        caffe_cpu_scale(inner_dim_, factor, top_diff, bottom_diff);
        bottom_diff += inner_dim_;
        top_diff += inner_dim_;
      }
    }
  }
}

#ifdef CPU_ONLY
STUB_GPU(ScaleLayer);
#endif

INSTANTIATE_CLASS(ScaleLayer);
REGISTER_LAYER_CLASS(Scale);
}  // namespace caffe