LossLayer类简介

LossLayer类是caffe中各种loss layer的基类，本身并不涉及网络的loss的具体计算，只是规定loss layer的一些通用属性，如输出blob的loss权重默认为1，预测数据与标签数据的维度匹配等。

loss_layer.cpp源码

template <typename Dtype>
void LossLayer<Dtype>::LayerSetUp(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
  // LossLayers have a non-zero (1) loss by default.
  if (this->layer_param_.loss_weight_size() == 0) {   //loss layer默认权重为1
    this->layer_param_.add_loss_weight(Dtype(1));     //layer param中未设置则置为1
  }
}
template <typename Dtype>
void LossLayer<Dtype>::Reshape(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
  CHECK_EQ(bottom[0]->shape(0), bottom[1]->shape(0))
      << "The data and label should have the same first dimension.";    //第0维的值必须相等
  vector<int> loss_shape(0);    // Loss layers output a scalar; 0 axes.
  top[0]->Reshape(loss_shape);  //调整输出blob的大小,一维,大小为0
}

MultinomialLogisticLossLayer类简介

MultinomialLogisticLossLayer类用于计算单标签的多分类任务的logistic loss，每个数据只允许有一个标签值，但是可以划分成多种类别。

第一个输入blob为网络的预测概率，大小\(N \times C \times H \times W\)，范围\(\hat{p}_{n,k} \in [0, 1]\)，第\(n\)个数据的属于第\(k\)类的预测概率为\(\hat{p}_{n,k}\)，且\(\forall n, \sum\limits_{k=1}^K \hat{p}_{n,k} = 1\)

其中\(N\)为数据的个数，\(K=C \times H \times W\)为类别总数

第二个输入blob为标签值，大小\(N \times 1 \times 1 \times 1\)，范围\(l_n \in [0, 1, 2, ..., K - 1]\)之间的整数，数据的真实类别为\(l_n\)。
前向计算时，loss的计算公式为： \(E=-\frac{1}{N}\sum\limits_{n=1}^{N} \sum\limits_{k=1}^{K} y_{n,k}*\log \hat{p}_{n,k}= -\frac{1}{N}\sum\limits_{n=1}^{N} \log(\hat{p}_{n,l_n})\)

\(y_{n,k}\)表示第\(n\)个数据的属于第\(k\)类的真实概率，\(y_{n,k}=\left\{\begin{matrix}1 & k=l_n\\0 & k\neq l_n\end{matrix}\right.\)

反向计算时，预测blob的梯度的计算公式为：\(\frac{\partial J}{\partial {\hat{p}_{n,l_n}}} = \frac{\partial J}{\partial E}*\frac{\partial E}{\partial {\hat{p}_{n,l_n}}}=-\frac{1}{N}*\frac{\partial J}{\partial E}*\frac{1}{\hat{p}_{n,l_n}}\)

\(J\)表示整个网络的loss值

multinomial_logistic_loss_layer.cpp源码

template <typename Dtype>
void MultinomialLogisticLossLayer<Dtype>::Reshape(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
  LossLayer<Dtype>::Reshape(bottom, top);   //调用基类的Reshape(),检查输入blob的第0维大小相等,调整输出blob为一维数据
  CHECK_EQ(bottom[1]->channels(), 1);   //检查,标签blob的形状必须为[N,1,1,1]
  CHECK_EQ(bottom[1]->height(), 1);
  CHECK_EQ(bottom[1]->width(), 1);
}

template <typename Dtype>
void MultinomialLogisticLossLayer<Dtype>::Forward_cpu(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {  //前向计算,计算loss值
  const Dtype* bottom_data = bottom[0]->cpu_data();   //预测blob的数据指针
  const Dtype* bottom_label = bottom[1]->cpu_data();  //标签blob的数据指针
  int num = bottom[0]->num();                         //数据的个数
  int dim = bottom[0]->count() / bottom[0]->num();    //K=C*H*W表示类别总数
  Dtype loss = 0;
  for (int i = 0; i < num; ++i) {
    int label = static_cast<int>(bottom_label[i]);    //第i个数据对应的标签值,即数据属于第label类
    // bottom_data[i * dim + label]为第i个数据对于第label类的预测概率,kLOG_THRESHOLD为一个较小值,防止|log(prob)|过大
    Dtype prob = std::max(bottom_data[i * dim + label], Dtype(kLOG_THRESHOLD));
    loss -= log(prob);    //计算loss值
  }
  top[0]->mutable_cpu_data()[0] = loss / num; //输出平均loss
}

template <typename Dtype>
void MultinomialLogisticLossLayer<Dtype>::Backward_cpu(
    const vector<Blob<Dtype>*>& top, const vector<bool>& propagate_down,
    const vector<Blob<Dtype>*>& bottom) {
  if (propagate_down[1]) {    //标签blob禁止梯度反传,报错
    LOG(FATAL) << this->type() << " Layer cannot backpropagate to label inputs.";
  }
  if (propagate_down[0]) {    //预测blob需要反传梯度
    const Dtype* bottom_data = bottom[0]->cpu_data();       //预测blob的数据指针
    const Dtype* bottom_label = bottom[1]->cpu_data();      //标签blob的数据指针
    Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();     //预测blob的梯度数据指针
    int num = bottom[0]->num();                             //数据的个数
    int dim = bottom[0]->count() / bottom[0]->num();        //K=C*H*W表示类别总数
    caffe_set(bottom[0]->count(), Dtype(0), bottom_diff);   //先清空预测blob的梯度,置为0
    const Dtype scale = - top[0]->cpu_diff()[0] / num;      //系数，即为 -\frac{1}{N}*\frac{\partial J}{\partial E}
    for (int i = 0; i < num; ++i) {
      int label = static_cast<int>(bottom_label[i]);      //数据的标签值,数据属于第label类
      Dtype prob = std::max(bottom_data[i * dim + label], Dtype(kLOG_THRESHOLD)); //第i个数据在label类别上的预测概率
      bottom_diff[i * dim + label] = scale / prob;        //得到当前数据对应的梯度
    }
  }
}

SoftmaxWithLossLayer类简介

SoftmaxWithLossLayer类同样用于计算单标签的多分类问题的损失函数，原理上等同于SoftmaxLayer + MultinomialLogisticLossLayer，但是caffe中推荐使用SoftmaxWithLossLayer层，单层计算的运算损失比两层分开来计算要小，数值更稳定。

第一个输入blob为网络的预测值，大小\(\tilde{N} \times C \times \tilde H \times \tilde W\)，范围\(x_{n,k} \in [-\infty, +\infty]\)。计算loss时使用softmax函数值作为其概率，\(\hat{p}_{n,k} = \frac{e^{x_{n,k}}}{\sum\limits_{k'=1}^{K} e^{x_{n,k'}}}\)。

后续假设计算softmax时是沿着第1维（维度\(C\)）进行的，则维度\(C\)的大小即为类别总数\(K\)，数据的总个数为外部个数（对应代码中的outer_num_）乘上内部个数inner_num_，即\(N=\tilde N * \tilde H * \tilde W\)。

第二个输入blob为标签值，大小\(N \times 1 \times 1 \times 1\)，范围\(l_n \in [0, 1, 2, ..., K - 1]\)之间的整数，数据的真实类别为\(l_n\)。

caffe代码中并没有严格限制标签blob的形状必须是\(N \times 1 \times 1 \times 1\)的形式，只要求预测blob与标签blob的第0维相等（LossLayer中要求），和标签blob的总个数等于\(N\)。

前向计算时，与MultinomialLogisticLossLayer类相同，loss的计算公式为： \(E=-\frac{1}{N} \sum\limits_{n=1}^N \log(\hat{p}_{n,l_n})\)
反向计算时，预测blob的梯度的计算过程如下：

\(\frac{{\partial {{\hat p}_{n,{l_n}}}}}{{\partial {x_{n,k}}}}{\rm{ = }}{\left( {\frac{{{e^{{x_{n,{l_n}}}}}}}{{{e^{{x_{n,0}}}}{\rm{ + }}{e^{{x_{n,{\rm{1}}}}}}{\rm{ + }}...{\rm{ + }}{e^{{x_{n,K}}}}}}} \right)_{{x_{n,k}}}}^\prime\)
\({\rm{ = }}\left\{ {\begin{array}{*{20}{c}}{\frac{{ - {e^{{x_{n,{l_n}}}}}*{e^{{x_{n,k}}}}}}{{{{\left( {{e^{{x_{n,0}}}}{\rm{ + }}{e^{{x_{n,{\rm{1}}}}}}{\rm{ + }}...{\rm{ + }}{e^{{x_{n,K}}}}} \right)}^2}}} + \frac{{{e^{{x_{n,{l_n}}}}}}}{{{e^{{x_{n,0}}}}{\rm{ + }}{e^{{x_{n,{\rm{1}}}}}}{\rm{ + }}...{\rm{ + }}{e^{{x_{n,K}}}}}} = {{\hat p}_{n,{l_n}}} - {{\hat p}_{n,{l_n}}}*{{\hat p}_{n,{l_n}}}{\rm{ }},k = {l_n}}\\{\frac{{ - {e^{{x_{n,{l_n}}}}}*{e^{{x_{n,k}}}}}}{{{{\left( {{e^{{x_{n,0}}}}{\rm{ + }}{e^{{x_{n,{\rm{1}}}}}}{\rm{ + }}...{\rm{ + }}{e^{{x_{n,K}}}}} \right)}^2}}} = - {{\hat p}_{n,{l_n}}}*{{\hat p}_{n,k}}{\rm{}},k \ne {l_n}}\end{array}} \right.\)
\(E = - \frac{1}{N}\sum\limits_{n = 1}^N {\log } ({{\hat p}_{n,{l_n}}}){\rm{ = }} - \frac{1}{N}\left( {\log {{\hat p}_{0,{l_0}}} + \log {{\hat p}_{1,{l_1}}} + ... + \log {{\hat p}_{n,{l_n}}}} \right)\)
注意\(\frac{{\partial E}}{{\partial {{\hat p}_{n,k'}}}}\)仅在\(k'=l_n\)时才为非0值。
\(\frac{{\partial E}}{{\partial {x_{n,k}}}} = \sum\limits_{k' = 1}^K {\frac{{\partial E}}{{\partial {{\hat p}_{n,k'}}}}} \frac{{\partial {{\hat p}_{n,k'}}}}{{\partial {x_{n,k}}}} = - \frac{1}{N}*\frac{1}{{{{\hat p}_{n,{l_n}}}}}*\frac{{\partial {{\hat p}_{n,{l_n}}}}}{{\partial {x_{n,k}}}}\)
\(= \left\{ {\begin{array}{*{20}{c}}{ - \frac{1}{N}*\frac{1}{{{{\hat p}_{n,{l_n}}}}}*\left( {{{\hat p}_{n,{l_n}}} - {{\hat p}_{n,{l_n}}}*{{\hat p}_{n,{l_n}}}} \right) = \frac{1}{N}\left( {{{\hat p}_{n,{l_n}}} - 1} \right){\rm{ }},k = {l_n}}\\{ - \frac{1}{N}*\frac{1}{{{{\hat p}_{n,{l_n}}}}}*\left( { - {{\hat p}_{n,{l_n}}}*{{\hat p}_{n,k}}} \right) = \frac{1}{N}{{\hat p}_{n,k}}{\rm{}},k \ne {l_n}}\end{array}} \right.\)
最后可计算：\(\frac{\partial J}{\partial {x_{n,k}}} = \frac{\partial J}{\partial E}*\frac{\partial E}{\partial {x_{n,k}}}\)

softmax_loss_layer.cpp源码

template <typename Dtype>
void SoftmaxWithLossLayer<Dtype>::LayerSetUp(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {  //layer初始化
  LossLayer<Dtype>::LayerSetUp(bottom, top);          //调用父类的函数
  LayerParameter softmax_param(this->layer_param_);   //当前层的参数
  softmax_param.set_type("Softmax");    //层的类型为"Softmax"
  softmax_layer_ = LayerRegistry<Dtype>::CreateLayer(softmax_param);  //根据层的类型创建一个Softmax层
  softmax_bottom_vec_.clear();
  softmax_bottom_vec_.push_back(bottom[0]);   //Softmax层的输入blob与当前层的输入blob形状相同
  softmax_top_vec_.clear();
  softmax_top_vec_.push_back(&prob_);         //将prob_用于存储Softmax层的输出数据
  softmax_layer_->SetUp(softmax_bottom_vec_, softmax_top_vec_);   //调用Softmax层的SetUp函数

  has_ignore_label_ = this->layer_param_.loss_param().has_ignore_label(); //设置了无效标签
  if (has_ignore_label_) {
    ignore_label_ = this->layer_param_.loss_param().ignore_label(); //将参数中的无效标签保存在ignore_label_中
  }
  if (!this->layer_param_.loss_param().has_normalization() &&
      this->layer_param_.loss_param().has_normalize()) {    //未设置normalization(新版本)参数但是设置了normalize(旧版本)参数
    //normalize为true时,使用VALID规范化形式,为false时,使用BATCH_SIZE规范化形式
    normalization_ = this->layer_param_.loss_param().normalize() ?
                     LossParameter_NormalizationMode_VALID :
                     LossParameter_NormalizationMode_BATCH_SIZE;
  } else {
    normalization_ = this->layer_param_.loss_param().normalization(); //使用normalization中的设置
  }
}

template <typename Dtype>
void SoftmaxWithLossLayer<Dtype>::Reshape(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {    //调整输入输出blob的
  LossLayer<Dtype>::Reshape(bottom, top);   //使用基类的函数,设置输出blob的形状为一维
  softmax_layer_->Reshape(softmax_bottom_vec_, softmax_top_vec_); //调整Softmax层的输入输出blob的形状
  //softmax_param().axis()可正可负,表示沿着第axis()维计算softmax值,其他维度之间的数据在计算时相互独立
  //同时,第softmax_axis_维的大小也表示数据的类别总数,例如图像的C
  softmax_axis_ = bottom[0]->CanonicalAxisIndex(this->layer_param_.softmax_param().axis());
  outer_num_ = bottom[0]->count(0, softmax_axis_);    //第[0, softmax_axis_)维之间的大小,作为数据的外部个数,例如图像的N
  inner_num_ = bottom[0]->count(softmax_axis_ + 1);   //第[softmax_axis_+1, end)维之间的大小,作为数据的内部总数,例如图像的H*W
  CHECK_EQ(outer_num_ * inner_num_, bottom[1]->count())   //外部个数乘上内部总数,需等于输出blob的总大小
      << "Number of labels must match number of predictions; "
      << "e.g., if softmax axis == 1 and prediction shape is (N, C, H, W), "
      << "label count (number of labels) must be N*H*W, "
      << "with integer values in {0, 1, ..., C-1}.";
  if (top.size() >= 2) {              //多个输出blob
    // softmax output
    top[1]->ReshapeLike(*bottom[0]);  //将top[1]作为内部创建的Softmax层的输出
  }
}

template <typename Dtype>
Dtype SoftmaxWithLossLayer<Dtype>::get_normalizer(
    LossParameter_NormalizationMode normalization_mode, int valid_count) {  //根据规范化类型和有效数据个数,计算规范化的系数
  Dtype normalizer;               //规范化系数
  switch (normalization_mode) {   //规范化类型
    case LossParameter_NormalizationMode_FULL:
      normalizer = Dtype(outer_num_ * inner_num_);  //FULL模式,规范化系数为外部个数乘上内部个数
      break;
    case LossParameter_NormalizationMode_VALID:     //VALID模式,规范化系数为有效数据的个数
      if (valid_count == -1) {                      //valid_count为-1,表示所有数据均为有效数据,则与FULL模式等同
        normalizer = Dtype(outer_num_ * inner_num_);
      } else {
        normalizer = Dtype(valid_count);
      }
      break;
    case LossParameter_NormalizationMode_BATCH_SIZE:  //BATCH_SIZE模式,规范化系数为数据的外部个数
      normalizer = Dtype(outer_num_);
      break;
    case LossParameter_NormalizationMode_NONE:        //NONE模式,规范化系数为1
      normalizer = Dtype(1);
      break;
    default:
      LOG(FATAL) << "Unknown normalization mode: "
          << LossParameter_NormalizationMode_Name(normalization_mode);
  }
  // Some users will have no labels for some examples in order to 'turn off' a
  // particular loss in a multi-task setup. The max prevents NaNs in that case.
  return std::max(Dtype(1.0), normalizer);    //防止有效数据的个数为0,设置最小值为1
}

template <typename Dtype>
void SoftmaxWithLossLayer<Dtype>::Forward_cpu(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {    //前向计算
  // The forward pass computes the softmax prob values.
  softmax_layer_->Forward(softmax_bottom_vec_, softmax_top_vec_); //先执行Softmax层的前向计算过程
  //输入blob的形状为(N, C, H, W),假设沿第1维计算softmax,则Softmax层的输出prob_为(N, C, H, W)形状,
  //标签的个数为N * H * W, outer_num_ = N, inner_num_ = H * W
  const Dtype* prob_data = prob_.cpu_data();    //Softmax层的输出数据
  const Dtype* label = bottom[1]->cpu_data();   //标签数据
  int dim = prob_.count() / outer_num_;         // C * H * W
  int count = 0;          //有效数据个数
  Dtype loss = 0;
  for (int i = 0; i < outer_num_; ++i) {
    for (int j = 0; j < inner_num_; j++) {
      const int label_value = static_cast<int>(label[i * inner_num_ + j]);  //第(i, j)位置的数据的标签值
      if (has_ignore_label_ && label_value == ignore_label_) {
        continue;       //设置了无效标签并且当前标签无效,则忽略
      }
      DCHECK_GE(label_value, 0);    //检查,标签值不小于0
      DCHECK_LT(label_value, prob_.shape(softmax_axis_)); //检查,标签值小于类别总数
      //获取第(i, j)位置的数据在label_value类别上的预测值,并计算loss
      loss -= log(std::max(prob_data[i * dim + label_value * inner_num_ + j], Dtype(FLT_MIN)));
      ++count;    //有效个数自增
    }
  }
  top[0]->mutable_cpu_data()[0] = loss / get_normalizer(normalization_, count); //计算规范化系数,得到最终loss
  if (top.size() == 2) {
    top[1]->ShareData(prob_);   //将Softmax层的输出作为SoftmaxWithLossLayer层的top[1]输出
  }
}

template <typename Dtype>
void SoftmaxWithLossLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {
  if (propagate_down[1]) {  //同样,标签blob不允许设置反传
    LOG(FATAL) << this->type() << " Layer cannot backpropagate to label inputs.";
  }
  if (propagate_down[0]) {  //允许反传
    Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();   //预测blob的梯度数据指针
    const Dtype* prob_data = prob_.cpu_data();            //Softmax层的输出数据指针
    caffe_copy(prob_.count(), prob_data, bottom_diff);    //bottom_diff = prob_data
    const Dtype* label = bottom[1]->cpu_data();           //标签blob的数据指针
    int dim = prob_.count() / outer_num_;                 // C * H * W
    int count = 0;                                        //有效数据个数
    for (int i = 0; i < outer_num_; ++i) {
      for (int j = 0; j < inner_num_; ++j) {
        const int label_value = static_cast<int>(label[i * inner_num_ + j]);  //第(i, j)位置的数据的标签值
        if (has_ignore_label_ && label_value == ignore_label_) {    //当前标签无效
          for (int c = 0; c < bottom[0]->shape(softmax_axis_); ++c) {   //维度C上的每个值
            bottom_diff[i * dim + c * inner_num_ + j] = 0;  //将预测blob的第(i, j)位置的维度C上的每个值的梯度都清零
          }
        } else {
          bottom_diff[i * dim + label_value * inner_num_ + j] -= 1; //只在维度C上的第label_value类别上减1,bottom_diff -= 1
          ++count;
        }
      }
    }
    // Scale gradient
    Dtype loss_weight = top[0]->cpu_diff()[0] / get_normalizer(normalization_, count);    //计算系数
    caffe_scal(prob_.count(), loss_weight, bottom_diff);  //bottom_diff *= loss_weight
  }
}

SigmoidCrossEntropyLossLayer类简介

SigmoidCrossEntropyLossLayer类用于计算多标签的二分类的交叉熵损失，每个数据允许有多个标签，但是每个标签只有0或1两种类别。

第一个输入blob为网络的预测值，大小\(N \times C \times H \times W\)，范围\(x_n \in [-\infty, +\infty]\)。计算loss时使用Sigmoid函数值作为其概率，\(\hat{p}_n = \sigma(x_n)\)
第二个输入blob为标签值，大小\(N \times C \times H \times W\)，范围\(p_n \in [0, 1]\)。

同样，代码中并没有严格限制预测blob与标签blob的形状必须相同，只要求两个blob的第0维相等，和总个数相等。

前向计算时，loss的计算公式为： \(E = -\frac{1}{N} \sum\limits_{n=1}^N \left[p_n \log \hat{p}_n + (1 - p_n) \log(1 - \hat{p}_n)\right]\)

\(N\)即为下文源码中的normalizer_。
注意代码中为了防止\(e^{-x}\)在计算时过大，稍微变换了下计算公式：
\(E=-\frac{1}{N} \sum\limits_{n=1}^N [p_n \log \sigma(x_n) + (1 - p_n) \log(1 - \sigma(x_n))]\\=-\frac{1}{N} \sum\limits_{n=1}^N [x_n (p_n-1)+\log \sigma(x_n)] \\ =\left\{\begin{matrix} -\frac{1}{N} \sum\limits_{n=1}^N [x_n (p_n-1)-\log (1+e^{-x_n}))] & x_n \geqslant 0\\ -\frac{1}{N} \sum\limits_{n=1}^N [x_n p_n-\log (1+e^{x_n}))] & x_n<0 \end{matrix}\right.\)

反向计算时，预测blob的梯度的计算公式为：\(\frac{\partial J}{\partial {x_n}} = \frac{\partial J}{\partial E}*\frac{\partial E}{\partial {x_n}}=\frac{1}{N}*\frac{\partial J}{\partial E}*(\sigma(x_n)-p_n)\)

\(J\)表示整个网络的loss值，\(\frac{\partial J}{\partial E}\)即为代码中的top[0]->cpu_diff()[0]

sigmoid_cross_entropy_loss_layer.cpp源码

template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::LayerSetUp(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
  LossLayer<Dtype>::LayerSetUp(bottom, top);    //调用基类的LayerSetUp(),设置输出blob的loss权重
  sigmoid_bottom_vec_.clear();
  sigmoid_bottom_vec_.push_back(bottom[0]);     //先清空,再将预测值存入,作为SigmoidLayer层的输入
  sigmoid_top_vec_.clear();
  sigmoid_top_vec_.push_back(sigmoid_output_.get());  //清空,作为SigmoidLayer层的输出
  sigmoid_layer_->SetUp(sigmoid_bottom_vec_, sigmoid_top_vec_); //创建SigmoidLayer层,调整输出blob的形状等

  has_ignore_label_ = this->layer_param_.loss_param().has_ignore_label();   //如果设置了无效类别
  if (has_ignore_label_) {
    ignore_label_ = this->layer_param_.loss_param().ignore_label();   //将无效的类别保存在当前层中
  }
  if (this->layer_param_.loss_param().has_normalization()) {          //如果设置了规范化方式
    normalization_ = this->layer_param_.loss_param().normalization(); //将其保存在当前层中
  } else if (this->layer_param_.loss_param().has_normalize()) {       //如果设置了规范化方式,normalize(旧版本)
    // normalize为true则为VALID规范化方式,为false则为BATCH_SIZE规范化方式
    normalization_ = this->layer_param_.loss_param().normalize() ?
                     LossParameter_NormalizationMode_VALID :
                     LossParameter_NormalizationMode_BATCH_SIZE;
  } else {
    //默认使用BATCH_SIZE方式,只有SigmoidCrossEntropyLoss的默认规范化方式为BATCH_SIZE,其他的默认方式为VALID
    normalization_ = LossParameter_NormalizationMode_BATCH_SIZE;
  }
}

template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::Reshape(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
  LossLayer<Dtype>::Reshape(bottom, top);   //调用基类的Reshape()函数,检查输入输出的形状
  outer_num_ = bottom[0]->shape(0);         //数据的外部个数N, batch size
  inner_num_ = bottom[0]->count(1);         //数据的内部个数C*H*W, instance size: |output| == |target|
  CHECK_EQ(bottom[0]->count(), bottom[1]->count()) <<
      "SIGMOID_CROSS_ENTROPY_LOSS layer inputs must have the same count.";  //检查预测与标签blob的总个数是否相等
  sigmoid_layer_->Reshape(sigmoid_bottom_vec_, sigmoid_top_vec_); //调用SigmoidLayer的Reshape()函数,调整形状
}

// TODO(shelhamer) loss normalization should be pulled up into LossLayer,
// instead of duplicated here and in SoftMaxWithLossLayer
template <typename Dtype>
Dtype SigmoidCrossEntropyLossLayer<Dtype>::get_normalizer(    //根据规范化类型和有效数据个数,计算规范化的系数
    LossParameter_NormalizationMode normalization_mode, int valid_count) {
  Dtype normalizer;
  switch (normalization_mode) {   //规范化类型
    case LossParameter_NormalizationMode_FULL:
      normalizer = Dtype(outer_num_ * inner_num_);  //FULL模式,规范化系数为外部个数乘上内部个数
      break;
    case LossParameter_NormalizationMode_VALID:     //VALID模式,规范化系数为有效数据的个数
      if (valid_count == -1) {                      //valid_count为-1,表示所有数据均为有效数据,则与FULL模式等同
        normalizer = Dtype(outer_num_ * inner_num_);
      } else {
        normalizer = Dtype(valid_count);
      }
      break;
    case LossParameter_NormalizationMode_BATCH_SIZE:  //BATCH_SIZE模式,规范化系数为数据的外部个数
      normalizer = Dtype(outer_num_);
      break;
    case LossParameter_NormalizationMode_NONE:        //NONE模式,规范化系数为1
      normalizer = Dtype(1);
      break;
    default:    //其他类型,返回错误
      LOG(FATAL) << "Unknown normalization mode: " << LossParameter_NormalizationMode_Name(normalization_mode);
  }
  // Some users will have no labels for some examples in order to 'turn off' a
  // particular loss in a multi-task setup. The max prevents NaNs in that case.
  return std::max(Dtype(1.0), normalizer);  //设置最小为1.某些数据可能不存在标签,valid_count可能为0,防止后续错误
}

template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::Forward_cpu(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
  // The forward pass computes the sigmoid outputs.
  sigmoid_bottom_vec_[0] = bottom[0];      //将预测值作为SigmoidLayer层的输入
  sigmoid_layer_->Forward(sigmoid_bottom_vec_, sigmoid_top_vec_); //计算SigmoidLayer层的输出sigmoid_top_vec_
  // Compute the loss (negative log likelihood)
  // Stable version of loss computation from input data
  const Dtype* input_data = bottom[0]->cpu_data();    //预测值
  const Dtype* target = bottom[1]->cpu_data();        //标签值
  int valid_count = 0;    //有效数据的个数
  Dtype loss = 0;
  for (int i = 0; i < bottom[0]->count(); ++i) {
    const int target_value = static_cast<int>(target[i]);   //第i个数据的标签值
    if (has_ignore_label_ && target_value == ignore_label_) {   //如果设置了无效标签,并且当前标签即为无效值
      continue;           //则忽略
    }
    //x = input_data[i] < 0时, loss -= x*p-log(1+exp(x))
    //x > 0时, loss -= x*(p-1)-log(1+exp(-x)),此处为了防止exp(x)的值过大
    loss -= input_data[i] * (target[i] - (input_data[i] >= 0)) -
        log(1 + exp(input_data[i] - 2 * input_data[i] * (input_data[i] >= 0)));
    ++valid_count;    //有效个数自增
  }
  normalizer_ = get_normalizer(normalization_, valid_count);  //计算规范化系数
  top[0]->mutable_cpu_data()[0] = loss / normalizer_;         //除以规范化系数,得到最终的loss值
}

template <typename Dtype>
void SigmoidCrossEntropyLossLayer<Dtype>::Backward_cpu(
    const vector<Blob<Dtype>*>& top, const vector<bool>& propagate_down,
    const vector<Blob<Dtype>*>& bottom) {
  if (propagate_down[1]) {    //同样,标签对应的输出blob不允许设置反传
    LOG(FATAL) << this->type() << " Layer cannot backpropagate to label inputs.";
  }
  if (propagate_down[0]) {    //预测blob需要反传
    // First, compute the diff
    const int count = bottom[0]->count();                 //数据个数
    const Dtype* sigmoid_output_data = sigmoid_output_->cpu_data(); //SigmoidLayer层的输出数据σ(x)
    const Dtype* target = bottom[1]->cpu_data();          //标签值
    Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();   //预测blob的梯度数据指针
    caffe_sub(count, sigmoid_output_data, target, bottom_diff); //bottom_diff = sigmoid_output_data - target
    // Zero out gradient of ignored targets.
    if (has_ignore_label_) {    //如果设置了无效标签
      for (int i = 0; i < count; ++i) {
        const int target_value = static_cast<int>(target[i]); //第i个数据的标签值
        if (target_value == ignore_label_) {  //当前数据为无效标签
          bottom_diff[i] = 0;   //梯度置为0
        }
      }
    }
    // Scale down gradient
    Dtype loss_weight = top[0]->cpu_diff()[0] / normalizer_;    //系数
    caffe_scal(count, loss_weight, bottom_diff);  //bottom_diff *= loss_weight
  }
}

小结

梯度累加是针对layer的参数blob，layer的输入输出blob的梯度是不会累加的

参考

http://freemind.pluskid.org/machine-learning/softmax-vs-softmax-loss-numerical-stability/
Caffe的源码笔者是第一次阅读，一边阅读一边记录，对代码的理解和分析可能会存在错误或遗漏，希望各位读者批评指正，谢谢支持！

Caffe源码-LossLayer类（上）