论文阅读笔记:NormFace: L2 HyperSphere Embedding for Face Verification
本文主要包含如下内容:
本篇论文来自电子科技大学,论文内容与COCO_LOSS极为类似,但从数学的角度解释了部分原理,可以结合COCO_LOSS一起看。
主要思想
首先论文分析存在的问题:在优化人脸识别任务时,softmax本身优化的是没有归一化的内积结果,但是最后在预测的时候使用的一般是cosine距离或者欧式距离,这会导致优化目标和最终的距离度量其实并不一致。
然后论文分析了测试时归一化的效果,发现归一化后计算内积和欧式距离的效果最好,因此考虑在训练中加入正则化。
算法原理
首先,说一下为什么需要正则化,以及为什么不要偏差bias:softmax loss倾向于学习到一个radial分布的特征,其原因在于特征的scale越大就会使得softmax的loss越小;softmax之前的fc有bias的情况下会使得有些类别在角度上没有区分性但是通过bias可以区分,在这种情况下如果对feature做normalize,会使得中间的那个小类别的feature变成一个单位球形并与其他的feature重叠在一起,所以在feature normalize的时候是不能加bias的。
接下来,论文解决了正则化后网络不收敛的问题:其原因在于normalize之后softmax loss的输入处于一个[-1,1]的分布,其最小值被抑制、是有下限的,即使样本被完美分类,即对应类别的输出为1,其他的为-1,那么这个概率Py还是一个比较小的值,而softmax loss的梯度为1-Py,这使得容易的样本梯度也很大。相比于原来的softmax loss,其输入的scale可以很大使得概率Py是个接近于1的数使得难易样本的梯度差别比较明显。
最后,论文提出解决办法,也就显而易见了,就是在normalize之后加个scale,让这个差距再拉大,所以最终normalize之后的softmax loss如下,其中w和f都是归一化的。
本篇论文与COCO_LOSS的唯一区别就是:COCO_LOSS的中心是每次根据样本更新求来的,而Norm_Face的权重是自己学习的, 这是两篇论文之间的最大区别。
注意:系数s的选择:如图所示:
实验结果
总结
normalize本身对于深度学习的训练有好处这个事情其实已经是共识。
代码实现
代码原型,首先对特征进行正则化处理,然后使用全新定义的全链接层使用归一化的权重与特征进行相乘,并对输出结果乘以相关尺度,最后运用softmaxwithloss回归得到最终结果:
layer {
name: "normalize1"
type: "Normalize"
bottom: "pool5/7x7_s1"
top: "norm1"
}
layer {
name: "cosine_layer"
type: "InnerProduct"
bottom: "norm1"
top: "cosine"
param {
lr_mult: 100
decay_mult: 0
}
inner_product_param{
bias_term: false
normalize: true
num_output: 10572
weight_filler {
type: "gaussian_unitball"
}
}
}
layer {
name: "cosine_scale"
type: "Scale"
bottom: "cosine"
top: "cosine"
scale_param {
num_axes: 0
bias_term: false
min_value: 0.01
filler{
value: 10
}
bias_filler{
value: 0
}
}
}
layer {
name: "softmax_loss"
type: "SoftmaxWithLoss"
bottom: "cosine"
bottom: "label"
top: "softmax_loss"
loss_weight: 1
}
layer {
name: "Accuracy"
type: "Accuracy"
bottom: "cosine"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
accuracy_param {
min_is_better: false
}
}
I0713 12:52:55.991122 26934 layer_factory.hpp:77] Creating layer normalize1
I0713 12:52:55.991127 26934 net.cpp:100] Creating Layer normalize1
I0713 12:52:55.991129 26934 net.cpp:434] normalize1 <- pool5/7x7_s1
I0713 12:52:55.991143 26934 net.cpp:408] normalize1 -> norm1
I0713 12:52:55.991185 26934 net.cpp:150] Setting up normalize1
I0713 12:52:55.991190 26934 net.cpp:157] Top shape: 16 1024 1 1 (16384)
I0713 12:52:55.991192 26934 net.cpp:165] Memory required for data: 1661300672
I0713 12:52:55.991194 26934 layer_factory.hpp:77] Creating layer cosine_layer
I0713 12:52:55.991209 26934 net.cpp:100] Creating Layer cosine_layer
I0713 12:52:55.991212 26934 net.cpp:434] cosine_layer <- norm1
I0713 12:52:55.991216 26934 net.cpp:408] cosine_layer -> cosine
I0713 12:52:55.991401 26934 net.cpp:150] Setting up cosine_layer
I0713 12:52:55.991407 26934 net.cpp:157] Top shape: 16 13 (208)
I0713 12:52:55.991410 26934 net.cpp:165] Memory required for data: 1661301504
I0713 12:52:55.991413 26934 layer_factory.hpp:77] Creating layer cosine_scale
I0713 12:52:55.991418 26934 net.cpp:100] Creating Layer cosine_scale
I0713 12:52:55.991422 26934 net.cpp:434] cosine_scale <- cosine
I0713 12:52:55.991426 26934 net.cpp:395] cosine_scale -> cosine (in-place)
I0713 12:52:55.991487 26934 net.cpp:150] Setting up cosine_scale
I0713 12:52:55.991490 26934 net.cpp:157] Top shape: 16 13 (208)
I0713 12:52:55.991493 26934 net.cpp:165] Memory required for data: 1661302336
I0713 12:52:55.991497 26934 layer_factory.hpp:77] Creating layer cosine_cosine_scale_0_split
I0713 12:52:55.991500 26934 net.cpp:100] Creating Layer cosine_cosine_scale_0_split
I0713 12:52:55.991503 26934 net.cpp:434] cosine_cosine_scale_0_split <- cosine
I0713 12:52:55.991509 26934 net.cpp:408] cosine_cosine_scale_0_split -> cosine_cosine_scale_0_split_0
I0713 12:52:55.991514 26934 net.cpp:408] cosine_cosine_scale_0_split -> cosine_cosine_scale_0_split_1
I0713 12:52:55.991541 26934 net.cpp:150] Setting up cosine_cosine_scale_0_split
I0713 12:52:55.991545 26934 net.cpp:157] Top shape: 16 13 (208)
I0713 12:52:55.991549 26934 net.cpp:157] Top shape: 16 13 (208)
I0713 12:52:55.991551 26934 net.cpp:165] Memory required for data: 1661304000
I0713 12:52:55.991554 26934 layer_factory.hpp:77] Creating layer softmax_loss
I0713 12:52:55.991559 26934 net.cpp:100] Creating Layer softmax_loss
I0713 12:52:55.991561 26934 net.cpp:434] softmax_loss <- cosine_cosine_scale_0_split_0
I0713 12:52:55.991565 26934 net.cpp:434] softmax_loss <- label_data_1_split_0
I0713 12:52:55.991569 26934 net.cpp:408] softmax_loss -> softmax_loss
I0713 12:52:55.991575 26934 layer_factory.hpp:77] Creating layer softmax_loss
I0713 12:52:55.992048 26934 net.cpp:150] Setting up softmax_loss
I0713 12:52:55.992056 26934 net.cpp:157] Top shape: (1)
I0713 12:52:55.992058 26934 net.cpp:160] with loss weight 1
I0713 12:52:55.992066 26934 net.cpp:165] Memory required for data: 1661304004
normalize_layer.hpp/normalize_layer.cpp
normalize_layer.hpp/normalize_layer.cpp(执行正则化操作)
正则化公式:
考虑矩阵乘法,因此对通道归一化(N*C*1*1)注意:前面一层一定是全连阶层:16 1024 1 1 (16384)
message NormalizeParameter {
optional string normalize_type = 1 [default = "L2"];
optional bool fix_gradient = 2 [default = false];
optional bool bp_norm = 3 [default = false];
}
#ifndef CAFFE_NORMALIZE_LAYER_HPP_
#define CAFFE_NORMALIZE_LAYER_HPP_
#include <utility>
#include <vector>
#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"
namespace caffe {
/**
* @brief Normalizes input.
*/
template <typename Dtype>
class NormalizeLayer : public Layer<Dtype> {
public:
explicit NormalizeLayer(const LayerParameter& param)
: Layer<Dtype>(param) {}
virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual inline const char* type() const { return "Normalize"; }
virtual inline int ExactNumBottomBlobs() const { return 1; }
virtual inline int MinTopBlobs() const { return 1; }
virtual inline int MaxTopBlobs() const { return 2; }
protected:
virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
# squared_中储存平方保存结果;norm_中储存所有通道数的均方根
Blob<Dtype> sum_multiplier_, squared_, norm_;
std::string normalize_type_; # 正则化的类型/默认为L2范数
bool fix_gradient_;
bool bp_norm_;
};
}
#endif // CAFFE_NORMALIZE_LAYER_HPP_
#include <algorithm>
#include <vector>
#include <cmath>
#include "caffe/layer.hpp"
#include "caffe/util/math_functions.hpp"
#include "caffe/layers/normalize_layer.hpp"
namespace caffe {
#define sign(x) ((Dtype(0) < (x)) - ((x) < Dtype(0)))
template <typename Dtype>
void NormalizeLayer<Dtype>::LayerSetUp( # 这里仅仅载入所需参数
const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
normalize_type_ =
this->layer_param_.normalize_param().normalize_type();
fix_gradient_ =
this->layer_param_.normalize_param().fix_gradient();
bp_norm_ = this->layer_param_.normalize_param().bp_norm() && top.size() == 2;
}
template <typename Dtype>
void NormalizeLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
top[0]->Reshape(bottom[0]->num(), bottom[0]->channels(),
bottom[0]->height(), bottom[0]->width());
squared_.Reshape(bottom[0]->num(), bottom[0]->channels(),
bottom[0]->height(), bottom[0]->width());
if (top.size() == 2) {
top[1]->Reshape(bottom[0]->num(), 1,
bottom[0]->height(), bottom[0]->width());
}
norm_.Reshape(bottom[0]->num(), 1,
bottom[0]->height(), bottom[0]->width());
}
template <typename Dtype>
void NormalizeLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
const Dtype* bottom_data = bottom[0]->cpu_data();
Dtype* top_data = top[0]->mutable_cpu_data();
Dtype* square_data = squared_.mutable_cpu_data();
Dtype* norm_data = (top.size() == 2) ? top[1]->mutable_cpu_data() : norm_.mutable_cpu_data();
int num = bottom[0]->num();
int channels = bottom[0]->channels();
int spatial_dim = bottom[0]->height() * bottom[0]->width();
if (normalize_type_ == "L2") {
caffe_sqr<Dtype>(num*channels*spatial_dim, bottom_data, square_data); # 计算均方值
for (int n = 0; n < num; n++) {
for (int s = 0; s < spatial_dim; s++) {
norm_data[n*spatial_dim + s] = Dtype(0);
for (int c = 0; c < channels; c++) {
norm_data[n*spatial_dim + s] += square_data[(n * channels + c) * spatial_dim + s]; # 对通道数求和
}
norm_data[n*spatial_dim + s] += 1e-6;
norm_data[n*spatial_dim + s] = sqrt(norm_data[n*spatial_dim + s]); # 计算均方根,前面有相应公式
for (int c = 0; c < channels; c++) {
top_data[(n * channels + c) * spatial_dim + s] = bottom_data[(n * channels + c) * spatial_dim + s] / norm_data[n*spatial_dim + s]; # 更新top_data值,获得前向传播的结果
}
}
}
}
else if (normalize_type_ == "L1") {
caffe_abs<Dtype>(num*channels*spatial_dim, bottom_data, square_data);
for (int n = 0; n < num; n++) {
for (int s = 0; s < spatial_dim; s++) {
norm_data[n*spatial_dim +s] = Dtype(0);
for (int c = 0; c < channels; c++) {
norm_data[n*spatial_dim + s] += square_data[(n * channels + c) * spatial_dim + s];
}
norm_data[n*spatial_dim + s] += 1e-6;
norm_data[n*spatial_dim + s] = norm_data[n*spatial_dim + s];
for (int c = 0; c < channels; c++) {
top_data[(n * channels + c) * spatial_dim + s] = bottom_data[(n * channels + c) * spatial_dim + s] / norm_data[n*spatial_dim + s];
}
}
}
}
else {
NOT_IMPLEMENTED;
}
}
template <typename Dtype>
void NormalizeLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {
const Dtype* top_diff = top[0]->cpu_diff();
const Dtype* top_data = top[0]->cpu_data();
const Dtype* bottom_data = bottom[0]->cpu_data();
const Dtype* square_data = squared_.cpu_data();
const Dtype* norm_data = (top.size() == 2) ? top[1]->cpu_data() : norm_.cpu_data();
Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
int num = bottom[0]->num();
int channels = bottom[0]->channels();
int spatial_dim = bottom[0]->height() * bottom[0]->width();
if (propagate_down[0]) {
if (normalize_type_ == "L2") {
for (int n = 0; n < num; ++n) {
for (int s = 0; s < spatial_dim; s++) {
Dtype a = caffe_cpu_strided_dot(channels, top_data + n*channels*spatial_dim + s, spatial_dim, top_diff + n*channels*spatial_dim + s, spatial_dim);
for (int c = 0; c < channels; c++) {
bottom_diff[(n * channels + c) * spatial_dim + s] =
(top_diff[(n * channels + c) * spatial_dim + s] - top_data[(n * channels + c) * spatial_dim + s] * a) / norm_data[n*spatial_dim + s];
}
}
}
}
else if (normalize_type_ == "L1") {
for (int n = 0; n < num; ++n) {
for (int s = 0; s < spatial_dim; s++) {
Dtype a = caffe_cpu_strided_dot(channels, top_data + n*channels*spatial_dim + s, spatial_dim, top_diff + n*channels*spatial_dim + s, spatial_dim);
for (int c = 0; c < channels; c++) {
bottom_diff[(n * channels + c) * spatial_dim + s] =
(top_diff[(n * channels + c) * spatial_dim + s] - sign(bottom_data[(n * channels + c) * spatial_dim + s]) * a) / norm_data[n*spatial_dim + s];
}
}
}
}
else {
NOT_IMPLEMENTED;
}
}
if (bp_norm_) {
const Dtype* norm_diff =top[1]->cpu_diff();
if (normalize_type_ == "L2") {
for (int n = 0; n < num; ++n) {
for (int s = 0; s < spatial_dim; s++) {
for (int c = 0; c < channels; c++) {
bottom_diff[(n * channels + c) * spatial_dim + s] += norm_diff[n*spatial_dim + s] * top_data[(n * channels + c) * spatial_dim + s];
}
}
}
}
else if (normalize_type_ == "L1") {
for (int n = 0; n < num; ++n) {
for (int s = 0; s < spatial_dim; s++) {
for (int c = 0; c < channels; c++) {
bottom_diff[(n * channels + c) * spatial_dim + s] += norm_diff[n*spatial_dim + s] * sign(bottom_data[(n * channels + c) * spatial_dim + s]);
}
}
}
}
}
}
#ifdef CPU_ONLY
STUB_GPU(NormalizeLayer);
#endif
INSTANTIATE_CLASS(NormalizeLayer);
REGISTER_LAYER_CLASS(Normalize);
} // namespace caffe
inner_product_layer.hpp/inner_product_layer.cpp
inner_product_layer.hpp/inner_product_layer.cpp(全连结层,生成可以学习的正则化权重)输入shape:16 13 (208)
message InnerProductParameter {
optional uint32 num_output = 1; // The number of outputs for the layer
optional bool bias_term = 2 [default = true]; // whether to have bias terms
optional FillerParameter weight_filler = 3; // The filler for the weight
optional FillerParameter bias_filler = 4; // The filler for the bias
// The first axis to be lumped into a single inner product computation;
// all preceding axes are retained in the output.
// May be negative to index from the end (e.g., -1 for the last axis).
optional int32 axis = 5 [default = 1];
// Specify whether to transpose the weight matrix or not.
// If transpose == true, any operations will be performed on the transpose
// of the weight matrix. The weight matrix itself is not going to be transposed
// but rather the transfer flag of operations will be toggled accordingly.
optional bool transpose = 6 [default = false];
optional bool normalize = 7 [default = false];
}
#ifndef CAFFE_INNER_PRODUCT_LAYER_HPP_
#define CAFFE_INNER_PRODUCT_LAYER_HPP_
#include <vector>
#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"
namespace caffe {
/**
* @brief Also known as a "fully-connected" layer, computes an inner product
* with a set of learned weights, and (optionally) adds biases.
*
* TODO(dox): thorough documentation for Forward, Backward, and proto params.
*/
template <typename Dtype>
class InnerProductLayer : public Layer<Dtype> {
public:
explicit InnerProductLayer(const LayerParameter& param)
: Layer<Dtype>(param) {}
virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual inline const char* type() const { return "InnerProduct"; }
virtual inline int MinBottomBlobs() const { return 1; }
virtual inline int ExactNumTopBlobs() const { return 1; }
protected:
virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
int M_;
int K_;
int N_;
bool bias_term_;
Blob<Dtype> bias_multiplier_;
bool transpose_; ///< if true, assume transposed weights
bool normalize_; // bool型变量表示是否执行normalize操作
Blob<Dtype> weight_norm_; // 保存正则化权值
};
} // namespace caffe
#endif // CAFFE_INNER_PRODUCT_LAYER_HPP_
#include <vector>
#include "caffe/filler.hpp"
#include "caffe/layers/inner_product_layer.hpp"
#include "caffe/util/math_functions.hpp"
namespace caffe {
template <typename Dtype>
void InnerProductLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
const int num_output = this->layer_param_.inner_product_param().num_output(); // 获取输出神经元个数
bias_term_ = this->layer_param_.inner_product_param().bias_term(); // bool类型,是否存在偏置项
transpose_ = this->layer_param_.inner_product_param().transpose(); // bool类型,是否对权重矩阵转置
normalize_ = this->layer_param_.inner_product_param().normalize(); // 获取normalize参数
if (bottom.size() == 1) N_ = num_output; // N_ 表示输出神经元的个数
else N_ = bottom[1]->num();
const int axis = bottom[0]->CanonicalAxisIndex(
this->layer_param_.inner_product_param().axis());
// Dimensions starting from "axis" are "flattened" into a single
// length K_ vector. For example, if bottom[0]'s shape is (N, C, H, W),
// and axis == 1, N inner products with dimension CHW are performed.
K_ = bottom[0]->count(axis); // 轴axis为1时,K_为C*H*W,输入神经元的数目
// Check if we need to set up the weights
if (this->blobs_.size() > 0 || bottom.size() == 3
|| (bottom.size() == 2 && !bias_term_)) {
LOG(INFO) << "Skipping parameter initialization";
}
else {
int bias_index = 0;
if (bias_term_) {
if (bottom.size() == 2) {
this->blobs_.resize(1);
}
else {
this->blobs_.resize(2);
bias_index = 1;
}
}
else {
this->blobs_.resize(1);
}
if (bottom.size() == 1) { // 如果只输入一个bottom:那么,初始化权重。
// Initialize the weights
vector<int> weight_shape(2);
if (transpose_) {
weight_shape[0] = K_;
weight_shape[1] = N_;
}
else {
weight_shape[0] = N_;
weight_shape[1] = K_;
}
this->blobs_[0].reset(new Blob<Dtype>(weight_shape)); // 根据权重的大小,开辟内存,k_个输入神经元,N个_输出神经元
// fill the weights
shared_ptr<Filler<Dtype> > weight_filler(GetFiller<Dtype>( // shared_ptr是智能指针,作用是根据配置文件,获取权重初始化函数
this->layer_param_.inner_product_param().weight_filler()));
weight_filler->Fill(this->blobs_[0].get()); // 利用初始化函数进行权重的初始值填充
}
// If necessary, intiialize and fill the bias term
if (bias_term_ && bottom.size() <= 2) {
vector<int> bias_shape(1, N_);
this->blobs_[bias_index].reset(new Blob<Dtype>(bias_shape));
shared_ptr<Filler<Dtype> > bias_filler(GetFiller<Dtype>(
this->layer_param_.inner_product_param().bias_filler()));
bias_filler->Fill(this->blobs_[bias_index].get());
}
} // parameter initialization
if (bottom.size() == 1) this->param_propagate_down_.resize(this->blobs_.size(), true); // 设置每个参数是否需要反向传播
}
template <typename Dtype>
void InnerProductLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
// Figure out the dimensions
const int axis = bottom[0]->CanonicalAxisIndex(
this->layer_param_.inner_product_param().axis());
const int new_K = bottom[0]->count(axis);
CHECK_EQ(K_, new_K)
<< "Input size incompatible with inner product parameters.";
// The first "axis" dimensions are independent inner products; the total
// number of these is M_, the product over these dimensions.
M_ = bottom[0]->count(0, axis); // 根据axis判断输出,用于初始化bias
if (bottom.size() >= 2) N_ = bottom[1]->num();
// The top shape will be the bottom shape with the flattened axes dropped,
// and replaced by a single axis with dimension num_output (N_).
vector<int> top_shape = bottom[0]->shape();
top_shape.resize(axis + 1); // top_shape:[N,C],二维向量.
top_shape[axis] = N_; // top_shape:[N,N_],将C向量变为N_, N_ 表示输出神经元的个数
top[0]->Reshape(top_shape); // 设置top的形状大小
// Set up the bias multiplier
if (bias_term_ && bottom.size() <= 2) {
vector<int> bias_shape(1, M_); // 获得偏置的形状
bias_multiplier_.Reshape(bias_shape); // 为偏置矩阵开辟空间
caffe_set(M_, Dtype(1), bias_multiplier_.mutable_cpu_data()); // 为偏置矩阵赋初值全为1
}
if (bottom.size() == 1 && normalize_) {
vector<int> weight_norm_shape(1, N_); // 获得正则化权重的形状
weight_norm_.Reshape(weight_norm_shape); // 为正则化权重矩阵开辟空间
caffe_set(N_, Dtype(0), weight_norm_.mutable_cpu_data()); // 为正则化权重矩阵赋初值全为0
}
}
template <typename Dtype>
void InnerProductLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
const Dtype* bottom_data = bottom[0]->cpu_data(); // 获得输入数据的指针
Dtype* top_data = top[0]->mutable_cpu_data(); // 获得输出数据的指针
const Dtype* weight = bottom.size() >= 2 ? bottom[1]->cpu_data() : this->blobs_[0]->cpu_data(); // 获得权重数据的指针weight
if (normalize_ && bottom.size() == 1) {
Dtype* mutable_weight = this->blobs_[0]->mutable_cpu_data();
Dtype sum_sq;
for (int n = 0; n < N_; n++) {
sum_sq = caffe_cpu_dot(K_, weight + n*K_, weight + n*K_) + 1e-6; // 求对应权重的平方
caffe_scal<Dtype>(K_, Dtype(1) / sqrt(sum_sq), mutable_weight + n*K_); // 正则化操作
}
}
caffe_cpu_gemm<Dtype>(CblasNoTrans, transpose_ ? CblasNoTrans : CblasTrans,
M_, N_, K_, (Dtype)1.,
bottom_data, weight, (Dtype)0., top_data); // 调用矩阵乘法
if (bias_term_) {
caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, N_, 1, (Dtype)1.,
bias_multiplier_.cpu_data(),
bottom.size() == 3 ? bottom[2]->cpu_data() : this->blobs_[1]->cpu_data(), (Dtype)1., top_data); // 矩阵加法,加上偏置
}
}
template <typename Dtype>
void InnerProductLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down,
const vector<Blob<Dtype>*>& bottom) {
const Dtype* weight = bottom.size() >= 2 ? bottom[1]->cpu_data() : this->blobs_[0]->cpu_data();
if ((bottom.size() == 1 && this->param_propagate_down_[0])||
(bottom.size() >= 2 && propagate_down[1])){
const Dtype* top_diff = top[0]->cpu_diff();
const Dtype* bottom_data = bottom[0]->cpu_data();
Dtype* weight_diff = bottom.size() >= 2 ? bottom[1]->mutable_cpu_diff() : this->blobs_[0]->mutable_cpu_diff();
if (bottom.size() >= 2) {
if (transpose_) {
caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans,
K_, N_, M_,
(Dtype)1., bottom_data, top_diff,
(Dtype)0., weight_diff);
}
else {
caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans,
N_, K_, M_,
(Dtype)1., top_diff, bottom_data,
(Dtype)0., weight_diff);
}
}
else {
if (transpose_) {
caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans,
K_, N_, M_,
(Dtype)1., bottom_data, top_diff,
(Dtype)1., weight_diff);
}
else {
caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans,
N_, K_, M_,
(Dtype)1., top_diff, bottom_data,
(Dtype)1., weight_diff);
}
}
}
if (bias_term_ && (this->param_propagate_down_[1] ||
(bottom.size() == 3 && propagate_down[2]))) {
const Dtype* top_diff = top[0]->cpu_diff();
// Gradient with respect to bias
caffe_cpu_gemv<Dtype>(CblasTrans, M_, N_, (Dtype)1., top_diff,
bias_multiplier_.cpu_data(), (Dtype)1.,
bottom.size()==3? bottom[2]->mutable_cpu_diff() : this->blobs_[1]->mutable_cpu_diff());
}
if (propagate_down[0]) {
const Dtype* top_diff = top[0]->cpu_diff();
// Gradient with respect to bottom data
if (transpose_) {
caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasTrans,
M_, K_, N_,
(Dtype)1., top_diff, weight,
(Dtype)0., bottom[0]->mutable_cpu_diff());
} else {
caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans,
M_, K_, N_,
(Dtype)1., top_diff, weight,
(Dtype)0., bottom[0]->mutable_cpu_diff());
}
}
}
#ifdef CPU_ONLY
STUB_GPU(InnerProductLayer);
#endif
INSTANTIATE_CLASS(InnerProductLayer);
REGISTER_LAYER_CLASS(InnerProduct);
} // namespace caffe
scale_layer.hpp/scale_layer.cpp
scale_layer.hpp/scale_layer.cpp;常规尺度变换
message ScaleParameter {
// The first axis of bottom[0] (the first input Blob) along which to apply
// bottom[1] (the second input Blob). May be negative to index from the end
// (e.g., -1 for the last axis).
//
// For example, if bottom[0] is 4D with shape 100x3x40x60, the output
// top[0] will have the same shape, and bottom[1] may have any of the
// following shapes (for the given value of axis):
// (axis == 0 == -4) 100; 100x3; 100x3x40; 100x3x40x60
// (axis == 1 == -3) 3; 3x40; 3x40x60
// (axis == 2 == -2) 40; 40x60
// (axis == 3 == -1) 60
// Furthermore, bottom[1] may have the empty shape (regardless of the value of
// "axis") -- a scalar multiplier.
optional int32 axis = 1 [default = 1];
// (num_axes is ignored unless just one bottom is given and the scale is
// a learned parameter of the layer. Otherwise, num_axes is determined by the
// number of axes by the second bottom.)
// The number of axes of the input (bottom[0]) covered by the scale
// parameter, or -1 to cover all axes of bottom[0] starting from `axis`.
// Set num_axes := 0, to multiply with a zero-axis Blob: a scalar.
optional int32 num_axes = 2 [default = 1];
// (filler is ignored unless just one bottom is given and the scale is
// a learned parameter of the layer.)
// The initialization for the learned scale parameter.
// Default is the unit (1) initialization, resulting in the ScaleLayer
// initially performing the identity operation.
optional FillerParameter filler = 3;
// Whether to also learn a bias (equivalent to a ScaleLayer+BiasLayer, but
// may be more efficient). Initialized with bias_filler (defaults to 0).
optional bool bias_term = 4 [default = false];
optional FillerParameter bias_filler = 5;
optional float min_value = 6;
optional float max_value = 7;
}
#ifndef CAFFE_SCALE_LAYER_HPP_
#define CAFFE_SCALE_LAYER_HPP_
#include <vector>
#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"
#include "caffe/layers/bias_layer.hpp"
namespace caffe {
/**
* @brief Computes the elementwise product of two input Blobs, with the shape of
* the latter Blob "broadcast" to match the shape of the former.
* Equivalent to tiling the latter Blob, then computing the elementwise
* product. Note: for efficiency and convenience, this layer can
* additionally perform a "broadcast" sum too when `bias_term: true`
* is set.
*
* The latter, scale input may be omitted, in which case it's learned as
* parameter of the layer (as is the bias, if it is included).
*/
template <typename Dtype>
class ScaleLayer: public Layer<Dtype> {
public:
explicit ScaleLayer(const LayerParameter& param)
: Layer<Dtype>(param) {}
virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual inline const char* type() const { return "Scale"; }
// Scale
virtual inline int MinBottomBlobs() const { return 1; }
virtual inline int MaxBottomBlobs() const { return 2; }
virtual inline int ExactNumTopBlobs() const { return 1; }
protected:
/**
* In the below shape specifications, @f$ i @f$ denotes the value of the
* `axis` field given by `this->layer_param_.scale_param().axis()`, after
* canonicalization (i.e., conversion from negative to positive index,
* if applicable).
*
* @param bottom input Blob vector (length 2)
* -# @f$ (d_0 \times ... \times
* d_i \times ... \times d_j \times ... \times d_n) @f$
* the first factor @f$ x @f$
* -# @f$ (d_i \times ... \times d_j) @f$
* the second factor @f$ y @f$
* @param top output Blob vector (length 1)
* -# @f$ (d_0 \times ... \times
* d_i \times ... \times d_j \times ... \times d_n) @f$
* the product @f$ z = x y @f$ computed after "broadcasting" y.
* Equivalent to tiling @f$ y @f$ to have the same shape as @f$ x @f$,
* then computing the elementwise product.
*/
virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top);
virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
shared_ptr<Layer<Dtype> > bias_layer_;
vector<Blob<Dtype>*> bias_bottom_vec_;
vector<bool> bias_propagate_down_;
int bias_param_id_;
Blob<Dtype> sum_multiplier_;
Blob<Dtype> sum_result_;
Blob<Dtype> temp_;
int axis_;
int outer_dim_, scale_dim_, inner_dim_;
};
} // namespace caffe
#endif // CAFFE_SCALE_LAYER_HPP_
#include <algorithm>
#include <vector>
#include "caffe/filler.hpp"
#include "caffe/layer_factory.hpp"
#include "caffe/layers/scale_layer.hpp"
#include "caffe/util/math_functions.hpp"
namespace caffe {
template <typename Dtype>
void ScaleLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
const ScaleParameter& param = this->layer_param_.scale_param();
if (bottom.size() == 1 && this->blobs_.size() > 0) {
LOG(INFO) << "Skipping parameter initialization";
} else if (bottom.size() == 1) {
// scale is a learned parameter; initialize it
axis_ = bottom[0]->CanonicalAxisIndex(param.axis());
const int num_axes = param.num_axes();
CHECK_GE(num_axes, -1) << "num_axes must be non-negative, "
<< "or -1 to extend to the end of bottom[0]";
if (num_axes >= 0) {
CHECK_GE(bottom[0]->num_axes(), axis_ + num_axes)
<< "scale blob's shape extends past bottom[0]'s shape when applied "
<< "starting with bottom[0] axis = " << axis_;
}
this->blobs_.resize(1);
const vector<int>::const_iterator& shape_start =
bottom[0]->shape().begin() + axis_;
const vector<int>::const_iterator& shape_end =
(num_axes == -1) ? bottom[0]->shape().end() : (shape_start + num_axes);
vector<int> scale_shape(shape_start, shape_end);
this->blobs_[0].reset(new Blob<Dtype>(scale_shape));
FillerParameter filler_param(param.filler());
if (!param.has_filler()) {
// Default to unit (1) filler for identity operation.
filler_param.set_type("constant");
filler_param.set_value(1);
}
shared_ptr<Filler<Dtype> > filler(GetFiller<Dtype>(filler_param));
filler->Fill(this->blobs_[0].get());
}
if (param.bias_term()) {
LayerParameter layer_param(this->layer_param_);
layer_param.set_type("Bias");
BiasParameter* bias_param = layer_param.mutable_bias_param();
bias_param->set_axis(param.axis());
if (bottom.size() > 1) {
bias_param->set_num_axes(bottom[1]->num_axes());
} else {
bias_param->set_num_axes(param.num_axes());
}
bias_param->mutable_filler()->CopyFrom(param.bias_filler());
bias_layer_ = LayerRegistry<Dtype>::CreateLayer(layer_param);
bias_bottom_vec_.resize(1);
bias_bottom_vec_[0] = bottom[0];
bias_layer_->SetUp(bias_bottom_vec_, top);
if (this->blobs_.size() + bottom.size() < 3) {
// case: blobs.size == 1 && bottom.size == 1
// or blobs.size == 0 && bottom.size == 2
bias_param_id_ = this->blobs_.size();
this->blobs_.resize(bias_param_id_ + 1);
this->blobs_[bias_param_id_] = bias_layer_->blobs()[0];
} else {
// bias param already initialized
bias_param_id_ = this->blobs_.size() - 1;
bias_layer_->blobs()[0] = this->blobs_[bias_param_id_];
}
bias_propagate_down_.resize(1, false);
}
this->param_propagate_down_.resize(this->blobs_.size(), true);
}
template <typename Dtype>
void ScaleLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
const ScaleParameter& param = this->layer_param_.scale_param();
Blob<Dtype>* scale = (bottom.size() > 1) ? bottom[1] : this->blobs_[0].get();
// Always set axis_ == 0 in special case where scale is a scalar
// (num_axes == 0). Mathematically equivalent for any choice of axis_, so the
// actual setting can be safely ignored; and computation is most efficient
// with axis_ == 0 and (therefore) outer_dim_ == 1. (Setting axis_ to
// bottom[0]->num_axes() - 1, giving inner_dim_ == 1, would be equally
// performant.)
axis_ = (scale->num_axes() == 0) ?
0 : bottom[0]->CanonicalAxisIndex(param.axis());
CHECK_GE(bottom[0]->num_axes(), axis_ + scale->num_axes())
<< "scale blob's shape extends past bottom[0]'s shape when applied "
<< "starting with bottom[0] axis = " << axis_;
for (int i = 0; i < scale->num_axes(); ++i) {
CHECK_EQ(bottom[0]->shape(axis_ + i), scale->shape(i))
<< "dimension mismatch between bottom[0]->shape(" << axis_ + i
<< ") and scale->shape(" << i << ")";
}
outer_dim_ = bottom[0]->count(0, axis_);
scale_dim_ = scale->count();
inner_dim_ = bottom[0]->count(axis_ + scale->num_axes());
if (bottom[0] == top[0]) { // in-place computation
temp_.ReshapeLike(*bottom[0]);
} else {
top[0]->ReshapeLike(*bottom[0]);
}
sum_result_.Reshape(vector<int>(1, outer_dim_ * scale_dim_));
const int sum_mult_size = std::max(outer_dim_, inner_dim_);
sum_multiplier_.Reshape(vector<int>(1, sum_mult_size));
if (sum_multiplier_.cpu_data()[sum_mult_size - 1] != Dtype(1)) {
caffe_set(sum_mult_size, Dtype(1), sum_multiplier_.mutable_cpu_data());
}
if (bias_layer_) {
bias_bottom_vec_[0] = top[0];
bias_layer_->Reshape(bias_bottom_vec_, top);
}
}
template <typename Dtype>
void ScaleLayer<Dtype>::Forward_cpu(
const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
const Dtype* bottom_data = bottom[0]->cpu_data();
if (bottom[0] == top[0]) {
// In-place computation; need to store bottom data before overwriting it.
// Note that this is only necessary for Backward; we could skip this if not
// doing Backward, but Caffe currently provides no way of knowing whether
// we'll need to do Backward at the time of the Forward call.
caffe_copy(bottom[0]->count(), bottom[0]->cpu_data(),
temp_.mutable_cpu_data());
}
Dtype* scale_data =
((bottom.size() > 1) ? bottom[1] : this->blobs_[0].get())->mutable_cpu_data();
if (this->layer_param_.scale_param().has_min_value()) {
for (int d = 0; d < scale_dim_; d++) {
scale_data[d] = std::max<Dtype>(scale_data[d], this->layer_param_.scale_param().min_value());
}
}
if (this->layer_param_.scale_param().has_max_value()) {
for (int d = 0; d < scale_dim_; d++) {
scale_data[d] = std::min<Dtype>(scale_data[d], this->layer_param_.scale_param().max_value());
}
}
Dtype* top_data = top[0]->mutable_cpu_data();
for (int n = 0; n < outer_dim_; ++n) {
for (int d = 0; d < scale_dim_; ++d) {
const Dtype factor = scale_data[d];
caffe_cpu_scale(inner_dim_, factor, bottom_data, top_data);
bottom_data += inner_dim_;
top_data += inner_dim_;
}
}
if (bias_layer_) {
bias_layer_->Forward(bias_bottom_vec_, top);
}
}
template <typename Dtype>
void ScaleLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {
if (bias_layer_ &&
this->param_propagate_down_[this->param_propagate_down_.size() - 1]) {
bias_layer_->Backward(top, bias_propagate_down_, bias_bottom_vec_);
}
const bool scale_param = (bottom.size() == 1);
Blob<Dtype>* scale = scale_param ? this->blobs_[0].get() : bottom[1];
if ((!scale_param && propagate_down[1]) ||
(scale_param && this->param_propagate_down_[0])) {
const Dtype* top_diff = top[0]->cpu_diff();
const bool in_place = (bottom[0] == top[0]);
const Dtype* bottom_data = (in_place ? &temp_ : bottom[0])->cpu_data();
// Hack: store big eltwise product in bottom[0] diff, except in the special
// case where this layer itself does the eltwise product, in which case we
// can store it directly in the scale diff, and we're done.
// If we're computing in-place (and not doing eltwise computation), this
// hack doesn't work and we store the product in temp_.
const bool is_eltwise = (bottom[0]->count() == scale->count());
Dtype* product = (is_eltwise ? scale->mutable_cpu_diff() :
(in_place ? temp_.mutable_cpu_data() : bottom[0]->mutable_cpu_diff()));
caffe_mul(top[0]->count(), top_diff, bottom_data, product);
if (!is_eltwise) {
Dtype* sum_result = NULL;
if (inner_dim_ == 1) {
sum_result = product;
} else if (sum_result_.count() == 1) {
const Dtype* sum_mult = sum_multiplier_.cpu_data();
Dtype* scale_diff = scale->mutable_cpu_diff();
if (scale_param) {
Dtype result = caffe_cpu_dot(inner_dim_, product, sum_mult);
*scale_diff += result;
} else {
*scale_diff = caffe_cpu_dot(inner_dim_, product, sum_mult);
}
} else {
const Dtype* sum_mult = sum_multiplier_.cpu_data();
sum_result = (outer_dim_ == 1) ?
scale->mutable_cpu_diff() : sum_result_.mutable_cpu_data();
caffe_cpu_gemv(CblasNoTrans, sum_result_.count(), inner_dim_,
Dtype(1), product, sum_mult, Dtype(0), sum_result);
}
if (outer_dim_ != 1) {
const Dtype* sum_mult = sum_multiplier_.cpu_data();
Dtype* scale_diff = scale->mutable_cpu_diff();
if (scale_dim_ == 1) {
if (scale_param) {
Dtype result = caffe_cpu_dot(outer_dim_, sum_mult, sum_result);
*scale_diff += result;
} else {
*scale_diff = caffe_cpu_dot(outer_dim_, sum_mult, sum_result);
}
} else {
caffe_cpu_gemv(CblasTrans, outer_dim_, scale_dim_,
Dtype(1), sum_result, sum_mult, Dtype(scale_param),
scale_diff);
}
}
}
}
if (propagate_down[0]) {
const Dtype* top_diff = top[0]->cpu_diff();
const Dtype* scale_data = scale->cpu_data();
Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
for (int n = 0; n < outer_dim_; ++n) {
for (int d = 0; d < scale_dim_; ++d) {
const Dtype factor = scale_data[d];
caffe_cpu_scale(inner_dim_, factor, top_diff, bottom_diff);
bottom_diff += inner_dim_;
top_diff += inner_dim_;
}
}
}
}
#ifdef CPU_ONLY
STUB_GPU(ScaleLayer);
#endif
INSTANTIATE_CLASS(ScaleLayer);
REGISTER_LAYER_CLASS(Scale);
} // namespace caffe