Ceres-Solver学习笔记(5)

路漫漫其修远兮，吾将上下而求索。
这篇文章翻译于Ceres官网。
Ceres的导数
与所有基于梯度的优化算法一样，Ceres求解程序依赖于能够在其域中的任意点对目标函数及其导数进行评估。实际上，定义目标函数和它的雅可比矩阵是用户在解决使用Ceres求解器解决优化问题时需要执行的主要任务。雅可比矩阵的正确和有效的计算是良好性能的关键。

Ceres解决程序提供相当灵活的方法，让用户可以向求解程序提求导。她可以使用:

Analytic Derivatives: 用户可以手动或者利用像Maple 或者Mathematica这样的工具自己求解导数, 并在 CostFunction中实现他们.
Numeric derivatives: Ceres用有限差分来计算导数。
Automatic Derivatives: Ceres通过C++模板和操作符重载自动计算分析导数。

这三种方法中应该使用哪一种(单独或组合)取决于用户愿意和权衡。不幸的是，数值优化系列的教科书很少详细讨论这些问题，而用户只能留给自己解决。
本文的目的是填补这一空白，并描述这三种方法中的每一种，在Ceres解决方案中，有足够的细节，用户可以做出明智的选择。
对于你的急躁，以下是一些高水平的建议:（滑稽）

使用 Automatic Derivatives.
在有些情况下 Analytic Derivatives值得一试.
避免使用 Numeric derivatives，使用它作为最后的手段，主要是与外部库交互。

Spivak Notation

Spivak符号是一个函数符号，它使对涉及导数的表达式的阅读和推理变得简单。
对于一个单变量函数f， f(a)表示它在a处的值。Df表示它的一阶导数，Df(a)是对a的导数求值。

D f (a) = d d x f (x) ∣ ∣ ∣ x = a

$Df(a) =\left. \frac{d}{dx} f(x)\right| _{x=a}$

$D^kf$ 表示 $f$ 的 $k^{th}$ 阶导数。（这个公式编辑器右边会多出来一个竖线，神烦）

对于二元函数g(x，y)。D1g和D2g表示“g”的偏导数。分别对应第一个和第二个变量。在经典的表示法中，这相当于说:

D 1 g = \partial \partial x g (x, y)

$D_1g = \frac{\partial}{\partial x} g(x,y)$

D 2 g = \partial \partial y g (x, y)

$D_2g = \frac{\partial}{\partial y} g(x,y)$

$Dg$ 表示 $g$ 的Jacobian.

D g = [D 1 g, D 2 g]

$Dg = [D_1g ,D_2g ]$

更通用的多元函数 $g:\mathbb{R}^n\longrightarrow \mathbb{R}^m$ ， $Dg$ 表示 $m\times n$ Jacobain矩阵， $D_i g$ 表示 $Dg$ 第i行第i列的系数。

最后， $D_1^2g$ 和 $D_1D_2g$ 有明显的含义，即更高阶的偏导数。

Analytic Derivatives

考虑用数据拟合下图的曲线

y = b 1 ( 1 + e b 2 - b 3 x ) 1 / b 4

$y = \frac{b_1}{(1+e^{b_2-b_3x})^{1/b_4}}$
给定一些数据

{xi,yi}, ∀i=1,...,n $\{x_i, y_i\},\ \forall i=1,... ,n$ ,得到参数

b1,b2,b3,b4 $b_1, b_2, b_3,b_4$ 最好的适应这些数据。
这个问题相当于求

b1,b2,b3,b4 $b_1, b_2, b_3,b_4$ 的值，最小化以下函数：

E (b 1, b 2, b 3, b 4) = \sum i f 2 (b 1, b 2, b 3, b 4; x i, y i) = \sum i (b 1 ( 1 + e b 2 - b 3 x i ) 1 / b 4 - y i) 2

$\begin{split}\begin{align} E(b_1, b_2, b_3, b_4) &= \sum_i f^2(b_1, b_2, b_3, b_4 ; x_i, y_i)\\ &= \sum_i \left(\frac{b_1}{(1+e^{b_2-b_3x_i})^{1/b_4}} - y_i\right)^2\\ \end{align}\end{split}$

为了求解，我们需要定义一个CostFunction，从给定的x，y计算残差f和f对 $b_1, b_2, b_3,b_4$ 的导数。
使用初等微分学，我们可以看到:

D 1 f (b 1, b 2, b 3, b 4; x, y) D 2 f (b 1, b 2, b 3, b 4; x, y) D 3 f (b 1, b 2, b 3, b 4; x, y) D 4 f (b 1, b 2, b 3, b 4; x, y) = 1 ( 1 + e b 2 - b 3 x ) 1 / b 4 = - b 1 e b 2 - b 3 x b 4 ( 1 + e b 2 - b 3 x ) 1 / b 4 + 1 = b 1 x e b 2 - b 3 x b 4 ( 1 + e b 2 - b 3 x ) 1 / b 4 + 1 = b 1 log ( 1 + e b 2 - b 3 x ) b 2 4 ( 1 + e b 2 - b 3 x ) 1 / b 4

$\begin{split}\begin{align} D_1 f(b_1, b_2, b_3, b_4; x,y) &= \frac{1}{(1+e^{b_2-b_3x})^{1/b_4}}\\ D_2 f(b_1, b_2, b_3, b_4; x,y) &= \frac{-b_1e^{b_2-b_3x}}{b_4(1+e^{b_2-b_3x})^{1/b_4 + 1}} \\ D_3 f(b_1, b_2, b_3, b_4; x,y) &= \frac{b_1xe^{b_2-b_3x}}{b_4(1+e^{b_2-b_3x})^{1/b_4 + 1}} \\ D_4 f(b_1, b_2, b_3, b_4; x,y) & = \frac{b_1 \log\left(1+e^{b_2-b_3x}\right) }{b_4^2(1+e^{b_2-b_3x})^{1/b_4}} \end{align}\end{split}$

手里有了这些导数，我们可以实现CostFunction：

class Rat43Analytic : public SizedCostFunction<1,4> {
   public:
     Rat43Analytic(const double x, const double y) : x_(x), y_(y) {}
     virtual ~Rat43Analytic() {}
     virtual bool Evaluate(double const* const* parameters,
                           double* residuals,
                           double** jacobians) const {
       const double b1 = parameters[0][0];
       const double b2 = parameters[0][1];
       const double b3 = parameters[0][2];
       const double b4 = parameters[0][3];

       residuals[0] = b1 *  pow(1 + exp(b2 -  b3 * x_), -1.0 / b4) - y_;

       if (!jacobians) return true;
       double* jacobian = jacobians[0];
       if (!jacobian) return true;

       jacobian[0] = pow(1 + exp(b2 - b3 * x_), -1.0 / b4);
       jacobian[1] = -b1 * exp(b2 - b3 * x_) *
                     pow(1 + exp(b2 - b3 * x_), -1.0 / b4 - 1) / b4;
       jacobian[2] = x_ * b1 * exp(b2 - b3 * x_) *
                     pow(1 + exp(b2 - b3 * x_), -1.0 / b4 - 1) / b4;
       jacobian[3] = b1 * log(1 + exp(b2 - b3 * x_)) *
                     pow(1 + exp(b2 - b3 * x_), -1.0 / b4) / (b4 * b4);
       return true;
     }

    private:
     const double x_;
     const double y_;
 };

这是一段乏味的代码，很难阅读，而且冗余很多。所以在实际操作中，我们会缓存一些子表达式来提高效率，这将给我们一些类似的东西:

class Rat43AnalyticOptimized : public SizedCostFunction<1,4> {
   public:
     Rat43AnalyticOptimized(const double x, const double y) : x_(x), y_(y) {}
     virtual ~Rat43AnalyticOptimized() {}
     virtual bool Evaluate(double const* const* parameters,
                           double* residuals,
                           double** jacobians) const {
       const double b1 = parameters[0][0];
       const double b2 = parameters[0][1];
       const double b3 = parameters[0][2];
       const double b4 = parameters[0][3];

       const double t1 = exp(b2 -  b3 * x_);
       const double t2 = 1 + t1;
       const double t3 = pow(t2, -1.0 / b4);
       residuals[0] = b1 * t3 - y_;

       if (!jacobians) return true;
       double* jacobian = jacobians[0];
       if (!jacobian) return true;

       const double t4 = pow(t2, -1.0 / b4 - 1);
       jacobian[0] = t3;
       jacobian[1] = -b1 * t1 * t4 / b4;
       jacobian[2] = -x_ * jacobian[1];
       jacobian[3] = b1 * log(t2) * t3 / (b4 * b4);
       return true;
     }

   private:
     const double x_;
     const double y_;
 };

这两种实现的性能有什么不同?

CostFunction	Time (ns)
Rat43Analytic	255
Rat43AnalyticOptimized	92

rRat43AnalyticOptimized的速度比Rat43Analytic的速度快2.8倍。运行时的这种差异并不少见。为了得到计算导数最佳性能，通常需要优化代码，计算通用子表达式。

什么时候应该使用analytical derivatives?

表达式很简单，例如大部分是线性的
计算机代数系统像 Maple , Mathematica, 或者SymPy可以被用来对目标函数进行符号化的微分。
性能是最令人关注的，在式子中有一些代数结构，你可以利用它来获得比自动微分有更好的性能。
也就是说, 获得在计算倒数之外的最大性能需要大量的工作.在沿着这条路径走下去之前，评估雅可比矩阵的计算花费是整个求解时间的一小部分是很有用的，，记住Amdahl法则是你的朋友。
没有其他的方法来计算这些导数，比如你想计算多项式的根的导数:
$a 3 (x, y) z 3 + a 2 (x, y) z 2 + a 1 (x, y) z + a 0 (x, y) = 0$ $a_3(x,y)z^3 + a_2(x,y)z^2 + a_1(x,y)z + a_0(x,y) = 0$
对于x,y.这需要用到逆函数理论。
你喜欢链式法则，而且亲自来做所有的代数运算。

Numeric derivatives
利用analytic derivatives的另一个极端形式是 numeric derivatives，这里的关键是f(x)的微分方程可以被写成一个极限形式：

D f (x) = lim h \to 0 f ( x + h ) - f ( x ) h

$Df(x) = \lim_{h \rightarrow 0} \frac{f(x + h) - f(x)}{h}$
Forward Differences
当然，我们不能在计算机上进行极限运算所以我们要做下一件事，那就是选择一个小的hh的值并近似导数

D f (x) \approx f ( x + h ) - f ( x ) h

$Df(x) \approx \frac{f(x + h) - f(x)}{h}$
上面的公式是最简单的最基本的数值微分。它被称为“正向差分公式”。
那么，如何在Ceres求解程序中构建一个数字微分Rat43Analytic(Rat43)的版本呢?这是通过两个步骤完成的:
1. 定义Functor给定参数值，将对给定的(x，y)的残值进行计算。
2. 用 NumericDiffCostFunction 来构造一个CostFunction 来打包Rat43CostFunctor实例。

struct Rat43CostFunctor {
  Rat43CostFunctor(const double x, const double y) : x_(x), y_(y) {}

  bool operator()(const double* parameters, double* residuals) const {
    const double b1 = parameters[0];
    const double b2 = parameters[1];
    const double b3 = parameters[2];
    const double b4 = parameters[3];
    residuals[0] = b1 * pow(1.0 + exp(b2 -  b3 * x_), -1.0 / b4) - y_;
    return true;
  }

  const double x_;
  const double y_;
}

CostFunction* cost_function =
  new NumericDiffCostFunction<Rat43CostFunctor, FORWARD, 1, 4>(
    new Rat43CostFunctor(x, y));

这是定义一个CostFunction的最小工作量。用户需要做的唯一一件事就是确残差计算得到正确和有效的实现。
在进一步深入之前，对前差公式中的误差进行估计是有意义的。我们通过考虑在x附近f的 Taylor 展开来做到这一点。

f (x + h) D f (x) D f (x) = f (x) + h D f (x) + h 2 2 ! D 2 f (x) + h 3 3 ! D 3 f (x) + \dots = f ( x + h ) - f ( x ) h - [h 2 ! D 2 f (x) + h 2 3 ! D 3 f (x) + \dots] = f ( x + h ) - f ( x ) h + O (h)

$\begin{split}\begin{align} f(x+h) &= f(x) + h Df(x) + \frac{h^2}{2!} D^2f(x) + \frac{h^3}{3!}D^3f(x) + \cdots \\ Df(x) &= \frac{f(x + h) - f(x)}{h} - \left [\frac{h}{2!}D^2f(x) + \frac{h^2}{3!}D^3f(x) + \cdots \right]\\ Df(x) &= \frac{f(x + h) - f(x)}{h} + O(h) \end{align}\end{split}$
前向微分的误差公式是

O(h) $O(h)$ .

Implementation Details
NumericDiffCostFunction 实现一种通用算法，对给定的函数进行数值微分。虽然 NumericDiffCostFunction的实际实现是复杂的，但最终的结果是一个成本函数，大致上是这样的:

class Rat43NumericDiffForward : public SizedCostFunction<1,4> {
   public:
     Rat43NumericDiffForward(const Rat43Functor* functor) : functor_(functor) {}
     virtual ~Rat43NumericDiffForward() {}
     virtual bool Evaluate(double const* const* parameters,
                           double* residuals,
                           double** jacobians) const {
       functor_(parameters[0], residuals);
       if (!jacobians) return true;
       double* jacobian = jacobians[0];
       if (!jacobian) return true;

       const double f = residuals[0];
       double parameters_plus_h[4];
       for (int i = 0; i < 4; ++i) {
         std::copy(parameters, parameters + 4, parameters_plus_h);
         const double kRelativeStepSize = 1e-6;
         const double h = std::abs(parameters[i]) * kRelativeStepSize;
         parameters_plus_h[i] += h;
         double f_plus;
         functor_(parameters_plus_h, &f_plus);
         jacobian[i] = (f_plus - f) / h;
       }
       return true;
     }

   private:
     scoped_ptr<Rat43Functor> functor_;
 };

注意在上面的代码中选择步骤大小的h，不是一个绝对的步长大小，对于所有的参数都是相同的，我们使用相对步长大小 $\text{kRelativeStepSize} = 10^{-6}$ ,这比绝对步长给出了更好的导数估计。这个步长大小的选择只适用于不接近于零的参数值。因此，数字扩散函数的实际实现，使用一个更复杂的步长选择逻辑，在接近于零的地方，它切换到一个固定的步长。

Central Differences
O(h)误差在前向微分公式中是可以的，但不是很好。一个更好的方法是使用中心微分公式:

D f (x) \approx f ( x + h ) - f ( x - h ) 2 h

$Df(x) \approx \frac{f(x + h) - f(x - h)}{2h}$
注意，如果f(x)的值是已知的，那么前向微分公式只需要一个额外的评估，但是中心微分公式需要两个评估，使它的代价是它的两倍。那么，额外的评估值值得吗?
为了回答这个问题，我们再来计算中心差分公式中的近似误差:

f (x + h) f (x - h) D f (x) D f (x) = f (x) + h D f (x) + h 2 2 ! D 2 f (x) + h 3 3 ! D 3 f (x) + h 4 4 ! D 4 f (x) + \dots = f (x) - h D f (x) + h 2 2 ! D 2 f (x) - h 3 3 ! D 3 f (c 2) + h 4 4 ! D 4 f (x) + \dots = f ( x + h ) - f ( x - h ) 2 h + h 2 3 ! D 3 f (x) + h 4 5 ! D 5 f (x) + \dots = f ( x + h ) - f ( x - h ) 2 h + O (h 2)

$\begin{split}\begin{align} f(x + h) &= f(x) + h Df(x) + \frac{h^2}{2!} D^2f(x) + \frac{h^3}{3!} D^3f(x) + \frac{h^4}{4!} D^4f(x) + \cdots\\ f(x - h) &= f(x) - h Df(x) + \frac{h^2}{2!} D^2f(x) - \frac{h^3}{3!} D^3f(c_2) + \frac{h^4}{4!} D^4f(x) + \cdots\\ Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + \frac{h^2}{3!} D^3f(x) + \frac{h^4}{5!} D^5f(x) + \cdots \\ Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + O(h^2) \end{align}\end{split}$
中心微分公式的误差是O(h2)。这个误差是平方的，而前差公式中的误差只会呈线性下降。
利用中心微分而不是前向微分，这是一个简单的问题:将模板参数更改为数字传播函数，如下:

CostFunction* cost_function =
  new NumericDiffCostFunction<Rat43CostFunctor, CENTRAL, 1, 4>(
    new Rat43CostFunctor(x, y));

但是，这些误差在实践中到底意味着什么呢?要看这个问题，考虑一个关于单变量函数的导数的问题

f (x) = e x sin x - x 2,

$f(x) = \frac{e^x}{\sin x - x^2},$
在 x=1.0 处.
很容易确定Df(1.0)=140.73773557129658 。利用这个值作为参考，我们现在可以计算出前向和中心微分公式的相对误差，对绝对步长大小的函数，并绘制它们。
这里写图片描述

从右到左阅读图表，在上面的图表中有很多东西是突出的:

这两个公式的图形都有两个不同的区域。首先，从一个很大的h值开始，随着截断泰勒级数的影响，这个误差会下降，但是随着“h”的值继续下降，这个误差开始再次增加，因为”舍入”的误差开始占据计算的主导地位。因此，我们不能继续降低“h”的价值，以获得更好的对Df的估计。我们使用有限精度运算的事实变成了一个限制因素
前向微分公式并不是计算导数的一种很好的方法。中心微分公式收敛速度快得多，可以更精确地估计出步长的导数。因此，除非f(x)的评估非常复杂以至于中心微分公式无法负担，否则不要使用前向微分公式。
对于一个糟糕的“h”值，这两个公式都不适用。

Ridders’ Method
那么，我们能否得到更好的对Df的估计，而不需要如此小的“h”，以至于我们开始碰到浮点数的误差?
一种可能的方法是找到一种比O(h2)快得多的方法。这可以通过运用 Richardson Extrapolation来解决微分问题。这也被称为Ridders的方法。
让我们回忆一下，中心差分公式中的误差。

D f (x) = f ( x + h ) - f ( x - h ) 2 h + h 2 3 ! D 3 f (x) + h 4 5 ! D 5 f (x) + \dots = f ( x + h ) - f ( x - h ) 2 h + K 2 h 2 + K 4 h 4 + \dots

$\begin{split}\begin{align} Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + \frac{h^2}{3!} D^3f(x) + \frac{h^4}{5!} D^5f(x) + \cdots\\ & = \frac{f(x + h) - f(x - h)}{2h} + K_2 h^2 + K_4 h^4 + \cdots \end{align}\end{split}$

这里要注意的关键是K2 K4，K4，K4，独立于h，只依赖于x。
让我们定义：

A (1, m) = f ( x + h / 2 m - 1 ) - f ( x - h / 2 m - 1 ) 2 h / 2 m - 1 .

$A(1, m) = \frac{f(x + h/2^{m-1}) - f(x - h/2^{m-1})}{2h/2^{m-1}}.$
观察：

D f (x) = A (1, 1) + K 2 h 2 + K 4 h 4 + \dots

$Df(x) = A(1,1) + K_2 h^2 + K_4 h^4 + \cdots$

D f (x) = A (1, 2) + K 2 (h / 2) 2 + K 4 (h / 2) 4 + \dots

$Df(x) = A(1, 2) + K_2 (h/2)^2 + K_4 (h/2)^4 + \cdots$
在这里，我们将步骤的大小减半，以获得第二个中心微分估计的Df(x)。结合这两种估计，我们得到:

D f (x) = 4 A ( 1 , 2 ) - A ( 1 , 1 ) 4 - 1 + O (h 4)

$Df(x) = \frac{4 A(1, 2) - A(1,1)}{4 - 1} + O(h^4)$
这是Df(x)的近似值，它的截断误差会随着O(h4)下降。但我们不需要停在这里。我们可以对这个过程进行迭代，以获得更准确的估计:

A (n, m) = ⎧ ⎩ ⎨ ⎪ ⎪ f ( x + h / 2 m - 1 ) - f ( x - h / 2 m - 1 ) 2 h / 2 m - 1 4 n - 1 A ( n - 1 , m + 1 ) - A ( n - 1 , m ) 4 n - 1 - 1 n = 1 n > 1

$\begin{split}A(n, m) = \begin{cases} \frac{\displaystyle f(x + h/2^{m-1}) - f(x - h/2^{m-1})}{\displaystyle 2h/2^{m-1}} & n = 1 \\ \frac{\displaystyle 4^{n-1} A(n - 1, m + 1) - A(n - 1, m)}{\displaystyle 4^{n-1} - 1} & n > 1 \end{cases}\end{split}$

很简单地证明A(n，1)的近似误差是O(h2n)，要了解如何在实践中实现上述公式来计算A(n 1)，把计算按照下表的结构构造很有帮助：

A (1, 1) A (1, 2) A (2, 1) A (1, 3) A (2, 2) A (3, 1) A (1, 4) A (2, 3) A (3, 2) A (4, 1) \dots \dots \dots \dots ⋱

$\begin{split}\begin{array}{ccccc} A(1,1) & A(1, 2) & A(1, 3) & A(1, 4) & \cdots\\ & A(2, 1) & A(2, 2) & A(2, 3) & \cdots\\ & & A(3, 1) & A(3, 2) & \cdots\\ & & & A(4, 1) & \cdots \\ & & & & \ddots \end{array}\end{split}$
因此，为了计算A(n，1)对增量n的值，我们从左向右移动，一次计算一列，假设这里的花费是对函数f(x)的评估，那么计算上表的新的一列的花费是两个函数的评估。由于评估A(1,n)的花费，需要对步长大小为

21−nh $2^{1-n}h$ 的中心微分公式进行评估。
把这个方法应用到

f(x)=exsinx−x2 $f(x) = \frac{e^x}{\sin x - x^2}$ ，从一个相当大的步长h=0.01 h=0.01开始，我们得到

141.678097131 140.971663667 140.736185846 140.796145400 140.737639311 140.737736209 140.752333523 140.737729564 140.737735581 140.737735571 140.741384778 140.737735196 140.737735571 140.737735571 140.737735571

$\begin{split}\begin{array}{rrrrr} 141.678097131 &140.971663667 &140.796145400 &140.752333523 &140.741384778\\ &140.736185846 &140.737639311 &140.737729564 &140.737735196\\ & &140.737736209 &140.737735581 &140.737735571\\ & & &140.737735571 &140.737735571\\ & & & &140.737735571\\ \end{array}\end{split}$
相对于正确的值Df(1.0)=140.73773557129658,A(5,1)的相对误差是

10−13 $10^{-13}$ 。比较而言，中央差分公式与相同的步长

0.01/24=0.000625 $0.01/2^4 = 0.000625$ 的相对误差是

10−5 $10^{-5}$ 。

上面的图表是“数字微分法”的基础。完整实现是一种自适应模式，它跟踪自己的估计误差，并在达到所需的精度时自动停止。当然，它比前和中心差分公式更昂贵，但也更加健壮和准确。
使用Ridder方法，而不是在Ceres的前向或中心的微分再次成为一个简单的问题:将模板参数改变为NumericDiffCostFunction ，如下:

CostFunction* cost_function =
  new NumericDiffCostFunction<Rat43CostFunctor, RIDDERS, 1, 4>(
    new Rat43CostFunctor(x, y));

下面的图显示了这三种方法作为绝对步骤大小的函数的相对误差。对于Ridders的方法，我们假设评估A(n，1)的步骤大小是 $2^{1-n}h$ 。
这里写图片描述

计算A(5,1)需要使用10个函数评估，我们能够近似地估计Df(1.0)是最好的中心差异估计的1000倍。为了准确地计算这些数字，机器的double精度是 $\approx 2.22 \times 10^{-16}$ 。

回到Rat43，让我们看看计算数字导数的各种方法的运行时成本。

CostFunction	Time (ns)
Rat43Analytic	255
Rat43AnalyticOptimized	92
zebra Rat43NumericDiffForward	262
Rat43NumericDiffCentral	517
Rat43NumericDiffRidders	3760

正如预期的那样，中心微分大约是前向微分的两倍，而这种方法的精确性提高了一倍多的运行时间。

建议
当你不能通过分析或使用自动差异来计算微分时，就应该使用数值微分。通常情况下，当您调用一个外部库或函数时，您不知道它的解析形式，或者即使您知道，您也无法用 Automatic Derivatives的方式重写它。

当使用数字微分时，至少使用中心微分，如果执行时间不是一个问题，或者目标函数很难确定一个好的静态相对步骤的大小，那么建议Ridders的方法。

Automatic Derivatives

现在我们将考虑自动微分。它是一种可以快速计算精确导数的技术，同时只需要用户与使用数字微分做相同的工作。
不相信我吗?下面的代码片段为Rat43实现了一个自动微分的CostFunction。

struct Rat43CostFunctor {
  Rat43CostFunctor(const double x, const double y) : x_(x), y_(y) {}

  template <typename T>
  bool operator()(const T* parameters, T* residuals) const {
    const T b1 = parameters[0];
    const T b2 = parameters[1];
    const T b3 = parameters[2];
    const T b4 = parameters[3];
    residuals[0] = b1 * pow(1.0 + exp(b2 -  b3 * x_), -1.0 / b4) - y_;
    return true;
  }

  private:
    const double x_;
    const double y_;
};

CostFunction* cost_function =
      new AutoDiffCostFunction<Rat43CostFunctor, 1, 4>(
        new Rat43CostFunctor(x, y));

注意，与数字微分相比，在定义使用自动微分的函数时，唯一的区别是操作符()的签名。
在数值微差的情况下

bool operator()(const double* parameters, double* residuals) const;

对于自动微分它是一个模板函数，如下：

emplate <typename T> bool operator()(const T* parameters, T* residuals) const;

那么这个小小的变化会带给我们什么呢?下表比较了使用各种方法对Rat43进行计算残差和雅可比矩阵的时间。

CostFunction	Time (ns)
Rat43Analytic	255
Rat43AnalyticOptimized	92
Rat43NumericDiffForward	262
Rat43NumericDiffCentral	517
Rat43NumericDiffRidders	3760
Rat43AutomaticDiff	129

我们可以使用自动微分(Rat43AutomaticDiff)来得到精确的微分，这与编写数字微分的代码所需要的工作量差不多，但比手工优化的求导工具只差40%。
那么它是如何工作的呢?为此，我们将不得不学习 Dual Numbers和Jets 。

Dual Numbers & Jets

note
阅读这篇文章和下一节关于实现Jets的内容，与在Ceres求解器中使用自动微分无关。但是，在调试和推理自动微分的性能时，了解Jets的工作原理是非常有用的。

Dual 数是实数的一个延伸，类似于复数:而复数则通过引入虚数来增加实数，比如i， dual 数引入了一个无穷小的单位，比如ϵ， $\epsilon^2 = 0$ 。一个dual数 $a + v\epsilon$ 包含两个分量，真正的分量a和无穷小的v。

令人惊讶的是，这种简单的变化导致了一种方便的计算精确导数的方法，而不需要操作复杂的符号表达式。
例如，考虑函数

f (x) = x 2,

$f(x) = x^2 ,$
然后

f (10 + ϵ) = (10 + ϵ) 2 = 100 + 20 ϵ + ϵ 2 = 100 + 20 ϵ

$\begin{split}\begin{align} f(10 + \epsilon) &= (10 + \epsilon)^2\\ &= 100 + 20 \epsilon + \epsilon^2\\ &= 100 + 20 \epsilon \end{align}\end{split}$
注意到 ϵ 的系数Df(10)=20。实际上，这推广到函数不是一个非多项式。考虑一个任意可微函数f(x)。然后我们可以估计f(x+ϵ)，通过在x附近泰勒展开，这就得到了无穷级数

f (x + ϵ) f (x + ϵ) = f (x) + D f (x) ϵ + D 2 f (x) ϵ 2 2 + D 3 f (x) ϵ 3 6 + \dots = f (x) + D f (x) ϵ

$\begin{split}\begin{align} f(x + \epsilon) &= f(x) + Df(x) \epsilon + D^2f(x) \frac{\epsilon^2}{2} + D^3f(x) \frac{\epsilon^3}{6} + \cdots\\ f(x + \epsilon) &= f(x) + Df(x) \epsilon \end{align}\end{split}$

这里我们使用 $\epsilon^2 = 0$ 。

Jet是一个n维的dual数，我们用n维无穷小的单位 $\epsilon_i,\ i=1,...,n$ 来增加实数,ϵi有属性 $\forall i, j\ :\epsilon_i\epsilon_j = 0$ 。Jet 由实数a和n维无穷小的v组成

x = a + \sum j v j ϵ j

$x = a + \sum_j v_{j} \epsilon_j$
求和符号很繁琐，所以我们写成

x = a + v .

$x = a + \mathbf{v}.$

然后，使用上面使用的泰勒级数展开，我们可以看到:

f (a + v) = f (a) + D f (a) v .

$f(a + \mathbf{v}) = f(a) + Df(a) \mathbf{v}.$
对于多变量函数

f:ℝn→ℝm $f:\mathbb{R}^{n}\rightarrow \mathbb{R}^m$ 来说相似的，然后我们简化上面的表达式：

f (x 1, . . ., x n) = f (a 1, . . ., a n) + \sum i D i f (a 1, . . ., a n) ϵ i

$f(x_1,..., x_n) = f(a_1, ..., a_n) + \sum_i D_i f(a_1, ..., a_n) \epsilon_i$
如果每个vi=ei是

ith $i^{\text{th}}$ 标准基向量，那么上面的表达式就会简化

f (x 1, . . ., x n) = f (a 1, . . ., a n) + \sum i D i f (a 1, . . ., a n) ϵ i

$f(x_1,..., x_n) = f(a_1, ..., a_n) + \sum_i D_i f(a_1, ..., a_n) \epsilon_i$
我们可以通过检查

ϵi $\epsilon_i$ 的系数来提取雅可比矩阵的对应。

实现 Jets

为了让上面的工作在实践中发挥作用，我们需要有能力求任意的函数f的值，不仅仅是在实数上，也需要在dual数上，但是我们通常不通过计算泰勒的扩展来求函数值，

这就是C++模板和操作符重载的作用。下面的代码片段有一个简单的Jet的实现，以及一些操作它们的操作器/函数。

template<int N> struct Jet {
  double a;
  Eigen::Matrix<double, 1, N> v;
};

template<int N> Jet<N> operator+(const Jet<N>& f, const Jet<N>& g) {
  return Jet<N>(f.a + g.a, f.v + g.v);
}

template<int N> Jet<N> operator-(const Jet<N>& f, const Jet<N>& g) {
  return Jet<N>(f.a - g.a, f.v - g.v);
}

template<int N> Jet<N> operator*(const Jet<N>& f, const Jet<N>& g) {
  return Jet<N>(f.a * g.a, f.a * g.v + f.v * g.a);
}

template<int N> Jet<N> operator/(const Jet<N>& f, const Jet<N>& g) {
  return Jet<N>(f.a / g.a, f.v / g.a - f.a * g.v / (g.a * g.a));
}

template <int N> Jet<N> exp(const Jet<N>& f) {
  return Jet<T, N>(exp(f.a), exp(f.a) * f.v);
}

// This is a simple implementation for illustration purposes, the
// actual implementation of pow requires careful handling of a number
// of corner cases.
template <int N>  Jet<N> pow(const Jet<N>& f, const Jet<N>& g) {
  return Jet<N>(pow(f.a, g.a),
                g.a * pow(f.a, g.a - 1.0) * f.v +
                pow(f.a, g.a) * log(f.a); * g.v);
}

有了这些重载的函数，我们现在可以用一个Jets数组来调用Rat43CostFunctor，而不是双精度数。将其与适当初始化的Jets结合起来，我们就可以计算雅可比矩阵了:

class Rat43Automatic : public ceres::SizedCostFunction<1,4> {
 public:
  Rat43Automatic(const Rat43CostFunctor* functor) : functor_(functor) {}
  virtual ~Rat43Automatic() {}
  virtual bool Evaluate(double const* const* parameters,
                        double* residuals,
                        double** jacobians) const {
    // Just evaluate the residuals if Jacobians are not required.
    if (!jacobians) return (*functor_)(parameters[0], residuals);

    // Initialize the Jets
    ceres::Jet<4> jets[4];
    for (int i = 0; i < 4; ++i) {
      jets[i].a = parameters[0][i];
      jets[i].v.setZero();
      jets[i].v[i] = 1.0;
    }

    ceres::Jet<4> result;
    (*functor_)(jets, &result);

    // Copy the values out of the Jet.
    residuals[0] = result.a;
    for (int i = 0; i < 4; ++i) {
      jacobians[0][i] = result.v[i];
    }
    return true;
  }

 private:
  std::unique_ptr<const Rat43CostFunctor> functor_;
};

实际上，这就是 AutoDiffCostFunction的工作原理。

陷阱
自动微分使用户不必计算和推理Jacobians的符号表达式，但是这种自由是有代价的。例如，考虑以下简单的函数:

struct Functor {
  template <typename T> bool operator()(const T* x, T* residual) const {
    residual[0] = 1.0 - sqrt(x[0] * x[0] + x[1] * x[1]);
    return true;
  }
};

查看计算残差的代码，没有人预见到任何问题。但是，如果我们看一下雅可比矩阵的解析表达式

y D 1 y = 1 - x 20 + x 21 ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt = - x 0 x 2 0 + x 2 1 ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt, D 2 y = - x 1 x 2 0 + x 2 1 ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt

$\begin{split} y &= 1 - \sqrt{x_0^2 + x_1^2}\\ D_1y &= -\frac{x_0}{\sqrt{x_0^2 + x_1^2}},\ D_2y = -\frac{x_1}{\sqrt{x_0^2 + x_1^2}}\end{split}$

我们发现它在x0=0，x1=0处是不确定的。

这个问题没有单一的解决方案。在某些情况下，我们需要明确地指出可能出现的不确定的点，并使用使用“ L’Hopital’s rule”的替代表达式，在其他情况下，可能需要对表达式进行规范化，以消除这些点。

Interfacing with Automatic Differentiation

在成本函数的有一个显式表达式的情况下，自动微分很容易使用。但这并不总是可能的。通常需要与外部的例程或数据进行交互。在这一章中，我们将考虑几种不同的方法。

为了实现这一点，我们将考虑寻找参数θ 和 t 来解决具有如下形式的优化问题:

min such that \sum i ∥ ∥ y i - f (∥ q i ∥ 2) q i ∥ ∥ 2 q i = R (θ) x i + t

$\begin{split}\min & \quad \sum_i \left \|y_i - f\left (\|q_{i}\|^2\right) q_i \right \|^2\\ \text{such that} & \quad q_i = R(\theta) x_i + t\end{split}$

在这里，R是一个二维的旋转矩阵，用角度θ和t参数化，是二维的向量。f是一个外部畸变函数。
我们首先考虑这种情况,我们有一个模板函数TemplatedComputeDistortion 可以计算函数f。然后，对应的残差的实现很简单，如下所列:

template <typename T> T TemplatedComputeDistortion(const T r2) {
  const double k1 = 0.0082;
  const double k2 = 0.000023;
  return 1.0 + k1 * y2 + k2 * r2 * r2;
}

struct Affine2DWithDistortion {
  Affine2DWithDistortion(const double x_in[2], const double y_in[2]) {
    x[0] = x_in[0];
    x[1] = x_in[1];
    y[0] = y_in[0];
    y[1] = y_in[1];
  }

  template <typename T>
  bool operator()(const T* theta,
                  const T* t,
                  T* residuals) const {
    const T q_0 =  cos(theta[0]) * x[0] - sin(theta[0]) * x[1] + t[0];
    const T q_1 =  sin(theta[0]) * x[0] + cos(theta[0]) * x[1] + t[1];
    const T f = TemplatedComputeDistortion(q_0 * q_0 + q_1 * q_1);
    residuals[0] = y[0] - f * q_0;
    residuals[1] = y[1] - f * q_1;
    return true;
  }

  double x[2];
  double y[2];
};

到目前为止还不错，但现在让我们考虑三种定义f的方法，它们不能直接用于自动区分:

一个非模板函数来计算它的值。
一个函数来计算它的值和导数。
一个被定义为一个值表的函数来进行插值。

我们将在下面依次考虑这些问题。

A function that returns its value
假设我们得到了一个函数计算值，它具有以下特征

double ComputeDistortionValue(double r2);

计算出f的值。函数的实际实现并不重要。用Affine2DWithDistortion接口这个函数需要三步:

包装ComputeDistortionValue成函数ComputeDistortionValueFunctor。
数值微分ComputeDistortionValueFunctor使用NumericDiffCostFunction创建CostFunction。
利用ComputeDistortionValueFunctor打包 CostFunction对象，结果产生一个带有模板操作符()方法的函数，该方法将由NumericDiffCostFunction计算的雅可比矩阵变为合适的Jet对象。

上面三步的实现看起来如下:

struct ComputeDistortionValueFunctor {
  bool operator()(const double* r2, double* value) const {
    *value = ComputeDistortionValue(r2[0]);
    return true;
  }
};

struct Affine2DWithDistortion {
  Affine2DWithDistortion(const double x_in[2], const double y_in[2]) {
    x[0] = x_in[0];
    x[1] = x_in[1];
    y[0] = y_in[0];
    y[1] = y_in[1];

    compute_distortion.reset(new ceres::CostFunctionToFunctor<1, 1>(
         new ceres::NumericDiffCostFunction<ComputeDistortionValueFunctor,
                                            ceres::CENTRAL,
                                            1,
                                            1>(
            new ComputeDistortionValueFunctor)));
  }

  template <typename T>
  bool operator()(const T* theta, const T* t, T* residuals) const {
    const T q_0 = cos(theta[0]) * x[0] - sin(theta[0]) * x[1] + t[0];
    const T q_1 = sin(theta[0]) * x[0] + cos(theta[0]) * x[1] + t[1];
    const T r2 = q_0 * q_0 + q_1 * q_1;
    T f;
    (*compute_distortion)(&r2, &f);
    residuals[0] = y[0] - f * q_0;
    residuals[1] = y[1] - f * q_1;
    return true;
  }

  double x[2];
  double y[2];
  std::unique_ptr<ceres::CostFunctionToFunctor<1, 1> > compute_distortion;
};

A function that returns its value and derivative
现在假设我们有一个函数ComputeDistortionValue，可以计算它的值，并且可以根据需要选择它的雅可比矩阵，并有如下的特征:

void ComputeDistortionValueAndJacobian(double r2,
                                       double* value,
                                       double* jacobian);

同样，函数的实际实现并不重要。用Affine2DWithDistortion 来处理这个函数有两步 :

包装ComputeDistortionValueAndJacobian变成一个我们称之为ComputeDistortionFunction 的CostFunction对象。
用CostFunctionToFunctor 打包ComputeDistortionFunction对象。产生的对象是一个带有模板操作符()方法的functor，它将由 NumericDiffCostFunction计算的雅可比矩阵变成合适的Jet对象。

函数实现看起来如下：

class ComputeDistortionFunction : public ceres::SizedCostFunction<1, 1> {
 public:
  virtual bool Evaluate(double const* const* parameters,
                        double* residuals,
                        double** jacobians) const {
    if (!jacobians) {
      ComputeDistortionValueAndJacobian(parameters[0][0], residuals, NULL);
    } else {
      ComputeDistortionValueAndJacobian(parameters[0][0], residuals, jacobians[0]);
    }
    return true;
  }
};

struct Affine2DWithDistortion {
  Affine2DWithDistortion(const double x_in[2], const double y_in[2]) {
    x[0] = x_in[0];
    x[1] = x_in[1];
    y[0] = y_in[0];
    y[1] = y_in[1];
    compute_distortion.reset(
        new ceres::CostFunctionToFunctor<1, 1>(new ComputeDistortionFunction));
  }

  template <typename T>
  bool operator()(const T* theta,
                  const T* t,
                  T* residuals) const {
    const T q_0 =  cos(theta[0]) * x[0] - sin(theta[0]) * x[1] + t[0];
    const T q_1 =  sin(theta[0]) * x[0] + cos(theta[0]) * x[1] + t[1];
    const T r2 = q_0 * q_0 + q_1 * q_1;
    T f;
    (*compute_distortion)(&r2, &f);
    residuals[0] = y[0] - f * q_0;
    residuals[1] = y[1] - f * q_1;
    return true;
  }

  double x[2];
  double y[2];
  std::unique_ptr<ceres::CostFunctionToFunctor<1, 1> > compute_distortion;
};

A function that is defined as a table of values

我们将考虑的第三个也是最后一个例子是，函数f被定义为一个在区间[0,100)的值表，每个整数都有一个值。

vector<double> distortion_values;

有很多方法可以插入一个值表。也许最简单和最常用的方法是线性插值。但用线性插值不是一个好办法，因为插值函数在抽样点处是不可微的。

一个简单的(表现良好的)可微插值是 Cubic Hermite Spline。Ceres包含Cubic & Bi-Cubic插值的整个流程，可以自动微分。
使用 Cubic插值，首先需要构造一个Grid1D对象来包装值表，然后构造一个 CubicInterpolator对象来使用它。

所得到的代码如下:

struct Affine2DWithDistortion {
  Affine2DWithDistortion(const double x_in[2],
                         const double y_in[2],
                         const std::vector<double>& distortion_values) {
    x[0] = x_in[0];
    x[1] = x_in[1];
    y[0] = y_in[0];
    y[1] = y_in[1];

    grid.reset(new ceres::Grid1D<double, 1>(
        &distortion_values[0], 0, distortion_values.size()));
    compute_distortion.reset(
        new ceres::CubicInterpolator<ceres::Grid1D<double, 1> >(*grid));
  }

  template <typename T>
  bool operator()(const T* theta,
                  const T* t,
                  T* residuals) const {
    const T q_0 =  cos(theta[0]) * x[0] - sin(theta[0]) * x[1] + t[0];
    const T q_1 =  sin(theta[0]) * x[0] + cos(theta[0]) * x[1] + t[1];
    const T r2 = q_0 * q_0 + q_1 * q_1;
    T f;
    compute_distortion->Evaluate(r2, &f);
    residuals[0] = y[0] - f * q_0;
    residuals[1] = y[1] - f * q_1;
    return true;
  }

  double x[2];
  double y[2];
  std::unique_ptr<ceres::Grid1D<double, 1> > grid;
  std::unique_ptr<ceres::CubicInterpolator<ceres::Grid1D<double, 1> > > compute_distortion;
};

在上面的例子中，我们使用了Grid1D 和CubicInterpolator 来插入一个一维的值表。Grid2D与CubicInterpolator 相结合使用户能够插入两个维度的值表。注意，无论是Grid1D还是Grid2D都不局限于标量值函数，它们也与向量值函数一起工作。

有些地方翻译的不准，比如将evaluate翻译成“评估”，实际上更接近于“计算…的值”，还请包含。

转载请注明出处，HJ

Ceres-Solver学习笔记(5)

猜你喜欢