Least squares problems and nonlinear optimization

0 Preface

Reprinted from here , minor errors corrected.

1. Least squares problem

When solving the optimal state estimation problem in SLAM, we generally get two variables. One is the actual observation value z obtained by the sensor \ boldsymbol{z}z , one is the predicted value h ( x ) h(\boldsymbol{x})calculated based on the currently estimated state quantity and observation modelh ( x ) . When solving the optimal state estimation problem, we usually try to minimize the square of the residual calculated from the predicted value and the observed value (the square is used to unify the influence of the sign), that is, to solve the following least squares problem:

x ∗ = arg ⁡ min ⁡ x ∣ ∣ z − h ( x ) ∣ ∣ 2 \boldsymbol{x}^* = \arg\min_{\boldsymbol{x}}||\boldsymbol{z} - h(\boldsymbol{x})||^2 x=argxmin∣∣zh(x)2If
the observation model is a linear model, the above equation is transformed into a linear least squares problem:

x ∗ = arg ⁡ min ⁡ x ∣ ∣ z − H x ∣ ∣ 2 \boldsymbol{x}^* = \arg\min_{\boldsymbol{x}}||\boldsymbol{z} - \boldsymbol{H}\boldsymbol{x}||^2 x=argxmin∣∣zHx2For
the linear least squares problem, we can directly find the closed-form solution:x ∗ = − ( HTH ) − 1 HT z \boldsymbol{x}^* = -(\boldsymbol{H}^T\boldsymbol{H}) ^{-1}\boldsymbol{H}^T\boldsymbol{z}x=(HTH)1HT zwill not be described in detail here.

In actual problems, we usually have to minimize more than one residual. Different residuals are usually assigned a corresponding weight coefficient according to their importance (uncertainty), and the observation model is usually nonlinear, that is, solving the following problems:

e i ( x ) = z i − h i ( x ) i = 1 , 2 , . . . , n ∣ ∣ e i ( x ) ∣ ∣ Σ i 2 = e i T Σ i e i x ∗ = arg ⁡ min ⁡ x F ( x ) = arg ⁡ min ⁡ x ∑ i ∣ ∣ e i ( x ) ∣ ∣ Σ i 2 \begin{aligned} \boldsymbol{e}_i(\boldsymbol{x}) &= \boldsymbol{z}_i - h_i(\boldsymbol{x}) \qquad i = 1, 2, ..., n\\ ||\boldsymbol{e}_i(\boldsymbol{x})||^2_{\Sigma_i} &= \boldsymbol{e}_i^T\boldsymbol{\Sigma}_i\boldsymbol{e}_i\\ \boldsymbol{x}^* &= \arg\min_{\boldsymbol{x}}F(\boldsymbol{x})\\ &= \arg\min_{\boldsymbol{x}}\sum_i||\boldsymbol{e}_i(\boldsymbol{x})||^2_{\Sigma_i} \end{aligned} ei(x)∣∣ei(x)Si2x=zihi(x)i=1,2,...,n=eiTSiei=argxminF(x)=argxmini∣∣ei(x)Si2
We want to obtain a state quantity x ∗ \boldsymbol{x}^*x , such that the loss functionF ( x ) F(\boldsymbol{x})F ( x ) obtains a local minimum.

Before solving the specific problem, first consider F ( x ) F(\boldsymbol{x})Properties of F ( x ) , perform a second-order Taylor expansion on it:

F ( x + Δ x ) = F ( x ) + J Δ x + 1 2 Δ x T H Δ x + O ( ∣ ∣ Δ x ∣ ∣ 3 ) F(\boldsymbol{x} + \Delta\boldsymbol{x}) = F(\boldsymbol{x}) + \boldsymbol{J}\Delta\boldsymbol{x} + \frac{1}{2}\Delta\boldsymbol{x}^T\boldsymbol{H}\Delta\boldsymbol{x} + O(||\Delta\boldsymbol{x}||^3) F(x+Δ x )=F(x)+JΔx+21Δx _THΔx+O(∣∣Δx3 )
Ignoring higher-order remainders, the quadratic function has the following properties:

If at point x ∗ \boldsymbol{x}^*xAt ∗ , the derivative is0 \boldsymbol{0}0 , then this point is a stable point and has different properties according to the positive definiteness of the Hessian matrix:

  • if H \boldsymbol{H}H is a positive definite matrix, thenF ( x ∗ ) F(\boldsymbol{x}^*)F(x )is the local minimum
  • if H \boldsymbol{H}H is a negative definite matrix, thenF ( x ∗ ) F(\boldsymbol{x}^*)F(x )is the local maximum
  • if H \boldsymbol{H}H is an indefinite matrix, thenF ( x ∗ ) F(\boldsymbol{x}^*)F(x )is the saddle point

In the actual process, generally F ( x ) F(x)F ( x ) is relatively complex. We have no way to directly make its derivative 0 and then find this point. Therefore, the iterative method is commonly used, that is, finding a descending direction so that the loss function changes with x \boldsymbol{x}The iterations of x gradually decrease untilx \boldsymbol{x}x converges tox ∗ \boldsymbol{x}^*x . Here are some commonly used iterative methods:

2. Iterative descent method

As mentioned above, we need to find an x ​​\boldsymbol{x}The iteration amount of x is such that F ( x ) F(\boldsymbol{x})F ( x ) decreases. This process is divided into two steps:

Find F ( x ) \boldsymbol{F(x)}The descending direction of F ( x ) , construct the unit vector d in this direction\boldsymbol{d}d determines the iteration step size α \alpha
in this directionIndependent variable x + α d \boldsymbol{x} + \alpha\boldsymbol{d}
after α iterationx+The function value corresponding to α d can be approximated by a first-order Taylor expansion (when the step size is small enough):

F ( x + α d ) = F ( x ) + α J d F(\boldsymbol{x} + \alpha\boldsymbol{d}) = F(\boldsymbol{x}) + \alpha\boldsymbol{Jd} F(x+αd)=F(x)+α Jd
Therefore, it is not difficult to find that to maintainF ( x ) F(x)F ( x ) is decreasing, just ensure thatJ d < 0 \boldsymbol{Jd} < 0Jd<0 . The following methods all use different ideas to iterate in finding a suitable direction.

3. Steepest descent method

Based on the previous part, we know that the change is α J d \alpha\boldsymbol{Jd}α Jd , whereJ d = ∣ ∣ J ∣ ∣ cos ⁡ θ \boldsymbol{Jd} = ||\boldsymbol{J}||\cos{\theta}Jd=∣∣J∣∣cosθ θ \theta θ is the gradientJ \boldsymbol{J}J andd \boldsymbol{d}d angle. Whenθ = − π \theta = -\pii=Whenπ , J d = − ∣ ∣ J ∣ ∣ \boldsymbol{Jd} = -||\boldsymbol{J}||Jd=∣∣ J ∣∣ obtains the minimum value. At this time, the direction vector is:

d = − J T ∣ ∣ J ∣ ∣ \boldsymbol{d} = \frac{-\boldsymbol{J}^T}{||\boldsymbol{J}||} d=∣∣J∣∣JT
Therefore, along the gradient J \boldsymbol{J}The negative direction of J can makeF ( x ) F(\boldsymbol{x})F ( x ) , but in the actual process, this method is generally only used at the beginning of the iteration. When it is close to the optimal value, this method will oscillate and converge slowly.

4.Newton’s method

For F ( x ) F(\boldsymbol{x})The second-order Taylor expansion of F ( x ) is:

F ( x + Δ x ) = F ( x ) + J Δ x + 1 2 Δ x T H Δ x F(\boldsymbol{x} + \Delta\boldsymbol{x}) = F(\boldsymbol{x}) + \boldsymbol{J}\Delta\boldsymbol{x} + \frac{1}{2}\Delta\boldsymbol{x}^T\boldsymbol{H}\Delta\boldsymbol{x} F(x+Δ x )=F(x)+JΔx+21Δx _T HΔx
What we focus on is to find aΔ x \Delta\boldsymbol{x}Δx 使得 J Δ x + 1 2 Δ x T H Δ x \boldsymbol{J}\Delta\boldsymbol{x} + \frac{1}{2}\Delta\boldsymbol{x}^T\boldsymbol{H}\Delta\boldsymbol{x} JΔx+21Δx _T HΔxis the smallest, so it can be derived:

∂ ( J Δ x + 1 2 Δ x T H Δ x ) ∂ Δ x = J T + H Δ x = 0 ⇒ Δ x = − H − 1 J T \begin{aligned} \frac{\partial(\boldsymbol{J}\Delta\boldsymbol{x} + \frac{1}{2}\Delta\boldsymbol{x}^T\boldsymbol{H}\Delta\boldsymbol{x})}{\partial\Delta\boldsymbol{x}} &= \boldsymbol{J}^T + \boldsymbol{H}\Delta\boldsymbol{x} = 0\\ \Rightarrow \Delta\boldsymbol{x} &= -\boldsymbol{H}^{-1}\boldsymbol{J}^T \end{aligned} Δx(JΔx+21Δx _THΔx)Δ x=JT+HΔx=0=H- 1 JT
When H \boldsymbol{H}H is a positive definite matrix and the currentx \boldsymbol{x}When x is near the optimal point, takeΔ x = − H − 1 JT \Delta\boldsymbol{x} = -\boldsymbol{H}^{-1}\boldsymbol{J}^TΔx _=H- 1 JT can make the function obtain a local minimum. But the disadvantage is that the Hessian function of the residual is usually difficult to find.

5. Damping method

On the basis of Newton's method, in order to control each iteration not to be too aggressive, we can add a penalty term to the loss function, as shown below:

arg ⁡ min ⁡ Δ x { F ( x ) + J Δ x + 1 2 Δ x T H Δ x + 1 2 μ ( Δ x ) T ( Δ x ) } \arg\min_{\Delta\boldsymbol{x}}\left\{F(\boldsymbol{x}) + \boldsymbol{J}\Delta\boldsymbol{x} + \frac{1}{2}\Delta\boldsymbol{x}^T\boldsymbol{H}\Delta\boldsymbol{x} + \frac{1}{2}\mu(\Delta\boldsymbol{x})^T(\Delta\boldsymbol{x})\right\} argΔx _min{ F(x)+JΔx+21Δx _THΔx+21m ( Δ x )T (Δx)}
When we chooseΔ x \Delta\boldsymbol{x}When Δ x is too large, the loss function will also become larger, and the amplitude of the increase is given byμ \muμ is determined, so we can control the amount of each iterationΔ x \Delta\boldsymbol{x}The size of Δx . Also in the right part forΔ x \Delta\boldsymbol{x}The derivative of Δx is :

J T + H Δ x + μ Δ x = 0 ( H + μ I ) Δ x = − J T \begin{aligned} \boldsymbol{J}^T + \boldsymbol{H}\Delta\boldsymbol{x} + \mu\Delta\boldsymbol{x} &= 0\\ (\boldsymbol{H} + \mu\boldsymbol{I})\Delta\boldsymbol{x} = -\boldsymbol{J}^T \end{aligned} JT+HΔx+m D x(H+μI)Δx=JT=0
This idea will also be used when we introduce the LM method later.

6.Gauss-Newton (GN) method

In the previous arrangement, what we actually solved was the sum of a series of residuals. It is relatively simple to find the Jacobian of a single residual, so in the following methods we focus on the changes in each residual . Write each residual in the above nonlinear least squares problem in vector form:

F ( x ) = E ( x ) = [ e 1 ( x ) e 2 ( x ) . . . e n ( x ) ] \boldsymbol{F}(\boldsymbol{x}) =\boldsymbol{E}(\boldsymbol{x}) =\begin{bmatrix} \boldsymbol{e}_1(\boldsymbol{x})\\ \boldsymbol{e}_2(\boldsymbol{x})\\ ...\\ \boldsymbol{e}_n(\boldsymbol{x})\\ \end{bmatrix} F(x)=E ( x )= e1(x)e2(x)...en(x)

For e ( x ) \boldsymbol{e}(\boldsymbol{x})Taylor expansion of e ( x )

e ( x + Δ x ) = e ( x ) + J Δ x \boldsymbol{e}(\boldsymbol{x} + \Delta\boldsymbol{x}) = \boldsymbol{e}(\boldsymbol{x}) + \boldsymbol{J}\Delta\boldsymbol{x} and ( x+Δ x )=and ( x )+J Δ x
In the above formula,J \boldsymbol{J}J is the residual matrixe ( x ) \boldsymbol{e}(\boldsymbol{x})The Jacobian matrix of e ( x )versus state quantities.

Note that in the original linear least squares problem, there is also a weight matrix Σ \boldsymbol{\Sigma} for each residualΣ . In this case, we only need to letei ( x ) = Σ iei ( x ) \boldsymbol{e}_i(\boldsymbol{x}) = \sqrt{\boldsymbol{\Sigma}_i}\boldsymbol{e}_i (\boldsymbol{x})ei(x)=Si ei( x ) is enough. Therefore, Σ \boldsymbol{\Sigma}is not considered in the following formula.The influence of Σ .

In this form, for e ( x ) \boldsymbol{e}(\boldsymbol{x})Taylor expansion of e ( x )

∣ ∣ e ( x + Δ x ) ∣ ∣ 2 = e ( x + Δ x ) T e ( x + Δ x ) = ( e ( x ) + J Δ x ) T ( e ( x ) + J Δ x ) = e ( x ) T e ( x ) + Δ x T J T e ( x ) + e ( x ) T J Δ x + Δ x T J T J Δ x \begin{aligned} ||e(\boldsymbol{x} + \Delta\boldsymbol{x}) ||^2&= \boldsymbol{e}(\boldsymbol{x} + \Delta\boldsymbol{x})^T\boldsymbol{e}(\boldsymbol{x} + \Delta\boldsymbol{x})\\ &= (\boldsymbol{e}(\boldsymbol{x}) + \boldsymbol{J}\Delta\boldsymbol{x})^T(\boldsymbol{e}(\boldsymbol{x}) + \boldsymbol{J}\Delta\boldsymbol{x})\\ &= \boldsymbol{e}(\boldsymbol{x})^T\boldsymbol{e}(\boldsymbol{x}) + \Delta\boldsymbol{x}^T\boldsymbol{J}^T\boldsymbol{e}(\boldsymbol{x}) + \boldsymbol{e}(\boldsymbol{x})^T\boldsymbol{J}\Delta\boldsymbol{x} + \Delta\boldsymbol{x}^T\boldsymbol{J}^T\boldsymbol{J}\Delta\boldsymbol{x} \end{aligned} ∣∣e(x+Δ x ) 2=and ( x+Δ x )T e(x+Δ x )=( and ( x )+JΔx)T(e(x)+JΔx)=and ( x )T e(x)+Δx _TJT e(x)+and ( x )TJΔx+Δx _TJTJΔx
In the above formula, note: e ( x ) e(\boldsymbol{x})e ( x ) is one-dimensional, soΔ x TJT e ( x ) = e ( x ) TJ Δ x \Delta\boldsymbol{x}^T\boldsymbol{J}^T\boldsymbol{e}(\boldsymbol{ x}) = \boldsymbol{e}(\boldsymbol{x})^T\boldsymbol{J}\Delta\boldsymbol{x}Δx _TJT e(x)=and ( x )T JΔx, so simplify to:

F ( x + Δ x ) = e ( x ) T e ( x ) + 2 e ( x ) T J Δ x + Δ x T J T J Δ x = F ( x ) + 2 e ( x ) T J Δ x + Δ x T J T J Δ x \begin{aligned} F(\boldsymbol{x} + \Delta\boldsymbol{x}) &= \boldsymbol{e}(\boldsymbol{x})^T\boldsymbol{e}(\boldsymbol{x}) + 2\boldsymbol{e}(\boldsymbol{x})^T\boldsymbol{J}\Delta\boldsymbol{x} + \Delta\boldsymbol{x}^T\boldsymbol{J}^T\boldsymbol{J}\Delta\boldsymbol{x}\\ &= F(\boldsymbol{x}) + 2\boldsymbol{e}(\boldsymbol{x})^T\boldsymbol{J}\Delta\boldsymbol{x} + \Delta\boldsymbol{x}^T\boldsymbol{J}^T\boldsymbol{J}\Delta\boldsymbol{x} \end{aligned} F(x+Δ x )=and ( x )T e(x)+2e(x)TJΔx+Δx _TJTJΔx=F(x)+2e(x)TJΔx+Δx _TJTJΔx
In this way, we also approximate it as a quadratic function, and compared with our previous expansion results, it is not difficult to find that here we actually use JT e \boldsymbol{J}^T\boldsymbol{e}JT eto approximate the Jacobian matrix, useJTJ \boldsymbol{J}^T\boldsymbol{J}JT Jto approximate the Hessian matrix. Therefore, whenJ \boldsymbol{J}When J is of full rank, we can ensure that the function obtains a local minimum where the derivative of the above formula is 0. Similarly, on the right side of the above formula, forΔ x \Delta\boldsymbol{x}Derivating Δ x and making it equal to 0 is:

J T e ( x ) + J T J Δ x = 0 ⇒ J T J Δ x = − J T e ( x ) ⇒ H Δ x = b \begin{aligned} \boldsymbol{J}^T\boldsymbol{e}(\boldsymbol{x}) + \boldsymbol{J}^T\boldsymbol{J}\Delta\boldsymbol{x} = 0\\ \Rightarrow \boldsymbol{J}^T\boldsymbol{J}\Delta\boldsymbol{x} &= -\boldsymbol{J}^T\boldsymbol{e}(\boldsymbol{x})\\ \Rightarrow \boldsymbol{H}\Delta\boldsymbol{x} &= \boldsymbol{b} \end{aligned} JT e(x)+JTJΔx=0JTJΔxHΔx=JT e(x)=b
In the above formula, we let H = JTJ , b = − JT e \boldsymbol{H} = \boldsymbol{J}^T\boldsymbol{J}, \boldsymbol{b} = -\boldsymbol{J}^T\boldsymbol {e}H=JTJ,b=JTe ._ In this way we obtain the solution process of Gauss-Newton method:

  • Calculate the Jacobian matrix J of the residual matrix with respect to the state value \boldsymbol{J}J
  • Use the Jacobian matrix and residuals to construct the information matrix and information vector H, b \boldsymbol{H}, \boldsymbol{b}H,b
  • Calculate the current iteration amount: Δ x = H − 1 b \Delta\boldsymbol{x} = \boldsymbol{H}^{-1}\boldsymbol{b}Δx _=H1b
  • If the iteration amount is small enough, end the iteration, otherwise enter the next iteration

7. Levenberg-Marquardt (LM) method

The LM method is based on the Gauss-Newton method and adds a damping factor according to the idea of ​​the damping method, that is, solving the following equation:

( H + μ I ) Δ x = b (\boldsymbol{H} + \mu\boldsymbol{I})\Delta\boldsymbol{x} = \boldsymbol{b} (H+μI)Δx=bIn
the above formula, the functions of the damping factor are:

  • Add to H \boldsymbol{H}H guaranteesH \boldsymbol{H}H is positive definite

  • μ \muμ 很大时, Δ x = − ( H + μ I ) − 1 b ≈ − 1 μ b = − 1 μ J T E ( x ) \Delta\boldsymbol{x} = -(\boldsymbol{H}+\mu\boldsymbol{I})^{-1}\boldsymbol{b}\approx-\frac{1}{\mu}\boldsymbol{b}=-\frac{1}{\mu}\boldsymbol{J}^T\boldsymbol{E}(\boldsymbol{x}) Δx _=(H+μ I )1bm1b=m1JT E(x), close to the steepest descent method

  • μ \muWhen μ is very small, it is close to the Gauss-Newton method.
    Therefore, by setting the damping factor reasonably, the iteration speed can be dynamically adjusted. The setting of damping factor is divided into two parts:

  • Selection of initial values

  • Update strategy that changes with iteration volume

Let’s first look at the initial value selection method. The size of the damping factor should be based on JTJ \boldsymbol{J}^T\boldsymbol{J}JTo select the size of T JJTJ \boldsymbol{J}^T\boldsymbol{J}JT Jperforms eigenvalue decomposition, there are:JTJ = V Λ VT \boldsymbol{J}^T\boldsymbol{J} = \boldsymbol{V}\boldsymbol{\Lambda}\boldsymbol{V}^TJTJ=VΛVTΛ = diag (λ 1 , λ 2 , . . . , λ n ), V = [ v 1 , . . . , vn ] \boldsymbol{\Lambda} = \text{diag}(\lambda_1, \lambda_2,..., \lambda_n), \boldsymbol{V} = [\boldsymbol{v}_1, ..., \boldsymbol {in n]L=diag(λ1,l2,...,ln),V=[v1,...,vn] , therefore, the iteration formula is simplified to:

( V Λ V T + μ I ) Δ x = b Δ x = ( V Λ V T + μ I ) − 1 b = − ∑ i v i T b λ i + μ v i \begin{aligned} (\boldsymbol{V\Lambda}\boldsymbol{V}^T + \mu\boldsymbol{I})\Delta\boldsymbol{x} &= \boldsymbol{b}\\ \Delta\boldsymbol{x} &= (\boldsymbol{V\Lambda}\boldsymbol{V}^T + \mu\boldsymbol{I})^{-1}\boldsymbol{b}\\ &= -\sum_i\frac{\boldsymbol{v}_i^T\boldsymbol{b}}{\lambda_i + \mu}\boldsymbol{v}_i \end{aligned} (VΛVT+μI)ΔxΔx _=b=(VΛVT+μ I )1b=ili+mviTbvi
Therefore, μ \muμ selectionλ i \lambda_iliIt can be close. A simple idea is to set μ 0 = τ max ⁡ ( JTJ ) ii \mu_0 = \tau\max{(\boldsymbol{J}^T\boldsymbol{J})_{ii}}m0=tmax(JTJ)ii, in practice it is generally assumed that τ ≈ [ 1 0 − 8 , 1 ] \tau \approx [10^{-8}, 1]t[108,1]

Next look at μ \muFor the update strategy of μ , first qualitatively analyze how to update the damping factor:

  • Δ x \Delta\boldsymbol{x} Δ x letF ( x ) F(\boldsymbol{x})When F ( x ) increases,μ \muμ to reduceΔ x \Delta\boldsymbol{x}Δx reduces the impact of this iteration by reducing the step size
  • Δ x \Delta\boldsymbol{x} Δ x letF ( x ) F(\boldsymbol{x})When F ( x ) decreases,μ \muμ to increaseΔ x \Delta\boldsymbol{x}Δx increases the impact of this iteration by increasing the step size

Let's conduct quantitative analysis below, let L ( Δ x ) = F ( x ) + 2 E ( x ) TJ Δ x + 1 2 Δ x TJTJ Δ x L(\Delta\boldsymbol{x}) = F(\boldsymbol{x }) +2\boldsymbol{E}(\boldsymbol{x})^T\boldsymbol{J}\Delta\boldsymbol{x} + \frac{1}{2}\Delta\boldsymbol{x}^T\boldsymbol {J}^T\boldsymbol{J}\Delta\boldsymbol{x}L(Δx)=F(x)+2E(x)TJΔx+21Δx _TJT JΔxtakes into account the following scaling factors:

ρ = F ( x ) − F ( x + Δ x ) L ( 0 ) − F ( Δ x ) \rho = \frac{F(\boldsymbol{x}) - F(\boldsymbol{x} + \Delta\boldsymbol{x})}{L(\boldsymbol{0}) - F(\Delta\boldsymbol{x})} r=L(0)F(Δx)F(x)F(x+Δ x )
Marquardt proposes a strategy:

  • ρ < 0 \rho < 0 r<0 , indicating the currentΔ x \Delta\boldsymbol{x}Δ x makesF ( x ) F(\boldsymbol{x})F ( x ) increases, indicating that it is still far from the optimal value, andμ \muμ makes it close to the steepest descent method for larger updates
  • When ρ > 0 \rho > 0r>0 and relatively large, indicating the currentΔ x \Delta\boldsymbol{x}Δ x makesF ( x ) F(\boldsymbol{x})F ( x ) decreases,μ \muμ makes it close to the Gauss-Newton method, reducing the speed and updating it to the optimal point
  • if ρ > 0 \rho > 0r>0 but relatively small, indicating that it has reached near the optimal point, then increase the dampingμ \muμ , reduce the iteration step size

Marquardt’s specific strategy is as follows:

if rho < 0.25: 
    mu = mu * 2
else if rho > 0.75: 
    mu = mu /3

An update process using the Marquardt strategy is as follows:
Insert image description here

It can be found that the effect is not very good. As the number of iterations increases, μ \muμ begins to oscillate, indicating that the iteration amount periodically changesF ( x ) F(\boldsymbol{x})F ( x ) increases and then decreases.

Therefore, Nielsen proposed another strategy, also used in G2O and Ceres:

if rho > 0:
    mu = mu * max(1/3, 1 - (2 * rho - 1)^3)
    v = 2
else:
    mu = mu * v
    v = 2 * v

An example of optimization using this strategy is as follows:

It can be seen that μ \muμ can continue to decrease relatively smoothly as the iteration proceeds until convergence is reached.
Insert image description here

8. Robust kernel function

When performing least squares problems, we will encounter some abnormal observation values ​​that make the observation residuals particularly large. If these abnormal points are not processed, it will affect the optimization process. The optimizer will try to minimize the abnormal residual terms. Finally, Affecting the accuracy of state estimation, the robust kernel function is used to reduce the impact of these abnormal observations.

Apply the robust kernel function directly to each residual term to transform the least squares problem into the following form:

F ( x ) = ∑ i ρ ( ∣ ∣ e i ( x ) ∣ ∣ 2 ) F(\boldsymbol{x}) = \sum_i\rho(||e_i(\boldsymbol{x})||^2) F(x)=iρ ( ∣∣ ei(x)2 )
The process of solving nonlinear least squares when using the robust kernel function.
In this form, perform a second-order Taylor expansion on the residual with the robust kernel function:

ρ ( s + Δ s ) = ρ ( x ) + ρ ′ ( x ) Δ s + 1 2 ρ ′ ′ ( x ) Δ 2 s \rho(s + \Delta s) = \rho(x) + \rho '(x)\Delta s + \frac{1}{2}\rho''(x)\Delta^2sp ( s+Δ s )=p ( x )+r(x)Δs+21r" (x)D2 s
In the above formula, the change amountΔ s \Delta sΔs is calculated as follows:

Δ s k = ∣ ∣ e i ( x + Δ x ) ∣ ∣ 2 − ∣ ∣ e i ( x ) ∣ ∣ 2 = ∣ ∣ e i ( x ) + J i Δ x ∣ ∣ 2 − ∣ ∣ e i ( x ) ∣ ∣ 2 = 2 e i ( x ) T J i Δ x + Δ x T J i T J i Δ x \begin{aligned} \Delta s_k &= ||e_i(\boldsymbol{x}+\Delta\boldsymbol{x})||^2 - ||e_i(\boldsymbol{x})||^2\\ &= ||e_i(\boldsymbol{x})+\boldsymbol{J}_i\Delta\boldsymbol{x}||^2 - ||e_i(\boldsymbol{x})||^2\\ &= 2e_i(\boldsymbol{x})^T\boldsymbol{J}_i\Delta\boldsymbol{x}+\Delta\boldsymbol{x}^T\boldsymbol{J}_i^T\boldsymbol{J}_i\Delta\boldsymbol{x} \end{aligned} Δs _k=∣∣ei(x+Δ x ) 2∣∣ei(x)2=∣∣ei(x)+JiΔx2∣∣ei(x)2=2e _i(x)TJiΔx _+Δx _TJiTJiΔ x
Δ s \Delta sΔ s代入ρ ( s + Δ s ) \rho(s + \Delta s)p ( s+Δ s ) can be obtained:

ρ ( s + Δ s ) = ρ ( s ) + ρ ′ ( s ) ( 2 e i ( x ) T J i Δ x + Δ x T J i T J i Δ x ) + 1 2 ρ ′ ′ ( s ) ( 2 e i ( x ) T J i Δ x + Δ x T J i T J i Δ x ) 2 ≈ ρ ( s ) + 2 ρ ′ ( s ) e i ( x ) T J i Δ x + ρ ′ ( s ) Δ x T J i T J i Δ x + 2 ρ ′ ′ ( s ) Δ x T J i T e i ( x ) e i ( x ) T J i Δ x \begin{aligned} \rho(s + \Delta s) =& \rho(s) + \rho'(s)(2e_i(\boldsymbol{x})^T\boldsymbol{J}_i\Delta\boldsymbol{x}+\Delta\boldsymbol{x}^T\boldsymbol{J}_i^T\boldsymbol{J}_i\Delta\boldsymbol{x}) \\ &+ \frac{1}{2}\rho''(s)(2e_i(\boldsymbol{x})^T\boldsymbol{J}_i\Delta\boldsymbol{x}+\Delta\boldsymbol{x}^T\boldsymbol{J}_i^T\boldsymbol{J}_i\Delta\boldsymbol{x})^2\\ \approx& \rho(s) + 2\rho'(s)e_i(\boldsymbol{x})^T\boldsymbol{J}_i\Delta\boldsymbol{x}+\rho'(s)\Delta\boldsymbol{x}^T\boldsymbol{J}_i^T\boldsymbol{J}_i\Delta\boldsymbol{x} \\ &+ 2\rho''(s)\Delta\boldsymbol{x}^T\boldsymbol{J}_i^Te_i(\boldsymbol{x})e_i(\boldsymbol{x})^T\boldsymbol{J}_i\Delta\boldsymbol{x} \end{aligned} p ( s+Δ s )=p ( s )+r (s)(2ei(x)TJiΔx _+Δx _TJiTJiΔ x )+21r′′ (s)(2ei(x)TJiΔx _+Δx _TJiTJiΔ x )2p ( s )+2 p(s)ei(x)TJiΔx _+r(s)ΔxTJiTJiΔx _+2 p′′(s)ΔxTJiTei( x ) ei(x)TJiΔ x
According to the previous idea, take the derivative of the above formula and make it equal to 0, we can get:

∑ i J i T ( ρ ′ ( s ) + 2 ρ ′ ′ ( s ) e i ( x ) e i ( x ) T ) J Δ x = − ∑ i ρ ′ ( s ) J i T e i ( x ) \sum_i\boldsymbol{J}_i^T(\rho'(s) + 2\rho''(s)e_i(\boldsymbol{\boldsymbol{x}})e_i(\boldsymbol{x})^T)\boldsymbol{J}\Delta\boldsymbol{x} = -\sum_i\rho'(s)\boldsymbol{J}_i^Te_i(\boldsymbol{x}) iJiT( r(s)+2 p′′(s)ei( x ) ei(x)T)JΔx=ir(s)JiTei( x )
Compare the previous matrixJTJ Δ x = − J i T e ( x ) \boldsymbol{J}^T\boldsymbol{J}\Delta\boldsymbol{x} = -\boldsymbol{J}_i^T\boldsymbol {e}(\boldsymbol{x})JTJΔx=JiTe ( x )can be obtained. After we use the robust kernel function, we only need to calculate the first-order and second-order derivatives of the kernel function of each residual, and then update the information matrix and information vector according to the above form. Can.

Commonly used robust kernel functions
Cauchy robust kernel function:

ρ ( s ) = c 2 log ⁡ ( 1 + s c 2 ) ρ ′ ( s ) = 1 1 + s c 2 ρ ′ ′ ( s ) = − 1 c 2 ( ρ ′ ( s ) ) 2 \begin{aligned} \rho(s) &= c^2\log{(1+\frac{s}{c^2})}\\ \rho'(s) &= \frac{1}{1+\frac{s}{c^2}}\\ \rho''(s) &= -\frac{1}{c^2}(\rho'(s))^2 \end{aligned} p ( s )r(s)r′′(s)=c2log(1+c2s)=1+c2s1=c21( r(s))2
Among them, ccc is the control parameter. When the residuals are normally distributed, Huber c is selected as 1.345 and Cauchy c is selected as 2.3849. The effects of different robust kernel functions are shown in the figure below:
Insert image description here

Guess you like

Origin blog.csdn.net/fb_941219/article/details/132106286