Optimization algorithms is very similar to root finding algorithms. The general structure goes something like: a) start with an initial guess, b) calculate the result of the guess, c) update the guess based on the result and some further conditions, d) repeat until you’re satisfied with the result.

The difference is: root finding algorithms stops when variable $f(x)$ converges to 0, whereas optimization algorithms stops when $\nabla f(x)$ converges to 0.

1 Basic funtion

It is import to get the derivate value in continuous optimization problems. We can get it by approximate method:

# calculate derivation
from scipy.optimize import *
f = lambda x: x[0]**2+x[1]**2 - 20
approx_fprime([3,2],f,0.0001)

Or we can calculate the value directly if we know the explicit form.

2 Gradinet descent

GD algorithm use $f(x) = f(x_n)+(x-x_n)\nabla f(x_n)$ to find the direction to move for optimization, and move a certain step.

def GD(f, x, delta = 0.1,tol=.0001):
    print(str(x)+":"+str(f(x)))
    xn = x - delta * approx_fprime(x,f,0.0001)
    if abs(f(x)-f(xn))> tol:
        GD(f,xn,delta,tol)
        
f = lambda x: np.sum((x-2)**2) + 20
GD(f, np.array([1,2]))

In the above example, we know $f(x)$ , but not $f'(x)$ . In some cases, we know the derivative function $f'(x)$ . In this case, we can replace approx_fprime $(x,f,0.0001)$ by $f'(x)$ .

x_iters = []
def f(x):
    return (x[0]-2)**2+(x[1]-2)**2+20

fd = lambda x: (x-2) * 2
def GD1(f, fd, x, delta = 0.01,tol=.0001):
    x_iters.append(x.reshape([len(x),1]))
    xn = x - delta * fd(x)
    if abs(f(x)-f(xn))> tol:
        GD1(f,fd,xn,delta,tol)

GD1(f,fd,np.array([1,2]))

在这里插入图片描述

3. Least square problem and pytorch

In deep learning, we use very simple functions to fit the real relationship. For example, we assume that fs(x) = t[0] * x[0] + t[1] * x[1] + t[2] * x[2] + …, and we have several samples [x1,y1],[x2,y2],…

The loss function is Loss = $(fs(x1)-y1)^2+(fs(x2)-y2)^2+...$

Then $Loss'|_{t[0]}$ = $2*(fs(x1)-y1)*x1[0]+2*(fs(x1)-y2)*x2[0]+...$

$Loss'|_{t[1]}$ = $2*(fs(x1)-y1)*x1[1]+2*(fs(x2)-y2)*x2[1]+...$

…

In short, we should minimize Loss function with LossD= $2*(fs(x1)-y1)*x1+2*(fs(x2)-y2)*x2+...$ , that means LossD should converges to 0 by gradient algorithm.

def Loss(t):
    return np.sum((np.sum(t*X,axis = 1) - Y) **2)

def LossD(t):
    return np.sum(2 * (np.sum(t*X,axis = 1) - Y).reshape([len(X),1]) * X, axis = 0)
t = np.array([8,8,8])
X = np.array([[3,4,3],[8,1,4],[4,2,3],[5,7,9]])
Y = np.array([101,185,109,206])
GD1(Loss,LossD,t,delta = 0.001,tol=.01) # The true value is [18,5,9]

We can use pytorch to help us calculate the derivate value automatically

import torch
t = torch.tensor([8., 8, 8], requires_grad=True)
Y = torch.tensor([101.,185,109,206])
X = torch.tensor([[3.,4,3],[8,1,4],[4,2,3],[5,7,9]])
def GD2(t, delta = torch.scalar_tensor(0.001),tol=torch.scalar_tensor(.01)):
    Loss = sum(([email protected])**2)
    Loss.backward()
    while Loss.data > tol:
        print(t.data,Loss.data)
        t = (t - delta * t.grad).data
        t.requires_grad=True
        Loss = sum(([email protected])**2)
        Loss.backward()
        
GD2(t) # The true value is [18,5,9]

Furthermore, we can use the optimizer provided by pytorch. Since pytorch optimizer do not provide full batch gradient descent method, we use SGD optimizer.

from torch.nn import Parameter
class LinearRegression(nn.Module):
    def __init__(self):
        super(LinearRegression, self).__init__()
        self.t = Parameter(torch.Tensor(3, 1))
        self.t.data = torch.tensor([8., 8, 8])
    def forward(self, x):
        out = [email protected]
        return out
    
model = LinearRegression()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3) 
out = model(X)
loss = criterion(out, Y)
while loss > 0.01:
    out = model(X)
    loss = criterion(out, Y)
    print(model.t.data,loss.data*len(Y))
    optimizer.zero_grad() 
    loss.backward()
    optimizer.step()

4. Momentum algorithm

In some cases, Gradinet descent is very slow to converge, or even worse, not converge. We may adopt momentum algorithm and give it a try. for example:

f =lambda x:(x[0]-2)**2 + abs(x[1])

在这里插入图片描述

4.1 naive momentum

def MOMENTUM(f,x,lr=0.1,discount=0.3,tol=0.001):
    g = approx_fprime(x,f,0.0001)
    pre_g = g
    xn = x - lr * g
    while abs(f(x)-f(xn))>tol:
        x = xn
        pre_g = lr * approx_fprime(x,f,0.0001) - discount* pre_g
        xn = x - pre_g
        print(str(xn)+":"+str(f(xn)))
MOMENTUM(f,[1,1])

4.2 Nesterov

def NESTEROV(f,x,lr=0.01,discount=0.1,tol=0.01):
    g = approx_fprime(x,f,0.00001)
    pre_g = g
    xn = x - lr * g
    while abs(f(x)-f(xn))>tol:
        x = xn
        xf = x - pre_g * discount * lr
        pre_g = approx_fprime(xf,f,0.0001) + discount * pre_g
        xn = x - pre_g * lr
        print(str(xn)+":"+str(f(xn)))
NESTEROV(f,[1,1])

5. Newton method

If we know the second derivate value of $f$ , we can use newton method.

It is computationally huge to get the Hessian Matrix, and we usually use approximate method to calculate it.

Assume the loss function is Loss = $(t@x1-y1)^2+(t@x2-y2)^2+...$

Then $Loss''|_{t[i],t[j]}$ = $2*x1[i]*x1[j]+2*x2[i]*x2[j]+...=2*x.T@x$

…

In short, we should minimize Loss function with LossD= $2*(fs(x1)-y1)*x1+2*(fs(x2)-y2)*x2+...$ , that means LossD should converges to 0 by gradient algorithm.

t = np.array([8,8,8])
X = np.array([[3,4,3],[8,1,4],[4,2,3],[5,7,9]])
Y = np.array([101,185,109,206])

def Loss(t):
    #return np.sum((np.sum(t*X,axis = 1) - Y) **2)
    A = [email protected]
    return [email protected]

def LossD(t):
    #return np.sum(2 * (np.sum(t*X,axis = 1) - Y).reshape([len(X),1]) * X, axis = 0)
    return 2*([email protected])@X

def LossHInv(t):
    return np.linalg.inv(2*X.T@X)

t-LossHInv(t)@LossD(t) # One step to optimal solution

We can also use pytorch to help calculate the Hessian values:

import torch
t = torch.tensor([8., 8, 8], requires_grad=True)
X = torch.tensor([[3.,4,3],[8,1,4],[4,2,3],[5,7,9]])
Y = torch.tensor([101,185,109,206])
loss = sum(([email protected] - Y)**2)
grad = torch.autograd.grad(loss, t, retain_graph=True, create_graph=True)
Print = torch.tensor([])
for anygrad in grad[0]:  # torch.autograd.grad返回的是元组
    Print = torch.cat((Print, torch.autograd.grad(anygrad, t, retain_graph=True)[0]))
    
t.data - torch.inverse(Print.view(t.size()[0], -1))@grad[0].data

6. Quasi-Newton Method

6.1 basic test

Let’s start with a simple problem:
$f(x) = x_1^2 + x_2^4 + x_1 x_2$
The plot is as below:
在这里插入图片描述
Let’s use gradient descent method as a test. The funtion below is similar to GD1 before. We add some new features like line search and verbose.

def gradient_descent(f, grad, x_0, alpha=0.1, verbose=0, steepest=0):
    x_curr = x_0
    x_iters = [x_curr]
    for i in range(1, n_iters):
        g = grad(x_curr)
        if verbose: print('Iteration:%d\nCurrent x:%.5f\nGradient: %.5f'%(i,x_curr,g))
        if steepest:
            best_alpha = alpha_search(f, x_curr, g)
            alpha = best_alpha
        x_new = x_curr - alpha * g
        if verbose: print('New x: ', x_new, '\n')
        x_iters.append(x_new)
        x_curr = x_new    
    return np.array(x_iters)

在这里插入图片描述

6.2 SR1 method

And then we test a SR1 quasi-Newton method.

We use $H_k$ to approximate $G_k^{-1}$ (Or $B_k$ to approximate $G_k$ ).

Assume that $H_{k+1}=H_k+\beta uu^T$ , we get $\Delta H=\frac{(\Delta x-H\Delta g)(\Delta x-H\Delta g)^T}{(\Delta x-H\Delta g)\Delta g}$ 。

I don’t know why people used to assign $s$ for $\Delta x$ and $y$ for $\Delta grad(x)$ , but remember that in the following function, s and y stands for $\Delta x$ and $\Delta g$ in the formulation.

def sr1(f, grad, x_0, alpha=0.1, verbose=0, steepest=1):
    # Starts from I
    B_curr = np.identity(len(x_0)) 
    H_curr = np.linalg.inv(B_curr)
    x_curr = x_0
    x_new  = x_curr - alpha * grad(x_curr)
    s_curr = x_new - x_curr
    y_curr = grad(x_new) - grad(x_curr)
    x_iters_sr1 = [x_0]
    
    for i in range(1, n_iters):
        if verbose: print('Iteration: ', i)
        if verbose: print ('Current x: ', x_curr)
            
        g = grad(x_curr)
        p = H_curr.dot(g)
        if steepest: alpha = alpha_search(f, x_curr, p)
        x_new = x_curr - alpha * p
        if verbose: print('New x: ', x_new, '\n')
            
        s_new = x_new - x_curr
        y_new = grad(x_new) - grad(x_curr)
        s_curr = s_new
        y_curr = y_new
        
        # This is the essential of SR1 algorithm. 
        b = y_curr - B_curr.dot(s_curr)
        B_new = B_curr + b.dot(b.T) / (b.T.dot(s_curr) + 10**-8)
        a = s_curr - H_curr.dot(y_curr)
        H_new = H_curr + a.dot(a.T) / (a.T.dot(y_curr) + 10**-8)

        r = 10**-8
        # sB>r*sqrt(ssBB)
        if s_curr.T.dot(b) > r * np.sqrt(s_curr.T.dot(s_curr) * b.T.dot(b)):
            if verbose: print('Update')
            H_curr = H_new
            B_curr = B_new
        x_iters_sr1.append(x_new)
        x_curr = x_new
    
    return np.array(x_iters_sr1)

在这里插入图片描述

6.3 SR2: DFP and BFGS

SR2 quasi-newton method has two forms:

Let $s=\Delta x$ and $y = \Delta g$

(0) Sherman-Morrison equation: $(A+uv^T)^{-1}=A^{-1}-\frac{A^{-1}uv^TA^{-1}}{1+v^TA^{-1}u}$

(1) Assume that $H_{k+1}=H_k+\beta uu^T+\gamma vv^T$ , which is DFP, $\Delta H=\frac{ss^T}{s^Ty}−\frac{Hyy^TkH}{y^THy}$

(2) Assume that $B_{k+1}=B_k+\beta uu^T+\gamma vv^T$ , which is BFGS, and we have $\Delta B=\frac{y y^T}{y^Ts}-\frac{Bss^TB}{s^TBs}$ . Using Sherman-Morrison equation 2 times, we have $B'^{-1}=\frac{s s^T}{y^Ts}+(I-\frac{sy^T}{y^Ts})B^{-1}(I-\frac{ys^T}{y^Ts})$ .

We can also combine DFP and BFGS with a certain ratio, and is called Broyden Formulation.

def dfp(f, grad, x_0, alpha=0.1, verbose=0, steepest=1):
    B= np.identity(len(x_0))
    x_curr = x_0
    x_new  = x_curr - alpha * grad(x_curr)
    s_curr = x_new - x_curr
    y_curr = grad(x_new) - grad(x_curr)

    x_iters_dfp = [x_0]
    
    for i in range(1, n_iters):
        
        if verbose: print('Iteration: ', i)
        if verbose: print ('Current x: ', x_curr)

        g = grad(x_curr)
        p = B.dot(g)
        if steepest: alpha = alpha_search(f, x_curr, p)
        x_new = x_curr - alpha * p
        if verbose: print('New x: ', x_new, '\n')
        s_new = x_new - x_curr
        y_new = grad(x_new) - grad(x_curr)
        s_curr = s_new
        y_curr = y_new
        
        bs = B.dot(s_curr)
        B += s_curr.dot(s_curr.T) / (y_curr.T.dot(s_curr) + 10**-8)\
            - bs.dot(s_curr.T).dot(B)/ (s_curr.T.dot(bs)+ 10**-8)

        x_iters_dfp.append(x_new)
        x_curr = x_new
    
    return np.array(x_iters_dfp)

在这里插入图片描述

def bfgs(f, grad, x_0, alpha=0.1, verbose=0, steepest=1):
    B_curr = np.identity(len(x_0))
    x_curr = x_0
    x_new  = x_curr - alpha * grad(x_curr)
    s_curr = x_new - x_curr
    y_curr = grad(x_new) - grad(x_curr)

    x_iters_bfgs = [x_0]
    
    for i in range(1, n_iters):
        
        if verbose: print('Iteration: ', i)
        if verbose: print ('Current x: ', x_curr)

        g = grad(x_curr)
        p = B_curr.dot(g)
        if steepest: alpha = alpha_search(f, x_curr, p)
        x_new = x_curr - alpha * p
        if verbose: print('New x: ', x_new, '\n')
        s_new = x_new - x_curr
        y_new = grad(x_new) - grad(x_curr)
        s_curr = s_new
        y_curr = y_new
        
        pho = 1. / (y_curr.T.dot(s_curr) + 10**-8)
        a = np.identity(len(x_0)) - pho * s_curr.dot(y_curr.T)
        b = np.identity(len(x_0)) - pho * y_curr.dot(s_curr.T)
        B_new = a.dot( B_curr.dot(b) ) + pho * s_curr.dot(s_curr.T)
        B_curr = B_new

        x_iters_bfgs.append(x_new)
        x_curr = x_new
    
    return np.array(x_iters_bfgs)

在这里插入图片描述

6.4 L-BFGS

L-BFGS is one particular optimization algorithm in the family of quasi-Newton methods that approximates the BFGS algorithm using limited memory. Whereas BFGS requires storing a dense matrix, L-BFGS only requires storing 5-20 vectors to approximate the matrix implicitly and constructs the matrix-vector product on-the-fly via a two-loop recursion.

In the deterministic or full-batch setting, L-BFGS constructs an approximation to the Hessian by collecting curvature pairs ( $s_k, y_k$ ) defined by differences in consecutive gradients and iterates, i.e. $s_k = x_{k + 1} - x_k$ and $y_k = \nabla F(x_{k + 1}) - \nabla F(x_k)$ . In our implementation, the curvature pairs are updated after an optimization step is taken (which yields $x_{k + 1}$ ).

The updating equation is:

$B^{-1}=V^TB^{-1}V+ \rho ss^T$ , where $\rho = 1/(y^Ts)$ , $V=I-\rho ys^T$

We only keep track of y and s, not B. We have the following result:

$B^{-1}_{i+1} = (V^T_i...V^T_{i-m})B^{-1}_{i-m}(V^T_{i-m}...V^T_{i})+\rho_{i-m}(V^T_{i-1}...V^T_{i-m})s_{i-m}s^T_{i-m}(V^T_{i-m}...V^T_{i-1})+...+\rho_is_is^T_{i}$

def lbfgs(f, grad, x_0, alpha=0.1, verbose=0, steepest=1,m=20,error = 1e-5):
    B= np.identity(len(x_0))
    xk = x_0.reshape(len(x_0))
    I = np.identity(xk.size)
    Hk = I
    sks = []
    yks = []

    x_iters_lbfgs = [x_0]
    
    def Hp(H0, p):
        m_t = len(sks)
        q = grad(xk)
        a = np.zeros(m_t)
        b = np.zeros(m_t)
        for i in reversed(range(m_t)):
            s = sks[i]
            y = yks[i]
            rho_i = float(1.0 / y.T.dot(s))
            a[i] = rho_i * s.dot(q)
            q = q - a[i] * y
        r = H0.dot(q)
        for i in range(m_t):
            s = sks[i]
            y = yks[i]
            rho_i = float(1.0 / y.T.dot(s))
            b[i] = rho_i * y.dot(r)
            r = r + s * (a[i] - b[i])
        return r

    for i in range(1, n_iters):
        
        if verbose: print('Iteration: ', i)
        if verbose: print ('Current x: ', xk)

        gk = grad(xk)
        pk = -Hp(I, gk)

        if steepest: alpha = alpha_search(f, xk, pk)
        # update x
        xk1 = xk - alpha * pk
        gk1 = grad(xk1)

        # define sk and yk for convenience
        sk = xk1 - xk
        yk = gk1 - gk

        sks.append(sk)
        yks.append(yk)
        if len(sks) > m:
            sks = sks[1:]
            yks = yks[1:]

        # compute H_{k+1} by BFGS update
        rho_k = float(1.0 / yk.dot(sk))

        if verbose: print('New x: ', xk, '\n')
        if np.linalg.norm(xk1 - xk) < error:
            xk = xk1
            break
        x_iters_lbfgs.append(xk1.reshape([len(xk1),1]))
        xk = xk1
    
    return np.array(x_iters_lbfgs)

在这里插入图片描述

7. Comparison

在这里插入图片描述

运筹系列45：连续优化问题的python代码

1 Basic funtion

2 Gradinet descent

3. Least square problem and pytorch

4. Momentum algorithm

4.1 naive momentum

4.2 Nesterov

5. Newton method

6. Quasi-Newton Method

6.1 basic test

6.2 SR1 method

6.3 SR2: DFP and BFGS

6.4 L-BFGS

7. Comparison

猜你喜欢