Optimization algorithms is very similar to root finding algorithms. The general structure goes something like: a) start with an initial guess, b) calculate the result of the guess, c) update the guess based on the result and some further conditions, d) repeat until you’re satisfied with the result.
The difference is: root finding algorithms stops when variable converges to 0, whereas optimization algorithms stops when converges to 0.
1 Basic funtion
It is import to get the derivate value in continuous optimization problems. We can get it by approximate method:
# calculate derivation
from scipy.optimize import *
f = lambda x: x[0]**2+x[1]**2 - 20
approx_fprime([3,2],f,0.0001)
Or we can calculate the value directly if we know the explicit form.
2 Gradinet descent
GD algorithm use to find the direction to move for optimization, and move a certain step.
def GD(f, x, delta = 0.1,tol=.0001):
print(str(x)+":"+str(f(x)))
xn = x - delta * approx_fprime(x,f,0.0001)
if abs(f(x)-f(xn))> tol:
GD(f,xn,delta,tol)
f = lambda x: np.sum((x-2)**2) + 20
GD(f, np.array([1,2]))
In the above example, we know , but not . In some cases, we know the derivative function . In this case, we can replace approx_fprime by .
x_iters = []
def f(x):
return (x[0]-2)**2+(x[1]-2)**2+20
fd = lambda x: (x-2) * 2
def GD1(f, fd, x, delta = 0.01,tol=.0001):
x_iters.append(x.reshape([len(x),1]))
xn = x - delta * fd(x)
if abs(f(x)-f(xn))> tol:
GD1(f,fd,xn,delta,tol)
GD1(f,fd,np.array([1,2]))
3. Least square problem and pytorch
In deep learning, we use very simple functions to fit the real relationship. For example, we assume that fs(x) = t[0] * x[0] + t[1] * x[1] + t[2] * x[2] + …, and we have several samples [x1,y1],[x2,y2],…
The loss function is Loss =
Then =
=
…
In short, we should minimize Loss function with LossD= , that means LossD should converges to 0 by gradient algorithm.
def Loss(t):
return np.sum((np.sum(t*X,axis = 1) - Y) **2)
def LossD(t):
return np.sum(2 * (np.sum(t*X,axis = 1) - Y).reshape([len(X),1]) * X, axis = 0)
t = np.array([8,8,8])
X = np.array([[3,4,3],[8,1,4],[4,2,3],[5,7,9]])
Y = np.array([101,185,109,206])
GD1(Loss,LossD,t,delta = 0.001,tol=.01) # The true value is [18,5,9]
We can use pytorch to help us calculate the derivate value automatically
import torch
t = torch.tensor([8., 8, 8], requires_grad=True)
Y = torch.tensor([101.,185,109,206])
X = torch.tensor([[3.,4,3],[8,1,4],[4,2,3],[5,7,9]])
def GD2(t, delta = torch.scalar_tensor(0.001),tol=torch.scalar_tensor(.01)):
Loss = sum(([email protected])**2)
Loss.backward()
while Loss.data > tol:
print(t.data,Loss.data)
t = (t - delta * t.grad).data
t.requires_grad=True
Loss = sum(([email protected])**2)
Loss.backward()
GD2(t) # The true value is [18,5,9]
Furthermore, we can use the optimizer provided by pytorch. Since pytorch optimizer do not provide full batch gradient descent method, we use SGD optimizer.
from torch.nn import Parameter
class LinearRegression(nn.Module):
def __init__(self):
super(LinearRegression, self).__init__()
self.t = Parameter(torch.Tensor(3, 1))
self.t.data = torch.tensor([8., 8, 8])
def forward(self, x):
out = [email protected]
return out
model = LinearRegression()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)
out = model(X)
loss = criterion(out, Y)
while loss > 0.01:
out = model(X)
loss = criterion(out, Y)
print(model.t.data,loss.data*len(Y))
optimizer.zero_grad()
loss.backward()
optimizer.step()
4. Momentum algorithm
In some cases, Gradinet descent is very slow to converge, or even worse, not converge. We may adopt momentum algorithm and give it a try. for example:
f =lambda x:(x[0]-2)**2 + abs(x[1])
4.1 naive momentum
def MOMENTUM(f,x,lr=0.1,discount=0.3,tol=0.001):
g = approx_fprime(x,f,0.0001)
pre_g = g
xn = x - lr * g
while abs(f(x)-f(xn))>tol:
x = xn
pre_g = lr * approx_fprime(x,f,0.0001) - discount* pre_g
xn = x - pre_g
print(str(xn)+":"+str(f(xn)))
MOMENTUM(f,[1,1])
4.2 Nesterov
def NESTEROV(f,x,lr=0.01,discount=0.1,tol=0.01):
g = approx_fprime(x,f,0.00001)
pre_g = g
xn = x - lr * g
while abs(f(x)-f(xn))>tol:
x = xn
xf = x - pre_g * discount * lr
pre_g = approx_fprime(xf,f,0.0001) + discount * pre_g
xn = x - pre_g * lr
print(str(xn)+":"+str(f(xn)))
NESTEROV(f,[1,1])
5. Newton method
If we know the second derivate value of , we can use newton method.
It is computationally huge to get the Hessian Matrix, and we usually use approximate method to calculate it.
Assume the loss function is Loss =
Then =
…
In short, we should minimize Loss function with LossD= , that means LossD should converges to 0 by gradient algorithm.
t = np.array([8,8,8])
X = np.array([[3,4,3],[8,1,4],[4,2,3],[5,7,9]])
Y = np.array([101,185,109,206])
def Loss(t):
#return np.sum((np.sum(t*X,axis = 1) - Y) **2)
A = [email protected]
return [email protected]
def LossD(t):
#return np.sum(2 * (np.sum(t*X,axis = 1) - Y).reshape([len(X),1]) * X, axis = 0)
return 2*([email protected])@X
def LossHInv(t):
return np.linalg.inv(2*X.T@X)
t-LossHInv(t)@LossD(t) # One step to optimal solution
We can also use pytorch to help calculate the Hessian values:
import torch
t = torch.tensor([8., 8, 8], requires_grad=True)
X = torch.tensor([[3.,4,3],[8,1,4],[4,2,3],[5,7,9]])
Y = torch.tensor([101,185,109,206])
loss = sum(([email protected] - Y)**2)
grad = torch.autograd.grad(loss, t, retain_graph=True, create_graph=True)
Print = torch.tensor([])
for anygrad in grad[0]: # torch.autograd.grad返回的是元组
Print = torch.cat((Print, torch.autograd.grad(anygrad, t, retain_graph=True)[0]))
t.data - torch.inverse(Print.view(t.size()[0], -1))@grad[0].data
6. Quasi-Newton Method
6.1 basic test
Let’s start with a simple problem:
The plot is as below:
Let’s use gradient descent method as a test. The funtion below is similar to GD1 before. We add some new features like line search and verbose.
def gradient_descent(f, grad, x_0, alpha=0.1, verbose=0, steepest=0):
x_curr = x_0
x_iters = [x_curr]
for i in range(1, n_iters):
g = grad(x_curr)
if verbose: print('Iteration:%d\nCurrent x:%.5f\nGradient: %.5f'%(i,x_curr,g))
if steepest:
best_alpha = alpha_search(f, x_curr, g)
alpha = best_alpha
x_new = x_curr - alpha * g
if verbose: print('New x: ', x_new, '\n')
x_iters.append(x_new)
x_curr = x_new
return np.array(x_iters)
6.2 SR1 method
And then we test a SR1 quasi-Newton method.
We use to approximate (Or to approximate ).
Assume that , we get 。
I don’t know why people used to assign for and for , but remember that in the following function, s and y stands for and in the formulation.
def sr1(f, grad, x_0, alpha=0.1, verbose=0, steepest=1):
# Starts from I
B_curr = np.identity(len(x_0))
H_curr = np.linalg.inv(B_curr)
x_curr = x_0
x_new = x_curr - alpha * grad(x_curr)
s_curr = x_new - x_curr
y_curr = grad(x_new) - grad(x_curr)
x_iters_sr1 = [x_0]
for i in range(1, n_iters):
if verbose: print('Iteration: ', i)
if verbose: print ('Current x: ', x_curr)
g = grad(x_curr)
p = H_curr.dot(g)
if steepest: alpha = alpha_search(f, x_curr, p)
x_new = x_curr - alpha * p
if verbose: print('New x: ', x_new, '\n')
s_new = x_new - x_curr
y_new = grad(x_new) - grad(x_curr)
s_curr = s_new
y_curr = y_new
# This is the essential of SR1 algorithm.
b = y_curr - B_curr.dot(s_curr)
B_new = B_curr + b.dot(b.T) / (b.T.dot(s_curr) + 10**-8)
a = s_curr - H_curr.dot(y_curr)
H_new = H_curr + a.dot(a.T) / (a.T.dot(y_curr) + 10**-8)
r = 10**-8
# sB>r*sqrt(ssBB)
if s_curr.T.dot(b) > r * np.sqrt(s_curr.T.dot(s_curr) * b.T.dot(b)):
if verbose: print('Update')
H_curr = H_new
B_curr = B_new
x_iters_sr1.append(x_new)
x_curr = x_new
return np.array(x_iters_sr1)
6.3 SR2: DFP and BFGS
SR2 quasi-newton method has two forms:
Let and
(0) Sherman-Morrison equation:
(1) Assume that , which is DFP,
(2) Assume that , which is BFGS, and we have . Using Sherman-Morrison equation 2 times, we have .
We can also combine DFP and BFGS with a certain ratio, and is called Broyden Formulation.
def dfp(f, grad, x_0, alpha=0.1, verbose=0, steepest=1):
B= np.identity(len(x_0))
x_curr = x_0
x_new = x_curr - alpha * grad(x_curr)
s_curr = x_new - x_curr
y_curr = grad(x_new) - grad(x_curr)
x_iters_dfp = [x_0]
for i in range(1, n_iters):
if verbose: print('Iteration: ', i)
if verbose: print ('Current x: ', x_curr)
g = grad(x_curr)
p = B.dot(g)
if steepest: alpha = alpha_search(f, x_curr, p)
x_new = x_curr - alpha * p
if verbose: print('New x: ', x_new, '\n')
s_new = x_new - x_curr
y_new = grad(x_new) - grad(x_curr)
s_curr = s_new
y_curr = y_new
bs = B.dot(s_curr)
B += s_curr.dot(s_curr.T) / (y_curr.T.dot(s_curr) + 10**-8)\
- bs.dot(s_curr.T).dot(B)/ (s_curr.T.dot(bs)+ 10**-8)
x_iters_dfp.append(x_new)
x_curr = x_new
return np.array(x_iters_dfp)
def bfgs(f, grad, x_0, alpha=0.1, verbose=0, steepest=1):
B_curr = np.identity(len(x_0))
x_curr = x_0
x_new = x_curr - alpha * grad(x_curr)
s_curr = x_new - x_curr
y_curr = grad(x_new) - grad(x_curr)
x_iters_bfgs = [x_0]
for i in range(1, n_iters):
if verbose: print('Iteration: ', i)
if verbose: print ('Current x: ', x_curr)
g = grad(x_curr)
p = B_curr.dot(g)
if steepest: alpha = alpha_search(f, x_curr, p)
x_new = x_curr - alpha * p
if verbose: print('New x: ', x_new, '\n')
s_new = x_new - x_curr
y_new = grad(x_new) - grad(x_curr)
s_curr = s_new
y_curr = y_new
pho = 1. / (y_curr.T.dot(s_curr) + 10**-8)
a = np.identity(len(x_0)) - pho * s_curr.dot(y_curr.T)
b = np.identity(len(x_0)) - pho * y_curr.dot(s_curr.T)
B_new = a.dot( B_curr.dot(b) ) + pho * s_curr.dot(s_curr.T)
B_curr = B_new
x_iters_bfgs.append(x_new)
x_curr = x_new
return np.array(x_iters_bfgs)
6.4 L-BFGS
L-BFGS is one particular optimization algorithm in the family of quasi-Newton methods that approximates the BFGS algorithm using limited memory. Whereas BFGS requires storing a dense matrix, L-BFGS only requires storing 5-20 vectors to approximate the matrix implicitly and constructs the matrix-vector product on-the-fly via a two-loop recursion.
In the deterministic or full-batch setting, L-BFGS constructs an approximation to the Hessian by collecting curvature pairs ( ) defined by differences in consecutive gradients and iterates, i.e. and . In our implementation, the curvature pairs are updated after an optimization step is taken (which yields ).
The updating equation is:
, where ,
We only keep track of y and s, not B. We have the following result:
def lbfgs(f, grad, x_0, alpha=0.1, verbose=0, steepest=1,m=20,error = 1e-5):
B= np.identity(len(x_0))
xk = x_0.reshape(len(x_0))
I = np.identity(xk.size)
Hk = I
sks = []
yks = []
x_iters_lbfgs = [x_0]
def Hp(H0, p):
m_t = len(sks)
q = grad(xk)
a = np.zeros(m_t)
b = np.zeros(m_t)
for i in reversed(range(m_t)):
s = sks[i]
y = yks[i]
rho_i = float(1.0 / y.T.dot(s))
a[i] = rho_i * s.dot(q)
q = q - a[i] * y
r = H0.dot(q)
for i in range(m_t):
s = sks[i]
y = yks[i]
rho_i = float(1.0 / y.T.dot(s))
b[i] = rho_i * y.dot(r)
r = r + s * (a[i] - b[i])
return r
for i in range(1, n_iters):
if verbose: print('Iteration: ', i)
if verbose: print ('Current x: ', xk)
gk = grad(xk)
pk = -Hp(I, gk)
if steepest: alpha = alpha_search(f, xk, pk)
# update x
xk1 = xk - alpha * pk
gk1 = grad(xk1)
# define sk and yk for convenience
sk = xk1 - xk
yk = gk1 - gk
sks.append(sk)
yks.append(yk)
if len(sks) > m:
sks = sks[1:]
yks = yks[1:]
# compute H_{k+1} by BFGS update
rho_k = float(1.0 / yk.dot(sk))
if verbose: print('New x: ', xk, '\n')
if np.linalg.norm(xk1 - xk) < error:
xk = xk1
break
x_iters_lbfgs.append(xk1.reshape([len(xk1),1]))
xk = xk1
return np.array(x_iters_lbfgs)
7. Comparison