机器学习最优化方法[1] -- 梯度下降法

梯度下降法是求解无约束最优化问题最常见的方法，其基本思想是通过在负梯度方向通过一定的步长慢慢逼近最优解的过程。

假设需要拟合函数： $y = {f(\theta ,x)}$ , ${x}\in R^n$ , $y\in R$

给定数据集 ${(x^1,y^1),(x^2,y^2), ..., (x^m,y^m)}$ , 我们需要最小化损失函数 $J(\theta )=\frac{1}{2m}\sum_{i=1}^m(f(\theta ,x^{i})-y^i)^{2}$ 来求得参数 $\theta$

求导：

$\frac{\partial J}{\partial \theta_{j}}=\frac{1}{m}\sum_{i=1}^{m}(f(\theta,x^i)-y^i)x_{j}^{i}$

可以令导数等于0求得 $\theta$

但在实际情况中，我们可能无法通过导数等于0求出解析解，因为：

1、有些方程非常复杂，无法求导

2、可以求得导数，但是无法通过解方程得到解

因此我们还是需要通过梯度下降法来求得 $\theta$ 的最优解，而且梯度下降法是一种循环迭代的方法，更适合用计算机去求解。

一、批量梯度下降(Batch Gradient Descent)

对 $\theta$ 的每轮更新，需要用到所有的样本：

$\theta_{j} := \theta_{j} - \lambda \frac{\partial J}{\partial \theta_{j}}$

$\theta_{j} := \theta_{j} - \frac{\lambda}{m}\sum_{i=1}^{m}(f(\theta,x^i)-y^i)x_{j}^{i}$

$\lambda$ 为步长，表示每次往梯度负方向移动距离的大小，步长的设置不能过大也不能过小，太大可能会导致无法收敛到最优解，太小则使得迭代过慢。

批量梯度下降每次迭代会选择全局最优的梯度方向，因此收敛快，对于一个凸函数会得到全局最优解，但是每轮迭代需要用到所有的样本，当样本数量很大时，效率会非常低。因此当样本数很大的时候，一般使用随机梯度下降法。

二、随机梯度下降(Stochastic Gradient Descent)

对 $\theta$ 的每轮更新，计算梯度方向只考虑一个样本：

$\theta_{j} := \theta_{j} - \lambda(f(\theta,x^i)-y^i)x_{j}^{i}$

每轮迭代的梯度方向可能不是最优的梯度方向，但是当样本数目很大时，迭代足够的次数一般也能收敛到最优解。

三、小批量梯度下降(mini-batch Gradient Descent)

结合了批量梯度下降和随机梯度下降的优点，对 $\theta$ 的每轮更新，计算梯度考虑一部分样本：

$\theta_{j} := \theta_{j} - \frac{\lambda}{n^{'}}\sum_{i=t}^{t+n^{'}}(f(\theta,x^i)-y^i)x_{j}^{i}$

# -*- coding: utf-8 -*-

import random 
import numpy as np 
import time

def  linear(theta, x):
    return theta.dot(x.T)
  
def loss(pred_y, data_y):
    n = len(data_y)
    return  ((pred_y - data_y).T.dot(pred_y - data_y))/(2.0*n)

def  bgd(func, data_x, data_y, step, eps, max_iter):
    n_iter = 1 
    n, m = data_x.shape 
    theta = np.random.rand(m)
    cost, last_cost = 1, 0 
    while cost > eps and n_iter < max_iter:
        print('n_iter:', n_iter, '  theta = ', theta, '  loss=', last_cost)
        delta = 0
        for i in range(n):
            pred_y = func(theta, data_x[i])
            delta += step*(pred_y-data_y[i])*data_x[i]/n

        theta = theta - delta
        pred_y = func(theta, data_x)
        new_cost = loss(pred_y, data_y)            
        cost = abs(new_cost-last_cost)
        last_cost = new_cost
        n_iter += 1

    return theta 


def sgd(func, data_x, data_y, step, eps, max_iter):
    n_iter = 1
    n, m = data_x.shape
    theta = np.random.rand(m) 
    
    data = np.insert(data_x, m, data_y, axis=1)
    np.random.shuffle(data)
    data_x = data[:,0:m]
    data_y = data[:,m]
    
    cost, last_cost = 1, 0 
    for i in range(n):
        print('n_iter:', n_iter, '  theta = ', theta, '  loss=', last_cost)
        pred_y = func(theta, data_x[i])
        theta = theta - step*(pred_y - data_y[i])*data_x[i]
        
        pred_y = func(theta, data_x)
        new_cost = loss(pred_y, data_y)
        cost = abs(new_cost-last_cost)
        last_cost = new_cost

        if cost < eps:
            break

        n_iter += 1
        if n_iter > max_iter:
            break 

    return theta 
        

def mbgd(func, data_x, data_y, step, eps, max_iter):
    n_iter = 1 
    n, m = data_x.shape 
    theta = np.random.rand(m)
    data = np.insert(data_x, m, data_y, axis=1)
    np.random.shuffle(data)
    data_x = data[:,0:m]
    data_y = data[:,m]

    cost, last_cost = 1, 0
    for i in range(0,n,10):
        print('n_iter:', n_iter, '  theta = ', theta, '  loss=', last_cost)
        delta = 0
        for j in range(10):
            pred_y = func(theta, data_x[i+j])
            delta += step*(pred_y-data_y[i+j])*data_x[i+j]/10.0

        theta = theta - delta
    
        pred_y = func(theta, data_x)
        new_cost = loss(pred_y, data_y)
        cost = abs(new_cost-last_cost)
        last_cost = new_cost
        if cost < eps:
            break

        n_iter += 1
        if n_iter > max_iter:
            break

    return theta 


if __name__ == "__main__":
    n = 10000      # number of samples
    m = 2        # data dimension 
   
    data_x = np.random.normal(size = n*(m+1))
    data_x = data_x.reshape(-1,3) 
    data_x[:,0] = 1.0                  # set x0 = 1

    data_y = 3 + sum((np.array([2,1])*data_x[:,1:]).T) + np.random.normal(size=n)   # y = w0x0 + w1*x1 + w2*x2 + noise, x0 = 1
    
    t = time.time()
    theta = sgd(linear, data_x, data_y, step = 0.001, eps = 0.000000001, max_iter = 10000)
    print(theta)
    print('- Time consumed: ', time.time()-t, ' seconds.')

机器学习最优化方法[1] -- 梯度下降法

猜你喜欢