线性回归解法：梯度下降及正规方程

回归问题

回归问题（regression）是属于机器学习里面监督学习（supervised learning）的部分，它利用已标记好数据去学习得到一个假设函数（又称目标函数），并用来预测新的测试数据结果。如果预测的数据是连续出现我们称为线性回归（linear regression），它也常常应用于预测一个连续值的结果的场景。而如果预测值是离散出现，我们就成为逻辑回归（logistic regression）问题。

例如：

假定我们现有一大批数据，包含房屋的面积和对应面积的房价信息，如果我们能得到房屋面积与房屋价格间的关系，那么，给定一个房屋时，我们只要知道其面积，就能大致推测出其价格了。

这类预测的问题就可以用回归问题模型去解决，同时由于该问题值是连续的，是线性回归问题。

一个预测问题在回归模型下的解决步骤为：

积累知识：我们将储备的知识称之为训练集（ Training Set），很好理解，知识能够训练人进步
学习：学习如何预测，得到输入与输出的关系。在学习阶段，应当有合适的指导方针，江山不能仅凭热血就攻下。在这里，合适的指导方针我们称之为学习算法（Learning Algorithm）
预测：学习完成后，当接受了新的数据（输入）后，我们就能通过学习阶段获得的对应关系来预测输出。

学习过程往往是艰苦的，“人谁无过，过而能改，善莫大焉”，因此对我们有这两点要求：

有手段能评估我们的学习正确性。
当学习效果不佳时，有手段能纠正我们的学习策略。

线性回归与梯度下降

预测

首先，我们明确几个常用的数学符号：

特征（feature）： $x_i$ , 比如房屋的面积，卧室数量都算房屋的特征。
特征向量（输入）： $x$ ,一套房屋的信息就算一个特征向量，特征向量由特征组成
$x_{j}^{i}$ $x^i _j$ 表示第 i 个特征向量的第 j 个特征。
输出向量： $y，y(i)$ 表示了第 i 个输入所对应的输出。
假设（hypothesis）：也称为预测函数，比如一个线性预测函数是：

$h_{θ} (x) = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2} + \dots + θ_{n} x_{n} = θ^{T} X$ $h_θ(x)=θ_0+θ_1x_1+θ_2x_2+⋯+θ_nx_n=θ^TX$
上面的表达式也称之为回归方程（regression equation）， $θ$ 为回归系数，它是我们预测准度的基石。

误差评估-代价函数（costFunction）

之前我们说到，需要某个手段来评估我们的学习效果，即评估各个真实值 y(i) 与预测值 hθ(x(i)) 之间的差异。最常见的，我们通过最小均方（Least Mean Square）来描述误差：

J (θ) = \frac{1}{2 m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)})^{2}, m 为样本数

$J(\theta)=\frac{1}{2m}\sum\limits_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2,\quad \mbox{$m$ 为样本数}$

其矩阵表达为：

J (θ) = \frac{1}{2 m} (X θ - y)^{T} (X θ - y)

$J(\theta)=\frac{1}{2m}(X\theta-y)^T(X\theta-y)$

误差评估的函数在机器学习中也称为代价函数（cost function）。

批量梯度下降（Batch Gradient Descent ）

在引入了代价函数后，解决了“有手段评估学习的正确性”的问题，下面我们开始解决“当学习效果不佳时，有手段能纠正我们的学习策略”的问题。

首先可以明确的是，该手段就是要反复调节 $θ$ 是的预测 $J(θ)$ 足够小，以及使得预测精度足够高，在线性回归中，通常使用梯度下降（Gradient Descent）通过减去偏导来调节 θ：

θ_{j} = θ_{j} - α \frac{\partial}{\partial θ_{j}} J (θ) α 为学习率

$\theta_j = \theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta) \quad \mbox{$\alpha$ 为学习率}$

数学上，梯度方向是函数值下降最为剧烈的方向。那么，沿着 $J(θ)$ 的梯度方向走，我们就能接近其最小值，或者极小值，从而接近更高的预测精度。学习率 α 是个相当玄乎的参数，其标识了沿梯度方向行进的速率，步子大了容易扯着蛋，很可能这一步就迈过了最小值。而步子小了，又会减缓我们找到最小值的速率。在实际编程中，学习率可以以 3 倍，10 倍这样进行取值尝试，如：

α = 0.001, 0.003, 0.01 \dots 0.3, 1

$α=0.001,0.003,0.01…0.3,1$
对于

\frac{\partial}{\partial θ_{j}} J (θ)

$\frac{\partial}{\partial\theta_j}J(\theta) \quad$ 求解，利用微积分的偏微分方程求导规则很容易就可以求出来：

\frac{\partial}{\partial θ_{j}} J (θ) = \frac{1}{m} \sum_{i = 1}^{m} (y^{(i)} - h_{θ} (x^{(i)})) x_{j}^{(i)}

$\frac{\partial}{\partial\theta_j}J(\theta) \quad =\frac{1}{m}\sum\limits_{i=1}^{m}(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}$

所以对于一个样本容量为 m 的训练集，我们定义 θ 的调优过程为：

\begin{aligned} 重复直到收敛（Repeat until convergence）： \\ θ_{j} = θ_{j} + α \frac{1}{m} \sum_{i = 1}^{m} (y^{(i)} - h_{θ} (x^{(i)})) x_{j}^{(i)} \end{aligned}

$\begin{align*} \mbox{重复直到收敛（Repeat until convergence）：} \\ \quad \theta_j = \theta_j+\alpha\frac{1}{m}\sum\limits_{i=1}^{m}(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)} \end{align*}$
该函数的矩阵（向量）表达如下：

θ_{j} = θ_{j} + α \frac{1}{m} (y - X θ)^{T} x_{j}

$\theta_j = \theta_j + \alpha\frac{1}{m}(y-X\theta)^Tx_j$
其中，代价函数为：

J (θ) = \frac{1}{2 m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)})^{2}

$J(\theta)=\frac{1}{2m}\sum\limits_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$
我们称该过程为 基于最小均方（LMS）的批量梯度下降法（Batch Gradient Descent），一方面，该方法虽然可以收敛到最小值，但是每调节一个 θj，都不得不遍历一遍样本集，如果样本的体积 m 很大，这样做无疑开销巨大。但另一方面，因为其可化解为向量型表示，所以就能利用到 并行计算优化性能。

随机梯度下降(Stochastic Gradient Descent )

鉴于批量梯度下降的性能问题，又引入了随机梯度下降：

\begin{aligned} 重复直到收敛（Repeat until convergence）: \\ for i = 1 to m : \\ θ_{j} = θ_{j} + α (y^{(i)} - h_{θ} (x^{(i)})) x_{j}^{(i)} \end{aligned}

$\begin{align*} & \mbox{重复直到收敛（Repeat until convergence）:} \\ & \quad \mbox{for $i=1$ to $m$}: \\ & \quad \quad \theta_j = \theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)} \end{align*}$
可以看到，在随机梯度下降法中，每次更新

θ j

$θj$ 只需要一个样本：

(x (i), y (i))

$(x(i),y(i))$ 。即便在样本集容量巨大时，我们也很可能迅速获得最优解，此时 SGD 能带来明显的性能提升。

手段	概括	优点	缺点
批量梯度下降法	尽可能减小训练样本的总的预测代价	能够获得最优解，支持并行计算	样本容量较大时，性能显著下降
随机梯度下降法	尽可能的减小每个训练样本的预测代价	训练速度快	并不一定能获得全局最优，经常出现抖动和噪音，且不能通过并行计算优化

正规方程

前面论述的线性回归问题中，我们通过梯度下降法来求得 J(θ) 的最小值，但是对于学习率 α 的调节有时候使得我们非常恼火。从实现代码最方便角度，解决线性回归问题我们可通过正规方程来最小化 $J(θ)$ ，求得参数 $θ$ ：

直接奉上公式：

θ = (X^{T} X)^{- 1} X^{T} y

$\theta=(X^TX)^{-1}X^Ty$
推导过程详见（写的详细，我就不罗嗦了）

梯度下降与正规方程的对比:

梯度下降	正规方程
需要选择适当的学习率 $a$ $a$	不需要学习率 $a$ $a$
需要进行多步迭代	不需要进行迭代，在 Matlab 等平台上，矩阵运算仅需一行代码就可完成
对多特征适应性较好，能在特征数量很多时仍然工作良好	算法复杂度为 $O (n^{3})$ $O(n^3)$ ，所以如果特征维度太高（特别是超过 10000 维），那么不宜再考虑该方法。
能应用到一些更加复杂的算法中，如逻辑回归（Logic Regression）等	矩阵需要可逆，并且，对于一些更复杂的算法，该方法无法工作

代码实现

这里我借用的数据集是Coursera 上著名机器学习入门课《Machine Learning》第二周的作业ex1data1.txt。

import numpy as np
import matplotlib as cm
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import time

def costTime(func):
    def newFunc(*args, **args2):
        t0 = time.time()
        back = func(*args, **args2)
        t1 = time.time()
        return back, t1 - t0
    return newFunc

def loadDataSet(filename):
    """
    读取数据

    在coursera《Machine Learning》里面数据格式如下:
     "feature1,fearture2,... ,label "

    :param filename: 文件名
    :return:
     x:训练样本集矩阵
     y:标签集矩阵

    """
    numFeat = len(open(filename).readline().split(',')) - 1
    X = []
    y = []
    file = open(filename)
    for line in file.readlines():
        lineArr = []
        curLine = line.strip().split(',')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        X.append(lineArr)
        y.append(float(curLine[-1]))
    return np.mat(X), np.mat(y).T

def h(theta, x):
    '''
    预测函数
    :param theta:相关系数矩阵
    :param x: 特征向量
    :return: 预测结果
    '''
    return (theta.T * x)[0, 0] # 这里结果是一个两层List嵌套的二维矩阵,所以取用[0,0]只需要返回数组里数值

def J(theta, X, y):
    '''
    代价函数
    :param theta:相关系数矩阵
    :param x: 样本集矩阵
    :param y: 标签矩阵
    :return:  预测
    '''
    m = len(X)
    return ((X * theta - y).T * (X * theta - y))[0, 0] / (2 * m) # [0, 0]原因同上


@costTime
def bgd(rate, maxLoop, epsilon, X, y):
    """
    随机批量梯度下降
    :param rate: 学习率
    :param maxLoop: 最大迭代次数
    :param epsilon: 收敛精度
    :param X: 样本矩阵
    :param y: 标签矩阵
    :return: theta: 结果参数 ，errors： 每一次误差的list，thetas ：每一轮迭代参数,二位数组
    """
    m, n = X.shape
    theta = np.zeros([n,1]) #系数矩阵初始化为n行1列全为0的矩阵
    count = 0  #记载迭代次数
    converged = False
    error = float('inf')
    errors = []
    thetas = {}  # 字典方便用下表获取
    for j  in range(n):
        thetas[j] = [theta[j,0] ]
    while count <= maxLoop:
        if(converged) :   break
        count = count + 1
        for j in range(n) :
            deriv = (y - X*theta).T * X[:, j] / m
            theta[j, 0] = theta[j, 0] + rate * deriv
            thetas[j].append(theta[j, 0])
        error = J(theta, X, y)
        errors.append(error)
        if(error <= epsilon):  converged = True
    return theta, errors, thetas

@costTime
def sgd(rate, maxLoop, epsilon, X, y):
    """

    :param rate:
    :param maxLoop:
    :param epsilon:
    :param X:
    :param y:
    :return:
    """
    m, n = X.shape
    theta = np.zeros([n,1])
    count = 0  # 迭代次数
    converged = False
    error = float('inf')
    errors = []
    thetas = {}
    for j in range(n):
        thetas[j] = [theta[j, 0]]
    while count <= maxLoop:
        if(converged) : break
        count = count +1
        for i in range(m):
            diff = y[i,0] - X[i,:] * theta
            for j in range(n):
               theta[j, 0] = theta[j, 0] + rate * diff * X[i,j]
               thetas[j].append(theta[j, 0])
            error = J(theta, X, y)
            errors.append(error)
            if (error <= epsilon): converged = True
    return theta, errors, thetas

@costTime
def standRegres(X, y):
    xTx = X.T * X
    if np.linalg.det(xTx) == 0.0 :
        print('This matrix is singular,cannot do inverse')
        return
    ws = xTx.I * (X.T * y)
    return ws

def test_bgd(): #
    X, y = loadDataSet("ex1data1.txt") # coursera的《machine learning》第二周实验数据
    m, n = X.shape
    X = np.concatenate((np.ones((m,1)), X),axis=1)
    rate = 0.01
    maxLoop = 1000
    epsilon = 0.00001
    result, timeConsumed = bgd(rate, maxLoop, epsilon, X, y)
    theta, errors, thetas = result
    print('总共迭代[%s]次，消耗[%s] s \n 参数矩阵:\n %s' % (maxLoop,timeConsumed,theta))
    fittingFig = plt.figure()
    title = 'bgd: rate=%.2f, maxLoop=%d, epsilon=%.3f \n time: %ds' % (rate, maxLoop, epsilon, timeConsumed)
    ax = fittingFig.add_subplot(111, title=title)

    trainingSet = ax.scatter(X[:, 1].flatten().A[0], y[:, 0].flatten().A[0])

    xCopy = X.copy()
    xCopy.sort(0)
    yHat = xCopy*theta
    fittingLine, = ax.plot(xCopy[:,1], yHat, color='g')

    ax.set_xlabel('Population of City in 10,000s')
    ax.set_ylabel('Profit in $10,000s')

    plt.legend([trainingSet, fittingLine], ['Training Set', 'Linear Regression'])
    plt.show()

    # 绘制误差曲线
    errorsFig = plt.figure()
    ax = errorsFig.add_subplot(111)
    ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.4f'))
    ax.plot(range((len(errors))), errors)
    ax.set_xlabel('Number of iterations')
    ax.set_ylabel('Cost J')
    plt.show()

def test_sgd():
    X, y = loadDataSet("ex1data1.txt")  # coursera的《machine learning》第二周实验数据
    m, n = X.shape
    X = np.concatenate((np.ones((m, 1)), X), axis=1)
    rate = 0.01
    maxLoop = 1000
    epsilon = 0.001
    result, timeConsumed = sgd(rate, maxLoop, epsilon, X, y)
    theta, errors, thetas = result
    print('总共迭代[%s]次，消耗[%s] s \n 参数矩阵:\n %s' % (maxLoop, timeConsumed, theta))
    fittingFig = plt.figure()
    title = 'bgd: rate=%.2f, maxLoop=%d, epsilon=%.3f \n time: %ds' % (rate, maxLoop, epsilon, timeConsumed)
    ax = fittingFig.add_subplot(111, title=title)

    trainingSet = ax.scatter(X[:, 1].flatten().A[0], y[:, 0].flatten().A[0])

    xCopy = X.copy()
    xCopy.sort(0)
    yHat = xCopy * theta
    fittingLine, = ax.plot(xCopy[:, 1], yHat, color='g')

    ax.set_xlabel('Population of City in 10,000s')
    ax.set_ylabel('Profit in $10,000s')

    plt.legend([trainingSet, fittingLine], ['Training Set', 'Linear Regression'])
    plt.show()

    # 绘制误差曲线
    errorsFig = plt.figure()
    ax = errorsFig.add_subplot(111)
    ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.4f'))
    ax.plot(range((len(errors))), errors)
    ax.set_xlabel('Number of iterations')
    ax.set_ylabel('Cost J')
    plt.show()

def test_standRegress():
    X, y = loadDataSet("ex1data1.txt")  # coursera的《machine learning》第二周实验数据
    m, n = X.shape
    X = np.concatenate((np.ones((m, 1)), X), axis=1)
    theta, timeConsumed = standRegres(X, y)
    print('消耗[%s] s \n 参数矩阵:\n %s' % (timeConsumed, theta))

    fittingFig = plt.figure()
    title = 'StandRegress  time: %s' % timeConsumed
    ax = fittingFig.add_subplot(111, title=title)
    trainingSet = ax.scatter(X[:, 1].flatten().A[0], y[:, 0].flatten().A[0])
    xCopy = X.copy()
    xCopy.sort(0)
    yHat = xCopy * theta
    fittingLine, = ax.plot(xCopy[:, 1], yHat, color='g')
    ax.set_xlabel('Population of City in 10,000s')
    ax.set_ylabel('Profit in $10,000s')
    plt.legend([trainingSet, fittingLine], ['Training Set', 'Linear Regression'])
    plt.show()


if __name__ == '__main__':
   test_bgd()
   test_sgd()
   test_standRegress()

最也在github上持续更新学习的机器学习算法的python实现。欢迎start！

监督学习之线性回归解法：梯度下降及正规方程

线性回归解法：梯度下降及正规方程

回归问题

线性回归与梯度下降

预测

误差评估-代价函数（costFunction）

批量梯度下降（Batch Gradient Descent ）

随机梯度下降(Stochastic Gradient Descent )

正规方程

代码实现

猜你喜欢

监督学习之线性回归解法：梯度下降及正规方程

线性回归解法：梯度下降及正规方程

回归问题

线性回归与梯度下降

预测

误差评估-代价函数（costFunction）

批量梯度下降 （Batch Gradient Descent ）

随机梯度下降(Stochastic Gradient Descent )

正规方程

代码实现

猜你喜欢

批量梯度下降（Batch Gradient Descent ）