线性回归 Linear Regression (2) —— 利用梯度下降法求解 & python 实现

Gradient Descent (GD) 的核心思想

迭代地调整模型参数，来最小化代价函数 cost function。

沿着下降梯度的方向
一旦梯度等于0，便得到了(局部)极小值。

具体步骤

随机初始化参数 $\theta$ (random initialization)
沿着负梯度方向逐步优化 $\theta$ （逐步减小cost function的取值），直到算法收敛。

如下图所示：
gradient descent

学习率（learning rate）

太小：
- 需要迭代很多步才能够收敛
- 容易陷入局部极值
- 在plateau区域，存在提前停止迭代的风险，从而无法找到极值
太大：
- 极值附近会产生振荡
- 甚至出现 cost function 越来越大的情况
如何确定合适的学习率？
- 使用 网格搜索 (grid search)
- 注意设置迭代次数限制，否则收敛时间过长
如何设置迭代次数限制？
- 将迭代次数设置得较大，但当梯度向量的模小于 $\epsilon$ 即 tolerance 时，停止计算。（可以认为GD已经几乎找到了极小值）

MSE cost function

线性回归问题的 MSE cost function 是凸的
- 当学习率不大的时候，可以离全局最优足够近

使用GD时要确保所有特征的scale相差不能太多，否则收敛时间会明显增长。
GD and feature scaling

1. 批量梯度下降 (Batch Gradient Descent, BGD)

为了便于后续分析，现将线性回归模型的MSE代价函数列在这里

【式-3：线性回归模型的 MSE 代价函数/成本函数 (Cost function)】

M S E (X, h_{θ}) = \frac{1}{m} \sum_{i = 1}^{m} {(θ^{T} \cdot x^{(i)} - y^{(i)})}^{2}

$MSE(\mathbf{X}, h_{\theta}) = \frac{1}{m}\sum_{i=1}^m \left( \theta^T\cdot x^{(i)} -y^{(i)}\right)^2$

MSE cost function 关于 $\theta_j$ 的梯度为:

\frac{\partial}{\partial θ_{j}} M S E (θ) = \frac{2}{m} \sum_{i = 1}^{m} (θ^{T} x^{(i)} - y^{(i)}) x_{j}^{(i)}

$\frac{\partial}{\partial \theta_j} MSE(\theta)= \frac{2}{m}\sum_{i=1}^{m}\left( \theta^T\mathbf{x}^{(i)}-y^{(i)} \right)x_j^{(i)}$

考虑所有的参数，记梯度向量为 $\nabla_{\theta}MSE(\theta)$ ，有

【式-4：MSE代价函数(线性回归问题)的梯度向量】
GD vector

注意

式-4在每一次计算中，都用了所有训练集的数据，这也就是该算法被称为 批量梯度下降 (batch gradient descent) 的原因。因此，当训练集比较大的时候，该算法的计算速度非常慢，但由于不存在求逆运算，在特征维度为百千量级的情况下，仍比基于解析解的求解方法快许多。

在求得了梯度向量后，我们就可以对模型参数进行迭代更新：

θ^{(n e x t s t e p)} = θ - η \nabla_{θ} M S E (θ)

$\theta^{(next\ step)} = \theta - \eta\nabla_{\theta}MSE(\theta)$

收敛率 convergence rate

学习率固定的批梯度下降算法的收敛率为 $O(\frac{1}{iterations})$ 。也就是说，如果你将 tolerance $\epsilon$ 除以 10（以提高精度），则此算法需要多 run 十倍的ierations。

2. 随机梯度下降 (Stochastic Gradient Descent, SGD)

批量梯度下降的问题在于每次更新梯度时，都需要利用所有训练样本进行计算，因此当训练集较大时计算速度非常缓慢。。。

随机梯度下降在每一个step都从训练集中随机挑选一个样本 (a random instance)，只基于该单样本计算梯度，因此与 batch GD 相比，有以下特点：
- 计算速度更快
- 可以的在非常大的训练集上进行训练
- can be implemented as an out-of-core algorithm
- 更加的不规则（less regular）
- BGD: 逐渐减小直到达到最小值
- SGD: 代价函数会上下波动，总体呈下降趋势。可以和最优值非常靠近，但即使在最优解附近，取值依旧波动。
- 不规则性 (irregular)使其更易于跳出局部极小值，因此比BGD更易于找到全局最优
- 缺点：无法停在最优解处，得到的解在最优解附近，并不是最优解（good but not optimal）
- 解决方法：
- 初始时设置较大学习率：加速训练过程 & 避开局部最优
- 逐渐减小学习率：使得找到的解停止在最优解(附近)

SGD

3. 小批量梯度下降 (Mini-batch Gradient Descent, MBGD)

为了清楚，我们将BGD，SGD，MBGD进行对比说明

在每个step计算梯度的时候：
- BGD：利用所有训练样本计算GD
- SGD：随机挑选单个样本计算GD
- MBGD：利用从训练集中随机挑选的小样本集合(即 mini-batches)计算GD

与SGD相比，MBGD的优势在于可以从矩阵操作的硬件优化中获得性能提升，尤其在使用GPU的时候。

在变量空间中，MBGD算法的寻优过程要比SGD稳定，尤其当Mini-batch较大的时候，因此，最终找到的解与最优解之间的距离往往比SGD更近。然而问题就是，比SGD更难条例局部最优。

import numpy as np

## 造一组数据
X = 2 * np.random.rand(100,1)
y = 4 + 3 * X + np.random.randn(100, 1)

X_b = np.c_[np.ones((100,1)), X] # add x_0=1 to each instance


## closed-form solution
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
print(theta_best)


## BGD:
theta_path_bgd = []

eta = 0.1 # learning rate
n_iterations = 1000
m = 100

theta = np.random.randn(2,1) # random initialization

for iteration in range(n_iterations):
    gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
    theta = theta - eta * gradients
    theta_path_bgd.append(theta)

print(theta)


## SGD:
theta_path_sgd = []
m = len(X_b)
np.random.seed(42)

n_epochs = 50 
t0, t1 =5, 50 # learning schedule hyperparameters

def learning_schedule(t):
    return t0 / (t + t1)

theta = np.random.randn(2,1) # random initialization

for epoch in range(n_epochs):
    for i in range(m):
        random_index = np.random.randint(m)
        xi = X_b[random_index:random_index+1] # shape: (1,2)
#         xi = X_b[random_index:random_index+1] # shape:(2,)
        # 此处所有的运算都是二维的
        yi = y[random_index:random_index+1]
        gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(epoch * m + i)
        theta = theta - eta * gradients
        theta_path_sgd.append(theta)

print(theta)


## MBGD:
theta_path_mgd = []

n_iterations = 50
minibatch_size = 20

np.random.seed(42)
theta = np.random.randn(2,1)  # random initialization

t0, t1 = 200, 1000
def learning_schedule(t):
    return t0 / (t + t1)

t = 0
for epoch in range(n_iterations):
    # 随机挑选 mini-batch
    shuffled_indices = np.random.permutation(m)
    X_b_shuffled = X_b[shuffled_indices]
    y_shuffled = y[shuffled_indices]
    for i in range(0, m, minibatch_size): # 保证每个epoch中利用m个样本进行了训练
        t += 1
        xi = X_b_shuffled[i:i+minibatch_size]
        yi = y_shuffled[i:i+minibatch_size]
        gradients = 2/minibatch_size * xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(t)
        theta = theta - eta * gradients
        theta_path_mgd.append(theta)

print(theta)


## 画出 BGD，SGD，MBGD的收敛轨迹
theta_path_bgd = np.array(theta_path_bgd)
theta_path_sgd = np.array(theta_path_sgd)
theta_path_mgd = np.array(theta_path_mgd)

plt.figure(figsize=(7,4))
plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "r-s", linewidth=1, label="Stochastic")
plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], "g-+", linewidth=2, label="Mini-batch")
plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], "b-o", linewidth=3, label="Batch")
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$   ", fontsize=20, rotation=0)
plt.axis([2.5, 4.5, 2.3, 3.9])
plt.show()

# output of closed-form solution
[[ 4.21509616]
 [ 2.77011339]]
# output of BGD
[[ 4.21509616]
 [ 2.77011339]]
# output of SGD
[[ 4.21076011]
 [ 2.74856079]]
# output of MBGD
[[ 4.25214635]
 [ 2.7896408 ]]

GD-compare

GD-compare-2