小批量随机梯度下降（SGD）

通过SGD训练的步骤

$\mathbf{W}$	模型参数，包括了偏移量
$b$	超参，批量的大小
$\eta_t$	超参，在时间 $t$ 的学习率

在时刻1，随机初始化 $\mathbf{W}_1$
在每一步 $t = 1, 2, ...$ 重复以下步骤直至收敛：
- 随机从 $n$ 个样本里采样 $I_t \subset\{1, \ldots, n\}$ ， $I_t$ 就是采样的 $n$ 个样本， $\left|I_t\right|=b$
- 更新 $\mathbf{w}_{t+1}=\mathbf{w}_t-\eta_t \nabla_{\mathbf{w}_t} \ell\left(\mathbf{X}_{I_t}, \mathbf{y}_{I_t}, \mathbf{w}_t\right)$

优点：能够求解除了决策树以外几乎所有的模型
缺点：对于超参数 $b$ 和 $\eta_t$ 比较敏感

代码实现

以下示例在线性回归模型中使用SGD。

超参包括batch_size，learning rate，num_epochs。

import random
import torch
# `features` shape is (n, p), `labels` shape is (p, 1)
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    random.shuffle(indices) # 打乱标号的顺序，目的是随机采样
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(
            indices[i:min(i+batch_size, num_examples)]
        )
        yield features[batch_indices], labels[batch_indices]

w = torch.normal(0, 0.01, size=(p, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        y_hat = X @ w + b
        loss = ((y_hat - y)**2 / 2).mean() # MSE
        loss.backward() # 参数求导
        # 更新参数
        for param in [w, b]:
            param -= learning_rate * param.grad
            param.grad.zero_() # 导数清零

总结

线性模型就是把输入通过线性加权和得到预测
在线性模型中，我们使用平均均方误差（MSE）来作为损失函数
在Softmax回归中，我们使用交叉熵（Cross Entropy）作为损失函数，一般用来做多分类问题
- Softmax操作子可以将预测数值变成概率
小批量随机梯度下降（Mini-batch SGD）对几乎所有的神经网络都可以求解

参考资料

3.4 随机梯度下降【斯坦福21秋季：实用机器学习中文版】_哔哩哔哩_bilibili

课件PPT：从第六页开始为本章课件

【实用机器学习】3.4 随机梯度下降

小批量随机梯度下降（SGD）

通过SGD训练的步骤

代码实现

总结

参考资料

猜你喜欢