[Practical Machine Learning] 3.4 Stochastic Gradient Descent

Mini-batch Stochastic Gradient Descent (SGD)

Steps to train through SGD

W \mathbf{W}W Model parameters, including offsets
b b b hyperparameters, batch size
the t \eta_tthet hyperparameter, at time ttlearning rate of t
  • At time 1, randomly initialize W 1 \mathbf{W}_1W1
  • At each step t = 1 , 2 , . . . t=1, 2, ...t=1,2,... repeat the following steps until convergence:
    • random from nnSamplingI t ⊂ { 1 , … , n } I_t \subset\{1, \ldots, n\} in n samplesIt{ 1,,n} I t I_t Itis the sampled nnn samples,∣ I t ∣ = b \left|I_t\right|=bIt=b
    • 更新 w t + 1 = w t − η t ∇ w t ℓ ( X I t , y I t , w t ) \mathbf{w}_{t+1}=\mathbf{w}_t-\eta_t \nabla_{\mathbf{w}_t} \ell\left(\mathbf{X}_{I_t}, \mathbf{y}_{I_t}, \mathbf{w}_t\right) wt+1=wtthetwt(XIt,yIt,wt)

Advantages: Can solve almost all models except decision tree
Disadvantages: For hyperparameters bbbh t \eta_tthetmore sensitive

Code

The following example uses SGD in a linear regression model.

Hyperparameters include batch_size, learning rate, num_epochs.

import random
import torch
# `features` shape is (n, p), `labels` shape is (p, 1)
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    random.shuffle(indices) # 打乱标号的顺序,目的是随机采样
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(
            indices[i:min(i+batch_size, num_examples)]
        )
        yield features[batch_indices], labels[batch_indices]

w = torch.normal(0, 0.01, size=(p, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        y_hat = X @ w + b
        loss = ((y_hat - y)**2 / 2).mean() # MSE
        loss.backward() # 参数求导
        # 更新参数
        for param in [w, b]:
            param -= learning_rate * param.grad
            param.grad.zero_() # 导数清零

Summarize

  • The linear model is to predict the input through a linear weighted sum
  • In linear models, we use the mean mean squared error (MSE) as the loss function
  • In Softmax regression, we use Cross Entropy (Cross Entropy) as the loss function, which is generally used for multi-classification problems
    • The Softmax operator can turn predicted values ​​into probabilities
  • Mini-batch stochastic gradient descent (Mini-batch SGD) can solve almost all neural networks

References

3.4 Stochastic Gradient Descent [Stanford 21 Fall: Practical Machine Learning Chinese Edition]_哔哩哔哩_bilibili

Courseware PPT: start from the sixth page for the courseware of this chapter

Guess you like

Origin blog.csdn.net/weixin_46421722/article/details/129667088