Mini-batch Stochastic Gradient Descent (SGD)
Steps to train through SGD
W \mathbf{W}W | Model parameters, including offsets |
---|---|
b b b | hyperparameters, batch size |
the t \eta_tthet | hyperparameter, at time ttlearning rate of t |
- At time 1, randomly initialize W 1 \mathbf{W}_1W1
- At each step t = 1 , 2 , . . . t=1, 2, ...t=1,2,... repeat the following steps until convergence:
- random from nnSamplingI t ⊂ { 1 , … , n } I_t \subset\{1, \ldots, n\} in n samplesIt⊂{ 1,…,n}, I t I_t Itis the sampled nnn samples,∣ I t ∣ = b \left|I_t\right|=b∣It∣=b
- 更新 w t + 1 = w t − η t ∇ w t ℓ ( X I t , y I t , w t ) \mathbf{w}_{t+1}=\mathbf{w}_t-\eta_t \nabla_{\mathbf{w}_t} \ell\left(\mathbf{X}_{I_t}, \mathbf{y}_{I_t}, \mathbf{w}_t\right) wt+1=wt−thet∇wtℓ(XIt,yIt,wt)
Advantages: Can solve almost all models except decision tree
Disadvantages: For hyperparameters bbb和h t \eta_tthetmore sensitive
Code
The following example uses SGD in a linear regression model.
Hyperparameters include batch_size, learning rate, num_epochs.
import random
import torch
# `features` shape is (n, p), `labels` shape is (p, 1)
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
random.shuffle(indices) # 打乱标号的顺序,目的是随机采样
for i in range(0, num_examples, batch_size):
batch_indices = torch.tensor(
indices[i:min(i+batch_size, num_examples)]
)
yield features[batch_indices], labels[batch_indices]
w = torch.normal(0, 0.01, size=(p, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)
for epoch in range(num_epochs):
for X, y in data_iter(batch_size, features, labels):
y_hat = X @ w + b
loss = ((y_hat - y)**2 / 2).mean() # MSE
loss.backward() # 参数求导
# 更新参数
for param in [w, b]:
param -= learning_rate * param.grad
param.grad.zero_() # 导数清零
Summarize
- The linear model is to predict the input through a linear weighted sum
- In linear models, we use the mean mean squared error (MSE) as the loss function
- In Softmax regression, we use Cross Entropy (Cross Entropy) as the loss function, which is generally used for multi-classification problems
- The Softmax operator can turn predicted values into probabilities
- Mini-batch stochastic gradient descent (Mini-batch SGD) can solve almost all neural networks
References
Courseware PPT: start from the sixth page for the courseware of this chapter