[Hands-on Deep Learning v2 Li Mu] Study Notes 07: Weight Decay, Regularization

Previous review: model selection, underfitting, and overfitting

1. Weight decay

1.1 Hard Limits

  • In the previous article, we covered two ways to control model capacity :
    • Use smaller parameters (makes the model smaller)
    • Make parameters have fewer values ​​to choose from
  • Weight decay controls model capacity by limiting the range of choices for parameter values.
    • For example, we can add a limit when minimizing the loss function to prevent the weight from being too large:
      min ⁡ l ( w ⃗ , b ) subject to ∣ ∣ w ⃗ ∣ ∣ 2 ≤ θ \min l(\vec{w}, b) \quad \text{subject to} \quad ||\vec{w}||^2 \leq \thetaminl(w ,b)subject to∣∣w 2θIn this example, we restrictw ⃗ \vec{w}w L 2 L_2L2The loss is not greater than θ \thetai
    • We usually don't limit the offset bbb (limit or no limit is almost the same)
    • Choose a small θ \thetaθ implies a strongerregularization term

1.2 Soft constraints (regularization)

We usually do not adopt hard limits like in the previous subsection, but control the model capacity by regularizing this soft limit.

  • L 2 L_2L2Regularization:
    • For each θ \thetaθ , can findλ \lambdaThe objective function before λ
      is equal to: min ⁡ l ( w ⃗ , b ) + λ 2 ∣ ∣ w ⃗ ∣ ∣ 2 \min{l(\vec w,b)}+\frac {\lambda} {2}||\vec w||^2minl(w ,b)+2l∣∣w 2
    • It can be proved by the Lagrangian multipliers
    • hyperparameter λ \lambdaλ controls the importance of the regular term
      • λ = 0 \lambda=0l=When 0 , the regularization item does not work
      • λ → ∞ \lambda \rightarrow \infty l 时, w ⃗ → 0 ⃗ \vec w \rightarrow\vec 0 w 0
  • L1 L_1L1Regularization:
    • The value of most model parameters is equal to 0, which has achieved the purpose of model sparsification.
    • 其公式文:
      min ⁡ l ( w ⃗ , b ) + λ ∣ ∣ w ⃗ ∣ ∣ 1 \min{l(\vec w,b)}+\lambda||\vec w||_1minl(w ,b)+λ∣∣w 1
  • Demonstration:
    We take L 2 L_2L2正则化的例文帳に追加,下图中:
    w ⃗ ∗ = arg ⁡ min ⁡ l ( w ˉ , b ) + λ 2 ∣ ∣ w ˉ ∣ ∣ 2 w ⃗ ~ ∗ = arg ⁡ min ⁡ l ( w ˉ ~ , b ) \begin{aligned}&\vec w*=\arg\min{l(\bar w,b)+\frac{\lambda}2||\bar w||^2} \\ &\tilde {\vec w}*=\arg\min{l(\tilde{\bar w},b)}\end{aligned}w =argminl(wˉ,b)+2l∣∣wˉ2w ~=argminl(wˉ~,b)
    demo
    The green curve is the case where only the loss value is optimized, and the yellow curve is the case where the regularization term is added. The regularization item will pull the value of the weight from a larger value that is far from the origin to a smaller value that is closer to the origin, thereby realizing the control of the parameter size.

1.3 Parameter update rule

  • Calculating 梢度:
    ∂ ∂ w ⃗ ( l ( w ⃗ , b ) + λ 2 ∣ ∣ w ⃗ ∣ ∣ 2 ) = ∂ l ( w ⃗ , b ) ∂ w + λ w ⃗ \frac{\partial}{\partial \vec w}\Big( l(\vec w,b)+\frac{\lambda}2||\vec w||^2 \Big)=\frac{\partial l(\vec w, b)} {\partial w}+\lambda \vec ww (l(w ,b)+2l∣∣w 2)=wl(w ,b)+lw
  • 手机parameter(时间 t):
    w ⃗ t + 1 = ( 1 − η λ ) w ⃗ t − η ∂ l ( w ⃗ t , bt ) ∂ w ⃗ t \vec w_{t+1}=(1-\ eta \lambda)\vec w_t-\eta\frac{\partial l(\vec w_t, b_t)}{\partial \vec w_t}w t+1=(1h l )w tthew tl(w t,bt)
    • Usually λ < 1 \eta \lambda<1the l<1 , which is usually called weight decay in deep learning. This means that every time the parameters are updated, the original parameter values ​​are now reduced, and then updated along the gradient direction.

1.4 Summary

  • Weight decay through L 2 L_2L2The regular term makes the model parameters not too large, thereby controlling the model complexity.
  • Regularizer weights are hyperparameters that control model complexity.

2. Code implementation

2.1 Implementation from scratch

2.1.1 Artificial datasets

Weight decay is one of the most widely used regularization techniques.

import torch
from torch import nn
from d2l import torch as d2l

Generate an artificial dataset:
y = 0.05 + ∑ i = 1 d 0.01 xi + ϵ where ϵ ∼ N ( 0 , 0.0 1 2 ) y=0.05+\sum_{i=1}^d0.01x_i + \epsilon \quad \ text{where} \quad \epsilon\sim \mathcal{N}(0, 0.01^2)y=0.05+i=1d0.01xi+ϵwhereϵN(0,0.012)

n_train, n_test, num_inputs, batch_size = 20, 100, 200, 5
true_w, true_b = torch.ones((num_inputs, 1)) * 0.01, 0.05
train_data = d2l.synthetic_data(true_w, true_b, n_train)
train_iter = d2l.load_array(train_data, batch_size, is_train=True)
test_data = d2l.synthetic_data(true_w, true_b, n_test)
test_iter = d2l.load_array(test_data, batch_size, is_train=False)

2.1.2 Model parameters

Initialize model parameters

# 初始化模型参数
def init_params():
    w = torch.normal(0, 1, size=(num_inputs, 1), requires_grad=True)
    b = torch.zeros(1, requires_grad=True)
    return [w, b]

2.1.3 L 2 L_2L2norm penalty

Define L 2 L_2L2norm penalty

# 定义L2范数惩罚
def l2_penalty(w):
    return torch.sum(w.pow(2)) / 2

2.1.4 Training

The biggest difference between this training function and the previous training function is that the input parameters are added lambd. We use hyperparameters lambdto control the importance of the regularization term. When lambdit is equal to 0, it is equivalent to no regularization; when lambdit is close to infinity, it is equivalent to the weight close to 0.

# 训练函数
def train(lambd):
    w, b = init_params()
    net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
    num_epochs, lr = 100, 0.003
    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
                            xlim=[5, num_epochs], legend=['train', 'test'])
    for epoch in range(num_epochs):
        for X, y in train_iter:
            # with torch.enable_grad():
            l = loss(net(X), y) + lambd * l2_penalty(w)
            l.sum().backward()
            d2l.sgd([w, b], lr, batch_size)
        if (epoch + 1) % 5 == 0:
            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
                                     d2l.evaluate_loss(net, test_iter, loss)))
    d2l.plt.show()
    print('w的L2范数是:', torch.norm(w).item())

First, we let lambd=0, ignore regularization and train directly.

train(lambd=0)

At this time, severe overfitting occurred, and the training error continued to decrease, but the test error remained high. The result is shown in the figure below:
ignore regularization
After using weight decay, the problem of overfitting is solved.

train(lambd=3)

weight decay

2.2 Simple implementation

L 2 L_2L2Regularization can be written in the objective function or in the training algorithm
In a concise implementation, we write weight decay in the training algorithm

def train_concise(wd):
    net = nn.Sequential(nn.Linear(num_inputs, 1))
    for param in net.parameters():
        param.data.normal_()
    loss = nn.MSELoss(reduction='none')
    num_epochs, lr = 100, 0.003
    trainer = torch.optim.SGD([{
    
    
        "params": net[0].weight,
        "weight_decay": wd}, {
    
    
        "params": net[0].bias }], lr=lr)
    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
                            xlim=[5, num_epochs], legend=['train', 'test'])
    for epoch in range(num_epochs):
        for X, y in train_iter:
            # with torch.enable_grad():
            trainer.zero_grad()
            l = loss(net(X), y)
            l.mean().backward()
            trainer.step()
        if (epoch + 1) % 5 == 0:
            animator.add(epoch + 1,
                         (d2l.evaluate_loss(net, train_iter, loss),
                          d2l.evaluate_loss(net, test_iter, loss)))
        print('w的L2范数:', net[0].weight.norm().item())
    d2l.plt.show()

Similar to the implementation from scratch, we also train without and with regularization, respectively.

train_concise(0)
train_concise(3)

The result without regularization is shown in the figure below:
Concise without regularity

The result of using regularization is shown in the figure below:
Concise and regular
Next: [Hands-on Deep Learning v2 Li Mu] Study Notes 08: Discarding Method

Guess you like

Origin blog.csdn.net/weixin_45800258/article/details/127056683
Recommended