Regularization is an important concept in machine learning that helps us prevent models from overfitting. In this post, I will detail two common regularization techniques: L1 and L2 regularization terms. Then, based on the PyTorch platform, I will explain how to add the above two technologies to my own network model, and use regularization for my own use! ! !

1 Background introduction

In machine learning, our goal is to find a model that minimizes a loss function. However, if we only focus on minimizing the loss function, we may end up with an overly complex model that performs well on training data, but may perform poorly on new data. This is called overfitting .

In order to prevent overfitting, we can add a regularization term to the loss function, which will penalize the complexity of the model. L1 and L2 regularization terms are two common regularization terms.

2 Formula derivation

Regarding the derivation of the regularization formula, there are many big cows on the Internet who have given vivid introductions and explanations. In order to avoid repeating the wheel, you can take a look at this blog before reading the following content: An article fully understands regularization (Regularization)

After getting a general understanding of the technologies of L1 and L2, let's do a simple sorting and review:

For L1, it is to add to the original loss function:

$\lambda\sum\left|w_i\right|$

In the same way, L2 is to join:

$\lambda\sum w_i^2$
Among them, $w$ represents the parameters in the network model.

3 program realization

3.1 Regularization implementation

The so-called adding a regular term is to add another term to the loss function. So we can define an additional function to calculate the additional loss value generated by L1 and L2:

# 定义L1正则化函数
def l1_regularizer(weight, lambda_l1):
    return lambda_l1 * torch.norm(weight, 1)

# 定义L2正则化函数
def l2_regularizer(weight, lambda_l2):
    return lambda_l2 * torch.norm(weight, 2)

The calculation method of L1 and L2 is defined in the above program, that is, the absolute value and square operation are performed on all network parameters weight. In actual use, you only need to add the value returned by the function to the original loss result to realize the regular operation.

3.2 Network example

In order to better show how to use regularization technology in your own network training model, first give a network training program without adding regular items (for your convenience to compare with the program that adds regular items below; you can also compare according to this framework Write the program yourself, quickly locate the place that should be modified):

import torch
import torch.nn as nn

#定义网络结构
class CNN(nn.Module):
    pass
    
# 实例化网络模型
model = CNN()  
# 定义损失函数
criterion = nn.MSELoss()
# 定义优化器
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# 迭代训练
for epoch in range(1000):
    #训练模型
    model.train()
    for i, data in enumerate(train_loader, 0):
        #1 解析数据并加载到GPU
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        #2 梯度清0
        optimizer.zero_grad()
        #3 前向传播
        outputs = model(inputs)
        #4 计算损失
        loss = criterion(outputs, labels)
        #5 反向传播和优化
        loss.backward()
        optimizer.step()

3.3 Add regularization items to the network

The following program realizes the addition of regular terms from the loss function level:

import torch
import torch.nn as nn

#定义网络结构
class CNN(nn.Module):
    pass
    
# 实例化网络模型
model = CNN()  
# 定义损失函数
criterion = nn.MSELoss()
# 定义优化器
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# 迭代训练
for epoch in range(1000):
    #训练模型
    model.train()
    for i, data in enumerate(train_loader, 0):
        #1 解析数据并加载到GPU
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        #2 梯度清0
        optimizer.zero_grad()
        #3 前向传播
        outputs = model(inputs)
        #4 计算损失
        #4.1 定义L1和L2正则化参数
        lambda_l1 = 0.01
        lambda_l2 = 0.01
        
        #4.2 计算L1和L2正则化
        l1_regularization = l1_regularizer(model.weight, lambda_l1)
        l2_regularization = l2_regularizer(model.weight, lambda_l2)
        
        #4.3 向loss中加入L1和L2 
        loss = criterion(outputs, labels)
        loss += l1_regularization + l2_regularization
        
        #5 反向传播和优化
        loss.backward()
        optimizer.step()

l1_regularizer()The sum function in the program l2_regularizer()is the L1 and L2 calculation function implemented manually in Section 3.1 .

Note: In this example, L1 and L2 are implemented independently of the loss function. But in actual operation, the safest way is to put the calculation process of L1 and L2 in the loss function, that is, to rewrite the loss function. The specific operation is to call the above two calculation functions in the loss.

3.4 The regular method in PyTorch: weight decay

In the first three sections of this chapter, we introduced how to manually implement regular calculation and embed it into the neural network, but as one of the basic network processing techniques, PyTorch comes with regularization technology. It is integrated in the optimizer.
Taking the optimizer used in the above program torch.optim.SGDas an example, let's introduce how to use it to directly regularize the network.

 optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=1e-4)

The above is the general definition of SGD, where the meaning of the parameters is:

model.parameters(): All learnable parameters of the model
lr: learning rate
weight_decay: weight decay coefficient

Among them, the weight decay coefficient weight_decay is a regular method . Specifically: weight decay is equivalent to L 2 norm regularization (regularization) . The specific derivation and analysis can be seen in the blog: WEIGHT_DECAY of
weight decay weight_decay regularization in neural network

4 Notes on the use of regular terms

When using L1 and L2 regular terms, you need to pay attention to the following points:

Regularization parameter $The choice of λ$ is very important. if $If λ$ is too large, the model may be too simple, leading to underfitting. if $If λ$ is too small, the effect of regularization may not be obvious.
Both L1 and L2 regularization terms can be used at the same time, which is called Elastic Net .
The L1 regularization term may cause model instability because it will make some model parameters become zero. If the data changes slightly, the parameters of the model may change significantly.

5 summary

Overall, using regularization terms is an effective technique to prevent model overfitting. It can help us avoid overfitting while ensuring the model fitting ability, thereby improving the generalization ability of the model. At the same time, the use of regularization items can also help us to perform feature selection, reduce the complexity of the model, and improve the interpretability of the model.
However, there are some limitations to the use of regularization terms. For example, in some cases, the regularization term may have a certain impact on the performance of the model. In addition, the coefficient of the regularization term needs to be adjusted reasonably. If it is too large or too small, the performance of the model may decrease. At the same time, a good network training result is determined by a combination of factors. While using regular terms, other techniques (such as dropout) cannot be ignored.

PyTorch program implements L1 and L2 regular terms