Experiment 7 Recurrent Neural Network (2) Gradient Explosion Experiment

6.2 Gradient explosion experiment

There are two reasons why it is difficult for simple recurrent networks to model long-range dependence problems: gradient explosion and gradient disappearance. Generally speaking, the gradient explosion problem of recurrent networks is relatively easy to solve, and can generally be better avoided through weight attenuation or gradient truncation; for the gradient disappearance problem, a more effective way is to change the model, such as through the long short-term memory network LSTM. ease.

This section will first reproduce the gradient explosion problem in a simple recurrent network, and then try to solve it using gradient truncation. A data set of length 20 is used for experiments. WW will be output during the training process.W, U U U, b b The norm of the gradient vector of b is used to measure the change of the gradient.

6.2.1 Gradient printing function

To custom_print_logimplement the function of printing gradients during the training process, custom_print_logyou need to receive the runner instance and model.named_parameters()obtain the parameter name and parameter value in the model. Here we define respectively W_list, U_listand , to store the parameters W, U and bb_list respectively during the training process. W, U and bW,The gradient norm of U and b .

import torch
import os
import random
import torch
import numpy as np
from torch.utils.data import DataLoader
W_list = []
U_list = []
b_list = []
# 计算梯度范数
def custom_print_log(runner):
    model = runner.model
    W_grad_l2, U_grad_l2, b_grad_l2 = 0, 0, 0
    for name, param in model.named_parameters():
        if name == "rnn_model.W":
            W_grad_l2 = torch.norm(param.grad, p=2).numpy()
        if name == "rnn_model.U":
            U_grad_l2 = torch.norm(param.grad, p=2).numpy()
        if name == "rnn_model.b":
            b_grad_l2 = torch.norm(param.grad, p=2).numpy()
    print(f"[Training] W_grad_l2: {
      
      W_grad_l2:.5f}, U_grad_l2: {
      
      U_grad_l2:.5f}, b_grad_l2: {
      
      b_grad_l2:.5f} ")
    W_list.append(W_grad_l2)
    U_list.append(U_grad_l2)
    b_list.append(b_grad_l2)

[Thinking] What is the norm, what is the L2 norm, and why is the gradient norm printed here?

Norm:

Norm is an enhanced concept of distance. We know that the definition of distance is: as long as it satisfies non-negative, reflexive, and triangular inequalities, it can be called distance. The definition of norm has one more algorithm of multiplication than distance. Sometimes for ease of understanding, we can understand the norm as a distance.

L2 norm:

The most commonly used distance measure "Euclidean distance" is an L2 norm, which is defined as follows: The L2 norm is
L2 norm
usually used as a regularization term for optimizing the objective function to prevent the model from being too complex to cater to the training set. Causes over-fitting, thereby improving the generalization ability of the model.

LP norm:

LP norm
Why print the gradient norm:
Printing the gradient norm value can help us understand the quality of the model training more intuitively. If the gradient is too large or too small, the training effect of the model may deteriorate, so printing the gradient norm is beneficial to us. Make changes to your model faster.

6.2.2 Reappearance of gradient explosion phenomenon

In order to better reproduce the gradient explosion problem, use the SGD optimizer to increase the batch size and learning rate, and the learning rate is 0.2. At the same time, when calculating the cross-entropy loss, set the reduction to sum, which means the loss is accumulated. The code is implemented as follows:

import os
import random
import torch
import numpy as np
 
np.random.seed(0)
random.seed(0)
torch.manual_seed(0)
 
# 训练轮次
num_epochs = 50
# 学习率
lr = 0.2
# 输入数字的类别数
num_digits = 10
# 将数字映射为向量的维度
input_size = 32
# 隐状态向量的维度
hidden_size = 32
# 预测数字的类别数
num_classes = 19
# 批大小
batch_size = 64
# 模型保存目录
save_dir = "./checkpoints"
 
 
# 可以设置不同的length进行不同长度数据的预测实验
length = 20
print(f"\n====> Training SRN with data of length {
      
      length}.")
 
# 加载长度为length的数据
data_path = f"D:/datasets/{
      
      length}"
train_examples, dev_examples, test_examples = load_data(data_path)
train_set, dev_set, test_set = DigitSumDataset(train_examples), DigitSumDataset(dev_examples),DigitSumDataset(test_examples)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)
dev_loader = torch.utils.data.DataLoader(dev_set, batch_size=batch_size)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size)
# 实例化模型
base_model = SRN(input_size, hidden_size)
model = Model_RNN4SeqClass(base_model, num_digits, input_size, hidden_size, num_classes)
# 指定优化器
optimizer = torch.optim.SGD(model.parameters(),lr)
# 定义评价指标
metric = Accuracy()
# 定义损失函数
loss_fn = nn.CrossEntropyLoss(reduction="sum")
 
# 基于以上组件,实例化Runner
runner = RunnerV3(model, optimizer, loss_fn, metric)
 
# 进行模型训练
model_save_path = os.path.join(save_dir, f"srn_explosion_model_{
      
      length}.pdparams")
runner.train(train_loader, dev_loader, num_epochs=num_epochs, eval_steps=100, log_steps=1,
             save_path=model_save_path, custom_print_log=custom_print_log)

operation result:

[Training] W_grad_l2: 0.00026, U_grad_l2: 0.00152, b_grad_l2: 0.00019
[Train] epoch: 48/50, step: 242/250, loss: 9972.68066
[Training] W_grad_l2: 0.00055, U_grad_l2: 0.00409, b_grad_l2: 0.00039 
[Train] epoch: 48/50, step: 243/250, loss: 6181.53027
[Training] W_grad_l2: 0.00013, U_grad_l2: 0.00191, b_grad_l2: 0.00009
[Train] epoch: 48/50, step: 244/250, loss: 5992.93750
[Training] W_grad_l2: 0.00036, U_grad_l2: 0.00169, b_grad_l2: 0.00030
[Train] epoch: 49/50, step: 245/250, loss: 11759.13379
[Training] W_grad_l2: 0.00016, U_grad_l2: 0.00345, b_grad_l2: 0.00012
[Train] epoch: 49/50, step: 246/250, loss: 8051.15332
[Training] W_grad_l2: 0.00053, U_grad_l2: 0.00297, b_grad_l2: 0.00040
[Train] epoch: 49/50, step: 247/250, loss: 6390.26660
[Training] W_grad_l2: 0.00049, U_grad_l2: 0.00249, b_grad_l2: 0.00036 
[Train] epoch: 49/50, step: 248/250, loss: 8804.83203
[Training] W_grad_l2: 0.00018, U_grad_l2: 0.00194, b_grad_l2: 0.00013
[Train] epoch: 49/50, step: 249/250, loss: 5890.47656
[Training] W_grad_l2: 0.00020, U_grad_l2: 0.00080, b_grad_l2: 0.00014
[Evaluate]  dev score: 0.06000, dev loss: 6819.73352
[Train] Training done!

Next, you can get the information about W \boldsymbol{W} during the training processW U \boldsymbol{U} U andb \boldsymbol{b}The L2 norm of the b parameter gradient and drawn as a picture for display. The corresponding code is as follows:

import matplotlib.pyplot as plt
def plot_grad(W_list, U_list, b_list, save_path, keep_steps=40):
    # 开始绘制图片
    plt.figure()
    # 默认保留前40步的结果
    steps = list(range(keep_steps))
    plt.plot(steps, W_list[:keep_steps], "r-", color="#e4007f", label="W_grad_l2")
    plt.plot(steps, U_list[:keep_steps], "-.", color="#f19ec2", label="U_grad_l2")
    plt.plot(steps, b_list[:keep_steps], "--", color="#000000", label="b_grad_l2")
 
    plt.xlabel("step")
    plt.ylabel("L2 Norm")
    plt.legend(loc="upper right")
    plt.show()
    plt.savefig(save_path)
    print("image has been saved to: ", save_path)
 
save_path = f"./images/6.8.pdf"
plot_grad(W_list, U_list, b_list, save_path)

operation result:
Insert image description here

Figure 6.8 shows about W \boldsymbol{W} during the training processW U \boldsymbol{U} U andb \boldsymbol{b}For the L2 norm of the gradient of the b parameter, you can see that after adjustments such as the learning rate, the gradient norm becomes sharply larger, and then the gradient norm is almost 0. This is because Tanh \text{Tanh}Tanh Sigmoid \text{Sigmoid} The derivative of the sigmoid function in the saturation zone is close to 0. Due to the sharp change of the gradient, the parameter value becomes larger or smaller, and it is easy to fall into the gradient saturation zone, resulting in a gradient of 0, making it difficult to continue training the model .

Next, use the model to test on the test set.

print(f"Evaluate SRN with data length {
      
      length}.")
model_path = os.path.join(save_dir, "srn_explosion_model_20.pdparams")
torch.load(model_path)
 
# 使用测试集评价模型,获取测试集上的预测准确率
score, _ = runner.evaluate(test_loader)
print(f"[SRN] length:{
      
      length}, Score: {
      
      score: .5f}")

operation result:

Evaluate SRN with data length 20.
[SRN] length:20, Score:  0.06000

6.2.3 Use gradient truncation to solve the gradient explosion problem

Gradient truncation is a heuristic method that can effectively solve the gradient explosion problem. When the modulus of the gradient is greater than a certain threshold, it is truncated into a smaller number. There are generally two truncation methods: truncation by value and truncation by model. This experiment uses modular truncation to solve the gradient explosion problem.

Modular truncation is based on the gradient vector g \boldsymbol{g}The module of g is truncated to ensure that the module value of the gradient vector is not greater than the thresholdbbb , the gradient after clipping is:

g = { g , ∣ ∣ g ∣ ∣ ≤ b b ∣ ∣ g ∣ ∣ ∗ g , ∣ ∣ g ∣ ∣ > b . \boldsymbol{g} = \left\{\begin{matrix} \boldsymbol{g}, & ||\boldsymbol{g}||\leq b \\ \frac{b}{||\boldsymbol{g}||} * \boldsymbol{g}, & ||\boldsymbol{g}||\gt b \end{matrix} \right.. g={ g,gbg,gbg>b.

When the gradient vector g \boldsymbol{g}The modulus of g is not greater than the thresholdbbWhen b ,g \boldsymbol{g}The value of g remains unchanged, otherwiseg \boldsymbol{g}g for numerical scaling.

In the flying paddle, paddle.nn.ClipGradByNorm can be used for modular truncation. When implementing the code, ClipGradByNorm is passed to the optimizer. During the reverse iteration process, the optimizer can clip all gradients by default every time the gradient is updated.
In pytorch:
nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=20, norm_type=2)

After introducing gradient truncation, the training of the model will be re-observed. Here we re-instantiate: the model and optimizer, then assemble the runner for training. The code is implemented as follows:

# 清空梯度列表
W_list.clear()
U_list.clear()
b_list.clear()
# 实例化模型
base_model = SRN(input_size, hidden_size)
model = Model_RNN4SeqClass(base_model, num_digits, input_size, hidden_size, num_classes)
 
# 定义clip,并实例化优化器
 
optimizer = torch.optim.SGD(lr=lr, params=model.parameters())
# 定义评价指标
metric = Accuracy()
# 定义损失函数
loss_fn = nn.CrossEntropyLoss(reduction="sum")
 
# 实例化Runner
runner = RunnerV3(model, optimizer, loss_fn, metric)
 
# 训练模型
model_save_path = os.path.join(save_dir, f"srn_fix_explosion_model_{
      
      length}.pdparams")
runner.train(train_loader, dev_loader, num_epochs=num_epochs, eval_steps=100, log_steps=1, save_path=model_save_path, custom_print_log=custom_print_log)
# 进行模型训练
model_save_path = os.path.join(save_dir, f"srn_explosion_model_{
      
      length}.pdparams")

After introducing gradient truncation, obtain the information about W \boldsymbol{W} during the training processW U \boldsymbol{U} U andb \boldsymbol{b}The L2 norm of the b parameter gradient and drawn as a picture for display. The corresponding code is as follows:

save_path = f"./images/6.9.pdf"
plot_grad(W_list, U_list, b_list, save_path, keep_steps=100)

operation result:
2

Next, the model using gradient truncation strategy is tested on the test set.

print(f"Evaluate SRN with data length {
      
      length}.")
 
# 加载训练过程中效果最好的模型
model_path = os.path.join(save_dir, f"srn_fix_explosion_model_{
      
      length}.pdparams")
torch.load(model_path)
 
# 使用测试集评价模型,获取测试集上的预测准确率
score, _ = runner.evaluate(test_loader)
print(f"[SRN] length:{
      
      length}, Score: {
      
      score: .5f}")

operation result:

Evaluate SRN with data length 20.
[SRN] length:20, Score:  0.05000

Since the learning rate, optimizer, etc. are changed to reproduce the gradient explosion phenomenon, the accuracy is relatively low. However, since the gradient truncation strategy is adopted, the model parameters can be updated and optimized during the subsequent training process, so the accuracy rate is improved to a certain extent.

[Thinking Question] What is the principle of gradient truncation to solve the gradient explosion problem?

Gradient clipping ensures the maximum norm of the gradient vector, which helps gradient descent maintain reasonable behavior even when the model's loss function is irregular. The image below shows the steep slope of the loss function. Without clipping, the parameters will change drastically along the gradient descent direction, causing them to leave the minimum value range; with clipping, the parameter changes will be limited to a reasonable range, avoiding the above situation.

The gradient truncation used in pytorch is the torch.nn.utils.clip_grad_norm_ function. Its function is to scale the gradient that exceeds the specified gradient, and the gradient explodes. As the name suggests, the gradient is too large. The reason for the gradient is too large. Because the initialization weight is too large, coupled with the continuous multiplication and accumulation, the final gradient is very, very large. Therefore, the gradient truncation using modulo truncation is used here to truncate the gradient whose modulus is greater than the specified number to make the gradient smaller. At the same time, there are also Gradient truncation method of number truncation.

2

Guess you like

Origin blog.csdn.net/weixin_51395608/article/details/127961250