Error analysis related to in-place operation in Pytorch and description of detach() method

0. Preface

*Thank you Tumi for your great support for this article.
*Thanks to the Rising Star Program for introducing me to great bloggers.

In accordance with international practice, I first declare: This article is only my own understanding. Although I have referred to other people's valuable insights, the content may be inaccurate. If you find mistakes in the text, I hope to criticize and correct them and make progress together.

Recently, when building the nn.RNN model and the nn.LSTM model based on nn.RNN, I encountered the following very troublesome good luck error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 1]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

There are many explanatory articles on CSDN about the above error message, but almost all of them directly explain the solution (and most of them are not used), lacking an explanation of the mechanism of this error. Therefore, this blog is written to record the solution process and related understanding of this problem.

1. Background Problem Description

Simplify the nn.RNN-based model into the following code:

import torch

rnn = torch.nn.RNN(input_size=1, hidden_size=1, num_layers=1)

train_set_x = torch.tensor([[[1]],[[2]],[[3]],[[4]],[[5]]], dtype=torch.float32)
train_set_y = torch.tensor([[[2]],[[4]],[[6]],[[8]],[[10]]], dtype=torch.float32)

h0 = torch.tensor([[0]], dtype=torch.float32)
h_cur = h0

loss = torch.nn.MSELoss()
opt = torch.optim.Adadelta(rnn.parameters(), lr = 0.01)

with torch.autograd.set_detect_anomaly(True):
    for i in range(5):
        opt.zero_grad()
        train_output, h_next = rnn(train_set_x[i], h_cur)
        rnn_loss = loss(train_output,train_set_y[i])
        rnn_loss.backward(retain_graph=True)
        opt.step()
        print(train_output)
        h_cur = h_next

Run this code and find that the above error message appears when training the network model.

Original question link: Pytorch framework nn.RNN backpropagation error during training

2. Error analysis: understanding and explanation of in-place (setting) operation

The above error message "one of the variables needed for gradient computation has been modified by an inplace operation", the literal translation is "one of the variables needed for gradient computation has been 一个置位操作modified by"

The reason why this problem has been bothering me before is that I don’t have a good understanding of the set operation. It turns out that I understand that the set operation only has operations such as "x += 1" or "x -= 1". But there is no such operation in my original code, but I am reporting an error of setting operation.

In fact, the set operation is a general term 直接更改内存中的值,而不是先复制一个值再更改复制后的这个值的操作.

An in-place operation is an operation that changes directly the content of a given Tensor without making a copy. Inplace operations in pytorch are always postfixed with a , like .add() or .scatter_(). Python operations like += or *= are also inplace operations.

Our commonly used assignment method "a = b", although the values ​​of a and b are the same, but the same value is in two completely different physical addresses, so that if a is changed, it will not have any impact on b, vice versa.

But if the above "a = b" is a set operation, then if a is changed, the same change will be made to b, because they completely share the same physical address and share the same block of memory.

Therefore, the setting operation should be used with caution, because there may be many shared variables, and when calculating (changing) these variables, it may bring about some unexpected variable changes.

Where is the set operation of the problem code above?
Answer: In "h_cur = h_next". Through the id() method, you can see that the addresses of these two variables are consistent. 我们认为的赋值操作被Pytorch变成了置位操作.

print(id(h_cur))
print(id(h_next))
输出-------------------------------------
2943427659952
2943427659952

Looking back at the second half of the above error statement "[torch.FloatTensor [1, 1]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead." There is a [1,1] tensor (that is, RNN The hidden layer output in h) is already version 3 (version), and the expectation is version 2 (version). What I understand here is the number of set operations.

Because of the above "h_cur = h_next" operation, h_cur was operated one more time (version). Although its value has not changed, it caused a version mismatch, which eventually caused the above error.

Then why does Pytorch have a set operation by default?
Answer: 为了节省内存,提高运行速度. As mentioned above, the set operation can directly change the data in the memory without copying a copy of the data first. Now large-scale neural networks often have tens of thousands, hundreds of thousands, or even millions of parameters. If there is no setting operation, copy a parameter before each backward, and then calculate in the copied parameters, it will cost A large amount of memory to store these parameters.

How does the hidden layer output h affect the gradient calculation?
Answer: The calculation formula is:
insert image description here
For the specific process, please refer to the RNN mathematical model I promoted earlier: Build an RNN module based on Numpy and apply it in an example (with code)

Of course, if you are really resistant to the mathematical derivation process, you can directly understand this conclusion: 在RNN中隐层输出h是直接参与反向传播梯度计算过程的.

How to finally solve the error caused by in-place operation?
A: Cancel the in-place operation. Forcibly assign a new piece of memory to the variable we actually want to "assign a value". The methods that can achieve this purpose include the detach() method and the clone() method. Since detach() is more widely used, only the detach() method is described below.

It is not possible to introduce an intermediate variable through the conventional method, such as: h_cur = mid , mid = h_next . Because Pytorch still defaults to the set operation, you can see by printing id that h_cur, mid, and h_next still share the physical address.

3. The role of the detach() method

① Assign a new memory to the variable

Change the above code to:

h_cur = h_next.detach()
print(id(h_cur))
print(id(h_next))
输出---------------------------------------------
3197060036944
3197060036864

The two variables are completely separated, and this problem is solved.

② Turn variables into leaf nodes

Whether a variable is a leaf node can be identified by the is_leaf() method:

h_cur = h_next.detach()
print('h_next_requires_grad:',h_next.requires_grad)
print('h_cur_requires_grad:',h_cur.requires_grad)
print('h_next_is_leaf:',h_next.is_leaf)
print('h_cur_is_lear:',h_cur.is_leaf)
输出---------------------------------------------
h_next_requires_grad: True
h_cur_requires_grad: False
h_next_is_leaf: False
h_cur_is_lear: True

It can be seen that the detach() method interrupts the backpropagation of h_cur, sets requires_grad to False, and sets h_cur to a leaf node.

Regarding the definition and function of leaf nodes/non-leaf nodes, it is not the object of this article. I recommend a very good blog: Pytorch leaf tensor leaf tensor (leaf node) (detach)

4. Corrected code

The complete code after correction is as follows:

import torch

rnn = torch.nn.RNN(input_size=1, hidden_size=1, num_layers=1)

train_set_x = torch.tensor([[[[1]]],[[[2]]],[[[3]]],[[[4]]],[[[5]]]], dtype=torch.float32)
train_set_y = torch.tensor([[[[2]]],[[[4]]],[[[6]]],[[[8]]],[[[10]]]], dtype=torch.float32)

h0 = torch.tensor([[[0]]], dtype=torch.float32)
h_cur = h0

loss = torch.nn.MSELoss()
opt = torch.optim.Adadelta(rnn.parameters(), lr = 0.01)


for i in range(5):
    opt.zero_grad()
    train_output, h_next = rnn(train_set_x[0], h_cur)
    rnn_loss = loss(train_output,train_set_y[0])
    h_cur = h_next.detach()
    rnn_loss.backward()
    opt.step()
    print(train_output)


# print(id(h_cur))
# print(id(h_next))
# print('h_next_requires_grad:',h_next.requires_grad)
# print('h_cur_requires_grad:',h_cur.requires_grad)
# print('h_next_is_leaf:',h_next.is_leaf)
# print('h_cur_is_lear:',h_cur.is_leaf)

In addition, in some versions, even without detach(), the initial code can still run, such as the following: It is
insert image description here
speculated that this may be that there is no default in-place operation in earlier Pytorch, of course, this will lead to the above mentioned The problem of increased memory consumption.

Guess you like

Origin blog.csdn.net/m0_49963403/article/details/129767497