debug video memory has been growing slowly

1 The code runs normally, but the video memory usage is gradually increasing, and eventually it is out of memory!

Common possibilities and solutions:

# 加torch的显存优化
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True

# loss加item()
train_loss += loss.item()  

# delete caches
del 变量
torch.cuda.empty_cache()

# 以上常见解决方案没有奏效,empty cache 只能清理掉被丢弃的缓存,并不能降低峰值显存占用。

Problems encountered:
(A differentiable is added to the model layer. During training, this sub-model is included with torch.enable_grad(), so I initially tried to put it under with torch.no_grad(): to find the problem)

with torch.no_grad():   # 加了这句,显存增长问题解决
	out = self.noise_layers(encoded, cover_img)

Print the memory allocation changes. The accumulation of gradient cache is not reflected in the memory allocation.

memory1 = torch.cuda.memory_allocated(0)
out = self.noise_layers(encoded, cover_img)
memory2 = torch.cuda.memory_allocated(0)
print('noise layer:', memory2-memory1, '\n')

with torch.enable_grad()Although the video memory will not increase after adding , enable_grad() will cause the calculation graph to be interrupted and the gradient cannot be transmitted back.

Finally, I found the following definition in the code:

matrix = np.array(
	  [[0.299, 0.587, 0.114], [-0.168736, -0.331264, 0.5],
	   [0.5, -0.418688, -0.081312]], dtype=np.float32).T
self.shift = nn.Parameter(torch.tensor([0., 128., 128.]), requires_grad=False)
self.matrix = nn.Parameter(torch.from_numpy(matrix), requires_grad=False)

Here's the problem! nn.Parameter()Add them all and requires_grad=Falseyou’re done!

Check the official doc of torch and torch.nn.Parameter()
Insert image description here
you will find that nn.parameter() is a subclass of tensor. required_gradBy default true!, it can be used as a trainable parameter in nn.module. Another function is to use nn.parameter() for initialization parameters in torch.

Recently I encountered a new pitfall. When saving the loss, the tensor was saved directly, causing the tensor to continue to accumulate.

# step update
current_step += 1
train_step  += 1
loss_per_epoch = loss_per_epoch + (loss_RecImg + loss_RecMsg + loss_encoded)
# write losses
utils.mkdir(loss_w_folder)
utils.write_losses(os.path.join(loss_w_folder, 'train-{}.txt'.format(time_now_NewExperiment)), loss_per_epoch/train_step, current_epoch)   

Video memory keeps rising, changed to

loss_per_epoch = loss_per_epoch + (loss_RecImg.item() + loss_RecMsg.item() + loss_encoded.item())

Guess you like

Origin blog.csdn.net/mr1217704159/article/details/121691953