Record the relevant knowledge of cuda, torchinfo, gpustat

I accidentally read a very good Zhihu article "PyTorch Memory Allocation Principles - Taking BERT as an Example" , and recorded knowledge about cuda memory allocation, using torchinfo to view model parameters, and using gpustat to view memory usage and process information.

memory allocation

Occupancy seen in nvidia-smi = CUDA context + pytorch buffer (allocated + unallocated)
Experiment:

import torch

a = torch.zeros(size=(1024,1024)).cuda()  # 1024*1024*4 = 4M
torch.cuda.memory_allocated() /1024/1024  # 缓存区当前Tensor占用的显存:4M
torch.cuda.memory_reserved() /1024/1024  # 缓存区总共占用的显存:20M

# torch缓存区总显存占 20M(未使用缓存 + 已使用缓存)
# Tensor占 4M (已使用缓存)
# 立即推:未使用缓存占 20-4=16M

""" nvidia-smi: muyao(1251M) """
# nvidia-smi 里看到的占用 = CUDA上下文 + 缓存区显存占用 = 1251M
# 立即推:CUDA上下文占 1251-20=1231M

After deleting the temporary variable a:

del a
torch.cuda.memory_allocated() /1024/1024  # 缓存区当前Tensor占用的显存:0M
torch.cuda.memory_reserved() /1024/1024  # 缓存区总共占用的显存:20M

After clearing the cache:

torch.cuda.empty_cache()
print(torch.cuda.memory_allocated() /1024/1024)  # 0M
print(torch.cuda.memory_reserved() /1024/1024)   # 0M
print(torch.cuda.memory_summary())

insert image description here

Use torchinfo to output model structure and parameters

Installation: pip install torchinfo
Use: Take T5 as an example.

from transformers import T5Config, T5ForConditionalGeneration
from torchinfo import summary

model_name_or_path = "ptms/checkpoint-xxxxx"
config = T5Config.from_pretrained(model_name_or_path)
model = T5ForConditionalGeneration.from_pretrained(model_name_or_path, config=config)
model.to("cuda:1")

summary(model)

output summary:

=====================================================================================
Layer (type:depth-idx) Param #
=====================================================================================
T5ForConditionalGeneration –
├─Embedding: 1-1 24,674,304
├─T5Stack: 1-2 24,674,304
│ └─Embedding: 2-1 (recursive)
│ └─ModuleList: 2-2 –
│ │ └─T5Block: 3-1 7,079,808
│ │ └─T5Block: 3-2 7,079,424
│ │ └─T5Block: 3-3 7,079,424
│ │ └─T5Block: 3-4 7,079,424
│ │ └─T5Block: 3-5 7,079,424
│ │ └─T5Block: 3-6 7,079,424
│ │ └─T5Block: 3-7 7,079,424
│ │ └─T5Block: 3-8 7,079,424
│ │ └─T5Block: 3-9 7,079,424
│ │ └─T5Block: 3-10 7,079,424
│ │ └─T5Block: 3-11 7,079,424
│ │ └─T5Block: 3-12 7,079,424
│ └─T5LayerNorm: 2-3 768
│ └─Dropout: 2-4 –
├─T5Stack: 1-3 24,674,304
│ └─Embedding: 2-5 (recursive)
│ └─ModuleList: 2-6 –
│ │ └─T5Block: 3-13 9,439,872
│ │ └─T5Block: 3-14 9,439,488
│ │ └─T5Block: 3-15 9,439,488
│ │ └─T5Block: 3-16 9,439,488
│ │ └─T5Block: 3-17 9,439,488
│ │ └─T5Block: 3-18 9,439,488
│ │ └─T5Block: 3-19 9,439,488
│ │ └─T5Block: 3-20 9,439,488
│ │ └─T5Block: 3-21 9,439,488
│ │ └─T5Block: 3-22 9,439,488
│ │ └─T5Block: 3-23 9,439,488
│ │ └─T5Block: 3-24 9,439,488
│ └─T5LayerNorm: 2-7 768
│ └─Dropout: 2-8 –
├─Linear: 1-4 24,674,304
=====================================================================================
Total params: 247,577,856
Trainable params: 247,577,856
Non-trainable params: 0
=====================================================================================

gpustat to view the process occupied by video memory

gpustat parameters:

usage: gpustat [-h] [–force-color | --no-color] [-c] [-u] [-p] [-F] [–json] [-v] [-P [{,draw,limit,draw,limit,limit,draw}]] [-i [INTERVAL]] [–no-header] [–gpuname-width GPUNAME_WIDTH] [–debug]

insert image description here
For example, the following scenarios:
display occupancy and process PID: gpustat -i -p
check the execution path corresponding to this PID:pwdx [PID]

Guess you like

Origin blog.csdn.net/muyao987/article/details/126480512