CUDA out of memory solution

The original link is https://www.zhihu.com/question/274635237 , I encountered CUDA out of memory during the text detection test, and used torch.no_grad() and torch.cuda.empty_cache() to reduce the use of video memory. The effect is obvious.

Generally, the occupied video memory is divided into three parts:

  1. The video memory occupied by the network model's own parameters.
  2. The intermediate variables or parameters generated during model calculation (including forward/backward/optimizer) also occupy video memory.
  3. The programming framework itself has some additional overhead.

Change the network structure

  1. Reduce batch_size...

  2. Sacrifice the calculation speed to reduce the memory usage, divide the calculation into two halves, first calculate the results of half of the models, save the intermediate results, and then calculate the latter half of the models

    # 输入
    input = torch.rand(1, 10)
    # 假设我们有一个非常深的网络
    layers = [nn.Linear(10, 10) for _ in range(1000)]
    model = nn.Sequential(*layers)
    output = model(input)
    
    ### 可进行如下更改
    # 首先设置输入的input=>requires_grad=True
    # 如果不设置可能会导致得到的gradient为0
    
    input = torch.rand(1, 10, requires_grad=True)
    layers = [nn.Linear(10, 10) for _ in range(1000)]
    
    
    # 定义要计算的层函数,可以看到我们定义了两个
    # 一个计算前500个层,另一个计算后500个层
    
    def run_first_half(*args):
        x = args[0]
        for layer in layers[:500]:
            x = layer(x)
        return x
    
    def run_second_half(*args):
        x = args[0]
        for layer in layers[500:-1]:
            x = layer(x)
        return x
    
    # 我们引入新加的checkpoint
    from torch.utils.checkpoint import checkpoint
    
    x = checkpoint(run_first_half, input)
    x = checkpoint(run_second_half, x)
    # 最后一层单独调出来执行
    x = layers[-1](x)
    x.sum.backward()  # 这样就可以了
    
  3. Use pooling to reduce the size of the feature map

  4. Reduce the use of fully connected layers

Do not modify the network structure

  1. Use inplace operations as much as possible. For example, relu can use inplace=True. A simple way to use it is as follows:

    def inplace_relu(m):
        classname = m.__class__.__name__
        if classname.find('Relu') != -1:
            m.inplace=True
    model.apply(inplace_relu)
    
  2. Delete Loss at the end of each cycle, which can save very little video memory, but it is better than nothing

  3. Using float16 precision mixed calculation, you can save nearly 50% of the video memory, but be careful of some unsafe operations, such as mean and sum, overflow fp16

  4. That do not require the Forward bp, such as validation, test use torch.no_grad(), attention model.eval()is not equal totorch.no_grad()

    • model.eval()All layers will be notified that you are in eval mode, so that batch norm or dropout layers will work in eval mode instead of training mode.
    • torch.no_grad()Affect autograd and disable it. It will reduce memory usage and speed up calculations, and cannot be backpropagated (not needed in eval scripts).
  5. torch.cuda.empty_cache() is an advanced version of del.

  6. The use of optimizer transformation, theoretically sgd <momentum <adam, it can be seen from the calculation formula that there are additional intermediate variables

  7. Depthwise Convolution

  8. Don't load the data in at once, but read it partially, so that there will be basically no memory shortage

Guess you like

Origin blog.csdn.net/m0_38007695/article/details/108085949