The Ultimate Guide to Pytorch Code Reproduction【Collection】

-----------------------------------Strive to reproduce accurately, exactly---------- -----------------------------

Code reproduction needs to consider:

1. Random seed setting;

2. DataLoader settings;

3. CUDA algorithm randomness;

4. The calling details of the random number generator;

5. Multi-card problem.

The first half comes from: https://zhuanlan.zhihu.com/p/532511514

When I first came into contact with pytorch, I found this big guy’s article. I didn’t understand the tiankeng part at the end. I have encountered the same problem until today. I would like to make a little supplement for everyone’s understanding.

  1. random seed

The introductory version of Pytorch's reproduction is the official guide, and various random seeds need to be set.

https://pytorch.org/docs/stable/notes/randomness.html

import random
import numpy as np
import torch

random.seed(0)  # Python 随机种子
np.random.seed(0)  # Numpy 随机种子
torch.manual_seed(0)  # Pytorch 随机种子
torch.cuda.manual_seed(0)  # CUDA 随机种子
torch.cuda.manual_seed_all(0)  # CUDA 随机种子

2. Dataloader parallelism

When DataLoader enables multi-threading (the number of parallel threads num_workers is greater than 1), random phenomena will also occur. The solution:

1. Disable multithreading: num_workers is set to 0.

2. Fix the initialization method of the worker, the code is as follows:

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2 ** 32
    numpy.random.seed(worker_seed)
    random.seed(worker_seed)

    g = torch.Generator()
    g.manual_seed(0)

    DataLoader(
        train_dataset,
        batch_size=batch_size,
        num_workers=num_workers,
        worker_init_fn=seed_worker,
        generator=g,
    )

However, in my own code, there is no special treatment for DataLoader, and the code can also be reproduced.

3. The randomness of the algorithm

Some parallel algorithms have randomness, such as LSTM or attention mechanism, RNN, etc.

Especially when building the cuDNN library with CUDA Toolkit 10.2 or later, the new buffer management and heuristics in the cuBLAS library introduce randomness. This happens when using both buffer sizes (16 KB and 4 MB) in the default configuration.

The solution is to set the environment variable at the top of the code:

os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'

If the CNN algorithm is used, the following variables must be set at the same time:

torch.backends.cudnn.benchmark = False  # 限制cuDNN算法选择的不确定性
torch.backends.cudnn.deterministic=True  # 固定cuDNN算法

After setting these, the results can be reproduced in 99% of cases. If it cannot be reproduced, restart the notebook or python.

4. Random number generator details

If you run pytorch training multiple times inside a for loop, randomness will appear. The following common methods are invalid : force empty_cache before each train/manually del variable after each cycle and use gc to recycle/force initialize model parameters/force set_rng_state/restart python files and notebooks;

Zhihu’s solution: 1. Set a random seed inside the for loop. 2. The dropout in the nn model has randomness in the for loop. It is best not to call Dropout explicitly or not.

The author's experimental results:

Need to pay attention to the calling order of the random number generator!

For example: calling Dataloader once will affect the random number generation of the next Dataloader.

Explanation: For example, there are two training methods of the model:

  1. Continue to train for the next Epoch after the train.

  1. Perform val after the train, and then train for the next Epoch.

The training results obtained by these two methods are different from the second Epoch, and the weights of the models before and after val have not changed, so it should be that the generated random numbers have changed.

In addition: Adding the following statement before the model code will also change the result of the model training. The guess is that the random number generator of pytorch is called, which will cause the random part of the subsequent code to change accordingly.

from frame.loss import FocalLoss, LabelSmoothCrossEntropyLoss, TimeWeightedCELoss

This article also means this: https://zhuanlan.zhihu.com/p/352833875

5. Multi-card problem

If multiple cards cannot reproduce accurately, you can try to use a single card. No solution has been found yet, welcome to leave a message.

Summarize

The above is the ultimate guide to the reproduction of Pytorch code. To be on the safe side, first add everything that can be added, and then see if it can be reproduced.

After that, if you have obsessive-compulsive disorder, you can do subtraction and screen one by one until the necessary codes are retained.

May the world be peaceful, and the code has no pitfalls...

Guess you like

Origin blog.csdn.net/BeiErGeLaiDe/article/details/129306023