pytorch training reported OSError: [WinError 1455] The page file is too small to complete the operation

When training with pytorch under windows, such as yolov5 (yolov8, etc., as long as it involves multiple processes, such as the num_workers of dataloader is set relatively large), you may encounter "OSError: [WinError 1455] page The file is too small to complete the operation" error.

 The above picture is the error reported by yolov5 training. The training command is:

python train.py --data data/worker_data/dataset.yaml --cfg models/yolov5s.yaml --weights weights/yolov5s.pt --batch-size 16 --epochs 200 --cache 

Here only the batch-size is specified as 16, workers are not set, and the default value used is 8. The actual num_workers takes the minimum value of the logical cpu core number (mine is 16), batch-size, and workers, that is Take 8. However, the dataloader of the training set and the verification set will use multiple processes.

 If you reduce the batch-size or the workers, you may not report this error, but this is obviously not a good way to solve the problem, because too small a batch-size or workers will reduce your training speed .

one. problem causes

The complete ins and outs can be found in this issue: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\Anaconda3\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies · Issue #1643 · ultralytics/yolov3 · GitHub

Briefly, there are some cuda-related dll files in pytorch that are very large, and as long as you import pytorch-related packages, these files will be loaded, and when multiple processes are enabled, each process will load these files , may not actually be used, so virtual memory is used under Windows. But if your virtual memory is not large enough, it will report that the page file is too small. The picture below is the screenshot when I reported the error. My memory is 32G, and the virtual memory is set to 20G, which is fully occupied.

 And there will be no such problem under linux, which is why many people may not encounter this problem at all, because they are all trained on linux servers. On a Linux server, if you apply for memory that is not actually needed, you apply for it on the surface, and no memory is allocated, so naturally there is no problem of enough memory.

two. Solution

1. Increase the virtual memory

Obviously, a simple and rude way is to increase the virtual memory. As for how much to adjust, it depends on your actual needs. Anyway, in my above situation, basically 80G to 100G of virtual memory is enough. The method of setting virtual memory can refer to other posts, I will not go into details, but a few points:

(1) It doesn’t matter if there is not enough space on the C drive, the virtual memory can be allocated to other partitions, and it can also be configured to multiple partitions

(2) It is best to use a solid-state hard drive, the mechanical hard drive will be slower

2. Upgrade pytorch version

Method 1 is actually not a good solution. The hard disk lost 100G for no reason, which is a pity, and if it writes the contents of the hard disk for you every time, it will still hurt the hard disk, especially the solid state hard disk. . .

First, give the best solution directly. As mentioned in the issue above, this problem may be related to pytorch, or it may be related to NVIDIA. Anyway, your dll is too big, or don’t make me have to load these dlls every time. OK. We can try to install a newer version of pytorch to see if this problem is solved!

The environment where the error was reported before was in conda, and pytorch used 1.10.1+cu113

Now let's try 1.13.1+cu117 (cuda will naturally have to be updated, I didn't try the latest pytorch2.0 because yolov5 doesn't seem to support pytorch2.0)

 Still execute the same command above, this time no error is reported, the virtual memory is still 20G, it can be seen that although some virtual memory is still used, it is much smaller than the 80G or even 100G required before!

 3. Use fixNvPe.py

But if for some reason, you can't upgrade the pytorch version, you can only use a certain version of pytorch, and you don't want to use such a large virtual memory, then you can refer to the solution in the above issue

https://github.com/ultralytics/yolov3/issues/1643

1. Download fixNvPe.py first, for example, I put it in D:\ai\pytorch\fixNvPe.py

2.pip install pefile

3. Execute fixNvPe.py

cd D:\ai\pytorch\

python fixNvPe.py --input C:\Users\kv183_pro\miniconda3\envs\torch1.0\Lib\site-packages\torch\lib\*.dll

The path of this dll is the lib directory in your pytorch. In fact, you don’t need to find it by yourself. Look carefully at the error picture above, it is the path in the red box

OK, it's done

 Try again next time, sure enough! And the memory usage is even less than pytorch1.13, awesome!

 The dll files changed by this script are all backed up for you, the script author is still very considerate~~

 The content of the script is not long. In order to prevent someone from accessing it, I will paste it directly below. The file name is fixNvPe.py

# Simple script to disable ASLR and make .nv_fatb sections read-only
# Requires: pefile  ( python -m pip install pefile )
# Usage:  fixNvPe.py --input path/to/*.dll

import argparse
import pefile
import glob
import os
import shutil

def main(args):
    failures = []
    for file in glob.glob( args.input, recursive=args.recursive ):
        print(f"\n---\nChecking {file}...")
        pe = pefile.PE(file, fast_load=True)
        nvbSect = [ section for section in pe.sections if section.Name.decode().startswith(".nv_fatb")]
        if len(nvbSect) == 1:
            sect = nvbSect[0]
            size = sect.Misc_VirtualSize
            aslr = pe.OPTIONAL_HEADER.IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE
            writable = 0 != ( sect.Characteristics & pefile.SECTION_CHARACTERISTICS['IMAGE_SCN_MEM_WRITE'] )
            print(f"Found NV FatBin! Size: {size/1024/1024:0.2f}MB  ASLR: {aslr}  Writable: {writable}")
            if (writable or aslr) and size > 0:
                print("- Modifying DLL")
                if args.backup:
                    bakFile = f"{file}_bak"
                    print(f"- Backing up [{file}] -> [{bakFile}]")
                    if os.path.exists( bakFile ):
                        print( f"- Warning: Backup file already exists ({bakFile}), not modifying file! Delete the 'bak' to allow modification")
                        failures.append( file )
                        continue
                    try:
                        shutil.copy2( file, bakFile)
                    except Exception as e:
                        print( f"- Failed to create backup! [{str(e)}], not modifying file!")
                        failures.append( file )
                        continue
                # Disable ASLR for DLL, and disable writing for section
                pe.OPTIONAL_HEADER.DllCharacteristics &= ~pefile.DLL_CHARACTERISTICS['IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE']
                sect.Characteristics = sect.Characteristics & ~pefile.SECTION_CHARACTERISTICS['IMAGE_SCN_MEM_WRITE']
                try:
                    newFile = f"{file}_mod"
                    print( f"- Writing modified DLL to [{newFile}]")
                    pe.write( newFile )
                    pe.close()
                    print( f"- Moving modified DLL to [{file}]")
                    os.remove( file )
                    shutil.move( newFile, file )
                except Exception as e:
                    print( f"- Failed to write modified DLL! [{str(e)}]")
                    failures.append( file )
                    continue

    print("\n\nDone!")
    if len(failures) > 0:
        print("***WARNING**** These files needed modification but failed: ")
        for failure in failures:
            print( f" - {failure}")







def parseArgs():
    parser = argparse.ArgumentParser( description="Disable ASLR and make .nv_fatb sections read-only", formatter_class=argparse.ArgumentDefaultsHelpFormatter )
    parser.add_argument('--input', help="Glob to parse", default="*.dll")
    parser.add_argument('--backup', help="Backup modified files", default=True, required=False)
    parser.add_argument('--recursive', '-r', default=False, action='store_true', help="Recurse into subdirectories")

    return parser.parse_args()


###############################
# program entry point
#
if __name__ == "__main__":
    args = parseArgs()
    main( args )

Guess you like

Origin blog.csdn.net/ogebgvictor/article/details/130468704