About the super big pit of ML-Agents "CUDA error: no kernel image is available for execution on the device"

After the painful environment construction work, everything is so harmonious and natural on the surface, just waiting for me to activate the training function to have a good time. As a result, the ruthless Bug is once again in the face.
The occurrence and resolution process of the problem is documented below.

The construction of the environment can be browsed: A Preliminary Study on Unity's Connection with ML-Agents

After completing the construction of the environment, you can use the attached Demo to get familiar with the functions according to the official Unity documentation.

The official documentation is here

1. Occurrence of problems

This time we are going to run 3D Balance Ball

insert image description here

In addition to directly running the Unity project to view the effect, you can also perform training operations by opening the previously created ml-agentsenvironment, and this pitfall is encountered when starting the training.

Open Anaconda Prompt, then activate ml-agentsthe environment, cd to the release 20 decompression file directory, and enter the following command:

mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun

Where config/ppo/3DBall.yamlis the path to the official default training configuration file.
insert image description here
config/ppofolder contains training configuration files for all example environments, including 3DBall .
--run-id=A unique name used to define the training session, named here first3DBallRun.
You can also add --forcea command at the end to indicate mandatory execution. After use, the last data will be overwritten. If there is no write --forceand there is a folder with the same name, it will fail to execute.

mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun --force

After executing the command, under normal circumstances, the following screen will appear on the console:
insert image description here
At the end, the user will be prompted to start the Unity project
[INFO] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.

After that, click Playthe button corresponding to the scene in the Unity project. Under normal circumstances, a display similar to that in the official document should appear:

INFO:mlagents_envs:
'Ball3DAcademy' started successfully!
Unity Academy name: Ball3DAcademy

INFO:mlagents_envs:Connected new brain:
Unity brain name: 3DBallLearning
        Number of Visual Observations (per agent): 0
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 1
INFO:mlagents_envs:Hyperparameters for the PPO Trainer of brain 3DBallLearning:
        batch_size:          64
        beta:                0.001
        buffer_size:         12000
        epsilon:             0.2
        gamma:               0.995
        hidden_units:        128
        lambd:               0.99
        learning_rate:       0.0003
        max_steps:           5.0e4
        normalize:           True
        num_epoch:           3
        num_layers:          2
        time_horizon:        1000
        sequence_length:     64
        summary_freq:        1000
        use_recurrent:       False
        memory_size:         256
        use_curiosity:       False
        curiosity_strength:  0.01
        curiosity_enc_size:  128
        output_path: ./results/first3DBallRun/3DBallLearning
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 1000. Mean Reward: 1.242. Std of Reward: 0.746. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 2000. Mean Reward: 1.319. Std of Reward: 0.693. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 3000. Mean Reward: 1.804. Std of Reward: 1.056. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 4000. Mean Reward: 2.151. Std of Reward: 1.432. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 5000. Mean Reward: 3.175. Std of Reward: 2.250. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 6000. Mean Reward: 4.898. Std of Reward: 4.019. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 7000. Mean Reward: 6.716. Std of Reward: 5.125. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 8000. Mean Reward: 12.124. Std of Reward: 11.929. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 9000. Mean Reward: 18.151. Std of Reward: 16.871. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 10000. Mean Reward: 27.284. Std of Reward: 28.667. Training.

At this point, it means entering the training session.

However, the problem I encountered was that Unity exited the execution immediately, and the console output the following information.

[WARNING] Trainer has no policies, not saving anything.
Traceback (most recent call last):
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\anaconda\Anaconda3\envs\ml-agents\Scripts\mlagents-learn.exe\__main__.py", line 7, in <module>
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\learn.py", line 264, in main
    run_cli(parse_command_line())
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\learn.py", line 260, in run_cli
    run_training(run_seed, options, num_areas)
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\learn.py", line 136, in run_training
    tc.start_learning(env_manager)
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\trainer_controller.py", line 172, in start_learning
    self._reset_env(env_manager)
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\trainer_controller.py", line 107, in _reset_env
    self._register_new_behaviors(env_manager, env_manager.first_step_infos)
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\trainer_controller.py", line 267, in _register_new_behaviors
    self._create_trainers_and_managers(env_manager, new_behavior_ids)
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\trainer_controller.py", line 165, in _create_trainers_and_managers
    self._create_trainer_and_manager(env_manager, behavior_id)
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\trainer_controller.py", line 137, in _create_trainer_and_manager
    policy = trainer.create_policy(
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\ppo\trainer.py", line 194, in create_policy
    policy = TorchPolicy(
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\policy\torch_policy.py", line 41, in __init__
    GlobalSteps()
  File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\torch_entities\networks.py", line 748, in __init__
    torch.Tensor([0]).to(torch.int64), requires_grad=False
RuntimeError: CUDA error: no kernel image is available for execution on the device

One of the more important pieces of information is:
RuntimeError: CUDA error: no kernel image is available for execution on the device

After that, it was a good meal. There are roughly two arguments:
1. The PyTorch version is not compatible with the CUDA version.
2. The computing power of the graphics card is too low to support a higher version of CUDA.

At that time, I felt ten thousand mud horses galloping past. What is CUDA ? ? ?
The low computing power of the graphics card is indeed a possible problem, because the graphics card model currently used for development NVIDIA GeForce GT 730is a relatively low-end graphics card.

CUDA (Compute Unified Device Architecture) is a computing platform launched by graphics card manufacturer NVIDIA. CUDA™ is a general-purpose parallel computing architecture introduced by NVIDIA that enables GPUs to solve complex computing problems. It includes the CUDA instruction set architecture (ISA) and the parallel computing engine inside the GPU. Developers can use the C language to write programs for the CUDA™ architecture that can run at ultra-high performance on CUDA™-enabled processors. CUDA3.0 has started to support C++ and FORTRAN.

I was stunned to see it directly, and felt that it was beyond the scope of business.

2. Exclude the incompatibility between the PyTorch version and the CUDA version

First of all, I doubted whether the error was reported due to PyTorchthe CUDAversion mismatch with , so I inquired about the inspection method and the version corresponding table.

# 激活 ml-agents 环境,因为 PyTorch 是安装在该环境下的
(base) C:\Users\Administrator>activate ml-agents
# 进入 python 环境
(ml-agents) C:\Users\Administrator>python
Python 3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
# 引入 PyTorch 开发包
>>> import torch
# 打印 PyTorch 的版本号
>>> torch.__version__
'1.7.1+cu110'
# 打印 CUDA 版本
>>> torch.version.cuda
'11.0'
# 显示当前 CUDA 版本是否可用
>>> torch.cuda.is_available()
True
# 顺便打印了一下 gpu 的名字
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce GT 730'

Go to the PyTorch official website to find the relationship between different versions PyTorchand CUDA.

insert image description here
Of course, although the picture on the homepage is more intuitive, it only shows the correspondence of the latest version.
In this address , you can find the corresponding relationship of past versions, and we installed the 1.7.1version according to the Unity documentation before
insert image description here
. For a more intuitive comparison chart, you can refer to this article .
Obviously, the currently installed PyTorchand CUDAversions can correspond.

3. Matching verification of CUDA and graphics card driver

That's the problem with the graphics card!
However, it is not easy to judge, because:
First, changing the graphics card is very expensive. . .
Second, there are some other claims about this error report on the Internet, so further verification is still needed.

Because I found an article that CUDAroughly refers to the NVIDIA CUDA Toolkit software development kit.
PS: But after all the queries, CUDAand CUDA Toolkitare two different concepts, you can't simply draw an equal sign.

CUDA (Compute Unified Device Architecture) : The literal translation is "Unified Computing Architecture". General computing. CUDA currently supports multiple languages ​​(C++, JAVA, Python, etc.), and it can basically run on the newer NVIDIA GPU, and the scope of use is still relatively wide.
CUDA Toolkit : CUDA development tools, including the following parts: Compiler (compiler), Tools, Libraries, CUDA Samples, CUDA Driver (if you just run deep learning code on the gpu, you can download the CUDA driver separately).

Therefore, the current assumption is to look at CUDAthe corresponding information of and the graphics card, so I went to the CUDA official website , and there are some correspondence tables that show the CUDA Toolkitcorrespondence between and the graphics card driver.
insert image description here
Based on this table, I judge that the cause of the problem may be due to the CUDAmismatch between the graphics card driver and the version (because I was really confused at first, thinking that the problem of computing power and CUDA refers to the compatibility between the graphics card driver and the CUDA Toolkit version , but finally found that it is not the same thing).

There are two ways to check the board driver:

1. View through the NVIDIA control panel
  • Simultaneously press Win + Qthe key combination to call out the search bar, search for "Control Panel"
  • click to openNVIDIA 控制面板

insert image description here
insert image description here
As you can see, the current driver version is472.12

After that, click "Help" -> "System Information", select the "Components" tab, and you can see the CUDA supported by the current computer
insert image description here

2. View through the cmd console

You can view the driver version and other details by using commands through cmd

C:\Users\Administrator>nvidia-smi

If it is invalid, it means that the path information is not configured.

The default path of NVIDIA's driver is: C:\Program Files\NVIDIA Corporation\NVSMI, add this path to the Path in the environment variable of the system, and restart cmd to use the command.
insert image description here

insert image description here

It can be seen that the driver version is 472.12, and CUDAthe version is 11.4
insert image description here
. After comparison, the CUDAmismatch between the graphics card driver and .

4. Confirmation of updated content on ML-Agents official documents

At this time, my head was really full of question marks. . . So I checked again, until I accidentally read the change information of Release 20.

insert image description here

Another 10,000 grass-mud horses spewed past in my heart, and there were mistakes in the co-authored official documents. . .

So open the environment again ml-agentsand execute the command to uninstall the currentPyTorch

pip3 uninstall torch

Then execute the following command:

conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge

-c conda-forgeIt indicates to download this package in the library conda-forge cudatoolkit 11.1.
Verify success after installation
insert image description here

After testing the following, the result is still the same error, I split.

5. Solving the problem: Compatibility between graphics card computing power and CUDA

After thinking about it, it may be a problem that the graphics card itself does not support. It is easy to check again. I found in an article that since , PyTorch 1.3graphics cards with a GPU computing power of 3.5 and below are no longer supported.
So using this as a clue, I inquired about the demand for graphics card computing power in the process of ai development.

The conclusions drawn are:
1. There is a corresponding relationship PyTorchwith CUDA.
2. CUDAThere is also a corresponding relationship with the computing power of the graphics card.
3. Different versions PyTorchalso have requirements for the computing power of the graphics card.

If there is any dissatisfaction in this, problems will arise.

Compute Capability:

The compute capability of a device is represented by a version number, also sometimes called its “SM version”. This version number identifies the features supported by the GPU hardware and is used by applications at runtime to determine which hardware features and/or instructions are available on the present GPU.
The compute capability comprises a major revision number X and a minor revision number Y and is denoted by X.Y.
Devices with the same major revision number are of the same core architecture. The major revision number is 9 for devices based on the NVIDIA Hopper GPU architecture, 8 for devices based on the NVIDIA Ampere GPU architecture, 7 for devices based on the Volta architecture, 6 for devices based on the Pascal architecture, 5 for devices based on the Maxwell architecture, and 3 for devices based on the Kepler architecture.
The minor revision number corresponds to an incremental improvement to the core architecture, possibly including new features.
Turing is the architecture for devices of compute capability 7.5, and is an incremental update based on the Volta architecture.
CUDA-Enabled GPUs lists of all CUDA-enabled devices along with their compute capability. Compute Capabilities gives the technical specifications of each compute capability.

Note
The compute capability version of a particular GPU should not be confused with the CUDA version (for example, CUDA 7.5, CUDA 8, CUDA 9), which is the version of the CUDA software platform. The CUDA platform is used by application developers to create applications that run on many generations of GPU architectures, including future GPU architectures yet to be invented. While new versions of the CUDA platform often add native support for a new GPU architecture by supporting the compute capability version of that architecture, new versions of the CUDA platform typically also include software features that are independent of hardware generation.
The Tesla and Fermi architectures are no longer supported starting with CUDA 7.0 and CUDA 9.0, respectively.

rough translation:

A device's computing capabilities are indicated by a version number, sometimes referred to as an "SM version". This version number identifies features supported by the GPU hardware and is used by applications at runtime to determine which hardware features and/or instructions are available on the current GPU.
Computing capabilities include a major version number X and a minor version number Y, denoted by XY.
Devices with the same major version number have the same core architecture. The major version numbers are:
8 for devices based on the NVIDIA Ampere GPU architecture,
7 for devices based on the Volta architecture,
6 for devices based on the Pascal architecture,
5 for devices based on the Maxwell architecture,
3 for devices based on the Kepler architecture devices,
2 for Fermi-based devices,
1 for Tesla-based devices.
Minor revision numbers correspond to incremental improvements to the core architecture, possibly including new features.

There are a lot of blah blah, which roughly means that this "computing power" does not simply refer to computing power, but also has other meanings.

CUDAYou can check the WIKI directly , or check this blog for comparison, and there are also hashrate values ​​for different types of graphics cards in it.

insert image description here
This picture is a screenshot from WIKI.
The computer I use is CUDA 11.4 . Although the picture shows a graphics card (NVIDIA GeForce GT 730) that can support 3.5 computing power, an error will be reported when actually running the ai algorithm.
Afterwards, I tried to set up the environment with my computer at home, and the result was a one-time pass. The graphics card at home was 1080 Ti, which made me more certain that the problem this time was caused by the computing power of the graphics card. But at home, I downloaded it directly from the ML-Agents document PyTorch 1.7.1and it can run normally, so I am somewhat confused.
Then, I found a notebook with NVIDIA GeForce GTX 1070 graphics card in the company and tried to install it. In order to prevent some other moths, the formatting operation was performed in advance. As a result, it took a whole day to install it, and there were some other problems in the middle, but in the end it was considered to be smooth, which means that the problem this time was the lack of computing power of the graphics card.
At this point, the problem is solved.


PS: Problems encountered when installing on a laptop [WinError 1455]页面文件太小,无法完成操作。

The cause of the problem: In the notebook computer, because Cthe disk is a relatively small solid-state disk, I ANACONDAinstalled the disk on the mechanical hard disk , but the computer does not allocate so much virtual memory Dto disks other than the disk by default . The above problems are due to Caused by not enough virtual memory.C

Solution:
First, open the control panelWin + Q interface through the key combination of . Then turn on the system . Select Advanced System Settings to open the System Properties page. Click the Settings button on the Advanced tab to enter the Performance Options page. On the Performance Options page, select the Advanced tab, then click Program Options, and click the Change button. The default page style is as follows: first uncheck the top check, then click the drive to allocate virtual memory, click custom size and allocate appropriate virtual memory (if 10G, fill in 10240), then click the setting button, and finally click the confirm button to exit the page and restart the computer. After some operations, the problem was reported and solved.
insert image description here


insert image description here

insert image description here

insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/EverNess010/article/details/129199874