After the painful environment construction work, everything is so harmonious and natural on the surface, just waiting for me to activate the training function to have a good time. As a result, the ruthless Bug is once again in the face.
The occurrence and resolution process of the problem is documented below.
The construction of the environment can be browsed: A Preliminary Study on Unity's Connection with ML-Agents
After completing the construction of the environment, you can use the attached Demo to get familiar with the functions according to the official Unity documentation.
The official documentation is here
1. Occurrence of problems
This time we are going to run 3D Balance Ball
In addition to directly running the Unity project to view the effect, you can also perform training operations by opening the previously created ml-agents
environment, and this pitfall is encountered when starting the training.
Open Anaconda Prompt, then activate ml-agents
the environment, cd to the release 20 decompression file directory, and enter the following command:
mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun
Where config/ppo/3DBall.yaml
is the path to the official default training configuration file.
config/ppo
folder contains training configuration files for all example environments, including 3DBall .
--run-id=
A unique name used to define the training session, named here first3DBallRun
.
You can also add --force
a command at the end to indicate mandatory execution. After use, the last data will be overwritten. If there is no write --force
and there is a folder with the same name, it will fail to execute.
mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun --force
After executing the command, under normal circumstances, the following screen will appear on the console:
At the end, the user will be prompted to start the Unity project
[INFO] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
After that, click Play
the button corresponding to the scene in the Unity project. Under normal circumstances, a display similar to that in the official document should appear:
INFO:mlagents_envs:
'Ball3DAcademy' started successfully!
Unity Academy name: Ball3DAcademy
INFO:mlagents_envs:Connected new brain:
Unity brain name: 3DBallLearning
Number of Visual Observations (per agent): 0
Vector Observation space size (per agent): 8
Number of stacked Vector Observation: 1
INFO:mlagents_envs:Hyperparameters for the PPO Trainer of brain 3DBallLearning:
batch_size: 64
beta: 0.001
buffer_size: 12000
epsilon: 0.2
gamma: 0.995
hidden_units: 128
lambd: 0.99
learning_rate: 0.0003
max_steps: 5.0e4
normalize: True
num_epoch: 3
num_layers: 2
time_horizon: 1000
sequence_length: 64
summary_freq: 1000
use_recurrent: False
memory_size: 256
use_curiosity: False
curiosity_strength: 0.01
curiosity_enc_size: 128
output_path: ./results/first3DBallRun/3DBallLearning
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 1000. Mean Reward: 1.242. Std of Reward: 0.746. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 2000. Mean Reward: 1.319. Std of Reward: 0.693. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 3000. Mean Reward: 1.804. Std of Reward: 1.056. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 4000. Mean Reward: 2.151. Std of Reward: 1.432. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 5000. Mean Reward: 3.175. Std of Reward: 2.250. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 6000. Mean Reward: 4.898. Std of Reward: 4.019. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 7000. Mean Reward: 6.716. Std of Reward: 5.125. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 8000. Mean Reward: 12.124. Std of Reward: 11.929. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 9000. Mean Reward: 18.151. Std of Reward: 16.871. Training.
INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 10000. Mean Reward: 27.284. Std of Reward: 28.667. Training.
At this point, it means entering the training session.
However, the problem I encountered was that Unity exited the execution immediately, and the console output the following information.
[WARNING] Trainer has no policies, not saving anything.
Traceback (most recent call last):
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\anaconda\Anaconda3\envs\ml-agents\Scripts\mlagents-learn.exe\__main__.py", line 7, in <module>
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\learn.py", line 264, in main
run_cli(parse_command_line())
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\learn.py", line 260, in run_cli
run_training(run_seed, options, num_areas)
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\learn.py", line 136, in run_training
tc.start_learning(env_manager)
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\trainer_controller.py", line 172, in start_learning
self._reset_env(env_manager)
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\trainer_controller.py", line 107, in _reset_env
self._register_new_behaviors(env_manager, env_manager.first_step_infos)
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\trainer_controller.py", line 267, in _register_new_behaviors
self._create_trainers_and_managers(env_manager, new_behavior_ids)
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\trainer_controller.py", line 165, in _create_trainers_and_managers
self._create_trainer_and_manager(env_manager, behavior_id)
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\trainer_controller.py", line 137, in _create_trainer_and_manager
policy = trainer.create_policy(
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\ppo\trainer.py", line 194, in create_policy
policy = TorchPolicy(
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\policy\torch_policy.py", line 41, in __init__
GlobalSteps()
File "C:\anaconda\Anaconda3\envs\ml-agents\lib\site-packages\mlagents\trainers\torch_entities\networks.py", line 748, in __init__
torch.Tensor([0]).to(torch.int64), requires_grad=False
RuntimeError: CUDA error: no kernel image is available for execution on the device
One of the more important pieces of information is:
RuntimeError: CUDA error: no kernel image is available for execution on the device
After that, it was a good meal. There are roughly two arguments:
1. The PyTorch version is not compatible with the CUDA version.
2. The computing power of the graphics card is too low to support a higher version of CUDA.
At that time, I felt ten thousand mud horses galloping past. What is CUDA ? ? ?
The low computing power of the graphics card is indeed a possible problem, because the graphics card model currently used for development NVIDIA GeForce GT 730
is a relatively low-end graphics card.
CUDA (Compute Unified Device Architecture) is a computing platform launched by graphics card manufacturer NVIDIA. CUDA™ is a general-purpose parallel computing architecture introduced by NVIDIA that enables GPUs to solve complex computing problems. It includes the CUDA instruction set architecture (ISA) and the parallel computing engine inside the GPU. Developers can use the C language to write programs for the CUDA™ architecture that can run at ultra-high performance on CUDA™-enabled processors. CUDA3.0 has started to support C++ and FORTRAN.
I was stunned to see it directly, and felt that it was beyond the scope of business.
2. Exclude the incompatibility between the PyTorch version and the CUDA version
First of all, I doubted whether the error was reported due to PyTorch
the CUDA
version mismatch with , so I inquired about the inspection method and the version corresponding table.
# 激活 ml-agents 环境,因为 PyTorch 是安装在该环境下的
(base) C:\Users\Administrator>activate ml-agents
# 进入 python 环境
(ml-agents) C:\Users\Administrator>python
Python 3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
# 引入 PyTorch 开发包
>>> import torch
# 打印 PyTorch 的版本号
>>> torch.__version__
'1.7.1+cu110'
# 打印 CUDA 版本
>>> torch.version.cuda
'11.0'
# 显示当前 CUDA 版本是否可用
>>> torch.cuda.is_available()
True
# 顺便打印了一下 gpu 的名字
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce GT 730'
Go to the PyTorch official website to find the relationship between different versions PyTorch
and CUDA
.
Of course, although the picture on the homepage is more intuitive, it only shows the correspondence of the latest version.
In this address , you can find the corresponding relationship of past versions, and we installed the 1.7.1
version according to the Unity documentation before
. For a more intuitive comparison chart, you can refer to this article .
Obviously, the currently installed PyTorch
and CUDA
versions can correspond.
3. Matching verification of CUDA and graphics card driver
That's the problem with the graphics card!
However, it is not easy to judge, because:
First, changing the graphics card is very expensive. . .
Second, there are some other claims about this error report on the Internet, so further verification is still needed.
Because I found an article that CUDA
roughly refers to the NVIDIA CUDA Toolkit software development kit.
PS: But after all the queries, CUDA
and CUDA Toolkit
are two different concepts, you can't simply draw an equal sign.
CUDA (Compute Unified Device Architecture) : The literal translation is "Unified Computing Architecture". General computing. CUDA currently supports multiple languages (C++, JAVA, Python, etc.), and it can basically run on the newer NVIDIA GPU, and the scope of use is still relatively wide.
CUDA Toolkit : CUDA development tools, including the following parts: Compiler (compiler), Tools, Libraries, CUDA Samples, CUDA Driver (if you just run deep learning code on the gpu, you can download the CUDA driver separately).
Therefore, the current assumption is to look at CUDA
the corresponding information of and the graphics card, so I went to the CUDA official website , and there are some correspondence tables that show the CUDA Toolkit
correspondence between and the graphics card driver.
Based on this table, I judge that the cause of the problem may be due to the CUDA
mismatch between the graphics card driver and the version (because I was really confused at first, thinking that the problem of computing power and CUDA refers to the compatibility between the graphics card driver and the CUDA Toolkit version , but finally found that it is not the same thing).
There are two ways to check the board driver:
1. View through the NVIDIA control panel
- Simultaneously press
Win + Q
the key combination to call out the search bar, search for "Control Panel" - click to open
NVIDIA 控制面板
As you can see, the current driver version is472.12
After that, click "Help" -> "System Information", select the "Components" tab, and you can see the CUDA supported by the current computer
2. View through the cmd console
You can view the driver version and other details by using commands through cmd
C:\Users\Administrator>nvidia-smi
If it is invalid, it means that the path information is not configured.
The default path of NVIDIA's driver is: C:\Program Files\NVIDIA Corporation\NVSMI
, add this path to the Path in the environment variable of the system, and restart cmd to use the command.
It can be seen that the driver version is 472.12
, and CUDA
the version is 11.4
. After comparison, the CUDA
mismatch between the graphics card driver and .
4. Confirmation of updated content on ML-Agents official documents
At this time, my head was really full of question marks. . . So I checked again, until I accidentally read the change information of Release 20.
Another 10,000 grass-mud horses spewed past in my heart, and there were mistakes in the co-authored official documents. . .
So open the environment again ml-agents
and execute the command to uninstall the currentPyTorch
pip3 uninstall torch
Then execute the following command:
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
-c conda-forge
It indicates to download this package in the library conda-forge cudatoolkit 11.1
.
Verify success after installation
After testing the following, the result is still the same error, I split.
5. Solving the problem: Compatibility between graphics card computing power and CUDA
After thinking about it, it may be a problem that the graphics card itself does not support. It is easy to check again. I found in an article that since , PyTorch 1.3
graphics cards with a GPU computing power of 3.5 and below are no longer supported.
So using this as a clue, I inquired about the demand for graphics card computing power in the process of ai development.
The conclusions drawn are:
1. There is a corresponding relationship PyTorch
with CUDA
.
2. CUDA
There is also a corresponding relationship with the computing power of the graphics card.
3. Different versions PyTorch
also have requirements for the computing power of the graphics card.
If there is any dissatisfaction in this, problems will arise.
Compute Capability:
The compute capability of a device is represented by a version number, also sometimes called its “SM version”. This version number identifies the features supported by the GPU hardware and is used by applications at runtime to determine which hardware features and/or instructions are available on the present GPU.
The compute capability comprises a major revision number X and a minor revision number Y and is denoted by X.Y.
Devices with the same major revision number are of the same core architecture. The major revision number is 9 for devices based on the NVIDIA Hopper GPU architecture, 8 for devices based on the NVIDIA Ampere GPU architecture, 7 for devices based on the Volta architecture, 6 for devices based on the Pascal architecture, 5 for devices based on the Maxwell architecture, and 3 for devices based on the Kepler architecture.
The minor revision number corresponds to an incremental improvement to the core architecture, possibly including new features.
Turing is the architecture for devices of compute capability 7.5, and is an incremental update based on the Volta architecture.
CUDA-Enabled GPUs lists of all CUDA-enabled devices along with their compute capability. Compute Capabilities gives the technical specifications of each compute capability.Note
The compute capability version of a particular GPU should not be confused with the CUDA version (for example, CUDA 7.5, CUDA 8, CUDA 9), which is the version of the CUDA software platform. The CUDA platform is used by application developers to create applications that run on many generations of GPU architectures, including future GPU architectures yet to be invented. While new versions of the CUDA platform often add native support for a new GPU architecture by supporting the compute capability version of that architecture, new versions of the CUDA platform typically also include software features that are independent of hardware generation.
The Tesla and Fermi architectures are no longer supported starting with CUDA 7.0 and CUDA 9.0, respectively.
rough translation:
A device's computing capabilities are indicated by a version number, sometimes referred to as an "SM version". This version number identifies features supported by the GPU hardware and is used by applications at runtime to determine which hardware features and/or instructions are available on the current GPU.
Computing capabilities include a major version number X and a minor version number Y, denoted by XY.
Devices with the same major version number have the same core architecture. The major version numbers are:
8 for devices based on the NVIDIA Ampere GPU architecture,
7 for devices based on the Volta architecture,
6 for devices based on the Pascal architecture,
5 for devices based on the Maxwell architecture,
3 for devices based on the Kepler architecture devices,
2 for Fermi-based devices,
1 for Tesla-based devices.
Minor revision numbers correspond to incremental improvements to the core architecture, possibly including new features.
There are a lot of blah blah, which roughly means that this "computing power" does not simply refer to computing power, but also has other meanings.
CUDA
You can check the WIKI directly , or check this blog for comparison, and there are also hashrate values for different types of graphics cards in it.
This picture is a screenshot from WIKI.
The computer I use is CUDA 11.4 . Although the picture shows a graphics card (NVIDIA GeForce GT 730) that can support 3.5 computing power, an error will be reported when actually running the ai algorithm.
Afterwards, I tried to set up the environment with my computer at home, and the result was a one-time pass. The graphics card at home was 1080 Ti, which made me more certain that the problem this time was caused by the computing power of the graphics card. But at home, I downloaded it directly from the ML-Agents document PyTorch 1.7.1
and it can run normally, so I am somewhat confused.
Then, I found a notebook with NVIDIA GeForce GTX 1070 graphics card in the company and tried to install it. In order to prevent some other moths, the formatting operation was performed in advance. As a result, it took a whole day to install it, and there were some other problems in the middle, but in the end it was considered to be smooth, which means that the problem this time was the lack of computing power of the graphics card.
At this point, the problem is solved.
PS: Problems encountered when installing on a laptop [WinError 1455]页面文件太小,无法完成操作。
The cause of the problem: In the notebook computer, because C
the disk is a relatively small solid-state disk, I ANACONDA
installed the disk on the mechanical hard disk , but the computer does not allocate so much virtual memory D
to disks other than the disk by default . The above problems are due to Caused by not enough virtual memory.C
Solution:
First, open the control panelWin + Q
interface through the key combination of . Then turn on the system . Select Advanced System Settings to open the System Properties page. Click the Settings button on the Advanced tab to enter the Performance Options page. On the Performance Options page, select the Advanced tab, then click Program Options, and click the Change button. The default page style is as follows: first uncheck the top check, then click the drive to allocate virtual memory, click custom size and allocate appropriate virtual memory (if 10G, fill in 10240), then click the setting button, and finally click the confirm button to exit the page and restart the computer. After some operations, the problem was reported and solved.