GPU memory is not shared

verify:

import torch
a=torch.rand(2)
a=a.to("cuda:0")
b=a.to("cuda:1")
print(b)
# tensor([0.0, 0.0], device='cuda:1')
print(a)
# tensor([0.9285, 0.3294], device='cuda:0')

If the above situation occurs, it means that the memory between the two graphics cards is not shared, and calculation errors will occur when running large models.

solution:

1. Disable IOMMU

In Linux systems, to disable IOMMU (Input-Output Memory Management Unit), you need to modify the kernel parameters when the system boots. IOMMU is used for virtualization and hardware device management, so disabling it may affect some functions of the system, so please proceed with caution and make sure you understand the impact of your operation.
Here are the steps to disable IOMMU in Linux:

1. Edit the boot configuration file (GRUB or other boot loader):
Open a terminal and log in as superuser (root) or a user with sudo privileges.
Find your bootloader configuration file. For most Linux systems, this is the GRUB boot loader.
You can edit the GRUB configuration file, usually located in /etc/default/grub. Open it using a text editor, for example:

vim /etc/default/grub

In the GRUB configuration file, find the GRUB_CMDLINE_LINUX or similar line, which contains the kernel parameters. Add intel_iommu=off or amd_iommu=off, depending on your hardware vendor. So the line might look like this:

GRUB_CMDLINE_LINUX="intel_iommu=off"
GRUB_CMDLINE_LINUX="amd_iommu=off"
GRUB_CMDLINE_LINUX="iommu=pt"

2. Update the boot configuration:

sudo update-grub

3. Restart the system:

sudo reboot

2. Update the driver

Insert image description here
You can see that the latest version of the driver has fixed this problem.
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_46398647/article/details/133309013