what(): NCCL Error 1: unhandled cuda error解决方法 - 代码天地

what(): NCCL Error 1: unhandled cuda error解决方法

移动开发 2023-10-05 01:04:42 阅读次数: 0

文章目录

遇到问题

运行项目：ACL2021的一篇工作，LM-BFF (Better Few-shot Fine-tuning of Language Models) https://github.com/princeton-nlp/LM-BFF 遇到环境问题。
我的机器环境如下：

服务器上CUDA版本为11.4
GPU：4 x 24G 3090
虚拟环境用的python=3.6 
安装的pytorch的版本1.6.0（原项目中使用的版本，会报错）

产生如下报错

NCCL Error 1: unhandled cuda error

/home/lishizheng/anaconda3/envs/lmbff/lib/python3.6/site-packages/transformers/trainer.py:1096: FutureWarning: This method is deprecated, use `Trainer.is_local_process_zero()` instead.
  warnings.warn("This method is deprecated, use `Trainer.is_local_process_zero()` instead.", FutureWarning)
Epoch:   0%|                                                                                                                              | 0/250 [00:00<?, ?it/s]terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL Error 1: unhandled cuda error
Aborted (core dumped)

解决方法

这是pytorch 、cudatoolkit、cuda驱动的版本不一致导致的问题。
在这里插入图片描述

我的cuda版本是11.4，根据 CUDA版本11.4，pytorch应该下哪个版本的？，安装cudatoolkit为11.3，pytorch=1.10.2可用：

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

这样解决我的问题，代码可正常运行。

参考

[1] https://pytorch.org/get-started/previous-versions/

猜你喜欢

转载自blog.csdn.net/shizheng_Li/article/details/132582580

what(): NCCL Error 1: unhandled cuda error解决方法

RuntimeError: NCCL error in：XXX，unhandled system error, NCCL version 2.7.8

NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL ,unhandled cuda error, NCCLversion 2.7.8

NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.

DDP报错::nccl error

Unhandled error in debug adapter - Unhandled promise rejection

caffe编译问题：fatal error: nccl.h: No such file or directory 解决方法

[twisted] CRITICAL: Unhandled error in Deferred

throw er; Unhandled 'error' event

throw er; // Unhandled 'error' event

解决node js 显示“Unhandled 'error” event

CUDA中一些error的解决方法

全网最全RuntimeError: CUDA error: out of memory解决方法

CUDA Error

10% building modules 1/1 modules 0 activeevents.js:174 throw er; // Unhandled 'error' event

ERROR Plumber found unhandled error: Error in plugin "gulp-htmlmin"

nodejs throw er; // Unhandled 'error' event

caffe安装问题——fattal error nccl.h no such file or directory

ionic浏览器刷新页面报错throw er; // Unhandled 'error' event解决方法

关于解决Unhandled error in Deferred或提示NameError: name 'xxPipeline' is not defined

Python scarpy Unhandled error in Deferred 的解决方案

node 报错 throw er; // Unhandled 'error' event 解决办法

cuda报错, RuntimeError: CUDA error: unknown error

解决CUDA error (3): initialization error (multiprocessing)

RuntimeError:CUDA error:unknown error

win10 no cuda-capable device is detected, error 38”问题解决方法

cuda invalid address error

(已解决) Unhandled error: Error: ENOENT, no such file or directory ‘。。。build-tools’

scrapy-redis执行报错Unhandled error in Deferred

events.js:183 throw er; // Unhandled 'error' event ^

今日推荐

Linus “吃狗粮”最积极！

开源日报 | Winamp播放器即将开源；生成式AI之战升级第二轮；Linus“吃狗粮”最积极；AI进入泡沫前期；吴泳铭为阿里云带来了什么？

NetBSD 禁止提交由 AI 生成的代码

Apache Doris 2.0.10 版本正式发布！

开源日报 | 大模型开战；大模型独角兽被曝卖身；周鸿祎建议谷歌开源所有产品；最大开源AI社区提供1000万美元共享GPU

开源日报 | Chrome内置Gemini的意义不在于Gemini；中国AI追随之路的五大误区；ECharts创始人“下海”养鱼；谷歌I/O开发者大会什么都有，只是没有惊喜

微软回应中国区AI团队“打包赴美”传闻

周排行

SVN服务端安装在阿里云

实战 | 相机标定

webpack核心概念

note20——》只要肯低头吃苦，人生就会有救

PAT甲级 1062 Talent and Virtue （25 分）排序

NG Toolset开发笔记--5GNR Resource Grid（26）

如何对待上司

oracle命令

第9章 STL迭代器

logstash使用es映射模板

每日归档

更多

2024-05-20(36)

2024-05-19(0)

2024-05-18(4)

2024-05-17(34)

2024-05-16(6)

2024-05-15(24)

2024-05-14(0)

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)