已解决
PyTorch - 大模型多卡训练 “CUDA error: an illegal memory access was encountered”
来自网友在路上 153853提问 提问时间:2023-10-09 06:39:14阅读次数: 53
最佳答案 问答题库538位专家为你答疑解惑
欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/133640212
错误日志:
# ...File "lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in fitself, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_pathFile "lib/python3.7/site-packages/pytorch_lightning/trainer/call.py", line 63, in _call_and_handle_interrupttrainer._teardown()File "lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1121, in _teardownself.strategy.teardown()File "lib/python3.7/site-packages/pytorch_lightning/strategies/horovod.py", line 241, in teardownsuper().teardown()File "lib/python3.7/site-packages/pytorch_lightning/strategies/parallel.py", line 114, in teardownsuper().teardown()File "lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 499, in teardownself.accelerator.teardown()File "lib/python3.7/site-packages/pytorch_lightning/accelerators/cuda.py", line 76, in teardowntorch.cuda.empty_cache()File "lib/python3.7/site-packages/torch/cuda/memory.py", line 125, in empty_cachetorch._C._cuda_emptyCache()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
# ...
核心错误:CUDA error: an illegal memory access was encountered
,遇到非法内存访问。
原因:显存溢出,降低配置中影响显存占用的参数即可,例如输入特征的尺寸,即可。
观察 WanbB 显存占用,也可及时发现,例如,高显存 100% 占用,容易造成内存溢出:
正常占用 83%:
查看全文
99%的人还看了
相似问题
猜你感兴趣
版权申明
本文"PyTorch - 大模型多卡训练 “CUDA error: an illegal memory access was encountered”":http://eshow365.cn/6-17652-0.html 内容来自互联网,请自行判断内容的正确性。如有侵权请联系我们,立即删除!
- 上一篇: 软件测试基础 - 测试覆盖率
- 下一篇: 最短路径专题8 交通枢纽 (Floyd求最短路 )