c10::CUDAError · Issue #67978 · pytorch/pytorch

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
🐛 Bug

  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 443, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 439, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  64%|██████████████▊        | 18/28 [00:08<00:04,  2.20it/[W CUDAGuardImpl.h:112] Warning: CUDA warning: the launch timed out and was terminated (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f6ea2ee7a22 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so
frame #1: <unknown function> + 0x10983 (0x7f6ea3148983 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f6ea314a027 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cu.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f6ea2ed15a4 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x7f6ef9d74ed9 in /opt/conda/lib/python3.site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x276 (0x7f6ef9d6b906 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f6ef9d9aa22 in /opt/conda/lib/python3.8/site-pkages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f6ef950ace6 in /opt/conda/lib/python3.8/site-packages/torch/l/libtorch_python.so)
frame #8: std::_Sp_counted_ptr<c10d::Logger*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x1d (0x7f6ef9d9f06d in /opt/conda/lib/python3.8/site-paages/torch/lib/libtorch_python.so)
frame #9: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f6ef950ace6 in /opt/conda/lib/python3.8/site-packages/torch/l/libtorch_python.so)
frame #10: <unknown function> + 0xd88ddf (0x7f6ef9d9cddf in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x4ff4c0 (0x7f6ef95134c0 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x50072e (0x7f6ef951472e in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0x12b785 (0x55af59fff785 in /opt/conda/bin/python)
frame #14: <unknown function> + 0x1ca984 (0x55af5a09e984 in /opt/conda/bin/python)
frame #15: <unknown function> + 0x11f906 (0x55af59ff3906 in /opt/conda/bin/python)
frame #16: <unknown function> + 0x12bc96 (0x55af59fffc96 in /opt/conda/bin/python)
frame #17: <unknown function> + 0x12bc4c (0x55af59fffc4c in /opt/conda/bin/python)
frame #18: <unknown function> + 0x12bc4c (0x55af59fffc4c in /opt/conda/bin/python)
frame #19: <unknown function> + 0x154ec8 (0x55af5a028ec8 in /opt/conda/bin/python)
frame #20: PyDict_SetItemString + 0x87 (0x55af5a02a127 in /opt/conda/bin/python)
frame #21: PyImport_Cleanup + 0x9a (0x55af5a12a5aa in /opt/conda/bin/python)
frame #22: Py_FinalizeEx + 0x7d (0x55af5a12a94d in /opt/conda/bin/python)
frame #23: Py_RunMain + 0x110 (0x55af5a12b7f0 in /opt/conda/bin/python)
frame #24: Py_BytesMain + 0x39 (0x55af5a12b979 in /opt/conda/bin/python)
frame #25: __libc_start_main + 0xf3 (0x7f6f00d860b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #26: <unknown function> + 0x1e7185 (0x55af5a0bb185 in /opt/conda/bin/python)
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  79%|██████████████████     | 22/28 [00:10<00:02,  2.44it/ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 9017) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
************************************************
                train.py FAILED
================================================
Root Cause:
  time: 2021-11-07_06:11:40
  rank: 3 (local_rank: 3)
  exitcode: -6 (pid: 9017)
  error_file: <N/A>
  msg: "Signal 6 (SIGABRT) received by PID 9017"
================================================
Other Failures:
  <NO_OTHER_FAILURES>
************************************************
To Reproduce
You have to use the two GPU

CUDA:0 (GeForce GTX 1080 Ti, 11178.5MB)

CUDA:1 (GeForce GTX TITAN X, 12212.8125MB)
Steps to reproduce the behavior:

docker pull ultralytics/yolov5

Then using the docker

$ git clone https://github.com/ultralytics/yolov5

$ cd yolov5

$ python -m torch.distributed.launch --nproc_per_node 2 train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1
Expected behavior
Train
Environment
https://github.com/ultralytics/yolov5/blob/master/Dockerfile
Additional context
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @ngimel
  triaged
  This issue has been looked at a team member, and triaged and prioritized into an appropriate module
 label
      Nov 9, 2021
          If using the two different GPU, it will happen the error.

CUDA:0 (GeForce GTX 1080 Ti, 11178.5MB)

CUDA:1 (GeForce GTX TITAN X, 12212.8125MB)

But use the same GPUs will not happen.




    

Use  <CUDA_LAUNCH_BLOCKING=1>
Message
0 (57.0258)
Traceback (most recent call last):
  File "train.py", line 836, in <module>
    main()
  File "train.py", line 639, in main
    ema_eval_metrics = validate(
  File "train.py", line 805, in validate
    reduced_loss = reduce_tensor(loss.data, args.world_size)
  File "/work/AI/CLASS/pytorch-image-models_1207/timm/utils/distributed.py", line 13, in reduce_tensor
    dist.all_reduce(rt, op=dist.ReduceOp.SUM)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:42, unhandled cuda error, NCCL version 21.1.4
ncclUnhandledCudaError: Call to CUDA function failed.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at /opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp:1196 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fb0864e263c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1d8f0 (0x7fb0865398f0 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x241 (0x7fb08653af21 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x9c (0x7fb0864cc2ac in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xda3b85 (0x7fb0dbc81b85 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: THPVariable_subclass_dealloc(_object*) + 0x285 (0x7fb0dbc82b55 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xfaef8 (0x55b07a890ef8 in /opt/conda/bin/python3)
frame #7: <unknown function> + 0xfd538 (0x55b07a893538 in /opt/conda/bin/python3)
frame #8: <unknown function> + 0xfd5d9 (0x55b07a8935d9 in /opt/conda/bin/python3)
frame #9: <unknown function> + 0xfd5d9 (0x55b07a8935d9 in /opt/conda/bin/python3)
frame #10: <unknown function> + 0xfd5d9 (0x55b07a8935d9 in /opt/conda/bin/python3)
frame #11: PyDict_SetItemString + 0x401 (0x55b07a9373d1 in /opt/conda/bin/python3)
frame #12: PyImport_Cleanup + 0xa4 (0x55b07aa054e4 in /opt/conda/bin/python3)
frame #13: Py_FinalizeEx + 0x7a (0x55b07aa05a9a in /opt/conda/bin/python3)
frame #14: Py_RunMain + 0x1b8 (0x55b07aa0a5c8 in /opt/conda/bin/python3)
frame #15: Py_BytesMain + 0x39 (0x55b07aa0a939 in /opt/conda/bin/python3)
frame #16: __libc_start_main + 0xf3 (0x7fb11e0590b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #17: <unknown function> + 0x1e8f39 (0x55b07a97ef39 in /opt/conda/bin/python3)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 546051) of binary: /opt/conda/bin/python3
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 187, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 173, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 688, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
**************************************************
                 train.py FAILED
==================================================
Root Cause:
  time: 2021-12-18_06:41:59
  rank: 3 (local_rank: 3)
  exitcode: -6 (pid: 546051)
  error_file: <N/A>
  msg: "Signal 6 (SIGABRT) received by PID 546051"
==================================================
Other Failures:
  <NO_OTHER_FAILURES>
**************************************************
          I get this error using 4 of the same GPU: p3.8xlarge on ec2, which is 4 Tesla V100 GPUs.
CUDA 11.3, pytorch 1.10.
[03/24 07:48:55 detectron2]: Environment info:
----------------------  -------------------------------------------------------------------------
sys.platform            linux
Python                  3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0]
numpy                   1.22.3
detectron2              0.6 @/home/ubuntu/.local/lib/python3.8/site-packages/detectron2
Compiler                GCC 7.3
CUDA compiler           CUDA 11.3
detectron2 arch flags   3.7, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.10.0+cu113 @/home/ubuntu/.local/lib/python3.8/site-packages/torch
PyTorch debug build     False
GPU available           Yes
GPU 0,1,2,3             Tesla V100-SXM2-16GB (arch=7.0)
Driver version          510.47.03
CUDA_HOME               /usr/local/cuda
Pillow                  9.0.1
torchvision             0.11.1+cu113 @/home/ubuntu/.local/lib/python3.8/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                  0.1.5.post20220305
iopath                  0.1.9
cv2                     4.5.5
----------------------  -------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
Error with CUDA_LAUNCH_BLOCKING=1:
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fd7da008d62 in /home/ubuntu/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1c5f3 (0x7fd81d6b55f3 in /home/ubuntu/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x7fd81d6b6002 in /home/ubuntu/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7fd7d9ff2314 in /home/ubuntu/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x29adb9 (0x7fd8a09d2db9 in /home/ubuntu/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xae0c91 (0x7fd8a1218c91 in /home/ubuntu/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x292 (0x7fd8a1218f92 in /home/ubuntu/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
          Similar errors with 4/8 A100 GPUs. Installed packages:




    

NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0
PyTorch 1.7.1+cu110
Apex  0.1
torch.cuda.nccl.version()=2708
Error message:
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f80d6d0b8b2 in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f80d6f5d952 in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f80d6cf6b7d in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x5c (0x7f81722c5ddc in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&) + 0xfc (0x7f8172a1bedc in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x8dbb36 (0x7f81728ebb36 in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x2c4416 (0x7f81722d4416 in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x13c7ae (0x55b705f257ae in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #8: _PyObject_MakeTpCall + 0x3bf (0x55b705f1a25f in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #9: <unknown function> + 0x166d50 (0x55b705f4fd50 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #10: _PyEval_EvalFrameDefault + 0x4f81 (0x55b705fc39d1 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #11: _PyEval_EvalCodeWithName + 0x260 (0x55b705fb51f0 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #12: _PyFunction_Vectorcall + 0x594 (0x55b705fb67b4 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x1517 (0x55b705fbff67 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #14: _PyEval_EvalCodeWithName + 0xd5f (0x55b705fb5cef in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #15: _PyFunction_Vectorcall + 0x594 (0x55b705fb67b4 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #16: _PyEval_EvalFrameDefault + 0x71a (0x55b705fbf16a in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #17: _PyEval_EvalCodeWithName + 0xd5f (0x55b705fb5cef in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #18: _PyFunction_Vectorcall + 0x594 (0x55b705fb67b4 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x1517 (0x55b705fbff67 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #20: _PyEval_EvalCodeWithName + 0x260 (0x55b705fb51f0 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #21: _PyFunction_Vectorcall + 0x594 (0x55b705fb67b4 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x1517 (0x55b705fbff67 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #23: _PyEval_EvalCodeWithName + 0xd5f (0x55b705fb5cef in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #24: _PyFunction_Vectorcall + 0x594 (0x55b705fb67b4 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x71a (0x55b705fbf16a in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #26: _PyFunction_Vectorcall + 0x1b7 (0x55b705fb63d7 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x71a (0x55b705fbf16a in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x260 (0x55b705fb51f0 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #29: PyEval_EvalCode + 0x23 (0x55b705fb6aa3 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #30: <unknown function> + 0x241382 (0x55b70602a382 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #31: <unknown function> + 0x252202 (0x55b70603b202 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #32: <unknown function> + 0x2553ab (0x55b70603e3ab in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #33: PyRun_SimpleFileExFlags + 0x1bf (0x55b70603e58f in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #34: Py_RunMain + 0x3a9 (0x55b70603ea69 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #35: Py_BytesMain + 0x39 (0x55b70603ec69 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #36: __libc_start_main + 0xf3 (0x7f817405a0b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x1f7427 (0x55b705fe0427 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
          I got the similar error when using distribute training with 8 RTX2080 Ti. But when I tried the following settings, the error just gone.  Does anyone know why?
#previous
torch.backends.cudnn.benchmark=True
torch.backends.cudnn.deterministic=False
#new settings
torch.backends.cudnn.benchmark=False
torch.backends.cudnn.deterministic=True
 mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.10 (default, Jun  4 2021, 15:09:15) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.2+cu102
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 
TorchVision: 0.9.2+cu102
OpenCV: 4.5.5
MMCV: 1.4.2
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMDetection: 2.23.0+09c9024
Error with CUDA_LAUNCH_BLOCKING=1:
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
terminate called after throwing an instance of '




    
c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f21c78ea2f2 in /mnt/sdf/caoxu/miniconda3/envs/mm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f21c78e767b in /mnt/sdf/caoxu/miniconda3/envs/mm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f21c7b421f9 in /mnt/sdf/caoxu/miniconda3/envs/mm/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f21c78d23a4 in /mnt/sdf/caoxu/miniconda3/envs/mm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x6e43ca (0x7f221a4aa3ca in /mnt/sdf/caoxu/miniconda3/envs/mm/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x6e4461 (0x7f221a4aa461 in /mnt/sdf/caoxu/miniconda3/envs/mm/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x1932c6 (0x55f0be5f12c6 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #7: <unknown function> + 0x15878b (0x55f0be5b678b in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #8: <unknown function> + 0x158b8c (0x55f0be5b6b8c in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #9: <unknown function> + 0x158bac (0x55f0be5b6bac in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #10: <unknown function> + 0x1592ac (0x55f0be5b72ac in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #11: <unknown function> + 0x158e77 (0x55f0be5b6e77 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #12: <unknown function> + 0x158e60 (0x55f0be5b6e60 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #13: <unknown function> + 0x158e60 (0x55f0be5b6e60 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #14: <unknown function> + 0x158e60 (0x55f0be5b6e60 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #15: <unknown function> + 0x158e60 (0x55f0be5b6e60 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #16: <unknown function> + 0x158e60 (0x55f0be5b6e60 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #17: <unknown function> + 0x158e60 (0x55f0be5b6e60 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #18: <unknown function> + 0x176057 (0x55f0be5d4057 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #19: PyDict_SetItemString + 0x61 (0x55f0be5f53c1 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #20: PyImport_Cleanup + 0x9d (0x55f0be633aad in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #21: Py_FinalizeEx + 0x79 (0x55f0be665a49 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #22: Py_RunMain + 0x183 (0x55f0be667893 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #23: Py_BytesMain + 0x39 (0x55f0be667ca9 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
frame #24: __libc_start_main + 0xe7 (0x7f222034cbf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: <unknown function> + 0x1e21c7 (0x55f0be6401c7 in /mnt/sdf/caoxu/miniconda3/envs/mm/bin/python)
Similar errors with 4/8 A100 GPUs. Installed packages:
NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0
PyTorch 1.7.1+cu110
Apex  0.1
nccl =2708
Error message:
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f80d6d0b8b2 in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f80d6f5d952 in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f80d6cf6b7d in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x5c (0x7f81722c5ddc in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&) + 0xfc (0x7f8172a1bedc in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x8dbb36 (0x7f81728ebb36 in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x2c4416 (0x7f81722d4416 in /export/share/ruimeng/env/anaconda/envs/ir/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x13c7ae (0x55b705f257ae in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #8: _PyObject_MakeTpCall + 0x3bf (0x55b705f1a25f in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #9: <unknown function> + 0x166d50 (0x55b705f4fd50 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #10: _PyEval_EvalFrameDefault + 0x4f81 (0x55b705fc39d1 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #11: _PyEval_EvalCodeWithName + 0x260 (0x55b705fb51f0 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #12: _PyFunction_Vectorcall + 0x594 (0x55b705fb67b4 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x1517 (0x55b705fbff67 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #14: _PyEval_EvalCodeWithName + 0xd5f (0x55b705fb5cef in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #15: _PyFunction_Vectorcall + 0x594 (0x55b705fb67b4 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #16: _PyEval_EvalFrameDefault + 0x71a (0x55b705fbf16a in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #17: _PyEval_EvalCodeWithName + 0xd5f (0x55b705fb5cef in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #18: _PyFunction_Vectorcall + 0x594 (0x55b705fb67b4 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x1517 (0x55b705fbff67 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #20: _PyEval_EvalCodeWithName + 0x260 (0x55b705fb51f0 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #21: _PyFunction_Vectorcall + 0x594 (0x55b705fb67b4 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x1517 (0x55b705fbff67 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #23: _PyEval_EvalCodeWithName + 0xd5f (0x55b705fb5cef in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #24: _PyFunction_Vectorcall + 0x594 (0x55b705fb67b4 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x71a (0x55b705fbf16a in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #26: _PyFunction_Vectorcall + 0x1b7 (0x55b705fb63d7 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x71a (0x55b705fbf16a in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x260 (0x55b705fb51f0 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #29: PyEval_EvalCode + 0x23 (0x55b705fb6aa3 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #30: <unknown function> + 0x241382 (0x55b70602a382 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #31: <unknown function> + 0x252202 (0x55b70603b202 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #32: <unknown function> + 0x2553ab (0x55b70603e3ab in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #33: PyRun_SimpleFileExFlags + 0x1bf (0x55b70603e58f in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #34: Py_RunMain + 0x3a9 (0x55b70603ea69 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #35: Py_BytesMain + 0x39 (0x55b70603ec69 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
frame #36: __libc_start_main + 0xf3 (0x7f817405a0b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x1f7427 (0x55b705fe0427 in /export/share/ruimeng/env/anaconda/envs/ir/bin/python)
FYI, The bug seems to be resolved after I upgrade the CUDA and pytorch. It's running well now and I will update if I see new problems.

Driver Version: 470.82.01

CUDA Version: 11.4

PyTorch Version: 1.11.0+cu113

NCCL: (2, 10, 3)




    

          I'm seeing the same error in a 2 GPU DDP setting with

Driver Version: 470.141.03

CUDA Version: 11.4

PyTorch Version: 1.12.1+cu113

NCCL: (2, 10, 3)
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: initialization error
Exception raised from getDevice at ../c10/cuda/impl/CUDAGuardImpl.h:39 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fa8cee5b20e in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xf3b6d (0x7fa91171eb6d in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: <unknown function> + 0xf6ffe (0x7fa911721ffe in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: <unknown function> + 0x4635b8 (0x7fa920a845b8 in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fa8cee427a5 in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x35f485 (0x7fa920980485 in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x6795c8 (0x7fa920c9a5c8 in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7fa920c9a995 in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: python() [0x57945b]
frame #9: python() [0x53f2f8]
frame #10: python() [0x5abb4c]
frame #11: python() [0x5abba0]
frame #12: python() [0x5abba0]
frame #13: python() [0x5abba0]
frame #14: python() [0x5abba0]
frame #15: python() [0x5abba0]
frame #16: python() [0x5abba0]
frame #17: python() [0x65cfd9]
frame #18: python() [0x5d8cfb]
frame #19: python() [0x67473f]
frame #20: python() [0x53fd97]
frame #21: python() [0x52f15a]
frame #22: python() [0x5b22ce]
<omitting python frames>
frame #27: python() [0x56ba4b]
frame #31: python() [0x4dfc1f]
frame #33: python() [0x4a634c]
frame #37: python() [0x4a634c]
frame #42: python() [0x4dfc76]
frame #45: python() [0x4dfc1f]
frame #50: python() [0x4dfc1f]
frame #51: python() [0x4fe137]
frame #52: python() [0x4df4c7]
frame #54: python() [0x4dfc1f]
frame #57: python() [0x56b6b6]
frame #61: python() [0x4df395]
frame #63: python() [0x4df395]
The error seems to be triggered by a specific combinations of weights and inputs, since usually, starting the training again from the last checkpoint will help for a couple of epochs. Also, this does not happen for all datasets the I use for training, even though the training code is always the same.

I have not yet managed to find out what exactly triggers the error, so any help debugging would be much appreciated!
Updating Torch (from 1.10.1+cu113 to 1.12.1+cu113) did not solve it, neither did setting
torch.backends.cudnn.benchmark=False
torch.backends.cudnn.deterministic=True
as suggested by @SheffieldCao
          I get similar errors with DDP training. Installed packages:
Driver Version: 455.23.05

CUDA Version: 11.1

PyTorch 1.8.2+cu111

torch.cuda.nccl.version()=2708
run python cmd with CUDA_LAUNCH_BLOCKING=1
Error message:
File "train.py", line 121, in <module>                                                                                                                                                    
    main()                                                                                                                                                                                  
  File "train.py", line 69, in main                                                                                                                                                         
    loss = trainer.model(inputs, targets, meta_info, 'train')                                                                                                                               
  File "/home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl                                                                      
    result = self.forward(*input, **kwargs)                                                                                                                                                 
  File "/home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 696, in forward                                                                   
    self._sync_params()                                                                                                                                                                     
  File "/home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1222, in _sync_params                                                             
    self._distributed_broadcast_coalesced(                                                                                                                                                  
  File "/home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced                                         
    dist._broadcast_coalesced(                                                                                                                                                              
RuntimeError: CUDA error: an illegal memory access was encountered                                                                                                                          
terminate called after throwing an instance of 'c10::Error'                                                                                                                                 
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7ff3ce9f22f2 in /home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7ff3ce9ef67b in /home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch$lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7ff3cec4a1f9 in /home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7ff3ce9da3a4 in /home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x7ff442f13cc9 in /home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/$ib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x7ff442f08c8a in /home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7ff442f2ff22 in /home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/lib/libt$rch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7ff44286ce76 in /home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xa2121f (0x7ff442f3321f in /home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x369f80 (0x7ff44287bf80 in /home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x36b1ee (0x7ff44287d1ee in /home/work/miniconda3/envs/zy/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x10faf5 (0x5568d54e5af5 in /home/work/miniconda3/envs/zy/bin/python)
frame #12: <unknown function> + 0x1a9727 (0x5568d557f727 in /home/work/miniconda3/envs/zy/bin/python)
frame #13: <unknown function> + 0x10faf5 (0x5568d54e5af5 in /home/work/miniconda3/envs/zy/bin/python)
frame #14: <unknown function> + 0x1a9727 (0x5568d557f727 in /home/work/miniconda3/envs/zy/bin/python)
frame #15: <unknown function> + 0x11071c (0x5568d54e671c in /home/work/miniconda3/envs/zy/bin/python)
frame #16: <unknown function> + 0x110059 (0x5568d54e6059 in /home/work/miniconda3/envs/zy/bin/python)
frame #17: <unknown function> + 0x110043 (0x5568d54e6043 in /home/work/miniconda3/envs/zy/bin/python)
frame #18: <unknown function> + 0x177ce7 (0x5568d554dce7 in /home/work/miniconda3/envs/zy/bin/python)
frame #19: PyDict_SetItemString + 0x4c (0x5568d5550d8c in /home/work/miniconda3/envs/zy/bin/python)
frame #20: PyImport_Cleanup + 0xaa (0x5568d55c3a2a in /home/work/miniconda3/envs/zy/bin/python)
frame #21: Py_FinalizeEx + 0x79 (0x5568d56294c9 in /home/work/miniconda3/envs/zy/bin/python)
frame #22: Py_RunMain + 0x1bc (0x5568d562c83c in /home/work/miniconda3/envs/zy/bin/python)
frame #23: Py_BytesMain + 0x39 (0x5568d562cc29 in /home/work/miniconda3/envs/zy/bin/python)
frame #24: __libc_start_main + 0xe7 (0x7ff44848bc87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: <unknown function> + 0x1f9ad7 (0x5568d55cfad7 in /home/work/miniconda3/envs/zy/bin/python)
          Still happens to me...

Any update on this @ptrblck ?




    

Is there anything one can get out of the stacktrace?

Running on 8 A100-80 (DDP) with driver version 515.65.01 and latest pytorch cudatoolkit etc. installed.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
Exception raised from insert_events at ../c10/cuda/CUDACachingAllocator.cpp:763 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x14f5159a72f2 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x14f5159a467b in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc92 (0x14f515bff682 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x14f51598f3a4 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x6e47ba (0x14f5169077ba in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x13c255 (0x5568fdb88255 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #6: <unknown function> + 0x1efd35 (0x5568fdc3bd35 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #7: _PyObject_GC_Malloc + 0x88 (0x5568fdb88998 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #8: _PyObject_GC_New + 0x13 (0x5568fdb8ece3 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #9: <unknown function> + 0x142f41 (0x5568fdb8ef41 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #10: PyObject_GetIter + 0x16 (0x5568fdb8f376 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #11: _PyEval_EvalFrameDefault + 0x165a (0x5568fdc0e01a in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #12: <unknown function> + 0x215056 (0x5568fdc61056 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #13: PyIter_Next + 0xe (0x5568fdb8b11e in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #14: <unknown function> + 0x16be39 (0x5568fdbb7e39 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #15: <unknown function> + 0x20807f (0x5568fdc5407f in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #16: <unknown function> + 0x17000c (0x5568fdbbc00c in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #17: <unknown function> + 0x10075e (0x5568fdb4c75e in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #18: _PyObject_FastCallDict + 0x38d (0x5568fdbd860d in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #19: <unknown function> + 0x21a01b (0x5568fdc6601b in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #20: <unknown function> + 0x107c77 (0x5568fdb53c77 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #21: <unknown function> + 0x13f30b (0x5568fdb8b30b in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #22: <unknown function> + 0x10077f (0x5568fdb4c77f in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #23: _PyFunction_Vectorcall + 0x10b (0x5568fdbd786b in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #24: <unknown function> + 0x10075e (0x5568fdb4c75e in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #25: _PyObject_FastCallDict + 0x38d (0x5568fdbd860d in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #26: _PyObject_Call_Prepend + 0x63 (0x5568fdbd8733 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #27: <unknown function> + 0x18c8ca (0x5568fdbd88ca in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #28: _PyObject_MakeTpCall + 0x1a4 (0x5568fdb897d4 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x475 (0x5568fdc0ce35 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #30: _PyFunction_Vectorcall + 0x10b (0x5568fdbd786b in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #31: <unknown function> + 0x10077f (0x5568fdb4c77f in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #32: _PyFunction_Vectorcall + 0x10b (0x5568fdbd786b in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #33: <unknown function> + 0x10077f (0x5568fdb4c77f in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #34: _PyFunction_Vectorcall + 0x10b (0x5568fdbd786b in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #35: <unknown function> + 0x10075e (0x5568fdb4c75e in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #36: _PyEval_EvalCodeWithName + 0x659 (0x5568fdbd6e19 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #37: _PyObject_FastCallDict + 0x20c (0x5568fdbd848c in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #38: _PyObject_Call_Prepend + 0x63 (0x5568fdbd8733 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #39: <unknown function> + 0x18c8ca (0x5568fdbd88ca in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #40: _PyObject_MakeTpCall + 0x1a4 (0x5568fdb897d4 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #41: _PyEval_EvalFrameDefault + 0x475 (0x5568fdc0ce35 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #42: _PyFunction_Vectorcall + 0x10b (0x5568fdbd786b in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #43: <unknown function> + 0x10075e (0x5568fdb4c75e in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #44: _PyFunction_Vectorcall + 0x10b (0x5568fdbd786b in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #45: <unknown function> + 0x7df36 (0x5568fdac9f36 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #46: <unknown function> + 0x13d9ef (0x5568fdb899ef in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #47: <unknown function> + 0x20958a (0x5568fdc5558a in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #48: PyObject_GetIter + 0x16 (0x5568fdb8f376 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #49: <unknown function> + 0x140699 (0x5568fdb8c699 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #50: PyVectorcall_Call + 0x71 (0x5568fdb89041 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #51: _PyEval_EvalFrameDefault + 0x53e0 (0x5568fdc11da0 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #52: _PyEval_EvalCodeWithName + 0x7df (0x5568fdbd6f9f in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #53: _PyFunction_Vectorcall + 0x1e3 (0x5568fdbd7943 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #54: <unknown function> + 0x10011a (0x5568fdb4c11a in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #55: _PyFunction_Vectorcall + 0x10b (0x5568fdbd786b in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #56: <unknown function> + 0x10077f (0x5568fdb4c77f in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #57: <unknown function> + 0x18c11a (0x5568fdbd811a in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #58: _PyObject_GenericGetAttrWithDict + 0x135 (0x5568fdb8af65 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #59: _PyEval_EvalFrameDefault + 0x96c (0x5568fdc0d32c in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #60: _PyFunction_Vectorcall + 0x10b (0x5568fdbd786b in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #61: <unknown function> + 0xba0de (0x5568fdb060de in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #62: <unknown function> + 0x17ee73 (0x5568fdbcae73 in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
frame #63: <unknown function> + 0x18481b (0x5568fdbd081b in /home/hpc/iwi5/iwi5044h/miniconda3/envs/ldm/bin/python)
          follow up
terminate` called after throwing an instance of 'c10::Error' what():  CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. [2:00](https://uox-vggrobots.slack.com/archives/D03DL7UMG5S/p1669644019213799) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdc55d80457 in /users/shivanim/anaconda3/envs/mlenv/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fdc55d4a3ec in /users/shivanim/anaconda3/envs/mlenv/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7fdc87a9a044 in /users/shivanim/anaconda3/envs/mlenv/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: <unknown function> + 0x6a841f (0x7fdc9de0041f in /users/shivanim/anaconda3/envs/mlenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: <unknown function> + 0x4ce868 (0x7fdc9dc26868 in /users/shivanim/anaconda3/envs/mlenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #5: <unknown function> + 0x3f2dd (0x7fdc55d672dd in /users/shivanim/anaconda3/envs/mlenv/lib/python3.8/site-packages/torch/lib/libc10.so) frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7fdc55d609e0 in /users/shivanim/anaconda3/envs/mlenv/lib/python3.8/site-packages/torch/lib/libc10.so) frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fdc55d60af9 in /users/shivanim/anaconda3/envs/mlenv/lib/python3.8/site-packages/torch/lib/libc10.so) frame #8: <unknown function> + 0x72d128 (0x7fdc9de85128 in /users/shivanim/anaconda3/envs/mlenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #9: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7fdc9de85445 in /users/shivanim/anaconda3/envs/mlenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so) <omitting python frames> frame #61: __libc_start_main + 0xf5 (0x7fdcf285a555 in /lib64/libc.so.6)
I'm seeing the same error in a 2 GPU DDP setting with Driver Version: 470.141.03 CUDA Version: 11.4 PyTorch Version: 1.12.1+cu113 NCCL: (2, 10, 3)
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: initialization error
Exception raised from getDevice at ../c10/cuda/impl/CUDAGuardImpl.h:39 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fa8cee5b20e in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xf3b6d (0x7fa91171eb6d in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: <unknown function> + 0xf6ffe (0x7fa911721ffe in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: <unknown function> + 0x4635b8 (0x7fa920a845b8 in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fa8cee427a5 in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x35f485 (0x7fa920980485 in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x6795c8 (0x7fa920c9a5c8 in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7fa920c9a995 in [..]/venv3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: python() [0x57945b]
frame #9: python() [0x53f2f8]
frame #10: python() [0x5abb4c]
frame #11: python() [0x5abba0]
frame #12: python() [0x5abba0]
frame #13: python() [0x5abba0]
frame #14: python() [0x5abba0]
frame #15: python() [0x5abba0]
frame #16: python() [0x5abba0]
frame #17: python() [0x65cfd9]
frame #18: python() [0x5d8cfb]
frame #19: python() [0x67473f]
frame #20: python() [0x53fd97]
frame #21: python() [0x52f15a]
frame #22: python() [0x5b22ce]
<omitting python frames>
frame #27: python() [0x56ba4b]
frame #31: python() [0x4dfc1f]
frame #33: python() [0x4a634c]
frame #37: python() [0x4a634c]
frame #42: python() [0x4dfc76]
frame #45: python() [0x4dfc1f]
frame #50: python() [0x4dfc1f]
frame #51: python() [0x4fe137]
frame #52: python() [0x4df4c7]
frame #54: python() [0x4dfc1f]
frame #57: python() [0x56b6b6]
frame #61: python() [0x4df395]
frame #63: python() [0x4df395]
The error seems to be triggered by a specific combinations of weights and inputs, since usually, starting the training again from the last checkpoint will help for a couple of epochs. Also, this does not happen for all datasets the I use for training, even though the training code is always the same. I have not yet managed to find out what exactly triggers the error, so any help debugging would be much appreciated!
Updating Torch (from 1.10.1+cu113 to 1.12.1+cu113) did not solve it, neither did setting
torch.backends.cudnn.benchmark=False
torch.backends.cudnn.deterministic=True
as suggested by @SheffieldCao
I just have the seem case as you described ,so have you solved the problem?
          We also experience quite high number of those errors:
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa90b1f74d7 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fa90b1c136b in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7fa90b1d569e in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
We are using pytorch 2.0, with DDP (initialized by pytorch lightning). Interestingly, we can clearly see this happening usually at the very beginning of an epoch, mid-training (e.g after 5 epochs)
          A small update on our side: while I can't generate minimal-example, we narrowed down the search for this error. We can now successfully replicate it whenever SummaryWriter.add_image() is called.

Usually it goes like this:
Run some sanity_val_steps
Log some images with SummaryWriter.add_image()
When epoch resets, error "CUDA error: initialization error" is thrown.
Inside add_image(), it seems that it is add_image() -> image() -> make_image() -> image.save(output).

This all saves image to temporary io.BytesIO stream that is soon after closed. Without this line, everything runs fine. Please note that the error isn't thrown directly after calling this, it might be later in the training, and only if using > 0 workers in dataloader (even though logging is not inside dataloader). This suggests that some references are leaking, but not sure how to debug this further.
As mentioned before, I'm not quite able to replicate it with some minimal example. I will update if I manage to.
EDIT:

The issue seems gone if I enforce gc.collect() after logging images. I can also put gc.collect() at the start of every epoch and it still seems to fix the error (at least in our very reproducible scenario, not sure if it will be gone always). I understand that it is not  the ideal solution.
A small update on our side: while I can't generate minimal-example, we narrowed down the search for this error. We can now successfully replicate it whenever SummaryWriter.add_image() is called. Usually it goes like this:
Run some sanity_val_steps
Log some images with SummaryWriter.add_image()
When epoch resets, error "CUDA error: initialization error" is thrown.
Inside add_image(), it seems that it is add_image() -> image() -> make_image() -> image.save(output). This all saves image to temporary io.BytesIO stream that is soon after closed. Without this line, everything runs fine. Please note that the error isn't thrown directly after calling this, it might be later in the training, and only if using > 0 workers in dataloader (even though logging is not inside dataloader). This suggests that some references are leaking, but not sure how to debug this further.
As mentioned before, I'm not quite able to replicate it with some minimal example. I will update if I manage to.
EDIT: The issue seems gone if I enforce gc.collect() after logging images. I can also put gc.collect() at the start of every epoch and it still seems to fix the error (at least in our very reproducible scenario, not sure if it will be gone always). I understand that it is not the ideal solution.
I use pytorch2.0&DDP of pytorch lightning, and I meet the same problem. I find my problem is initializing an attribute of pytorch lightning module outside its init function, that is, I use self.xxx = xxx.to(self.device). In the second epoch, the processes fail and report CUDA initialization error.
I remove self.xxx = xxx.to(self.device) outside init function, and add self.xxx = xxx to init function(so that it will move to device automatically by pytorch lightning). And the problem is fixed.
Hope it will help : )
          torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1347, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.5

ncclUnhandledCudaError: Call to CUDA function failed.

Last error:

Cuda failure 'invalid argument'

I have encountered the same problem, has anyone solved it?
I found that I increased the number of num_workers and then the issue was solved. It's weird. If someone knows the reason, please tell me. Thank you!!
I solve the problem changed  the number of workers, Note I use a  #SLURM launcher
          I have got the similarity problem also, following is the error info:
terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I solved this problem by reducing the size of AvgPool from 100352 to a smaller one. I think it is related to the GPU memory. So you may solve this problem by check everything that my influence  GPU memory such as the BatchSize or size of Pooler .
          i think it happens for adding the wrong index in labels during the dataset enhancing process.
checking you labels:
it should be using the id for each class rather than  the rgb labels
checking the "ignore index" , for  cross entropy function, its default "ignore index =-100",

you should ignore the right index in your  dataset config files,
as for custom dataset,  and not ignoring the background, for padding process, adding another  index for the padding elements in case conflicting with the ids would be calculated in you loss function.
for example:

for custom dataset, an very common reason for this error is setting the wrong value for padding elements, and solving by :

the  dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255) -> dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=-100),