RuntimeError: CUDA error: an illegal memory access was encountered · Issue #46974

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
🐛 Bug

Hi, every one, I can not figure out where went wrong, I need some help, thanks in advance.
I've just installed pytorch1.6 + cuda10.2 using conda on my server
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch
My server is installed with cuda11.1, here is the output of nvidia-smicommand of my server:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:18:00.0 Off |                  N/A |
| 40%   45C    P8    17W / 260W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 41%   37C    P8    20W / 260W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:86:00.0 Off |                  N/A |
| 41%   37C    P8    20W / 260W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:AF:00.0 Off |                  N/A |
| 41%   38C    P8    11W / 260W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
When I tried to train my network, always throws me at this runtime error:
Files already downloaded and verified
Files already downloaded and verified
/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Training Epoch: 1 [128/50000]   Loss: 4.6580    LR: 0.000256
Training Epoch: 1 [256/50000]   Loss: 4.6234    LR: 0.000512
Training Epoch: 1 [384/50000]   Loss: 4.6516    LR: 0.000767
Training Epoch: 1 [512/50000]   Loss: 4.6480    LR: 0.001023
Training Epoch: 1 [640/50000]   Loss: 4.6622    LR: 0.001279
Training Epoch: 1 [768/50000]   Loss: 4.6319    LR: 0.001535
Training Epoch: 1 [896/50000]   Loss: 4.5743    LR: 0.001790
Training Epoch: 1 [1024/50000]  Loss: 4.6521    LR: 0.002046
Training Epoch: 1 [1152/50000]  Loss: 4.6352    LR: 0.002302
Training Epoch: 1 [1280/50000]  Loss: 4.5955    LR: 0.002558
Training Epoch: 1 [1408/50000]  Loss: 4.6159    LR: 0.002813
Training Epoch: 1 [1536/50000]  Loss: 4.6440    LR: 0.003069
Training Epoch: 1 [1664/50000]  Loss: 4.6346    LR: 0.003325
Training Epoch: 1 [1792/50000]  Loss: 4.6477    LR: 0.003581
Training Epoch: 1 [1920/50000]  Loss: 4.6555    LR: 0.003836
Traceback (most recent call last):
  File "train.py", line 209, in <module>
    train(epoch)
  File "train.py", line 52, in train
    writer.add_scalar('LastLayerGradients/grad_norm2_weights', para.grad.norm(), n_iter)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 346, in add_scalar
    scalar(tag, scalar_value), global_step, walltime)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 247, in scalar
    scalar = make_np(scalar)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/_convert_np.py", line 24, in make_np
    return _prepare_pytorch(x)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/_convert_np.py", line 32, in _prepare_pytorch
    x = x.cpu().numpy()
RuntimeError: CUDA error: an illegal memory access was encountered
I've tried installed pytorch1.6 with cuda9.2, 10.1, 10.2 and pytorch1.7 with cuda 9.2, 10.1, 10.2, 11.0 using conda, all gives me the same error. But my code works fine using Google Colab
To Reproduce
Steps to reproduce the behavior:

Requre pytorch 1.6 to run my code
git clone [email protected]:weiaicunzai/pytorch-cifar100.git
cd pytorch-cifar100
python train.py -net vgg16 -gpu
Expected behavior
If you have the same environment as me(install pytorch1.6 through conda on a machine with cuda11.1 installed)
Files already downloaded and verified
Files already downloaded and verified
/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Training Epoch: 1 [128/50000]   Loss: 4.6580    LR: 0.000256
Training Epoch: 1 [256/50000]   Loss: 4.6234    LR: 0.000512
Training Epoch: 1 [384/50000]   Loss: 4.6516    LR: 0.000767
Training Epoch: 1 [512/50000]   Loss: 4.6480    LR: 0.001023
Training Epoch: 1 [640/50000]   Loss: 4.6622    LR: 0.001279
Training Epoch: 1 [768/50000]   Loss: 4.6319    LR: 0.001535
Training Epoch: 1 [896/50000]   Loss: 4.5743    LR: 0.001790
Training Epoch: 1 [1024/50000]  Loss: 4.6521    LR: 0.002046
Training Epoch: 1 [1152/50000]  Loss: 4.6352    LR: 0.002302
Training Epoch: 1 [1280/50000]  Loss: 4.5955    LR: 0.002558
Training Epoch: 1 [1408/50000]  Loss: 4.6159    LR: 0.002813
Training Epoch: 1 [1536/50000]  Loss: 4.6440    LR: 0.003069
Training Epoch: 1 [1664/50000]  Loss: 4.6346    LR: 0.003325
Training Epoch: 1 [1792/50000]  Loss: 4.6477    LR: 0.003581
Training Epoch: 1 [1920/50000]  Loss: 4.6555    LR: 0.003836
Traceback (most recent call last):
  File "train.py", line 209, in <module>
    train(epoch)
  File "train.py", line 52, in train
    writer.add_scalar('LastLayerGradients/grad_norm2_weights', para.grad.norm(), n_iter)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 346, in add_scalar
    scalar(tag, scalar_value), global_step, walltime)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 247, in scalar
    scalar = make_np(scalar)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/_convert_np.py", line 24, in make_np
    return _prepare_pytorch(x)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/_convert_np.py", line 32, in _prepare_pytorch
    x = x.cpu().numpy()
RuntimeError: CUDA error: an illegal memory access was encountered
Environment
Please copy and paste the output from our

environment collection script

(or fill out the checklist below manually).




    

Collecting environment information...
PyTorch version: 1.6.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
Nvidia driver version: 455.23.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.6.0
[pip3] torchaudio==0.6.0a0+f17ae39
[pip3] torchvision==0.7.0
[conda] _pytorch_select           0.1                       cpu_0    defaults
[conda] blas                      1.0                         mkl    defaults
[conda] cudatoolkit               10.2.89              hfd86e86_1    defaults
[conda] mkl                       2020.2                      256    defaults
[conda] mkl-service               2.3.0            py36he904b0f_0    defaults
[conda] mkl_fft                   1.2.0            py36h23d657b_0    defaults
[conda] mkl_random                1.1.1            py36h0573a6f_0    defaults
[conda] numpy                     1.19.2           py36h54aff64_0    defaults
[conda] numpy-base                1.19.2           py36hfa32c7d_0    defaults
[conda] pytorch                   1.6.0           py3.6_cuda10.2.89_cudnn7.6.5_0    pytorch
[conda] torchaudio                0.6.0                      py36    pytorch
[conda] torchvision               0.7.0                py36_cu102    pytorch
Additional context
I've noticed this pic from nvidia website, does this mean that  Pytorch1.6 with cuda 10.2 compiled can work smoothly on a machine with nvidia driver 455 installed? Since conda could install cudatoolkit10.2 for me, so why am I  get this cuda runtime error????Thanks!
          I've tried to downgrade nvidia-driver from 455  to 450, still the same problem:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:18:00.0 Off |                  N/A |
| 41%   51C    P8    18W / 260W |      3MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 41%   40C    P8    20W / 260W |      3MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:86:00.0 Off |                  N/A |
| 41%   41C    P8    20W / 260W |      3MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:AF:00.0 Off |                  N/A |
| 41%   42C    P8    12W / 260W |      3MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
          @weiaicunzai You can have the latest NVIDIA drivers installed and still use any of the CUDA 10.x or 11.x.

Just make sure to install and update CUDA together with CuDNN separately from the driver installation.
For CUDA 11 you need to use pytorch 1.7, released yesterday.
For pytorch 1.6 both CUDA 10.1 and 10.2 should be fine.
PS. nvidia-smi CUDA Version field can be misleading, not worth relying on when it comes to seeing what is actually being used by pytorch.
This is what I use to switch between different CUDA versions, but I do not use conda, just a regular venv.
SET CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1
SET PATH=%CUDA_PATH%\bin;%PATH%
SET PATH=%CUDA_PATH%\extras\CUPTI\libx64;%PATH%
SET PATH=C:\cudnn-7.6.5\10.1\bin;%PATH%;
PS. nvidia-smi CUDA Version field can be misleading, not worth relying on when it comes to seeing what is actually being used by pytorch.
Thanks for your replay,  The nvidia driver has downgraded to  450.80.02,  but the error still exists. I do not know what happened,

I tried your solution to  set PATH variable, but it seems like Pytorch still uses  conda's  cudatoolkit.
Any other possible suggestions? Thank you. I still can not figure out where could possiblly went wrong.
          @weiaicunzai I think you can still have the latest drivers, the problem is likely with cuda or cudnn.
What's curious about your environment is the 8.0.4 cudnn that was picked up in env collection, but later it shows the reference to py3.6_cuda10.2.89_cudnn7.6.5_0 . cudnn 8.0.4 is compatible only with CUDA 11.x, so one would need 7.6.5.
Could be worth trying to install pytorch to regular python venv and see how that goes.
PS. Sorry about the PATH section I posted, it was extremely misleading as you're not using Windows. Having multiple CUDA version in Linux is much less trivial.
@weiaicunzai I think you can still have the latest drivers, the problem is likely with cuda or cudnn.
What's curious about your environment is the 8.0.4 cudnn that was picked up in env collection, but later it shows the reference to py3.6_cuda10.2.89_cudnn7.6.5_0 . cudnn 8.0.4 is compatible only with CUDA 11.x, so one would need 7.6.5.
Could be worth trying to install pytorch to regular python venv and see how that goes.
PS. Sorry about the PATH section I posted, it was extremely misleading as you're not using Windows. Having multiple CUDA version in Linux is much less trivial.
Since I do not have root privilege, so I created a python virtual environment using conda, installed pytorch1.7 with cuda11.0 support, but with no luck , still the same error.

here is my detected  cuda version and cudnn version using Pytorch
>>> import torch
>>> torch.version.cuda
'11.0'
>>> torch.backends.cudnn.version()
Is my problem related to  #21819?
  triaged
  This issue has been looked at a team member, and triaged and prioritized into an appropriate module
  labels
    Oct 28, 2020
          Thanks for your replies. @ngimel  @VitalyFedyunin
I've tested Pytorch1.6 with cuda 9.2, cuda 10.1, cuda 10.2 and Pytorch1.7 with cuda 9.2, cuda 10.1, cuda 10.2, both gives me the same error message
For example: pytorch1.6 + cuda10.2+python3.6:
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch
CUDA_LAUNCH_BLOCKING=1 python train.py -net resnet18 -gpu
Error message is:
Files already downloaded and verified
Files already downloaded and verified
/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Training Epoch: 1 [128/50000]   Loss: 4.6661    LR: 0.000256
Training Epoch: 1 [256/50000]   Loss: 4.6763    LR: 0.000512
Training Epoch: 1 [384/50000]   Loss: 4.7014    LR: 0.000767
Training Epoch: 1 [512/50000]   Loss: 4.7478    LR: 0.001023
Training Epoch: 1 [640/50000]   Loss: 4.6901    LR: 0.001279
Traceback (most recent call last):
  File "train.py", line 209, in <module>
    train(epoch)
  File "train.py", line 44, in train
    loss.backward()
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Exception raised from cudnn_batch_norm_backward at /opt/conda/conda-bld/pytorch_1595629427286/work/aten/src/ATen/native/cudnn/BatchNorm.cpp:324 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f1dde67c77d in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: at::native::cudnn_batch_norm_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, at::Tensor const&) + 0x1db3 (0x7f1ddf772433 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd0138a (0x7f1ddf7e338a in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xd2f8bb (0x7f1ddf8118bb in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::cudnn_batch_norm_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, at::Tensor const&) + 0x1ef (0x7f1e15ce910f in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x2b59cff (0x7f1e17930cff in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x2b6b21b (0x7f1e1794221b in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::cudnn_batch_norm_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, at::Tensor const&) + 0x1ef (0x7f1e15ce910f in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: torch::autograd::generated::CudnnBatchNormBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x42c (0x7f1e17893fec in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x30d1017 (0x7f1e17ea8017 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x7f1e17ea3860 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7f1e17ea4401 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f1e17e9c579 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x7f1e1c1cb13a in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #14: <unknown function> + 0xc819d (0x7f1e1ecf019d in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #15: <unknown function> + 0x76db (0x7f1e431b06db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #16: clone + 0x3f (0x7f1e42ed9a3f in /lib/x86_64-linux-gnu/libc.so.6)
But the strange thing is: I've tested for pytoch1.6 with cuda 10.1(if I remembered correctly) a couple  hours ago with CUDA_LAUNCH_BLOCKING=1 python train.py -net resnet18 -gpu, gives me this error:RuntimeError: Unable to find a valid cuDNN algorithm to run convolution, but now I can not reproduce it anymore, I do not know why




    

update:
I set cudnn to false using torch.backends.cudnn.enabled = False, then run CUDA_LAUNCH_BLOCKING=1  python train.py -net resnet18 -gpu, gives me the same RuntimeError: CUDA error: an illegal memory access was encounterederror:
Files already downloaded and verified
Files already downloaded and verified
/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Training Epoch: 1 [128/50000]   Loss: 4.7320    LR: 0.000256
Training Epoch: 1 [256/50000]   Loss: 4.6542    LR: 0.000512
Training Epoch: 1 [384/50000]   Loss: 4.6969    LR: 0.000767
Training Epoch: 1 [512/50000]   Loss: 4.7084    LR: 0.001023
Training Epoch: 1 [640/50000]   Loss: 4.7125    LR: 0.001279
Training Epoch: 1 [768/50000]   Loss: 4.7134    LR: 0.001535
Training Epoch: 1 [896/50000]   Loss: 4.7244    LR: 0.001790
Training Epoch: 1 [1024/50000]  Loss: 4.7463    LR: 0.002046
Training Epoch: 1 [1152/50000]  Loss: 4.6443    LR: 0.002302
Training Epoch: 1 [1280/50000]  Loss: 4.6344    LR: 0.002558
Training Epoch: 1 [1408/50000]  Loss: 4.6758    LR: 0.002813
Training Epoch: 1 [1536/50000]  Loss: 4.6331    LR: 0.003069
Training Epoch: 1 [1664/50000]  Loss: 4.6103    LR: 0.003325
Training Epoch: 1 [1792/50000]  Loss: 4.6204    LR: 0.003581
Training Epoch: 1 [1920/50000]  Loss: 4.5800    LR: 0.003836
Traceback (most recent call last):
  File "train.py", line 210, in <module>
    train(epoch)
  File "train.py", line 45, in train
    loss.backward()
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered
Exception raised from batch_norm_backward_cuda_template at /opt/conda/conda-bld/pytorch_1595629427286/work/aten/src/ATen/native/cuda/Normalization.cuh:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f0ecd70477d in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: std::tuple<at::Tensor, at::Tensor, at::Tensor> at::native::batch_norm_backward_cuda_template<float, float, int>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, std::array<bool, 3ul>) + 0x9fd (0x7f0ecfc85a0d in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: at::native::batch_norm_backward_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, std::array<bool, 3ul>) + 0x30e (0x7f0ecfc5e06e in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xd0570c (0x7f0ece86f70c in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd2ff53 (0x7f0ece899f53 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native_batch_norm_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, std::array<bool, 3ul>) + 0x233 (0x7f0f04dbc133 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x2bad8d4 (0x7f0f06a0c8d4 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xe6e373 (0x7f0f04ccd373 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: at::native_batch_norm_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, std::array<bool, 3ul>) + 0x233 (0x7f0f04dbc133 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::autograd::generated::NativeBatchNormBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x398 (0x7f0f0691b128 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x30d1017 (0x7f0f06f30017 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x7f0f06f2b860 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7f0f06f2c401 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f0f06f24579 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x7f0f0b25313a in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0xc819d (0x7f0f0dd7819d in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #16: <unknown function> + 0x76db (0x7f0f322386db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #17: clone + 0x3f (0x7f0f31f61a3f in /lib/x86_64-linux-gnu/libc.so.6)
The most common cause for this error is invalid user data (e.g. your target is larger than the number of classes). Please run with CUDA_LAUNCH_BLOCKING as Vitaly suggests, and post the error that you are getting.




    

Thank you , the same code can run on Google Colab smoothly,  but locally would gives me runtime error, so I do not think it's the code which occurs this error, and I also have other projects can raise  the same error as well after my server upgrading cuda version(also upgraded nvidia driver version), they were also smoothly running on my server before upgrading:
Just in case if you need:

Google Colab  nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
          I've solved this issue, For someone who needs the solution, I'll post my solution here, hope it can help someone.
I've tried reinstall ubuntu,  tried each cuda version that 2080ti supports, and each installation  method(apt-get, deb, run file...), none of them worked, always the same error. Then I started to wonder whether my 2080ti was broken. I explicitly specified the gpu id using

net.cuda(device=0), magically, everything worked fine, the bug  just disapeared.
So the solution is:

Run your model with specifying each gpu id usingnet = net.cuda(device=gpu_id) to "activate"  each gpu, then you can use Pytorch as usual again(multi card training, single card training, but sometimes still can occur this error) . For each gpu id, I suggest to train 5-10 epoch on a small network like resnet18 to finish the "activation" phase.
@ngimel  @ahtik  @VitalyFedyunin  I think this could be a potential Pytorch bug, please take a look at this, thank you.
I've solved this issue, For someone who needs the solution, I'll post my solution here, hope it can help someone.
I've tried reinstall ubuntu, tried each cuda version that 2080ti supports, and each installation method(apt-get, deb, run file...), none of them worked, always the same error. Then I started to wonder whether my 2080ti was broken. I explicitly specified the gpu id using

net.cuda(device=0), magically, everything worked fine, the bug just disapeared.
So the solution is:

Run your model with specifying each gpu id usingnet = net.cuda(device=gpu_id) to "activate" each gpu, then you can use Pytorch as usual again(multi card training, single card training, but sometimes still can occur this error) . For each gpu id, I suggest to train 5-10 epoch on a small network like resnet18 to finish the "activation" phase.
@ngimel @ahtik @VitalyFedyunin I think this could be a potential Pytorch bug, please take a look at this, thank you.
I have the same issue. But It was not solved even all I did what you recommended.

I hope to solve this issue.

What is your pytorch, cuda, cudnn version?
Thanks
I've solved this issue, For someone who needs the solution, I'll post my solution here, hope it can help someone.

I've tried reinstall ubuntu, tried each cuda version that 2080ti supports, and each installation method(apt-get, deb, run file...), none of them worked, always the same error. Then I started to wonder whether my 2080ti was broken. I explicitly specified the gpu id using

net.cuda(device=0), magically, everything worked fine, the bug just disapeared.

So the solution is:

Run your model with specifying each gpu id usingnet = net.cuda(device=gpu_id) to "activate" each gpu, then you can use Pytorch as usual again(multi card training, single card training, but sometimes still can occur this error) . For each gpu id, I suggest to train 5-10 epoch on a small network like resnet18 to finish the "activation" phase.

@ngimel @ahtik @VitalyFedyunin I think this could be a potential Pytorch bug, please take a look at this, thank you.
I have the same issue. But It was not solved even all I did what you recommended.

I hope to solve this issue.

What is your pytorch, cuda, cudnn version?
Thanks
Try to use Pytorch official Docker image. I've tested on the official image, then I've found out that my bug could be caused by the hardware, I think one of my GPU was broken, cause I only get this error on GPU:0, other three GPU cards works fine.
I've solved this issue, For someone who needs the solution, I'll post my solution here, hope it can help someone.

I've tried reinstall ubuntu, tried each cuda version that 2080ti supports, and each installation method(apt-get, deb, run file...), none of them worked, always the same error. Then I started to wonder whether my 2080ti was broken. I explicitly specified the gpu id using

net.cuda(device=0), magically, everything worked fine, the bug just disapeared.

So the solution is:

Run your model with specifying each gpu id usingnet = net.cuda(device=gpu_id) to "activate" each gpu, then you can use Pytorch as usual again(multi card training, single card training, but sometimes still can occur this error) . For each gpu id, I suggest to train 5-10 epoch on a small network like resnet18 to finish the "activation" phase.

@ngimel @ahtik @VitalyFedyunin I think this could be a potential Pytorch bug, please take a look at this, thank you.
I have the same issue. But It was not solved even all I did what you recommended.

I hope to solve this issue.

What is your pytorch, cuda, cudnn version?

Thanks
Try to use Pytorch official Docker image. I've tested on the official image, then I've found out that my bug could be caused by the hardware, I think one of my GPU was broken, cause I only get this error on GPU:0, other three GPU cards works fine.
So, you means that The GPU:0 was broken?

Then, what did you do? just remove that GPU?

I only have one GPU.. I don't have replacements.
Thanks
Then, what did you do? just remove that GPU?
I'm planning to get GPU:0 repaired. Are you sure your GPU is also broken??Have you tried on both TF and Pytorch to train the model in a docker container?
I use "CUDA_VISIBLE_DEVICES" variable to block my 0 gpu, use other 3 gpus for training
would you please let us know once you can confirm if the problem was with your GPU or not. Thank you for the feedback so far.
Ok, will do.
@weiaicunzai, would you please let us know once you can confirm if the problem was with your GPU or not. Thank you for the feedback so far.
After replacing with the new GPU, everything works fine now, I think we can confirm  it's the hardware issue.
@weiaicunzai, would you please let us know once you can confirm if the problem was with your GPU or not. Thank you for the feedback so far.
After replacing with the new GPU, everything works fine now, I think we can confirm it's the hardware issue.
Thank you for letting us know!
          initially nvidia-smi was looking like this while training first epoch :
Sun Jan 31 23:28:24 2021

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 457.51       Driver Version: 457.51       CUDA Version: 11.1     |

|-------------------------------+----------------------+----------------------+

| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|                               |                      |               MIG M. |

|===============================+======================+======================|

|   0  GeForce RTX 306... WDDM  | 00000000:09:00.0  On |                  N/A |

| 68%   60C    P2   177W / 240W |   6803MiB /  8192MiB |     94%      Default |

|                               |                      |                  N/A |

+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
after 4/5 epoch speed got reduced and it looked like this :
Sun Jan 31 23:53:15 2021

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 457.51       Driver Version: 457.51       CUDA Version: 11.1     |

|-------------------------------+----------------------+----------------------+

| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|                               |                      |               MIG M. |

|===============================+======================+======================|

|   0  GeForce RTX 306... WDDM  | 00000000:09:00.0  On |                  N/A |

| 69%   58C    P2   171W / 240W |   4803MiB /  8192MiB |     65%      Default |

|                               |                      |                  N/A |

+-------------------------------+----------------------+----------------------+
then after few more epochs it gave me : RuntimeError: CUDA error: an illegal memory access was encountered error!
could be related to multiprocessing/num_workers?

what is the solution?

here is the code : https://pastebin.com/W4BpmWcP
          @heitorschueroff sorry i forgot to collect full error trace but i remember clearly from where it was coming,it came from this block of code :
if running_loss is None:
    running_loss = loss.item()
else:
    running_loss = running_loss * .99 + loss.item() * .01
while executing this line of code :  running_loss = running_loss * .99 + loss.item() * .01
that error came
          I using CUDA_LAUNCH_BLOCKING=1, but it's so slow that I waited for an hour still get nothing. Can anyone tell me how to fix it?
I think it should be my model problem. I'm learning to complete a two-stage computer-vision model, and the error caused when I feed the proposal into the second stage.
          @heitorschueroff I tried two version which are 1.6 and 1.8.1. Both of them cause same error.
here's my error information. I think it's my model problem when I feed the second-stage input to the model though single model works fine.
  File "/home/f523/guazai/sdb/rsy/cornerPoject/myCornerNet6/exp/train.py", line 212, in run_epoch
    cls, rgr = self.model([proposal, fm], stage='two')
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/cuda/comm.py", line 166, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: an illegal memory access was encountered
my model sciprt shows below. I want my two-stage model can support for multi batch. e.g. the batch size is 4 and every img output 128 proposal, so the proposal size in here is (4*128, 5)
    def _stage2(self, xs):
        proposal, fm = xs
        if proposal.dim()==2 and proposal.size(1) == 5:
            # train mode
            roi = roi_align(fm, proposal, output_size=[15, 15])
        elif proposal.dim()==3 and proposal.size(2) == 4:
            # eval mode
            roi = roi_align(fm, [proposal[0]], output_size=[15, 15])
        else:
            assert AssertionError(" The boxes tensor shape should be Tensor[K, 5] in train or Tensor[N, 4] in eval")
        x = self.big_kernel(roi)
        cls = self.cls_fm(x)
        rgr = self.rgr_fm(x)
        return cls, rgr
          I know where I am wrong. Here’s my second stage to feed input
cls, offset = self.model([proposal, fm], stage='two')
proposal is the ROI whose shape is [N, 5], the 1th dim is the batch index. e.g. The batch size is 4, the range of index is [0,1,2,3]. And fm is the feature map.
When I use the mult-gpu like 2 gpu. the proposal and fm will be split into two branch and feed into two gpu. However the batch index range still be [0,1,2,3], then cause a index error and raise gpu error.
What I do is add a line before roi_align like below:
from torchvision.ops import roi_align
proposal[:, 0] = proposal[:, 0] % fm.size(0) # this make multi-gpu work
roi = roi_align(fm, proposal, output_size=[15, 15])
          I'm facing the same problem. The training is good without any error. But when I try to predict and x.to(device) throws out RuntimeError: CUDA error: an illegal memory access was encountered. And when I try to save the model, it gives more details about the error THCudaCheck FAIL file=/pytorch/torch/csrc/generic/serialization.cpp line=31 error=700 : an illegal memory access was encountered. It seems this error has appeared for more than 2 years with multiple pytorch versions. Do we know anything about the error?
  triaged
  This issue has been looked at a team member, and triaged and prioritized into an appropriate module
  labels
    Aug 22, 2021
  triaged
  This issue has been looked at a team member, and triaged and prioritized into an appropriate module
    and removed
  triage review
  labels
    Aug 23, 2021
          Wanna share some experience: I met this error when using CTC loss.
The input of CTCLoss should be a batch with variant lengths of sequence. However, to feed the data, I padded all sequence into a fixed length.
This error happened when I forgot to clip the padded part of each sequence. That is to say, the input data in the loss function was longer than the expected.
Give a toy example: I want to use [1 2 3] as the label of the first sequence; what I want to do is:
pad [1 2 3] to [1 2 3 0 0]
clip [1 2 3 0 0] to [1 2 3]
feed [1 2 3] into the model
But I forgot the 2nd step so this error occurred.
It's definitely a nvidia or Pytorch bug.
If you have overclocked your gpu beyond a certain point using MSI afterburner, this happens. I also got the error, after trying everything else, I brought down the clock and memory frequency back to native levels and the error did not occur again.
This is the reason running the same code on cloud does not cause illegal memory access error, while running it on the overclocked local gpu causes the error.
Wanna share some experience: I met this error when using CTC loss.
The input of CTCLoss should be a batch with variant lengths of sequence. However, to feed the data, I padded all sequence into a fixed length.
This error happened when I forgot to clip the padded part of each sequence. That is to say, the input data in the loss function was longer than the expected.
Give a toy example: I want to use [1 2 3] as the label of the first sequence; what I want to do is:
pad [1 2 3] to [1 2 3 0 0]
clip [1 2 3 0 0] to [1 2 3]
feed [1 2 3] into the model
But I forgot the 2nd step so this error occurred.
Thanks for sharing,I made the same mistake
I had a similar CUDA illegal memory access issue when running the example cartpole training. Updated nvidia driver from 450->495, rebooted and it worked
which cuda version & op system were u using?
      RuntimeError: CUDA error: an illegal memory access was encountered on RTX 3080 with enough memory
      #79603
          THis is gpu RAM size issue. On allocation time or during crash time ,gpu profiles shows VRAM is  about 12% is being utilized.

But if the allocation parameters will be decreased to use less VRAM and succescifully run the program , it will be pretty obvious that the code that allocation VRAM  is allocation almost 90% of VRAM size