Hi, every one, I can not figure out where went wrong, I need some help, thanks in advance.
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch
My server is installed with cuda11.1, here is the output of nvidia-smi
command of my server:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:18:00.0 Off | N/A |
| 40% 45C P8 17W / 260W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A |
| 41% 37C P8 20W / 260W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:86:00.0 Off | N/A |
| 41% 37C P8 20W / 260W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A |
| 41% 38C P8 11W / 260W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
When I tried to train my network, always throws me at this runtime error:
Files already downloaded and verified
Files already downloaded and verified
/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Training Epoch: 1 [128/50000] Loss: 4.6580 LR: 0.000256
Training Epoch: 1 [256/50000] Loss: 4.6234 LR: 0.000512
Training Epoch: 1 [384/50000] Loss: 4.6516 LR: 0.000767
Training Epoch: 1 [512/50000] Loss: 4.6480 LR: 0.001023
Training Epoch: 1 [640/50000] Loss: 4.6622 LR: 0.001279
Training Epoch: 1 [768/50000] Loss: 4.6319 LR: 0.001535
Training Epoch: 1 [896/50000] Loss: 4.5743 LR: 0.001790
Training Epoch: 1 [1024/50000] Loss: 4.6521 LR: 0.002046
Training Epoch: 1 [1152/50000] Loss: 4.6352 LR: 0.002302
Training Epoch: 1 [1280/50000] Loss: 4.5955 LR: 0.002558
Training Epoch: 1 [1408/50000] Loss: 4.6159 LR: 0.002813
Training Epoch: 1 [1536/50000] Loss: 4.6440 LR: 0.003069
Training Epoch: 1 [1664/50000] Loss: 4.6346 LR: 0.003325
Training Epoch: 1 [1792/50000] Loss: 4.6477 LR: 0.003581
Training Epoch: 1 [1920/50000] Loss: 4.6555 LR: 0.003836
Traceback (most recent call last):
File "train.py", line 209, in <module>
train(epoch)
File "train.py", line 52, in train
writer.add_scalar('LastLayerGradients/grad_norm2_weights', para.grad.norm(), n_iter)
File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 346, in add_scalar
scalar(tag, scalar_value), global_step, walltime)
File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 247, in scalar
scalar = make_np(scalar)
File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/_convert_np.py", line 24, in make_np
return _prepare_pytorch(x)
File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/_convert_np.py", line 32, in _prepare_pytorch
x = x.cpu().numpy()
RuntimeError: CUDA error: an illegal memory access was encountered
I've tried installed pytorch1.6 with cuda9.2, 10.1, 10.2 and pytorch1.7 with cuda 9.2, 10.1, 10.2, 11.0 using conda, all gives me the same error. But my code works fine using Google Colab
To Reproduce
Steps to reproduce the behavior:
Requre pytorch 1.6 to run my code
git clone [email protected]:weiaicunzai/pytorch-cifar100.git
cd pytorch-cifar100
python train.py -net vgg16 -gpu
Expected behavior
If you have the same environment as me(install pytorch1.6 through conda on a machine with cuda11.1 installed)
Files already downloaded and verified
Files already downloaded and verified
/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Training Epoch: 1 [128/50000] Loss: 4.6580 LR: 0.000256
Training Epoch: 1 [256/50000] Loss: 4.6234 LR: 0.000512
Training Epoch: 1 [384/50000] Loss: 4.6516 LR: 0.000767
Training Epoch: 1 [512/50000] Loss: 4.6480 LR: 0.001023
Training Epoch: 1 [640/50000] Loss: 4.6622 LR: 0.001279
Training Epoch: 1 [768/50000] Loss: 4.6319 LR: 0.001535
Training Epoch: 1 [896/50000] Loss: 4.5743 LR: 0.001790
Training Epoch: 1 [1024/50000] Loss: 4.6521 LR: 0.002046
Training Epoch: 1 [1152/50000] Loss: 4.6352 LR: 0.002302
Training Epoch: 1 [1280/50000] Loss: 4.5955 LR: 0.002558
Training Epoch: 1 [1408/50000] Loss: 4.6159 LR: 0.002813
Training Epoch: 1 [1536/50000] Loss: 4.6440 LR: 0.003069
Training Epoch: 1 [1664/50000] Loss: 4.6346 LR: 0.003325
Training Epoch: 1 [1792/50000] Loss: 4.6477 LR: 0.003581
Training Epoch: 1 [1920/50000] Loss: 4.6555 LR: 0.003836
Traceback (most recent call last):
File "train.py", line 209, in <module>
train(epoch)
File "train.py", line 52, in train
writer.add_scalar('LastLayerGradients/grad_norm2_weights', para.grad.norm(), n_iter)
File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 346, in add_scalar
scalar(tag, scalar_value), global_step, walltime)
File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 247, in scalar
scalar = make_np(scalar)
File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/_convert_np.py", line 24, in make_np
return _prepare_pytorch(x)
File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/_convert_np.py", line 32, in _prepare_pytorch
x = x.cpu().numpy()
RuntimeError: CUDA error: an illegal memory access was encountered
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
Collecting environment information...
PyTorch version: 1.6.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
Nvidia driver version: 455.23.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.6.0
[pip3] torchaudio==0.6.0a0+f17ae39
[pip3] torchvision==0.7.0
[conda] _pytorch_select 0.1 cpu_0 defaults
[conda] blas 1.0 mkl defaults
[conda] cudatoolkit 10.2.89 hfd86e86_1 defaults
[conda] mkl 2020.2 256 defaults
[conda] mkl-service 2.3.0 py36he904b0f_0 defaults
[conda] mkl_fft 1.2.0 py36h23d657b_0 defaults
[conda] mkl_random 1.1.1 py36h0573a6f_0 defaults
[conda] numpy 1.19.2 py36h54aff64_0 defaults
[conda] numpy-base 1.19.2 py36hfa32c7d_0 defaults
[conda] pytorch 1.6.0 py3.6_cuda10.2.89_cudnn7.6.5_0 pytorch
[conda] torchaudio 0.6.0 py36 pytorch
[conda] torchvision 0.7.0 py36_cu102 pytorch
Additional context
I've noticed this pic from nvidia website, does this mean that Pytorch1.6 with cuda 10.2 compiled can work smoothly on a machine with nvidia driver 455 installed? Since conda could install cudatoolkit10.2 for me, so why am I get this cuda runtime error????Thanks!
I've tried to downgrade nvidia-driver from 455 to 450, still the same problem:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:18:00.0 Off | N/A |
| 41% 51C P8 18W / 260W | 3MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A |
| 41% 40C P8 20W / 260W | 3MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:86:00.0 Off | N/A |
| 41% 41C P8 20W / 260W | 3MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A |
| 41% 42C P8 12W / 260W | 3MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
@weiaicunzai You can have the latest NVIDIA drivers installed and still use any of the CUDA 10.x or 11.x.
Just make sure to install and update CUDA together with CuDNN separately from the driver installation.
For CUDA 11 you need to use pytorch 1.7, released yesterday.
For pytorch 1.6 both CUDA 10.1 and 10.2 should be fine.
PS. nvidia-smi CUDA Version field can be misleading, not worth relying on when it comes to seeing what is actually being used by pytorch.
This is what I use to switch between different CUDA versions, but I do not use conda, just a regular venv.
SET CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1
SET PATH=%CUDA_PATH%\bin;%PATH%
SET PATH=%CUDA_PATH%\extras\CUPTI\libx64;%PATH%
SET PATH=C:\cudnn-7.6.5\10.1\bin;%PATH%;
PS. nvidia-smi CUDA Version field can be misleading, not worth relying on when it comes to seeing what is actually being used by pytorch.
Thanks for your replay, The nvidia driver has downgraded to 450.80.02
, but the error still exists. I do not know what happened,
I tried your solution to set PATH
variable, but it seems like Pytorch still uses conda's cudatoolkit.
Any other possible suggestions? Thank you. I still can not figure out where could possiblly went wrong.
@weiaicunzai I think you can still have the latest drivers, the problem is likely with cuda or cudnn.
What's curious about your environment is the 8.0.4
cudnn that was picked up in env collection, but later it shows the reference to py3.6_cuda10.2.89_cudnn7.6.5_0
. cudnn 8.0.4 is compatible only with CUDA 11.x, so one would need 7.6.5.
Could be worth trying to install pytorch to regular python venv and see how that goes.
PS. Sorry about the PATH
section I posted, it was extremely misleading as you're not using Windows. Having multiple CUDA version in Linux is much less trivial.
@weiaicunzai I think you can still have the latest drivers, the problem is likely with cuda or cudnn.
What's curious about your environment is the 8.0.4
cudnn that was picked up in env collection, but later it shows the reference to py3.6_cuda10.2.89_cudnn7.6.5_0
. cudnn 8.0.4 is compatible only with CUDA 11.x, so one would need 7.6.5.
Could be worth trying to install pytorch to regular python venv and see how that goes.
PS. Sorry about the PATH
section I posted, it was extremely misleading as you're not using Windows. Having multiple CUDA version in Linux is much less trivial.
Since I do not have root privilege, so I created a python virtual environment using conda, installed pytorch1.7 with cuda11.0 support, but with no luck , still the same error.
here is my detected cuda version and cudnn version using Pytorch
>>> import torch
>>> torch.version.cuda
'11.0'
>>> torch.backends.cudnn.version()
Is my problem related to #21819?
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
labels
Oct 28, 2020
Thanks for your replies. @ngimel @VitalyFedyunin
I've tested Pytorch1.6 with cuda 9.2, cuda 10.1, cuda 10.2 and Pytorch1.7 with cuda 9.2, cuda 10.1, cuda 10.2, both gives me the same error message
For example: pytorch1.6 + cuda10.2+python3.6:
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch
CUDA_LAUNCH_BLOCKING=1 python train.py -net resnet18 -gpu
Error message is:
Files already downloaded and verified
Files already downloaded and verified
/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Training Epoch: 1 [128/50000] Loss: 4.6661 LR: 0.000256
Training Epoch: 1 [256/50000] Loss: 4.6763 LR: 0.000512
Training Epoch: 1 [384/50000] Loss: 4.7014 LR: 0.000767
Training Epoch: 1 [512/50000] Loss: 4.7478 LR: 0.001023
Training Epoch: 1 [640/50000] Loss: 4.6901 LR: 0.001279
Traceback (most recent call last):
File "train.py", line 209, in <module>
train(epoch)
File "train.py", line 44, in train
loss.backward()
File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Exception raised from cudnn_batch_norm_backward at /opt/conda/conda-bld/pytorch_1595629427286/work/aten/src/ATen/native/cudnn/BatchNorm.cpp:324 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f1dde67c77d in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: at::native::cudnn_batch_norm_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, at::Tensor const&) + 0x1db3 (0x7f1ddf772433 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd0138a (0x7f1ddf7e338a in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xd2f8bb (0x7f1ddf8118bb in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::cudnn_batch_norm_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, at::Tensor const&) + 0x1ef (0x7f1e15ce910f in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x2b59cff (0x7f1e17930cff in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x2b6b21b (0x7f1e1794221b in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::cudnn_batch_norm_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, at::Tensor const&) + 0x1ef (0x7f1e15ce910f in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: torch::autograd::generated::CudnnBatchNormBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x42c (0x7f1e17893fec in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x30d1017 (0x7f1e17ea8017 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x7f1e17ea3860 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7f1e17ea4401 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f1e17e9c579 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x7f1e1c1cb13a in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #14: <unknown function> + 0xc819d (0x7f1e1ecf019d in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #15: <unknown function> + 0x76db (0x7f1e431b06db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #16: clone + 0x3f (0x7f1e42ed9a3f in /lib/x86_64-linux-gnu/libc.so.6)
But the strange thing is: I've tested for pytoch1.6 with cuda 10.1(if I remembered correctly) a couple hours ago with CUDA_LAUNCH_BLOCKING=1 python train.py -net resnet18 -gpu
, gives me this error:RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
, but now I can not reproduce it anymore, I do not know why
update:
I set cudnn to false using torch.backends.cudnn.enabled = False
, then run CUDA_LAUNCH_BLOCKING=1 python train.py -net resnet18 -gpu
, gives me the same RuntimeError: CUDA error: an illegal memory access was encountered
error:
Files already downloaded and verified
Files already downloaded and verified
/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Training Epoch: 1 [128/50000] Loss: 4.7320 LR: 0.000256
Training Epoch: 1 [256/50000] Loss: 4.6542 LR: 0.000512
Training Epoch: 1 [384/50000] Loss: 4.6969 LR: 0.000767
Training Epoch: 1 [512/50000] Loss: 4.7084 LR: 0.001023
Training Epoch: 1 [640/50000] Loss: 4.7125 LR: 0.001279
Training Epoch: 1 [768/50000] Loss: 4.7134 LR: 0.001535
Training Epoch: 1 [896/50000] Loss: 4.7244 LR: 0.001790
Training Epoch: 1 [1024/50000] Loss: 4.7463 LR: 0.002046
Training Epoch: 1 [1152/50000] Loss: 4.6443 LR: 0.002302
Training Epoch: 1 [1280/50000] Loss: 4.6344 LR: 0.002558
Training Epoch: 1 [1408/50000] Loss: 4.6758 LR: 0.002813
Training Epoch: 1 [1536/50000] Loss: 4.6331 LR: 0.003069
Training Epoch: 1 [1664/50000] Loss: 4.6103 LR: 0.003325
Training Epoch: 1 [1792/50000] Loss: 4.6204 LR: 0.003581
Training Epoch: 1 [1920/50000] Loss: 4.5800 LR: 0.003836
Traceback (most recent call last):
File "train.py", line 210, in <module>
train(epoch)
File "train.py", line 45, in train
loss.backward()
File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered
Exception raised from batch_norm_backward_cuda_template at /opt/conda/conda-bld/pytorch_1595629427286/work/aten/src/ATen/native/cuda/Normalization.cuh:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f0ecd70477d in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: std::tuple<at::Tensor, at::Tensor, at::Tensor> at::native::batch_norm_backward_cuda_template<float, float, int>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, std::array<bool, 3ul>) + 0x9fd (0x7f0ecfc85a0d in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: at::native::batch_norm_backward_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, std::array<bool, 3ul>) + 0x30e (0x7f0ecfc5e06e in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xd0570c (0x7f0ece86f70c in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd2ff53 (0x7f0ece899f53 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native_batch_norm_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, std::array<bool, 3ul>) + 0x233 (0x7f0f04dbc133 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x2bad8d4 (0x7f0f06a0c8d4 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xe6e373 (0x7f0f04ccd373 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: at::native_batch_norm_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, std::array<bool, 3ul>) + 0x233 (0x7f0f04dbc133 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::autograd::generated::NativeBatchNormBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x398 (0x7f0f0691b128 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x30d1017 (0x7f0f06f30017 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x7f0f06f2b860 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7f0f06f2c401 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f0f06f24579 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x7f0f0b25313a in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0xc819d (0x7f0f0dd7819d in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #16: <unknown function> + 0x76db (0x7f0f322386db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #17: clone + 0x3f (0x7f0f31f61a3f in /lib/x86_64-linux-gnu/libc.so.6)
The most common cause for this error is invalid user data (e.g. your target is larger than the number of classes). Please run with CUDA_LAUNCH_BLOCKING as Vitaly suggests, and post the error that you are getting.
Thank you , the same code can run on Google Colab smoothly, but locally would gives me runtime error, so I do not think it's the code which occurs this error, and I also have other projects can raise the same error as well after my server upgrading cuda version(also upgraded nvidia driver version), they were also smoothly running on my server before upgrading:
Just in case if you need:
Google Colab nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 48C P8 9W / 70W | 0MiB / 15079MiB | 0% Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I've solved this issue, For someone who needs the solution, I'll post my solution here, hope it can help someone.
I've tried reinstall ubuntu, tried each cuda version that 2080ti supports, and each installation method(apt-get, deb, run file...), none of them worked, always the same error. Then I started to wonder whether my 2080ti was broken. I explicitly specified the gpu id using
net.cuda(device=0)
, magically, everything worked fine, the bug just disapeared.
So the solution is:
Run your model with specifying each gpu id usingnet = net.cuda(device=gpu_id)
to "activate" each gpu, then you can use Pytorch as usual again(multi card training, single card training, but sometimes still can occur this error) . For each gpu id, I suggest to train 5-10 epoch on a small network like resnet18 to finish the "activation" phase.
@ngimel @ahtik @VitalyFedyunin I think this could be a potential Pytorch bug, please take a look at this, thank you.
I've solved this issue, For someone who needs the solution, I'll post my solution here, hope it can help someone.
I've tried reinstall ubuntu, tried each cuda version that 2080ti supports, and each installation method(apt-get, deb, run file...), none of them worked, always the same error. Then I started to wonder whether my 2080ti was broken. I explicitly specified the gpu id using
net.cuda(device=0)
, magically, everything worked fine, the bug just disapeared.
So the solution is:
Run your model with specifying each gpu id usingnet = net.cuda(device=gpu_id)
to "activate" each gpu, then you can use Pytorch as usual again(multi card training, single card training, but sometimes still can occur this error) . For each gpu id, I suggest to train 5-10 epoch on a small network like resnet18 to finish the "activation" phase.
@ngimel @ahtik @VitalyFedyunin I think this could be a potential Pytorch bug, please take a look at this, thank you.
I have the same issue. But It was not solved even all I did what you recommended.
I hope to solve this issue.
What is your pytorch, cuda, cudnn version?
Thanks
I've solved this issue, For someone who needs the solution, I'll post my solution here, hope it can help someone.
I've tried reinstall ubuntu, tried each cuda version that 2080ti supports, and each installation method(apt-get, deb, run file...), none of them worked, always the same error. Then I started to wonder whether my 2080ti was broken. I explicitly specified the gpu id using
net.cuda(device=0)
, magically, everything worked fine, the bug just disapeared.
So the solution is:
Run your model with specifying each gpu id usingnet = net.cuda(device=gpu_id)
to "activate" each gpu, then you can use Pytorch as usual again(multi card training, single card training, but sometimes still can occur this error) . For each gpu id, I suggest to train 5-10 epoch on a small network like resnet18 to finish the "activation" phase.
@ngimel @ahtik @VitalyFedyunin I think this could be a potential Pytorch bug, please take a look at this, thank you.
I have the same issue. But It was not solved even all I did what you recommended.
I hope to solve this issue.
What is your pytorch, cuda, cudnn version?
Thanks
Try to use Pytorch official Docker image. I've tested on the official image, then I've found out that my bug could be caused by the hardware, I think one of my GPU was broken, cause I only get this error on GPU:0, other three GPU cards works fine.
I've solved this issue, For someone who needs the solution, I'll post my solution here, hope it can help someone.
I've tried reinstall ubuntu, tried each cuda version that 2080ti supports, and each installation method(apt-get, deb, run file...), none of them worked, always the same error. Then I started to wonder whether my 2080ti was broken. I explicitly specified the gpu id using
net.cuda(device=0)
, magically, everything worked fine, the bug just disapeared.
So the solution is:
Run your model with specifying each gpu id usingnet = net.cuda(device=gpu_id)
to "activate" each gpu, then you can use Pytorch as usual again(multi card training, single card training, but sometimes still can occur this error) . For each gpu id, I suggest to train 5-10 epoch on a small network like resnet18 to finish the "activation" phase.
@ngimel @ahtik @VitalyFedyunin I think this could be a potential Pytorch bug, please take a look at this, thank you.
I have the same issue. But It was not solved even all I did what you recommended.
I hope to solve this issue.
What is your pytorch, cuda, cudnn version?
Thanks
Try to use Pytorch official Docker image. I've tested on the official image, then I've found out that my bug could be caused by the hardware, I think one of my GPU was broken, cause I only get this error on GPU:0, other three GPU cards works fine.
So, you means that The GPU:0 was broken?
Then, what did you do? just remove that GPU?
I only have one GPU.. I don't have replacements.
Thanks
Then, what did you do? just remove that GPU?
I'm planning to get GPU:0 repaired. Are you sure your GPU is also broken??Have you tried on both TF and Pytorch to train the model in a docker container?
I use "CUDA_VISIBLE_DEVICES" variable to block my 0 gpu, use other 3 gpus for training
would you please let us know once you can confirm if the problem was with your GPU or not. Thank you for the feedback so far.
Ok, will do.
@weiaicunzai, would you please let us know once you can confirm if the problem was with your GPU or not. Thank you for the feedback so far.
After replacing with the new GPU, everything works fine now, I think we can confirm it's the hardware issue.
@weiaicunzai, would you please let us know once you can confirm if the problem was with your GPU or not. Thank you for the feedback so far.
After replacing with the new GPU, everything works fine now, I think we can confirm it's the hardware issue.
Thank you for letting us know!
initially nvidia-smi was looking like this while training first epoch :
Sun Jan 31 23:28:24 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 457.51 Driver Version: 457.51 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 306... WDDM | 00000000:09:00.0 On | N/A |
| 68% 60C P2 177W / 240W | 6803MiB / 8192MiB | 94% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
after 4/5 epoch speed got reduced and it looked like this :
Sun Jan 31 23:53:15 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 457.51 Driver Version: 457.51 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 306... WDDM | 00000000:09:00.0 On | N/A |
| 69% 58C P2 171W / 240W | 4803MiB / 8192MiB | 65% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
then after few more epochs it gave me : RuntimeError: CUDA error: an illegal memory access was encountered error!
could be related to multiprocessing/num_workers?
what is the solution?
here is the code : https://pastebin.com/W4BpmWcP
@heitorschueroff sorry i forgot to collect full error trace but i remember clearly from where it was coming,it came from this block of code :
if running_loss is None:
running_loss = loss.item()
else:
running_loss = running_loss * .99 + loss.item() * .01
while executing this line of code : running_loss = running_loss * .99 + loss.item() * .01
that error came
I using CUDA_LAUNCH_BLOCKING=1
, but it's so slow that I waited for an hour still get nothing. Can anyone tell me how to fix it?
I think it should be my model problem. I'm learning to complete a two-stage computer-vision model, and the error caused when I feed the proposal into the second stage.
@heitorschueroff I tried two version which are 1.6
and 1.8.1
. Both of them cause same error.
here's my error information. I think it's my model problem when I feed the second-stage input to the model though single model works fine.
File "/home/f523/guazai/sdb/rsy/cornerPoject/myCornerNet6/exp/train.py", line 212, in run_epoch
cls, rgr = self.model([proposal, fm], stage='two')
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
return self.gather(outputs, self.output_device)
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/cuda/comm.py", line 166, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: an illegal memory access was encountered
my model sciprt shows below. I want my two-stage model can support for multi batch. e.g. the batch size is 4 and every img output 128 proposal, so the proposal size in here is (4*128, 5)
def _stage2(self, xs):
proposal, fm = xs
if proposal.dim()==2 and proposal.size(1) == 5:
# train mode
roi = roi_align(fm, proposal, output_size=[15, 15])
elif proposal.dim()==3 and proposal.size(2) == 4:
# eval mode
roi = roi_align(fm, [proposal[0]], output_size=[15, 15])
else:
assert AssertionError(" The boxes tensor shape should be Tensor[K, 5] in train or Tensor[N, 4] in eval")
x = self.big_kernel(roi)
cls = self.cls_fm(x)
rgr = self.rgr_fm(x)
return cls, rgr
I know where I am wrong. Here’s my second stage to feed input
cls, offset = self.model([proposal, fm], stage='two')
proposal is the ROI whose shape is [N, 5], the 1th dim is the batch index. e.g. The batch size is 4, the range of index is [0,1,2,3]. And fm is the feature map.
When I use the mult-gpu like 2 gpu. the proposal and fm will be split into two branch and feed into two gpu. However the batch index range still be [0,1,2,3], then cause a index error and raise gpu error.
What I do is add a line before roi_align like below:
from torchvision.ops import roi_align
proposal[:, 0] = proposal[:, 0] % fm.size(0) # this make multi-gpu work
roi = roi_align(fm, proposal, output_size=[15, 15])
I'm facing the same problem. The training is good without any error. But when I try to predict and x.to(device) throws out RuntimeError: CUDA error: an illegal memory access was encountered
. And when I try to save the model, it gives more details about the error THCudaCheck FAIL file=/pytorch/torch/csrc/generic/serialization.cpp line=31 error=700 : an illegal memory access was encountered
. It seems this error has appeared for more than 2 years with multiple pytorch versions. Do we know anything about the error?
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
labels
Aug 22, 2021
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
and removed
triage review
labels
Aug 23, 2021
Wanna share some experience: I met this error when using CTC loss.
The input of CTCLoss should be a batch with variant lengths of sequence. However, to feed the data, I padded all sequence into a fixed length.
This error happened when I forgot to clip the padded part of each sequence. That is to say, the input data in the loss function was longer than the expected.
Give a toy example: I want to use [1 2 3] as the label of the first sequence; what I want to do is:
pad [1 2 3] to [1 2 3 0 0]
clip [1 2 3 0 0] to [1 2 3]
feed [1 2 3] into the model
But I forgot the 2nd step so this error occurred.
It's definitely a nvidia or Pytorch bug.
If you have overclocked your gpu beyond a certain point using MSI afterburner, this happens. I also got the error, after trying everything else, I brought down the clock and memory frequency back to native levels and the error did not occur again.
This is the reason running the same code on cloud does not cause illegal memory access error, while running it on the overclocked local gpu causes the error.
Wanna share some experience: I met this error when using CTC loss.
The input of CTCLoss should be a batch with variant lengths of sequence. However, to feed the data, I padded all sequence into a fixed length.
This error happened when I forgot to clip the padded part of each sequence. That is to say, the input data in the loss function was longer than the expected.
Give a toy example: I want to use [1 2 3] as the label of the first sequence; what I want to do is:
pad [1 2 3] to [1 2 3 0 0]
clip [1 2 3 0 0] to [1 2 3]
feed [1 2 3] into the model
But I forgot the 2nd step so this error occurred.
Thanks for sharing,I made the same mistake
I had a similar CUDA illegal memory access issue when running the example cartpole training. Updated nvidia driver from 450->495, rebooted and it worked
which cuda version & op system were u using?
RuntimeError: CUDA error: an illegal memory access was encountered on RTX 3080 with enough memory
#79603
THis is gpu RAM size issue. On allocation time or during crash time ,gpu profiles shows VRAM is about 12% is being utilized.
But if the allocation parameters will be decreased to use less VRAM and succescifully run the program , it will be pretty obvious that the code that allocation VRAM is allocation almost 90% of VRAM size