Hi all, I encountered a weird CUDA illegal memory access error. Will try to have a minimal example in a while.
During training, my code will run for several batches without any errors, then after a random amount of time there will be an illegal memory access error. Then error happened in this line:
conf_p = conf[pos]
and error messages are:
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 74, in __getitem__
return MaskedSelect.apply(self, key)
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 534, in forward
return tensor.masked_select(mask)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /export/home/x/code/pytorch/torch/lib/THC
/generated/../THCReduceAll.cuh:339
interestingly, even I replace this line of code to:
print(pos)
there will still be an error
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 119, in __repr__
return 'Variable containing:' + self.data.__repr__()
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 133, in __repr__
return str(self)
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 140, in __str__
return _tensor_str._str(self)
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 294, in _str
strt = _tensor_str(self)
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 142, in _tensor_str
formatter = _number_format(self, min_sz=3 if not print_full_mat else 0)
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 74, in _number_format
tensor = torch.DoubleTensor(tensor.size()).copy_(tensor).abs_().view(tensor.nelement())
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /export/home/x/code/pytorch/torch/lib/THC
/generic/THCTensorCopy.c:70
however I can run:
print(pos.size(), pos.is_contiguous())
I run it in cpu mode, below is the gdb back trace:
Thread 21 "python" received signal SIGBUS, Bus error. [54/1858]
[Switching to Thread 0x7fff4ce01700 (LWP 32576)]
malloc_consolidate (av=av@entry=0x7ffe58000020) at malloc.c:4181
4181 malloc.c: No such file or directory.
(gdb) bt
#0 malloc_consolidate (av=av@entry=0x7ffe58000020) at malloc.c:4181
#1 0x00007ffff6a51678 in _int_free (av=0x7ffe58000020, p=<optimized out>, have_lock=0) at malloc.c:4075
#2 0x00007ffff6a5553c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968
#3 0x00007fffe0b853fe in THLongStorage_free ()
from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libTH.so.1
#4 0x00007fffe0baf4e7 in THLongTensor_free ()
from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libTH.so.1
#5 0x00007fffe01b8839 in at::CPULongTensor::~CPULongTensor() [clone .localalias.31] ()
from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so.1
#6 0x00007fffecf88269 in torch::autograd::VariableImpl::~VariableImpl (this=0x7ffdd793fc40, __in_chrg=<optimized out>)
at torch/csrc/autograd/variable.cpp:38
#7 0x00007fffecf9ad21 in at::TensorImpl::release (this=<optimized out>)
at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/TensorImpl.h:31
---Type <return> to continue, or q <return> to quit---
#8 at::detail::TensorBase::~TensorBase (this=<optimized out>, __in_chrg=<optimized out>)
at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/TensorBase.h:27
#9 at::Tensor::~Tensor (this=<optimized out>, __in_chrg=<optimized out>)
at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/Tensor.h:33
#10 at::Tensor::reset (this=0x7fffa888c288) at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/Tensor.h:57
#11 THPVariable_clear (self=self@entry=0x7fffa888c278) at torch/csrc/autograd/python_variable.cpp:131
#12 0x00007fffecf9ae31 in THPVariable_dealloc (self=0x7fffa888c278) at torch/csrc/autograd/python_variable.cpp:138
#13 0x00007ffff79a75f9 in subtype_dealloc (self=0x7fffa888c278) at Objects/typeobject.c:1222
#14 0x00007ffff79a5f3e in tupledealloc (op=0x7fffa8851a08) at Objects/tupleobject.c:243
#15 0x00007fffecf9072f in THPPointer<_object>::~THPPointer (this=0x7fff4ce00850, __in_chrg=<optimized out>)
at /export/home/x/code/pytorch/torch/csrc/utils/object_ptr.h:12
#16 torch::autograd::PyFunction::apply (this=0x7fffa877e6d0, inputs=...) at torch/csrc/autograd/python_function.cpp:123
#17 0x00007fffecf7bbf4 in torch::autograd::Function::operator() (inputs=..., this=<optimized out>)
---Type <return> to continue, or q <return> to quit---
at /export/home/x/code/pytorch/torch/csrc/autograd/function.h:89
#18 torch::autograd::call_function (task=...) at torch/csrc/autograd/engine.cpp:208
#19 torch::autograd::Engine::evaluate_function (this=this@entry=0x7fffee198ca0 <engine>, task=...)
at torch/csrc/autograd/engine.cpp:220
#20 0x00007fffecf7ddae in torch::autograd::Engine::thread_main (this=0x7fffee198ca0 <engine>, graph_task=0x0)
at torch/csrc/autograd/engine.cpp:144
#21 0x00007fffecf7ab42 in torch::autograd::Engine::thread_init (this=this@entry=0x7fffee198ca0 <engine>, device=device@entry=-1)
at torch/csrc/autograd/engine.cpp:121
#22 0x00007fffecf9da9a in torch::autograd::python::PythonEngine::thread_init (this=0x7fffee198ca0 <engine>, device=-1)
at torch/csrc/autograd/python_engine.cpp:28
#23 0x00007fffcd559c5c in std::execute_native_thread_routine_compat (__p=<optimized out>)
at /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/threa
d.cc:110
---Type <return> to continue, or q <return> to quit---
#24 0x00007ffff76ba6ba in start_thread (arg=0x7fff4ce01700) at pthread_create.c:333
#25 0x00007ffff6ad83dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) return str(self)
Seems that a piece of memory is wrongly freed
You should run your code with CUDA_LAUNCH_BLOCKING=1
to see where the error comes from.
Because all cuda calls are asynchronous when you don’t specify this option, the python code will report the error on the next cuda call after the error. This is why trying to use the tensor or printing its content raise an error (you use the gpu for that) while printing the size or checking if it is contiguous does not (because these are cpu only operations).
Yeah I will try that
I just found that even I set CUDA_LAUNCH_BLOCKING=1, there is still an error when I try to print the tensor. I was running
CUDA_LAUNCH_BLOCKING=1
python train.py
is this the right way to set this environment variable?
if you run in 2 commands, your should use export CUDA_LAUNCH_BLOCKING=1
but that will set it for the whole terminal session.
If you use CUDA_LAUNCH_BLOCKING=1 python train.py
(in one command), that will set this env variable just for this command.
I put them in the same line now, here is the error message:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): function_attributes(): after cudaFuncGetAttributes: an illegal memory access was encountered
./train.sh: line 14: 4111 Aborted (core dumped) CUDA_LAUNCH_BLOCKING=1 python train.py
I finally solved this problem.
Although the error message is not very helpful, I guess the illegal memory access should come from an index out of range access. So I double check all my code, and finally found that in certain batch, the groundtruth target could be larger than the number of classes in softmax. I fixed it and no more errors
@albanD thanks for your time anyway
Just to share my case, I had a similar error code.
I commented the line cudnn.benchnark=True
and everything works fine now.
Training code works fine with the commented line, but when I run my validation code, it crashes with the same 77 illegal access error.
Anyways, I will share more if I find something else.
I’m getting the same illegal memory access error which was caused by moving the tensors to GPU: input[key] = input=[key].cuda()
.
I tried setting cudnn.benmark = False
and rm -rf ~/.nv
following some web search but without success. Any suggestions? Thanks a lot!
EDIT: I realized that cudnn.benmark
was set to True
on a later line ^ ^b (I was running someone else’s git repo) and after resetting it to False
the error went away!
I also met the same error when evaluating the model
RuntimeError: CUDA error: an illegal memory access was encountered
My code looks like
correct = 0
total = 0
for i, (input, target) in tqdm.tqdm(enumerate(data_loader), total=len(dataset)//batch_size):
target = target.to(device)
input = input.to(device)
output = self.model.forward_t(input)
c = output.argmax(dim=1)
total += len(target)
correct += sum(target.cpu().numpy() == c.cpu().numpy())
acc = float(correct) / total
It is also strange that, if I do not use .cpu.numpy()
to convert the data first, the result will be incorrect.
I am facing the same issue. Setting cudnn.benchmark=False
did not help (it was set to False
from the beginning). My code crashes after a second call to some function. (I use CUDA_LAUNCH_BLOCKING=1
to find out where the error occured). Any pointers to the cause and how to fix it? thanks
File "../libs/bn.py", line 109, in forward
self.training, self.momentum, self.eps, self.activation, self.slope)
File "../libs/functions.py", line 99, in forward
running_mean.mul_((1 - ctx.momentum)).add_(ctx.momentum * mean)
RuntimeError: CUDA error: an illegal memory access was encountered
When trying to print the value of the tensor running_mean
(during the second call), it raises the following error:
print(running_mean)
File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/tensor.py", line 66, in __repr__
return torch._tensor_str._str(self)
File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 277, in _str
tensor_str = _tensor_str(self, indent)
File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 195, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 84, in __init__
nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/functional.py", line 271, in isfinite
return (tensor == tensor) & (tensor.abs() != inf)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generated/../THCTensorMathCompareT.cuh:69
—> running_mean
seems to have inf
values!!!
It seems an issue related to the machine where the code is running. (more specifically, cuda-related. Things run fine on cpu).
Fix and possible explanation.