Weird CUDA illegal memory access error - PyTorch Forums

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
相关文章推荐
温柔的红薯 · 南昌坠亡事件：高层3扇窗户被风吹坏，答案“只 ...· 6 月前 ·
严肃的香槟 · 一把手访谈 | ...· 7 月前 ·
重情义的打火机 · VUE学习笔记——es6对象合并 ...· 8 月前 ·
玩篮球的四季豆 · .NET6 ...· 9 月前 ·
欢乐的黄豆 · [HADOOP-10870] Failed ...· 1 年前 ·
Hi all, I encountered a weird CUDA illegal memory access error. Will try to have a minimal example in a while.
During training, my code will run for several batches without any errors, then after a random amount of time there will be an illegal memory access error. Then error happened in this line:
conf_p = conf[pos]
and error messages are:
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 74, in __getitem__
    return MaskedSelect.apply(self, key)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 534, in forward
    return tensor.masked_select(mask)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /export/home/x/code/pytorch/torch/lib/THC
/generated/../THCReduceAll.cuh:339
interestingly, even I replace this line of code to:
print(pos)
there will still be an error
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 119, in __repr__
    return 'Variable containing:' + self.data.__repr__()
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 133, in __repr__
    return str(self)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 140, in __str__
    return _tensor_str._str(self)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 294, in _str
    strt = _tensor_str(self)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 142, in _tensor_str
    formatter = _number_format(self, min_sz=3 if not print_full_mat else 0)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 74, in _number_format
    tensor = torch.DoubleTensor(tensor.size()).copy_(tensor).abs_().view(tensor.nelement())
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /export/home/x/code/pytorch/torch/lib/THC
/generic/THCTensorCopy.c:70
however I can run:
print(pos.size(), pos.is_contiguous())
I run it in cpu mode, below is the gdb back trace:
Thread 21 "python" received signal SIGBUS, Bus error.                                                                    [54/1858]
[Switching to Thread 0x7fff4ce01700 (LWP 32576)]
malloc_consolidate (av=av@entry=0x7ffe58000020) at malloc.c:4181
4181    malloc.c: No such file or directory.
(gdb) bt
#0  malloc_consolidate (av=av@entry=0x7ffe58000020) at malloc.c:4181
#1  0x00007ffff6a51678 in _int_free (av=0x7ffe58000020, p=<optimized out>, have_lock=0) at malloc.c:4075
#2  0x00007ffff6a5553c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968
#3  0x00007fffe0b853fe in THLongStorage_free ()
   from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libTH.so.1
#4  0x00007fffe0baf4e7 in THLongTensor_free ()
   from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libTH.so.1
#5  0x00007fffe01b8839 in at::CPULongTensor::~CPULongTensor() [clone .localalias.31] ()
   from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so.1
#6  0x00007fffecf88269 in torch::autograd::VariableImpl::~VariableImpl (this=0x7ffdd793fc40, __in_chrg=<optimized out>)
    at torch/csrc/autograd/variable.cpp:38
#7  0x00007fffecf9ad21 in at::TensorImpl::release (this=<optimized out>)
    at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/TensorImpl.h:31
---Type <return> to continue, or q <return> to quit---
#8  at::detail::TensorBase::~TensorBase (this=<optimized out>, __in_chrg=<optimized out>)
    at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/TensorBase.h:27
#9  at::Tensor::~Tensor (this=<optimized out>, __in_chrg=<optimized out>)
    at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/Tensor.h:33
#10 at::Tensor::reset (this=0x7fffa888c288) at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/Tensor.h:57
#11 THPVariable_clear (self=self@entry=0x7fffa888c278) at torch/csrc/autograd/python_variable.cpp:131
#12 0x00007fffecf9ae31 in THPVariable_dealloc (self=0x7fffa888c278) at torch/csrc/autograd/python_variable.cpp:138
#13 0x00007ffff79a75f9 in subtype_dealloc (self=0x7fffa888c278) at Objects/typeobject.c:1222
#14 0x00007ffff79a5f3e in tupledealloc (op=0x7fffa8851a08) at Objects/tupleobject.c:243
#15 0x00007fffecf9072f in THPPointer<_object>::~THPPointer (this=0x7fff4ce00850, __in_chrg=<optimized out>)
    at /export/home/x/code/pytorch/torch/csrc/utils/object_ptr.h:12
#16 torch::autograd::PyFunction::apply (this=0x7fffa877e6d0, inputs=...) at torch/csrc/autograd/python_function.cpp:123
#17 0x00007fffecf7bbf4 in torch::autograd::Function::operator() (inputs=..., this=<optimized out>)
---Type <return> to continue, or q <return> to quit---
    at /export/home/x/code/pytorch/torch/csrc/autograd/function.h:89
#18 torch::autograd::call_function (task=...) at torch/csrc/autograd/engine.cpp:208
#19 torch::autograd::Engine::evaluate_function (this=this@entry=0x7fffee198ca0 <engine>, task=...)
    at torch/csrc/autograd/engine.cpp:220
#20 0x00007fffecf7ddae in torch::autograd::Engine::thread_main (this=0x7fffee198ca0 <engine>, graph_task=0x0)
    at torch/csrc/autograd/engine.cpp:144
#21 0x00007fffecf7ab42 in torch::autograd::Engine::thread_init (this=this@entry=0x7fffee198ca0 <engine>, device=device@entry=-1)
    at torch/csrc/autograd/engine.cpp:121
#22 0x00007fffecf9da9a in torch::autograd::python::PythonEngine::thread_init (this=0x7fffee198ca0 <engine>, device=-1)
    at torch/csrc/autograd/python_engine.cpp:28
#23 0x00007fffcd559c5c in std::execute_native_thread_routine_compat (__p=<optimized out>)
    at /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/threa
d.cc:110
---Type <return> to continue, or q <return> to quit---
#24 0x00007ffff76ba6ba in start_thread (arg=0x7fff4ce01700) at pthread_create.c:333
#25 0x00007ffff6ad83dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)     return str(self)
Seems that a piece of memory is wrongly freed
You should run your code with CUDA_LAUNCH_BLOCKING=1 to see where the error comes from.

Because all cuda calls are asynchronous when you don’t specify this option, the python code will report the error on the next cuda call after the error. This is why trying to use the tensor or printing its content raise an error (you use the gpu for that) while printing the size or checking if it is contiguous does not (because these are cpu only operations).
              Yeah I will try that
I just found that even I set CUDA_LAUNCH_BLOCKING=1, there is still an error when I try to print the tensor. I was running
CUDA_LAUNCH_BLOCKING=1
python train.py
is this the right way to set this environment variable?
if you run in 2 commands, your should use export CUDA_LAUNCH_BLOCKING=1 but that will set it for the whole terminal session.

If you use CUDA_LAUNCH_BLOCKING=1 python train.py (in one command), that will set this env variable just for this command.
              
I put them in the same line now, here is the error message:
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  function_attributes(): after cudaFuncGetAttributes: an illegal memory access was encountered
./train.sh: line 14:  4111 Aborted                 (core dumped) CUDA_LAUNCH_BLOCKING=1 python train.py
              I finally solved this problem.
Although the error message is not very helpful, I guess the illegal memory access should come from an index out of range access. So I double check all my code, and finally found that in certain batch, the groundtruth target could be larger than the number of classes in softmax. I fixed it and no more errors 
@albanD thanks for your time anyway
              Just to share my case, I had a similar error code.

I commented the line cudnn.benchnark=True and everything works fine now.
Training code works fine with the commented line, but when I run my validation code, it crashes with the same 77 illegal access error.

Anyways, I will share more if I find something else.
              I’m getting the same illegal memory access error which was caused by moving the tensors to GPU: input[key] = input=[key].cuda().
I tried setting cudnn.benmark = False and rm -rf ~/.nv following some web search but without success. Any suggestions? Thanks a lot!
EDIT: I realized that cudnn.benmark was set to True on a later line ^ ^b (I was running someone else’s git repo) and after resetting it to False the error went away!
              I also met the same error when evaluating the model
RuntimeError: CUDA error: an illegal memory access was encountered
My code looks like
correct = 0
total = 0
for i, (input, target) in tqdm.tqdm(enumerate(data_loader), total=len(dataset)//batch_size):
    target = target.to(device)
    input = input.to(device)
    output = self.model.forward_t(input)
    c = output.argmax(dim=1)
    total += len(target)
    correct += sum(target.cpu().numpy() == c.cpu().numpy())
    acc = float(correct) / total
It is also strange that, if I do not use .cpu.numpy() to convert the data first, the result will be incorrect.
I am facing the same issue. Setting cudnn.benchmark=False did not help (it was set to False from the beginning). My code crashes after a second call to some function. (I use CUDA_LAUNCH_BLOCKING=1 to find out where the error occured). Any pointers to the cause and how to fix it? thanks
File "../libs/bn.py", line 109, in forward
    self.training, self.momentum, self.eps, self.activation, self.slope)
  File "../libs/functions.py", line 99, in forward
    running_mean.mul_((1 - ctx.momentum)).add_(ctx.momentum * mean)
RuntimeError: CUDA error: an illegal memory access was encountered
When trying to print the value of the tensor running_mean (during the second call), it raises the following error:
print(running_mean)
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/tensor.py", line 66, in __repr__
    return torch._tensor_str._str(self)
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 277, in _str
    tensor_str = _tensor_str(self, indent)
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 195, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 84, in __init__
    nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/functional.py", line 271, in isfinite
    return (tensor == tensor) & (tensor.abs() != inf)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generated/../THCTensorMathCompareT.cuh:69
—> running_mean seems to have inf values!!!

It seems an issue related to the machine where the code is running.  (more specifically, cuda-related. Things run fine on cpu).
Fix and possible explanation.