添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

Hi all, I encountered a weird CUDA illegal memory access error. Will try to have a minimal example in a while.

During training, my code will run for several batches without any errors, then after a random amount of time there will be an illegal memory access error. Then error happened in this line:

conf_p = conf[pos]

and error messages are:

  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 74, in __getitem__
    return MaskedSelect.apply(self, key)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 534, in forward
    return tensor.masked_select(mask)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /export/home/x/code/pytorch/torch/lib/THC
/generated/../THCReduceAll.cuh:339

interestingly, even I replace this line of code to:

print(pos)

there will still be an error

  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 119, in __repr__
    return 'Variable containing:' + self.data.__repr__()
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 133, in __repr__
    return str(self)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 140, in __str__
    return _tensor_str._str(self)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 294, in _str
    strt = _tensor_str(self)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 142, in _tensor_str
    formatter = _number_format(self, min_sz=3 if not print_full_mat else 0)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 74, in _number_format
    tensor = torch.DoubleTensor(tensor.size()).copy_(tensor).abs_().view(tensor.nelement())
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /export/home/x/code/pytorch/torch/lib/THC
/generic/THCTensorCopy.c:70

however I can run:

print(pos.size(), pos.is_contiguous())

I run it in cpu mode, below is the gdb back trace:

Thread 21 "python" received signal SIGBUS, Bus error.                                                                    [54/1858]
[Switching to Thread 0x7fff4ce01700 (LWP 32576)]
malloc_consolidate (av=av@entry=0x7ffe58000020) at malloc.c:4181
4181    malloc.c: No such file or directory.
(gdb) bt
#0  malloc_consolidate (av=av@entry=0x7ffe58000020) at malloc.c:4181
#1  0x00007ffff6a51678 in _int_free (av=0x7ffe58000020, p=<optimized out>, have_lock=0) at malloc.c:4075
#2  0x00007ffff6a5553c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968
#3  0x00007fffe0b853fe in THLongStorage_free ()
   from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libTH.so.1
#4  0x00007fffe0baf4e7 in THLongTensor_free ()
   from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libTH.so.1
#5  0x00007fffe01b8839 in at::CPULongTensor::~CPULongTensor() [clone .localalias.31] ()
   from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so.1
#6  0x00007fffecf88269 in torch::autograd::VariableImpl::~VariableImpl (this=0x7ffdd793fc40, __in_chrg=<optimized out>)
    at torch/csrc/autograd/variable.cpp:38
#7  0x00007fffecf9ad21 in at::TensorImpl::release (this=<optimized out>)
    at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/TensorImpl.h:31
---Type <return> to continue, or q <return> to quit---
#8  at::detail::TensorBase::~TensorBase (this=<optimized out>, __in_chrg=<optimized out>)
    at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/TensorBase.h:27
#9  at::Tensor::~Tensor (this=<optimized out>, __in_chrg=<optimized out>)
    at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/Tensor.h:33
#10 at::Tensor::reset (this=0x7fffa888c288) at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/Tensor.h:57
#11 THPVariable_clear (self=self@entry=0x7fffa888c278) at torch/csrc/autograd/python_variable.cpp:131
#12 0x00007fffecf9ae31 in THPVariable_dealloc (self=0x7fffa888c278) at torch/csrc/autograd/python_variable.cpp:138
#13 0x00007ffff79a75f9 in subtype_dealloc (self=0x7fffa888c278) at Objects/typeobject.c:1222
#14 0x00007ffff79a5f3e in tupledealloc (op=0x7fffa8851a08) at Objects/tupleobject.c:243
#15 0x00007fffecf9072f in THPPointer<_object>::~THPPointer (this=0x7fff4ce00850, __in_chrg=<optimized out>)
    at /export/home/x/code/pytorch/torch/csrc/utils/object_ptr.h:12
#16 torch::autograd::PyFunction::apply (this=0x7fffa877e6d0, inputs=...) at torch/csrc/autograd/python_function.cpp:123
#17 0x00007fffecf7bbf4 in torch::autograd::Function::operator() (inputs=..., this=<optimized out>)
---Type <return> to continue, or q <return> to quit---
    at /export/home/x/code/pytorch/torch/csrc/autograd/function.h:89
#18 torch::autograd::call_function (task=...) at torch/csrc/autograd/engine.cpp:208
#19 torch::autograd::Engine::evaluate_function (this=this@entry=0x7fffee198ca0 <engine>, task=...)
    at torch/csrc/autograd/engine.cpp:220
#20 0x00007fffecf7ddae in torch::autograd::Engine::thread_main (this=0x7fffee198ca0 <engine>, graph_task=0x0)
    at torch/csrc/autograd/engine.cpp:144
#21 0x00007fffecf7ab42 in torch::autograd::Engine::thread_init (this=this@entry=0x7fffee198ca0 <engine>, device=device@entry=-1)
    at torch/csrc/autograd/engine.cpp:121
#22 0x00007fffecf9da9a in torch::autograd::python::PythonEngine::thread_init (this=0x7fffee198ca0 <engine>, device=-1)
    at torch/csrc/autograd/python_engine.cpp:28
#23 0x00007fffcd559c5c in std::execute_native_thread_routine_compat (__p=<optimized out>)
    at /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/threa
d.cc:110
---Type <return> to continue, or q <return> to quit---
#24 0x00007ffff76ba6ba in start_thread (arg=0x7fff4ce01700) at pthread_create.c:333
#25 0x00007ffff6ad83dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)     return str(self)

Seems that a piece of memory is wrongly freed

You should run your code with CUDA_LAUNCH_BLOCKING=1 to see where the error comes from.
Because all cuda calls are asynchronous when you don’t specify this option, the python code will report the error on the next cuda call after the error. This is why trying to use the tensor or printing its content raise an error (you use the gpu for that) while printing the size or checking if it is contiguous does not (because these are cpu only operations).

Yeah I will try that

I just found that even I set CUDA_LAUNCH_BLOCKING=1, there is still an error when I try to print the tensor. I was running

CUDA_LAUNCH_BLOCKING=1
python train.py

is this the right way to set this environment variable?

if you run in 2 commands, your should use export CUDA_LAUNCH_BLOCKING=1 but that will set it for the whole terminal session.
If you use CUDA_LAUNCH_BLOCKING=1 python train.py (in one command), that will set this env variable just for this command.

I put them in the same line now, here is the error message:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  function_attributes(): after cudaFuncGetAttributes: an illegal memory access was encountered
./train.sh: line 14:  4111 Aborted                 (core dumped) CUDA_LAUNCH_BLOCKING=1 python train.py
              

I finally solved this problem.

Although the error message is not very helpful, I guess the illegal memory access should come from an index out of range access. So I double check all my code, and finally found that in certain batch, the groundtruth target could be larger than the number of classes in softmax. I fixed it and no more errors :slight_smile:

@albanD thanks for your time anyway

Just to share my case, I had a similar error code.
I commented the line cudnn.benchnark=True and everything works fine now.

Training code works fine with the commented line, but when I run my validation code, it crashes with the same 77 illegal access error.
Anyways, I will share more if I find something else.

I’m getting the same illegal memory access error which was caused by moving the tensors to GPU: input[key] = input=[key].cuda().

I tried setting cudnn.benmark = False and rm -rf ~/.nv following some web search but without success. Any suggestions? Thanks a lot!

EDIT: I realized that cudnn.benmark was set to True on a later line ^ ^b (I was running someone else’s git repo) and after resetting it to False the error went away!

I also met the same error when evaluating the model

RuntimeError: CUDA error: an illegal memory access was encountered

My code looks like

correct = 0
total = 0
for i, (input, target) in tqdm.tqdm(enumerate(data_loader), total=len(dataset)//batch_size):
    target = target.to(device)
    input = input.to(device)
    output = self.model.forward_t(input)
    c = output.argmax(dim=1)
    total += len(target)
    correct += sum(target.cpu().numpy() == c.cpu().numpy())
    acc = float(correct) / total

It is also strange that, if I do not use .cpu.numpy() to convert the data first, the result will be incorrect.

I am facing the same issue. Setting cudnn.benchmark=False did not help (it was set to False from the beginning). My code crashes after a second call to some function. (I use CUDA_LAUNCH_BLOCKING=1 to find out where the error occured). Any pointers to the cause and how to fix it? thanks

File "../libs/bn.py", line 109, in forward
    self.training, self.momentum, self.eps, self.activation, self.slope)
  File "../libs/functions.py", line 99, in forward
    running_mean.mul_((1 - ctx.momentum)).add_(ctx.momentum * mean)
RuntimeError: CUDA error: an illegal memory access was encountered

When trying to print the value of the tensor running_mean (during the second call), it raises the following error:

print(running_mean) File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/tensor.py", line 66, in __repr__ return torch._tensor_str._str(self) File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 277, in _str tensor_str = _tensor_str(self, indent) File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 195, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 84, in __init__ nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)) File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/functional.py", line 271, in isfinite return (tensor == tensor) & (tensor.abs() != inf) RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generated/../THCTensorMathCompareT.cuh:69

—> running_mean seems to have inf values!!!
It seems an issue related to the machine where the code is running. (more specifically, cuda-related. Things run fine on cpu).

Fix and possible explanation.