RuntimeError: CUDA error: an illegal memory access was encountered · Issue #21819

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

打盹的冰棍 · 小米/红米 AC2100 ...· 1 月前 ·

骑白马的足球 · 《春歌》歌词同步LRC音画（佛歌-歌手何训田 ...· 3 月前 ·

讲道义的茶叶 · 推荐Python的学习想法 - CSDN文库· 4 月前 ·

英姿勃勃的菠萝 · 1000µl Universal tip, ...· 5 月前 ·

独立的奔马 · Ошибка 429· 7 月前 ·

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hi,everyone!
I met a strange illegal memory access error. It happens randomly without any regular pattern.
The code is really simple. It is PointNet for point cloud segmentation. I don't think there is anything wrong in the code.

import torch
import torch.nn as nn
import torch.nn.functional as F
import os
class InstanceSeg(nn.Module):
    def __init__(self, num_points=1024):
        super(InstanceSeg, self).__init__()
        self.num_points = num_points
        self.conv1 = nn.Conv1d(9, 64, 1)
        self.conv2 = nn.Conv1d(64, 64, 1)
        self.conv3 = nn.Conv1d(64, 64, 1)
        self.conv4 = nn.Conv1d(64, 128, 1)
        self.conv5 = nn.Conv1d(128, 1024, 1)
        self.conv6 = nn.Conv1d(1088, 512, 1)
        self.conv7 = nn.Conv1d(512, 256, 1)
        self.conv8 = nn.Conv1d(256, 128, 1)
        self.conv9 = nn.Conv1d(128, 128, 1)
        self.conv10 = nn.Conv1d(128, 2, 1)
        self.max_pool = nn.MaxPool1d(num_points)
    def forward(self, x):
        batch_size = x.size()[0] # (x has shape (batch_size, 9, num_points))
        out = F.relu(self.conv1(x)) # (shape: (batch_size, 64, num_points))
        out = F.relu(self.conv2(out)) # (shape: (batch_size, 64, num_points))
        point_features = out
        out = F.relu(self.conv3(out)) # (shape: (batch_size, 64, num_points))
        out = F.relu(self.conv4(out)) # (shape: (batch_size, 128, num_points))
        out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
        global_feature = self.max_pool(out) # (shape: (batch_size, 1024, 1))
        global_feature_repeated = global_feature.repeat(1, 1, self.num_points) # (shape: (batch_size, 1024, num_points))
        out = torch.cat([global_feature_repeated, point_features], 1) # (shape: (batch_size, 1024+64=1088, num_points))
        out = F.relu(self.conv6(out)) # (shape: (batch_size, 512, num_points))
        out = F.relu(self.conv7(out)) # (shape: (batch_size, 256, num_points))
        out = F.relu(self.conv8(out)) # (shape: (batch_size, 128, num_points))
        out = F.relu(self.conv9(out)) # (shape: (batch_size, 128, num_points))
        out = self.conv10(out) # (shape: (batch_size, 2, num_points))
        out = out.transpose(2,1).contiguous() # (shape: (batch_size, num_points, 2))
        out = F.log_softmax(out.view(-1, 2), dim=1) # (shape: (batch_size*num_points, 2))
        out = out.view(batch_size, self.num_points, 2) # (shape: (batch_size, num_points, 2))
        return out
Num = 0
network = InstanceSeg()
network.cuda()
while(1):
    input0 = torch.randn(32, 3, 1024).cuda()
    input1 = torch.randn(32, 3, 1024).cuda()
    input2 = torch.randn(32, 3, 1024).cuda()
    input = torch.cat((input0, input1, input2), 1)
    out = network(input)
    Num = Num+1
    print(Num)

After random number of steps, error raises. The error report is

Traceback (most recent call last):
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 58, in <module>
    input0 = torch.randn(32, 3, 1024).cuda()
RuntimeError: CUDA error: an illegal memory access was encountered
When I added "os.environ['CUDA_LAUNCH_BLOCKING'] = '1'" at the top of this script, the error report was changed to this
Traceback (most recent call last):
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 64, in <module>
    out = network(input)
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 35, in forward
    out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 187, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
I know some wrong indexing operations and some wrong usage method of loss function may lead to illegal memory access error. But in this script, there is no such kind of operation.

I am quite sure this error is not because of out of memory since only about 2G GPU memory is used, and I have totally 12G GPU memory.
This is my environment information:
OS: Ubuntu 16.04 LTS 64-bit
Command: conda install pytorch torchvision cudatoolkit=9.0 -c pytorch
GPU: Titan XP
Driver Version: 410.93
Python Version: 3.6
cuda Version: cuda_9.0.176_384.81_linux
cudnn Version: cudnn-9.0-linux-x64-v7.4.2.24
pytorch Version: pytorch-1.0.1-py3.6_cuda9.0.176_cudnn7.4.2_2
I have been stuck here for long time.

In fact, not only this project faces this error, many other projects face similar error in my computer.

I don't think there is anything wrong in the code. It can run correctly for some steps. Maybe this error is because the environment. I am not sure.

Does anyone have any idea about this situation? If more detailed information is needed, please let me know.

Thanks for any suggestion.




    

  FarzanT, jianjieluo, zhixuanli, maxtrem, RemonComputer, flamehaze1115, tuananhle7, soulslicer, shapovalov, mobarakol, and 111 more reacted with thumbs up emoji
  glenn-jocher, JohnGiorgi, hszhao, yashkant, Vast-Stars, giannisdaras, machelreid, GuiyangLuo, XGBoost, karthi0804, and 8 more reacted with rocket emoji
    All reactions
  triaged
  This issue has been looked at a team member, and triaged and prioritized into an appropriate module
  labels
    Jun 17, 2019
          I met the same problem with 2080ti. Setting batch from 2 to 1 and reducing the gtBoxes of per image didn't work.

This is my environment information:
OS: Ubuntu 16.04 LTS 64-bit

Command: conda install pytorch torchvision cudatoolkit=10.1 -c pytorch

GPU: 2080ti

Driver Version: 418.67

Python Version: 3.7

cuda Version: 10.1

cudnn Version: 7

pytorch Version: torch-1.1.0, torchvision-0.2.0
invalid argument

an illegal memory access was encountered

an illegal memory access was encountered

Traceback (most recent call last):

File "tools/train_net.py", line 174, in 

main()

File "tools/train_net.py", line 167, in main

model = train(cfg, args.local_rank, args.distributed)

File "tools/train_net.py", line 73, in train

arguments,

File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/engine/trainer.py", line 68, in do_train

loss_dict = model(images, targets)

File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call

result = self.forward(*input, **kwargs)

File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward

output = self.module(*inputs[0], **kwargs[0])

File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call

result = self.forward(*input, **kwargs)

File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward

proposals, proposal_losses = self.rpn(images, features, targets)

File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call

result = self.forward(*input, **kwargs)

File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 136, in forward

return self.forward_train(anchors, box_cls, box_regression, targets)

File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 143, in forward_train

anchors, box_cls, box_regression, targets

File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/rpn/retinanet/loss.py", line 172, in call

match_quality_matrix = boxlist_iou(targets, anchors)

File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/structures/rboxlist_ops.py", line 167, in boxlist_iou

overlaps_th = torch.tensor(overlaps).to(boxlist1.bbox.device) #[N, M]

RuntimeError: CUDA error: an illegal memory access was encountered

terminate called after throwing an instance of 'c10::Error'

what():  CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:569)

frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fb1e9515813 in /home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
setting CUDA_LAUNCH_BLOCKING to 1 didn't work.
          Is this problem related to this one?

I am on Ubuntu 18.04, and I have tried pytorch 1.1.0, 1.2.0, 1.3.0 and cuda's 9.2, 10.0, 10.1 with Python 3.7.4 within a conda installation. The nvidia-smi drivers I am currently using are 440.26, but I have tried a bunch as well, none working.
In my case, I get the RuntimeError: CUDA error: an illegal memory access was encountered message when I run my code on gpu 1, but it runs fine on gpu 0:
gpu=1
device = torch.device(f"cuda:{gpu}" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    torch.cuda.set_device(device)
Any ideas on how to try debug this?
          @jzazo

Hi, I had similar problem.

If I use device = torch.device("cuda:1"),  I always got  RuntimeError: CUDA error: an illegal memory access was encountered error.
But when I set a specific gpu by torch.cuda.set_device(1), everything is fine.
  kouroshHakha, zhixuan-lin, nerffei, NotAnyMike, canorbal, EvgenyM, SONAMDAWANI, lzw875, qinliuliuqin, milk-abc, and 93 more reacted with thumbs up emoji
  Originofamonia, mojtaba-nafez, zc-alexfan, and MM-0712 reacted with thumbs down emoji
  XGBoost, YDDDDG, mukulkhanna, Tomeu7, and warmknife reacted with laugh emoji
  SONAMDAWANI, qinliuliuqin, ioannisa92, siyaoyu77, annahung31, IvanZyf666, cherepashkin0, teshnizi, luhan-wang, XGBoost, and 7 more reacted with hooray emoji
  972461099, cherepashkin0, philo-zhou, Fang-Lansheng, skhnha, zc-alexfan, theRain0450, and warmknife reacted with confused emoji
  teshnizi, XGBoost, aribryan, YDDDDG, ShoufaChen, SajjadAemmi, danigoju, mukulkhanna, pmikola, sangminkim-99, and warmknife reacted with heart emoji
  teshnizi, mmabrouk, XGBoost, YDDDDG, ShoufaChen, XinYu-Andy, mukulkhanna, Ankbzpx, yangxu351, unlugi, and 3 more reacted with rocket emoji
    All reactions
          I'm getting this error as well, but it seems to depend on my batch size. I don't encounter it on smaller batch sizes.

pytorch v 1.3.1 on a V100
  edupooch-pucrs, Lansv-Noer, yuezhao-zy, nipunsadvilkar, hszhao, daidew, Aria461863631, romech, whcjb, afiaka87, and 57 more reacted with thumbs up emoji
  mojtaba-nafez, zssjh, and yyliu01 reacted with thumbs down emoji
  nofreewill42 and Edgar-1205 reacted with hooray emoji
  aaxwaz, davidcoppa, tanweer-mahdi, mukulkhanna, nofreewill42, and Edgar-1205 reacted with heart emoji
  nofreewill42 and Edgar-1205 reacted with rocket emoji
  LvJC, MarvelousRC, chuanraoCV, gowda-95, nofreewill42, and Edgar-1205 reacted with eyes emoji
    All reactions
          @heiyuxiaokai

The first output points to an "out of memory" error.

Could you lower the batch size and rerun your code again?

Are you using the code snippet from the first post or another one?
@jzazo

The original script does not use apex, so this issue should be unrelated.
@kouohhashi @dan-nadler

Are you using the script from the first post or another one?
I still cannot reproduce the error for more than 20k iterations, so I would need (another) code snippet to reproduce this issue.
          I tried this MNIST example.
I added the following lines at the beginning of script:
gpu = 1
device = torch.device(gpu if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    torch.cuda.set_device(gpu)
and device = torch.device(gpu if use_cuda else "cpu") in the main function.

I get the following error: RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1570710718161/work/aten/src/THC/THCGeneral.cpp:216
It's a different error than what I was getting in my own script, but still the simple example does not run on gpu=1, but it does work on gpu=0.
I just remembered I followed this guide to move Xorg from being loaded on discrete gpu, to be run on Intel's integrated chip. Could this change be responsible for this strange behavior?

I will undo the change and report back the outcome.
          @dan-nadler the peak memory usage might have caused the OOM issue.
@jzazo I cannot reproduce this issue by adding your provided code to the MNIST example on an 8 GPU system (rerunning with different GPU ids).
What GPU are you using as GPU1? If it's the intel integrated chip, this won't work.

You would need a GPU which can execute CUDA code.
          I'm having a potentially related issue as well.  On a machine with 8 RTX 2080 Ti GPUs, one specific GPU (4) gives the CUDA illegal memory access issue when trying to copy from the GPU to the CPU:
# predicted = pytorch tensor on GPU
predicted = predicted.view(-1).detach().cpu().numpy()
# RuntimeError: CUDA error: an illegal memory access was encountered
Identical code runs fine on the other 7 GPUs but gives an error on this particular GPU after a random number of iterations.
Driver: 430.50
Ubuntu 18.04.3 LTS
CUDA: 10.1.243
cuDNN: 7.5.1
conda install
python: 3.7.4
pytorch:  1.1.0 py3.7_cuda10.1.243_cudnn7.6.3_0
cudatoolkit: 10.1.243
torchvision:  0.4.2
I haven't done too much playing around, but this happens fairly repeatibly (usually within 20-30 minutes of running) only on this one particular GPU.  Any developments about this issue before I start checking hardware?
  xwzheng1020, zilunzhang, ShoufaChen, liushiru, benlee73, lacls, haoheliu, bugensui, XGBoost, fschmid56, and 8 more reacted with thumbs up emoji
  sxfduter and doilion reacted with eyes emoji
    All reactions
          @knagrecha

0.4.0 is quite old by now. Could you please update to the latest stable release (1.4.0) and retry your script? Feel free to create a new issue in case you see the same error with any to('cuda') call and ping me there or are you seeing this error with the first posted code snippet?
@bhaeffele Could you post a (minimal) executable code snippet to reproduce this error?
  GuintherKovalski, imyhxy, H-tr, prkumar112451, and BUBEE-Liao reacted with thumbs down emoji
  holstvoogd and wangjunyi9999 reacted with confused emoji
  CaoHoangTung and doursand reacted with eyes emoji
    All reactions
      failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
      rwth-i6/returnn#1124
          Just update the cuda version to 11.3 and the pytorch version to the lateset stabal vesrion. My problem disppears
***@***.***
签名由网易邮箱大师定制
On 3/12/2022 ***@***.***> wrote：
For me, I just used Tensos.contiguous().cuda() before feeding it to the model and this problem got fixed.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you commented.Message ID: ***@***.***>
          Hi guys,

I'm trying to train a mobilenetv2 model and meet the same error.  Environment: pytorch 1.13, cuda_11.7.

Any recommendation is appreciated, thanks in advance!
          I have tried to upgrade the version (pytorch2.0、cuda_11.7 => 11.8, and I still met this problem in two models training code.

I don't think it's the problem of batchsize, I reduced the size but still haven't solved it.

The last line before the error is images = torch.tensor(images, device=self.device, dtype=torch.float32).

Another strange thing is that when this error occurs, the brightness of my laptop monitor drops to the lowest, I don't know what happened :(
# the Last 3 lines in terminal
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
          This has been an issue for me for a while. After updating to nightly (or maybe just pytorch-cuda version issue), it is all good for "ddp" training.
OS: AWS sagemaker ml.p2.8xlarge

GPU: Tesla K80 * 8

pytorch-cuda -> 11.8

pytorch-nightly -> 2.1.0.dev20230304

pytorch_lightning -> 1.94
I have tried to upgrade the version (pytorch2.0、cuda_11.7 => 11.8, and I still met this problem in two models training code. I don't think it's the problem of batchsize, I reduced the size but still haven't solved it. The last line before the error is images = torch.tensor(images, device=self.device, dtype=torch.float32). Another strange thing is that when this error occurs, the brightness of my laptop monitor drops to the lowest, I don't know what happened :(
# the Last 3 lines in terminal
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
This is because the GPU utilization remains 100% after the CUDA error and does not drops, so the GPU sinks a lot of power and while laptop power is not very powerful, it results in power degradation of other devices/peripherals.
          We where  facing this problem in inference time after hundreds of iterations.

After a lot of tests and finally finding some configuration that solved the issue,

it is very likely that what causes this is some combination of cuda+pytorch version
the error appears in this configuration:
OS:     AWS sagemaker ml.g4dn.xlarge
GPU:    tesla t4
CUDA:   11.8.0
PYTORCH: 
	torch==2.0.0
	torchvision==0.15.1
IMAGE:  nvidia/cuda:11.8.0-base-ubuntu20.04
and it was solved by changing to this configuration:
OS:      AWS sagemaker ml.g4dn.xlarge
GPU:     tesla t4
CUDA:    11.3.1
PYTORCH: 
	 torch==1.12.1+cu113
	 torchvision==0.13.1+cu113
IMAGE:   nvidia/cuda:11.3.1-runtime-ubuntu20.04
We where facing this problem in inference time after hundreds of iterations. After a lot of tests and finally finding some configuration that solved the issue, it is very likely that what causes this is some combination of cuda+pytorch version
the error appears in this configuration:
OS:     AWS sagemaker ml.g4dn.xlarge
GPU:    tesla t4
CUDA:   11.8.0
PYTORCH: 
	torch==2.0.0
	torchvision==0.15.1
IMAGE:  nvidia/cuda:11.8.0-base-ubuntu20.04
and it was solved by changing to this configuration:
OS:      AWS sagemaker ml.g4dn.xlarge
GPU:     tesla t4
CUDA:    11.3.1
PYTORCH: 
	 torch==1.12.1+cu113
	 torchvision==0.13.1+cu113
IMAGE:   nvidia/cuda:11.3.1-runtime-ubuntu20.04
I was having the same problem while trying to run multiple models in parallel on a docker image (nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04)

@GuintherKovalski solution worked for me!
We where facing this problem in inference time after hundreds of iterations. After a lot of tests and finally finding some configuration that solved the issue, it is very likely that what causes this is some combination of cuda+pytorch version
the error appears in this configuration:
OS:     AWS sagemaker ml.g4dn.xlarge
GPU:    tesla t4
CUDA:   11.8.0
PYTORCH: 
	torch==2.0.0
	torchvision==0.15.1
IMAGE:  nvidia/cuda:11.8.0-base-ubuntu20.04
and it was solved by changing to this configuration:
OS:      AWS sagemaker ml.g4dn.xlarge
GPU:     tesla t4
CUDA:    11.3.1
PYTORCH: 
	 torch==1.12.1+cu113
	 torchvision==0.13.1+cu113
IMAGE:   nvidia/cuda:11.3.1-runtime-ubuntu20.04
SO it looks like a campatible bug?
      "RuntimeError: CUDA error: an illegal memory access was encountered" when trying to run the plain demo.
      tsunghan-wu/SLD#3
I got same error used torch==2.2.0. When I use torch==2.0.0, this error never appears again.
This worked for me as well, thanks.