添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
打盹的冰棍  ·  小米/红米 AC2100 ...·  1 月前    · 
英姿勃勃的菠萝  ·  1000µl Universal tip, ...·  5 月前    · 
独立的奔马  ·  Ошибка 429·  7 月前    · 

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hi,everyone!
I met a strange illegal memory access error. It happens randomly without any regular pattern.
The code is really simple. It is PointNet for point cloud segmentation. I don't think there is anything wrong in the code.

import torch
import torch.nn as nn
import torch.nn.functional as F
import os
class InstanceSeg(nn.Module):
    def __init__(self, num_points=1024):
        super(InstanceSeg, self).__init__()
        self.num_points = num_points
        self.conv1 = nn.Conv1d(9, 64, 1)
        self.conv2 = nn.Conv1d(64, 64, 1)
        self.conv3 = nn.Conv1d(64, 64, 1)
        self.conv4 = nn.Conv1d(64, 128, 1)
        self.conv5 = nn.Conv1d(128, 1024, 1)
        self.conv6 = nn.Conv1d(1088, 512, 1)
        self.conv7 = nn.Conv1d(512, 256, 1)
        self.conv8 = nn.Conv1d(256, 128, 1)
        self.conv9 = nn.Conv1d(128, 128, 1)
        self.conv10 = nn.Conv1d(128, 2, 1)
        self.max_pool = nn.MaxPool1d(num_points)
    def forward(self, x):
        batch_size = x.size()[0] # (x has shape (batch_size, 9, num_points))
        out = F.relu(self.conv1(x)) # (shape: (batch_size, 64, num_points))
        out = F.relu(self.conv2(out)) # (shape: (batch_size, 64, num_points))
        point_features = out
        out = F.relu(self.conv3(out)) # (shape: (batch_size, 64, num_points))
        out = F.relu(self.conv4(out)) # (shape: (batch_size, 128, num_points))
        out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
        global_feature = self.max_pool(out) # (shape: (batch_size, 1024, 1))
        global_feature_repeated = global_feature.repeat(1, 1, self.num_points) # (shape: (batch_size, 1024, num_points))
        out = torch.cat([global_feature_repeated, point_features], 1) # (shape: (batch_size, 1024+64=1088, num_points))
        out = F.relu(self.conv6(out)) # (shape: (batch_size, 512, num_points))
        out = F.relu(self.conv7(out)) # (shape: (batch_size, 256, num_points))
        out = F.relu(self.conv8(out)) # (shape: (batch_size, 128, num_points))
        out = F.relu(self.conv9(out)) # (shape: (batch_size, 128, num_points))
        out = self.conv10(out) # (shape: (batch_size, 2, num_points))
        out = out.transpose(2,1).contiguous() # (shape: (batch_size, num_points, 2))
        out = F.log_softmax(out.view(-1, 2), dim=1) # (shape: (batch_size*num_points, 2))
        out = out.view(batch_size, self.num_points, 2) # (shape: (batch_size, num_points, 2))
        return out
Num = 0
network = InstanceSeg()
network.cuda()
while(1):
    input0 = torch.randn(32, 3, 1024).cuda()
    input1 = torch.randn(32, 3, 1024).cuda()
    input2 = torch.randn(32, 3, 1024).cuda()
    input = torch.cat((input0, input1, input2), 1)
    out = network(input)
    Num = Num+1
    print(Num)

After random number of steps, error raises. The error report is

Traceback (most recent call last):
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 58, in <module>
    input0 = torch.randn(32, 3, 1024).cuda()
RuntimeError: CUDA error: an illegal memory access was encountered

When I added "os.environ['CUDA_LAUNCH_BLOCKING'] = '1'" at the top of this script, the error report was changed to this

Traceback (most recent call last):
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 64, in <module>
    out = network(input)
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 35, in forward
    out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 187, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I know some wrong indexing operations and some wrong usage method of loss function may lead to illegal memory access error. But in this script, there is no such kind of operation.
I am quite sure this error is not because of out of memory since only about 2G GPU memory is used, and I have totally 12G GPU memory.

This is my environment information:

OS: Ubuntu 16.04 LTS 64-bit
Command: conda install pytorch torchvision cudatoolkit=9.0 -c pytorch
GPU: Titan XP
Driver Version: 410.93
Python Version: 3.6
cuda Version: cuda_9.0.176_384.81_linux
cudnn Version: cudnn-9.0-linux-x64-v7.4.2.24
pytorch Version: pytorch-1.0.1-py3.6_cuda9.0.176_cudnn7.4.2_2

I have been stuck here for long time.
In fact, not only this project faces this error, many other projects face similar error in my computer.
I don't think there is anything wrong in the code. It can run correctly for some steps. Maybe this error is because the environment. I am not sure.
Does anyone have any idea about this situation? If more detailed information is needed, please let me know.
Thanks for any suggestion.

FarzanT, jianjieluo, zhixuanli, maxtrem, RemonComputer, flamehaze1115, tuananhle7, soulslicer, shapovalov, mobarakol, and 111 more reacted with thumbs up emoji glenn-jocher, JohnGiorgi, hszhao, yashkant, Vast-Stars, giannisdaras, machelreid, GuiyangLuo, XGBoost, karthi0804, and 8 more reacted with rocket emoji All reactions triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 17, 2019

I met the same problem with 2080ti. Setting batch from 2 to 1 and reducing the gtBoxes of per image didn't work.
This is my environment information:

OS: Ubuntu 16.04 LTS 64-bit
Command: conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
GPU: 2080ti
Driver Version: 418.67
Python Version: 3.7
cuda Version: 10.1
cudnn Version: 7
pytorch Version: torch-1.1.0, torchvision-0.2.0

invalid argument
an illegal memory access was encountered
an illegal memory access was encountered
Traceback (most recent call last):
File "tools/train_net.py", line 174, in
main()
File "tools/train_net.py", line 167, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 73, in train
arguments,
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/engine/trainer.py", line 68, in do_train
loss_dict = model(images, targets)
File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
proposals, proposal_losses = self.rpn(images, features, targets)
File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 136, in forward
return self.forward_train(anchors, box_cls, box_regression, targets)
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 143, in forward_train
anchors, box_cls, box_regression, targets
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/rpn/retinanet/loss.py", line 172, in call
match_quality_matrix = boxlist_iou(targets
, anchors
)
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/structures/rboxlist_ops.py", line 167, in boxlist_iou
overlaps_th = torch.tensor(overlaps).to(boxlist1.bbox.device) #[N, M]
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fb1e9515813 in /home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)

setting CUDA_LAUNCH_BLOCKING to 1 didn't work.

Is this problem related to this one?
I am on Ubuntu 18.04, and I have tried pytorch 1.1.0, 1.2.0, 1.3.0 and cuda's 9.2, 10.0, 10.1 with Python 3.7.4 within a conda installation. The nvidia-smi drivers I am currently using are 440.26, but I have tried a bunch as well, none working.

In my case, I get the RuntimeError: CUDA error: an illegal memory access was encountered message when I run my code on gpu 1, but it runs fine on gpu 0:

gpu=1
device = torch.device(f"cuda:{gpu}" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    torch.cuda.set_device(device)

Any ideas on how to try debug this?

@jzazo
Hi, I had similar problem.
If I use device = torch.device("cuda:1"), I always got RuntimeError: CUDA error: an illegal memory access was encountered error.

But when I set a specific gpu by torch.cuda.set_device(1), everything is fine.

kouroshHakha, zhixuan-lin, nerffei, NotAnyMike, canorbal, EvgenyM, SONAMDAWANI, lzw875, qinliuliuqin, milk-abc, and 93 more reacted with thumbs up emoji Originofamonia, mojtaba-nafez, zc-alexfan, and MM-0712 reacted with thumbs down emoji XGBoost, YDDDDG, mukulkhanna, Tomeu7, and warmknife reacted with laugh emoji SONAMDAWANI, qinliuliuqin, ioannisa92, siyaoyu77, annahung31, IvanZyf666, cherepashkin0, teshnizi, luhan-wang, XGBoost, and 7 more reacted with hooray emoji 972461099, cherepashkin0, philo-zhou, Fang-Lansheng, skhnha, zc-alexfan, theRain0450, and warmknife reacted with confused emoji teshnizi, XGBoost, aribryan, YDDDDG, ShoufaChen, SajjadAemmi, danigoju, mukulkhanna, pmikola, sangminkim-99, and warmknife reacted with heart emoji teshnizi, mmabrouk, XGBoost, YDDDDG, ShoufaChen, XinYu-Andy, mukulkhanna, Ankbzpx, yangxu351, unlugi, and 3 more reacted with rocket emoji All reactions

I'm getting this error as well, but it seems to depend on my batch size. I don't encounter it on smaller batch sizes.
pytorch v 1.3.1 on a V100

edupooch-pucrs, Lansv-Noer, yuezhao-zy, nipunsadvilkar, hszhao, daidew, Aria461863631, romech, whcjb, afiaka87, and 57 more reacted with thumbs up emoji mojtaba-nafez, zssjh, and yyliu01 reacted with thumbs down emoji nofreewill42 and Edgar-1205 reacted with hooray emoji aaxwaz, davidcoppa, tanweer-mahdi, mukulkhanna, nofreewill42, and Edgar-1205 reacted with heart emoji nofreewill42 and Edgar-1205 reacted with rocket emoji LvJC, MarvelousRC, chuanraoCV, gowda-95, nofreewill42, and Edgar-1205 reacted with eyes emoji All reactions

@heiyuxiaokai
The first output points to an "out of memory" error.
Could you lower the batch size and rerun your code again?
Are you using the code snippet from the first post or another one?

@jzazo
The original script does not use apex, so this issue should be unrelated.

@kouohhashi @dan-nadler
Are you using the script from the first post or another one?

I still cannot reproduce the error for more than 20k iterations, so I would need (another) code snippet to reproduce this issue.

I tried this MNIST example.

I added the following lines at the beginning of script:

gpu = 1
device = torch.device(gpu if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    torch.cuda.set_device(gpu)

and device = torch.device(gpu if use_cuda else "cpu") in the main function.
I get the following error: RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1570710718161/work/aten/src/THC/THCGeneral.cpp:216

It's a different error than what I was getting in my own script, but still the simple example does not run on gpu=1, but it does work on gpu=0.

I just remembered I followed this guide to move Xorg from being loaded on discrete gpu, to be run on Intel's integrated chip. Could this change be responsible for this strange behavior?
I will undo the change and report back the outcome.

@dan-nadler the peak memory usage might have caused the OOM issue.

@jzazo I cannot reproduce this issue by adding your provided code to the MNIST example on an 8 GPU system (rerunning with different GPU ids).

What GPU are you using as GPU1? If it's the intel integrated chip, this won't work.
You would need a GPU which can execute CUDA code.

I'm having a potentially related issue as well. On a machine with 8 RTX 2080 Ti GPUs, one specific GPU (4) gives the CUDA illegal memory access issue when trying to copy from the GPU to the CPU:

# predicted = pytorch tensor on GPU
predicted = predicted.view(-1).detach().cpu().numpy()
# RuntimeError: CUDA error: an illegal memory access was encountered

Identical code runs fine on the other 7 GPUs but gives an error on this particular GPU after a random number of iterations.

Driver: 430.50
Ubuntu 18.04.3 LTS
CUDA: 10.1.243
cuDNN: 7.5.1
conda install
python: 3.7.4
pytorch:  1.1.0 py3.7_cuda10.1.243_cudnn7.6.3_0
cudatoolkit: 10.1.243
torchvision:  0.4.2

I haven't done too much playing around, but this happens fairly repeatibly (usually within 20-30 minutes of running) only on this one particular GPU. Any developments about this issue before I start checking hardware?

xwzheng1020, zilunzhang, ShoufaChen, liushiru, benlee73, lacls, haoheliu, bugensui, XGBoost, fschmid56, and 8 more reacted with thumbs up emoji sxfduter and doilion reacted with eyes emoji All reactions

@knagrecha
0.4.0 is quite old by now. Could you please update to the latest stable release (1.4.0) and retry your script? Feel free to create a new issue in case you see the same error with any to('cuda') call and ping me there or are you seeing this error with the first posted code snippet?

@bhaeffele Could you post a (minimal) executable code snippet to reproduce this error?

GuintherKovalski, imyhxy, H-tr, prkumar112451, and BUBEE-Liao reacted with thumbs down emoji holstvoogd and wangjunyi9999 reacted with confused emoji CaoHoangTung and doursand reacted with eyes emoji All reactions failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered rwth-i6/returnn#1124

Hi guys,
I'm trying to train a mobilenetv2 model and meet the same error. Environment: pytorch 1.13, cuda_11.7.
Any recommendation is appreciated, thanks in advance!

I have tried to upgrade the version (pytorch2.0、cuda_11.7 => 11.8, and I still met this problem in two models training code.
I don't think it's the problem of batchsize, I reduced the size but still haven't solved it.
The last line before the error is images = torch.tensor(images, device=self.device, dtype=torch.float32).
Another strange thing is that when this error occurs, the brightness of my laptop monitor drops to the lowest, I don't know what happened :(

# the Last 3 lines in terminal
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1

This has been an issue for me for a while. After updating to nightly (or maybe just pytorch-cuda version issue), it is all good for "ddp" training.

OS: AWS sagemaker ml.p2.8xlarge
GPU: Tesla K80 * 8
pytorch-cuda -> 11.8
pytorch-nightly -> 2.1.0.dev20230304
pytorch_lightning -> 1.94

I have tried to upgrade the version (pytorch2.0、cuda_11.7 => 11.8, and I still met this problem in two models training code. I don't think it's the problem of batchsize, I reduced the size but still haven't solved it. The last line before the error is images = torch.tensor(images, device=self.device, dtype=torch.float32). Another strange thing is that when this error occurs, the brightness of my laptop monitor drops to the lowest, I don't know what happened :(

# the Last 3 lines in terminal
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1

This is because the GPU utilization remains 100% after the CUDA error and does not drops, so the GPU sinks a lot of power and while laptop power is not very powerful, it results in power degradation of other devices/peripherals.

We where facing this problem in inference time after hundreds of iterations.
After a lot of tests and finally finding some configuration that solved the issue,
it is very likely that what causes this is some combination of cuda+pytorch version

the error appears in this configuration:

OS:     AWS sagemaker ml.g4dn.xlarge
GPU:    tesla t4
CUDA:   11.8.0
PYTORCH: 
	torch==2.0.0
	torchvision==0.15.1
IMAGE:  nvidia/cuda:11.8.0-base-ubuntu20.04

and it was solved by changing to this configuration:

OS:      AWS sagemaker ml.g4dn.xlarge
GPU:     tesla t4
CUDA:    11.3.1
PYTORCH: 
	 torch==1.12.1+cu113
	 torchvision==0.13.1+cu113
IMAGE:   nvidia/cuda:11.3.1-runtime-ubuntu20.04

We where facing this problem in inference time after hundreds of iterations. After a lot of tests and finally finding some configuration that solved the issue, it is very likely that what causes this is some combination of cuda+pytorch version

the error appears in this configuration:

OS:     AWS sagemaker ml.g4dn.xlarge
GPU:    tesla t4
CUDA:   11.8.0
PYTORCH: 
	torch==2.0.0
	torchvision==0.15.1
IMAGE:  nvidia/cuda:11.8.0-base-ubuntu20.04

and it was solved by changing to this configuration:

OS:      AWS sagemaker ml.g4dn.xlarge
GPU:     tesla t4
CUDA:    11.3.1
PYTORCH: 
	 torch==1.12.1+cu113
	 torchvision==0.13.1+cu113
IMAGE:   nvidia/cuda:11.3.1-runtime-ubuntu20.04

I was having the same problem while trying to run multiple models in parallel on a docker image (nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04)
@GuintherKovalski solution worked for me!

We where facing this problem in inference time after hundreds of iterations. After a lot of tests and finally finding some configuration that solved the issue, it is very likely that what causes this is some combination of cuda+pytorch version

the error appears in this configuration:

OS:     AWS sagemaker ml.g4dn.xlarge
GPU:    tesla t4
CUDA:   11.8.0
PYTORCH: 
	torch==2.0.0
	torchvision==0.15.1
IMAGE:  nvidia/cuda:11.8.0-base-ubuntu20.04

and it was solved by changing to this configuration:

OS:      AWS sagemaker ml.g4dn.xlarge
GPU:     tesla t4
CUDA:    11.3.1
PYTORCH: 
	 torch==1.12.1+cu113
	 torchvision==0.13.1+cu113
IMAGE:   nvidia/cuda:11.3.1-runtime-ubuntu20.04

SO it looks like a campatible bug?

"RuntimeError: CUDA error: an illegal memory access was encountered" when trying to run the plain demo. tsunghan-wu/SLD#3

I got same error used torch==2.2.0. When I use torch==2.0.0, this error never appears again.

This worked for me as well, thanks.