I built my own dual GPU machine and wanted to train some random model (resnet152), using torchvision, to make sure the machine is ready for running experiments with PyTorch.
However,
vision/references/classification/train.py
did not complete the training sessions due to various CUDA errors.
(Note that I did not modify any code in the repository, and the commit version is beb4bb706b5e13009cb5d5586505c6d2896d184a)
I feel that the errors are not caused by
train.py
but more like issues from PyTorch, CUDA, or other dependencies because the errors were confirmed at
loss.backward()
when using other scripts, or sometimes the training failed due to segmentation fault.
Errors
Using
vision/references/classification/train.py
, I attempted to train a model in three different ways, which all turned out to fail.
torchrun
didn’t help me identify at which line of train.py the training failed, but the last two attempts show it failed at
loss.backward()
with different types of errors.
1. Distributed training mode (with torchrun)
This is the first attempt:
torchrun --nproc_per_node=2 train.py --model='resnet152' --data-path /home/yoshitomo/datasets/ilsvrc2012/ --print-freq 10000
After a few minutes, it failed and returned “Signal 11 (SIGSEGV) received by PID 74466”
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
| distributed init (rank 0): env://
| distributed init (rank 1): env://
Namespace(data_path='/home/yoshitomo/datasets/ilsvrc2012/', model='resnet152', device='cuda', batch_size=32, epochs=90, workers=16, opt='sgd', lr=0.1, momentum=0.9, weight_decay=0.0001, norm_weight_decay=None, bias_weight_decay=None, transformer_embedding_decay=None, label_smoothing=0.0, mixup_alpha=0.0, cutmix_alpha=0.0, lr_scheduler='steplr', lr_warmup_epochs=0, lr_warmup_method='constant', lr_warmup_decay=0.01, lr_step_size=30, lr_gamma=0.1, lr_min=0.0, print_freq=10000, output_dir='.', resume='', start_epoch=0, cache_dataset=False, sync_bn=False, test_only=False, auto_augment=None, ra_magnitude=9, augmix_severity=3, random_erase=0.0, amp=False, world_size=2, dist_url='env://', model_ema=False, model_ema_steps=32, model_ema_decay=0.99998, use_deterministic_algorithms=False, interpolation='bilinear', val_resize_size=256, val_crop_size=224, train_crop_size=224, clip_grad_norm=None, ra_sampler=False, ra_reps=3, weights=None, rank=0, gpu=0, distributed=True, dist_backend='nccl')
Loading data
Loading training data
Took 1.1964967250823975
Loading validation data
Creating data loaders
Creating model
Start training
Epoch: [0] [ 0/20019] eta: 13:36:49 lr: 0.1 img/s: 19.59049709064079 loss: 7.1077 (7.1077) acc1: 0.0000 (0.0000) acc5: 0.0000 (0.0000) time: 2.4482 data: 0.8147 max mem: 8343
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 74467 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 74466) of binary: /home/yoshitomo/anaconda3/bin/python
Traceback (most recent call last):
File "/home/yoshitomo/anaconda3/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
time : 2023-03-06_22:42:17
host : my_pc
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 74466)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 74466
=======================================================
2. Distributed training mode (without torchrun)
I gave it another try, but without torchrun:
python3 -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --world-size 2 --model resnet152 --data-path /home/yoshitomo/datasets/ilsvrc2012/ --print-freq 1000
This time it returned “Signal 6 (SIGABRT) received by PID 75526”
/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as ne
eded.
*****************************************
| distributed init (rank 1): env://
| distributed init (rank 0): env://
Namespace(data_path='/home/yoshitomo/datasets/ilsvrc2012/', model='resnet152', device='cuda', batch_size=32, epochs=90, workers=16, opt='sgd', lr=0.1, momentum=0.9, weight_decay=0.0001, norm_weight_decay=None, bias_weight_decay=None, transformer_embedding_decay=None, label_smoothing=0.0, mixup_alpha=0.0, cutmix_alpha=0.0, lr_scheduler='steplr', lr_warmup_epochs=0, lr_warmup_method='constant', lr_warmup_decay=0.01, lr_step_size=30, lr_gamma=0.1, lr_min=0.0, print_freq=1000, output_dir='.', resume='', start_epoch=0, cache_dataset=False, sync_bn=False, test_only=False, auto_augment=None, ra_magnitude=9, augmix_severity=3, random_erase=0.0, amp=False, world_size=2, dist_url='env://', model_ema=False, model_ema_steps=32, model_ema_decay=0.99998, use_deterministic_algorithms=False, interpolation='bilinear', val_resize_size=256, val_crop_size=224, train_crop_size=224, clip_grad_norm=None, ra_sampler=False, ra_reps=3, weights=None, rank=0, gpu=0, distributed=True, dist_backend='nccl')
Loading data
Loading training data
Took 1.1878349781036377
Loading validation data
Creating data loaders
Creating model
Start training
Epoch: [0] [ 0/20019] eta: 13:32:42 lr: 0.1 img/s: 19.396707388591327 loss: 7.1465 (7.1465) acc1: 0.0000 (0.0000) acc5: 0.0000 (0.0000) time: 2.4358 data: 0.7860 max mem: 8343
Epoch: [0] [ 1000/20019] eta: 0:58:49 lr: 0.1 img/s: 173.86594400494326 loss: 6.8894 (6.9586) acc1: 0.0000 (0.1467) acc5: 0.0000 (0.6119) time: 0.1842 data: 0.0000 max mem: 8343
Epoch: [0] [ 2000/20019] eta: 0:55:36 lr: 0.1 img/s: 173.47313783347485 loss: 6.7749 (6.9015) acc1: 0.0000 (0.1515) acc5: 0.0000 (0.8027) time: 0.1854 data: 0.0000 max mem: 8343
Traceback (most recent call last):
File "/home/yoshitomo/workspace/vision/references/classification/train.py", line 515, in <module>
main(args)
File "/home/yoshitomo/workspace/vision/references/classification/train.py", line 357, in main
train_one_epoch(model, criterion, optimizer, data_loader, device, epoch, args, model_ema, scaler)
File "/home/yoshitomo/workspace/vision/references/classification/train.py", line 42, in train_one_epoch
loss.backward()
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR
[W CUDAGuardImpl.h:124] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 75527 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 75526) of binary: /home/yoshitomo/anaconda3/bin/python3
Traceback (most recent call last):
File "/home/yoshitomo/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/yoshitomo/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in <module>
main()
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
time : 2023-03-06_23:03:47
host : my_pc
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 75526)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 75526
======================================================
3. Non-distributed training mode
I also tried to train the same model, using only one GPU
CUDA_VISIBLE_DEVICES=0 python3 train.py --model resnet152 --data-path /home/yoshitomo/datasets/ilsvrc2012/ --print-freq 10000
, but it still returns RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED
Not using distributed mode
Namespace(data_path='/home/yoshitomo/datasets/ilsvrc2012/', model='resnet152', device='cuda', batch_size=32, epochs=90, workers=16, opt='sgd', lr=0.1, momentum=0.9, weight_decay=0.0001, norm_weight_decay=None, bias_weight_decay=None, transformer_embedding_decay=None, label_smoothing=0.0, mixup_alpha=0.0, cutmix_alpha=0.0, lr_scheduler='steplr', lr_warmup_epochs=0, lr_warmup_method='constant', lr_warmup_decay=0.01, lr_step_size=30, lr_gamma=0.1, lr_min=0.0, print_freq=10000, output_dir='.', resume='', start_epoch=0, cache_dataset=False, sync_bn=False, test_only=False, auto_augment=None, ra_magnitude=9, augmix_severity=3, random_erase=0.0, amp=False, world_size=1, dist_url='env://', model_ema=False, model_ema_steps=32, model_ema_decay=0.99998, use_deterministic_algorithms=False, interpolation='bilinear', val_resize_size=256, val_crop_size=224, train_crop_size=224, clip_grad_norm=None, ra_sampler=False, ra_reps=3, weights=None, distributed=False)
Loading data
Loading training data
Took 1.1773681640625
Loading validation data
Creating data loaders
Creating model
Start training
Epoch: [0] [ 0/40037] eta: 21:54:13 lr: 0.1 img/s: 22.003122653665447 loss: 7.0583 (7.0583) acc1: 0.0000 (0.0000) acc5: 0.0000 (0.0000) time: 1.9695 data: 0.5152 max mem: 8113
Traceback (most recent call last):
File "/home/yoshitomo/workspace/vision/references/classification/train.py", line 515, in <module>
main(args)
File "/home/yoshitomo/workspace/vision/references/classification/train.py", line 357, in main
train_one_epoch(model, criterion, optimizer, data_loader, device, epoch, args, model_ema, scaler)
File "/home/yoshitomo/workspace/vision/references/classification/train.py", line 42, in train_one_epoch
loss.backward()
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Versions
Environment
NVIDIA RTX 3090 Ti x2
Ubuntu 22 LTS
NVIDIA-SMI 515.86.01
Driver Version: 515.86.01
CUDA Version: 11.7
Python 3.9
I installed torch and torchvision with the following command:
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
Actually I did so at CUDA errors with CUDA 11.7 + dual RTX 3090 Ti - PyTorch Forums
However, as I explained in this post, I feel that the issues are something more like fundamental (RTX 3090 Ti and/or dependencies) rather than caused by the specific script, and that’s because I made the post here at first.
e.g., this is a similar case, and people experienced or suspect that the card was/is broken.
Are there any stress tests available to make sure that the GPUs (NVIDIA RTX 3090 Ti x2) are not the root cause of the issues?
I am still looking for help.
Thank you
NVIDIA doesn’t supply or recommend any stress tests for GeForce cards. There are plenty of examples around the web, however.
If you have installed the CUDA toolkit on your machine, you could try running the nbody
sample code or other sample codes. This is basically a suggestion/recommendation in the CUDA install guides for verification of proper install and function.