添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

I built my own dual GPU machine and wanted to train some random model (resnet152), using torchvision, to make sure the machine is ready for running experiments with PyTorch.

However, vision/references/classification/train.py did not complete the training sessions due to various CUDA errors.
(Note that I did not modify any code in the repository, and the commit version is beb4bb706b5e13009cb5d5586505c6d2896d184a)

I feel that the errors are not caused by train.py but more like issues from PyTorch, CUDA, or other dependencies because the errors were confirmed at loss.backward() when using other scripts, or sometimes the training failed due to segmentation fault.

Errors

Using vision/references/classification/train.py , I attempted to train a model in three different ways, which all turned out to fail.
torchrun didn’t help me identify at which line of train.py the training failed, but the last two attempts show it failed at loss.backward() with different types of errors.

1. Distributed training mode (with torchrun)

This is the first attempt:

torchrun --nproc_per_node=2 train.py --model='resnet152' --data-path /home/yoshitomo/datasets/ilsvrc2012/ --print-freq 10000

After a few minutes, it failed and returned “Signal 11 (SIGSEGV) received by PID 74466”

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
| distributed init (rank 0): env://
| distributed init (rank 1): env://
Namespace(data_path='/home/yoshitomo/datasets/ilsvrc2012/', model='resnet152', device='cuda', batch_size=32, epochs=90, workers=16, opt='sgd', lr=0.1, momentum=0.9, weight_decay=0.0001, norm_weight_decay=None, bias_weight_decay=None, transformer_embedding_decay=None, label_smoothing=0.0, mixup_alpha=0.0, cutmix_alpha=0.0, lr_scheduler='steplr', lr_warmup_epochs=0, lr_warmup_method='constant', lr_warmup_decay=0.01, lr_step_size=30, lr_gamma=0.1, lr_min=0.0, print_freq=10000, output_dir='.', resume='', start_epoch=0, cache_dataset=False, sync_bn=False, test_only=False, auto_augment=None, ra_magnitude=9, augmix_severity=3, random_erase=0.0, amp=False, world_size=2, dist_url='env://', model_ema=False, model_ema_steps=32, model_ema_decay=0.99998, use_deterministic_algorithms=False, interpolation='bilinear', val_resize_size=256, val_crop_size=224, train_crop_size=224, clip_grad_norm=None, ra_sampler=False, ra_reps=3, weights=None, rank=0, gpu=0, distributed=True, dist_backend='nccl')
Loading data
Loading training data
Took 1.1964967250823975
Loading validation data
Creating data loaders
Creating model
Start training
Epoch: [0]  [    0/20019]  eta: 13:36:49  lr: 0.1  img/s: 19.59049709064079  loss: 7.1077 (7.1077)  acc1: 0.0000 (0.0000)  acc5: 0.0000 (0.0000)  time: 2.4482  data: 0.8147  max mem: 8343
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 74467 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 74466) of binary: /home/yoshitomo/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/yoshitomo/anaconda3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
  time      : 2023-03-06_22:42:17
  host      : my_pc
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 74466)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 74466
=======================================================
2. Distributed training mode (without torchrun)

I gave it another try, but without torchrun:

python3 -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --world-size 2 --model resnet152 --data-path /home/yoshitomo/datasets/ilsvrc2012/ --print-freq 1000

This time it returned “Signal 6 (SIGABRT) received by PID 75526”

/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated                                                    
and will be removed in future. Use torchrun.                                                                                                                                                               
Note that --use_env is set by default in torchrun.                                                                                                                                                         
If your script expects `--local_rank` argument to be set, please                                                                                                                                           
change it to read from `os.environ['LOCAL_RANK']` instead. See                                                                                                                                             
https://pytorch.org/docs/stable/distributed.html#launch-utility for                                                                                                                                        
further instructions                                                                                                                                                                                       
  warnings.warn(                                                                                                                                                                                           
WARNING:torch.distributed.run:                                                                                                                                                                             
*****************************************                                                                                                                                                                  
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as ne
eded.                                                                                                                                                                                                      
*****************************************                                                                                                                                                                  
| distributed init (rank 1): env://                                                                                                                                                                        
| distributed init (rank 0): env://
Namespace(data_path='/home/yoshitomo/datasets/ilsvrc2012/', model='resnet152', device='cuda', batch_size=32, epochs=90, workers=16, opt='sgd', lr=0.1, momentum=0.9, weight_decay=0.0001, norm_weight_decay=None, bias_weight_decay=None, transformer_embedding_decay=None, label_smoothing=0.0, mixup_alpha=0.0, cutmix_alpha=0.0, lr_scheduler='steplr', lr_warmup_epochs=0, lr_warmup_method='constant', lr_warmup_decay=0.01, lr_step_size=30, lr_gamma=0.1, lr_min=0.0, print_freq=1000, output_dir='.', resume='', start_epoch=0, cache_dataset=False, sync_bn=False, test_only=False, auto_augment=None, ra_magnitude=9, augmix_severity=3, random_erase=0.0, amp=False, world_size=2, dist_url='env://', model_ema=False, model_ema_steps=32, model_ema_decay=0.99998, use_deterministic_algorithms=False, interpolation='bilinear', val_resize_size=256, val_crop_size=224, train_crop_size=224, clip_grad_norm=None, ra_sampler=False, ra_reps=3, weights=None, rank=0, gpu=0, distributed=True, dist_backend='nccl')
Loading data
Loading training data
Took 1.1878349781036377
Loading validation data
Creating data loaders
Creating model
Start training
Epoch: [0]  [    0/20019]  eta: 13:32:42  lr: 0.1  img/s: 19.396707388591327  loss: 7.1465 (7.1465)  acc1: 0.0000 (0.0000)  acc5: 0.0000 (0.0000)  time: 2.4358  data: 0.7860  max mem: 8343
Epoch: [0]  [ 1000/20019]  eta: 0:58:49  lr: 0.1  img/s: 173.86594400494326  loss: 6.8894 (6.9586)  acc1: 0.0000 (0.1467)  acc5: 0.0000 (0.6119)  time: 0.1842  data: 0.0000  max mem: 8343
Epoch: [0]  [ 2000/20019]  eta: 0:55:36  lr: 0.1  img/s: 173.47313783347485  loss: 6.7749 (6.9015)  acc1: 0.0000 (0.1515)  acc5: 0.0000 (0.8027)  time: 0.1854  data: 0.0000  max mem: 8343
Traceback (most recent call last):
  File "/home/yoshitomo/workspace/vision/references/classification/train.py", line 515, in <module>
    main(args)
  File "/home/yoshitomo/workspace/vision/references/classification/train.py", line 357, in main
    train_one_epoch(model, criterion, optimizer, data_loader, device, epoch, args, model_ema, scaler)
  File "/home/yoshitomo/workspace/vision/references/classification/train.py", line 42, in train_one_epoch
    loss.backward()
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR
[W CUDAGuardImpl.h:124] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 75527 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 75526) of binary: /home/yoshitomo/anaconda3/bin/python3
Traceback (most recent call last):
  File "/home/yoshitomo/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yoshitomo/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
train.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
  time      : 2023-03-06_23:03:47
  host      : my_pc
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 75526)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 75526
======================================================
3. Non-distributed training mode

I also tried to train the same model, using only one GPU

CUDA_VISIBLE_DEVICES=0 python3 train.py --model resnet152 --data-path /home/yoshitomo/datasets/ilsvrc2012/ --print-freq 10000

, but it still returns RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED

Not using distributed mode
Namespace(data_path='/home/yoshitomo/datasets/ilsvrc2012/', model='resnet152', device='cuda', batch_size=32, epochs=90, workers=16, opt='sgd', lr=0.1, momentum=0.9, weight_decay=0.0001, norm_weight_decay=None, bias_weight_decay=None, transformer_embedding_decay=None, label_smoothing=0.0, mixup_alpha=0.0, cutmix_alpha=0.0, lr_scheduler='steplr', lr_warmup_epochs=0, lr_warmup_method='constant', lr_warmup_decay=0.01, lr_step_size=30, lr_gamma=0.1, lr_min=0.0, print_freq=10000, output_dir='.', resume='', start_epoch=0, cache_dataset=False, sync_bn=False, test_only=False, auto_augment=None, ra_magnitude=9, augmix_severity=3, random_erase=0.0, amp=False, world_size=1, dist_url='env://', model_ema=False, model_ema_steps=32, model_ema_decay=0.99998, use_deterministic_algorithms=False, interpolation='bilinear', val_resize_size=256, val_crop_size=224, train_crop_size=224, clip_grad_norm=None, ra_sampler=False, ra_reps=3, weights=None, distributed=False)
Loading data
Loading training data
Took 1.1773681640625
Loading validation data
Creating data loaders
Creating model
Start training
Epoch: [0]  [    0/40037]  eta: 21:54:13  lr: 0.1  img/s: 22.003122653665447  loss: 7.0583 (7.0583)  acc1: 0.0000 (0.0000)  acc5: 0.0000 (0.0000)  time: 1.9695  data: 0.5152  max mem: 8113
Traceback (most recent call last):
  File "/home/yoshitomo/workspace/vision/references/classification/train.py", line 515, in <module>
    main(args)
  File "/home/yoshitomo/workspace/vision/references/classification/train.py", line 357, in main
    train_one_epoch(model, criterion, optimizer, data_loader, device, epoch, args, model_ema, scaler)
  File "/home/yoshitomo/workspace/vision/references/classification/train.py", line 42, in train_one_epoch
    loss.backward()
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/yoshitomo/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Versions
Environment
  • NVIDIA RTX 3090 Ti x2
  • Ubuntu 22 LTS
  • NVIDIA-SMI 515.86.01
  • Driver Version: 515.86.01
  • CUDA Version: 11.7
  • Python 3.9
  • I installed torch and torchvision with the following command:

    conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
                  

    Actually I did so at CUDA errors with CUDA 11.7 + dual RTX 3090 Ti - PyTorch Forums

    However, as I explained in this post, I feel that the issues are something more like fundamental (RTX 3090 Ti and/or dependencies) rather than caused by the specific script, and that’s because I made the post here at first.
    e.g., this is a similar case, and people experienced or suspect that the card was/is broken.

    Are there any stress tests available to make sure that the GPUs (NVIDIA RTX 3090 Ti x2) are not the root cause of the issues?
    I am still looking for help.

    Thank you

    NVIDIA doesn’t supply or recommend any stress tests for GeForce cards. There are plenty of examples around the web, however.

    If you have installed the CUDA toolkit on your machine, you could try running the nbody sample code or other sample codes. This is basically a suggestion/recommendation in the CUDA install guides for verification of proper install and function.