Error in use multiple gpu in my source - distributed

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

大鼻子的山寨机 · 如何解决源站存在安全防护等原因导致访问CDN ...· 10 月前 ·

斯文的茴香 · git:# Please enter ...· 1 年前 ·

谦和的马克杯 · Android ...· 2 年前 ·

面冷心慈的口罩 · java - Cannot load ...· 2 年前 ·

帅呆的汉堡包 · paraview绘制等值面 - 哔哩哔哩· 2 年前 ·

Hi I want to run my project on two gpu parallel so after I write my code, I write this command
in this commands I write my model in paralell and then in pl.trainer in add parameter of gpus=2
def main():
datamodule = DataModule(train_ds, val_ds)
mymodel = mymodel(config)
trainer = pl.Trainer(
accelerator=“gpu”,
gpus=2,
trainer.fit(model, datamodule)

if name == “ main ”:
main()

but when I run it I see this error how to edit my source code?( I’d appreciate it if you could help me with it)
‘TypeError: Trainer.fit() requires a LightningModule , got: DataParallel’

parallel so after I write my code, I write this command
in this commands I write my model in

Looks like this is something more related to lightning. Maybe can you post the all of your code here so that we can better help you? Thanks!

Hi, yes sure. Thank you for your responce.
my source code is…

I send my source if it could be better show my error

This is data module

class DataModule(pl.LightningDataModule):
  def __init__(self, train_dataset, val_dataset,  batch_size = 1):
    super(DataModule, self).__init__()
    self.train_dataset = train_dataset
    self.val_dataset = val_dataset
    self.batch_size = batch_size
  def train_dataloader(self):
    return DataLoader(self.train_dataset, batch_size = self.batch_size, collate_fn = collate_fn, shuffle = True, num_workers = 0, pin_memory = True)
  def val_dataloader(self):
    return DataLoader(self.val_dataset, batch_size = self.batch_size,collate_fn = collate_fn, shuffle = False, num_workers = 0, pin_memory = True)
and this is part of model
class LaTrForVQA(pl.LightningModule):
  def __init__(self, config , learning_rate = 1e-4, max_steps = 100000//2):
    super(LaTrForVQA, self).__init__()   
    self.config = config
    self.save_hyperparameters()
    self.latr =  LaTr_for_finetuning(config)
    self.training_losses = []
    self.validation_losses = []
    self.max_steps = max_steps
  def configure_optimizers(self):
    return torch.optim.AdamW(self.parameters(), lr = self.hparams['learning_rate'])
  def forward(self, batch_dict):
    boxes =   batch_dict['boxes']
    img =     batch_dict['img']
    question = batch_dict['question']
    words =   batch_dict['tokenized_words']
    answer_vector = self.latr(lang_vect = words, 
                               spatial_vect = boxes, 
                               img_vect = img, 
                               quest_vect = question
    return answer_vector
  def calculate_metrics(self, prediction, labels):
      ## Calculate the accuracy score between the prediction and ground label for a batch, with considering the pad sequence
      batch_size = len(prediction)
      ac_score = 0
      for (pred, gt) in zip(prediction, labels):
        ac_score+= calculate_acc_score(pred.detach().cpu(), gt.detach().cpu())
      ac_score = ac_score/batch_size
      return ac_score
  def training_step(self, batch, batch_idx):
    answer_vector = self.forward(batch)
    ## https://discuss.huggingface.co/t/bertformaskedlm-s-loss-and-scores-how-the-loss-is-computed/607/2
    loss = nn.CrossEntropyLoss()(answer_vector.reshape(-1,self.config['classes']), batch['answer'].reshape(-1))
    _, preds = torch.max(answer_vector, dim = -1)
    ## Calculating the accuracy score
    train_acc = self.calculate_metrics(preds, batch['answer'])
    train_acc = torch.tensor(train_acc)
    ## Logging
    self.log('train_ce_loss', loss,prog_bar = True)
    self.log('train_acc', train_acc, prog_bar = True)
    self.training_losses.append(loss.item())
    return loss
  def validation_step(self, batch, batch_idx):
    logits = self.forward(batch)
    loss = nn.CrossEntropyLoss()(logits.reshape(-1,self.config['classes']), batch['answer'].reshape(-1))
    _, preds = torch.max(logits, dim = -1)
    ## Validation Accuracy
    val_acc = self.calculate_metrics(preds.cpu(), batch['answer'].cpu())
    val_acc = torch.tensor(val_acc)
    ## Logging
    self.log('val_ce_loss', loss, prog_bar = True)
    self.log('val_acc', val_acc, prog_bar = True)    
    return {'val_loss': loss, 'val_acc': val_acc}
  def optimizer_step(self, epoch_nb, batch_nb, optimizer, optimizer_i, opt_closure = None, on_tpu=False,
    using_native_amp=False, using_lbfgs=False):
        if self.trainer.global_step < 1000:
            lr_scale = min(1., float(self.trainer.global_step + 1) / 1000.)
            for pg in optimizer.param_groups:
                pg['lr'] = lr_scale * self.hparams.learning_rate
        else:
            for pg in optimizer.param_groups:
                pg['lr'] = polynomial(self.hparams.learning_rate, self.trainer.global_step, max_iter = self.max_steps)
        optimizer.step(opt_closure)
        optimizer.zero_grad()
  def validation_epoch_end(self, outputs):       
        val_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        val_acc = torch.stack([x['val_acc'] for x in outputs]).mean()
        self.log('val_loss_epoch_end', val_loss, on_epoch=True, sync_dist=True)
        self.log('val_acc_epoch_end', val_acc, on_epoch=True, sync_dist=True)
        self.val_prediction = []
model=LaTrForVQA(config)
and this is for trainer
trainer = pl.Trainer(
    max_steps = max_steps,
    default_root_dir="runs",
    gpus=2,
    deterministic=True,
Now I fit my model
datamodule = DataModule(train_ds, val_ds)
trainer.fit(model,datamodule)   
If you have any other questions, be sure to ask…

I change my code according some suggestion to don’t use parallel and only use pl.trainer(gpus=2)  but with this code I still can’t use both of GPU in parallel and only one GPU work in source code and I dont know what other change I should apply in this source.
I use different type of strategy and anyone don’t work and give me different type of errors and I don’t know what to do and can’t find any solution

I mention some of this error
When I set strategy=“ddp_spawn” (or ddp).It gives me this error
MisconfigurationException: Trainer(strategy='ddp_spawn') is not compatible with an interactive environment. Run your code as a script, or choose one of the compatible strategies: Trainer(strategy=None|dp|ddp_fork). In case you are spawning processes yourself, make sure to include the Trainer creation inside the worker function
When I set strategy=“ddp_fork” .It gives me this error

"Cannot re-initialize CUDA in forked subprocess. To use CUDA with "

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method
When I set strategy=“dp” . It gives me this error

AssertionError: Gather function not implemented for CPU tensors
When I set strategy=" ddp_fork " (or ddp_notebook).It gives me this error

ValueError: ‘ddp_notebook’ is not a valid DistributedType
If it is my wrong in source code what to do?

I’d so appreciate it if you could help me with this wrong.
              If I have understood your point, first of all I create a virtual environment in anaconda navigator with python 3.7 and my source could identify both of GPU
import torch

print(‘__CUDNN VERSION:’, torch.backends.cudnn.version())

print(‘__Number CUDA Devices:’, torch.cuda.device_count())

print(‘__CUDA Device Name 1:’,torch.cuda.get_device_name(0))

print(‘__CUDA Device Name 2:’,torch.cuda.get_device_name(1))

print(‘__CUDA Device current:’,torch.cuda.get_device_name(torch.cuda.current_device()))

print(‘__CUDA Device Total Memory [GB]:’,torch.cuda.get_device_properties(0).total_memory/1e9)

print(‘__CUDA Device Total Memory [GB]:’,torch.cuda.get_device_properties(1).total_memory/1e9)

print(‘__CUDA Device Total Memory [GB]:’,torch.cuda.get_device_properties(0).total_memory/1e9+torch.cuda.get_device_properties(1).total_memory/1e9)
_CUDNN VERSION: 8302

__Number CUDA Devices: 2

__CUDA Device Name 1: NVIDIA GeForce GTX 1080 Ti

__CUDA Device Name 2: NVIDIA GeForce GTX 1070 Ti

__CUDA Device Total Memory [GB]: 11.721572352

__CUDA Device Total Memory [GB]: 8.514043904

__CUDA Device Total Memory [GB]: 20.235616256

version of pytorch-lightning  is     1.8.0.post1
and I set for different kinds of pytorch package . All of them have these errors.

It is worth mentioning that with the same settings I can use both of my GPU in dataparallel only but I can’t use multi gpu in pytorch-lightning .
If I have not understood your point, may you give more guidance.