添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
斯文的茴香  ·  git:# Please enter ...·  1 年前    · 
谦和的马克杯  ·  Android ...·  2 年前    · 
面冷心慈的口罩  ·  java - Cannot load ...·  2 年前    · 

Hi I want to run my project on two gpu parallel so after I write my code, I write this command
in this commands I write my model in paralell and then in pl.trainer in add parameter of gpus=2
def main():
datamodule = DataModule(train_ds, val_ds)
mymodel = mymodel(config)
trainer = pl.Trainer(
accelerator=“gpu”,
gpus=2,
trainer.fit(model, datamodule)

if name == “ main ”:
main()

but when I run it I see this error how to edit my source code?( I’d appreciate it if you could help me with it)
‘TypeError: Trainer.fit() requires a LightningModule , got: DataParallel’

parallel so after I write my code, I write this command
in this commands I write my model in

Looks like this is something more related to lightning. Maybe can you post the all of your code here so that we can better help you? Thanks!

Hi, yes sure. Thank you for your responce.
my source code is…

I send my source if it could be better show my error

This is data module

class DataModule(pl.LightningDataModule):
  def __init__(self, train_dataset, val_dataset,  batch_size = 1):
    super(DataModule, self).__init__()
    self.train_dataset = train_dataset
    self.val_dataset = val_dataset
    self.batch_size = batch_size
  def train_dataloader(self):
    return DataLoader(self.train_dataset, batch_size = self.batch_size, collate_fn = collate_fn, shuffle = True, num_workers = 0, pin_memory = True)
  def val_dataloader(self):
    return DataLoader(self.val_dataset, batch_size = self.batch_size,collate_fn = collate_fn, shuffle = False, num_workers = 0, pin_memory = True)

and this is part of model

class LaTrForVQA(pl.LightningModule):
  def __init__(self, config , learning_rate = 1e-4, max_steps = 100000//2):
    super(LaTrForVQA, self).__init__()   
    self.config = config
    self.save_hyperparameters()
    self.latr =  LaTr_for_finetuning(config)
    self.training_losses = []
    self.validation_losses = []
    self.max_steps = max_steps
  def configure_optimizers(self):
    return torch.optim.AdamW(self.parameters(), lr = self.hparams['learning_rate'])
  def forward(self, batch_dict):
    boxes =   batch_dict['boxes']
    img =     batch_dict['img']
    question = batch_dict['question']
    words =   batch_dict['tokenized_words']
    answer_vector = self.latr(lang_vect = words, 
                               spatial_vect = boxes, 
                               img_vect = img, 
                               quest_vect = question
    return answer_vector
  def calculate_metrics(self, prediction, labels):
      ## Calculate the accuracy score between the prediction and ground label for a batch, with considering the pad sequence
      batch_size = len(prediction)
      ac_score = 0
      for (pred, gt) in zip(prediction, labels):
        ac_score+= calculate_acc_score(pred.detach().cpu(), gt.detach().cpu())
      ac_score = ac_score/batch_size
      return ac_score
  def training_step(self, batch, batch_idx):
    answer_vector = self.forward(batch)
    ## https://discuss.huggingface.co/t/bertformaskedlm-s-loss-and-scores-how-the-loss-is-computed/607/2
    loss = nn.CrossEntropyLoss()(answer_vector.reshape(-1,self.config['classes']), batch['answer'].reshape(-1))
    _, preds = torch.max(answer_vector, dim = -1)
    ## Calculating the accuracy score
    train_acc = self.calculate_metrics(preds, batch['answer'])
    train_acc = torch.tensor(train_acc)
    ## Logging
    self.log('train_ce_loss', loss,prog_bar = True)
    self.log('train_acc', train_acc, prog_bar = True)
    self.training_losses.append(loss.item())
    return loss
  def validation_step(self, batch, batch_idx):
    logits = self.forward(batch)
    loss = nn.CrossEntropyLoss()(logits.reshape(-1,self.config['classes']), batch['answer'].reshape(-1))
    _, preds = torch.max(logits, dim = -1)
    ## Validation Accuracy
    val_acc = self.calculate_metrics(preds.cpu(), batch['answer'].cpu())
    val_acc = torch.tensor(val_acc)
    ## Logging
    self.log('val_ce_loss', loss, prog_bar = True)
    self.log('val_acc', val_acc, prog_bar = True)    
    return {'val_loss': loss, 'val_acc': val_acc}
  def optimizer_step(self, epoch_nb, batch_nb, optimizer, optimizer_i, opt_closure = None, on_tpu=False,
    using_native_amp=False, using_lbfgs=False):
        if self.trainer.global_step < 1000:
            lr_scale = min(1., float(self.trainer.global_step + 1) / 1000.)
            for pg in optimizer.param_groups:
                pg['lr'] = lr_scale * self.hparams.learning_rate
        else:
            for pg in optimizer.param_groups:
                pg['lr'] = polynomial(self.hparams.learning_rate, self.trainer.global_step, max_iter = self.max_steps)
        optimizer.step(opt_closure)
        optimizer.zero_grad()
  def validation_epoch_end(self, outputs):       
        val_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        val_acc = torch.stack([x['val_acc'] for x in outputs]).mean()
        self.log('val_loss_epoch_end', val_loss, on_epoch=True, sync_dist=True)
        self.log('val_acc_epoch_end', val_acc, on_epoch=True, sync_dist=True)
        self.val_prediction = []
model=LaTrForVQA(config)

and this is for trainer

trainer = pl.Trainer(
    max_steps = max_steps,
    default_root_dir="runs",
    gpus=2,
    deterministic=True,

Now I fit my model

datamodule = DataModule(train_ds, val_ds)
trainer.fit(model,datamodule)   

If you have any other questions, be sure to ask…
I change my code according some suggestion to don’t use parallel and only use pl.trainer(gpus=2) but with this code I still can’t use both of GPU in parallel and only one GPU work in source code and I dont know what other change I should apply in this source.

I use different type of strategy and anyone don’t work and give me different type of errors and I don’t know what to do and can’t find any solution
I mention some of this error

When I set strategy=“ddp_spawn” (or ddp).It gives me this error

MisconfigurationException: Trainer(strategy='ddp_spawn') is not compatible with an interactive environment. Run your code as a script, or choose one of the compatible strategies: Trainer(strategy=None|dp|ddp_fork). In case you are spawning processes yourself, make sure to include the Trainer creation inside the worker function

When I set strategy=“ddp_fork” .It gives me this error
"Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method

When I set strategy=“dp” . It gives me this error
AssertionError: Gather function not implemented for CPU tensors

When I set strategy=" ddp_fork " (or ddp_notebook).It gives me this error
ValueError: ‘ddp_notebook’ is not a valid DistributedType

If it is my wrong in source code what to do?
I’d so appreciate it if you could help me with this wrong.

If I have understood your point, first of all I create a virtual environment in anaconda navigator with python 3.7 and my source could identify both of GPU

import torch
print(‘__CUDNN VERSION:’, torch.backends.cudnn.version())
print(‘__Number CUDA Devices:’, torch.cuda.device_count())
print(‘__CUDA Device Name 1:’,torch.cuda.get_device_name(0))
print(‘__CUDA Device Name 2:’,torch.cuda.get_device_name(1))
print(‘__CUDA Device current:’,torch.cuda.get_device_name(torch.cuda.current_device()))
print(‘__CUDA Device Total Memory [GB]:’,torch.cuda.get_device_properties(0).total_memory/1e9)
print(‘__CUDA Device Total Memory [GB]:’,torch.cuda.get_device_properties(1).total_memory/1e9)
print(‘__CUDA Device Total Memory [GB]:’,torch.cuda.get_device_properties(0).total_memory/1e9+torch.cuda.get_device_properties(1).total_memory/1e9)

_CUDNN VERSION: 8302
__Number CUDA Devices: 2
__CUDA Device Name 1: NVIDIA GeForce GTX 1080 Ti
__CUDA Device Name 2: NVIDIA GeForce GTX 1070 Ti
__CUDA Device Total Memory [GB]: 11.721572352
__CUDA Device Total Memory [GB]: 8.514043904
__CUDA Device Total Memory [GB]: 20.235616256
version of pytorch-lightning is 1.8.0.post1

and I set for different kinds of pytorch package . All of them have these errors.
It is worth mentioning that with the same settings I can use both of my GPU in dataparallel only but I can’t use multi gpu in pytorch-lightning .

If I have not understood your point, may you give more guidance.