Hi I want to run my project on two gpu parallel so after I write my code, I write this command
in this commands I write my model in paralell and then in pl.trainer in add parameter of gpus=2
def main():
datamodule = DataModule(train_ds, val_ds)
mymodel = mymodel(config)
trainer = pl.Trainer(
accelerator=“gpu”,
gpus=2,
trainer.fit(model, datamodule)
if
name
== “
main
”:
main()
but when I run it I see this error how to edit my source code?( I’d appreciate it if you could help me with it)
‘TypeError:
Trainer.fit()
requires a
LightningModule
, got: DataParallel’
parallel so after I write my code, I write this command
in this commands I write my model in
Looks like this is something more related to lightning. Maybe can you post the all of your code here so that we can better help you? Thanks!
Hi, yes sure. Thank you for your responce.
my source code is…
I send my source if it could be better show my error
This is data module
class DataModule(pl.LightningDataModule):
def __init__(self, train_dataset, val_dataset, batch_size = 1):
super(DataModule, self).__init__()
self.train_dataset = train_dataset
self.val_dataset = val_dataset
self.batch_size = batch_size
def train_dataloader(self):
return DataLoader(self.train_dataset, batch_size = self.batch_size, collate_fn = collate_fn, shuffle = True, num_workers = 0, pin_memory = True)
def val_dataloader(self):
return DataLoader(self.val_dataset, batch_size = self.batch_size,collate_fn = collate_fn, shuffle = False, num_workers = 0, pin_memory = True)
and this is part of model
class LaTrForVQA(pl.LightningModule):
def __init__(self, config , learning_rate = 1e-4, max_steps = 100000//2):
super(LaTrForVQA, self).__init__()
self.config = config
self.save_hyperparameters()
self.latr = LaTr_for_finetuning(config)
self.training_losses = []
self.validation_losses = []
self.max_steps = max_steps
def configure_optimizers(self):
return torch.optim.AdamW(self.parameters(), lr = self.hparams['learning_rate'])
def forward(self, batch_dict):
boxes = batch_dict['boxes']
img = batch_dict['img']
question = batch_dict['question']
words = batch_dict['tokenized_words']
answer_vector = self.latr(lang_vect = words,
spatial_vect = boxes,
img_vect = img,
quest_vect = question
return answer_vector
def calculate_metrics(self, prediction, labels):
## Calculate the accuracy score between the prediction and ground label for a batch, with considering the pad sequence
batch_size = len(prediction)
ac_score = 0
for (pred, gt) in zip(prediction, labels):
ac_score+= calculate_acc_score(pred.detach().cpu(), gt.detach().cpu())
ac_score = ac_score/batch_size
return ac_score
def training_step(self, batch, batch_idx):
answer_vector = self.forward(batch)
## https://discuss.huggingface.co/t/bertformaskedlm-s-loss-and-scores-how-the-loss-is-computed/607/2
loss = nn.CrossEntropyLoss()(answer_vector.reshape(-1,self.config['classes']), batch['answer'].reshape(-1))
_, preds = torch.max(answer_vector, dim = -1)
## Calculating the accuracy score
train_acc = self.calculate_metrics(preds, batch['answer'])
train_acc = torch.tensor(train_acc)
## Logging
self.log('train_ce_loss', loss,prog_bar = True)
self.log('train_acc', train_acc, prog_bar = True)
self.training_losses.append(loss.item())
return loss
def validation_step(self, batch, batch_idx):
logits = self.forward(batch)
loss = nn.CrossEntropyLoss()(logits.reshape(-1,self.config['classes']), batch['answer'].reshape(-1))
_, preds = torch.max(logits, dim = -1)
## Validation Accuracy
val_acc = self.calculate_metrics(preds.cpu(), batch['answer'].cpu())
val_acc = torch.tensor(val_acc)
## Logging
self.log('val_ce_loss', loss, prog_bar = True)
self.log('val_acc', val_acc, prog_bar = True)
return {'val_loss': loss, 'val_acc': val_acc}
def optimizer_step(self, epoch_nb, batch_nb, optimizer, optimizer_i, opt_closure = None, on_tpu=False,
using_native_amp=False, using_lbfgs=False):
if self.trainer.global_step < 1000:
lr_scale = min(1., float(self.trainer.global_step + 1) / 1000.)
for pg in optimizer.param_groups:
pg['lr'] = lr_scale * self.hparams.learning_rate
else:
for pg in optimizer.param_groups:
pg['lr'] = polynomial(self.hparams.learning_rate, self.trainer.global_step, max_iter = self.max_steps)
optimizer.step(opt_closure)
optimizer.zero_grad()
def validation_epoch_end(self, outputs):
val_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
val_acc = torch.stack([x['val_acc'] for x in outputs]).mean()
self.log('val_loss_epoch_end', val_loss, on_epoch=True, sync_dist=True)
self.log('val_acc_epoch_end', val_acc, on_epoch=True, sync_dist=True)
self.val_prediction = []
model=LaTrForVQA(config)
and this is for trainer
trainer = pl.Trainer(
max_steps = max_steps,
default_root_dir="runs",
gpus=2,
deterministic=True,
Now I fit my model
datamodule = DataModule(train_ds, val_ds)
trainer.fit(model,datamodule)
If you have any other questions, be sure to ask…
I change my code according some suggestion to don’t use parallel and only use pl.trainer(gpus=2) but with this code I still can’t use both of GPU in parallel and only one GPU work in source code and I dont know what other change I should apply in this source.
I use different type of strategy and anyone don’t work and give me different type of errors and I don’t know what to do and can’t find any solution
I mention some of this error
When I set strategy=“ddp_spawn” (or ddp).It gives me this error
MisconfigurationException: Trainer(strategy='ddp_spawn')
is not compatible with an interactive environment. Run your code as a script, or choose one of the compatible strategies: Trainer(strategy=None|dp|ddp_fork). In case you are spawning processes yourself, make sure to include the Trainer creation inside the worker function
When I set strategy=“ddp_fork” .It gives me this error
"Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method
When I set strategy=“dp” . It gives me this error
AssertionError: Gather function not implemented for CPU tensors
When I set strategy=" ddp_fork " (or ddp_notebook).It gives me this error
ValueError: ‘ddp_notebook’ is not a valid DistributedType
If it is my wrong in source code what to do?
I’d so appreciate it if you could help me with this wrong.
If I have understood your point, first of all I create a virtual environment in anaconda navigator with python 3.7 and my source could identify both of GPU
import torch
print(‘__CUDNN VERSION:’, torch.backends.cudnn.version())
print(‘__Number CUDA Devices:’, torch.cuda.device_count())
print(‘__CUDA Device Name 1:’,torch.cuda.get_device_name(0))
print(‘__CUDA Device Name 2:’,torch.cuda.get_device_name(1))
print(‘__CUDA Device current:’,torch.cuda.get_device_name(torch.cuda.current_device()))
print(‘__CUDA Device Total Memory [GB]:’,torch.cuda.get_device_properties(0).total_memory/1e9)
print(‘__CUDA Device Total Memory [GB]:’,torch.cuda.get_device_properties(1).total_memory/1e9)
print(‘__CUDA Device Total Memory [GB]:’,torch.cuda.get_device_properties(0).total_memory/1e9+torch.cuda.get_device_properties(1).total_memory/1e9)
_CUDNN VERSION: 8302
__Number CUDA Devices: 2
__CUDA Device Name 1: NVIDIA GeForce GTX 1080 Ti
__CUDA Device Name 2: NVIDIA GeForce GTX 1070 Ti
__CUDA Device Total Memory [GB]: 11.721572352
__CUDA Device Total Memory [GB]: 8.514043904
__CUDA Device Total Memory [GB]: 20.235616256
version of pytorch-lightning is 1.8.0.post1
and I set for different kinds of pytorch package . All of them have these errors.
It is worth mentioning that with the same settings I can use both of my GPU in dataparallel only but I can’t use multi gpu in pytorch-lightning .
If I have not understood your point, may you give more guidance.