I’m runing this piece of code:
def forward(self, inputs, core_state=()):
x = inputs["frame"]
# time x batch x 64 x 64 x 3
T, B, *_ = x.shape
# merge time and batch
# [T*B x 64 x 64 x 3]
x = torch.flatten(x, 0, 1)
x = x.float()
# [T*B x 3 x 64 x 64]
x = x.transpose(1, 3)
# x = checkpoint_sequential(self.feat_extract, 2, x)
x = self.feat_extract(x)
x = x.view(T*B, -1)
# core_input = checkpoint_sequential(self.fc, 2, x)
core_input = self.fc(x)
core_output = core_input
core_state = tuple()
policy_logits = self.policy(core_output)
baseline = self.baseline(core_output)
if self.training:
action = torch.multinomial(F.softmax(policy_logits, dim=1), num_samples=1)
else:
action = torch.argmax(policy_logits, dim=1)
policy_logits = policy_logits.view(T, B, self.num_actions)
baseline = baseline.view(T, B)
action = action.view(T, B)
return dict(policy_logits=policy_logits, baseline=baseline,
action=action), core_state
And got this error:
Traceback (most recent call last):
File "/usr/local/easybuild-2019/easybuild/software/compiler/gcccore/8.3.0/python/3.7.4/lib/python3$
self.run()
File "/usr/local/easybuild-2019/easybuild/software/compiler/gcccore/8.3.0/python/3.7.4/lib/python3$
self._target(*self._args, **self._kwargs)
File “/data/gpfs/projects/punim1126/CL-RIDE-master-project/src/utils.py”, line 335, in act
raise e
File “/data/gpfs/projects/punim1126/CL-RIDE-master-project/src/utils.py”, line 287, in act
agent_output, agent_state = model(env_output, agent_state)
File “/home/shizhec/.local/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 722, in _$
result = self.forward(*input, **kwargs)
File “/data/gpfs/projects/punim1126/CL-RIDE-master-project/src/models.py”, line 675, in forward
action = torch.multinomial(F.softmax(policy_logits, dim=1), num_samples=1)
RuntimeError: probability tensor contains either inf
, nan
or element < 0
I’m using torch 1.6.0, however, when I use torch 1.1.0, I don’t get any error anymore, and I could train the model correctly. Anyone know why this is happening?
#one output before error:
tensor([[ 4.0024, -22.5107, 8.6548, -26.3529, 199.5710, -23.9216, -40.0046,
3.1537, -19.8343, -39.2633, 34.4721, -7.2311, -56.9415, -5.0400,
15.7165]])
#where the error occur:
tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])
You code seems OK, something might have changed between these two versions, so you can either:
Modify you framework, then track and find the origin of the first NaN value
Switch back to 1.1.0, for now.
I would recommend you use the second solution, according to your complex situation presented in the previous question.
division by 0
something involving the annoying log/exp calculation, like log probability, I have just located a Nan problem myself this morning. Eg:
a=Normal(1, 1e-23)
a.log_prob(a.sample()) -> NaN because sigma is too small.
Well try increasing the temperature
value. I had very low temperature value along with other parameters such as top_k
and top_p
which made the next token distribution too steep and as the beam search’s logic, you will need to have multiple tokens available, and in the low temperature case I couldn’t have (because we know how temperature works, right?)
So I increased the temperature and it worked.
Try increasing the temp value and it should just work, if there are no other complexity involved.