添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
inputs = ...
inputs = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
model.generate(**inputs, **generate_kwargs)

RuntimeError: probability tensor contains either inf , nan or element < 0

I got this error while doing inference for text generation, in particular when the batch size is great than 1. I did not get this error and generate correctly when the batch size is set to 1.

Does anyone see the same issue?

Jack000, Liyan06, eric8607242, beepkh, ycjcl868, tangzhy, haorannlp, Go2Heart, YangZyyyy, Teddy-Li, and 44 more reacted with thumbs up emoji msaroufim, ganow, Liyan06, Jack000, peterjc123, Teddy-Li, shashankmujumdar, AayushSameerShah, leocnj, Abletobetable, and 11 more reacted with eyes emoji All reactions

Here's my previous code, when it runs, the error was reported

RuntimeError: probability tensor contains either inf, nan or element < 0

from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...

I modified model.half() in it to mode.bfloat16(), the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...

tangzhy, ycjcl868, Go2Heart, tannonk, yining610, minwhoo, AayushSameerShah, UnderTheMangoTree, tczbzb, leocnj, and 27 more reacted with thumbs up emoji yaokunkun, MariaRigaki, chengjiali, jkukul, gankim, and xlnn reacted with heart emoji All reactions

I had the problem solved via setting the pad_token as "[PAD]" and padding_side to "left" as suggested here.

tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"

Works with torch.float16

I've been able to hit this issue with most of the suggested solutions: using bfloat16, using unk_token instead of eos_token as the pad_token ("[PAD]" is not in the vocab, so tokenizer.pad_token = "[PAD]" actually sets tokenizer.pad_token_id to tokenizer.unk_token_id). Some of these work okay for smaller batches and/or if the number of padding tokens added is limited (i.e., difference of lengths in the batch is comparably small), but with enough length discrepancy or sufficiently large batch size, the error still occurs.

So far, the only thing that seems to work reliably is left-padding with the bos_token.

I've been able to hit this issue with most of the suggested solutions: using bfloat16, using unk_token instead of eos_token as the pad_token ("[PAD]" is not in the vocab, so tokenizer.pad_token = "[PAD]" actually sets tokenizer.pad_token_id to tokenizer.unk_token_id). Some of these work okay for smaller batches and/or if the number of padding tokens added is limited (i.e., difference of lengths in the batch is comparably small), but with enough length discrepancy or sufficiently large batch size, the error still occurs.

So far, the only thing that seems to work reliably is left-padding with the bos_token.

work for me

Unfortunately I have to report that even when using the bos_token for left-padding, the error sometimes occurs. In my case, this happened with an inference batch size of 8 and only after successfully generating 128+ responses (16+ batches). I didn't log more detailed statistics unfortunately, but I suspect that some combination of length difference between longest and shortest prompt as well as absolute length of longest prompt might be at play.

The stopgap solution in my code is to catch the error and then switch to only batching together prompts with the same token length (prompts are already sorted by token length to minimize the number of padding tokens necessary). However, that means a considerable drop in # responses/second once this error occurs.

🌡 Have you tried increasing the temperature?

Well try increasing the temperature value. I had very low temperature value along with other parameters such as top_k and top_p which made the next token distribution too steep and as the beam search's logic, you will need to have multiple tokens available, and in the low temperature case I couldn't have (because we know how temperature works, right?)

So I increased the temperature and it worked.

Try increasing the temp value and it should just work, if there are no other complexity involved.

🌡 Have you tried increasing the temperature?

Well try increasing the temperature value. I had very low temperature value along with other parameters such as top_k and top_p which made the next token distribution too steep and as the beam search's logic, you will need to have multiple tokens available, and in the low temperature case I couldn't have (because we know how temperature works, right?)

So I increased the temperature and it worked.

Try increasing the temp value and it should just work, if there are no other complexity involved.

Increasing temp value to >0.5 works in my case

Here's my previous code, when it runs, the error was reported

RuntimeError: probability tensor contains either inf, nan or element < 0

from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...

I modified model.half() in it to mode.bfloat16(), the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...

This works for me too.

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
<Tried 3 different methods here>
sentence = "Hello, how are you?"
inputs = tokenizer(sentence, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7)
for i, output in enumerate(outputs):
    print(f"{i}: {tokenizer.decode(output)}")

Tried these 3 methods as suggested above, but to no avail.
Method 1:
model = model.bfloat16()
Method 2:

tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"

Method 3:

tokenizer.pad_token = tokenizer.bos_token
tokenizer.padding_side = "left"

Anyone got an idea how to solve this issue?

outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7)

@chuanbinp
The num_return_sequences doesn't work if you haven't set do_sample=True.
Try using that first and see what happens.

Still getting the same error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[13], line 4
      1 sentence = "Hello, how are you?"
      3 inputs = tokenizer(sentence, return_tensors="pt", padding=True)
----> 4 outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7, do_sample=True)
      5 for i, output in enumerate(outputs):
      6     print(f"{i}: {tokenizer.decode(output)}")
File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)
File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/transformers/generation/utils.py:1652, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   1644     input_ids, model_kwargs = self._expand_inputs_for_generation(
   1645         input_ids=input_ids,
   1646         expand_size=generation_config.num_return_sequences,
   1647         is_encoder_decoder=self.config.is_encoder_decoder,
   1648         **model_kwargs,
   1649     )
   1651     # 13. run sample
-> 1652     return self.sample(
   1653         input_ids,
   1654         logits_processor=logits_processor,
   1655         logits_warper=logits_warper,
   1656         stopping_criteria=stopping_criteria,
   1657         pad_token_id=generation_config.pad_token_id,
   1658         eos_token_id=generation_config.eos_token_id,
   1659         output_scores=generation_config.output_scores,
   1660         return_dict_in_generate=generation_config.return_dict_in_generate,
   1661         synced_gpus=synced_gpus,
   1662         streamer=streamer,
   1663         **model_kwargs,
   1664     )
   1666 elif generation_mode == GenerationMode.BEAM_SEARCH:
   1667     # 11. prepare beam search scorer
   1668     beam_scorer = BeamSearchScorer(
   1669         batch_size=batch_size,
   1670         num_beams=generation_config.num_beams,
   (...)
   1675         max_length=generation_config.max_length,
   1676     )
File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/transformers/generation/utils.py:2770, in GenerationMixin.sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   2768 # sample
   2769 probs = nn.functional.softmax(next_token_scores, dim=-1)
-> 2770 next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
   2772 # finished sentences should have their next token be a padding token
   2773 if eos_token_id is not None:
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

@chuanbinp Can you increase the temperature and top_p / k ?

If that works great, if it doesn't, may need somebody else's help 🤗

I tried with temperate=0.99, do_sample=True, top_k=50, top_p=0.95 but still facing the same error. Can someone help?

Update:
Seems like my files were corrupted. Re-downloading the models worked!

model.bfloat16() can solve this problem

This solved my issue. changing from float to bfloat solved my issue.

I tried playing around with different values of temperate, do_sample=True/false, top_k, top_p, gradient_clip, reducing the learning rate. But that didn't solve the issue.

Finally changing from model dtype from float to bfloat solved the issue.

So, for me changing to bfloat16 did not fix the issue. I didn't try removing do_sample=True, but working with batches of size 1 was the thing that reliably made the error go away.

Without having looked at the internals, the error message and the fact that changing the dtype works for some people but not for others, sounds like a numerical instability problem, where an "almost zero" value happens to be slightly negative due to imprecision in the representation. So I wonder if changing the probs argument to torch.multinomial to something like

torch.maximum(probs, torch.zeros_like(probs))

would help, clamping all probabilities to at least 0. Might be worth a short - the way I look at it, can't be worse than crashing. But not sure if there are any implications on performance, and it might also have unpredictable results if the underlying issue is something else.

Here's my previous code, when it runs, the error was reported

RuntimeError: probability tensor contains either inf, nan or element < 0

from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...

I modified model.half() in it to mode.bfloat16(), the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...

llama2 model
print(model.config.pad_token_id,"\n",model.config.eos_token_id)
returns None and 32000

so change "pad_token=model.config.pad_token_id" to "pad_token=model.config.eos_token_id," in generation config works to me

这解决了我的问题。从 float 更改为 bfloat 解决了我的问题。

我尝试使用不同的温和值,do_sample=真/假、top_k、top_p、gradient_clip,降低了学习率。但这并没有解决问题。

最后,从模型 dtype 从 float 更改为 bfloat 解决了这个问题。

same. bfloat works well.

Hack to get bs > 1 to work, modify https://github.com/huggingface/transformers/blob/772307be7649e1333a933cfaa229dc0dec2fd331/src/transformers/generation/utils.py#L2650C5-L2650C5

probs = nn.functional.softmax(next_token_scores, dim=-1)
nans = torch.isnan(probs)
if nans.any(): 
   idx = torch.argwhere(torch.sum(nans, 1))
   z = torch.zeros_like(probs[idx][0])
   z[0][2] = 1.
   probs[idx] = z
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)

this replaces every row with NaNs with 1.0 logit for eos token (having idx=2)

If anyone is still having this problem, removing do_sample=True fixed it for me. 🤷

Edit: Although it now generates 0th token after a couple of valid tokens. So underlying problem wasn't solved. This is happening while using bitsandbytes library with Mistral 7b.

Without do_sample=True it will simply ignore the temperature, top_k and top_p parameters.

I figure out mine. It was mainly because you introduced nan during fine-tuning. In my case was quantized model outputed nan in the logits when running DPO.

How to solved this error then ?

I got this error when Llama2-7b-chat inference with nf4_config quantification.
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
I solved this by change torch.bfloat16 to torch.float16.