RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 · Issue #380

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

飘逸的作业本 · JProfiler for Mac ...· 3 周前 ·

高大的回锅肉 · “5·20”街头惊现“黑马王子”！身绑鲜花要 ...· 2 月前 ·

欢快的围巾 · 获取LinkedHashMap ...· 4 月前 ·

直爽的抽屉 · Aquifi创新界面技术挑战3D手势识别 ...· 5 月前 ·

冷冷的马铃薯 · ROG 幻14 ...· 6 月前 ·

茫然的钥匙 · 加拿大就业率已连跌半年失业率续升-中新网· 7 月前 ·

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
inputs = ...
inputs = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
model.generate(**inputs, **generate_kwargs)

RuntimeError: probability tensor contains either inf , nan or element < 0

I got this error while doing inference for text generation, in particular when the batch size is great than 1. I did not get this error and generate correctly when the batch size is set to 1.

Does anyone see the same issue?

All reactions

Here's my previous code, when it runs, the error was reported

RuntimeError: probability tensor contains either inf, nan or element < 0

from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...
I modified model.half() in it to mode.bfloat16(), the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
  tangzhy, ycjcl868, Go2Heart, tannonk, yining610, minwhoo, AayushSameerShah, UnderTheMangoTree, tczbzb, leocnj, and 27 more reacted with thumbs up emoji
  yaokunkun, MariaRigaki, chengjiali, jkukul, gankim, and xlnn reacted with heart emoji
    All reactions
          I had the problem solved via setting the pad_token as "[PAD]" and padding_side to "left" as suggested here.
tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"
Works with torch.float16
          I've been able to hit this issue with most of the suggested solutions: using bfloat16, using unk_token instead of eos_token as the pad_token ("[PAD]" is not in the vocab, so tokenizer.pad_token = "[PAD]" actually sets tokenizer.pad_token_id to tokenizer.unk_token_id). Some of these work okay for smaller batches and/or if the number of padding tokens added is limited (i.e., difference of lengths in the batch is comparably small), but with enough length discrepancy or sufficiently large batch size, the error still occurs.
So far, the only thing that seems to work reliably is left-padding with the bos_token.
I've been able to hit this issue with most of the suggested solutions: using bfloat16, using unk_token instead of eos_token as the pad_token ("[PAD]" is not in the vocab, so tokenizer.pad_token = "[PAD]" actually sets tokenizer.pad_token_id to tokenizer.unk_token_id). Some of these work okay for smaller batches and/or if the number of padding tokens added is limited (i.e., difference of lengths in the batch is comparably small), but with enough length discrepancy or sufficiently large batch size, the error still occurs.
So far, the only thing that seems to work reliably is left-padding with the bos_token.
work for me
          Unfortunately I have to report that even when using the bos_token for left-padding, the error sometimes occurs. In my case, this happened with an inference batch size of 8 and only after successfully generating 128+ responses (16+ batches). I didn't log more detailed statistics unfortunately, but I suspect that some combination of length difference between longest and shortest prompt as well as absolute length of longest prompt might be at play.
The stopgap solution in my code is to catch the error and then switch to only batching together prompts with the same token length (prompts are already sorted by token length to minimize the number of padding tokens necessary). However, that means a considerable drop in # responses/second once this error occurs.
          🌡 Have you tried increasing the temperature?
Well try increasing the temperature value. I had very low temperature value along with other parameters such as top_k and top_p which made the next token distribution too steep and as the beam search's logic, you will need to have multiple tokens available, and in the low temperature case I couldn't have (because we know how temperature works, right?)
So I increased the temperature and it worked.
Try increasing the temp value and it should just work, if there are no other complexity involved.
🌡 Have you tried increasing the temperature?
Well try increasing the temperature value. I had very low temperature value along with other parameters such as top_k and top_p which made the next token distribution too steep and as the beam search's logic, you will need to have multiple tokens available, and in the low temperature case I couldn't have (because we know how temperature works, right?)
So I increased the temperature and it worked.
Try increasing the temp value and it should just work, if there are no other complexity involved.
Increasing temp value to >0.5 works in my case
Here's my previous code, when it runs, the error was reported
RuntimeError: probability tensor contains either inf, nan or element < 0
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...
I modified model.half() in it to mode.bfloat16(), the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
This works for me too.
          tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
<Tried 3 different methods here>
sentence = "Hello, how are you?"
inputs = tokenizer(sentence, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7)
for i, output in enumerate(outputs):
    print(f"{i}: {tokenizer.decode(output)}")
Tried these 3 methods as suggested above, but to no avail.

Method 1:

model = model.bfloat16()

Method 2:
tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"
Method 3:
tokenizer.pad_token = tokenizer.bos_token
tokenizer.padding_side = "left"
Anyone got an idea how to solve this issue?
outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7)
@chuanbinp

The num_return_sequences doesn't work if you haven't set do_sample=True.

Try using that first and see what happens.
          Still getting the same error:
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[13], line 4
      1 sentence = "Hello, how are you?"
      3 inputs = tokenizer(sentence, return_tensors="pt", padding=True)
----> 4 outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7, do_sample=True)
      5 for i, output in enumerate(outputs):
      6     print(f"{i}: {tokenizer.decode(output)}")
File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)
File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/transformers/generation/utils.py:1652, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   1644     input_ids, model_kwargs = self._expand_inputs_for_generation(
   1645         input_ids=input_ids,
   1646         expand_size=generation_config.num_return_sequences,
   1647         is_encoder_decoder=self.config.is_encoder_decoder,
   1648         **model_kwargs,
   1649     )
   1651     # 13. run sample
-> 1652     return self.sample(
   1653         input_ids,
   1654         logits_processor=logits_processor,
   1655         logits_warper=logits_warper,
   1656         stopping_criteria=stopping_criteria,
   1657         pad_token_id=generation_config.pad_token_id,
   1658         eos_token_id=generation_config.eos_token_id,
   1659         output_scores=generation_config.output_scores,
   1660         return_dict_in_generate=generation_config.return_dict_in_generate,
   1661         synced_gpus=synced_gpus,
   1662         streamer=streamer,
   1663         **model_kwargs,
   1664     )
   1666 elif generation_mode == GenerationMode.BEAM_SEARCH:
   1667     # 11. prepare beam search scorer
   1668     beam_scorer = BeamSearchScorer(
   1669         batch_size=batch_size,
   1670         num_beams=generation_config.num_beams,
   (...)
   1675         max_length=generation_config.max_length,
   1676     )
File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/transformers/generation/utils.py:2770, in GenerationMixin.sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   2768 # sample
   2769 probs = nn.functional.softmax(next_token_scores, dim=-1)
-> 2770 next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
   2772 # finished sentences should have their next token be a padding token
   2773 if eos_token_id is not None:
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
@chuanbinp Can you increase the temperature and top_p / k ?
If that works great, if it doesn't, may need somebody else's help 🤗
I tried with temperate=0.99, do_sample=True, top_k=50, top_p=0.95 but still facing the same error. Can someone help?
Update:

Seems like my files were corrupted. Re-downloading the models worked!
model.bfloat16() can solve this problem
This solved my issue. changing from float to bfloat solved my issue.
I tried playing around with different values of temperate, do_sample=True/false, top_k, top_p, gradient_clip, reducing the learning rate. But that didn't solve the issue.
Finally changing from model dtype from float to bfloat solved the issue.
          So, for me changing to bfloat16 did not fix the issue. I didn't try removing do_sample=True, but working with batches of size 1 was the thing that reliably made the error go away.
Without having looked at the internals, the error message and the fact that changing the dtype works for some people but not for others, sounds like a numerical instability problem, where an "almost zero" value happens to be slightly negative due to imprecision in the representation. So I wonder if changing the probs argument to torch.multinomial to something like
torch.maximum(probs, torch.zeros_like(probs))




    

would help, clamping all probabilities to at least 0. Might be worth a short - the way I look at it, can't be worse than crashing. But not sure if there are any implications on performance, and it might also have unpredictable results if the underlying issue is something else.
Here's my previous code, when it runs, the error was reported
RuntimeError: probability tensor contains either inf, nan or element < 0
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...
I modified model.half() in it to mode.bfloat16(), the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
llama2 model

print(model.config.pad_token_id,"\n",model.config.eos_token_id)

returns None and 32000
so change "pad_token=model.config.pad_token_id" to "pad_token=model.config.eos_token_id," in generation config works to me
这解决了我的问题。从 float 更改为 bfloat 解决了我的问题。
我尝试使用不同的温和值，do_sample=真/假、top_k、top_p、gradient_clip，降低了学习率。但这并没有解决问题。
最后，从模型 dtype 从 float 更改为 bfloat 解决了这个问题。
same. bfloat works well.
          Hack to get bs > 1 to work, modify https://github.com/huggingface/transformers/blob/772307be7649e1333a933cfaa229dc0dec2fd331/src/transformers/generation/utils.py#L2650C5-L2650C5
probs = nn.functional.softmax(next_token_scores, dim=-1)
nans = torch.isnan(probs)
if nans.any(): 
   idx = torch.argwhere(torch.sum(nans, 1))
   z = torch.zeros_like(probs[idx][0])
   z[0][2] = 1.
   probs[idx] = z
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
this replaces every row with NaNs with 1.0 logit for eos token (having idx=2)
If anyone is still having this problem, removing do_sample=True fixed it for me. 🤷
Edit: Although it now generates 0th token after a couple of valid tokens. So underlying problem wasn't solved. This is happening while using bitsandbytes library with Mistral 7b.
Without do_sample=True it will simply ignore the temperature, top_k and top_p parameters.
I figure out mine. It was mainly because you introduced nan during fine-tuning. In my case was quantized model outputed nan in the logits when running DPO.
How to solved this error then ?
          I got this error when Llama2-7b-chat inference with nf4_config quantification.

nf4_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_quant_type="nf4",

bnb_4bit_use_double_quant=True,

bnb_4bit_compute_dtype=torch.bfloat16

I solved this by change torch.bfloat16 to torch.float16.
On Mon, May 6, 2024, 2:43 AM allan ***@***.***> wrote:
 Only setting do_sample=False solved my issue. Don't know the exact reason
 about this.
 Reply to this email directly, view it on GitHub
 <#380 (comment)>,
 or unsubscribe
 <https://github.com/notifications/unsubscribe-auth/ADDO3UWNP25XBQQVACQG7TDZA4YDJAVCNFSM6AAAAAA2PECPF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJVGM3TKMZRG4>
 You are receiving this because you commented.Message ID:
 ***@***.***>
      [SFT error] After SFT the released LEGO-7B model on our data, the model suffer from "RuntimeError: probability tensor contains either inf, nan or element < 0"
      lzw-lzw/GroundingGPT#10
For me, removing the do_sample fixed it. However, I don't understand the root problem. In my case, I used a transformers.BitsAndBytesConfig object to quantize the LLM, which already has bnb_4bit_compute_dtype=bfloat16. I was still encountering the issue.

@AnuraktKumar  It sounds like you finetuned the model with that dtype (bfloat16), but did you also load the model with that BitsAndBytesConfig?
      Cannot run Gemma2 27B with transformers loader: RuntimeError: probability tensor contains either inf, nan or element < 0
      oobabooga/text-generation-webui#6213
On Tue, Jul 16, 2024, 1:56 AM Yuxuan Wang ***@***.***> wrote:
 Does anybody figure out the reason behind this phenomenon?
 Reply to this email directly, view it on GitHub
 <#380 (comment)>,
 or unsubscribe
 <https://github.com/notifications/unsubscribe-auth/ADDO3UXCE5S4RP4PKZM3IN3ZMS73FAVCNFSM6AAAAAA2PECPF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZQGE2TSMJYGE>
 You are receiving this because you commented.Message ID:
 ***@***.***>
Here's my previous code, when it runs, the error was reported
RuntimeError: probability tensor contains either inf, nan or element < 0
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...
I modified model.half() in it to mode.bfloat16(), the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
That is good. It solve my problem! Thank you!
          I encountered this problem under 8-bit quantized Llama2-7b-hf model, that fixed my problem when batch_size=4
Unfortunately I have to report that even when using the bos_token for left-padding, the error sometimes occurs. In my case, this happened with an inference batch size of 8 and only after successfully generating 128+ responses (16+ batches). I didn't log more detailed statistics unfortunately, but I suspect that some combination of length difference between longest and shortest prompt as well as absolute length of longest prompt might be at play.
The stopgap solution in my code is to catch the error and then switch to only batching together prompts with the same token length (prompts are already sorted by token length to minimize the number of padding tokens necessary). However, that means a considerable drop in # responses/second once this error occurs.
          Only set tokenizer.pad_token = tokenizer.bos_token solved my issue. Don't know the reason. Thanks everyone for discussing!
(After an  hour)

The error occurred again. I found its occurrence relating with the data, which means it occurred when fed in some data, but disappeared when fed in other data. I don't understand why.
(After another hour)
Only setting do_sample=False solved my issue. Don't know the exact reason about this.
Same as you. I don't know why, either.
          because the error was raised at .no sample no raise
           if do_sample:
                probs = nn.functional.softmax(next_token_scores, dim=-1)
                next_tokens = torch.multinomial(probs, num_samples=n_tokens_to_keep)
                next_token_scores = torch.gather(next_token_scores, -1, next_tokens)
                next_token_scores, _indices = torch.sort(next_token_scores, descending=True, dim=1)
                next_tokens = torch.gather(next_tokens, -1, _indices)
            else:
                next_token_scores, next_tokens = torch.topk(
                    next_token_scores, n_tokens_to_keep, dim=1, largest=True, sorted=True
                )```
          As mentioned by others, you can avoid this by simply disabling sampling with do_sample=False.
When you enable sampling, it looks at the probabilities of the tokens and does clever things to generate completions based on those probabilities (depending on your generation kwargs). You can imagine that for some values of logits, if you convert those logit scores to probabilities, you might run into nonsense values, i.e. inf, nan, etc.
Of course, if you do need sampling, e.g. maybe with PPO training where you want diverse completions, then you need to carefully look at your generation kwargs.
Here's my previous code, when it runs, the error was reported
RuntimeError: probability tensor contains either inf, nan or element < 0
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...
I modified model.half() in it to mode.bfloat16(), the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
It works! But I still curious about the causes of this error and why it is fixed by doing this?