I got this error while doing inference for text generation, in particular when the batch size is great than 1. I did not get this error and generate correctly when the batch size is set to 1.
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...
I modified model.half()
in it to mode.bfloat16()
, the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
tangzhy, ycjcl868, Go2Heart, tannonk, yining610, minwhoo, AayushSameerShah, UnderTheMangoTree, tczbzb, leocnj, and 27 more reacted with thumbs up emoji
yaokunkun, MariaRigaki, chengjiali, jkukul, gankim, and xlnn reacted with heart emoji
All reactions
I had the problem solved via setting the pad_token as "[PAD]"
and padding_side to "left"
as suggested here.
tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"
Works with torch.float16
I've been able to hit this issue with most of the suggested solutions: using bfloat16, using unk_token
instead of eos_token
as the pad_token
("[PAD]"
is not in the vocab, so tokenizer.pad_token = "[PAD]"
actually sets tokenizer.pad_token_id
to tokenizer.unk_token_id
). Some of these work okay for smaller batches and/or if the number of padding tokens added is limited (i.e., difference of lengths in the batch is comparably small), but with enough length discrepancy or sufficiently large batch size, the error still occurs.
So far, the only thing that seems to work reliably is left-padding with the bos_token
.
I've been able to hit this issue with most of the suggested solutions: using bfloat16, using unk_token
instead of eos_token
as the pad_token
("[PAD]"
is not in the vocab, so tokenizer.pad_token = "[PAD]"
actually sets tokenizer.pad_token_id
to tokenizer.unk_token_id
). Some of these work okay for smaller batches and/or if the number of padding tokens added is limited (i.e., difference of lengths in the batch is comparably small), but with enough length discrepancy or sufficiently large batch size, the error still occurs.
So far, the only thing that seems to work reliably is left-padding with the bos_token
.
work for me
Unfortunately I have to report that even when using the bos_token
for left-padding, the error sometimes occurs. In my case, this happened with an inference batch size of 8 and only after successfully generating 128+ responses (16+ batches). I didn't log more detailed statistics unfortunately, but I suspect that some combination of length difference between longest and shortest prompt as well as absolute length of longest prompt might be at play.
The stopgap solution in my code is to catch the error and then switch to only batching together prompts with the same token length (prompts are already sorted by token length to minimize the number of padding tokens necessary). However, that means a considerable drop in # responses/second once this error occurs.
🌡 Have you tried increasing the temperature?
Well try increasing the temperature
value. I had very low temperature value along with other parameters such as top_k
and top_p
which made the next token distribution too steep and as the beam search's logic, you will need to have multiple tokens available, and in the low temperature case I couldn't have (because we know how temperature works, right?)
So I increased the temperature and it worked.
Try increasing the temp value and it should just work, if there are no other complexity involved.
🌡 Have you tried increasing the temperature?
Well try increasing the temperature
value. I had very low temperature value along with other parameters such as top_k
and top_p
which made the next token distribution too steep and as the beam search's logic, you will need to have multiple tokens available, and in the low temperature case I couldn't have (because we know how temperature works, right?)
So I increased the temperature and it worked.
Try increasing the temp value and it should just work, if there are no other complexity involved.
Increasing temp value to >0.5 works in my case
Here's my previous code, when it runs, the error was reported
RuntimeError: probability tensor contains either inf, nan or element < 0
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...
I modified model.half()
in it to mode.bfloat16()
, the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
This works for me too.
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
<Tried 3 different methods here>
sentence = "Hello, how are you?"
inputs = tokenizer(sentence, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7)
for i, output in enumerate(outputs):
print(f"{i}: {tokenizer.decode(output)}")
Tried these 3 methods as suggested above, but to no avail.
Method 1:
model = model.bfloat16()
Method 2:
tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"
Method 3:
tokenizer.pad_token = tokenizer.bos_token
tokenizer.padding_side = "left"
Anyone got an idea how to solve this issue?
outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7)
@chuanbinp
The num_return_sequences
doesn't work if you haven't set do_sample=True
.
Try using that first and see what happens.
Still getting the same error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[13], line 4
1 sentence = "Hello, how are you?"
3 inputs = tokenizer(sentence, return_tensors="pt", padding=True)
----> 4 outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7, do_sample=True)
5 for i, output in enumerate(outputs):
6 print(f"{i}: {tokenizer.decode(output)}")
File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/transformers/generation/utils.py:1652, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1644 input_ids, model_kwargs = self._expand_inputs_for_generation(
1645 input_ids=input_ids,
1646 expand_size=generation_config.num_return_sequences,
1647 is_encoder_decoder=self.config.is_encoder_decoder,
1648 **model_kwargs,
1649 )
1651 # 13. run sample
-> 1652 return self.sample(
1653 input_ids,
1654 logits_processor=logits_processor,
1655 logits_warper=logits_warper,
1656 stopping_criteria=stopping_criteria,
1657 pad_token_id=generation_config.pad_token_id,
1658 eos_token_id=generation_config.eos_token_id,
1659 output_scores=generation_config.output_scores,
1660 return_dict_in_generate=generation_config.return_dict_in_generate,
1661 synced_gpus=synced_gpus,
1662 streamer=streamer,
1663 **model_kwargs,
1664 )
1666 elif generation_mode == GenerationMode.BEAM_SEARCH:
1667 # 11. prepare beam search scorer
1668 beam_scorer = BeamSearchScorer(
1669 batch_size=batch_size,
1670 num_beams=generation_config.num_beams,
(...)
1675 max_length=generation_config.max_length,
1676 )
File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/transformers/generation/utils.py:2770, in GenerationMixin.sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
2768 # sample
2769 probs = nn.functional.softmax(next_token_scores, dim=-1)
-> 2770 next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
2772 # finished sentences should have their next token be a padding token
2773 if eos_token_id is not None:
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
@chuanbinp Can you increase the temperature and top_p / k ?
If that works great, if it doesn't, may need somebody else's help 🤗
I tried with temperate=0.99, do_sample=True, top_k=50, top_p=0.95
but still facing the same error. Can someone help?
Update:
Seems like my files were corrupted. Re-downloading the models worked!
model.bfloat16()
can solve this problem
This solved my issue. changing from float to bfloat solved my issue.
I tried playing around with different values of temperate, do_sample=True/false, top_k, top_p, gradient_clip, reducing the learning rate. But that didn't solve the issue.
Finally changing from model dtype from float to bfloat solved the issue.
So, for me changing to bfloat16
did not fix the issue. I didn't try removing do_sample=True
, but working with batches of size 1 was the thing that reliably made the error go away.
Without having looked at the internals, the error message and the fact that changing the dtype works for some people but not for others, sounds like a numerical instability problem, where an "almost zero" value happens to be slightly negative due to imprecision in the representation. So I wonder if changing the probs
argument to torch.multinomial
to something like
torch.maximum(probs, torch.zeros_like(probs))
would help, clamping all probabilities to at least 0. Might be worth a short - the way I look at it, can't be worse than crashing. But not sure if there are any implications on performance, and it might also have unpredictable results if the underlying issue is something else.
Here's my previous code, when it runs, the error was reported
RuntimeError: probability tensor contains either inf, nan or element < 0
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...
I modified model.half()
in it to mode.bfloat16()
, the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
llama2 model
print(model.config.pad_token_id,"\n",model.config.eos_token_id)
returns None and 32000
so change "pad_token=model.config.pad_token_id" to "pad_token=model.config.eos_token_id," in generation config works to me
这解决了我的问题。从 float 更改为 bfloat 解决了我的问题。
我尝试使用不同的温和值,do_sample=真/假、top_k、top_p、gradient_clip,降低了学习率。但这并没有解决问题。
最后,从模型 dtype 从 float 更改为 bfloat 解决了这个问题。
same. bfloat works well.
Hack to get bs > 1 to work, modify https://github.com/huggingface/transformers/blob/772307be7649e1333a933cfaa229dc0dec2fd331/src/transformers/generation/utils.py#L2650C5-L2650C5
probs = nn.functional.softmax(next_token_scores, dim=-1)
nans = torch.isnan(probs)
if nans.any():
idx = torch.argwhere(torch.sum(nans, 1))
z = torch.zeros_like(probs[idx][0])
z[0][2] = 1.
probs[idx] = z
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
this replaces every row with NaNs with 1.0 logit for eos token (having idx=2)
If anyone is still having this problem, removing do_sample=True
fixed it for me. 🤷
Edit: Although it now generates 0th token after a couple of valid tokens. So underlying problem wasn't solved. This is happening while using bitsandbytes library with Mistral 7b.
Without do_sample=True
it will simply ignore the temperature
, top_k
and top_p
parameters.
I figure out mine. It was mainly because you introduced nan during fine-tuning. In my case was quantized model outputed nan in the logits when running DPO.
How to solved this error then ?
I got this error when Llama2-7b-chat inference with nf4_config quantification.
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
I solved this by change torch.bfloat16 to torch.float16.
On Mon, May 6, 2024, 2:43 AM allan ***@***.***> wrote:
Only setting do_sample=False solved my issue. Don't know the exact reason
about this.
Reply to this email directly, view it on GitHub
<
#380 (comment)>,
or unsubscribe
<
https://github.com/notifications/unsubscribe-auth/ADDO3UWNP25XBQQVACQG7TDZA4YDJAVCNFSM6AAAAAA2PECPF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJVGM3TKMZRG4>
You are receiving this because you commented.Message ID:
***@***.***>
[SFT error] After SFT the released LEGO-7B model on our data, the model suffer from "RuntimeError: probability tensor contains either
inf
,
nan
or element < 0"
lzw-lzw/GroundingGPT#10
For me, removing the do_sample fixed it. However, I don't understand the root problem. In my case, I used a transformers.BitsAndBytesConfig object to quantize the LLM, which already has bnb_4bit_compute_dtype=bfloat16. I was still encountering the issue.
@AnuraktKumar It sounds like you finetuned the model with that dtype (bfloat16), but did you also load the model with that BitsAndBytesConfig?
Cannot run Gemma2 27B with transformers loader: RuntimeError: probability tensor contains either
inf
,
nan
or element < 0
oobabooga/text-generation-webui#6213
On Tue, Jul 16, 2024, 1:56 AM Yuxuan Wang ***@***.***> wrote:
Does anybody figure out the reason behind this phenomenon?
Reply to this email directly, view it on GitHub
<
#380 (comment)>,
or unsubscribe
<
https://github.com/notifications/unsubscribe-auth/ADDO3UXCE5S4RP4PKZM3IN3ZMS73FAVCNFSM6AAAAAA2PECPF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZQGE2TSMJYGE>
You are receiving this because you commented.Message ID:
***@***.***>
Here's my previous code, when it runs, the error was reported
RuntimeError: probability tensor contains either inf, nan or element < 0
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...
I modified model.half()
in it to mode.bfloat16()
, the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
That is good. It solve my problem! Thank you!
I encountered this problem under 8-bit quantized Llama2-7b-hf model, that fixed my problem when batch_size=4
Unfortunately I have to report that even when using the bos_token
for left-padding, the error sometimes occurs. In my case, this happened with an inference batch size of 8 and only after successfully generating 128+ responses (16+ batches). I didn't log more detailed statistics unfortunately, but I suspect that some combination of length difference between longest and shortest prompt as well as absolute length of longest prompt might be at play.
The stopgap solution in my code is to catch the error and then switch to only batching together prompts with the same token length (prompts are already sorted by token length to minimize the number of padding tokens necessary). However, that means a considerable drop in # responses/second once this error occurs.
Only set tokenizer.pad_token = tokenizer.bos_token
solved my issue. Don't know the reason. Thanks everyone for discussing!
(After an hour)
The error occurred again. I found its occurrence relating with the data, which means it occurred when fed in some data, but disappeared when fed in other data. I don't understand why.
(After another hour)
Only setting do_sample=False solved my issue. Don't know the exact reason about this.
Same as you. I don't know why, either.
because the error was raised at .no sample no raise
if do_sample:
probs = nn.functional.softmax(next_token_scores, dim=-1)
next_tokens = torch.multinomial(probs, num_samples=n_tokens_to_keep)
next_token_scores = torch.gather(next_token_scores, -1, next_tokens)
next_token_scores, _indices = torch.sort(next_token_scores, descending=True, dim=1)
next_tokens = torch.gather(next_tokens, -1, _indices)
else:
next_token_scores, next_tokens = torch.topk(
next_token_scores, n_tokens_to_keep, dim=1, largest=True, sorted=True
)```
As mentioned by others, you can avoid this by simply disabling sampling with do_sample=False
.
When you enable sampling, it looks at the probabilities of the tokens and does clever things to generate completions based on those probabilities (depending on your generation kwargs). You can imagine that for some values of logits, if you convert those logit scores to probabilities, you might run into nonsense values, i.e. inf
, nan
, etc.
Of course, if you do need sampling, e.g. maybe with PPO training where you want diverse completions, then you need to carefully look at your generation kwargs.
Here's my previous code, when it runs, the error was reported
RuntimeError: probability tensor contains either inf, nan or element < 0
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...
I modified model.half()
in it to mode.bfloat16()
, the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
It works! But I still curious about the causes of this error and why it is fixed by doing this?