I am trying to run the full param training template.
04_finetuning_llms_with_deepspeed
on a single 4*A100 (80GB) machine. Already made the necessary changes to load the model locally and store it locally, However, I ran into the following issue.
./run_llama_ft.sh --size=7b [--as-test]
Failure # 1 (occurred at 2023-08-26_07-55-34)
e[36mray::_Inner.train()e[39m (pid=133909, ip=10.14.0.6, actor_id=f1fc079c10f51864ae2f3ac601000000, repr=TorchTrainer)
File "/home/sadra/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 394, in train
raise skipped from exception_cause(skipped)
File "/home/sadra/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
ray.get(object_ref)
ray.exceptions.RayTaskError(AssertionError): e[36mray::_RayTrainWorker__execute.get_next()e[39m (pid=136753, ip=10.14.0.6, actor_id=1435c3fad5191d9a19669de301000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f06a4575030>)
File "/home/sadra/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/home/sadra/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
train_func(*args, **kwargs)
File "/home/sadra/ray/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py", line 307, in training_function
outputs = model(**batch)
File "/home/sadra/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1768, in forward
loss = self.module(*inputs, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward
logits = self.lm_head(hidden_states)
File "/home/sadra/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
result = hook(self, args)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/home/sadra/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 306, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}
this is what I see before error happens
(pid=137497) Running: 0.0/96.0 CPU, 0.0/4.0 GPU, 3.56 MiB/1.86 GiB object_store_memory: 0%| | 20/36864 [00:00<04:52, 12023-08-26 07:55:34,707 ERROR tune_controller.py:1507 -- Trial task failed for trial TorchTrainer_c16fe_00000
I tested in various scenarios from deepspeed .8 .9.3 .10 and python 3.8 as well.
torch is 2.0.1 , transformers 4.32,
tested with both Version: 3.0.0.dev0, but with 2.6 errors were similar.
cuda version of pytorch and nvcc are matching. the machine already has 900GB Ram, I also tried disabling zero_load.
Can someone give me ideas how to fix it?
Same error on our side with transformers>=4.32.0
and Ray Torch Trainer.
2376 return func(*args, **kwargs)2375 File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context2374 ret_val = func(*args, **kwargs)2373 File "/usr/local/lib/python3.9/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn2372 param_coordinator.fetch_sub_module(sub_module, forward=True)2371 File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 494, in pre_sub_module_forward_function2370 return func(*args, **kwargs)
- `transformers` version: 4.33.0
- Platform: Linux-3.10.0-11…60.83.1.el7.x86_64-x86_64-with-glibc2.35
- Python version: 3.10.9
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.2
- Accelerate version: 0.22.0
- Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 64
- machine_rank: 0
- num_machines: 8
- main_process_ip: 127.0.0.1
- main_process_port: 29500
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'deepspeed_multinode_launcher': 'standard', 'gradient_accumulation_steps': 1, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
### Who can help?
@pacman100
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)
### Reproduction
I've encountered an issue while training using the combination of Trainer & Deepspeed stage 3. When invoking model.resize_token_embeddings, an AssertError arises during training. This was not an issue in transformers version 4.31.0. However, for versions > 4.31.0 and in the main branch, this problem persists. I suspect this might be related to PR https://github.com/huggingface/transformers/pull/25394
File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module
return func(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionErrorassert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}
code:
```python test.py
from transformers import (
AutoConfig,
AutoTokenizer,
AutoModelForCausalLM,
HfArgumentParser,
TrainingArguments,
DataCollatorForSeq2Seq,
Trainer,
def main():
parser = HfArgumentParser((TrainingArguments))
training_args, = parser.parse_args_into_dataclasses()
model_path = '/path/to/Llama-2-7b-hf'
config = AutoConfig.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
config = config,
tokenizer = AutoTokenizer.from_pretrained(
model_path,
use_fast=False,
model_max_length=1024,
add_new_tokens = True
if add_new_tokens:
# deepspeed AssertionError
tokenizer.add_special_tokens({"pad_token": "<pad>",})
model.resize_token_embeddings(len(tokenizer))
else:
# it works
tokenizer.pad_token = tokenizer.eos_token
from datasets import Dataset
def gen():
for _ in range(100):
yield {"input_ids": [1, 2, 3], "labels": [1, 1, 1]}
datasets = Dataset.from_generator(gen)
datasets.set_format('pt')
trainer = Trainer(
model=model,
args=training_args,
tokenizer=tokenizer,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, max_length=tokenizer.model_max_length),
train_dataset=datasets,
trainer.train()
if __name__ == "__main__":
main()
scripts:
```shell
deepspeed test.py \
--deepspeed configs/zero3_hf.conf \
--output_dir output/test/ \
--do_train \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--report_to "none" \
deepspeed config
```conf
"bf16": {
"enabled": "auto"
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 1e5,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
### Expected behavior
no error
torchvision==0.15.1
torchaudio==2.0.1
git+https://github.com/huggingface/transformers.git@d0c1aeb
deepspeed==0.10.0
fairscale==0.4.13
peft==0.5.0
datasets==2.14.4
accelerate==0.21.0
evaluate==0.4.0
bitsandbytes==0.41.1
wandb==0.15.8
pytorch-lightning==2.0.6
protobuf<3.21.0
torchmetrics==1.0.3
lm_eval==0.3.0
tiktoken==0.1.2
sentencepiece==0.1.99
urllib3<1.27