Issue with 04 templete fine-tune llms - Ray Libraries (Data, Train, Tune, Serve)

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
相关文章推荐
活泼的面包 · python实现自动化测试报告邮件实时发送 ...· 4 天前 ·
逼格高的作业本 · AIX Toolbox for Open ...· 3 天前 ·
满身肌肉的包子 · 商学院-科斯特照明集团(香港)有限公司（官网 ...· 1 月前 ·
强健的乌冬面 · 省直机关工会到四川测绘局开展工会干部赴基层蹲 ...· 3 月前 ·
想发财的菠萝 · Android ...· 4 月前 ·
从容的炒面 · 打造一台适合生产的Chromebook - ...· 4 月前 ·
谈吐大方的鸵鸟 · 香港金融管理局 - 智慧银行新纪元· 5 月前 ·
I am trying to run the full param training template. 04_finetuning_llms_with_deepspeed
on a single 4*A100 (80GB) machine. Already made the necessary changes to load the model locally and store it locally, However, I ran into the following issue.
./run_llama_ft.sh --size=7b [--as-test]
Failure # 1 (occurred at 2023-08-26_07-55-34)
e[36mray::_Inner.train()e[39m (pid=133909, ip=10.14.0.6, actor_id=f1fc079c10f51864ae2f3ac601000000, repr=TorchTrainer)
  File "/home/sadra/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 394, in train
    raise skipped from exception_cause(skipped)
  File "/home/sadra/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(AssertionError): e[36mray::_RayTrainWorker__execute.get_next()e[39m (pid=136753, ip=10.14.0.6, actor_id=1435c3fad5191d9a19669de301000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f06a4575030>)
  File "/home/sadra/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/sadra/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/sadra/ray/doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py", line 307, in training_function
    outputs = model(**batch)
  File "/home/sadra/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1768, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 827, in forward
    logits = self.lm_head(hidden_states)
  File "/home/sadra/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/home/sadra/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=True)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/sadra/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 306, in fetch_sub_module
    assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}
this is what I see before error happens
(pid=137497) Running: 0.0/96.0 CPU, 0.0/4.0 GPU, 3.56 MiB/1.86 GiB object_store_memory:   0%| | 20/36864 [00:00<04:52, 12023-08-26 07:55:34,707 ERROR tune_controller.py:1507 -- Trial task failed for trial TorchTrainer_c16fe_00000
I tested in various scenarios from deepspeed .8 .9.3 .10 and python 3.8 as well.

torch is 2.0.1 , transformers 4.32,

tested with both Version: 3.0.0.dev0, but with 2.6 errors were similar.
cuda version of pytorch and nvcc are matching. the machine already has 900GB Ram, I also tried disabling zero_load.
Can someone give me ideas how to fix it?
              Same error on our side with transformers>=4.32.0 and Ray Torch Trainer.
2376    return func(*args, **kwargs)2375  File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context2374    ret_val = func(*args, **kwargs)2373  File "/usr/local/lib/python3.9/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn2372    param_coordinator.fetch_sub_module(sub_module, forward=True)2371  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 494, in pre_sub_module_forward_function2370    return func(*args, **kwargs)
- `transformers` version: 4.33.0
- Platform: Linux-3.10.0-11…60.83.1.el7.x86_64-x86_64-with-glibc2.35
- Python version: 3.10.9
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.2
- Accelerate version: 0.22.0
- Accelerate config: 	- compute_environment: LOCAL_MACHINE
	- distributed_type: DEEPSPEED
	- mixed_precision: bf16
	- use_cpu: False
	- debug: False
	- num_processes: 64
	- machine_rank: 0
	- num_machines: 8
	- main_process_ip: 127.0.0.1
	- main_process_port: 29500
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- deepspeed_config: {'deepspeed_multinode_launcher': 'standard', 'gradient_accumulation_steps': 1, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
### Who can help?
@pacman100
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)
### Reproduction
I've encountered an issue while training using the combination of Trainer & Deepspeed stage 3. When invoking model.resize_token_embeddings, an AssertError arises during training. This was not an issue in transformers version 4.31.0. However, for versions > 4.31.0 and in the main branch, this problem persists. I suspect this might be related to PR https://github.com/huggingface/transformers/pull/25394
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module
    return func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 310, in fetch_sub_module
    assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
    AssertionErrorassert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}
code:
```python test.py
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    HfArgumentParser,
    TrainingArguments,
    DataCollatorForSeq2Seq,
    Trainer,
def main():
    parser = HfArgumentParser((TrainingArguments))
    training_args, = parser.parse_args_into_dataclasses()
    model_path = '/path/to/Llama-2-7b-hf'
    config = AutoConfig.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        config = config,
    tokenizer = AutoTokenizer.from_pretrained(
        model_path, 
        use_fast=False,
        model_max_length=1024,
    add_new_tokens = True
    if add_new_tokens:
        # deepspeed AssertionError
        tokenizer.add_special_tokens({"pad_token": "<pad>",})
        model.resize_token_embeddings(len(tokenizer))
    else:
        # it works
        tokenizer.pad_token = tokenizer.eos_token
    from datasets import Dataset
    def gen():
        for _ in range(100):
            yield {"input_ids": [1, 2, 3], "labels": [1, 1, 1]}
    datasets = Dataset.from_generator(gen)
    datasets.set_format('pt')
    trainer = Trainer(
        model=model,
        args=training_args,
        tokenizer=tokenizer,
        data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, max_length=tokenizer.model_max_length),
        train_dataset=datasets,
    trainer.train()
if __name__ == "__main__":
    main()
scripts:
```shell
deepspeed test.py \
    --deepspeed configs/zero3_hf.conf \
    --output_dir output/test/ \
    --do_train \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --report_to "none" \
deepspeed config
```conf
    "bf16": {
        "enabled": "auto"
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
### Expected behavior
no error
torchvision==0.15.1
torchaudio==2.0.1
git+https://github.com/huggingface/transformers.git@d0c1aeb
deepspeed==0.10.0
fairscale==0.4.13
peft==0.5.0
datasets==2.14.4
accelerate==0.21.0
evaluate==0.4.0
bitsandbytes==0.41.1
wandb==0.15.8
pytorch-lightning==2.0.6
protobuf<3.21.0
torchmetrics==1.0.3
lm_eval==0.3.0
tiktoken==0.1.2
sentencepiece==0.1.99
urllib3<1.27