添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
LLaMA 65B-PyTorch

LLaMA-33B/65B

LLaMA 模型源于: LLaMA: OPen and Efficient Foundation Language Models

Touvron, Hugo, et al. "LLaMA: OPen and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971 (2023).

LLaMA-33B/65B 训练的硬件配置:

32 x Ascend NPUs

模型使用 alpaca 数据集训练

  • 拷贝仓库到个人服务器
  • git clone https://gitee.com/ascend.AscendSpeed.git
    cd AscendSpeed
    mkdir logs
    mkdir ckpt
      # python3.8
      conda create -n test python=3.8
      conda activate test
      # 安装 torch 和 torch_npu
      # ARM
      wget https://download.pytorch.org/whl/torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
      pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
      pip install torch_npu-2.1.0.post4_XXXXXX-cp38-cp38m-manylinux2014_aarch64.whl
      # X86
      #pip install torch==2.1.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
      #pip install torch_npu-2.1.0.post4_XXXXXX-cp38-cp38m-manylinux2014_aarch64.whl
      # 安装 megatron-core
      pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/[email protected]#egg=megatron-core
      # 安装 deepspeed 和 deepspeed_npu
      pip install deepspeed==0.9.2
      git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
      cd deepspeed_npu
      pip3 install -e ./
      cd ..
      # 安装其他包
      pip install -r requirements.txt

        llama-33B 权重

        mkdir tokenizer
        cd ./tokenizer
        # 需要安装 git-lfs
        git lfs install
        git clone https://huggingface.co/pinkmanlove/llama-33b-hf
        cd ..

        llama-65B 权重

        mkdir tokenizer
        cd ./tokenizer
        # 需要安装 git-lfs
        git lfs install
        git clone https://huggingface.co/pinkmanlove/llama-65b-hf
        cd ..
        1. 预训练权重从 huggingface 格式转换为 AscendSpeed 格式
        2. llama-33B

          mkdir model_weights
          SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
          python $SCRIPT_PATH \
                --input-model-dir ./tokenizer \
                --output-model-dir ./model_weights \
                --tensor-model-parallel-size 4 \
                --pipeline-model-parallel-size 4 \
                --merge-mlp \
                --type 30B

          llama-65B

          mkdir model_weights
          SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
          python $SCRIPT_PATH \
                --input-model-dir ./tokenizer \
                --output-model-dir ./model_weights \
                --tensor-model-parallel-size 8 \
                --pipeline-model-parallel-size 4 \
                --type 65B
          1. 下载数据集
          2. # 下载 alpaca 数据集
            wget http://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.jason
            # 下载 tokenizer 配置 和 (可选择的) 权重:
            # http://huggingface.co/pinkmanlove/llama-33b-hf
            # http://huggingface.co/pinkmanlove/llama-65b-hf
            # 将 tokenizer_config.json 中的 "LLaMATokenizer" 修改为 "LLaMTokenizer" (这是hf的一个bug)
            mkdir dataset
            python tools/preprocess_data.py --input alpaca_data.json\
                                            --output-prefix dataset/alpaca\
                                            --tokenizer-type PretrainedFromHF\
                                            --tokenizer-name-or-path llama-33b-hf
                                           #--tokenizer-name-or-path llama-65b-hf
                                            --tokenizer-not-use-fast
                                            --handler-name GeneralInstructionHandler
            1. 配置 llama-33B/65B 预训练脚本:
            2. AscendSpeed/examples/llama/pretrain_llama_33B_ptd_32p.sh

              AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh

              # 修改 ascend-toolkit 路径
              export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
              export HEEL_CONNECT_TIMEOUT=1200
              export COMBINED_ENABLE=1
              # 配置词表和数据路径等
              TOKENIZER_PATH=./dataset/llama_tokenizer # line 16
              DATA_PATH=./dataset/llama_text_document # line 17
              1. 启动预训练脚本:
              2. 启动 llama-33B 预训练脚本 : AscendSpeed/examples/llama/pretrain_llama_33B_ptd_32p.sh

                bash examples/llama/pretrain_llama_33B_ptd_32p.sh

                启动 llama-65B 预训练脚本 : AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh

                bash examples/llama/pretrain_llama_65B_ptd_32p.sh

                为多节点配置 llama-33B/65B 预训练脚本 (在集群的每个节点上启动脚本):

                MASTER_ADDR=localhost
                MASTER_PORT=6001
                NNODES=4
                NODE_RANK=0

                训练log如下:

                 iteration  11/50000 | consumed samples: 5632 | consumed tokens:  11534336 | elapsed time per iteration (ms):  52728.1 | learning rate:    1.499E-05 | gloabl batch size:  512 | lm loss:  1.376514E+01 | loss scale:  65536.0 | grad norm:    459.628 | actual seqlen:  2048 | number of skipped
                iterations: 0 | number of nan iterations:   0 | samples per second: 9.710 | TFLOPs: 167.52 |
                time (ms)

                性能

                LLaMA-33B/65B在 昇腾芯片 参考芯片 上的性能对比:

                tokens吞吐 (tokens/s/p) llama-33B llama-33B llama-65B llama-65B

                NPU vs 参考 loss 和相对误差:

                LLaMa-33B

                LLaMa-65B

                我们支持使用 LLaMA-33B 和 LLaMA-65B 进行文本生成的推理。 推理与预训练不同,比如我们需要加载预训练权重和输出样本的长度:

                配置LLaMA-33B推理脚本 examples/llama/generate_llama_33B_ptd.sh

                配置LLaMA-65B推理脚本 examples/llama/generate_llama_65B_tp8_pp1.sh

                # 修改模型权重路径和分词器路径
                CHECKPOINT=<checkpoint-path>
                VOCAB_FILE=<vocabfile-path>

                LLaMA-33B:

                bash ./examples/llama/generate_llama_33B_ptd.sh

                LLaMA-65B:

                bash ./examples/llama/generate_llama_65B_tp8_pp1.sh

                部分推理样本如下:

                LLaMA-33B:

                LLaMA-65B:

                使用基线数据集进行评估

                我们使用 Boolq benchmark 来评估我们的模型。Benchmark下载 此处

                配置LLaMA-33B评估脚本:

                    CHECKPOINT=./llama-33b-tp4-pp2/
                    VOCAB_FILE=./llama-33b-hf/
                    # 配置任务和数据路径
                    DATA_PATH="./boolq/data/test/"
                    TASK="boolq"
                    # 配置生成参数
                    python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/evaluation/evaluation_llama.py   \
                         --task-data-path $DATA_PATH \
                         --task $TASK\
                         --seq-length 1024 \
                         --max-new-tokens 2 \
                         --max-position-embeddings 1024 \
                         --tensor-model-parallel-size 4  \
                         --pipeline-model-parallel-size 2  \
                         --num-layers 60 \
                         --hidden-size 6656  \
                         --ffn-hidden-size 17920 \
                         --load ${CHECKPOINT}  \
                         --num-attention-heads 52  \
                         --tokenizer-type PretrainedFromHF  \
                         --tokenizer-name-or-path ${VOCAB_FILE} \
                         --tokenizer-not-use-fast \
                         --fp16  \
                         --micro-batch-size 1  \
                         --position-embedding-type rope \
                         --normalization RMSNorm \
                         --mlp-layer-fusion \
                         --seed 42
                # 开始评估
                # llama-65B评估
                bash tasks/evaluation/evaluate_llama_65B_tp8_pp1.sh

                LLaMA-33B和LLaMA-65B在 Ascend NPU 中的评测表现:

                Boolq LLaMA-33B Boolq LLaMA-65B
                @article{Touvron2023llama,