添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
使用 HTTPS 协议时,命令行会出现如下账号密码验证步骤。基于安全考虑,Gitee 建议 配置并使用私人令牌 替代登录密码进行克隆、推送等操作
Username for 'https://gitee.com': userName
Password for 'https://[email protected]': <p align="left"> <b>简体中文</b> | <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/llama/README_en.md">English</a></b> | 硬件 | 设备 | 配置 | |:----:|:------:|:----------------:| | NPU | A+K | 8 x Ascend NPUs | | NPU | A+X | 16 x Ascend NPUs | LLaMA-7B/13B 训练的软件配置如下: | 软件 | 配置 | |:-------------------------:|:------------------------------------------------------------------------------------------------------------:| | python | 3.7.5 | | driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) | | firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) | | CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) | | binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) | | torch | 1.11.0 | | torch_npu | [package](https://gitee.com/ascend/pytorch/releases) | ### 脚本 1. 拷贝仓库到个人服务器 ```shell git clone https://gitee.com/ascend/AscendSpeed.git cd AscendSpeed mkdir logs mkdir ckpt 2. 搭建环境 ```bash # python3.7 conda create -n test python=3.7 conda activate test # 安装 torch 和 torch_npu pip install torch-1.11.0-cp37-cp37m-linux_aarch64.whl pip install torch_npu-1.11.0.post4_XXXXXX-cp37-cp37m-linux_aarch64.whl # 安装 megatron-core pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/[email protected]#egg=megatron-core # 安装 deepspeed 和 deepspeed_npu pip install deepspeed==0.9.2 git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu cd deepspeed_npu pip3 install -e ./ cd .. # 安装其余依赖包 pip install -r requirements.txt 3. 下载 LLaMA-7B/13B [词表文件](https://huggingface.co/yahma/llama-7b-hf/tree/main) ```shell #!/bin/bash mkdir -p dataset/llama cd ./dataset/llama wget https://huggingface.co/yahma/llama-7b-hf/tree/main/config.json wget https://huggingface.co/yahma/llama-7b-hf/tree/main/generation_config.json wget https://huggingface.co/yahma/llama-7b-hf/tree/main/special_tokens_map.json wget https://huggingface.co/yahma/llama-7b-hf/tree/main/tokenizer.model wget https://huggingface.co/yahma/llama-7b-hf/tree/main/tokenizer_config.json cd .. 4. 下载 LLaMA-7B/13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) ```shell cd dataset/ wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet cd .. ```shell #!/bin/bash source /usr/local/Ascend/ascend-toolkit/set_env.sh python ./tools/preprocess_data.py \ --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \ --tokenizer-name-or-path ./dataset/llama \ --output-prefix ./dataset/llama \ --workers 4 \ --log-interval 1000 \ --tokenizer-type PretrainedFromHF \ --handler-name AlpacaPretrainHandler \ --tokenizer-not-use-fast \ --append-eod 5. 权重转换 ```text # 请注意,如果要加载huggingface的预训练权重,需要修改一个deepspeed关于加载权重的bug: # 在 `<deepspeed-installed-path>/runtime/engine.py` 文件里的 `_load_zero_checkpoint` 函数, # 将 `if zero_sd_list is None` 改为 `if zero_sd_list is None or len(zero_sd_list) == 0` # 原始 deepspeed/runtime/engine.py, 大概 #Lines2746-2748 zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag) if zero_sd_list is None: return False # 修改后 zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag) if zero_sd_list is None or len(zero_sd_list) == 0: return False 下载 LLaMA-7B [权重](https://huggingface.co/yahma/llama-7b-hf/tree/main) 或 LLaMA-13B [权重](https://huggingface.co/yahma/llama-13b-hf/tree/main) ```shell mkdir model_from_hf cd ./model_from_hf # 需要安装 git-lfs git clone https://huggingface.co/yahma/llama-7b-hf cd .. 将模型权重文件从 huggingface 格式转化为 AscendSpeed 格式 LLaMA-7B ```shell mkdir model_weights SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py python $SCRIPT_PATH \ --input-model-dir ./model_from_hf/llama-7b/ \ --output-model-dir ./model_weights/llama-7b \ --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 1 \ --type 7B \ --deepspeed LLaMA-13B ```shell # 单机八卡 mkdir model_weights SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py python $SCRIPT_PATH \ --input-model-dir ./model_from_hf/llama-13b/ \ --output-model-dir ./model_weights/llama-13b \ --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 8 \ --type 13B # 单机16卡 mkdir model_weights SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py python $SCRIPT_PATH \ --input-model-dir ./model_from_hf/llama-13b/ \ --output-model-dir ./model_weights/llama-13b \ --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 2 \ --type 13B 6. 配置 LLaMA-7B/13B 预训练脚本 ```shell # 设置 ascend-toolkit 路径 source /usr/local/Ascend/ascend-toolkit/set_env.sh # 修改数据集路径,权重路径,词表路径等 TOKENIZER_PATH=./dataset/llama #tokenizer 路径 DATA=./dataset/llama_text_document #数据集 路径 CHECKPOINT=./model_weights/ # 如果不需要加载权重,就移除 `--load` 参数 7. 启动 LLaMA-7B/13B 预训练脚本 LLaMA-7B ```shell bash examples/llama/pretrain_llama_7B_zero_8p.sh LLaMA-13B ```shell # 单机8卡 bash examples/llama/pretrain_llama_13B_ptd_8p.sh # 单机16卡 bash examples/llama/pretrain_llama_13B_ptd_16p.sh ### 性能 #### 吞吐 LLaMA-7B/13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比: | 设备 | 硬件 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) | |------|-----------|-----------|------|------------------|----------------------|-----------------|------------------| | NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.80 | 3686 | 4.44 | 156.5 | | 参考 | - | LLaMA-7B | 2048 | 1.85 | 3788 | 4.31 | 161.1 | | NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.956 | 1958 | 16.70 | 212.25 | | NPUs | 910 1*16p | LLaMA-13B | 2048 | 0.88 | 1800 | 36.32 | 195.58 | | 参考 | - | LLaMA-13B | 2048 | 0.98 | 2012 | 16.33 | 217.37 | LLama-7b NPU vs 参考 loss. ![NPU-Loss-with-weight-and-Relative-Error](../../sources/images/llama/llama7b-loss-with-weight.png) LLama-13b NPU vs 参考 loss. ![NPU-Loss-with-weight-and-Relative-Error](../../sources/images/llama/llama13b-loss-with-weight.png) ## 推理 我们支持使用 LLaMA-7B 和 LLaMA-13B 进行文本生成的推理。 推理与预训练不同,比如我们需要加载预训练权重和输出样本的长度: 配置LLaMA-7B推理脚本`examples/llama/generate_llama_7B_deepspeed.sh`和LLaMA-13B推理脚本`examples/llama/generate_llama_13B_tp8_pp1.sh`。 ```shell # 修改模型权重路径和分词器路径 CHECKPOINT=<checkpoint-path> VOCAB_FILE=<vocabfile-path> LLaMA-7B: ```shell bash ./examples/llama/generate_llama_7B_deepspeed.sh LLaMA-13B: ```shell bash ./examples/llama/generate_llama_13B_tp8_pp1.sh 部分推理样本如下: LLaMA-7B: ![llama-7B_generate.png](../../sources/images/llama/llama-7B_generate.png) LLaMA-13B: ![llama-13B_generate.png](../../sources/images/llama/llama-13B_generate.png) ## 使用基线数据集进行评估 我们使用 BBH benchmark 来评估我们的模型。Benchmark下载[此处](https://huggingface.co/datasets/lukaemon/bbh)。 配置LLaMA-7B评估脚本: ```shell CHECKPOINT=./llama-7b-tp4-pp2/ VOCAB_FILE=./llama-7b-hf/ # 配置任务和数据路径 DATA_PATH="./bbh/data/test/" TASK="bbh" # 配置生成参数 python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/evaluation/evaluation_llama.py \ --task-data-path $DATA_PATH \ --task $TASK\ --seq-length 2048 \ --max-new-tokens 32 \ --max-position-embeddings 2048 \ --tensor-model-parallel-size 4 \ --pipeline-model-parallel-size 2 \ --num-layers 32 \ --hidden-size 4096 \ --ffn-hidden-size 11008 \ --load ${CHECKPOINT} \ --num-attention-heads 32 \ --tokenizer-type PretrainedFromHF \ --tokenizer-name-or-path $VOCAB_FILE \ --tokenizer-not-use-fast \ --fp16 \ --micro-batch-size 1 \ --seed 42 | tee logs/evaluation.log 配置LLaMA-13B评估脚本: ```shell CHECKPOINT=./llama-13b-tp1-pp8/ VOCAB_FILE=./llama-13b-hf/ # 配置任务和数据路径 DATA_PATH="./bbh/data/test/" TASK="bbh" # 配置参数 python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/evaluation/evaluation_llama.py \ --task-data-path $DATA_PATH \ --task $TASK\ --seq-length 2048 \ --max-new-tokens 32 \ --max-position-embeddings 2048 \ --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 8 \ --num-layers 40 \ --hidden-size 5120 \ --ffn-hidden-size 13824 \ --load ${CHECKPOINT} \ --num-attention-heads 40 \ --tokenizer-type PretrainedFromHF \ --tokenizer-name-or-path $VOCAB_FILE \ --tokenizer-not-use-fast \ --fp16 \ --micro-batch-size 1 \ --seed 42 | tee logs/evaluation.log ```shell # 开始评估 bash tasks/evaluation/eval.sh LLaMA-7B/13B在**Ascend NPU**中的评测表现: | 任务 | 模型 | 昇腾值 | 社区值 | |-----------------------------------------------------|-----------|------|------| | [BBH](https://huggingface.co/datasets/lukaemon/bbh) | LLaMA-7B | 33.7 | [33.5](https://opencompass.org.cn/dataset-detail/BBH) | | [BBH](https://huggingface.co/datasets/lukaemon/bbh) | LLaMA-13B | 38.7 | [37.9](https://opencompass.org.cn/dataset-detail/BBH) | # LLaMA-33B/65B LLaMA 模型源于: [LLaMA: OPen and Efficient Foundation Language Models](https://arxiv.org/pdf/2302.13971v1.pdf) >Touvron, Hugo, et al. "LLaMA: OPen and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971 (2023). ## 训练 LLaMA-33B/65B 训练的硬件配置: | 硬件 | 配置 | |:---:|:---------------:| | NPU | 8 x Ascend NPUs | LLaMA-33B/65B 训练的软件配置: | 软件 | 配置 | |:-------------------------:|:------------------------------------------------------------------------------------------------------------:| | python | 3.7 | | driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) | | firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) | | CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) | | binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) | | torch | 1.11.0 | | torch_npu | [package](https://gitee.com/ascend/pytorch/releases) | ### 数据集 模型使用 alpaca 数据集训练 ### 脚本 1. 拷贝仓库到个人服务器 ```shell git clone https://gitee.com/ascend.AscendSpeed.git cd AscendSpeed mkdir logs mkdir ckpt 2. 搭建环境 ```shell # python3.7 conda create -n test python=3.7 conda activate test # 安装 torch 和 torch_npu # ARM wget https://download.pytorch.org/whl/torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl pip install torch-1.11.0-cp37-cp37m-linux_aarch64.whl pip install torch_npu-1.11.0.post4_XXXXXX-cp37-cp37m-linux_aarch64.whl # X86 #pip install torch ==1.11 -i https://pypi.tuna.tsinghua.edu.cn/simple #pip install torch_npu-1.11.0.post4_XXXXXX-cp37-cp37m-linux_aarch64.whl # 安装 megatron-core pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/[email protected]#egg=megatron-core # 安装 deepspeed 和 deepspeed_npu pip install deepspeed==0.9.2 git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu cd deepspeed_npu pip3 install -e ./ cd .. # 安装其他包 pip install -r requirements.txt 3. 下载权重 llama-33B 权重 ```shell mkdir tokenizer cd ./tokenizer # 需要安装 git-lfs git lfs install git clone https://huggingface.co/pinkmanlove/llama-33b-hf cd .. llama-65B 权重 ```shell mkdir tokenizer cd ./tokenizer # 需要安装 git-lfs git lfs install git clone https://huggingface.co/pinkmanlove/llama-65b-hf cd .. 4. 预训练权重从 huggingface 格式转换为 AscendSpeed 格式 llama-33B ```shell mkdir model_weights SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py python $SCRIPT_PATH \ --input-model-dir ./tokenizer \ --output-model-dir ./model_weights \ --tensor-model-parallel-size 8 \ --pipeline-model-parallel-size 4 \ --type 33B llama-65B ```shell mkdir model_weights SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py python $SCRIPT_PATH \ --input-model-dir ./tokenizer \ --output-model-dir ./model_weights \ --tensor-model-parallel-size 8 \ --pipeline-model-parallel-size 4 \ --type 65B 5. 下载数据集 ```shell # 下载 alpaca 数据集 wget http://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.jason # 下载 tokenizer 配置 和 (可选择的) 权重: # http://huggingface.co/pinkmanlove/llama-33b-hf # http://huggingface.co/pinkmanlove/llama-65b-hf # 将 tokenizer_config.json 中的 "LLaMATokenizer" 修改为 "LLaMTokenizer" (这是hf的一个bug) mkdir dataset python tools/preprocess_data.py --input alpaca_data.json\ --output-prefix dataset/alpaca\ --tokenizer-type PretrainedFromHF\ --tokenizer-name-or-path llama-33b-hf #--tokenizer-name-or-path llama-65b-hf --tokenizer-not-use-fast --handler-name GeneralInstructionHandler 6. 配置 llama-33B/65B 预训练脚本: AscendSpeed/examples/llama/pretrain_llama_33B_zero_32p.sh AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh ```bash # 修改 ascend-toolkit 路径 export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH export HEEL_CONNECT_TIMEOUT=1200 export COMBINED_ENABLE=1 # 配置词表和数据路径等 TOKENIZER_PATH=./dataset/llama_tokenizer # line 16 DATA_PATH=./dataset/llama_text_document # line 17 7. 启动预训练脚本: 启动 llama-33B 预训练脚本 : AscendSpeed/examples/llama/pretrain_llama_33B_zero_32p.sh ```bash bash examples/llama/pretrain_llama_33B_zero_32p.sh 启动 llama-65B 预训练脚本 : AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh ```bash bash examples/llama/pretrain_llama_65B_ptd_32p.sh 为多节点配置 llama-33B/65B 预训练脚本 (在集群的每个节点上启动脚本): ```shell MASTER_ADDR=localhost MASTER_PORT=6001 NNODES=4 NODE_RANK=0 训练log如下: ```Shell iteration 3/50000 | consumed samples: 768 | consumed tokens: 1572864 | elapsed time per iteration (ms): 33818.0 | learning rate: 1.406E-07 | gloabl batch size: 256 | lm loss: 1.200820E+01 | loss scale: 1.0 | grad norm: 9.216 | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 7.570 | TFLOPs: 107.09 | time (ms) ### 性能 #### 吞吐 LLaMA-33B/65B在 **昇腾芯片** 和 **参考芯片** 上的性能对比: | 设备 | 模型 | tokens吞吐 (tokens/s/p) | |:----:|:---------:|:---------------------:| | 参考 | llama-33B | 520 | | NPUs | llama-33B | 621 | | 参考 | llama-65B | 260 | | NPUs | llama-65B | 234 | #### 精度 NPU vs 参考 loss 和相对误差: LLaMa-33B ![NPU-LOSS](../../sources/images/llama/llama33B_loss.png) ![NPU-Relative-Error](../../sources/images/llama/llama33B_relative_error.png) LLaMa-65B ![NPU-LOSS](../../sources/images/llama/loss_chart.png) ![NPU-Relative-Error](../../sources/images/llama/compare_chart.png) ## 引用 ```shell @article{Touvron2023llama, title={LLaMA: OPen and Efficient Foundation Language Models}, author={Hugo Touvron*, Thibaut Lavril*, Gautier Izacard*, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave*, Guillaume Lample*}, journal={arXiv preprint arXiv:2302.13971}, year={2023}} <p align="left"> <b>简体中文</b> | <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/llama/README_en.md">English</a></b> | 硬件 | 设备 | 配置 | |:----:|:------:|:----------------:| | NPU | A+K | 8 x Ascend NPUs | | NPU | A+X | 16 x Ascend NPUs | LLaMA-7B/13B 训练的软件配置如下: | 软件 | 配置 | |:-------------------------:|:------------------------------------------------------------------------------------------------------------:| | python | 3.7.5 | | driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) | | firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) | | CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) | | binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) | | torch | 1.11.0 | | torch_npu | [package](https://gitee.com/ascend/pytorch/releases) | ### 脚本 1. 拷贝仓库到个人服务器 ```shell git clone https://gitee.com/ascend/AscendSpeed.git cd AscendSpeed mkdir logs mkdir ckpt 2. 搭建环境 ```bash # python3.7 conda create -n test python=3.7 conda activate test # 安装 torch 和 torch_npu pip install torch-1.11.0-cp37-cp37m-linux_aarch64.whl pip install torch_npu-1.11.0.post4_XXXXXX-cp37-cp37m-linux_aarch64.whl # 安装 megatron-core pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/[email protected]#egg=megatron-core # 安装 deepspeed 和 deepspeed_npu pip install deepspeed==0.9.2 git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu cd deepspeed_npu pip3 install -e ./ cd .. # 安装其余依赖包 pip install -r requirements.txt 3. 下载 LLaMA-7B/13B [词表文件](https://huggingface.co/yahma/llama-7b-hf/tree/main) ```shell #!/bin/bash mkdir -p dataset/llama cd ./dataset/llama wget https://huggingface.co/yahma/llama-7b-hf/tree/main/config.json wget https://huggingface.co/yahma/llama-7b-hf/tree/main/generation_config.json wget https://huggingface.co/yahma/llama-7b-hf/tree/main/special_tokens_map.json wget https://huggingface.co/yahma/llama-7b-hf/tree/main/tokenizer.model wget https://huggingface.co/yahma/llama-7b-hf/tree/main/tokenizer_config.json cd .. 4. 下载 LLaMA-7B/13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) ```shell cd dataset/ wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet cd .. ```shell #!/bin/bash source /usr/local/Ascend/ascend-toolkit/set_env.sh python ./tools/preprocess_data.py \ --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \ --tokenizer-name-or-path ./dataset/llama \ --output-prefix ./dataset/llama \ --workers 4 \ --log-interval 1000 \ --tokenizer-type PretrainedFromHF \ --handler-name AlpacaPretrainHandler \ --tokenizer-not-use-fast \ --append-eod 5. 权重转换 ```text # 请注意,如果要加载huggingface的预训练权重,需要修改一个deepspeed关于加载权重的bug: # 在 `<deepspeed-installed-path>/runtime/engine.py` 文件里的 `_load_zero_checkpoint` 函数, # 将 `if zero_sd_list is None` 改为 `if zero_sd_list is None or len(zero_sd_list) == 0` # 原始 deepspeed/runtime/engine.py, 大概 #Lines2746-2748 zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag) if zero_sd_list is None: return False # 修改后 zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag) if zero_sd_list is None or len(zero_sd_list) == 0: return False 下载 LLaMA-7B [权重](https://huggingface.co/yahma/llama-7b-hf/tree/main) 或 LLaMA-13B [权重](https://huggingface.co/yahma/llama-13b-hf/tree/main) ```shell mkdir model_from_hf cd ./model_from_hf # 需要安装 git-lfs git clone https://huggingface.co/yahma/llama-7b-hf cd .. 将模型权重文件从 huggingface 格式转化为 AscendSpeed 格式 LLaMA-7B ```shell mkdir model_weights SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py python $SCRIPT_PATH \ --input-model-dir ./model_from_hf/llama-7b/ \ --output-model-dir ./model_weights/llama-7b \ --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 1 \ --type 7B \ --deepspeed LLaMA-13B ```shell # 单机八卡 mkdir model_weights SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py python $SCRIPT_PATH \ --input-model-dir ./model_from_hf/llama-13b/ \ --output-model-dir ./model_weights/llama-13b \ --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 8 \ --type 13B # 单机16卡 mkdir model_weights SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py python $SCRIPT_PATH \ --input-model-dir ./model_from_hf/llama-13b/ \ --output-model-dir ./model_weights/llama-13b \ --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 2 \ --type 13B 6. 配置 LLaMA-7B/13B 预训练脚本 ```shell # 设置 ascend-toolkit 路径 source /usr/local/Ascend/ascend-toolkit/set_env.sh # 修改数据集路径,权重路径,词表路径等 TOKENIZER_PATH=./dataset/llama #tokenizer 路径 DATA=./dataset/llama_text_document #数据集 路径 CHECKPOINT=./model_weights/ # 如果不需要加载权重,就移除 `--load` 参数 7. 启动 LLaMA-7B/13B 预训练脚本 LLaMA-7B ```shell bash examples/llama/pretrain_llama_7B_zero_8p.sh LLaMA-13B ```shell # 单机8卡 bash examples/llama/pretrain_llama_13B_ptd_8p.sh # 单机16卡 bash examples/llama/pretrain_llama_13B_ptd_16p.sh ### 性能 #### 吞吐 LLaMA-7B/13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比: | 设备 | 硬件 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) | |------|-----------|-----------|------|------------------|----------------------|-----------------|------------------| | NPUs | 910 1*8p | LLaMA-7B | 2048 | 1.80 | 3686 | 4.44 | 156.5 | | 参考 | - | LLaMA-7B | 2048 | 1.85 | 3788 | 4.31 | 161.1 | | NPUs | 910 1*8p | LLaMA-13B | 2048 | 0.956 | 1958 | 16.70 | 212.25 | | NPUs | 910 1*16p | LLaMA-13B | 2048 | 0.88 | 1800 | 36.32 | 195.58 | | 参考 | - | LLaMA-13B | 2048 | 0.98 | 2012 | 16.33 | 217.37 | LLama-7b NPU vs 参考 loss. ![NPU-Loss-with-weight-and-Relative-Error](../../sources/images/llama/llama7b-loss-with-weight.png) LLama-13b NPU vs 参考 loss. ![NPU-Loss-with-weight-and-Relative-Error](../../sources/images/llama/llama13b-loss-with-weight.png) ## 推理 我们支持使用 LLaMA-7B 和 LLaMA-13B 进行文本生成的推理。 推理与预训练不同,比如我们需要加载预训练权重和输出样本的长度: 配置LLaMA-7B推理脚本`examples/llama/generate_llama_7B_deepspeed.sh`和LLaMA-13B推理脚本`examples/llama/generate_llama_13B_tp8_pp1.sh`。 ```shell # 修改模型权重路径和分词器路径 CHECKPOINT=<checkpoint-path> VOCAB_FILE=<vocabfile-path> LLaMA-7B: ```shell bash ./examples/llama/generate_llama_7B_deepspeed.sh LLaMA-13B: ```shell bash ./examples/llama/generate_llama_13B_tp8_pp1.sh 部分推理样本如下: LLaMA-7B: ![llama-7B_generate.png](../../sources/images/llama/llama-7B_generate.png) LLaMA-13B: ![llama-13B_generate.png](../../sources/images/llama/llama-13B_generate.png) ## 使用基线数据集进行评估 我们使用 BBH benchmark 来评估我们的模型。Benchmark下载[此处](https://huggingface.co/datasets/lukaemon/bbh)。 配置LLaMA-7B评估脚本: ```shell CHECKPOINT=./llama-7b-tp4-pp2/ VOCAB_FILE=./llama-7b-hf/ # 配置任务和数据路径 DATA_PATH="./bbh/data/test/" TASK="bbh" # 配置生成参数 python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/evaluation/evaluation_llama.py \ --task-data-path $DATA_PATH \ --task $TASK\ --seq-length 2048 \ --max-new-tokens 32 \ --max-position-embeddings 2048 \ --tensor-model-parallel-size 4 \ --pipeline-model-parallel-size 2 \ --num-layers 32 \ --hidden-size 4096 \ --ffn-hidden-size 11008 \ --load ${CHECKPOINT} \ --num-attention-heads 32 \ --tokenizer-type PretrainedFromHF \ --tokenizer-name-or-path $VOCAB_FILE \ --tokenizer-not-use-fast \ --fp16 \ --micro-batch-size 1 \ --seed 42 | tee logs/evaluation.log 配置LLaMA-13B评估脚本: ```shell CHECKPOINT=./llama-13b-tp1-pp8/ VOCAB_FILE=./llama-13b-hf/ # 配置任务和数据路径 DATA_PATH="./bbh/data/test/" TASK="bbh" # 配置参数 python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/evaluation/evaluation_llama.py \ --task-data-path $DATA_PATH \ --task $TASK\ --seq-length 2048 \ --max-new-tokens 32 \ --max-position-embeddings 2048 \ --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 8 \ --num-layers 40 \ --hidden-size 5120 \ --ffn-hidden-size 13824 \ --load ${CHECKPOINT} \ --num-attention-heads 40 \ --tokenizer-type PretrainedFromHF \ --tokenizer-name-or-path $VOCAB_FILE \ --tokenizer-not-use-fast \ --fp16 \ --micro-batch-size 1 \ --seed 42 | tee logs/evaluation.log ```shell # 开始评估 bash tasks/evaluation/eval.sh LLaMA-7B/13B在**Ascend NPU**中的评测表现: | 任务 | 模型 | 昇腾值 | 社区值 | |-----------------------------------------------------|-----------|------|------| | [BBH](https://huggingface.co/datasets/lukaemon/bbh) | LLaMA-7B | 33.7 | [33.5](https://opencompass.org.cn/dataset-detail/BBH) | | [BBH](https://huggingface.co/datasets/lukaemon/bbh) | LLaMA-13B | 38.7 | [37.9](https://opencompass.org.cn/dataset-detail/BBH) | # LLaMA-33B/65B LLaMA 模型源于: [LLaMA: OPen and Efficient Foundation Language Models](https://arxiv.org/pdf/2302.13971v1.pdf) >Touvron, Hugo, et al. "LLaMA: OPen and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971 (2023). ## 训练 LLaMA-33B/65B 训练的硬件配置: | 硬件 | 配置 | |:---:|:---------------:| | NPU | 8 x Ascend NPUs | LLaMA-33B/65B 训练的软件配置: | 软件 | 配置 | |:-------------------------:|:------------------------------------------------------------------------------------------------------------:| | python | 3.7 | | driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) | | firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) | | CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) | | binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) | | torch | 1.11.0 | | torch_npu | [package](https://gitee.com/ascend/pytorch/releases) | ### 数据集 模型使用 alpaca 数据集训练 ### 脚本 1. 拷贝仓库到个人服务器 ```shell git clone https://gitee.com/ascend.AscendSpeed.git cd AscendSpeed mkdir logs mkdir ckpt 2. 搭建环境 ```shell # python3.7 conda create -n test python=3.7 conda activate test # 安装 torch 和 torch_npu # ARM wget https://download.pytorch.org/whl/torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl pip install torch-1.11.0-cp37-cp37m-linux_aarch64.whl pip install torch_npu-1.11.0.post4_XXXXXX-cp37-cp37m-linux_aarch64.whl # X86 #pip install torch ==1.11 -i https://pypi.tuna.tsinghua.edu.cn/simple #pip install torch_npu-1.11.0.post4_XXXXXX-cp37-cp37m-linux_aarch64.whl # 安装 megatron-core pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/[email protected]#egg=megatron-core # 安装 deepspeed 和 deepspeed_npu pip install deepspeed==0.9.2 git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu cd deepspeed_npu pip3 install -e ./ cd .. # 安装其他包 pip install -r requirements.txt 3. 下载权重 llama-33B 权重 ```shell mkdir tokenizer cd ./tokenizer # 需要安装 git-lfs git lfs install git clone https://huggingface.co/pinkmanlove/llama-33b-hf cd .. llama-65B 权重 ```shell mkdir tokenizer cd ./tokenizer # 需要安装 git-lfs git lfs install git clone https://huggingface.co/pinkmanlove/llama-65b-hf cd .. 4. 预训练权重从 huggingface 格式转换为 AscendSpeed 格式 llama-33B ```shell mkdir model_weights SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py python $SCRIPT_PATH \ --input-model-dir ./tokenizer \ --output-model-dir ./model_weights \ --tensor-model-parallel-size 8 \ --pipeline-model-parallel-size 4 \ --type 33B llama-65B ```shell mkdir model_weights SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py python $SCRIPT_PATH \ --input-model-dir ./tokenizer \ --output-model-dir ./model_weights \ --tensor-model-parallel-size 8 \ --pipeline-model-parallel-size 4 \ --type 65B 5. 下载数据集 ```shell # 下载 alpaca 数据集 wget http://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.jason # 下载 tokenizer 配置 和 (可选择的) 权重: # http://huggingface.co/pinkmanlove/llama-33b-hf # http://huggingface.co/pinkmanlove/llama-65b-hf # 将 tokenizer_config.json 中的 "LLaMATokenizer" 修改为 "LLaMTokenizer" (这是hf的一个bug) mkdir dataset python tools/preprocess_data.py --input alpaca_data.json\ --output-prefix dataset/alpaca\ --tokenizer-type PretrainedFromHF\ --tokenizer-name-or-path llama-33b-hf #--tokenizer-name-or-path llama-65b-hf --tokenizer-not-use-fast --handler-name GeneralInstructionHandler 6. 配置 llama-33B/65B 预训练脚本: AscendSpeed/examples/llama/pretrain_llama_33B_zero_32p.sh AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh ```bash # 修改 ascend-toolkit 路径 export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH export HEEL_CONNECT_TIMEOUT=1200 export COMBINED_ENABLE=1 # 配置词表和数据路径等 TOKENIZER_PATH=./dataset/llama_tokenizer # line 16 DATA_PATH=./dataset/llama_text_document # line 17 7. 启动预训练脚本: 启动 llama-33B 预训练脚本 : AscendSpeed/examples/llama/pretrain_llama_33B_zero_32p.sh ```bash bash examples/llama/pretrain_llama_33B_zero_32p.sh 启动 llama-65B 预训练脚本 : AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh ```bash bash examples/llama/pretrain_llama_65B_ptd_32p.sh 为多节点配置 llama-33B/65B 预训练脚本 (在集群的每个节点上启动脚本): ```shell MASTER_ADDR=localhost MASTER_PORT=6001 NNODES=4 NODE_RANK=0 训练log如下: ```Shell iteration 3/50000 | consumed samples: 768 | consumed tokens: 1572864 | elapsed time per iteration (ms): 33818.0 | learning rate: 1.406E-07 | gloabl batch size: 256 | lm loss: 1.200820E+01 | loss scale: 1.0 | grad norm: 9.216 | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 7.570 | TFLOPs: 107.09 | time (ms) ### 性能 #### 吞吐 LLaMA-33B/65B在 **昇腾芯片** 和 **参考芯片** 上的性能对比: | 设备 | 模型 | tokens吞吐 (tokens/s/p) | |:----:|:---------:|:---------------------:| | 参考 | llama-33B | 520 | | NPUs | llama-33B | 621 | | 参考 | llama-65B | 260 | | NPUs | llama-65B | 234 | #### 精度 NPU vs 参考 loss 和相对误差: LLaMa-33B ![NPU-LOSS](../../sources/images/llama/llama33B_loss.png) ![NPU-Relative-Error](../../sources/images/llama/llama33B_relative_error.png) LLaMa-65B ![NPU-LOSS](../../sources/images/llama/loss_chart.png) ![NPU-Relative-Error](../../sources/images/llama/compare_chart.png) ## 引用 ```shell @article{Touvron2023llama, title={LLaMA: OPen and Efficient Foundation Language Models}, author={Hugo Touvron*, Thibaut Lavril*, Gautier Izacard*, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave*, Guillaume Lample*}, journal={arXiv preprint arXiv:2302.13971}, year={2023}}

此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。

如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。

https://gitee.com/humphrey007/AscendSpeed.git
[email protected]:humphrey007/AscendSpeed.git
humphrey007
AscendSpeed
AscendSpeed
master