LLaMA-33B/65B
LLaMA 模型源于: LLaMA: OPen and Efficient Foundation Language Models
Touvron, Hugo, et al. "LLaMA: OPen and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971 (2023).
LLaMA-33B/65B 训练的硬件配置:
32 x Ascend NPUs模型使用 alpaca 数据集训练
git clone https://gitee.com/ascend.AscendSpeed.git
cd AscendSpeed
mkdir logs
mkdir ckpt
- 预训练权重从 huggingface 格式转换为 AscendSpeed 格式
- 下载数据集
- 配置 llama-33B/65B 预训练脚本:
- 启动预训练脚本:
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
# ARM
wget https://download.pytorch.org/whl/torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0.post4_XXXXXX-cp38-cp38m-manylinux2014_aarch64.whl
# X86
#pip install torch==2.1.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
#pip install torch_npu-2.1.0.post4_XXXXXX-cp38-cp38m-manylinux2014_aarch64.whl
# 安装 megatron-core
pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/[email protected]#egg=megatron-core
# 安装 deepspeed 和 deepspeed_npu
pip install deepspeed==0.9.2
git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
cd deepspeed_npu
pip3 install -e ./
cd ..
# 安装其他包
pip install -r requirements.txt
llama-33B 权重
mkdir tokenizer
cd ./tokenizer
# 需要安装 git-lfs
git lfs install
git clone https://huggingface.co/pinkmanlove/llama-33b-hf
cd ..
llama-65B 权重
mkdir tokenizer
cd ./tokenizer
# 需要安装 git-lfs
git lfs install
git clone https://huggingface.co/pinkmanlove/llama-65b-hf
cd ..
llama-33B
mkdir model_weights
SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
python $SCRIPT_PATH \
--input-model-dir ./tokenizer \
--output-model-dir ./model_weights \
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 4 \
--merge-mlp \
--type 30B
llama-65B
mkdir model_weights
SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
python $SCRIPT_PATH \
--input-model-dir ./tokenizer \
--output-model-dir ./model_weights \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 4 \
--type 65B
# 下载 alpaca 数据集
wget http://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.jason
# 下载 tokenizer 配置 和 (可选择的) 权重:
# http://huggingface.co/pinkmanlove/llama-33b-hf
# http://huggingface.co/pinkmanlove/llama-65b-hf
# 将 tokenizer_config.json 中的 "LLaMATokenizer" 修改为 "LLaMTokenizer" (这是hf的一个bug)
mkdir dataset
python tools/preprocess_data.py --input alpaca_data.json\
--output-prefix dataset/alpaca\
--tokenizer-type PretrainedFromHF\
--tokenizer-name-or-path llama-33b-hf
#--tokenizer-name-or-path llama-65b-hf
--tokenizer-not-use-fast
--handler-name GeneralInstructionHandler
AscendSpeed/examples/llama/pretrain_llama_33B_ptd_32p.sh
AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh
# 修改 ascend-toolkit 路径
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HEEL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
# 配置词表和数据路径等
TOKENIZER_PATH=./dataset/llama_tokenizer # line 16
DATA_PATH=./dataset/llama_text_document # line 17
启动 llama-33B 预训练脚本 : AscendSpeed/examples/llama/pretrain_llama_33B_ptd_32p.sh
bash examples/llama/pretrain_llama_33B_ptd_32p.sh
启动 llama-65B 预训练脚本 : AscendSpeed/examples/llama/pretrain_llama_65B_ptd_32p.sh
bash examples/llama/pretrain_llama_65B_ptd_32p.sh
为多节点配置 llama-33B/65B 预训练脚本 (在集群的每个节点上启动脚本):
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=4
NODE_RANK=0
训练log如下:
iteration 11/50000 | consumed samples: 5632 | consumed tokens: 11534336 | elapsed time per iteration (ms): 52728.1 | learning rate: 1.499E-05 | gloabl batch size: 512 | lm loss: 1.376514E+01 | loss scale: 65536.0 | grad norm: 459.628 | actual seqlen: 2048 | number of skipped
iterations: 0 | number of nan iterations: 0 | samples per second: 9.710 | TFLOPs: 167.52 |
time (ms)
性能
LLaMA-33B/65B在 昇腾芯片 和 参考芯片 上的性能对比:
tokens吞吐 (tokens/s/p) llama-33B llama-33B llama-65B llama-65BNPU vs 参考 loss 和相对误差:
LLaMa-33B
LLaMa-65B
我们支持使用 LLaMA-33B 和 LLaMA-65B 进行文本生成的推理。 推理与预训练不同,比如我们需要加载预训练权重和输出样本的长度:
配置LLaMA-33B推理脚本
examples/llama/generate_llama_33B_ptd.sh
。
配置LLaMA-65B推理脚本
examples/llama/generate_llama_65B_tp8_pp1.sh
。
# 修改模型权重路径和分词器路径
CHECKPOINT=<checkpoint-path>
VOCAB_FILE=<vocabfile-path>
LLaMA-33B:
bash ./examples/llama/generate_llama_33B_ptd.sh
LLaMA-65B:
bash ./examples/llama/generate_llama_65B_tp8_pp1.sh
部分推理样本如下:
LLaMA-33B:
LLaMA-65B:
使用基线数据集进行评估
我们使用 Boolq benchmark 来评估我们的模型。Benchmark下载 此处 。
配置LLaMA-33B评估脚本:
CHECKPOINT=./llama-33b-tp4-pp2/
VOCAB_FILE=./llama-33b-hf/
# 配置任务和数据路径
DATA_PATH="./boolq/data/test/"
TASK="boolq"
# 配置生成参数
python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/evaluation/evaluation_llama.py \
--task-data-path $DATA_PATH \
--task $TASK\
--seq-length 1024 \
--max-new-tokens 2 \
--max-position-embeddings 1024 \
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 2 \
--num-layers 60 \
--hidden-size 6656 \
--ffn-hidden-size 17920 \
--load ${CHECKPOINT} \
--num-attention-heads 52 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${VOCAB_FILE} \
--tokenizer-not-use-fast \
--fp16 \
--micro-batch-size 1 \
--position-embedding-type rope \
--normalization RMSNorm \
--mlp-layer-fusion \
--seed 42
# 开始评估
# llama-65B评估
bash tasks/evaluation/evaluate_llama_65B_tp8_pp1.sh
LLaMA-33B和LLaMA-65B在 Ascend NPU 中的评测表现:
Boolq LLaMA-33B Boolq LLaMA-65B@article{Touvron2023llama,