mirror of
https://gitee.com/ascend/ModelLink.git
synced 2024-12-05 05:17:40 +08:00
llama2 uploads the training inference script and the README.
This commit is contained in:
parent
569e1cc7b3
commit
f801b32e15
@ -13,6 +13,9 @@
|
||||
- [性能](#性能)
|
||||
- [吞吐](#吞吐)
|
||||
- [精度](#精度)
|
||||
- [微调](#微调)
|
||||
- [全参微调](#全参微调)
|
||||
- [低参微调](#低参微调)
|
||||
- [推理](#推理)
|
||||
- [deepspeed_pipeline](#deepspeed_pipeline)
|
||||
- [megatron](#megatron)
|
||||
@ -27,7 +30,8 @@
|
||||
- [deepspeed_pipeline](#deepspeed_pipeline)
|
||||
- [megatron](#megatron)
|
||||
- [评估](#评估)
|
||||
- [举例](#举例)
|
||||
- [举例](#举例)
|
||||
|
||||
# Bloom-7B
|
||||
|
||||
## 训练
|
||||
@ -154,6 +158,26 @@ DATA_PATH=/home/bloom_data/enwiki_100k/enwiki-100k_text_document
|
||||
bash examples/bloom/pretrain_bloom_7b1.sh
|
||||
```
|
||||
|
||||
## 微调
|
||||
|
||||
### 全参微调
|
||||
执行流程与预训练一致,配置训练权重路径如下:
|
||||
```shell
|
||||
# 修改预训练权重路径
|
||||
CHECKPOINT_PATH='./ckpt'
|
||||
```
|
||||
|
||||
### 低参微调
|
||||
执行流程与预训练一致,参数修改配置如下:
|
||||
```shell
|
||||
# 修改预训练权重路径
|
||||
CHECKPOINT_PATH='./ckpt'
|
||||
|
||||
# 增加配置参数
|
||||
pretrain_bloom.py
|
||||
--lora-target-modules query_key_value dense \
|
||||
```
|
||||
|
||||
## 性能
|
||||
|
||||
### 吞吐
|
||||
@ -486,6 +510,7 @@ bash ./examples/bloom/generate_bloom_176b_2nodes.sh
|
||||
|
||||
|
||||
## 评估
|
||||
|
||||
配置 Bloom-176B 评估脚本: tasks/evaluation/eval_bloom.sh
|
||||
|
||||
```shell
|
||||
@ -504,6 +529,12 @@ TASK="boolq"
|
||||
--num-attention-heads 112
|
||||
```
|
||||
|
||||
```text
|
||||
# 请注意,评估时需要修改一个deepspeed的bug:
|
||||
# 将 `<deepspeed-installed-path>/runtime/pipe/engine.py` 文件里的第671行注释掉:
|
||||
# self.total_loss += self.loss.detach()
|
||||
```
|
||||
|
||||
```shell
|
||||
bash ./tasks/evaluation/eval_bloom.sh
|
||||
```
|
||||
@ -528,7 +559,7 @@ bash ./tasks/evaluation/eval_bloom.sh
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
## 举例
|
||||
# 举例
|
||||
1. bloom 7b
|
||||
|
||||
![bloom_7b_generate.png](..%2F..%2Fsources%2Fimages%2Fbloom_7b_generate.png)
|
||||
@ -536,7 +567,7 @@ bash ./tasks/evaluation/eval_bloom.sh
|
||||
|
||||
![bloom_176b_generate.png](..%2F..%2Fsources%2Fimages%2Fbloom_176b_generate.png)
|
||||
|
||||
## 引用
|
||||
# 引用
|
||||
|
||||
```
|
||||
@article{scao2022bloom,
|
||||
|
@ -9,21 +9,31 @@
|
||||
# Contents
|
||||
|
||||
- [Bloom-7B](#contents)
|
||||
- [Training](#pre-training)
|
||||
- [Training](#training)
|
||||
- [Script](#script)
|
||||
- [Performance](#performance)
|
||||
- [Machine performance](#machine-performance)
|
||||
- [Accuracy of the loss](#accuracy-of-the-loss)
|
||||
- [Inference](#Inference)
|
||||
- [Script](#script)
|
||||
- [Machine performance](#Machine-performance)
|
||||
- [Accuracy of the loss](#Accuracy-of-the-loss)
|
||||
- [Fine-tune](#fine-tune)
|
||||
- [Full parameter fine-tuning](#Full-parameter-fine-tuning)
|
||||
- [LORA fine-tuning](#LORA-fine-tuning)
|
||||
- [Inference](#inference)
|
||||
- [deepspeed pipeline](#deepspeed-pipeline)
|
||||
- [megatron](#megatron)
|
||||
- [Evaluation](#evaluation)
|
||||
|
||||
- [Bloom-176B](#contents)
|
||||
- [Training](#pre-training)
|
||||
- [Training](#training)
|
||||
- [Script](#script)
|
||||
- [Performance](#performance)
|
||||
- [Machine performance](#machine-performance)
|
||||
- [Accuracy of the loss](#accuracy-of-the-loss)
|
||||
- [Inference](#Inference)
|
||||
- [Script](#script)
|
||||
- [Inference](#inference)
|
||||
- [deepspeed pipeline](#deepspeed-pipeline)
|
||||
- [megatron](#megatron)
|
||||
- [Evaluation](#evaluation)
|
||||
- [Example](#example)
|
||||
|
||||
|
||||
# Bloom-7B
|
||||
|
||||
@ -154,6 +164,30 @@ Run the examples/bloom/pretrain_bloom_7b1.sh on all nodes in the cluster.
|
||||
bash examples/bloom/pretrain_bloom_7b1.sh
|
||||
```
|
||||
|
||||
## Fine-tune
|
||||
|
||||
### Full parameter fine-tuning
|
||||
|
||||
The execution process is the same as the pre-training. Config training weight path is as follows:
|
||||
|
||||
```shell
|
||||
# modify the model weight path
|
||||
CHECKPOINT_PATH='./ckpt'
|
||||
```
|
||||
|
||||
### LORA fine-tuning
|
||||
|
||||
The execution process is consistent with the pre-training, and the parameters are modified as follows:
|
||||
|
||||
```shell
|
||||
# modify the model weight path
|
||||
CHECKPOINT_PATH='./ckpt'
|
||||
|
||||
# Add configuration parameters
|
||||
pretrain_bloom.py
|
||||
--lora-target-modules query_key_value dense
|
||||
```
|
||||
|
||||
## Performance
|
||||
|
||||
### Machine Performance
|
||||
@ -183,7 +217,7 @@ NPU vs GPU loss relative error.
|
||||
|
||||
We support AscendSpeed Inference for text generation with BLOOM 7B (deepspeed or megatron).
|
||||
|
||||
### deepspeed_pipeline
|
||||
### deepspeed pipeline
|
||||
|
||||
```shell
|
||||
# modify the model weight path and tokenizer path
|
||||
@ -440,7 +474,7 @@ and GPU on a single-node system. The average relative error is 0.1%, less than 2
|
||||
|
||||
We support AscendSpeed Inference for text generation with BLOOM 176B (deepspeed or megatron).
|
||||
|
||||
### deepspeed_pipeline
|
||||
### deepspeed pipeline
|
||||
|
||||
```shell
|
||||
# modify the model weight path and tokenizer path
|
||||
@ -506,6 +540,12 @@ In addition, you need to set the corresponding parameters according to the model
|
||||
--num-attention-heads 112
|
||||
```
|
||||
|
||||
```text
|
||||
# Note that, a deepspeed bug needs to be fixed during evaluation:
|
||||
# Comment out line 671 in the file `<deepspeed-installed-path>/runtime/pipe/engine.py`:
|
||||
# self.total_loss += self.loss.detach()
|
||||
```
|
||||
|
||||
```shell
|
||||
bash ./tasks/evaluation/eval_bloom.sh
|
||||
```
|
||||
|
@ -3,7 +3,6 @@
|
||||
<p align="left">
|
||||
<b>简体中文</b> |
|
||||
<b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/llama2/README_en.md">English</a> </b>
|
||||
</p>
|
||||
</p>
|
||||
|
||||
|
||||
@ -16,6 +15,15 @@
|
||||
- [吞吐](#吞吐)
|
||||
- [精度](#精度)
|
||||
|
||||
- [LLaMA2-13B](#LLaMA2-13B)
|
||||
- [训练](#训练)
|
||||
- [脚本](#脚本)
|
||||
- [性能](#性能)
|
||||
- [吞吐](#吞吐)
|
||||
- [精度](#精度)
|
||||
- [推理](#推理)
|
||||
- [评估](#评估)
|
||||
|
||||
|
||||
|
||||
# LLaMA2-7B
|
||||
@ -188,3 +196,211 @@ NPU vs 参考 loss
|
||||
绝对误差
|
||||
|
||||
![NPU-LOSS and NPU-Absolute-Error](../../sources/images/llama2/llama2_7b_shape_fp16_layer32_loss_with_weights_comparison_absolute.png)
|
||||
|
||||
|
||||
# LLaMA2-13B
|
||||
|
||||
## 训练
|
||||
|
||||
LLaMA2-13B 训练的硬件配置:
|
||||
|
||||
| 硬件 | 配置 |
|
||||
|:---:|:---------------:|
|
||||
| NPU | 8 x Ascend NPUs |
|
||||
|
||||
LLaMA2-13B 训练的软件配置:
|
||||
|
||||
| 软件 | 配置 |
|
||||
|:-------------------------:|:------------------------------------------------------------------------------------------------------------:|
|
||||
| python | 3.7.16 |
|
||||
| driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
|
||||
| firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
|
||||
| CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
|
||||
| binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
|
||||
| torch | 1.11.0 |
|
||||
| torch_npu | [package](https://gitee.com/ascend/pytorch/releases) |
|
||||
|
||||
### 脚本
|
||||
|
||||
1. 拷贝代码仓到本地服务器
|
||||
```shell
|
||||
git clone https://gitee.com/ascend/AscendSpeed.git
|
||||
cd AscendSpeed
|
||||
mkdir logs
|
||||
mkdir ckpt
|
||||
```
|
||||
|
||||
2. 搭建环境
|
||||
|
||||
```bash
|
||||
# python3.7
|
||||
conda create -n test python=3.7
|
||||
conda activate test
|
||||
|
||||
# 安装 torch 和 torch_npu
|
||||
pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
|
||||
pip install torch_npu-1.11.0*-cp37-cp37m-linux_aarch64.whl
|
||||
pip install apex-0.1_ascend*-cp37-cp37m-linux_aarch64.whl
|
||||
|
||||
# 安装 megatron-core
|
||||
pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
|
||||
|
||||
# 安装 deepspeed 和 deepspeed_npu
|
||||
pip install deepspeed==0.9.2
|
||||
git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
|
||||
cd deepspeed_npu
|
||||
pip3 install -e ./
|
||||
cd ..
|
||||
|
||||
# install other packages
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. 下载 LLaMA2-13B 的 [预训练权重和词表](https://huggingface.co/NousResearch/Llama-2-13b-hf/tree/main)
|
||||
|
||||
```bash
|
||||
git lfs install
|
||||
git clone https://huggingface.co/NousResearch/Llama-2-13b-hf
|
||||
```
|
||||
|
||||
```text
|
||||
# 请注意,如果要加载huggingface的预训练权重,需要修改一个deepspeed关于加载权重的bug:
|
||||
# 在 `<deepspeed-installed-path>/runtime/engine.py` 文件里的 `_load_zero_checkpoint` 函数,
|
||||
# 将 `if zero_sd_list is None` 改为 `if zero_sd_list is None or len(zero_sd_list) == 0`
|
||||
|
||||
# 原始 deepspeed/runtime/engine.py, 大概 #Lines2746-2748
|
||||
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
|
||||
if zero_sd_list is None:
|
||||
return False
|
||||
|
||||
# 修改后
|
||||
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
|
||||
if zero_sd_list is None or len(zero_sd_list) == 0:
|
||||
return False
|
||||
```
|
||||
|
||||
将权重从 huggingface 格式转化为 AscendSpeed 格式
|
||||
```bash
|
||||
# 修改 ascend-toolkit 路径
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
|
||||
# 权重格式转换
|
||||
python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-2-13b-hf \
|
||||
--output-model-dir ckpt \
|
||||
--tensor-model-parallel-size 1 \
|
||||
--pipeline-model-parallel-size 1 \
|
||||
--type 13B \
|
||||
--deepspeed
|
||||
```
|
||||
|
||||
4. 准备数据集
|
||||
|
||||
下载 LLaMA2-13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
|
||||
|
||||
```shell
|
||||
# 下载数据
|
||||
mkdir dataset_llama2
|
||||
cd ./dataset_llama2
|
||||
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
|
||||
cd ..
|
||||
|
||||
# 处理数据
|
||||
cd WORKSPACE
|
||||
mkdir alpaca_preprocessed
|
||||
python tools/preprocess_data.py --input WORKSPACE/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
|
||||
--output-prefix WORKSPACE/alpaca_preprocessed/alpaca \
|
||||
--tokenizer-type PretrainedFromHF \
|
||||
--tokenizer-name-or-path WORKSPACE/llama-13b-hf \
|
||||
--tokenizer-not-use-fast \
|
||||
--handler-name GeneralInstructionHandler
|
||||
```
|
||||
|
||||
5. 配置 LLaMA2-13B 预训练脚本: examples/llama2/pretrain_llama2_13B_ptd_8p.sh
|
||||
|
||||
```shell
|
||||
# 设置 ascend-toolkit 路径
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
|
||||
# 配置词表,数据集等路径
|
||||
TOKENIZER_PATH=./llama-2-13b-hf/ #词表路径
|
||||
DATA_PATH=WORKSPACE/alpaca_preprocessed/alpaca #数据集路径
|
||||
```
|
||||
|
||||
6. 启动 LLaMA2-13B 预训练脚本: examples/llama2/pretrain_llama2_13B_ptd_8p.sh
|
||||
|
||||
```shell
|
||||
bash examples/llama2/pretrain_llama2_13B_ptd_8p.sh
|
||||
```
|
||||
|
||||
### 性能
|
||||
|
||||
#### 吞吐
|
||||
|
||||
LLaMA2-13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比:
|
||||
|
||||
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
|
||||
|:----:|:---------:|:----:|:------------------:|:---------------------:|:---------------:|:----------------:|
|
||||
| NPUs | LLaMA2-13B | 5000 | 2.868 | 1468.416 | 89.275 | 126.73 |
|
||||
| 参考 | LLaMA2-13B | -- | -- | 1750 | -- | -- |
|
||||
|
||||
|
||||
#### 精度
|
||||
|
||||
NPU vs 参考 loss
|
||||
NPU运行平稳,资源使用稳定,中间没有报错,Loss呈下降趋势,收敛速度符合预期。
|
||||
精度满足要求。平均损耗的绝对误差为0.0011,小于0.5%。
|
||||
![NPU-LOSS](../../sources/images/llama2/llama2_13b_bf16_loss_absolute.png)
|
||||
|
||||
## 推理
|
||||
|
||||
我们在Llama2 13B中支持AscendSpeed推理来生成文本。
|
||||
推理不同于预训练,比如我们需要加载预训练检查点和输出样本的长度:
|
||||
|
||||
配置 LLaMA2-13B 推理脚本: examples/llama2/generate_llama2_13B_tp8_pp1.sh
|
||||
|
||||
```shell
|
||||
# 修改模型权重路径以及词表路径
|
||||
CHECKPOINT=./llama2-13b-tp8-pp1/
|
||||
VOCAB_FILE=./llama2-13b-hf/
|
||||
```
|
||||
|
||||
```shell
|
||||
bash ./examples/llama2/generate_llama2_13B_tp8_pp1.sh
|
||||
```
|
||||
推理结果示例如下:
|
||||
![llama2-13B-generate.png](../../sources/images/llama2/llama2-13B-generate.png)
|
||||
|
||||
|
||||
## 评估
|
||||
|
||||
我们使用boolq基准来评估我们的模型。基准[下载](https://huggingface.co/datasets/boolq).
|
||||
|
||||
```shell
|
||||
CHECKPOINT=./llama2-13b-tp8-pp1/
|
||||
VOCAB_FILE=./llama2-13b-hf/
|
||||
# 配置任务以及数据路径
|
||||
DATA_PATH="./boolq/data/test/"
|
||||
TASK="boolq"
|
||||
# 配置生成参数
|
||||
python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py \
|
||||
--task-data-path $DATA_PATH \
|
||||
--task $TASK\
|
||||
--seq-length 4096 \
|
||||
--max-new-tokens 32 \
|
||||
--max-position-embeddings 4096 \
|
||||
--tensor-model-parallel-size 8 \
|
||||
--pipeline-model-parallel-size 1 \
|
||||
--num-layers 40 \
|
||||
--hidden-size 5120 \
|
||||
--ffn-hidden-size 13824 \
|
||||
--load ${CHECKPOINT} \
|
||||
--num-attention-heads 40 \
|
||||
--tokenizer-type PretrainedFromHF \
|
||||
--tokenizer-name-or-path $VOCAB_FILE \
|
||||
--tokenizer-not-use-fast \
|
||||
--fp16 \
|
||||
--micro-batch-size 1 \
|
||||
--seed 42 | tee logs/train.log
|
||||
# 开始评估
|
||||
bash tasks/evaluation/eval.sh
|
||||
```
|
@ -2,10 +2,8 @@
|
||||
<p align="left">
|
||||
<b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/llama2/README.md">简体中文</a></b> |
|
||||
<b>English</b>
|
||||
</p>
|
||||
</p>
|
||||
|
||||
|
||||
# Contents
|
||||
|
||||
- [LLaMA2-7B](#contents)
|
||||
@ -15,6 +13,15 @@
|
||||
- [Machine performance](#machine-performance)
|
||||
- [Accuracy of the loss](#accuracy-of-the-loss)
|
||||
|
||||
- [LLaMA2-13B](#contents)
|
||||
- [Training](#pre-training)
|
||||
- [Script](#script)
|
||||
- [Performance](#performance)
|
||||
- [Machine performance](#machine-performance)
|
||||
- [Accuracy of the loss](#accuracy-of-the-loss)
|
||||
- [Inference](#Inference)
|
||||
- [Evaluation](#Evaluation)
|
||||
|
||||
|
||||
|
||||
# LLaMA2-7B
|
||||
@ -187,3 +194,205 @@ The relative error of the average loss is 0.0046, less than 2%, the maximum rela
|
||||
The absolute error of the average loss is 0.0009, less than 2%, the maximum absolute error is 0.0246.
|
||||
|
||||
![NPU-LOSS and NPU-Absolute-Error](../../sources/images/llama2/llama2_7b_shape_fp16_layer32_loss_with_weights_comparison_absolute.png)
|
||||
|
||||
|
||||
# LLaMA2-13B
|
||||
|
||||
## Training
|
||||
|
||||
Here's a hardware summary of pre-training LLaMA2-13B:
|
||||
|
||||
| Hardware | Value |
|
||||
| :------: | :---------------------------------------------: |
|
||||
| NPU | 8 x Ascend NPUs |
|
||||
|
||||
Here's a software summary of pre-training LLaMA2-13B:
|
||||
|
||||
|
||||
| Software | Version |
|
||||
| :-----------------------: |:-----------:|
|
||||
| Python | 3.7.16 |
|
||||
| driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
|
||||
| firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
|
||||
| CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
|
||||
| binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
|
||||
| torch | 1.11.0 |
|
||||
| torch_npu | [package](https://gitee.com/ascend/pytorch/releases) |
|
||||
|
||||
### Script
|
||||
|
||||
1. Clone the repository to your local server:
|
||||
```shell
|
||||
git clone https://gitee.com/ascend/AscendSpeed.git
|
||||
cd AscendSpeed
|
||||
mkdir logs
|
||||
mkdir ckpt
|
||||
```
|
||||
|
||||
2. Build environment
|
||||
|
||||
```bash
|
||||
# python3.7
|
||||
conda create -n test python=3.7
|
||||
conda activate test
|
||||
|
||||
# install torch and torch_npu
|
||||
pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
|
||||
pip install torch_npu-1.11.0*-cp37-cp37m-linux_aarch64.whl
|
||||
pip install apex-0.1_ascend*-cp37-cp37m-linux_aarch64.whl
|
||||
|
||||
# install megatron-core
|
||||
pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
|
||||
|
||||
# install deepspeed and deepspeed_npu
|
||||
pip install deepspeed==0.9.2
|
||||
git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
|
||||
cd deepspeed_npu
|
||||
pip3 install -e ./
|
||||
cd ..
|
||||
|
||||
# install other packages
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
*Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`*
|
||||
|
||||
```text
|
||||
# original deepspeed/runtime/engine.py, about #Lines2746-2748
|
||||
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
|
||||
if zero_sd_list is None:
|
||||
return False
|
||||
|
||||
# modified
|
||||
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
|
||||
if zero_sd_list is None or len(zero_sd_list) == 0:
|
||||
return False
|
||||
```
|
||||
3. Prepare pretrained weights and tokenizer
|
||||
Download the LLaMA2-13B checkpoint from [here](https://huggingface.co/NousResearch/Llama-2-13b-hf/tree/main)
|
||||
|
||||
```bash
|
||||
git lfs install
|
||||
git clone https://huggingface.co/NousResearch/Llama-2-13b-hf
|
||||
```
|
||||
|
||||
*Note that if you want to use the weight from huggingface, please run the weight conversion script first. The following uses llama-2-13b model weight conversion as an example.*
|
||||
```bash
|
||||
# modify the script according to your own ascend-toolkit path
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
|
||||
# convert to deepspeed weights
|
||||
python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-2-13b-hf \
|
||||
--output-model-dir ckpt \
|
||||
--tensor-model-parallel-size 8 \
|
||||
--pipeline-model-parallel-size 1 \
|
||||
--type 13B \
|
||||
```
|
||||
|
||||
4. Prepare dataset
|
||||
|
||||
Download the LLaMA2-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
|
||||
|
||||
```bash
|
||||
# dataset:wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
|
||||
|
||||
cd WORKSPACE
|
||||
mkdir alpaca_preprocessed
|
||||
python tools/preprocess_data.py --input WORKSPACE/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
|
||||
--output-prefix WORKSPACE/alpaca_preprocessed/alpaca \
|
||||
--tokenizer-type PretrainedFromHF \
|
||||
--tokenizer-name-or-path WORKSPACE/llama-13b-hf \
|
||||
--tokenizer-not-use-fast \
|
||||
--handler-name GeneralInstructionHandler
|
||||
```
|
||||
|
||||
5. Config LLaMA2-13B pre-training script: examples/llama2/pretrain_llama2_13B_ptd_8p.sh
|
||||
|
||||
```shell
|
||||
# modify the script according to your own ascend-toolkit path
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
|
||||
# modify script orign dataset path according to your own dataset path
|
||||
TOKENIZER_PATH=./llama-2-13b-hf/ #tokenizer path
|
||||
DATA_PATH=WORKSPACE/alpaca_preprocessed/alpaca #processed dataset
|
||||
```
|
||||
|
||||
6. Launch LLaMA2-13B pre-training script: examples/llama2/pretrain_llama2_13B_ptd_8p.sh
|
||||
|
||||
```shell
|
||||
bash examples/llama2/pretrain_llama2_13B_ptd_8p.sh
|
||||
```
|
||||
|
||||
### Performance
|
||||
|
||||
#### Machine performance
|
||||
|
||||
The performance of LLaMA2-13B in **Ascend NPU** and **Reference**:
|
||||
|
||||
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
|
||||
| :------: |:----------:|:----------------:|:-----------------------------:|:----------------------------:|:-------------------------:|:-----------------------------------:|
|
||||
| NPUs | LLaMA2-13B | 5000 | 2.868 | 1468.416 | 89.275 | 126.73 |
|
||||
| Reference | LLaMA2-13B | -- | -- | 1750 | -- | -- |
|
||||
|
||||
|
||||
#### Accuracy of the loss
|
||||
|
||||
NPU vs Reference loss.
|
||||
|
||||
The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected.
|
||||
The precision meets the requirements.The absolute error of the average loss is 0.0011, less than 0.5%.
|
||||
|
||||
![NPU-LOSS](../../sources/images/llama2/llama2_13b_bf16_loss_absolute.png)
|
||||
|
||||
## Inference
|
||||
|
||||
We support AscendSpeed Inference for text generation with Llama2 13B.
|
||||
Inference different from pre-training, such as we need to Load pre-training checkpoint and the length of the output samples:
|
||||
|
||||
Config Llama2-13B inference script: examples/llama2/generate_llama2_13B_tp8_pp1.sh
|
||||
|
||||
```shell
|
||||
# modify the model weight path and tokenizer path
|
||||
CHECKPOINT=./llama2-13b-tp8-pp1/
|
||||
VOCAB_FILE=./llama2-13b-hf/
|
||||
```
|
||||
|
||||
```shell
|
||||
bash ./examples/llama2/generate_llama2_13B_tp8_pp1.sh
|
||||
```
|
||||
Some inference samples are as follows:
|
||||
![llama2-13B-generate.png](../../sources/images/llama2/llama2-13B-generate.png)
|
||||
|
||||
|
||||
## Evaluation
|
||||
|
||||
We use boolq benchmark to evaluate our model. Benchmark Download [here](https://huggingface.co/datasets/boolq).
|
||||
|
||||
```shell
|
||||
CHECKPOINT=./llama2-13b-tp8-pp1/
|
||||
VOCAB_FILE=./llama2-13b-hf/
|
||||
# configure task and data path
|
||||
DATA_PATH="./boolq/data/test/"
|
||||
TASK="boolq"
|
||||
# configure generation parameters
|
||||
python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py \
|
||||
--task-data-path $DATA_PATH \
|
||||
--task $TASK\
|
||||
--seq-length 4096 \
|
||||
--max-new-tokens 32 \
|
||||
--max-position-embeddings 4096 \
|
||||
--tensor-model-parallel-size 8 \
|
||||
--pipeline-model-parallel-size 1 \
|
||||
--num-layers 40 \
|
||||
--hidden-size 5120 \
|
||||
--ffn-hidden-size 13824 \
|
||||
--load ${CHECKPOINT} \
|
||||
--num-attention-heads 40 \
|
||||
--tokenizer-type PretrainedFromHF \
|
||||
--tokenizer-name-or-path $VOCAB_FILE \
|
||||
--tokenizer-not-use-fast \
|
||||
--fp16 \
|
||||
--micro-batch-size 1 \
|
||||
--seed 42 | tee logs/train.log
|
||||
# start evaluation
|
||||
bash tasks/evaluation/eval.sh
|
||||
```
|
||||
|
37
examples/llama2/generate_llama2_13B_tp8_pp1.sh
Normal file
37
examples/llama2/generate_llama2_13B_tp8_pp1.sh
Normal file
@ -0,0 +1,37 @@
|
||||
#!/bin/bash
|
||||
export TOKENIZERS_PARALLELISM=false
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
|
||||
MASTER_ADDR=localhost
|
||||
MASTER_PORT=6001
|
||||
NNODES=1
|
||||
NODE_RANK=0
|
||||
NPUS_PER_NODE=8
|
||||
|
||||
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
|
||||
--nnodes $NNODES \
|
||||
--node_rank $NODE_RANK \
|
||||
--master_addr $MASTER_ADDR \
|
||||
--master_port $MASTER_PORT"
|
||||
|
||||
CHECKPOINT=./model/LLAMA-2-13B-hf_tp8_pp1_save/
|
||||
VOCAB_FILE=./model/LLAMA-2-13B-hf
|
||||
|
||||
python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_llama.py \
|
||||
--no-contiguous-buffers-in-local-ddp \
|
||||
--tensor-model-parallel-size 8 \
|
||||
--pipeline-model-parallel-size 1 \
|
||||
--num-layers 40 \
|
||||
--hidden-size 5120 \
|
||||
--ffn-hidden-size 13824 \
|
||||
--load "${CHECKPOINT}" \
|
||||
--num-attention-heads 40 \
|
||||
--max-position-embeddings 4096 \
|
||||
--tokenizer-type PretrainedFromHF \
|
||||
--tokenizer-name-or-path "$VOCAB_FILE" \
|
||||
--tokenizer-not-use-fast \
|
||||
--fp16 \
|
||||
--micro-batch-size 1 \
|
||||
--seq-length 4096 \
|
||||
--max-new-tokens 256 \
|
||||
--seed 42
|
62
examples/llama2/pretrain_llama2_13B_ptd_8p.sh
Normal file
62
examples/llama2/pretrain_llama2_13B_ptd_8p.sh
Normal file
@ -0,0 +1,62 @@
|
||||
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
|
||||
export HCCL_CONNECT_TIMEOUT=1200
|
||||
export COMBINED_ENABLE=1
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
|
||||
# Change for multinode config
|
||||
MASTER_ADDR=localhost
|
||||
MASTER_PORT=6000
|
||||
NNODES=1
|
||||
NODE_RANK=0
|
||||
NPUS_PER_NODE=8
|
||||
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
|
||||
|
||||
DATA_PATH=./dataset/llama_text_document
|
||||
LOAD_CHECKPOINT=./model/LLAMA-2-13B-hf_tp8_pp1
|
||||
SAVE_CHECKPOINT=./model/LLAMA-2-13B-hf_tp8_pp1_save/
|
||||
TOKENIZER_PATH=./model/LLAMA-2-13B-hf
|
||||
|
||||
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
|
||||
|
||||
# Main script
|
||||
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
|
||||
pretrain_llama.py \
|
||||
--DDP-impl local \
|
||||
--tensor-model-parallel-size 8 \
|
||||
--pipeline-model-parallel-size 1 \
|
||||
--sequence-parallel \
|
||||
--num-layers 40 \
|
||||
--hidden-size 5120 \
|
||||
--ffn-hidden-size 13824 \
|
||||
--num-attention-heads 40 \
|
||||
--micro-batch-size 2 \
|
||||
--global-batch-size 2 \
|
||||
--seq-length 4096 \
|
||||
--max-position-embeddings 4096 \
|
||||
--train-iters 1 \
|
||||
--lr-warmup-iters 0 \
|
||||
--save $SAVE_CHECKPOINT \
|
||||
--load $LOAD_CHECKPOINT \
|
||||
--data-path $DATA_PATH \
|
||||
--tokenizer-name-or-path $TOKENIZER_PATH \
|
||||
--tokenizer-not-use-fast \
|
||||
--data-impl mmap \
|
||||
--split 949,50,1 \
|
||||
--distributed-backend nccl \
|
||||
--lr 1e-6 \
|
||||
--lr-decay-style cosine \
|
||||
--min-lr 1e-8 \
|
||||
--weight-decay 1e-1 \
|
||||
--clip-grad 1.0 \
|
||||
--initial-loss-scale 4096.0 \
|
||||
--seed 1234 \
|
||||
--adam-beta1 0.9 \
|
||||
--adam-beta2 0.95 \
|
||||
--adam-eps 1.0e-5 \
|
||||
--log-interval 1 \
|
||||
--save-interval 5000 \
|
||||
--eval-interval 1000 \
|
||||
--eval-iters 1 \
|
||||
--use-fused-rotary-pos-emb \
|
||||
--use-flash-attn \
|
||||
--bf16 | tee logs/train_13B_llama2_npu.log
|
BIN
sources/images/llama2/llama2-13B-generate.png
Normal file
BIN
sources/images/llama2/llama2-13B-generate.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 40 KiB |
BIN
sources/images/llama2/llama2_13b_bf16_loss_absolute.png
Normal file
BIN
sources/images/llama2/llama2_13b_bf16_loss_absolute.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 289 KiB |
Loading…
Reference in New Issue
Block a user