llama2-7B readme更新

This commit is contained in:
wenjiang2357 2023-12-04 17:23:55 +08:00
parent 680c0fbe65
commit 9c594efe8e
8 changed files with 383 additions and 347 deletions

11
OWNERS
View File

@ -2,6 +2,7 @@ approvers:
- dilililiwhy
- wujianping996
- fengliangjun66
- xqiangx1991
- ckirchhoff
- block0219
- wqh17101
@ -12,9 +13,17 @@ reviewers:
- mhh001
- zhangshengdong
- kingsleyandher
- guo-xinjie-1
- matrixssy
- ckirchhoff
- block0219
- wqh17101
- ZeroFadeAway
- Ronnie_zheng
- sibylk
- dl9864
- demo-v
- xulianhao
- yangyifei626
- z2fan
- wenjiang2357

View File

@ -16,25 +16,25 @@
</p>
<p align="center">
<b>简体中文</b> |
<b><a href="https://gitee.com/fengliangjun66/AscendSpeed/blob/master/README_en.md">English</a> </b>
<b><a href="https://gitee.com/fengliangjun66/AscendSpeed/blob/master/README_en.md">English</a> </b>
</p>
</p>
AscendSpeed旨在为华为 [昇腾芯片](https://open.codehub.huawei.com/OpenBaize/Ascend/ascendspeed/files?ref=master&filePath=examples%2Fbaichuan%2Fpretrain_baichuan_zero_7B.sh&isFile=true) 上的大语言模型提供端到端的解决方案, 包含模型,算法,算子,以及下游任务。
## AscendSpeed解决方案概览
## AscendSpeed解决方案概览
---
### 大语言模型
当前AscendSpeed支持下列模型的预训练以及全参微调:
当前AscendSpeed支持下列模型的预训练以及全参微调:
* <a href="https://huggingface.co/BAAI/Aquila-7B/tree/main" style="color:green">Aquila</a>-[[使用说明: 7B]](examples/aquila/README.md)
* <a href="https://github.com/baichuan-inc" style="color:green">Baichuan</a>-[[使用说明: 7B/13B]](examples/baichuan/README.md)
* <a href="https://arxiv.org/pdf/2108.12409.pdf" style="color:green">Bloom</a>-[[使用说明: 7B/176B]](examples/bloom/README.md)
* <a href="https://internlm.intern-ai.org.cn/" style="color:green">InternLM</a>-[[使用说明: 7B]](examples/intern/README.md)
* <a href="https://internlm.intern-ai.org.cn/" style="color:green">InternLM</a>-[[使用说明: 7B/65B]](examples/intern/README.md)
* <a href="https://huggingface.co/docs/transformers/main/model_doc/llama" style="color:green">LLaMA</a>-[[使用说明: 7B/13B/33B/65B]](examples/llama/README.md)
* <a href="https://huggingface.co/docs/transformers/main/model_doc/llama2" style="color:green">LLaMA2</a>-[[使用说明: 7B/13B/70B]](examples/llama2/README.md)
LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B 等模型即将上线...
LLaMA2-34B, Baichuan2-7B/13B 等模型即将上线...
### 下游任务
当前AscendSpeed为大模型提供以下周边应用:
@ -86,6 +86,16 @@ LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B 等模型即将上线...
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1"><a href="examples/aquila/README.md">Aquila</a></td>
<td>7B</td>
<td> 1x8</td>
<td> FP16 </td>
<td> 3644 </td>
<td> 4078 </td>
<td> <a href="./sources/images/aquila/aquila_comp1130.png">Loss</a> </td>
<td> <a href="examples/aquila/pretrain_aquila_7B.sh">训练</a> </td>
</tr>
<tr>
<td rowspan="2"><a href="examples/baichuan/README.md">Baichaun</a></td>
<td>7B</td>
@ -125,7 +135,7 @@ LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B 等模型即将上线...
<td> <a href="examples/bloom/pretrain_bloom_176b.sh">训练</a> </td>
</tr>
<tr>
<td><a href="examples/intern/README.md">InternLM</td>
<td rowspan="2"><a href="examples/intern/README.md">InternLM</a></td>
<td>7B</td>
<td>1x8</td>
<td>BF16</td>
@ -134,6 +144,15 @@ LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B 等模型即将上线...
<td> <a href="sources/images/intern7b_loss.png">Loss</a> </td>
<td> <a href="examples/intern/pretrain_internlm_7b_zero.sh">训练</a> </td>
</tr>
<tr>
<td >65B</td>
<td >4x8</td>
<td> BF16 </td>
<td> 342 </td>
<td> 414 </td>
<td> <a href="sources/images/intern65b_loss.png">Loss</a> </td>
<td> <a href="examples/intern/pretrain_internlm_65b_ptd_32p.sh">训练</a> </td>
</tr>
<tr>
<td rowspan="5"><a href="examples/llama/README.md">LLaMA</td>
<td>7B</td>
@ -178,11 +197,11 @@ LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B 等模型即将上线...
<td rowspan="3"><a href="examples/llama2/README.md">LLaMA2</td>
<td>7B</td>
<td>1x8</td>
<td>FP16 </td>
<td> 2712 </td>
<td> 2348 </td>
<td> <a href="sources/images/llama2/llama2_7b_shape_fp16_layer32_loss_with_weights.png">Loss</a> </td>
<td> <a href="examples/llama2/pretrain_llama2_7b_zero_8p.sh">训练</a> </td>
<td>BF16 </td>
<td> 2662 </td>
<td> 2884 </td>
<td> <a href="sources/images/llama2/llama2-7b-tp8pp1mbs4gbs16-cann1115-Megatron-GPU-loss-releative.png">Loss</a> </td>
<td> <a href="examples/llama2/pretrain_llama2_7b_ptd.sh">训练</a> </td>
</tr>
<tr>
<td>13B</td>
@ -212,8 +231,8 @@ LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B 等模型即将上线...
1. 拷贝仓库到你的个人服务器:
```bash
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
mkdir logs
mkdir ckpt
```
@ -285,7 +304,7 @@ python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-mode
5. 启动训练
```bash
# 在脚本中设置你自己的数据/权重/tokenizer等路径
# 在脚本中设置你自己的数据/权重/tokenizer等路径
sh examples/llama/pretrain_llama_7B_zero_8p.sh
```
@ -370,10 +389,10 @@ sh examples/llama/pretrain_llama_7B_zero_8p.sh
<tr>
<td><a href="examples/llama2/README.md">LLaMA2</a></td>
<td>7B</td>
<td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/llama2/tune_llama2_7b_ptd.sh">lora</a> </td>
<td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/llama2/generate_llama2_7b_ptd.sh">对话 </a> </td>
<td> -- </td>
<td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/llama/generate_llama_7B_tp2_pp2.sh">对话 </a> </td>
<td> -- </td>
<td> -- </td>
<td> <a href="https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json">alpaca_data.json </td>
</tr>
</tbody>
</table>
@ -469,7 +488,7 @@ python tools/preprocess_data.py --input WORKSPACE/alpaca/train-00000-of-00001-a0
--handler-name GeneralInstructionHandler
```
在处理后,`WORKSPACE/alpaca_preprocessed` 文件夹下会有3个 `bin` 文件 和 3个 `idx` 文件,我们便可以通过添加 `--data-path WORKSPACE/alpaca_preprocessed/alpaca``--is-instruction-dataset` 标志来进行指令微调。
在处理后,`WORKSPACE/alpaca_preprocessed` 文件夹下会有3个 `bin` 文件 和 3个 `idx` 文件,我们便可以通过添加 `--data-path WORKSPACE/alpaca_preprocessed/alpaca``--is-instruction-dataset` 标志来进行指令微调。
此外,基于指令数据集,我们还可以通过加上 `--variable-seq-lengths` 标志使用动态序列长度训练模型。
请注意,使用 `--handler-name GeneralInstructionHandler` 标志的指令数据集,在处理时会从 `ascendspeed/data/data_handler.py` 中选择 `GeneralInstructionHandler` 类来制作prompt。如果你处理的是 alpaca 格式风格的数据集,即包含 `instruction`, `input``output` 列的数据集,可以直接使用 `--handler-name GeneralInstructionHandler` 标志。
@ -483,7 +502,7 @@ python tools/preprocess_data.py --input WORKSPACE/alpaca/train-00000-of-00001-a0
```shell
pip install peft==0.4.0
```
你也可以选择直接从它Github仓库的 [源码安装](https://github.com/huggingface/peft/archive/refs/tags/v0.4.0.tar.gz) 通过修改它的setup.py文件来回避一些依赖问题。
你也可以选择直接从它Github仓库的 [源码安装](https://github.com/huggingface/peft/archive/refs/tags/v0.4.0.tar.gz) 通过修改它的setup.py文件来回避一些依赖问题。
之后你仅仅只需要在启动脚本中使能如下标志便可以启动lora微调训练:
@ -498,11 +517,11 @@ Lora有一些相关参数在 [PEFT](https://github.com/huggingface/peft) 仓
# Llama example
--lora-r 64 \
--lora-alpha 128 \
--lora-modules-to-save word_embeddings lm_head.lm_head \
--lora-modules-to-save word_embeddings output_layer \
--lora-register-forward-hook word_embeddings input_layernorm \
```
在这些参数中,标志 `--lora-register-forward-hook` 被用于修复由PP造成的梯度链中断它仅仅只需要在每一个PP阶段的输入层设置并不会增加训练参数。
在这些参数中,标志 `--lora-register-forward-hook` 被用于修复由PP造成的梯度链中断它仅仅只需要在每一个PP阶段的输入层设置并不会增加训练参数。 标志 `--lora-modules-to-save` 被用于扩展词表时的微调,若没此需求则无需传入此参数。
最后Lora微调后保存的权重仅仅只会包含新增的Lora权重。相似的当你加载一个Lora模型时除了原始权重路径需要设置还需要设置一个加载Lora权重的路径如下
@ -536,9 +555,9 @@ AscendSpeed:
这里有一些使用不同模式的样例脚本可以尝试运行,***请注意:***
1. 如果你尝试使用 huggingface 的模型权重,请首先进行权重转换, 以 Llama-7B 为例:
- PTD 策略的转换
```bash
python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-7b-hf \
--output-model-dir llama-7b-tp2-pp2 \
@ -546,7 +565,7 @@ AscendSpeed:
--pipeline-model-parallel-size 2 \
--type 7B
```
- ZeRO 策略的转换
```bash
python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-7b-hf \
@ -554,7 +573,7 @@ AscendSpeed:
--type 7B \
--deepspeed
```
5. 下面脚本中的一些路径需要修改,比如:模型权重路径 和 词表路径.
- 仅仅使用 PTD 策略训练的模型:在这种模式下,模型以 Megatron-LM 的风格被 流水并行 和 张量并行 切分
@ -752,7 +771,7 @@ VOCAB_FILE=../models/llama7b-hf/
# 配置任务和数据路径
DATA_PATH="dataset/boolq/test"
TASK="boolq"
# 配置生成参数
# 配置生成参数
python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation_llama.py \
--task-data-path $DATA_PATH \
--task $TASK\
@ -800,7 +819,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation_llama.py \
--micro-batch-size 1 \
--seed 42 | tee logs/train.log
```
##### BoolQ
##### BoolQ
BoolQ 是一个 yes/no 的问答数据集, 每一个问题包含了一个问题文章答案三元组同时有文章的标题作为额外的选择性输入。BoolQ 数据集的评估相对简单,只需要配置 `TASK="boolq"`, `--seq-length=512`, `--max-position-embeddings=512`, `--max-new-token=2`
零样本评估的结果通常会被给定的 prompt 影响,可以尝试通过在 `tasks/evaluation/evaluation.py` 中设置合适的 prompt 得到更高的分数,
@ -809,15 +828,15 @@ BoolQ 是一个 yes/no 的问答数据集, 每一个问题包含了一个(
template = {instruction}
```
##### MMLU
##### MMLU
由于 MMLU 是一项多学科任务,并且需要进行 5-shot 评估因此每个学科问题的长度差异很大。如果你想同时跑57个学科任务可以尝试设置 `TASK="mmlu"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=2` (`--max-new-tokens` 可以在 2-4 取值)。
在很多网站MMLU 的精度会依据学科进行评估57个学科主要属于四个大类 因此该数据集也可以基于四个大类进行打分,[网站](https://github.com/hendrycks/test/blob/master/categories.py) 给出了具体的57个类别。
##### GSM8K
##### GSM8K
GSM8K 是一个有8.5k高质量小学数学应用题文本的数据集,每一个问题的回答是具体的数字。由于该数据集通常采用 few-shot 的形式进行评估GSM8K的问题长度相对是比较长的输出答案包含一整个思维链路相关入参应该设置为 `TASK="gsm8k"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=128` (`--max-new-tokens` 可以是 256-512).
##### HumanEval
##### HumanEval
HumanEval 是一个用于挑战代码生成问题的数据集具有164个编程问题包含函数签名文档函数主体和单元测试等。该数据的所有问题都是手写的以确保它们不在训练集中由于答案包含长代码相关参数可以设置为 `TASK="human_eval"`, `--seq-length=2048`,
`--max-position-embeddings=2048`, `--max-new-token=1024`
@ -841,7 +860,7 @@ Big-bench-hard 数据集是 BIG-Bench 的一个子集,专注于有挑战性的
在 AscendSpeed 中使用张量并行, 可以在启动脚本中增加 `--tensor-model-parallel-size` 标志, 来明确用于拆分模型的GPU数量。
### <span id="jump2"> (虚拟 & 动态) 流水并行 </span>
流水并行Pipeline Parallelism (PP)是一种将模型所有的Transformer模块划分为多个stage放在不同卡上训练的模型并行技术。
流水并行Pipeline Parallelism (PP)是一种将模型所有的Transformer模块划分为多个stage放在不同卡上训练的模型并行技术。
在PP中每一个stage都有相等数量的Transformer同时通过将 全局的 global batchsize 拆分为多个 微 micro batchsize在stage间流水化训练这些微 batchsize以达到训练整个模型的目的。
虚拟流水并行 Virtual Pipeline (VP) Parallelism通过增加虚拟的 stage 来减少 PP 运行时的空泡时间, 动态流水并行 Dynamic Pipline ParallelismDPP 则是增强版本的 VP 通过合理的设置每个微 batchsize的大小进一步降低空泡时间。
PP 和 VP的基本原理如下<div align=center> <img src="sources/images/pp_vp.png" height="350px" width="800px"> </div>
@ -1038,16 +1057,3 @@ AscendSpeed由华为公司的下列部门联合贡献
- 许可描述: 在运行程序之前建议您对训练所需的文件如ckpt、日志等采取权限控制普通用户而不是root权限等安全措施
- 路径描述: 当你使用 `torch.load` 接口时,除非 weights_only 参数被设置为 True 否则请隐式的使用 pickle 模块,我们不建议您加载和使用任何来源未知或不可信任的数据
- 通信矩阵: [通信矩阵](https://gitee.com/ascend/AscendSpeed/wikis/Communition%20Matrix/AscendSpeed%20Communication%20Matrix?sort_id=8855822)

View File

@ -16,25 +16,25 @@
</p>
<p align="center">
<b><a href="https://gitee.com/fengliangjun66/AscendSpeed/blob/master/README.md"> 简体中文</a></b> |
<b> English </b>
<b> English </b>
</p>
</p>
AscendSpeed provides end-to-end solutions for large language models on [Ascend](https://open.codehub.huawei.com/OpenBaize/Ascend/ascendspeed/files?ref=master&filePath=examples%2Fbaichuan%2Fpretrain_baichuan_zero_7B.sh&isFile=true) chips, including models, algorithms, kernels, and tasks.
## Why AscendSpeed?
## Why AscendSpeed?
---
### Prepared Models
Currently, the following models has been supported:
Currently, the following models has been supported:
* <a href="https://huggingface.co/BAAI/Aquila-7B/tree/main" style="color:green">Aquila</a>-[[README: 7B]](examples/aquila/README.md)
* <a href="https://github.com/baichuan-inc" style="color:green">Baichuan</a>-[[README: 7B/13B]](examples/baichuan/README.md)
* <a href="https://arxiv.org/pdf/2108.12409.pdf" style="color:green">Bloom</a>-[[README: 7B/176B]](examples/bloom/README.md)
* <a href="https://internlm.intern-ai.org.cn/" style="color:green">InternLM</a>-[[README: 7B]](examples/intern/README.md)
* <a href="https://internlm.intern-ai.org.cn/" style="color:green">InternLM</a>-[[README: 7B/65B]](examples/intern/README.md)
* <a href="https://huggingface.co/docs/transformers/main/model_doc/llama" style="color:green">LLaMA</a>-[[README: 7B/13B/33B/65B]](examples/llama/README.md)
* <a href="https://huggingface.co/docs/transformers/main/model_doc/llama2" style="color:green">LLaMA2</a>-[[README: 7B/13B/70B]](examples/llama2/README.md)
LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B are coming soon ...
LLaMA2-34B, Baichuan2-7B/13B are coming soon ...
### Downstream Tasks
Currently, the following downstream tasks have been supported:
@ -85,6 +85,16 @@ Coming soon ...
</tr>
</thead>
<tbody>
<tr>
<td rowspan="1"><a href="examples/aquila/README.md">Aquila</a></td>
<td>7B</td>
<td> 1x8</td>
<td> FP16 </td>
<td> 3644 </td>
<td> 4078 </td>
<td> <a href="./sources/images/aquila/aquila_comp1130.png">Loss</a> </td>
<td> <a href="examples/aquila/pretrain_aquila_7B.sh">Train</a> </td>
</tr>
<tr>
<td rowspan="2"><a href="examples/baichuan/README.md">Baichaun</a></td>
<td>7B</td>
@ -124,7 +134,7 @@ Coming soon ...
<td> <a href="examples/bloom/pretrain_bloom_176b.sh">Train</a> </td>
</tr>
<tr>
<td><a href="examples/intern/README.md">InternLM</td>
<td rowspan="2"><a href="examples/intern/README.md">InternLM</a></td>
<td>7B</td>
<td>1x8</td>
<td>BF16</td>
@ -133,6 +143,15 @@ Coming soon ...
<td> <a href="sources/images/intern7b_loss.png">Loss</a> </td>
<td> <a href="examples/intern/pretrain_internlm_7b_zero.sh">Train</a> </td>
</tr>
<tr>
<td >65B</td>
<td >4x8</td>
<td> BF16 </td>
<td> 342 </td>
<td> 414 </td>
<td> <a href="sources/images/intern65b_loss.png">Loss</a> </td>
<td> <a href="examples/intern/pretrain_internlm_65b_ptd_32p.sh">Train</a> </td>
</tr>
<tr>
<td rowspan="5"><a href="examples/llama/README.md">LLaMA</td>
<td>7B</td>
@ -177,11 +196,11 @@ Coming soon ...
<td rowspan="3"><a href="examples/llama2/README.md">LLaMA2</td>
<td>7B</td>
<td>1x8</td>
<td>FP16 </td>
<td> 2712</td>
<td> 2348 </td>
<td> <a href="sources/images/llama2/llama2_7b_shape_fp16_layer32_loss_with_weights.png">Loss</a> </td>
<td> <a href="examples/llama2/pretrain_llama2_7b_zero_8p.sh">Train</a> </td>
<td>BF16 </td>
<td> 2662</td>
<td> 2884 </td>
<td> <a href="sources/images/llama2/llama2-7b-tp8pp1mbs4gbs16-cann1115-Megatron-GPU-loss-releative.png">Loss</a> </td>
<td> <a href="examples/llama2/pretrain_llama2_7b_ptd.sh">Train</a> </td>
</tr>
<tr>
<td>13B</td>
@ -211,8 +230,8 @@ This is an example of model training with AscendSpeed, and the detailed guidelin
1. Clone the repository to your local server:
```bash
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
mkdir logs
mkdir ckpt
```
@ -285,7 +304,7 @@ python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-mode
5. Start your task
```bash
# set your data path / weight path / tokenizer path etc.
# set your data path / weight path / tokenizer path etc.
sh examples/llama/pretrain_llama_7B_zero_8p.sh
```
@ -370,10 +389,10 @@ sh examples/llama/pretrain_llama_7B_zero_8p.sh
<tr>
<td><a href="examples/llama2/README.md">LLaMA2</a></td>
<td>7B</td>
<td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/llama2/tune_llama2_7b_ptd.sh">lora</a> </td>
<td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/llama/generate_llama2_7b_ptd.sh">inference </a> </td>
<td> -- </td>
<td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/llama/generate_llama_7B_tp2_pp2.sh">inference </a> </td>
<td> -- </td>
<td> -- </td>
<td> <a href="https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json">alpaca_data.json </td>
</tr>
</tbody>
</table>
@ -388,7 +407,7 @@ sh examples/llama/pretrain_llama_7B_zero_8p.sh
# for llama, download alpaca dataset, like
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
# download tokenizer configs and (selective) weights from
# download tokenizer configs and (selective) weights from
# https://huggingface.co/yahma/llama-7b-hf/tree/main
# revise "LLaMATokenizer" as "LlamaTokenizer" in tokenizer_config.json (This is a bug of huggingface)
mkdir dataset
@ -402,7 +421,7 @@ python tools/preprocess_data.py --input train-00000-of-00001-a09b74b3ef9c3b56.pa
#### Preprocessing pretraining dataset
##### wikipedia dataset
##### wikipedia dataset
+ download [wikipedia data](https://huggingface.co/datasets/wikipedia/tree/main) from huggingface to WORKSPACE/wikipedia
+ download [llama tokenizer model and config](https://huggingface.co/yahma/llama-7b-hf/tree/main) from huggingface to WORKSPACE/llama-7b-hf
@ -414,7 +433,7 @@ cd WORKSPACE
mkdir wikipedia_preprocessed
# specify huggingface load_dataset parameters.(--input param will be ignored)
# these params will just be feed into datasets.load_dataset function
# these params will just be feed into datasets.load_dataset function
hf_config_json="./hf_config_json.json"
cat <<EOT > $hf_config_json
{
@ -463,7 +482,7 @@ python tools/preprocess_data.py --input WORKSPACE/train-00000-of-00001-a09b74b3e
# for llama, download alpaca dataset, like
# wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
# download tokenizer configs and (selective) weights from
# download tokenizer configs and (selective) weights from
# https://huggingface.co/yahma/llama-7b-hf/tree/main
# revise "LLaMATokenizer" as "LlamaTokenizer" in tokenizer_config.json (This is a bug of huggingface)
@ -477,7 +496,7 @@ python tools/preprocess_data.py --input WORKSPACE/alpaca/train-00000-of-00001-a0
--handler-name GeneralInstructionHandler
```
After preprocessing, there will be three `bin` files and three `idx` files in the `WORKSPACE/alpaca_preprocessed` dictionary. Then, we can train a model with `--data-path WORKSPACE/alpaca_preprocessed/alpaca` and `--is-instruction-dataset` flags.
After preprocessing, there will be three `bin` files and three `idx` files in the `WORKSPACE/alpaca_preprocessed` dictionary. Then, we can train a model with `--data-path WORKSPACE/alpaca_preprocessed/alpaca` and `--is-instruction-dataset` flags.
In addition, we have developed the dynamic padding function based on the instruction dataset, which can be implemented using the `--variable-seq-lengths` flag.
Note that instruction dataset has a `--handler-name GeneralInstructionHandler` flag which will choose `GeneralInstructionHandler` class to create prompt in `ascendspeed/data/data_handler.py`.
@ -489,7 +508,7 @@ In addition, `BelleMultiTurnInstructionHandler` is used to handle [belle dataset
### <span id="jump12"> Low-parameter fine-tuning </span>
#### Lora
Now, we support Lora to fine-tune your models.
Now, we support Lora to fine-tune your models.
First, you need to install version 0.4.0 of the peft library, like this:
```shell
@ -510,11 +529,11 @@ There are other Lora related arguments here, you can find their definitions in t
# Llama example
--lora-r 64 \
--lora-alpha 128 \
--lora-modules-to-save word_embeddings lm_head.lm_head \
--lora-modules-to-save word_embeddings output_layer \
--lora-register-forward-hook word_embeddings input_layernorm \
```
Among them, the argument `--lora-register-forward-hook` is used to repair the gradient chain break caused by PP. It only needs to be set to the input layer of each PP stage, and the repair will not increase the trainable parameters.
Among them, the argument `--lora-register-forward-hook` is used to repair the gradient chain break caused by PP. It only needs to be set to the input layer of each PP stage, and the repair will not increase the trainable parameters. The argument `--lora-modules-to-save` is used for fine-tuning when expanding the vocabulary. If there is no need for this, there is no need to pass in this argument.
Finally, only Lora's parameters are saved after turning on Lora. Similarly, when loading a model, you need to specify the original model weight path and the Lora weight path. Parameters such as the optimizer are subject to those in the Lora weight path.
@ -523,7 +542,7 @@ Finally, only Lora's parameters are saved after turning on Lora. Similarly, when
--lora-load ${LORA_CHECKPOINT} \
```
There is an [example](examples/llama/tune_llama_ptd_13b.sh) could be referred.
There is an [example](examples/llama/tune_llama_ptd_13b.sh) could be referred.
After using Lora to fine-tune the Llama model, the instruction dialogue effect is as follows:
@ -548,11 +567,11 @@ Currently, we support the following four cases of inference:
Here are some example scripts in different mode mentioned above for you to launch directly.
***Please Note that:***
1. If you want to use the weight from huggingface, please run the weight conversion script first.
1. If you want to use the weight from huggingface, please run the weight conversion script first.
Take Llama-7B, for example:
- PTD only
```bash
python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-7b-hf \
--output-model-dir llama-7b-tp2-pp2 \
@ -560,7 +579,7 @@ Here are some example scripts in different mode mentioned above for you to launc
--pipeline-model-parallel-size 2 \
--type 7B
```
- DeepSpeed ZeRO only
```bash
python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-7b-hf \
@ -568,7 +587,7 @@ Here are some example scripts in different mode mentioned above for you to launc
--type 7B \
--deepspeed
```
2. You need to modify some variables in the shell script such as **model weight path** and **vocab path**.
- **PTD only:** In this mode, the model is split by pipeline parallel and tensor parallel mode in megatron ways.
@ -767,7 +786,7 @@ VOCAB_FILE=../models/llama7b-hf/
# configure task and data path
DATA_PATH="dataset/boolq/test"
TASK="boolq"
# configure generation parameters
# configure generation parameters
python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation_llama.py \
--task-data-path $DATA_PATH \
--task $TASK\
@ -815,39 +834,39 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation_llama.py \
--micro-batch-size 1 \
--seed 42 | tee logs/train.log
```
##### BoolQ
##### BoolQ
BoolQ is a question answering dataset for yes/no questions. Each question contains a triplet of (question, passage, answer), with the title of the page as optional additional context.
The evaluation of the BoolQ data set is relatively simple, just configure `TASK="boolq"`, `--seq-length=512`, `--max-position-embeddings=512`, `--max-new-token=2`.
The zero-shot results are usually affected by the given prompt, and a higher score can be obtained by a suitable prompt.
The zero-shot results are usually affected by the given prompt, and a higher score can be obtained by a suitable prompt.
The prompt can be modified in `tasks/evaluation/evaluation.py`
```bash
# Update new prompt by changing the template
template = {instruction}
```
##### MMLU
##### MMLU
Since MMLU is a multidisciplinary task and 5 shots are performed, the length of each subject question varies greatly. If you want to run 57 subjects at the same time, you need to set `TASK="mmlu"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=2`. (`--max-new-tokens` can be set to between 2-4).
On many websites, the accuracy of the MMLU is evaluated according to disciplines. The 57 categories of single subjects belong to four main categories. Therefore, the statistics should be summarized according to the major categories of the subjects. The [website](https://github.com/hendrycks/test/blob/master/categories.py) gives the major categories of subjects for 57 categories of subjects.
##### GSM8K
##### GSM8K
GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The answer of each question is a specific number. Since few shots are performed, the question length is relatively long in GSM8K, and the output answer contains a chain of thoughts, it is necessary to configure `TASK="gsm8k"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=128`. (`--max-new-tokens` can be set between 256-512).
##### HumanEval
HumanEval dataset is a handcrafted set of 164 programming problems designed to challenge code generation models. The problems include a function signature, docstring, body, and several unit tests, all handwritten to ensure they're not included in the training set of code generation models.
##### HumanEval
HumanEval dataset is a handcrafted set of 164 programming problems designed to challenge code generation models. The problems include a function signature, docstring, body, and several unit tests, all handwritten to ensure they're not included in the training set of code generation models.
Since the answer of HumanEval dataset contains long codes, it is necessary to configure `TASK="human_eval"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=1024`.
##### AGIEval
AGIEval is a human-centric benchmark specifically designed to evaluate the general
abilities of foundation models in tasks pertinent to human cognition and problem-solving. This benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams.Since the length of answers to different type of questions varies, we have to configure `TASK="agieval"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=1024` to fit the longest answer.
AGIEval is a human-centric benchmark specifically designed to evaluate the general
abilities of foundation models in tasks pertinent to human cognition and problem-solving. This benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams.Since the length of answers to different type of questions varies, we have to configure `TASK="agieval"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=1024` to fit the longest answer.
##### Big-Bench-Hard
Big-bench-hard dataset is a subset of big bench, which is a diverse evaluation suite that focuses on a suite of 23 challenging BIG-Bench tasks. These are the task for which prior language model evaluations did not outperform the average human-rater. This dataset covers multiple areas including text understanding, reasoning, logical reasoning, mathematical reasoning, and common sense reasoning.
Except word_sorting, all datasets are multiple-choice questions. So we can set `TASK="bbh"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=32`. (`--max-new-tokens` can be set between 32-64).
##### CEval
As [C-Eval](https://cevalbenchmark.com/) shows, C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels, as shown below. You may explore our dataset examples at Explore, or check our paper for more details. The dataset contains validation and test data, however, only validation data has label for auto-evaluation. If
you want to evaluate on test data, you should email your results to [C-Eval](https://cevalbenchmark.com/).
As [C-Eval](https://cevalbenchmark.com/) shows, C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels, as shown below. You may explore our dataset examples at Explore, or check our paper for more details. The dataset contains validation and test data, however, only validation data has label for auto-evaluation. If
you want to evaluate on test data, you should email your results to [C-Eval](https://cevalbenchmark.com/).
#### Configuration of models and datasets
@ -858,7 +877,7 @@ python convert_weights_from_huggingface.py \
--output-model-dir /home/w425040/models/llama-7b-tp2-pp4 \
--type 7B \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 4
--pipeline-model-parallel-size 4
```
Then, configure dataset path and task. Note: since the evaluation parameters of different datasets are not totally same, it is not recommended to evaluate two or more different datasets together. Evaluation parameters such as `--seq-length`, `--max-new-tokens` and `--max-position-embeddings` need to be adjusted to datasets. The recommended parameters for each dataset will be given in the following instruction.
@ -869,7 +888,7 @@ VOCAB_FILE=../models/llama7b-hf/
# configure task and data path
DATA_PATH="dataset/boolq/test"
TASK="boolq"
# configure generation parameters
# configure generation parameters
```
## Introduction For Acceleration Features
@ -877,7 +896,7 @@ TASK="boolq"
---
### <span id="jump1"> Tensor Parallelism </span>
Tensor parallelism (TP) is a kind of model parallelism strategy, which splits execution of a single transformer module over multiple devices.
Tensor parallelism (TP) is a kind of model parallelism strategy, which splits execution of a single transformer module over multiple devices.
The basic principle of PP is:<div align=center>
<img src="sources/images/tp_in_mlp.png" height="280px" width="500px">
<img src="sources/images/tp_in_sa.png" height="280px" width="500px"></div>
@ -890,7 +909,7 @@ smaller microbatches. Virtual pipeline (VP) parallelism optimizes PP by add virt
<img src="sources/images/pp_vp.png" height="350px" width="800px"></div>
To enable pipeline model parallelism, use the `--pipeline-model-parallel-size` flag to specify the number of stages to split the model into (e.g., splitting a model with 24 transformer layers across 4 stages would mean each stage gets 6 transformer layers each).
To enable virtual pipeline parallelism, additionally use `--num-layers-per-virtual-pipeline-stage` flag to decide number of layers per virtual stage.
To enable virtual pipeline parallelism, additionally use `--num-layers-per-virtual-pipeline-stage` flag to decide number of layers per virtual stage.
To enable optimized pipeline parallelism, additionally use `--optimized-pipeline` and `--manual-mbs example-config-1` flag based on PP. Note that both VP and OPP reduce bubble time, but increase communication time.
### <span id="jump3"> Fold3D </span>
@ -899,8 +918,8 @@ Fold3D hides the commutation time of data parallelism in VP. The basic principle
To enable pipeline model parallelism, use the `--fold-mode "aiao"` flag to choose strategy.
### <span id="jump4"> (Selective) Recomputation </span>
To reduce NPU memory usage so deploy a large model to a training system, we support activation checkpointing and recomputation.
We support two levels of recompute granularity: full and selective. To enable full recomputation, please use `--checkpoint-activations` flag, and to enable selective recomputation, please use `--checkpoint-policy` flag to
To reduce NPU memory usage so deploy a large model to a training system, we support activation checkpointing and recomputation.
We support two levels of recompute granularity: full and selective. To enable full recomputation, please use `--checkpoint-activations` flag, and to enable selective recomputation, please use `--checkpoint-policy` flag to
decide the strategy of selective recomputation. Note that the selective strategy is customized and optimized for Ascend chips.
To improve the performance of model training while maximizing the use of NPU memory, we support [auto selective recomputing](https://gitee.com/ascend/AscendSpeed/wikis/Algorithm%20Introduction/Auto%20Selective%20Recomputing) strategy by manually adjusting the training memory size.
@ -911,16 +930,16 @@ Auto selective recomputing selects a strategy based on the training memory infor
By default, profiling is stopped in step 10, with a minimum setting of 5 steps. It is recommended to stop profiling after the training memory is stable, in order to obtain a better choice of recalculation strategy.
### <span id="jump5"> Sequence Parallelism </span>
Sequence parallelism (SP) is a kind of model parallelism strategy, which splits the sequence axis in dropout and layernorm layers. SP depends on TP in our implementation.
Sequence parallelism (SP) is a kind of model parallelism strategy, which splits the sequence axis in dropout and layernorm layers. SP depends on TP in our implementation.
The allreduce operation in TP is split to reduce-scatter and allgather by SP, which reduces the memory occupation in model training. The basic principle of SP is:<div align=center>
<img src="sources/images/sp.png" height="250px" width="800px"></div>
To enable sequence parallelism (SP), `--tensor-model-parallel-size` flag should be lager than 1, and set `--sequence-parallel` flag.
### <span id="jump6"> ZeRO-1/2/3 </span>
Zero Redundancy Optimizer (ZeRO) is a kind of memory-optimization strategy in data parallelism proposed by MicroSoft.
Zero Redundancy Optimizer (ZeRO) is a kind of memory-optimization strategy in data parallelism proposed by MicroSoft.
AscendSpeed supports ZeRO-1/2/3 by adding a deepspeed branch. The basic principle of ZeRO is:<div align=center>
<img src="sources/images/ZeRO.png" height="250px" width="600px"></div>
To enable ZeRO-1/2/3, a deepspeed config is required and an [example](examples/llama/pretrain_llama_7B_zero_8p.sh) could be referred.
To enable ZeRO-1/2/3, a deepspeed config is required and an [example](examples/llama/pretrain_llama_7B_zero_8p.sh) could be referred.
Notably, if only the ZeRO-1 is supposed, deepspeed is not necessary, and simply set the `--use-distributed-optimizer` flag.
### <span id="jump7"> Inverted Triangle Acceleration </span>
@ -940,7 +959,7 @@ For llama and other LLMs without bias in FFN, the linear transformation in FFN c
### <span id="jump10"> Memory Overcommitment </span>
In mix precision training, multiple state tensors, such as parameter copies, gradient copies, and optimizer states, occupy a large amount of static memory (16N, where N is the number of parameters). However, in fact, parameters and gradients (4N, N is the number of parameters) that participate in forward and reverse calculation account for a small proportion, and optimizing the preceding state tensors can bring great video memory benefits. By analyzing the actual use of each part of the state tensor, the memory reuse mechanism of the mechanism is obtained, and a multilevel optimizer memory optimization scheme integrating multiple algorithm modules is finally obtained.
- Memory Overcommitment O1 —— **Relase FP32 Gradient**
- Memory Overcommitment O1 —— **Relase FP32 Gradient**
- Advantages: Completely equivalent; Support for multiple optimizers; lossless performance.
- Algorithm principle: The static memory of the FP32 gradient copy that needs to be permanently stored is reused. The memory of the FP16 gradient is converted into the FP32 format by performing the Foreach+Cast operation when necessary, saving 4N space.
- Usage: This equivalent algorithm is applicable to all optimizers and can be triggered by specifying `--release-fp32-grad` in the script.
@ -992,5 +1011,5 @@ We appreciate every PR from community, and welcome to contribute to AscendSpeed.
- Inner Function Description: Here are some inner implementation interface introduction [InnerInterface](https://gitee.com/ascend/AscendSpeed/wikis/Inner%20API%20Description/Some%20inner%20interface%20description?sort_id=8824096)
- Parameters Description: Here are some parameters description and usage [param](https://gitee.com/ascend/AscendSpeed/wikis/Home).
- Permission Description: It is recommended that the umask value of Linux be greater than or eqaul to 027. Before running the program, you are advised to take security measures such as permission control for files required for training, such as ckpt, logs and so on. You are advised to run the program or execute commands as a regular user not as root or super user. Also, you are advised to set the folder permission to 750 and the file permission to 640. When multiple users share datasets, set the read and write permissions for folders and files based on the minimum permissions to avoid security problems such as unauthorized access.
- Path Description: When you're using interface such as `torch.load`, unless weights_only parameter is set to True, uses pickle module implicitly, which is known to be insecure. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. We don't suggest you load data that could have come from an untrusted source in an unsafe mode, or that could have been tampered with. Please load data you trust. Moreover, when you need to read data from outside or your specified path you'd better make it trusted and safe, including but not limited to weights path, dataset path.
- Path Description: When you're using interface such as `torch.load`, unless weights_only parameter is set to True, uses pickle module implicitly, which is known to be insecure. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. We don't suggest you load data that could have come from an untrusted source in an unsafe mode, or that could have been tampered with. Please load data you trust. Moreover, when you need to read data from outside or your specified path you'd better make it trusted and safe, including but not limited to weights path, dataset path.
- Communication Matrix: Please refer to this [link](https://gitee.com/ascend/AscendSpeed/wikis/Communition%20Matrix/AscendSpeed%20Communication%20Matrix?sort_id=8855822) to check the communication matrix.

View File

@ -7,7 +7,7 @@
# 目录
- [LLaMA2-7B](#LLaMA2-7B)
- [LLAMA2-7B](#LLAMA2-7B)
- [训练](#训练)
- [脚本](#脚本)
- [性能](#性能)
@ -35,26 +35,26 @@
- [评估](#评估)
# LLaMA2-7B
# LLAMA2-7B
## 训练
LLaMA2-7B 训练的硬件配置:
LLAMA2-7B 训练的硬件配置:
| 硬件 | 配置 |
| :--: | :-------------: |
| NPU | 8 x Ascend NPUs |
LLaMA2-7B 训练的软件配置:
LLAMA2-7B 训练的软件配置:
| 软件 | 配置 |
| :-----------------------: | :-------------------------------------------------------------------------------------------------------: |
| python | 3.7.16 |
| python | 3.8.18 |
| driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| torch | 1.11.0 |
| torch | 2.1.0 |
| torch_npu | [package](https://gitee.com/ascend/pytorch/releases) |
### 脚本
@ -70,14 +70,14 @@ LLaMA2-7B 训练的软件配置:
2. 搭建环境
```bash
# python3.7
conda create -n test python=3.7
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
pip install torch_npu-1.11.0*-cp37-cp37m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp37-cp37m-linux_aarch64.whl
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 安装 megatron-core
pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
@ -92,7 +92,7 @@ LLaMA2-7B 训练的软件配置:
# install other packages
pip install -r requirements.txt
```
3. 下载 LLaMA2-7B 的 [预训练权重和词表](https://huggingface.co/daryl149/llama-2-7b-hf/tree/main)
3. 下载 LLAMA2-7B 的 [预训练权重和词表](https://huggingface.co/daryl149/llama-2-7b-hf/tree/main)
```shell
#!/bin/bash
@ -110,38 +110,7 @@ LLaMA2-7B 训练的软件配置:
cd ..
```
```text
# 请注意如果要加载huggingface的预训练权重需要修改一个deepspeed关于加载权重的bug
# 在 `<deepspeed-installed-path>/runtime/engine.py` 文件里的 `_load_zero_checkpoint` 函数,
# 将 `if zero_sd_list is None` 改为 `if zero_sd_list is None or len(zero_sd_list) == 0`
# 原始 deepspeed/runtime/engine.py, 大概 #Lines2746-2748
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None:
return False
# 修改后
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None or len(zero_sd_list) == 0:
return False
```
3.1 将权重从 huggingface 格式转化为 AscendSpeed 格式 deepspeed模式
```bash
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 权重格式转换
python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-2-7b-hf \
--output-model-dir ckpt \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--type 7B \
--deepspeed
```
3.2 将权重从 huggingface 格式转化为 AscendSpeed 格式 PTD模式
将权重从 huggingface 格式转化为 AscendSpeed 格式 PTD模式
```bash
# 修改 ascend-toolkit 路径
@ -178,26 +147,7 @@ LLaMA2-7B 训练的软件配置:
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
4.2 用deepspeed模式预训练
配置 LLaMA2-7B 预训练脚本: examples/llama2/pretrain_llama2_7b_zero_8p.sh
```shell
# 设置 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 配置词表,数据集等路径
TOKENIZER_PATH=./llama-2-7b-hf/ #词表路径
DATA_PATH=./dataset_llama2/alpaca_text_document #数据集路径
```
启动 LLaMA2-7B 预训练脚本: examples/llama2/pretrain_llama2_7b_zero_8p.sh
```shell
bash examples/llama2/pretrain_llama2_7b_zero_8p.sh
```
4.3 用ptd模式预训练
4.2 用ptd模式预训练
配置LLaMA2-7B PTD 预训练脚本: examples/llama2/pretrain_llama2_7b_ptd.sh
```shell
@ -240,17 +190,17 @@ LLaMA2-7B 训练的软件配置:
--append-eod
```
5.2 用deepspeed模式微调
5.2.1 全参微调
全参微调的配置脚本基本和预训练脚本pretrain_llama2_7b_zero_8p.sh一致.*唯一的区别是数据集*
5.2 全参微调
全参微调的配置脚本基本和预训练脚本pretrain_llama2_7b_ptd.sh一致. *区别是数据集,以及增加训练参数--is-instruction-dataset*
```bash
DATA_PATH=./finetune_dataset/alpaca
--is-instruction-dataset \
```
5.2.2 Lora微调
Lora微调的脚本配置是在预训练脚本pretrain_llama2_7b_zero_8p.sh基础上加上lora参数如下所示:
5.3 Lora微调
Lora微调的脚本配置是在预训练脚本pretrain_llama2_7b_ptd.sh基础上加上lora参数如下所示:
```bash
--lora-target-modules query_key_value dense gate_proj up_proj down_proj \
@ -271,8 +221,6 @@ LLaMA2-7B 训练的软件配置:
--lora-load ${LORA_CHECKPOINT} \ # lora参数checkpoint
```
5.3 PTD模式微调
*PTD模式的微调方法和deepspeed模式的微调方法完全一致.具体细节请参考上一小节.*
### 性能
@ -403,8 +351,10 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS tasks/evaluation/evaluation
--num-attention-heads 32 \
--mlp-layer-fusion \
--load ${CHECKPOINT} \
--position-embedding-type rope \
--normalization RMSNorm \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path $VOCAB_FILE \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--fp16 \
--micro-batch-size 1 \
@ -418,67 +368,68 @@ bash tasks/evaluation/eval.sh
```
评估结果如下
```text
subject question_n acc
0 high_school_macroeconomics 390 0.466667
1 formal_logic 126 0.253968
2 international_law 121 0.652893
3 college_mathematics 100 0.330000
4 college_medicine 173 0.421965
5 world_religions 171 0.725146
6 moral_scenarios 895 0.220112
7 nutrition 306 0.513072
8 high_school_statistics 216 0.361111
9 medical_genetics 100 0.490000
10 college_chemistry 100 0.300000
11 professional_accounting 282 0.361702
12 professional_law 1534 0.338331
13 miscellaneous 783 0.698595
14 sociology 201 0.651741
15 professional_medicine 272 0.496324
16 logical_fallacies 163 0.552147
17 public_relations 110 0.563636
18 college_biology 144 0.506944
19 high_school_european_history 165 0.612121
20 philosophy 311 0.556270
21 abstract_algebra 100 0.310000
22 high_school_psychology 545 0.678899
23 high_school_computer_science 100 0.400000
24 elementary_mathematics 378 0.312169
25 high_school_us_history 204 0.617647
26 machine_learning 112 0.366071
27 astronomy 152 0.493421
28 global_facts 100 0.330000
29 high_school_mathematics 270 0.255556
30 electrical_engineering 145 0.496552
31 high_school_microeconomics 238 0.415966
32 business_ethics 100 0.540000
33 college_computer_science 100 0.400000
34 high_school_physics 151 0.317881
35 human_sexuality 131 0.526718
36 college_physics 102 0.245098
37 high_school_government_and_politics 193 0.720207
38 marketing 234 0.747863
39 high_school_geography 198 0.601010
40 security_studies 245 0.555102
41 high_school_chemistry 203 0.418719
42 management 103 0.699029
43 jurisprudence 108 0.537037
44 econometrics 114 0.350877
45 human_aging 223 0.591928
46 virology 166 0.403614
47 moral_disputes 346 0.528902
48 anatomy 135 0.451852
49 professional_psychology 612 0.498366
50 conceptual_physics 235 0.455319
51 computer_security 100 0.560000
52 clinical_knowledge 265 0.505660
53 us_foreign_policy 100 0.680000
54 prehistory 324 0.570988
55 high_school_world_history 237 0.645570
56 high_school_biology 310 0.535484
57 total 14042 0.478422
MMLU Running Time: 18266.85981464386
学科名 问题数 参考准确率 NPU准确率 准确率差异
17 public_relations 110 0.563636 0.554545 0.009091
44 econometrics 114 0.368421 0.377193 0.008772
30 electrical_engineering 145 0.503448 0.510345 0.006897
5 world_religions 171 0.701754 0.707602 0.005848
25 high_school_us_history 204 0.647059 0.651961 0.004902
45 human_aging 223 0.596413 0.600897 0.004484
38 marketing 234 0.709402 0.713675 0.004274
55 high_school_world_history 237 0.620253 0.624473 0.004219
31 high_school_microeconomics 238 0.420168 0.424370 0.004202
7 nutrition 306 0.503268 0.500000 0.003268
56 high_school_biology 310 0.541935 0.545161 0.003226
20 philosophy 311 0.569132 0.565916 0.003215
24 elementary_mathematics 378 0.291005 0.293651 0.002646
22 high_school_psychology 545 0.645872 0.647706 0.001835
12 professional_law 1534 0.339635 0.340939 0.001304
13 miscellaneous 783 0.679438 0.678161 0.001277
6 moral_scenarios 895 0.221229 0.222346 0.001117
37 high_school_government_and_politics 193 0.694301 0.694301 0.000000
54 prehistory 324 0.555556 0.555556 0.000000
53 us_foreign_policy 100 0.700000 0.700000 0.000000
39 high_school_geography 198 0.626263 0.626263 0.000000
40 security_studies 245 0.522449 0.522449 0.000000
41 high_school_chemistry 203 0.408867 0.408867 0.000000
52 clinical_knowledge 265 0.513208 0.513208 0.000000
49 professional_psychology 612 0.482026 0.482026 0.000000
42 management 103 0.679612 0.679612 0.000000
43 jurisprudence 108 0.583333 0.583333 0.000000
51 computer_security 100 0.560000 0.560000 0.000000
50 conceptual_physics 235 0.417021 0.417021 0.000000
35 human_sexuality 131 0.526718 0.526718 0.000000
46 virology 166 0.439759 0.439759 0.000000
47 moral_disputes 346 0.514451 0.514451 0.000000
48 anatomy 135 0.459259 0.459259 0.000000
36 college_physics 102 0.215686 0.215686 0.000000
0 high_school_macroeconomics 390 0.420513 0.420513 0.000000
34 high_school_physics 151 0.311258 0.311258 0.000000
33 college_computer_science 100 0.420000 0.420000 0.000000
2 international_law 121 0.636364 0.636364 0.000000
3 college_mathematics 100 0.330000 0.330000 0.000000
4 college_medicine 173 0.410405 0.410405 0.000000
8 high_school_statistics 216 0.314815 0.314815 0.000000
9 medical_genetics 100 0.450000 0.450000 0.000000
10 college_chemistry 100 0.290000 0.290000 0.000000
11 professional_accounting 282 0.411348 0.411348 0.000000
14 sociology 201 0.601990 0.601990 0.000000
15 professional_medicine 272 0.452206 0.452206 0.000000
16 logical_fallacies 163 0.521472 0.521472 0.000000
18 college_biology 144 0.506944 0.506944 0.000000
19 high_school_european_history 165 0.575758 0.575758 0.000000
21 abstract_algebra 100 0.280000 0.280000 0.000000
23 high_school_computer_science 100 0.430000 0.430000 0.000000
26 machine_learning 112 0.375000 0.375000 0.000000
27 astronomy 152 0.500000 0.500000 0.000000
1 formal_logic 126 0.222222 0.222222 0.000000
29 high_school_mathematics 270 0.259259 0.259259 0.000000
32 business_ethics 100 0.450000 0.450000 0.000000
28 global_facts 100 0.380000 0.380000 0.000000
```
| 数据集 | 总学科数 |总问题数 |参考准确率|NPU准确率|
|:---:|:---:|:---:|:---:|:---:|
| MMLU | 57| 14042 |0.4691|0.4698|
# LLaMA2-13B

View File

@ -33,27 +33,27 @@
- [Inference](#inference-70b)
- [Evaluation](#Evaluation-70b)
# LLaMA2-7B
# LLAMA2-7B
## Training
Here's a hardware summary of pre-training LLaMA2-7B:
Here's a hardware summary of pre-training LLAMA2-7B:
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 8 x Ascend NPUs |
Here's a software summary of pre-training LLaMA2-7B:
Here's a software summary of pre-training LLAMA2-7B:
| Software | Version |
| :-----------------------: |:-----------:|
| Python | 3.7.16 |
| Python | 3.8.18 |
| driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| torch | 1.11.0 |
| torch | 2.1.0 |
| torch_npu | [package](https://gitee.com/ascend/pytorch/releases) |
### Script
@ -69,14 +69,14 @@ Here's a software summary of pre-training LLaMA2-7B:
2. Build environment
```bash
# python3.7
conda create -n test python=3.7
# python3.8
conda create -n test python=3.8
conda activate test
# install torch and torch_npu
pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
pip install torch_npu-1.11.0*-cp37-cp37m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp37-cp37m-linux_aarch64.whl
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# install megatron-core
pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
@ -122,22 +122,8 @@ Here's a software summary of pre-training LLaMA2-7B:
wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/tokenizer_config.json
cd ..
```
3.1 weight conversion in deepspeed mode
*Note that if you want to use the weight from huggingface, please run the weight conversion script first. The following uses llama-2-7b model weight conversion in deepspeed as an example.*
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# convert to deepspeed weights
python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-2-7b-hf \
--output-model-dir ckpt \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--type 7B \
--deepspeed
```
3.2 weight conversion in ptd mode
weight conversion in ptd mode
*Note that if you want to use the weight from huggingface, please run the weight conversion script first. The following uses llama-2-7b model weight conversion in ptd as an example.*
```bash
# modify the script according to your own ascend-toolkit path
@ -173,23 +159,7 @@ Here's a software summary of pre-training LLaMA2-7B:
--tokenizer-type PretrainedFromHF
```
4.2 pre-training using deepspeed mode
Config LLAMA2-7B pre-training script: examples/llama2/pretrain_llama2_7b_zero_8p.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
TOKENIZER_PATH=./llama-2-7b-hf/ #tokenizer path
DATA_PATH=./dataset_llama2/alpaca_text_document #processed dataset
```
Launch LLAMA2-7B pre-training script: examples/llama2/pretrain_llama2_7b_zero_8p.sh
```shell
bash examples/llama2/pretrain_llama2_7b_zero_8p.sh
```
4.3 pre-training using ptd mode
4.2 pre-training using ptd mode
Config LLAMA2-7B pre-training script: examples/llama2/pretrain_llama2_7b_ptd.sh
```shell
# modify the script according to your own ascend-toolkit path
@ -229,14 +199,15 @@ Here's a software summary of pre-training LLaMA2-7B:
--handler-name GeneralInstructionHandler \
--append-eod
```
5.2 fine-tuning using deepspeed mode
5.2.1 Full Parameters Fine-Tuning
The configuration script for full parameters fine-tuning is basically the same as that for pretrain_llama2_7b_zero_8p.sh.*The only difference is the data set.*
5.2 Full Parameters Fine-Tuning
The configuration script for full parameters fine-tuning is basically the same as that for pretrain_llama2_7b_ptd.sh.*The difference is that the dataset and the training parameter is-instruction-dataset are added.*
```bash
DATA_PATH=./finetune_dataset/alpaca
--is-instruction-dataset \
```
5.2.2 Lora Fine-Tuning
The Lora fine-tuning script is configured by adding the following lora parameters to the pretrain_llama2_7b_zero_8p.sh script:
5.3 Lora Fine-Tuning
The Lora fine-tuning script is configured by adding the following lora parameters to the pretrain_llama2_7b_ptd.sh script:
```bash
--lora-target-modules query_key_value dense gate_proj up_proj down_proj \
--lora-r 16 \
@ -251,10 +222,6 @@ Here's a software summary of pre-training LLaMA2-7B:
--load ${ORIGIN_CHECKPOINT} \
--lora-load ${LORA_CHECKPOINT} \
```
5.3 fine-tuning using ptd mode
*The modification method is the same as that in deepspeed mode. For details, see the previous section.*
### Performance
@ -374,7 +341,6 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS tasks/evaluation/evaluation
--seq-length 4096 \
--max-new-tokens 1 \
--max-position-embeddings 4096 \
--rotary-v3-impl \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--num-layers 32 \
@ -383,8 +349,10 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS tasks/evaluation/evaluation
--num-attention-heads 32 \
--mlp-layer-fusion \
--load ${CHECKPOINT} \
--position-embedding-type rope \
--normalization RMSNorm \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path $VOCAB_FILE \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \
--fp16 \
--micro-batch-size 1 \
@ -397,67 +365,68 @@ bash tasks/evaluation/eval.sh
Evaluation results
```text
subject question_n acc
0 high_school_macroeconomics 390 0.466667
1 formal_logic 126 0.253968
2 international_law 121 0.652893
3 college_mathematics 100 0.330000
4 college_medicine 173 0.421965
5 world_religions 171 0.725146
6 moral_scenarios 895 0.220112
7 nutrition 306 0.513072
8 high_school_statistics 216 0.361111
9 medical_genetics 100 0.490000
10 college_chemistry 100 0.300000
11 professional_accounting 282 0.361702
12 professional_law 1534 0.338331
13 miscellaneous 783 0.698595
14 sociology 201 0.651741
15 professional_medicine 272 0.496324
16 logical_fallacies 163 0.552147
17 public_relations 110 0.563636
18 college_biology 144 0.506944
19 high_school_european_history 165 0.612121
20 philosophy 311 0.556270
21 abstract_algebra 100 0.310000
22 high_school_psychology 545 0.678899
23 high_school_computer_science 100 0.400000
24 elementary_mathematics 378 0.312169
25 high_school_us_history 204 0.617647
26 machine_learning 112 0.366071
27 astronomy 152 0.493421
28 global_facts 100 0.330000
29 high_school_mathematics 270 0.255556
30 electrical_engineering 145 0.496552
31 high_school_microeconomics 238 0.415966
32 business_ethics 100 0.540000
33 college_computer_science 100 0.400000
34 high_school_physics 151 0.317881
35 human_sexuality 131 0.526718
36 college_physics 102 0.245098
37 high_school_government_and_politics 193 0.720207
38 marketing 234 0.747863
39 high_school_geography 198 0.601010
40 security_studies 245 0.555102
41 high_school_chemistry 203 0.418719
42 management 103 0.699029
43 jurisprudence 108 0.537037
44 econometrics 114 0.350877
45 human_aging 223 0.591928
46 virology 166 0.403614
47 moral_disputes 346 0.528902
48 anatomy 135 0.451852
49 professional_psychology 612 0.498366
50 conceptual_physics 235 0.455319
51 computer_security 100 0.560000
52 clinical_knowledge 265 0.505660
53 us_foreign_policy 100 0.680000
54 prehistory 324 0.570988
55 high_school_world_history 237 0.645570
56 high_school_biology 310 0.535484
57 total 14042 0.478422
MMLU Running Time: 18266.85981464386
subject_name question_n acc_ref acc_npu score_diff
17 public_relations 110 0.563636 0.554545 0.009091
44 econometrics 114 0.368421 0.377193 0.008772
30 electrical_engineering 145 0.503448 0.510345 0.006897
5 world_religions 171 0.701754 0.707602 0.005848
25 high_school_us_history 204 0.647059 0.651961 0.004902
45 human_aging 223 0.596413 0.600897 0.004484
38 marketing 234 0.709402 0.713675 0.004274
55 high_school_world_history 237 0.620253 0.624473 0.004219
31 high_school_microeconomics 238 0.420168 0.424370 0.004202
7 nutrition 306 0.503268 0.500000 0.003268
56 high_school_biology 310 0.541935 0.545161 0.003226
20 philosophy 311 0.569132 0.565916 0.003215
24 elementary_mathematics 378 0.291005 0.293651 0.002646
22 high_school_psychology 545 0.645872 0.647706 0.001835
12 professional_law 1534 0.339635 0.340939 0.001304
13 miscellaneous 783 0.679438 0.678161 0.001277
6 moral_scenarios 895 0.221229 0.222346 0.001117
37 high_school_government_and_politics 193 0.694301 0.694301 0.000000
54 prehistory 324 0.555556 0.555556 0.000000
53 us_foreign_policy 100 0.700000 0.700000 0.000000
39 high_school_geography 198 0.626263 0.626263 0.000000
40 security_studies 245 0.522449 0.522449 0.000000
41 high_school_chemistry 203 0.408867 0.408867 0.000000
52 clinical_knowledge 265 0.513208 0.513208 0.000000
49 professional_psychology 612 0.482026 0.482026 0.000000
42 management 103 0.679612 0.679612 0.000000
43 jurisprudence 108 0.583333 0.583333 0.000000
51 computer_security 100 0.560000 0.560000 0.000000
50 conceptual_physics 235 0.417021 0.417021 0.000000
35 human_sexuality 131 0.526718 0.526718 0.000000
46 virology 166 0.439759 0.439759 0.000000
47 moral_disputes 346 0.514451 0.514451 0.000000
48 anatomy 135 0.459259 0.459259 0.000000
36 college_physics 102 0.215686 0.215686 0.000000
0 high_school_macroeconomics 390 0.420513 0.420513 0.000000
34 high_school_physics 151 0.311258 0.311258 0.000000
33 college_computer_science 100 0.420000 0.420000 0.000000
2 international_law 121 0.636364 0.636364 0.000000
3 college_mathematics 100 0.330000 0.330000 0.000000
4 college_medicine 173 0.410405 0.410405 0.000000
8 high_school_statistics 216 0.314815 0.314815 0.000000
9 medical_genetics 100 0.450000 0.450000 0.000000
10 college_chemistry 100 0.290000 0.290000 0.000000
11 professional_accounting 282 0.411348 0.411348 0.000000
14 sociology 201 0.601990 0.601990 0.000000
15 professional_medicine 272 0.452206 0.452206 0.000000
16 logical_fallacies 163 0.521472 0.521472 0.000000
18 college_biology 144 0.506944 0.506944 0.000000
19 high_school_european_history 165 0.575758 0.575758 0.000000
21 abstract_algebra 100 0.280000 0.280000 0.000000
23 high_school_computer_science 100 0.430000 0.430000 0.000000
26 machine_learning 112 0.375000 0.375000 0.000000
27 astronomy 152 0.500000 0.500000 0.000000
1 formal_logic 126 0.222222 0.222222 0.000000
29 high_school_mathematics 270 0.259259 0.259259 0.000000
32 business_ethics 100 0.450000 0.450000 0.000000
28 global_facts 100 0.380000 0.380000 0.000000
```
| dataset | subject_num | question_num | reference_acc |NPU acc|
|:---:|:-----------:|:------------:|:-------------:|:---:|
| MMLU | 57 | 14042 | 0.4691 |0.4698|
# LLaMA2-13B

View File

@ -38,6 +38,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference
--seq-length 4096 \
--max-new-tokens 256 \
--use-flash-attn \
--use-fused-rmsnorm \
--seed 42 \
--position-embedding-type rope \
--normalization RMSNorm \

View File

@ -69,6 +69,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS \
--lr-warmup-fraction 0.01 \
--mlp-layer-fusion \
--use-flash-attn \
--use-fused-rmsnorm \
--bf16 | tee ./logs/ascendspeed-llama2-7b-ptd-bf16_train-${logfile}.log
# ptd模式训练脚本

View File

@ -0,0 +1,80 @@
# This is an example: training llama using PTD/
# The number of parameters is not aligned
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
# modify config according to your own actual situation
LOAD_CHECKPOINT_PATH="your init model load path"
SAVE_CHECKPOINT_PATH="your model ckpt save path"
TOKENIZER_PATH=./llama-2-7b-hf/ #tokenizer path
DATA_PATH=./finetune_dataset/alpaca #processed dataset
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
logfile=$(date +%Y%m%d)_$(date +%H%M%S)
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
pretrain_llama.py \
--DDP-impl local \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--sequence-parallel \
--num-layers 32 \
--hidden-size 4096 \
--position-embedding-type rope \
--normalization RMSNorm \
--ffn-hidden-size 11008 \
--num-attention-heads 32 \
--attention-dropout 0.0 \
--hidden-dropout 0.0 \
--init-method-std 0.01 \
--micro-batch-size 4 \
--global-batch-size 16 \
--seq-length 4096 \
--max-position-embeddings 4096 \
--data-path $DATA_PATH \
--tokenizer-name-or-path $TOKENIZER_PATH \
--tokenizer-not-use-fast \
--split 100,0,0 \
--distributed-backend nccl \
--lr 1.25e-5 \
--min-lr 1.25e-6 \
--lr-decay-style cosine \
--weight-decay 1e-1 \
--clip-grad 1.0 \
--initial-loss-scale 65536.0 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--log-interval 1 \
--load ${LOAD_CHECKPOINT_PATH} \
--save ${SAVE_CHECKPOINT_PATH} \
--save-interval 10000 \
--eval-interval 10000 \
--eval-iters 0 \
--use-fused-rotary-pos-emb \
--no-masked-softmax-fusion \
--no-load-optim \
--no-load-rng \
--train-iters 50000 \
--lr-warmup-fraction 0.01 \
--mlp-layer-fusion \
--use-flash-attn \
--use-fused-rmsnorm \
--is-instruction-dataset \
--lora-target-modules query_key_value dense proj dense_4h_to_h \
--lora-r 16 \
--lora-alpha 32 \
--bf16 | tee ./logs/ascendspeed-llama2-7b-ptd-bf16_train-${logfile}.log
# ptd模式训练脚本