llama2 uploads the training inference script and the README.

This commit is contained in:
ningbenzhe1 2023-11-24 12:15:46 +08:00
parent 569e1cc7b3
commit f801b32e15
8 changed files with 611 additions and 16 deletions

View File

@ -13,6 +13,9 @@
- [性能](#性能)
- [吞吐](#吞吐)
- [精度](#精度)
- [微调](#微调)
- [全参微调](#全参微调)
- [低参微调](#低参微调)
- [推理](#推理)
- [deepspeed_pipeline](#deepspeed_pipeline)
- [megatron](#megatron)
@ -27,7 +30,8 @@
- [deepspeed_pipeline](#deepspeed_pipeline)
- [megatron](#megatron)
- [评估](#评估)
- [举例](#举例)
- [举例](#举例)
# Bloom-7B
## 训练
@ -154,6 +158,26 @@ DATA_PATH=/home/bloom_data/enwiki_100k/enwiki-100k_text_document
bash examples/bloom/pretrain_bloom_7b1.sh
```
## 微调
### 全参微调
执行流程与预训练一致,配置训练权重路径如下:
```shell
# 修改预训练权重路径
CHECKPOINT_PATH='./ckpt'
```
### 低参微调
执行流程与预训练一致,参数修改配置如下:
```shell
# 修改预训练权重路径
CHECKPOINT_PATH='./ckpt'
# 增加配置参数
pretrain_bloom.py
--lora-target-modules query_key_value dense \
```
## 性能
### 吞吐
@ -486,6 +510,7 @@ bash ./examples/bloom/generate_bloom_176b_2nodes.sh
## 评估
配置 Bloom-176B 评估脚本: tasks/evaluation/eval_bloom.sh
```shell
@ -504,6 +529,12 @@ TASK="boolq"
--num-attention-heads 112
```
```text
# 请注意评估时需要修改一个deepspeed的bug
# 将 `<deepspeed-installed-path>/runtime/pipe/engine.py` 文件里的第671行注释掉
# self.total_loss += self.loss.detach()
```
```shell
bash ./tasks/evaluation/eval_bloom.sh
```
@ -528,7 +559,7 @@ bash ./tasks/evaluation/eval_bloom.sh
</tbody>
</table>
## 举例
# 举例
1. bloom 7b
![bloom_7b_generate.png](..%2F..%2Fsources%2Fimages%2Fbloom_7b_generate.png)
@ -536,7 +567,7 @@ bash ./tasks/evaluation/eval_bloom.sh
![bloom_176b_generate.png](..%2F..%2Fsources%2Fimages%2Fbloom_176b_generate.png)
## 引用
# 引用
```
@article{scao2022bloom,

View File

@ -9,21 +9,31 @@
# Contents
- [Bloom-7B](#contents)
- [Training](#pre-training)
- [Training](#training)
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
- [Accuracy of the loss](#accuracy-of-the-loss)
- [Inference](#Inference)
- [Script](#script)
- [Machine performance](#Machine-performance)
- [Accuracy of the loss](#Accuracy-of-the-loss)
- [Fine-tune](#fine-tune)
- [Full parameter fine-tuning](#Full-parameter-fine-tuning)
- [LORA fine-tuning](#LORA-fine-tuning)
- [Inference](#inference)
- [deepspeed pipeline](#deepspeed-pipeline)
- [megatron](#megatron)
- [Evaluation](#evaluation)
- [Bloom-176B](#contents)
- [Training](#pre-training)
- [Training](#training)
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
- [Accuracy of the loss](#accuracy-of-the-loss)
- [Inference](#Inference)
- [Script](#script)
- [Inference](#inference)
- [deepspeed pipeline](#deepspeed-pipeline)
- [megatron](#megatron)
- [Evaluation](#evaluation)
- [Example](#example)
# Bloom-7B
@ -154,6 +164,30 @@ Run the examples/bloom/pretrain_bloom_7b1.sh on all nodes in the cluster.
bash examples/bloom/pretrain_bloom_7b1.sh
```
## Fine-tune
### Full parameter fine-tuning
The execution process is the same as the pre-training. Config training weight path is as follows:
```shell
# modify the model weight path
CHECKPOINT_PATH='./ckpt'
```
### LORA fine-tuning
The execution process is consistent with the pre-training, and the parameters are modified as follows:
```shell
# modify the model weight path
CHECKPOINT_PATH='./ckpt'
# Add configuration parameters
pretrain_bloom.py
--lora-target-modules query_key_value dense
```
## Performance
### Machine Performance
@ -183,7 +217,7 @@ NPU vs GPU loss relative error.
We support AscendSpeed Inference for text generation with BLOOM 7B (deepspeed or megatron).
### deepspeed_pipeline
### deepspeed pipeline
```shell
# modify the model weight path and tokenizer path
@ -440,7 +474,7 @@ and GPU on a single-node system. The average relative error is 0.1%, less than 2
We support AscendSpeed Inference for text generation with BLOOM 176B (deepspeed or megatron).
### deepspeed_pipeline
### deepspeed pipeline
```shell
# modify the model weight path and tokenizer path
@ -506,6 +540,12 @@ In addition, you need to set the corresponding parameters according to the model
--num-attention-heads 112
```
```text
# Note that, a deepspeed bug needs to be fixed during evaluation
# Comment out line 671 in the file `<deepspeed-installed-path>/runtime/pipe/engine.py`
# self.total_loss += self.loss.detach()
```
```shell
bash ./tasks/evaluation/eval_bloom.sh
```

View File

@ -3,7 +3,6 @@
<p align="left">
<b>简体中文</b> |
<b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/llama2/README_en.md">English</a> </b>
</p>
</p>
@ -16,6 +15,15 @@
- [吞吐](#吞吐)
- [精度](#精度)
- [LLaMA2-13B](#LLaMA2-13B)
- [训练](#训练)
- [脚本](#脚本)
- [性能](#性能)
- [吞吐](#吞吐)
- [精度](#精度)
- [推理](#推理)
- [评估](#评估)
# LLaMA2-7B
@ -188,3 +196,211 @@ NPU vs 参考 loss
绝对误差
![NPU-LOSS and NPU-Absolute-Error](../../sources/images/llama2/llama2_7b_shape_fp16_layer32_loss_with_weights_comparison_absolute.png)
# LLaMA2-13B
## 训练
LLaMA2-13B 训练的硬件配置:
| 硬件 | 配置 |
|:---:|:---------------:|
| NPU | 8 x Ascend NPUs |
LLaMA2-13B 训练的软件配置:
| 软件 | 配置 |
|:-------------------------:|:------------------------------------------------------------------------------------------------------------:|
| python | 3.7.16 |
| driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| torch | 1.11.0 |
| torch_npu | [package](https://gitee.com/ascend/pytorch/releases) |
### 脚本
1. 拷贝代码仓到本地服务器
```shell
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
mkdir logs
mkdir ckpt
```
2. 搭建环境
```bash
# python3.7
conda create -n test python=3.7
conda activate test
# 安装 torch 和 torch_npu
pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
pip install torch_npu-1.11.0*-cp37-cp37m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp37-cp37m-linux_aarch64.whl
# 安装 megatron-core
pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
# 安装 deepspeed 和 deepspeed_npu
pip install deepspeed==0.9.2
git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
cd deepspeed_npu
pip3 install -e ./
cd ..
# install other packages
pip install -r requirements.txt
```
3. 下载 LLaMA2-13B 的 [预训练权重和词表](https://huggingface.co/NousResearch/Llama-2-13b-hf/tree/main)
```bash
git lfs install
git clone https://huggingface.co/NousResearch/Llama-2-13b-hf
```
```text
# 请注意如果要加载huggingface的预训练权重需要修改一个deepspeed关于加载权重的bug
# 在 `<deepspeed-installed-path>/runtime/engine.py` 文件里的 `_load_zero_checkpoint` 函数,
# 将 `if zero_sd_list is None` 改为 `if zero_sd_list is None or len(zero_sd_list) == 0`
# 原始 deepspeed/runtime/engine.py, 大概 #Lines2746-2748
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None:
return False
# 修改后
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None or len(zero_sd_list) == 0:
return False
```
将权重从 huggingface 格式转化为 AscendSpeed 格式
```bash
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 权重格式转换
python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-2-13b-hf \
--output-model-dir ckpt \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--type 13B \
--deepspeed
```
4. 准备数据集
下载 LLaMA2-13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# 下载数据
mkdir dataset_llama2
cd ./dataset_llama2
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 处理数据
cd WORKSPACE
mkdir alpaca_preprocessed
python tools/preprocess_data.py --input WORKSPACE/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--output-prefix WORKSPACE/alpaca_preprocessed/alpaca \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path WORKSPACE/llama-13b-hf \
--tokenizer-not-use-fast \
--handler-name GeneralInstructionHandler
```
5. 配置 LLaMA2-13B 预训练脚本: examples/llama2/pretrain_llama2_13B_ptd_8p.sh
```shell
# 设置 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 配置词表,数据集等路径
TOKENIZER_PATH=./llama-2-13b-hf/ #词表路径
DATA_PATH=WORKSPACE/alpaca_preprocessed/alpaca #数据集路径
```
6. 启动 LLaMA2-13B 预训练脚本: examples/llama2/pretrain_llama2_13B_ptd_8p.sh
```shell
bash examples/llama2/pretrain_llama2_13B_ptd_8p.sh
```
### 性能
#### 吞吐
LLaMA2-13B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
|:----:|:---------:|:----:|:------------------:|:---------------------:|:---------------:|:----------------:|
| NPUs | LLaMA2-13B | 5000 | 2.868 | 1468.416 | 89.275 | 126.73 |
| 参考 | LLaMA2-13B | -- | -- | 1750 | -- | -- |
#### 精度
NPU vs 参考 loss
NPU运行平稳资源使用稳定中间没有报错Loss呈下降趋势收敛速度符合预期。
精度满足要求。平均损耗的绝对误差为0.0011小于0.5%。
![NPU-LOSS](../../sources/images/llama2/llama2_13b_bf16_loss_absolute.png)
## 推理
我们在Llama2 13B中支持AscendSpeed推理来生成文本。
推理不同于预训练,比如我们需要加载预训练检查点和输出样本的长度:
配置 LLaMA2-13B 推理脚本: examples/llama2/generate_llama2_13B_tp8_pp1.sh
```shell
# 修改模型权重路径以及词表路径
CHECKPOINT=./llama2-13b-tp8-pp1/
VOCAB_FILE=./llama2-13b-hf/
```
```shell
bash ./examples/llama2/generate_llama2_13B_tp8_pp1.sh
```
推理结果示例如下:
![llama2-13B-generate.png](../../sources/images/llama2/llama2-13B-generate.png)
## 评估
我们使用boolq基准来评估我们的模型。基准[下载](https://huggingface.co/datasets/boolq).
```shell
CHECKPOINT=./llama2-13b-tp8-pp1/
VOCAB_FILE=./llama2-13b-hf/
# 配置任务以及数据路径
DATA_PATH="./boolq/data/test/"
TASK="boolq"
# 配置生成参数
python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task $TASK\
--seq-length 4096 \
--max-new-tokens 32 \
--max-position-embeddings 4096 \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--num-layers 40 \
--hidden-size 5120 \
--ffn-hidden-size 13824 \
--load ${CHECKPOINT} \
--num-attention-heads 40 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path $VOCAB_FILE \
--tokenizer-not-use-fast \
--fp16 \
--micro-batch-size 1 \
--seed 42 | tee logs/train.log
# 开始评估
bash tasks/evaluation/eval.sh
```

View File

@ -2,10 +2,8 @@
<p align="left">
<b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/llama2/README.md">简体中文</a></b> |
<b>English</b>
</p>
</p>
# Contents
- [LLaMA2-7B](#contents)
@ -15,6 +13,15 @@
- [Machine performance](#machine-performance)
- [Accuracy of the loss](#accuracy-of-the-loss)
- [LLaMA2-13B](#contents)
- [Training](#pre-training)
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
- [Accuracy of the loss](#accuracy-of-the-loss)
- [Inference](#Inference)
- [Evaluation](#Evaluation)
# LLaMA2-7B
@ -187,3 +194,205 @@ The relative error of the average loss is 0.0046, less than 2%, the maximum rela
The absolute error of the average loss is 0.0009, less than 2%, the maximum absolute error is 0.0246.
![NPU-LOSS and NPU-Absolute-Error](../../sources/images/llama2/llama2_7b_shape_fp16_layer32_loss_with_weights_comparison_absolute.png)
# LLaMA2-13B
## Training
Here's a hardware summary of pre-training LLaMA2-13B:
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 8 x Ascend NPUs |
Here's a software summary of pre-training LLaMA2-13B:
| Software | Version |
| :-----------------------: |:-----------:|
| Python | 3.7.16 |
| driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| torch | 1.11.0 |
| torch_npu | [package](https://gitee.com/ascend/pytorch/releases) |
### Script
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
mkdir logs
mkdir ckpt
```
2. Build environment
```bash
# python3.7
conda create -n test python=3.7
conda activate test
# install torch and torch_npu
pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
pip install torch_npu-1.11.0*-cp37-cp37m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp37-cp37m-linux_aarch64.whl
# install megatron-core
pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
# install deepspeed and deepspeed_npu
pip install deepspeed==0.9.2
git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
cd deepspeed_npu
pip3 install -e ./
cd ..
# install other packages
pip install -r requirements.txt
```
*Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`*
```text
# original deepspeed/runtime/engine.py, about #Lines2746-2748
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None:
return False
# modified
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None or len(zero_sd_list) == 0:
return False
```
3. Prepare pretrained weights and tokenizer
Download the LLaMA2-13B checkpoint from [here](https://huggingface.co/NousResearch/Llama-2-13b-hf/tree/main)
```bash
git lfs install
git clone https://huggingface.co/NousResearch/Llama-2-13b-hf
```
*Note that if you want to use the weight from huggingface, please run the weight conversion script first. The following uses llama-2-13b model weight conversion as an example.*
```bash
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# convert to deepspeed weights
python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-2-13b-hf \
--output-model-dir ckpt \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--type 13B \
```
4. Prepare dataset
Download the LLaMA2-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```bash
# datasetwget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd WORKSPACE
mkdir alpaca_preprocessed
python tools/preprocess_data.py --input WORKSPACE/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--output-prefix WORKSPACE/alpaca_preprocessed/alpaca \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path WORKSPACE/llama-13b-hf \
--tokenizer-not-use-fast \
--handler-name GeneralInstructionHandler
```
5. Config LLaMA2-13B pre-training script: examples/llama2/pretrain_llama2_13B_ptd_8p.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
TOKENIZER_PATH=./llama-2-13b-hf/ #tokenizer path
DATA_PATH=WORKSPACE/alpaca_preprocessed/alpaca #processed dataset
```
6. Launch LLaMA2-13B pre-training script: examples/llama2/pretrain_llama2_13B_ptd_8p.sh
```shell
bash examples/llama2/pretrain_llama2_13B_ptd_8p.sh
```
### Performance
#### Machine performance
The performance of LLaMA2-13B in **Ascend NPU** and **Reference**:
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
| :------: |:----------:|:----------------:|:-----------------------------:|:----------------------------:|:-------------------------:|:-----------------------------------:|
| NPUs | LLaMA2-13B | 5000 | 2.868 | 1468.416 | 89.275 | 126.73 |
| Reference | LLaMA2-13B | -- | -- | 1750 | -- | -- |
#### Accuracy of the loss
NPU vs Reference loss.
The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected.
The precision meets the requirements.The absolute error of the average loss is 0.0011, less than 0.5%.
![NPU-LOSS](../../sources/images/llama2/llama2_13b_bf16_loss_absolute.png)
## Inference
We support AscendSpeed Inference for text generation with Llama2 13B.
Inference different from pre-training, such as we need to Load pre-training checkpoint and the length of the output samples:
Config Llama2-13B inference script: examples/llama2/generate_llama2_13B_tp8_pp1.sh
```shell
# modify the model weight path and tokenizer path
CHECKPOINT=./llama2-13b-tp8-pp1/
VOCAB_FILE=./llama2-13b-hf/
```
```shell
bash ./examples/llama2/generate_llama2_13B_tp8_pp1.sh
```
Some inference samples are as follows:
![llama2-13B-generate.png](../../sources/images/llama2/llama2-13B-generate.png)
## Evaluation
We use boolq benchmark to evaluate our model. Benchmark Download [here](https://huggingface.co/datasets/boolq).
```shell
CHECKPOINT=./llama2-13b-tp8-pp1/
VOCAB_FILE=./llama2-13b-hf/
# configure task and data path
DATA_PATH="./boolq/data/test/"
TASK="boolq"
# configure generation parameters
python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py \
--task-data-path $DATA_PATH \
--task $TASK\
--seq-length 4096 \
--max-new-tokens 32 \
--max-position-embeddings 4096 \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--num-layers 40 \
--hidden-size 5120 \
--ffn-hidden-size 13824 \
--load ${CHECKPOINT} \
--num-attention-heads 40 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path $VOCAB_FILE \
--tokenizer-not-use-fast \
--fp16 \
--micro-batch-size 1 \
--seed 42 | tee logs/train.log
# start evaluation
bash tasks/evaluation/eval.sh
```

View File

@ -0,0 +1,37 @@
#!/bin/bash
export TOKENIZERS_PARALLELISM=false
source /usr/local/Ascend/ascend-toolkit/set_env.sh
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT"
CHECKPOINT=./model/LLAMA-2-13B-hf_tp8_pp1_save/
VOCAB_FILE=./model/LLAMA-2-13B-hf
python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_llama.py \
--no-contiguous-buffers-in-local-ddp \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--num-layers 40 \
--hidden-size 5120 \
--ffn-hidden-size 13824 \
--load "${CHECKPOINT}" \
--num-attention-heads 40 \
--max-position-embeddings 4096 \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path "$VOCAB_FILE" \
--tokenizer-not-use-fast \
--fp16 \
--micro-batch-size 1 \
--seq-length 4096 \
--max-new-tokens 256 \
--seed 42

View File

@ -0,0 +1,62 @@
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
DATA_PATH=./dataset/llama_text_document
LOAD_CHECKPOINT=./model/LLAMA-2-13B-hf_tp8_pp1
SAVE_CHECKPOINT=./model/LLAMA-2-13B-hf_tp8_pp1_save/
TOKENIZER_PATH=./model/LLAMA-2-13B-hf
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
# Main script
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
pretrain_llama.py \
--DDP-impl local \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--sequence-parallel \
--num-layers 40 \
--hidden-size 5120 \
--ffn-hidden-size 13824 \
--num-attention-heads 40 \
--micro-batch-size 2 \
--global-batch-size 2 \
--seq-length 4096 \
--max-position-embeddings 4096 \
--train-iters 1 \
--lr-warmup-iters 0 \
--save $SAVE_CHECKPOINT \
--load $LOAD_CHECKPOINT \
--data-path $DATA_PATH \
--tokenizer-name-or-path $TOKENIZER_PATH \
--tokenizer-not-use-fast \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 1e-6 \
--lr-decay-style cosine \
--min-lr 1e-8 \
--weight-decay 1e-1 \
--clip-grad 1.0 \
--initial-loss-scale 4096.0 \
--seed 1234 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--adam-eps 1.0e-5 \
--log-interval 1 \
--save-interval 5000 \
--eval-interval 1000 \
--eval-iters 1 \
--use-fused-rotary-pos-emb \
--use-flash-attn \
--bf16 | tee logs/train_13B_llama2_npu.log

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 289 KiB