add baichuan chinese readme

This commit is contained in:
fengliangjun 2023-11-17 13:58:19 +08:00
parent dda7ac859b
commit eea45231c2
3 changed files with 495 additions and 122 deletions

View File

@ -1,2 +0,0 @@
## Recipes and Scripts

View File

@ -1,50 +1,55 @@
# BaiChuan
<p align="left">
<b>简体中文</b> |
<b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/baichuan/README_en.md">English</a> </b>
</p>
</p>
# Contents
- [Baichuan-7B](#contents)
- [Training](#pre-training)
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
- [Accuracy of the loss](#accuracy-of-the-loss)
# 目录
- [Baichuan-13B](#contents)
- [Training](#pre-training)
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
- [Accuracy of the loss](#accuracy-of-the-loss)
- [Baichuan-7B](#Baichuan-7B)
- [训练](#训练)
- [脚本](#脚本)
- [性能](#性能)
- [吞吐](#吞吐)
- [精度](#精度)
- [Baichuan-13B](#Baichuan-13B)
- [训练](#训练)
- [脚本](#脚本)
- [性能](#性能)
- [吞吐](#吞吐)
- [精度](#精度)
# Baichuan-7B
## Training
## 训练
Baichuan-7B 训练的硬件配置如下:
Here's a hardware summary of pre-training Baichuan-7B:
| 硬件 | 配置 |
|:---:|:---------------:|
| NPU | 8 x Ascend NPUs |
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 8 x Ascend NPUs |
Baichuan-7B 训练的软件配置如下:
Here's a software summary of pre-training Baichuan-7B:
| Software | Version |
| :-----------------------: |:-----------:|
| Python | 3.7.16 |
| driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| torch | 1.11.0 |
| torch_npu | [package](https://gitee.com/ascend/pytorch/releases) |
| apex | [package](https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20230915.2/pytorch_v1.11.0_py37.tar.gz) |
| 软件 | 配置 |
|:-------------------------:|:----------------------------------------------------------------------------------------------------------------------------:|
| python | 3.7.16 |
| driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| torch | 1.11.0 |
| torch_npu | [package](https://gitee.com/ascend/pytorch/releases) |
| apex | [package](https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20230915.2/pytorch_v1.11.0_py37.tar.gz) |
### Script
### 脚本
1. Clone the repository to your local server:
1. 拷贝仓库到你的个人服务器:
```shell
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
@ -52,21 +57,21 @@ mkdir logs
mkdir ckpt
```
2. Build environment
2. 搭建环境
```bash
# python3.7
conda create -n test python=3.7
conda activate test
# install torch and torch_npu
# 安装 torch 和 torch_npu
pip install torch-1.11.0-cp37-cp37m-linux_aarch64.whl
pip install torch_npu-1.11.0.XXX-cp37-cp37m-linux_XXX.whl
# install megatron-core
# 安装 megatron-core
pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
# install deepspeed and deepspeed_npu
# 安装 deepspeed 和 deepspeed_npu
pip install deepspeed==0.9.2
git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
cd deepspeed_npu
@ -74,25 +79,29 @@ pip3 install -e ./
cd ..
# install other packages
# 安装其余依赖库
pip install -r requirements.txt
```
*Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`*
3. (可选)准备预训练权重
从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main) 下载预训练权重:
```text
# original deepspeed/runtime/engine.py, about #Lines2746-2748
# 请注意如果要加载huggingface的预训练权重需要修改一个deepspeed关于加载权重的bug
# 在 `<deepspeed-installed-path>/runtime/engine.py` 文件里的 `_load_zero_checkpoint` 函数,
# 将 `if zero_sd_list is None` 改为 `if zero_sd_list is None or len(zero_sd_list) == 0`
# 原始 deepspeed/runtime/engine.py, 大概 #Lines2746-2748
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None:
return False
# modified
# 修改后
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None or len(zero_sd_list) == 0:
return False
```
3. Prepare pretrained weights
Download the Baichuan-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main)
```shell
mkdir baichuan-7B-hf
@ -109,7 +118,8 @@ wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.mode
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
cd ..
```
In order to adapt to the baichuan-7B model, the following script is used to convert the model pre-training weights.
接着将hf格式的权重转化为AscendSpeed可以加载的形式
```shell
mkdir weight
@ -127,18 +137,18 @@ python $SCRIPT_PATH \
```
4. Prepare dataset
4. 准备数据集
Download the Baichuan-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
从 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 下载 BaiChuan-7B 的数据集:
```shell
# download datasets
# 下载数据集
mkdir dataset_baichuan7B
cd ./dataset_baichuan7B
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
# 准备数据集
python ./tools/preprocess_data.py \
--input ./dataset_baichuan7B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./baichuan-7B-hf \
@ -149,47 +159,44 @@ python ./tools/preprocess_data.py \
```
5. Config Baichuan-7B pre-training script : examples/baichuan/pretrain_baichuan_zero_7B.sh
5. 配置 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_zero_7B.sh
```shell
# modify the script according to your own ascend-toolkit path
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
TOKENIZER_PATH=./baichuan-7B-hf/ #tokenizer path
DATA_PATH=./dataset_baichuan7B/alpaca_text_document #processed dataset
# 修改数据集,权重,词表等路径
TOKENIZER_PATH=./baichuan-7B-hf/ #tokenizer 路径
DATA_PATH=./dataset_baichuan7B/alpaca_text_document #数据集路径
# 如果要加载权重,可以增加参数 `--load ./weight`
```
6. Launch Baichuan-7B pre-training script: examples/baichuan/pretrain_baichuan_zero_7B.sh
6. 启动 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_zero_7B.sh
```shell
bash examples/baichuan/pretrain_baichuan_zero_7B.sh
```
*Note that if you want to train with weights from the huggingface, please add a parameter to the script `pretrain_baichuan_zero_7B.sh` by inserting `--load ./weight` at lines 74 - 107 and rerun it.*
### 性能
### Performance
#### 吞吐
#### Machine performance
Baichuan-7B 使用 **昇腾芯片****参考芯片** 的吞吐对比:
The performance of Baichuan-7B in **Ascend NPU** and **Reference**:
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
| ------ | ----------- | ---------------- | ----------------------------- | ---------------------------- | ------------------------- | ----------------------------------- |
| NPUs | Baichuan-7B | 1024 | 3.722 | 1905 | 2.14 | 102.69 |
| Reference | Baichuan-7B | 1024 | 3.978 | 2036 | 1.98 | 125.66 |
| 设备 | 模型 | 迭代 | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
|----|--------|----|--------------------|-----------------------|-----------------|------------------|
| NPUs | Baichuan-7B | 1024 | 3.722 | 1905 | 2.14 | 102.69 |
| 参考 | Baichuan-7B | 1024 | 3.978 | 2036 | 1.98 | 125.66 |
#### Accuracy of the loss
#### 精度
NPU vs Reference loss.
The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. The relative error of the average loss is 0.01093, less than 2%, the maximum relative error is 0.1243, and the maximum absolute error is 0.4859. The precision meets the requirements.
NPU vs 参考 loss.
![NPU-LOSS](../../sources/images/baichuan/7B_loss_compare.png)
NPU vs Reference loss relative error.
NPU vs 参考 loss 相对误差.
![NPU-Relative-Error](../../sources/images/baichuan/7B_relative_error.png)
@ -197,20 +204,19 @@ NPU vs Reference loss relative error.
# Baichuan-13B
## Training
## 训练
Here's a hardware summary of pre-training Baichuan-13B:
Baichuan-13B 训练的硬件配置如下:
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 8 x Ascend NPUs |
| 硬件 | 配置 |
|:---:|:---------------:|
| NPU | 8 x Ascend NPUs |
Here's a software summary of pre-training Baichuan-13B:
Baichuan-13B 训练的软件配置如下:
| Software | Version |
| :-----------------------: |:-----------:|
| Python | 3.7.16 |
| 软件 | 配置 |
|:-------------------------:|:------------------------------------------------------------------------------------------------------------:|
| python | 3.7.16 |
| driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
@ -221,46 +227,45 @@ Here's a software summary of pre-training Baichuan-13B:
### Script
1. Clone the repository to your local server:
### 脚本
1. 将仓库拷贝到你的个人服务器:
```shell
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
mkdir logs
mkdir ckpt
```
2. Build environment
2. 搭建环境
```bash
# python3.7
conda create -n test python=3.7
conda activate test
# install torch and torch_npu
# 安装 torch 和 torch_npu
pip install torch-1.11.0-cp37-cp37m-linux_aarch64.whl
pip install torch_npu-1.11.0.XXX-cp37-cp37m-linux_XXX.whl
#install megatron
# 安装 megatron
git clone https://github.com/NVIDIA/Megatron-LM.git -b 23.05
cd Megatron-LM
pip3 install -e ./
cd ..
# install deepspeed and deepspeed_npu
# 安装 deepspeed 和 deepspeed_npu
pip install deepspeed==0.9.2
git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
cd deepspeed_npu
pip3 install -e ./
cd ..
# install other packages
# 安装其余依赖库
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
3. Prepare pretrained weights
3. (可选的)准备预训练权重
Download the Baichuan-13B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-13B-Base/tree/main)
从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan-13B-Base/tree/main) 下载预训练权重
```shell
mkdir baichuan-13B-hf
cd ./baichuan-13B-hf
@ -280,7 +285,7 @@ wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenize
cd ..
```
In order to adapt to the baichuan-13B model, the following script is used to convert the model pre-training weights.
将 BaiChuan-13B 模型权重从 huggingface 格式转换为 AscendSpeed 格式
```shell
mkdir weight
@ -295,8 +300,9 @@ python $SCRIPT_PATH \
--pse
```
4. Prepare dataset
Download the Baichuan-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
4. 准备数据集
下载 Baichuan-13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
mkdir dataset_baichuan13B
@ -304,7 +310,6 @@ cd ./dataset_baichuan13B
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
python ./tools/preprocess_data.py \
--input ./dataset_baichuan13B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./baichuan-13B-hf \
@ -315,57 +320,49 @@ python ./tools/preprocess_data.py \
```
5. Config Baichuan-13B pre-training script: /examples/baichuan/pretrain_baichuan_ptd_13B.sh
5. 配置 Baichuan-13B 训练脚本: /examples/baichuan/pretrain_baichuan_ptd_13B.sh
```shell
# modify the script according to your own ascend-toolkit path
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
# 修改词表,数据集等路径
TOKENIZER_PATH=./baichuan-13B-hf
DATA_PATH=./dataset_baichuan13B/aplaca_text_document
```
6. Launch Baichuan-13B pre-training script: /examples/baichuan/pretrain_baichuan_ptd_13B.sh
6. 启动 Baichuan-13B 训练脚本: /examples/baichuan/pretrain_baichuan_ptd_13B.sh
```bash
bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
```
There is an hourly pulse checking script running that checks that the training is either running or scheduled.
### 性能
#### 吞吐
Baichuan-13B 在 **昇腾芯片****参考芯片** 上的性能对比:
| 设备 | 模型 | 迭代数 | 样本吞吐 (samples/p/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
|:----:|:------------:|:----:|:------------------:|:--------------------:|:---------------:|:----------------:|
| NPUs | Baichuan-13B | 1000 | 1.928 | 1024 | 16.067 | 89.37 |
| 参考 | Baichuan-13B | 1000 | 1.535 | 862 | 19.852 | 72.39 |
### Performance
#### 精度
#### Machine performance
NPU vs 参考 loss.
The performance of the Baichuan-13B in **Ascend NPU** and **Reference**:
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
| :----: | :----------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: | :---------------------------------: |
| NPUs | Baichuan-13B | 1000 | 1.928 | 1024 | 16.067 | 89.37 |
| Reference | Baichuan-13B | 1000 | 1.535 | 862 | 19.852 | 72.39 |
#### Accuracy of the loss
NPU vs Reference loss.
The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. The relative error of the average loss is 0.00725, less than 2%, the maximum relative error is 0.01978, and the maximum absolute error is 0.10811. The precision meets the requirements.
![NPU-LOSS](../../sources/images/baichuan/13B-loss-compare.png)
NPU vs Reference loss relative error.
The relative error between NPU and Reference Loss is less than 0.02 throughout, as expected.
NPU vs 参考 loss 相对误差.
![NPU-Relative-Error](../../sources/images/baichuan/baichuan13B-loss-relative-error.png)
\
\
<font size=1>If the download of the file fails using 'wget' , you can download it manually while ensuring website security.</font>

View File

@ -0,0 +1,378 @@
# BaiChuan
<p align="left">
<b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/baichuan/README.md">简体中文</a></b> |
<b>English</b>
</p>
</p>
# Contents
- [Baichuan-7B](#contents)
- [Training](#pre-training)
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
- [Accuracy of the loss](#accuracy-of-the-loss)
- [Baichuan-13B](#contents)
- [Training](#pre-training)
- [Script](#script)
- [Performance](#performance)
- [Machine performance](#machine-performance)
- [Accuracy of the loss](#accuracy-of-the-loss)
# Baichuan-7B
## Training
Here's a hardware summary of pre-training Baichuan-7B:
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 8 x Ascend NPUs |
Here's a software summary of pre-training Baichuan-7B:
| Software | Version |
| :-----------------------: |:-----------:|
| Python | 3.7.16 |
| driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| torch | 1.11.0 |
| torch_npu | [package](https://gitee.com/ascend/pytorch/releases) |
| apex | [package](https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20230915.2/pytorch_v1.11.0_py37.tar.gz) |
### Script
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
mkdir logs
mkdir ckpt
```
2. Build environment
```bash
# python3.7
conda create -n test python=3.7
conda activate test
# install torch and torch_npu
pip install torch-1.11.0-cp37-cp37m-linux_aarch64.whl
pip install torch_npu-1.11.0.XXX-cp37-cp37m-linux_XXX.whl
# install megatron-core
pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
# install deepspeed and deepspeed_npu
pip install deepspeed==0.9.2
git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
cd deepspeed_npu
pip3 install -e ./
cd ..
# install other packages
pip install -r requirements.txt
```
*Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`*
```text
# original deepspeed/runtime/engine.py, about #Lines2746-2748
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None:
return False
# modified
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None or len(zero_sd_list) == 0:
return False
```
3. Prepare pretrained weights
Download the Baichuan-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main)
```shell
mkdir baichuan-7B-hf
cd ./baichuan-7B-hf
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
cd ..
```
In order to adapt to the baichuan-7B model, the following script is used to convert the model pre-training weights.
```shell
mkdir weight
SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
python $SCRIPT_PATH \
--input-model-dir ./baichuan-7B-hf \
--output-model-dir ./weight \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--type 7B \
--pse \
--deepspeed \
--use_wpack_rotray \
--load_weight_map
```
4. Prepare dataset
Download the Baichuan-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
# download datasets
mkdir dataset_baichuan7B
cd ./dataset_baichuan7B
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
python ./tools/preprocess_data.py \
--input ./dataset_baichuan7B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./baichuan-7B-hf \
--output-prefix ./dataset_baichuan7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5. Config Baichuan-7B pre-training script : examples/baichuan/pretrain_baichuan_zero_7B.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
TOKENIZER_PATH=./baichuan-7B-hf/ #tokenizer path
DATA_PATH=./dataset_baichuan7B/alpaca_text_document #processed dataset
```
6. Launch Baichuan-7B pre-training script: examples/baichuan/pretrain_baichuan_zero_7B.sh
```shell
bash examples/baichuan/pretrain_baichuan_zero_7B.sh
```
*Note that if you want to train with weights from the huggingface, please add a parameter to the script `pretrain_baichuan_zero_7B.sh` by inserting `--load ./weight` at lines 74 - 107 and rerun it.*
### Performance
#### Machine performance
The performance of Baichuan-7B in **Ascend NPU** and **Reference**:
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
| ------ | ----------- | ---------------- | ----------------------------- | ---------------------------- | ------------------------- | ----------------------------------- |
| NPUs | Baichuan-7B | 1024 | 3.722 | 1905 | 2.14 | 102.69 |
| Reference | Baichuan-7B | 1024 | 3.978 | 2036 | 1.98 | 125.66 |
#### Accuracy of the loss
NPU vs Reference loss.
The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. The relative error of the average loss is 0.01093, less than 2%, the maximum relative error is 0.1243, and the maximum absolute error is 0.4859. The precision meets the requirements.
![NPU-LOSS](../../sources/images/baichuan/7B_loss_compare.png)
NPU vs Reference loss relative error.
![NPU-Relative-Error](../../sources/images/baichuan/7B_relative_error.png)
# Baichuan-13B
## Training
Here's a hardware summary of pre-training Baichuan-13B:
| Hardware | Value |
| :------: | :---------------------------------------------: |
| NPU | 8 x Ascend NPUs |
Here's a software summary of pre-training Baichuan-13B:
| Software | Version |
| :-----------------------: |:-----------:|
| Python | 3.7.16 |
| driver | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| firmware | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
| CANN | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| binary arithmetic package | [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software) |
| torch | 1.11.0 |
| torch_npu | [package](https://gitee.com/ascend/pytorch/releases) |
### Script
1. Clone the repository to your local server:
```shell
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
mkdir logs
mkdir ckpt
```
2. Build environment
```bash
# python3.7
conda create -n test python=3.7
conda activate test
# install torch and torch_npu
pip install torch-1.11.0-cp37-cp37m-linux_aarch64.whl
pip install torch_npu-1.11.0.XXX-cp37-cp37m-linux_XXX.whl
#install megatron
git clone https://github.com/NVIDIA/Megatron-LM.git -b 23.05
cd Megatron-LM
pip3 install -e ./
cd ..
# install deepspeed and deepspeed_npu
pip install deepspeed==0.9.2
git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
cd deepspeed_npu
pip3 install -e ./
cd ..
# install other packages
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
3. Prepare pretrained weights
Download the Baichuan-13B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-13B-Base/tree/main)
```shell
mkdir baichuan-13B-hf
cd ./baichuan-13B-hf
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
cd ..
```
In order to adapt to the baichuan-13B model, the following script is used to convert the model pre-training weights.
```shell
mkdir weight
SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
python $SCRIPT_PATH \
--input-model-dir ./baichuan-13B-hf \
--output-model-dir ./weight \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--make-vocab-size-divisible-by 8 \
--type 13B \
--pse
```
4. Prepare dataset
Download the Baichuan-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
```shell
mkdir dataset_baichuan13B
cd ./dataset_baichuan13B
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
python ./tools/preprocess_data.py \
--input ./dataset_baichuan13B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./baichuan-13B-hf \
--output-prefix ./dataset_baichuan13B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
```
5. Config Baichuan-13B pre-training script: /examples/baichuan/pretrain_baichuan_ptd_13B.sh
```shell
# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# modify script orign dataset path according to your own dataset path
TOKENIZER_PATH=./baichuan-13B-hf
DATA_PATH=./dataset_baichuan13B/aplaca_text_document
```
6. Launch Baichuan-13B pre-training script: /examples/baichuan/pretrain_baichuan_ptd_13B.sh
```bash
bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
```
There is an hourly pulse checking script running that checks that the training is either running or scheduled.
### Performance
#### Machine performance
The performance of the Baichuan-13B in **Ascend NPU** and **Reference**:
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
| :----: | :----------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: | :---------------------------------: |
| NPUs | Baichuan-13B | 1000 | 1.928 | 1024 | 16.067 | 89.37 |
| Reference | Baichuan-13B | 1000 | 1.535 | 862 | 19.852 | 72.39 |
#### Accuracy of the loss
NPU vs Reference loss.
The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. The relative error of the average loss is 0.00725, less than 2%, the maximum relative error is 0.01978, and the maximum absolute error is 0.10811. The precision meets the requirements.
![NPU-LOSS](../../sources/images/baichuan/13B-loss-compare.png)
NPU vs Reference loss relative error.
The relative error between NPU and Reference Loss is less than 0.02 throughout, as expected.
![NPU-Relative-Error](../../sources/images/baichuan/baichuan13B-loss-relative-error.png)
\
\
<font size=1>If the download of the file fails using 'wget' , you can download it manually while ensuring website security.</font>