!449 增加baichuan7B/baichuan2-7B adaptor

Merge pull request !449 from xiongliangcheng/modellink
2024-12-05 05:17:40 +08:00 · 2024-01-26 11:52:49 +00:00 · 2024-01-26 11:52:49 +00:00 · 2a9af85917
commit 2a9af85917
parent daa835112c
23 changed files with 851 additions and 0 deletions
--- a/8
+++ b/8
@ -30,3 +30,11 @@ reviewers:
 - wenjiang2357
 - leizhenzhen
 - liuyanghan
 - Ares_Lzk
 - flying-artillery
 - xiong-liangcheng_admin
 - gitee-code-template
 - yaojia2021
 - chantcalf
 - kongfuziyue
 - yuhui69
--- a/examples/baichuan/README.md
+++ b/examples/baichuan/README.md
@ -0,0 +1,166 @@
 # BaiChuan
 <p align="left">
        <b>简体中文</b> |
        <b><a href="https://gitee.com/ascend/ModelLink/blob/modellink/examples/baichuan/README_en.md">English</a> </b> 
    </p>
 </p>
 #  目录
 - [Baichuan-7B](#Baichuan-7B)
  - [训练](#训练)
    - [脚本](#脚本)
    - [性能](#性能)
        - [吞吐](#吞吐)
        - [精度](#精度)
 # Baichuan-7B
 ## 训练
 Baichuan-7B 训练的硬件配置如下：
 |  硬件 |       配置        |
 |:---:|:---------------:|
 | NPU | 8 x Ascend NPUs |
 ### 脚本
 1. 拷贝仓库到你的个人服务器：
 ```shell
 git clone https://gitee.com/ascend/ModelLink.git 
 cd ModeLlink 
 git checkout modellink
 mkdir logs
 mkdir ckpt
 ```
 2. 搭建环境
 ```bash
 # python3.8
 conda create -n test python=3.8
 conda activate test
 # 安装 torch 和 torch_npu 
 pip install torch-2.1.0-cp37-cp37m-linux_aarch64.whl
 pip install torch_npu-2.1.0.XXX-cp37-cp37m-linux_aarch64.whl
 pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
 # 修改 ascend-toolkit 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 # 安装加速库
 git clone https://gitee.com/ascend/AscendSpeed.git
 cd AscendSpeed
 pip install -r requirements.txt 
 pip3 install -e .
 cd ..
 # 安装其余依赖库
 pip install -r requirements.txt 
 ```
 3. （可选）准备预训练权重
 从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main) 下载预训练权重：
 ```shell
 mkdir baichuan-7B-hf
 cd ./baichuan-7B-hf
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
 cd ..
 ```
 接着将hf格式的权重转化为AscendSpeed可以加载的形式：
 ```shell
 mkdir baichuan-7B-mt
 SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
 python $SCRIPT_PATH \
    --input-model-dir ./baichuan-7B-hf \
    --output-model-dir ./baichuan-7B-mt \
    --tensor-model-parallel-size 8 \
    --pipeline-model-parallel-size 1 \
    --type 7B \
    --pse \
    --merge-mlp
 ```
 4. 准备数据集
 从 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 下载 BaiChuan-7B 的数据集：
 ```shell
 # 下载数据集
 mkdir dataset_baichuan7B
 cd ./dataset_baichuan7B
 wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
 cd ..
 # 准备数据集                              
 python ./tools/preprocess_data.py \
 --input ./dataset_baichuan7B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
 --tokenizer-name-or-path ./baichuan-7B-hf \
 --output-prefix ./dataset_baichuan7B/alpaca \
 --workers 4 \
 --log-interval 1000 \
 --tokenizer-type PretrainedFromHF
 ```
 5. 配置 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_ptd_7B.sh 
 ```shell
 # 修改 ascend-toolkit 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 CKPT_SAVE_DIR="./ckpt"
 DATA_PATH="./dataset_baichuan7B/alpaca_text_document"
 TOKENIZER_MODEL="./baichuan-7B-hf/tokenizer.model"
 CKPT_LOAD_DIR="./baichuan-7B-mt"
 ```
 6. 启动 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_ptd_7B.sh 
 ```shell
 bash examples/baichuan/pretrain_baichuan_ptd_7B.sh 
 ```
 ### 性能
 #### 吞吐
 Baichuan-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
 |  设备  |    模型     | 迭代数  | 样本吞吐 (samples/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 
 |:----:|:---------:|:----:|:---------------------:|:---------------:|:----------------:|
 | NPUs | Baichuan-7B | 1000 | 4.78 | 2448.76 | 6.688| 
 |  参考  | Baichuan-7B | 1000 | 5.45 |  2792.56 | 5.863       | 
 #### 精度
 NPU vs 参考 loss.
 ![NPU-LOSS](../../sources/images/baichuan/baichuan7B-loss-compare.png)
 NPU vs 参考 loss 相对误差.
 ![NPU-Relative-Error](../../sources/images/baichuan/baichuan7B-loss-relative-error.png)
--- a/examples/baichuan/README_en.md
+++ b/examples/baichuan/README_en.md
@ -0,0 +1,165 @@
 # BaiChuan
 <p align="left">
        <b><a href="https://gitee.com/ascend/ModelLink/blob/modellink/examples/baichuan/README.md">简体中文</a></b> |
        <b>English</b> 
    </p>
 </p>
 #  Contents
 - [Baichuan-7B](#contents)
  - [Training](#pre-training)
    - [Script](#script)
    - [Performance](#performance)
        - [Machine performance](#machine-performance)
        - [Accuracy of the loss](#accuracy-of-the-loss)
 # Baichuan-7B
 ## Training
 Here's a hardware summary of pre-training Baichuan-7B:
 | Hardware |                      Value                      |
 | :------: | :---------------------------------------------: |
 |   NPU    |               8 x Ascend NPUs                   |
 ### Script
 1. Clone the repository to your local server:
 ```shell
 git clone https://gitee.com/ascend/ModelLink.git 
 cd ModeLlink 
 git checkout modellink
 mkdir logs
 mkdir ckpt
 ```
 2. Build environment
 ```bash
 # python3.8
 conda create -n test python=3.8
 conda activate test
 # install torch and torch_npu 
 pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
 pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
 pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
 # modify the path according to your own  ascend-toolkit path
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 # install AscendSpeed
 git clone https://gitee.com/ascend/AscendSpeed.git
 cd AscendSpeed
 pip install -r requirements.txt 
 pip3 install -e .
 cd ..
 # install other packages
 pip install -r requirements.txt 
 ```
 3. Prepare pretrained weights
 Download the Baichuan-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan-7B/tree/main) 
 ```shell
 mkdir baichuan-7B-hf
 cd ./baichuan-7B-hf
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
 wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
 cd ..
 ```
 In order to adapt to the baichuan-7B model, the following script is used to convert the model pre-training weights.
 ```shell
 mkdir weight
 SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
 python $SCRIPT_PATH \
    --input-model-dir ./baichuan-7B-hf \
    --output-model-dir ./weight \
    --tensor-model-parallel-size 8 \
    --pipeline-model-parallel-size 1 \
    --type 7B \
    --pse \
    --merge-mlp
 ```
 4. Prepare dataset
 Download the Baichuan-7B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
 ```shell
 # download datasets
 mkdir dataset_baichuan7B
 cd ./dataset_baichuan7B
 wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
 cd ..
 # process datasets                              
 python ./tools/preprocess_data.py \
 --input ./dataset_baichuan7B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
 --tokenizer-name-or-path ./baichuan-7B-hf \
 --output-prefix ./dataset_baichuan7B/alpaca \
 --workers 4 \
 --log-interval 1000 \
 --tokenizer-type PretrainedFromHF
 ```
 5. Config Baichuan-7B pre-training script : examples/baichuan/pretrain_baichuan_ptd_7B.sh 
 ```shell
 # modify the script according to your own  ascend-toolkit path
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 CKPT_SAVE_DIR="./ckpt"
 DATA_PATH="./dataset_baichuan7B/alpaca_text_document"
 TOKENIZER_MODEL="./baichuan-7B-hf/tokenizer.model"
 CKPT_LOAD_DIR="./baichuan-7B-mt"
 ```
 6. Launch Baichuan-7B  pre-training script: examples/baichuan/pretrain_baichuan_ptd_7B.sh 
 ```shell
 bash examples/baichuan/pretrain_baichuan_ptd_7B.sh 
 ```
 ### Performance
 #### Machine performance
 The performance of Baichuan-7B in **Ascend NPU** and **Reference**:
 | Device | Model       | total Iterations | throughput rate (samples/s) | throughput rate (tokens/s/p) | single-step time (s/step) | 
 |:----:|:---------:|:----:|:---------------------:|:---------------:|:----------------:|
 | NPUs | Baichuan-7B | 1000 | 4.78 | 2448.76 | 6.688| 
 |  Reference  | Baichuan-7B | 1000 | 5.45 |  2792.56 | 5.863       | 
 #### Accuracy of the loss
 NPU vs Reference loss.
 ![NPU-LOSS](../../sources/images/baichuan/baichuan7B-loss-compare.png)
 NPU vs Reference loss relative error.
 ![NPU-Relative-Error](../../sources/images/baichuan/baichuan7B-loss-relative-error.png)
--- a/examples/baichuan/pretrain_baichuan_ptd_7B.sh
+++ b/examples/baichuan/pretrain_baichuan_ptd_7B.sh
@ -0,0 +1,89 @@
 #!/bin/bash
 export CUDA_DEVICE_MAX_CONNECTIONS=1
 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
 MASTER_PORT=6000
 NNODES=1
 NODE_RANK=0
 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
 CKPT_SAVE_DIR="your model save ckpt path"
 DATA_PATH="your data path"
 TOKENIZER_MODEL="your tokenizer model path"
 CKPT_LOAD_DIR="your model load ckpt path"
 TP=8
 PP=1
 DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
 "
 GPT_ARGS="
    --tensor-model-parallel-size ${TP} \
    --pipeline-model-parallel-size ${PP} \
    --sequence-parallel \
    --num-layers 32 \
    --hidden-size 4096 \
    --ffn-hidden-size 11008 \
    --num-attention-heads 32 \
    --tokenizer-type Llama2Tokenizer \
    --tokenizer-model ${TOKENIZER_MODEL} \
    --load ${CKPT_LOAD_DIR} \
    --seq-length 4096 \
    --max-position-embeddings 4096 \
    --micro-batch-size 4 \
    --global-batch-size 32 \
    --make-vocab-size-divisible-by 128 \
    --lr 1e-5 \
    --train-iters 5000 \
    --lr-decay-style cosine \
    --untie-embeddings-and-output-weights \
    --disable-bias-linear \
    --attention-dropout 0.0 \
    --init-method-std 0.01 \
    --hidden-dropout 0.0 \
    --position-embedding-type rope \
    --normalization RMSNorm \
    --use-fused-rmsnorm \
    --use-flash-attn \
    --swiglu \
    --no-masked-softmax-fusion \
    --attention-softmax-in-fp32 \
    --min-lr 1e-6 \
    --weight-decay 1e-2 \
    --lr-warmup-fraction 0.1 \
    --clip-grad 1.0 \
    --adam-beta1 0.9 \
    --initial-loss-scale 8188.0 \
    --adam-beta2 0.95 \
    --no-gradient-accumulation-fusion \
    --no-load-optim \
    --no-load-rng \
    --fp16
 "
 DATA_ARGS="
    --data-path $DATA_PATH \
    --split 949,50,1
 "
 OUTPUT_ARGS="
    --log-interval 1 \
    --save-interval 1000 \
    --eval-interval 1000 \
    --eval-iters 1 \
 "
 torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
    --save ${CKPT_SAVE_DIR}
--- a/examples/baichuan2/README.md
+++ b/examples/baichuan2/README.md
@ -0,0 +1,166 @@
 # BaiChuan2
 <p align="left">
        <b>简体中文</b> |
        <b><a href="https://gitee.com/ascend/ModelLink/blob/modellink/examples/baichuan2/README_en.md">English</a> </b> 
    </p>
 </p>
 #  目录
 - [Baichuan2-7B](#Baichuan2-7B)
  - [训练](#训练)
    - [脚本](#脚本)
    - [性能](#性能)
        - [吞吐](#吞吐)
        - [精度](#精度)
 # Baichuan2-7B
 ## 训练
 Baichuan2-7B 训练的硬件配置如下：
 |  硬件 |       配置        |
 |:---:|:---------------:|
 | NPU | 8 x Ascend NPUs |
 ### 脚本
 1. 拷贝仓库到你的个人服务器：
 ```shell
 git clone https://gitee.com/ascend/ModelLink.git 
 cd ModeLlink 
 git checkout modellink 
 mkdir logs
 mkdir ckpt
 ```
 2. 搭建环境
 ```bash
 # python3.8
 conda create -n test python=3.8
 conda activate test
 # 安装 torch 和 torch_npu 
 pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
 pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
 # 修改 ascend-toolkit 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 # 安装加速库
 git clone https://gitee.com/ascend/AscendSpeed.git
 cd AscendSpeed
 pip install -r requirements.txt 
 pip3 install -e .
 cd ..
 # 安装其余依赖库
 pip install -r requirements.txt 
 ```
 3. （可选）准备预训练权重
 从 [huggingface](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/tree/main) 下载预训练权重：
 ```shell
 mkdir baichuan2-7B-hf
 cd ./baichuan2-7B-hf
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/config.json
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/configuration_baichuan.py
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/generation_utils.py
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/modeling_baichuan.py
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00001-of-00002.bin
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00002-of-00002.bin
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model.bin.index.json
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/quantizer.py
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/special_tokens_map.json
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenization_baichuan.py
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer.model
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer_config.json
 cd ..
 ```
 接着将hf格式的权重转化为AscendSpeed可以加载的形式：
 ```shell
 mkdir baichuan2-7B-mt
 SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
 # for ptd
 python $SCRIPT_PATH \
    --input-model-dir ./baichuan2-7B-hf \
    --output-model-dir ./baichuan2-7B-mt \
    --tensor-model-parallel-size 8 \
    --pipeline-model-parallel-size 1 \
    --type 7B \
    --merge-mlp \
    --pse  
 ```
 4. 准备数据集
 从 [这里](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 下载 Baichuan2-7B-Base 的数据集：
 ```shell
 # 下载数据集
 mkdir dataset_baichuan2-7B
 cd ./dataset_baichuan2-7B
 wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
 cd ..
 # 准备数据集                              
 python ./tools/preprocess_data.py \
 --input ./dataset_baichuan2-7B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
 --tokenizer-name-or-path ./baichuan2-7B-hf \
 --output-prefix ./dataset_baichuan2-7B/alpaca \
 --workers 4 \
 --log-interval 1000 \
 --tokenizer-type PretrainedFromHF
 ```
 5. 配置 Baichuan2-7B 预训练脚本: examples/baichuan/pretrain_baichuan2_ptd_7B.sh 
 ```shell
 # 修改 ascend-toolkit 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 # 修改数据集，权重，词表等路径
 CKPT_SAVE_DIR="./ckpt"
 DATA_PATH="./dataset_baichuan2-7B/alpaca_text_document"
 TOKENIZER_MODEL="./baichuan2-7B-hf/tokenizer.model"
 CKPT_LOAD_DIR="./baichuan2-7B-mt"
 ```
 6. 启动 Baichuan2-7B 预训练脚本: examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
 ```shell
 bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
 ```
 ### 性能
 #### 吞吐
 Baichuan2-7B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
 |  设备  |    模型     | 迭代数  | 样本吞吐 (samples/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 
 |:----:|:---------:|:----:|:---------------------:|:---------------:|:----------------:|
 | NPUs | Baichuan2-7B | 1000 | 4.59 | 2349 | 6.973| 
 |  参考  | Baichuan2-7B | 1000 | 5.40 |  2769 | 5.915       | 
 #### 精度
 NPU vs 参考 loss.
 ![NPU-LOSS](../../sources/images/baichuan2/baichuan2-7B-loss-compare.png)
 NPU vs 参考 loss 相对误差.
 ![NPU-Relative-Error](../../sources/images/baichuan2/baichuan2-7B-loss-relative-error.png)
--- a/examples/baichuan2/README_en.md
+++ b/examples/baichuan2/README_en.md
@ -0,0 +1,168 @@
 # BaiChuan2
 <p align="left">
        <b><a href="https://gitee.com/ascend/ModelLink/blob/modellink/examples/baichuan2/README.md">简体中文</a></b> |
        <b>English</b> 
    </p>
 </p>
 #  Contents
 - [Baichuan2-7B](#contents)
  - [Training](#pre-training)
    - [Script](#script)
    - [Performance](#performance)
        - [Machine performance](#machine-performance)
        - [Accuracy of the loss](#accuracy-of-the-loss)
 # Baichuan2-7B
 ## Training
 Here's a hardware summary of pre-training Baichuan2-7B:
 | Hardware |                      Value                      |
 | :------: | :---------------------------------------------: |
 |   NPU    |               8 x Ascend NPUs                   |
 ### Script
 1. Clone the repository to your local server:
 ```shell
 git clone https://gitee.com/ascend/ModelLink.git 
 cd ModeLlink 
 git checkout -b modellink origin/modellink 
 mkdir logs
 mkdir ckpt
 ```
 2. Build environment
 ```bash
 # python3.8
 conda create -n test python=3.8
 conda activate test
 # install torch and torch_npu 
 pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
 pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_XXX.whl
 # modify the path according to your own  ascend-toolkit path
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 # install AscendSpeed
 git clone https://gitee.com/ascend/AscendSpeed.git
 cd AscendSpeed
 pip install -r requirements.txt 
 pip3 install -e .
 cd ..
 # install other packages
 pip install -r requirements.txt 
 ```
 3. Prepare pretrained weights
 Download the Baichuan2-7B checkpoint from [here](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/tree/main)
 ```shell
 mkdir baichuan2-7B-hf
 cd ./baichuan2-7B-hf
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/config.json
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/configuration_baichuan.py
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/generation_utils.py
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/modeling_baichuan.py
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00001-of-00002.bin
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00002-of-00002.bin
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model.bin.index.json
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/quantizer.py
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/special_tokens_map.json
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenization_baichuan.py
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer.model
 wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer_config.json
 cd ..
 ```
 In order to adapt to the baichuan2-7B model, the following script is used to convert the model pre-training weights.
 ```shell
 mkdir weight
 SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
 # for ptd
 python $SCRIPT_PATH \
    --input-model-dir ./baichuan2-7B-hf \
    --output-model-dir ./weight-tp8 \
    --tensor-model-parallel-size 8 \
    --pipeline-model-parallel-size 1 \
    --type 7B \
    --merge-mlp \
    --pse  
 ```
 4. Prepare dataset
 Download the Baichuan2-7B-Base datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
 ```shell
 # download datasets
 mkdir dataset_baichuan2-7B
 cd ./dataset_baichuan2-7B
 wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
 cd ..
 # process datasets                              
 python ./tools/preprocess_data.py \
 --input ./dataset_baichuan2-7B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
 --tokenizer-name-or-path ./baichuan2-7B-hf \
 --output-prefix ./dataset_baichuan2-7B/alpaca \
 --workers 4 \
 --log-interval 1000 \
 --tokenizer-type PretrainedFromHF
 ```
 5. Config Baichuan2-7B pre-training script : examples/baichuan/pretrain_baichuan2_ptd_7B.sh 
 ```shell
 # modify the script according to your own  ascend-toolkit path
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 # modify script orign dataset path according to your own dataset path
 CKPT_SAVE_DIR="./ckpt"
 DATA_PATH="./dataset_baichuan2-7B/alpaca_text_document"
 TOKENIZER_MODEL="./baichuan2-7B-hf/tokenizer.model"
 CKPT_LOAD_DIR="./baichuan2-7B-mt"
 ```
 6. Launch Baichuan2-7B  pre-training script: examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
 ```shell
 bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
 ```
 ### Performance
 #### Machine performance
 The performance of Baichuan2-7B in **Ascend NPU** and **Reference**:
 | Device | Model       | total Iterations | throughput rate (samples/s) | throughput rate (tokens/s/p) | single-step time (s/step) | 
 |:----:|:---------:|:----:|:---------------------:|:---------------:|:----------------:|
 | NPUs | Baichuan2-7B | 1000 | 4.59 | 2349 | 6.973| 
 |  Reference  | Baichuan2-7B | 1000 | 5.40 |  2769 | 5.915       |
 #### Accuracy of the loss
 NPU vs Reference loss.
 ![NPU-LOSS](../../sources/images/baichuan2/baichuan2-7B-loss-compare.png)
 NPU vs Reference loss relative error.
 ![NPU-Relative-Error](../../sources/images/baichuan2/baichuan2-7B-loss-relative-error.png)
--- a/examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
+++ b/examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
@ -0,0 +1,89 @@
 #!/bin/bash
 export CUDA_DEVICE_MAX_CONNECTIONS=1
 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
 MASTER_PORT=6000
 NNODES=1
 NODE_RANK=0
 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
 CKPT_SAVE_DIR="your model save ckpt path"
 DATA_PATH="your data path"
 TOKENIZER_MODEL="your tokenizer model path"
 CKPT_LOAD_DIR="your model load ckpt path"
 TP=8
 PP=1
 DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
 "
 GPT_ARGS="
    --tensor-model-parallel-size ${TP} \
    --pipeline-model-parallel-size ${PP} \
    --sequence-parallel \
    --num-layers 32 \
    --hidden-size 4096 \
    --ffn-hidden-size 11008 \
    --num-attention-heads 32 \
    --tokenizer-type Llama2Tokenizer \
    --tokenizer-model ${TOKENIZER_MODEL} \
    --load ${CKPT_LOAD_DIR} \
    --seq-length 4096 \
    --max-position-embeddings 4096 \
    --micro-batch-size 4 \
    --global-batch-size 32 \
    --make-vocab-size-divisible-by 128 \
    --lr 1e-6 \
    --train-iters 5000 \
    --lr-decay-style cosine \
    --untie-embeddings-and-output-weights \
    --disable-bias-linear \
    --attention-dropout 0.0 \
    --init-method-std 0.01 \
    --hidden-dropout 0.0 \
    --position-embedding-type rope \
    --normalization RMSNorm \
    --use-fused-rmsnorm \
    --use-flash-attn \
    --swiglu \
    --no-masked-softmax-fusion \
    --attention-softmax-in-fp32 \
    --min-lr 1e-8 \
    --weight-decay 1e-2 \
    --lr-warmup-fraction 0.1 \
    --clip-grad 1.0 \
    --adam-beta1 0.9 \
    --initial-loss-scale 8188.0 \
    --adam-beta2 0.95 \
    --no-gradient-accumulation-fusion \
    --no-load-optim \
    --no-load-rng \
    --bf16
 "
 DATA_ARGS="
    --data-path $DATA_PATH \
    --split 949,50,1
 "
 OUTPUT_ARGS="
    --log-interval 1 \
    --save-interval 1000 \
    --eval-interval 1000 \
    --eval-iters 1 \
 "
 torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
    --save ${CKPT_SAVE_DIR}
--- a/sources/images/baichuan/13B-inference.png
+++ b/sources/images/baichuan/13B-inference.png
--- a/sources/images/baichuan/13B-lora-inference.png
+++ b/sources/images/baichuan/13B-lora-inference.png
--- a/sources/images/baichuan/13B-loss-compare.png
+++ b/sources/images/baichuan/13B-loss-compare.png
--- a/sources/images/baichuan/7B_loss_compare.png
+++ b/sources/images/baichuan/7B_loss_compare.png
--- a/sources/images/baichuan/7B_relative_error.png
+++ b/sources/images/baichuan/7B_relative_error.png
--- a/sources/images/baichuan/baichuan13B-loss-relative-error.png
+++ b/sources/images/baichuan/baichuan13B-loss-relative-error.png
--- a/sources/images/baichuan/baichuan7B-loss-compare.png
+++ b/sources/images/baichuan/baichuan7B-loss-compare.png
--- a/sources/images/baichuan/baichuan7B-loss-relative-error.png
+++ b/sources/images/baichuan/baichuan7B-loss-relative-error.png
--- a/sources/images/baichuan2/13B-inference.png
+++ b/sources/images/baichuan2/13B-inference.png
--- a/sources/images/baichuan2/13B-inference_en.png
+++ b/sources/images/baichuan2/13B-inference_en.png
--- a/sources/images/baichuan2/13B-loss-compare.png
+++ b/sources/images/baichuan2/13B-loss-compare.png
--- a/sources/images/baichuan2/7B_loss_compare.png
+++ b/sources/images/baichuan2/7B_loss_compare.png
--- a/sources/images/baichuan2/7B_relative_error.png
+++ b/sources/images/baichuan2/7B_relative_error.png
--- a/sources/images/baichuan2/baichuan2-13B-loss-relative-error.png
+++ b/sources/images/baichuan2/baichuan2-13B-loss-relative-error.png
--- a/sources/images/baichuan2/baichuan2-7B-loss-compare.png
+++ b/sources/images/baichuan2/baichuan2-7B-loss-compare.png
--- a/sources/images/baichuan2/baichuan2-7B-loss-relative-error.png
+++ b/sources/images/baichuan2/baichuan2-7B-loss-relative-error.png