llama2 uploads the training inference script and the README.

2024-12-05 05:17:40 +08:00 · 2023-11-24 12:15:46 +08:00 · 2023-11-24 12:15:46 +08:00 · f801b32e15
commit f801b32e15
parent 569e1cc7b3
8 changed files with 611 additions and 16 deletions
--- a/examples/bloom/README.md
+++ b/examples/bloom/README.md
@ -13,6 +13,9 @@
    - [性能](#性能)
      - [吞吐](#吞吐)
      - [精度](#精度)
+  - [微调](#微调)
+    - [全参微调](#全参微调)
+    - [低参微调](#低参微调)
  - [推理](#推理)
    - [deepspeed_pipeline](#deepspeed_pipeline)
    - [megatron](#megatron)
@ -27,7 +30,8 @@
    - [deepspeed_pipeline](#deepspeed_pipeline)
    - [megatron](#megatron)
  - [评估](#评估)
-  - [举例](#举例)
+- [举例](#举例)
+
 # Bloom-7B

 ## 训练
@ -154,6 +158,26 @@ DATA_PATH=/home/bloom_data/enwiki_100k/enwiki-100k_text_document
 bash examples/bloom/pretrain_bloom_7b1.sh
 ```

+## 微调
+
+### 全参微调
+执行流程与预训练一致，配置训练权重路径如下：
+```shell
+# 修改预训练权重路径
+CHECKPOINT_PATH='./ckpt'
+```
+
+### 低参微调
+执行流程与预训练一致，参数修改配置如下：
+```shell
+# 修改预训练权重路径
+CHECKPOINT_PATH='./ckpt'
+
+# 增加配置参数
+pretrain_bloom.py
+--lora-target-modules query_key_value dense \
+```
+
 ## 性能

 ### 吞吐
@ -486,6 +510,7 @@ bash ./examples/bloom/generate_bloom_176b_2nodes.sh


 ## 评估 
+
 配置 Bloom-176B 评估脚本: tasks/evaluation/eval_bloom.sh

 ```shell
@ -504,6 +529,12 @@ TASK="boolq"
 --num-attention-heads 112 
 ```

+```text
+    # 请注意，评估时需要修改一个deepspeed的bug：
+    # 将 `<deepspeed-installed-path>/runtime/pipe/engine.py` 文件里的第671行注释掉：
+    # self.total_loss += self.loss.detach()
+```
+
 ```shell
 bash ./tasks/evaluation/eval_bloom.sh
 ```
@ -528,7 +559,7 @@ bash ./tasks/evaluation/eval_bloom.sh
  </tbody>
 </table>

-## 举例
+# 举例
 1. bloom 7b

 ![bloom_7b_generate.png](..%2F..%2Fsources%2Fimages%2Fbloom_7b_generate.png)
@ -536,7 +567,7 @@ bash ./tasks/evaluation/eval_bloom.sh

 ![bloom_176b_generate.png](..%2F..%2Fsources%2Fimages%2Fbloom_176b_generate.png)

-## 引用
+# 引用

 ```
@article{scao2022bloom,
--- a/examples/bloom/README_en.md
+++ b/examples/bloom/README_en.md
@ -9,21 +9,31 @@
 # Contents

 - [Bloom-7B](#contents)
-  - [Training](#pre-training)
+  - [Training](#training)
    - [Script](#script)
    - [Performance](#performance)
-      - [Machine performance](#machine-performance)
-      - [Accuracy of the loss](#accuracy-of-the-loss)
-  - [Inference](#Inference)
-    - [Script](#script)
+      - [Machine performance](#Machine-performance)
+      - [Accuracy of the loss](#Accuracy-of-the-loss)
+  - [Fine-tune](#fine-tune)
+    - [Full parameter fine-tuning](#Full-parameter-fine-tuning)
+    - [LORA fine-tuning](#LORA-fine-tuning)
+  - [Inference](#inference)
+    - [deepspeed pipeline](#deepspeed-pipeline)
+    - [megatron](#megatron)
+  - [Evaluation](#evaluation)
+
 - [Bloom-176B](#contents)
-  - [Training](#pre-training)
+  - [Training](#training)
    - [Script](#script)
    - [Performance](#performance)
      - [Machine performance](#machine-performance)
      - [Accuracy of the loss](#accuracy-of-the-loss)
-  - [Inference](#Inference)
-    - [Script](#script)
+  - [Inference](#inference)
+    - [deepspeed pipeline](#deepspeed-pipeline)
+    - [megatron](#megatron)
+  - [Evaluation](#evaluation)
+- [Example](#example)
+

 # Bloom-7B

@ -154,6 +164,30 @@ Run the examples/bloom/pretrain_bloom_7b1.sh on all nodes in the cluster.
 bash examples/bloom/pretrain_bloom_7b1.sh
 ```

+## Fine-tune
+
+### Full parameter fine-tuning
+
+The execution process is the same as the pre-training. Config training weight path is as follows:
+
+```shell
+# modify the model weight path
+CHECKPOINT_PATH='./ckpt'
+```
+
+### LORA fine-tuning
+
+The execution process is consistent with the pre-training, and the parameters are modified as follows:
+
+```shell
+# modify the model weight path
+CHECKPOINT_PATH='./ckpt'
+
+# Add configuration parameters
+pretrain_bloom.py
+--lora-target-modules query_key_value dense 
+```
+
 ## Performance

 ### Machine Performance
@ -183,7 +217,7 @@ NPU vs GPU loss relative error.

 We support AscendSpeed Inference for text generation with BLOOM 7B (deepspeed or megatron).

-### deepspeed_pipeline
+### deepspeed pipeline

 ```shell
 # modify the model weight path and tokenizer path
@ -440,7 +474,7 @@ and GPU on a single-node system. The average relative error is 0.1%, less than 2

 We support AscendSpeed Inference for text generation with BLOOM 176B (deepspeed or megatron).

-### deepspeed_pipeline
+### deepspeed pipeline

 ```shell
 # modify the model weight path and tokenizer path
@ -506,6 +540,12 @@ In addition, you need to set the corresponding parameters according to the model
 --num-attention-heads 112 
 ```

+```text
+    # Note that, a deepspeed bug needs to be fixed during evaluation：
+    # Comment out line 671 in the file `<deepspeed-installed-path>/runtime/pipe/engine.py`：
+    # self.total_loss += self.loss.detach()
+```
+
 ```shell
 bash ./tasks/evaluation/eval_bloom.sh
 ```
--- a/examples/llama2/README.md
+++ b/examples/llama2/README.md
@ -3,7 +3,6 @@
 <p align="left">
        <b>简体中文</b> |
        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/llama2/README_en.md">English</a> </b> 
-    </p>
 </p>


@ -16,6 +15,15 @@
    - [吞吐](#吞吐)
    - [精度](#精度)

+- [LLaMA2-13B](#LLaMA2-13B)
+  - [训练](#训练)
+  - [脚本](#脚本)
+  - [性能](#性能)
+    - [吞吐](#吞吐)
+    - [精度](#精度)
+  - [推理](#推理)
+  - [评估](#评估)
+


 # LLaMA2-7B
@ -188,3 +196,211 @@ NPU vs 参考 loss
 绝对误差

 ![NPU-LOSS and NPU-Absolute-Error](../../sources/images/llama2/llama2_7b_shape_fp16_layer32_loss_with_weights_comparison_absolute.png)
+
+
+# LLaMA2-13B
+
+## 训练
+
+LLaMA2-13B 训练的硬件配置:
+
+|  硬件 |       配置        |
+|:---:|:---------------:|
+| NPU | 8 x Ascend NPUs |
+
+LLaMA2-13B 训练的软件配置: 
+
+|            软件             |                                                      配置                                                      |
+|:-------------------------:|:------------------------------------------------------------------------------------------------------------:|
+|          python           |                                                    3.7.16                                                    |
+|          driver           | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
+|         firmware          | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
+|           CANN            |       [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software)       |
+| binary arithmetic package |       [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software)       |
+|           torch           |                                                    1.11.0                                                    |
+|         torch_npu         |                             [package](https://gitee.com/ascend/pytorch/releases)                             |
+
+### 脚本
+
+1. 拷贝代码仓到本地服务器
+    ```shell
+    git clone https://gitee.com/ascend/AscendSpeed.git 
+    cd AscendSpeed 
+    mkdir logs
+    mkdir ckpt
+    ```
+
+2. 搭建环境
+    
+    ```bash
+    # python3.7
+    conda create -n test python=3.7
+    conda activate test
+    
+    # 安装 torch 和 torch_npu
+    pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
+    pip install torch_npu-1.11.0*-cp37-cp37m-linux_aarch64.whl
+    pip install apex-0.1_ascend*-cp37-cp37m-linux_aarch64.whl
+    
+    # 安装 megatron-core
+    pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
+    
+    # 安装 deepspeed 和 deepspeed_npu
+    pip install deepspeed==0.9.2
+    git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
+    cd deepspeed_npu
+    pip3 install -e ./
+    cd ..
+    
+    # install other packages
+    pip install -r requirements.txt 
+    ```
+
+3. 下载 LLaMA2-13B 的 [预训练权重和词表](https://huggingface.co/NousResearch/Llama-2-13b-hf/tree/main) 
+    
+    ```bash
+    git lfs install
+    git clone https://huggingface.co/NousResearch/Llama-2-13b-hf
+    ```
+
+    ```text
+    # 请注意，如果要加载huggingface的预训练权重，需要修改一个deepspeed关于加载权重的bug：
+    # 在 `<deepspeed-installed-path>/runtime/engine.py` 文件里的 `_load_zero_checkpoint` 函数，
+    # 将 `if zero_sd_list is None` 改为 `if zero_sd_list is None or len(zero_sd_list) == 0`
+    
+    # 原始 deepspeed/runtime/engine.py, 大概 #Lines2746-2748
+    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
+    if zero_sd_list is None:
+        return False
+    
+    # 修改后
+    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
+    if zero_sd_list is None or len(zero_sd_list) == 0:
+        return False
+    ```
+
+    将权重从 huggingface 格式转化为 AscendSpeed 格式
+    ```bash
+    # 修改 ascend-toolkit 路径
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+    
+    # 权重格式转换
+    python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-2-13b-hf \
+                                                                        --output-model-dir ckpt \
+                                                                        --tensor-model-parallel-size 1 \
+                                                                        --pipeline-model-parallel-size 1 \
+                                                                        --type 13B \
+                                                                        --deepspeed
+    ```
+
+4. 准备数据集
+    
+    下载 LLaMA2-13B [数据集](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
+        
+    ```shell
+      # 下载数据
+      mkdir dataset_llama2
+      cd ./dataset_llama2
+      wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+      cd ..
+    
+      # 处理数据                             
+      cd WORKSPACE
+      mkdir alpaca_preprocessed
+      python tools/preprocess_data.py --input WORKSPACE/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+                                    --output-prefix WORKSPACE/alpaca_preprocessed/alpaca \
+                                    --tokenizer-type PretrainedFromHF \
+                                    --tokenizer-name-or-path WORKSPACE/llama-13b-hf \
+                                    --tokenizer-not-use-fast \
+                                    --handler-name GeneralInstructionHandler
+    ```
+
+5. 配置 LLaMA2-13B 预训练脚本: examples/llama2/pretrain_llama2_13B_ptd_8p.sh
+    
+    ```shell
+    # 设置 ascend-toolkit 路径
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+    
+    # 配置词表，数据集等路径
+    TOKENIZER_PATH=./llama-2-13b-hf/  #词表路径
+    DATA_PATH=WORKSPACE/alpaca_preprocessed/alpaca  #数据集路径
+    ```
+
+6. 启动 LLaMA2-13B 预训练脚本: examples/llama2/pretrain_llama2_13B_ptd_8p.sh
+    
+    ```shell
+    bash examples/llama2/pretrain_llama2_13B_ptd_8p.sh
+    ```
+
+### 性能
+
+#### 吞吐
+
+LLaMA2-13B 在 **昇腾芯片** 和 **参考芯片** 上的性能对比：
+
+|  设备  |    模型     | 迭代数  | 样本吞吐 (samples/p/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) | 浮点计算数 (TFLOPs/s) |
+|:----:|:---------:|:----:|:------------------:|:---------------------:|:---------------:|:----------------:|
+| NPUs | LLaMA2-13B |       5000       |             2.868             |           1468.416           |          89.275           |               126.73                |
+|  参考  | LLaMA2-13B |        --        |              --               |             1750             |            --             |                 --                  |
+
+
+#### 精度
+
+NPU vs 参考 loss
+NPU运行平稳，资源使用稳定，中间没有报错，Loss呈下降趋势，收敛速度符合预期。
+精度满足要求。平均损耗的绝对误差为0.0011，小于0.5%。
+![NPU-LOSS](../../sources/images/llama2/llama2_13b_bf16_loss_absolute.png)
+
+## 推理
+
+我们在Llama2 13B中支持AscendSpeed推理来生成文本。
+推理不同于预训练，比如我们需要加载预训练检查点和输出样本的长度:
+
+配置 LLaMA2-13B 推理脚本: examples/llama2/generate_llama2_13B_tp8_pp1.sh
+
+```shell
+# 修改模型权重路径以及词表路径
+CHECKPOINT=./llama2-13b-tp8-pp1/
+VOCAB_FILE=./llama2-13b-hf/
+```
+
+```shell
+bash ./examples/llama2/generate_llama2_13B_tp8_pp1.sh
+```
+推理结果示例如下:
+![llama2-13B-generate.png](../../sources/images/llama2/llama2-13B-generate.png)
+
+
+## 评估
+
+我们使用boolq基准来评估我们的模型。基准[下载](https://huggingface.co/datasets/boolq).
+
+```shell
+    CHECKPOINT=./llama2-13b-tp8-pp1/
+    VOCAB_FILE=./llama2-13b-hf/
+    # 配置任务以及数据路径
+    DATA_PATH="./boolq/data/test/"
+    TASK="boolq"
+    # 配置生成参数 
+    python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py   \
+           --task-data-path $DATA_PATH \
+           --task $TASK\
+           --seq-length 4096 \
+           --max-new-tokens 32 \
+           --max-position-embeddings 4096 \
+           --tensor-model-parallel-size 8  \
+           --pipeline-model-parallel-size 1  \
+           --num-layers 40  \
+           --hidden-size 5120  \
+           --ffn-hidden-size 13824 \
+           --load ${CHECKPOINT}  \
+           --num-attention-heads 40 \
+           --tokenizer-type PretrainedFromHF  \
+           --tokenizer-name-or-path $VOCAB_FILE \
+           --tokenizer-not-use-fast \
+           --fp16  \
+           --micro-batch-size 1  \
+           --seed 42 | tee logs/train.log
+    # 开始评估
+    bash tasks/evaluation/eval.sh
+```
--- a/examples/llama2/README_en.md
+++ b/examples/llama2/README_en.md
@ -2,10 +2,8 @@
 <p align="left">
        <b><a href="https://gitee.com/ascend/AscendSpeed/blob/master/examples/llama2/README.md">简体中文</a></b> |
        <b>English</b> 
-    </p>
 </p>

-
 #  Contents

 - [LLaMA2-7B](#contents)
@ -15,6 +13,15 @@
    - [Machine performance](#machine-performance)
    - [Accuracy of the loss](#accuracy-of-the-loss)

+- [LLaMA2-13B](#contents)
+  - [Training](#pre-training)
+    - [Script](#script)
+    - [Performance](#performance)
+      - [Machine performance](#machine-performance)
+      - [Accuracy of the loss](#accuracy-of-the-loss)
+    - [Inference](#Inference)
+    - [Evaluation](#Evaluation)
+


 # LLaMA2-7B
@ -187,3 +194,205 @@ The relative error of the average loss is 0.0046, less than 2%, the maximum rela
 The absolute error of the average loss is 0.0009, less than 2%, the maximum absolute error is 0.0246. 

 ![NPU-LOSS and NPU-Absolute-Error](../../sources/images/llama2/llama2_7b_shape_fp16_layer32_loss_with_weights_comparison_absolute.png)
+
+
+# LLaMA2-13B
+
+## Training
+
+Here's a hardware summary of pre-training  LLaMA2-13B:
+
+| Hardware |                      Value                      |
+| :------: | :---------------------------------------------: |
+|   NPU    |               8 x Ascend NPUs                   |
+
+Here's a software summary of pre-training  LLaMA2-13B: 
+
+
+|         Software          |   Version   |
+| :-----------------------: |:-----------:|
+|          Python           |   3.7.16    |
+|          driver           | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
+|         firmware          | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
+|           CANN            |       [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software)       |
+| binary arithmetic package |       [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software)       |
+|           torch           |                                                    1.11.0                                                    |
+|         torch_npu         |                             [package](https://gitee.com/ascend/pytorch/releases)                             |
+
+### Script
+
+1. Clone the repository to your local server:
+    ```shell
+    git clone https://gitee.com/ascend/AscendSpeed.git 
+    cd AscendSpeed 
+    mkdir logs
+    mkdir ckpt
+    ```
+
+2. Build environment
+    
+    ```bash
+    # python3.7
+    conda create -n test python=3.7
+    conda activate test
+    
+    # install torch and torch_npu
+    pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
+    pip install torch_npu-1.11.0*-cp37-cp37m-linux_aarch64.whl
+    pip install apex-0.1_ascend*-cp37-cp37m-linux_aarch64.whl
+    
+    # install megatron-core
+    pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
+    
+    # install deepspeed and deepspeed_npu
+    pip install deepspeed==0.9.2
+    git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
+    cd deepspeed_npu
+    pip3 install -e ./
+    cd ..
+    
+    # install other packages
+    pip install -r requirements.txt 
+    ```
+     *Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified `if zero_sd_list is None` as `if zero_sd_list is None or len(zero_sd_list) == 0` in the `_load_zero_checkpoint` function of `<deepspeed-installed-path>/runtime/engine.py`*
+     
+    ```text
+    # original deepspeed/runtime/engine.py, about #Lines2746-2748
+    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
+    if zero_sd_list is None:
+        return False
+    
+    # modified
+    zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
+    if zero_sd_list is None or len(zero_sd_list) == 0:
+        return False
+    ```
+3. Prepare pretrained weights and tokenizer
+    Download the LLaMA2-13B checkpoint from [here](https://huggingface.co/NousResearch/Llama-2-13b-hf/tree/main) 
+    
+    ```bash
+    git lfs install
+    git clone https://huggingface.co/NousResearch/Llama-2-13b-hf
+    ```
+    
+   *Note that if you want to use the weight from huggingface, please run the weight conversion script first. The following uses llama-2-13b model  weight conversion as an example.*
+    ```bash
+    # modify the script according to your own ascend-toolkit path
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+    
+    # convert to deepspeed weights
+    python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-2-13b-hf \
+                                                                        --output-model-dir ckpt \
+                                                                        --tensor-model-parallel-size 8 \
+                                                                        --pipeline-model-parallel-size 1 \
+                                                                        --type 13B \
+    ```
+
+4. Prepare dataset
+    
+    Download the LLaMA2-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 
+        
+    ```bash
+    # dataset：wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
+    
+    cd WORKSPACE
+    mkdir alpaca_preprocessed
+    python tools/preprocess_data.py --input WORKSPACE/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+                                    --output-prefix WORKSPACE/alpaca_preprocessed/alpaca \
+                                    --tokenizer-type PretrainedFromHF \
+                                    --tokenizer-name-or-path WORKSPACE/llama-13b-hf \
+                                    --tokenizer-not-use-fast \
+                                    --handler-name GeneralInstructionHandler
+    ```
+
+5. Config LLaMA2-13B pre-training script: examples/llama2/pretrain_llama2_13B_ptd_8p.sh
+    
+    ```shell
+    # modify the script according to your own ascend-toolkit path
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
+    
+    # modify script orign dataset path according to your own dataset path
+    TOKENIZER_PATH=./llama-2-13b-hf/  #tokenizer path
+    DATA_PATH=WORKSPACE/alpaca_preprocessed/alpaca   #processed dataset
+    ```
+
+6. Launch LLaMA2-13B  pre-training script: examples/llama2/pretrain_llama2_13B_ptd_8p.sh
+    
+    ```shell
+    bash examples/llama2/pretrain_llama2_13B_ptd_8p.sh
+    ```
+
+### Performance
+
+#### Machine performance
+
+The performance of LLaMA2-13B in **Ascend NPU** and **Reference**:
+
+| Device |   Model    | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
+| :------: |:----------:|:----------------:|:-----------------------------:|:----------------------------:|:-------------------------:|:-----------------------------------:|
+| NPUs   | LLaMA2-13B |       5000       |             2.868             |           1468.416           |          89.275           |               126.73                |
+| Reference   | LLaMA2-13B |        --        |              --               |             1750             |            --             |                 --                  |
+
+
+#### Accuracy of the loss
+
+NPU vs Reference loss.
+
+The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. 
+The precision meets the requirements.The absolute error of the average loss is 0.0011, less than 0.5%. 
+
+![NPU-LOSS](../../sources/images/llama2/llama2_13b_bf16_loss_absolute.png)
+
+## Inference
+
+We support AscendSpeed Inference for text generation with Llama2 13B.
+Inference different from pre-training, such as we need to Load pre-training checkpoint and the length of the output samples:
+
+Config Llama2-13B inference script: examples/llama2/generate_llama2_13B_tp8_pp1.sh
+
+```shell
+# modify the model weight path and tokenizer path
+CHECKPOINT=./llama2-13b-tp8-pp1/
+VOCAB_FILE=./llama2-13b-hf/
+```
+
+```shell
+bash ./examples/llama2/generate_llama2_13B_tp8_pp1.sh
+```
+Some inference samples are as follows:
+![llama2-13B-generate.png](../../sources/images/llama2/llama2-13B-generate.png)
+
+
+## Evaluation
+
+We use boolq benchmark to evaluate our model. Benchmark Download [here](https://huggingface.co/datasets/boolq).
+
+```shell
+    CHECKPOINT=./llama2-13b-tp8-pp1/
+    VOCAB_FILE=./llama2-13b-hf/
+    # configure task and data path
+    DATA_PATH="./boolq/data/test/"
+    TASK="boolq"
+    # configure generation parameters 
+    python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation.py   \
+           --task-data-path $DATA_PATH \
+           --task $TASK\
+           --seq-length 4096 \
+           --max-new-tokens 32 \
+           --max-position-embeddings 4096 \
+           --tensor-model-parallel-size 8  \
+           --pipeline-model-parallel-size 1  \
+           --num-layers 40  \
+           --hidden-size 5120  \
+           --ffn-hidden-size 13824 \
+           --load ${CHECKPOINT}  \
+           --num-attention-heads 40 \
+           --tokenizer-type PretrainedFromHF  \
+           --tokenizer-name-or-path $VOCAB_FILE \
+           --tokenizer-not-use-fast \
+           --fp16  \
+           --micro-batch-size 1  \
+           --seed 42 | tee logs/train.log
+    # start evaluation
+    bash tasks/evaluation/eval.sh
+```
--- a/examples/llama2/generate_llama2_13B_tp8_pp1.sh
+++ b/examples/llama2/generate_llama2_13B_tp8_pp1.sh
@ -0,0 +1,37 @@
+#!/bin/bash
+export TOKENIZERS_PARALLELISM=false
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+
+MASTER_ADDR=localhost
+MASTER_PORT=6001
+NNODES=1
+NODE_RANK=0
+NPUS_PER_NODE=8
+
+DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE \
+                  --nnodes $NNODES \
+                  --node_rank $NODE_RANK \
+                  --master_addr $MASTER_ADDR \
+                  --master_port $MASTER_PORT"
+
+CHECKPOINT=./model/LLAMA-2-13B-hf_tp8_pp1_save/
+VOCAB_FILE=./model/LLAMA-2-13B-hf
+
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference_llama.py \
+       --no-contiguous-buffers-in-local-ddp \
+       --tensor-model-parallel-size 8  \
+       --pipeline-model-parallel-size 1  \
+       --num-layers 40  \
+       --hidden-size 5120  \
+       --ffn-hidden-size 13824 \
+       --load "${CHECKPOINT}"  \
+       --num-attention-heads 40  \
+       --max-position-embeddings 4096 \
+       --tokenizer-type PretrainedFromHF  \
+       --tokenizer-name-or-path "$VOCAB_FILE" \
+       --tokenizer-not-use-fast \
+       --fp16 \
+       --micro-batch-size 1 \
+       --seq-length 4096 \
+       --max-new-tokens 256 \
+       --seed 42
--- a/examples/llama2/pretrain_llama2_13B_ptd_8p.sh
+++ b/examples/llama2/pretrain_llama2_13B_ptd_8p.sh
@ -0,0 +1,62 @@
+export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
+export HCCL_CONNECT_TIMEOUT=1200
+export COMBINED_ENABLE=1
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+
+# Change for multinode config
+MASTER_ADDR=localhost
+MASTER_PORT=6000
+NNODES=1
+NODE_RANK=0
+NPUS_PER_NODE=8
+WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
+
+DATA_PATH=./dataset/llama_text_document
+LOAD_CHECKPOINT=./model/LLAMA-2-13B-hf_tp8_pp1
+SAVE_CHECKPOINT=./model/LLAMA-2-13B-hf_tp8_pp1_save/
+TOKENIZER_PATH=./model/LLAMA-2-13B-hf
+
+DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
+
+# Main script
+python -m torch.distributed.launch $DISTRIBUTED_ARGS \
+       pretrain_llama.py \
+       --DDP-impl local \
+       --tensor-model-parallel-size 8 \
+       --pipeline-model-parallel-size 1 \
+       --sequence-parallel \
+       --num-layers 40 \
+       --hidden-size 5120 \
+       --ffn-hidden-size 13824 \
+       --num-attention-heads 40 \
+       --micro-batch-size 2 \
+       --global-batch-size 2 \
+       --seq-length 4096 \
+       --max-position-embeddings 4096 \
+       --train-iters 1 \
+       --lr-warmup-iters 0 \
+       --save $SAVE_CHECKPOINT \
+       --load $LOAD_CHECKPOINT \
+       --data-path $DATA_PATH \
+       --tokenizer-name-or-path $TOKENIZER_PATH \
+       --tokenizer-not-use-fast \
+       --data-impl mmap \
+       --split 949,50,1 \
+       --distributed-backend nccl \
+       --lr 1e-6 \
+       --lr-decay-style cosine \
+       --min-lr 1e-8 \
+       --weight-decay 1e-1 \
+       --clip-grad 1.0 \
+       --initial-loss-scale 4096.0 \
+       --seed 1234 \
+       --adam-beta1 0.9 \
+       --adam-beta2 0.95 \
+       --adam-eps 1.0e-5 \
+       --log-interval 1 \
+       --save-interval 5000 \
+       --eval-interval 1000 \
+       --eval-iters 1 \
+       --use-fused-rotary-pos-emb \
+       --use-flash-attn \
+       --bf16 | tee logs/train_13B_llama2_npu.log
--- a/sources/images/llama2/llama2-13B-generate.png
+++ b/sources/images/llama2/llama2-13B-generate.png
--- a/sources/images/llama2/llama2_13b_bf16_loss_absolute.png
+++ b/sources/images/llama2/llama2_13b_bf16_loss_absolute.png