llama2-7B readme更新

2024-12-05 05:17:40 +08:00 · 2023-12-04 17:23:55 +08:00 · 2023-12-04 17:23:55 +08:00 · 9c594efe8e
commit 9c594efe8e
parent 680c0fbe65
8 changed files with 383 additions and 347 deletions
--- a/11
+++ b/11
@ -2,6 +2,7 @@ approvers:
 - dilililiwhy
 - wujianping996
 - fengliangjun66
+- xqiangx1991
 - ckirchhoff
 - block0219
 - wqh17101
@ -12,9 +13,17 @@ reviewers:
 - mhh001
 - zhangshengdong
 - kingsleyandher
+- guo-xinjie-1
+- matrixssy
 - ckirchhoff
 - block0219
 - wqh17101
 - ZeroFadeAway
 - Ronnie_zheng
-
+- sibylk
+- dl9864
+- demo-v
+- xulianhao
+- yangyifei626
+- z2fan
+- wenjiang2357
--- a/README.md
+++ b/README.md
@ -16,25 +16,25 @@
 </p>
 <p align="center">
        <b>简体中文</b> |
-        <b><a href="https://gitee.com/fengliangjun66/AscendSpeed/blob/master/README_en.md">English</a> </b> 
+        <b><a href="https://gitee.com/fengliangjun66/AscendSpeed/blob/master/README_en.md">English</a> </b>
    </p>
 </p>

 AscendSpeed旨在为华为 [昇腾芯片](https://open.codehub.huawei.com/OpenBaize/Ascend/ascendspeed/files?ref=master&filePath=examples%2Fbaichuan%2Fpretrain_baichuan_zero_7B.sh&isFile=true) 上的大语言模型提供端到端的解决方案, 包含模型，算法，算子，以及下游任务。

-## AscendSpeed解决方案概览 
+## AscendSpeed解决方案概览

 ---
 ### 大语言模型
-当前AscendSpeed支持下列模型的预训练以及全参微调: 
-
+当前AscendSpeed支持下列模型的预训练以及全参微调:
+* <a href="https://huggingface.co/BAAI/Aquila-7B/tree/main" style="color:green">Aquila</a>-[[使用说明: 7B]](examples/aquila/README.md)
 * <a href="https://github.com/baichuan-inc" style="color:green">Baichuan</a>-[[使用说明: 7B/13B]](examples/baichuan/README.md)
 * <a href="https://arxiv.org/pdf/2108.12409.pdf" style="color:green">Bloom</a>-[[使用说明: 7B/176B]](examples/bloom/README.md)
-* <a href="https://internlm.intern-ai.org.cn/" style="color:green">InternLM</a>-[[使用说明: 7B]](examples/intern/README.md)
+* <a href="https://internlm.intern-ai.org.cn/" style="color:green">InternLM</a>-[[使用说明: 7B/65B]](examples/intern/README.md)
 * <a href="https://huggingface.co/docs/transformers/main/model_doc/llama" style="color:green">LLaMA</a>-[[使用说明: 7B/13B/33B/65B]](examples/llama/README.md)
 * <a href="https://huggingface.co/docs/transformers/main/model_doc/llama2" style="color:green">LLaMA2</a>-[[使用说明: 7B/13B/70B]](examples/llama2/README.md)

-LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B 等模型即将上线...
+LLaMA2-34B, Baichuan2-7B/13B 等模型即将上线...

 ### 下游任务
 当前AscendSpeed为大模型提供以下周边应用:
@ -86,6 +86,16 @@ LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B 等模型即将上线...
    </tr>
  </thead>
  <tbody>
+    <tr>
+      <td rowspan="1"><a href="examples/aquila/README.md">Aquila</a></td>
+      <td>7B</td>
+      <td> 1x8</td>
+      <td> FP16 </td>
+      <td> 3644 </td>
+      <td> 4078 </td>
+      <td> <a href="./sources/images/aquila/aquila_comp1130.png">Loss</a> </td>
+      <td> <a href="examples/aquila/pretrain_aquila_7B.sh">训练</a> </td>
+    </tr>
    <tr>
      <td rowspan="2"><a href="examples/baichuan/README.md">Baichaun</a></td>
      <td>7B</td>
@ -125,7 +135,7 @@ LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B 等模型即将上线...
      <td> <a href="examples/bloom/pretrain_bloom_176b.sh">训练</a> </td>
    </tr>
    <tr>
-      <td><a href="examples/intern/README.md">InternLM</td>
+      <td rowspan="2"><a href="examples/intern/README.md">InternLM</a></td>
      <td>7B</td>
      <td>1x8</td>
      <td>BF16</td>
@ -134,6 +144,15 @@ LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B 等模型即将上线...
      <td>  <a href="sources/images/intern7b_loss.png">Loss</a>  </td>
      <td> <a href="examples/intern/pretrain_internlm_7b_zero.sh">训练</a> </td>
    </tr>
+    <tr>
+      <td >65B</td>
+      <td >4x8</td>
+      <td> BF16 </td>
+      <td> 342 </td>
+      <td> 414 </td>
+      <td> <a href="sources/images/intern65b_loss.png">Loss</a> </td>
+      <td> <a href="examples/intern/pretrain_internlm_65b_ptd_32p.sh">训练</a> </td>
+    </tr>
    <tr>
      <td rowspan="5"><a href="examples/llama/README.md">LLaMA</td>
      <td>7B</td>
@ -178,11 +197,11 @@ LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B 等模型即将上线...
      <td rowspan="3"><a href="examples/llama2/README.md">LLaMA2</td>
      <td>7B</td>
      <td>1x8</td>
-      <td>FP16 </td>
-      <td> 2712 </td>
-      <td> 2348 </td>
-      <td> <a href="sources/images/llama2/llama2_7b_shape_fp16_layer32_loss_with_weights.png">Loss</a> </td>
-      <td> <a href="examples/llama2/pretrain_llama2_7b_zero_8p.sh">训练</a> </td>
+      <td>BF16 </td>
+      <td> 2662 </td>
+      <td> 2884 </td>
+      <td> <a href="sources/images/llama2/llama2-7b-tp8pp1mbs4gbs16-cann1115-Megatron-GPU-loss-releative.png">Loss</a> </td>
+      <td> <a href="examples/llama2/pretrain_llama2_7b_ptd.sh">训练</a> </td>
    </tr>
    <tr>
      <td>13B</td>
@ -212,8 +231,8 @@ LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B 等模型即将上线...
 1. 拷贝仓库到你的个人服务器:

 ```bash
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
+git clone https://gitee.com/ascend/AscendSpeed.git
+cd AscendSpeed
 mkdir logs
 mkdir ckpt
 ```
@ -285,7 +304,7 @@ python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-mode
 5. 启动训练

 ```bash
-# 在脚本中设置你自己的数据/权重/tokenizer等路径  
+# 在脚本中设置你自己的数据/权重/tokenizer等路径
 sh examples/llama/pretrain_llama_7B_zero_8p.sh
 ```

@ -370,10 +389,10 @@ sh examples/llama/pretrain_llama_7B_zero_8p.sh
    <tr>
      <td><a href="examples/llama2/README.md">LLaMA2</a></td>
      <td>7B</td>
+      <td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/llama2/tune_llama2_7b_ptd.sh">lora</a> </td>
+      <td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/llama2/generate_llama2_7b_ptd.sh">对话 </a> </td>
      <td> -- </td>
-      <td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/llama/generate_llama_7B_tp2_pp2.sh">对话 </a> </td>
-      <td> -- </td>
-      <td> -- </td>
+      <td> <a href="https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json">alpaca_data.json </td>
    </tr>
  </tbody>
 </table>
@ -469,7 +488,7 @@ python tools/preprocess_data.py --input WORKSPACE/alpaca/train-00000-of-00001-a0
                                --handler-name GeneralInstructionHandler
 ```

-在处理后，`WORKSPACE/alpaca_preprocessed` 文件夹下会有3个 `bin` 文件 和 3个 `idx` 文件，我们便可以通过添加 `--data-path WORKSPACE/alpaca_preprocessed/alpaca` 和 `--is-instruction-dataset` 标志来进行指令微调。 
+在处理后，`WORKSPACE/alpaca_preprocessed` 文件夹下会有3个 `bin` 文件 和 3个 `idx` 文件，我们便可以通过添加 `--data-path WORKSPACE/alpaca_preprocessed/alpaca` 和 `--is-instruction-dataset` 标志来进行指令微调。
 此外，基于指令数据集，我们还可以通过加上 `--variable-seq-lengths` 标志使用动态序列长度训练模型。

 请注意，使用 `--handler-name GeneralInstructionHandler` 标志的指令数据集，在处理时会从 `ascendspeed/data/data_handler.py` 中选择 `GeneralInstructionHandler` 类来制作prompt。如果你处理的是 alpaca 格式风格的数据集，即包含 `instruction`, `input` 和 `output` 列的数据集，可以直接使用 `--handler-name GeneralInstructionHandler` 标志。
@ -483,7 +502,7 @@ python tools/preprocess_data.py --input WORKSPACE/alpaca/train-00000-of-00001-a0
 ```shell
 pip install peft==0.4.0
 ```
-你也可以选择直接从它Github仓库的 [源码安装](https://github.com/huggingface/peft/archive/refs/tags/v0.4.0.tar.gz)， 通过修改它的setup.py文件来回避一些依赖问题。 
+你也可以选择直接从它Github仓库的 [源码安装](https://github.com/huggingface/peft/archive/refs/tags/v0.4.0.tar.gz)， 通过修改它的setup.py文件来回避一些依赖问题。

 之后，你仅仅只需要在启动脚本中使能如下标志便可以启动lora微调训练:

@ -498,11 +517,11 @@ Lora有一些相关参数，在 [PEFT](https://github.com/huggingface/peft) 仓
 # Llama example
 --lora-r 64 \
 --lora-alpha 128 \
--lora-modules-to-save word_embeddings lm_head.lm_head \
+--lora-modules-to-save word_embeddings output_layer \
 --lora-register-forward-hook word_embeddings input_layernorm \
 ```

-在这些参数中，标志 `--lora-register-forward-hook` 被用于修复由PP造成的梯度链中断，它仅仅只需要在每一个PP阶段的输入层设置，并不会增加训练参数。
+在这些参数中，标志 `--lora-register-forward-hook` 被用于修复由PP造成的梯度链中断，它仅仅只需要在每一个PP阶段的输入层设置，并不会增加训练参数。 标志 `--lora-modules-to-save` 被用于扩展词表时的微调，若没此需求则无需传入此参数。

 最后，Lora微调后保存的权重仅仅只会包含新增的Lora权重。相似的，当你加载一个Lora模型时，除了原始权重路径需要设置，还需要设置一个加载Lora权重的路径，如下：

@ -536,9 +555,9 @@ AscendSpeed:

 这里有一些使用不同模式的样例脚本可以尝试运行，***请注意：***
 1. 如果你尝试使用 huggingface 的模型权重，请首先进行权重转换， 以 Llama-7B 为例:
-    
+
      - PTD 策略的转换
-    
+
           ```bash
           python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-7b-hf \
                                                                               --output-model-dir llama-7b-tp2-pp2 \
@ -546,7 +565,7 @@ AscendSpeed:
                                                                               --pipeline-model-parallel-size 2 \
                                                                               --type 7B
           ```
-    
+
      - ZeRO 策略的转换
          ```bash
          python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-7b-hf \
@ -554,7 +573,7 @@ AscendSpeed:
                                                                              --type 7B \
                                                                              --deepspeed
          ```
-    
+
 5. 下面脚本中的一些路径需要修改，比如：模型权重路径 和 词表路径.

    - 仅仅使用 PTD 策略训练的模型：在这种模式下，模型以 Megatron-LM 的风格被 流水并行 和 张量并行 切分
@ -752,7 +771,7 @@ VOCAB_FILE=../models/llama7b-hf/
 # 配置任务和数据路径
 DATA_PATH="dataset/boolq/test"
 TASK="boolq"
-# 配置生成参数 
+# 配置生成参数
 python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation_llama.py   \
       --task-data-path $DATA_PATH \
       --task $TASK\
@ -800,7 +819,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation_llama.py   \
       --micro-batch-size 1  \
       --seed 42 | tee logs/train.log
 ```
-##### BoolQ 
+##### BoolQ
 BoolQ 是一个 yes/no 的问答数据集， 每一个问题包含了一个（问题，文章，答案）三元组，同时有文章的标题作为额外的选择性输入。BoolQ 数据集的评估相对简单，只需要配置 `TASK="boolq"`, `--seq-length=512`, `--max-position-embeddings=512`, `--max-new-token=2`。
 零样本评估的结果通常会被给定的 prompt 影响，可以尝试通过在 `tasks/evaluation/evaluation.py` 中设置合适的 prompt 得到更高的分数，

@ -809,15 +828,15 @@ BoolQ 是一个 yes/no 的问答数据集， 每一个问题包含了一个（
 template = {instruction}
 ```

-##### MMLU 
+##### MMLU
 由于 MMLU 是一项多学科任务，并且需要进行 5-shot 评估，因此每个学科问题的长度差异很大。如果你想同时跑57个学科任务，可以尝试设置 `TASK="mmlu"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=2` (`--max-new-tokens` 可以在 2-4 取值)。
 在很多网站，MMLU 的精度会依据学科进行评估，57个学科主要属于四个大类， 因此该数据集也可以基于四个大类进行打分，[网站](https://github.com/hendrycks/test/blob/master/categories.py) 给出了具体的57个类别。


-##### GSM8K 
+##### GSM8K
 GSM8K 是一个有8.5k高质量小学数学应用题文本的数据集，每一个问题的回答是具体的数字。由于该数据集通常采用 few-shot 的形式进行评估，GSM8K的问题长度相对是比较长的，输出答案包含一整个思维链路，相关入参应该设置为 `TASK="gsm8k"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=128` (`--max-new-tokens` 可以是 256-512).

-##### HumanEval 
+##### HumanEval
 HumanEval 是一个用于挑战代码生成问题的数据集，具有164个编程问题，包含函数签名，文档，函数主体和单元测试等。该数据的所有问题都是手写的，以确保它们不在训练集中，由于答案包含长代码，相关参数可以设置为 `TASK="human_eval"`, `--seq-length=2048`,
 `--max-position-embeddings=2048`, `--max-new-token=1024`。

@ -841,7 +860,7 @@ Big-bench-hard 数据集是 BIG-Bench 的一个子集，专注于有挑战性的
 在 AscendSpeed 中使用张量并行， 可以在启动脚本中增加  `--tensor-model-parallel-size` 标志， 来明确用于拆分模型的GPU数量。

 ### <span id="jump2">  (虚拟 & 动态) 流水并行  </span>
-流水并行（Pipeline Parallelism (PP)）是一种将模型所有的Transformer模块划分为多个stage放在不同卡上训练的模型并行技术。 
+流水并行（Pipeline Parallelism (PP)）是一种将模型所有的Transformer模块划分为多个stage放在不同卡上训练的模型并行技术。
 在PP中，每一个stage都有相等数量的Transformer，同时通过将 全局的 （global） batchsize 拆分为多个 微 （micro） batchsize，在stage间流水化训练这些微 batchsize，以达到训练整个模型的目的。
 虚拟流水并行 （Virtual Pipeline (VP) Parallelism），通过增加虚拟的 stage 来减少 PP 运行时的空泡时间， 动态流水并行 （Dynamic Pipline Parallelism，DPP） 则是增强版本的 VP， 通过合理的设置每个微 batchsize的大小进一步降低空泡时间。
 PP 和 VP的基本原理如下：<div align=center> <img src="sources/images/pp_vp.png" height="350px" width="800px"> </div>
@ -1038,16 +1057,3 @@ AscendSpeed由华为公司的下列部门联合贡献
 - 许可描述: 在运行程序之前，建议您对训练所需的文件（如ckpt、日志等）采取权限控制（普通用户而不是root权限）等安全措施
 - 路径描述: 当你使用 `torch.load` 接口时，除非 weights_only 参数被设置为 True， 否则请隐式的使用 pickle 模块，我们不建议您加载和使用任何来源未知或不可信任的数据
 - 通信矩阵: [通信矩阵](https://gitee.com/ascend/AscendSpeed/wikis/Communition%20Matrix/AscendSpeed%20Communication%20Matrix?sort_id=8855822)
-
-
-
-
-
-
-
-
-
-
-
-
-
--- a/README_en.md
+++ b/README_en.md
@ -16,25 +16,25 @@
 </p>
 <p align="center">
        <b><a href="https://gitee.com/fengliangjun66/AscendSpeed/blob/master/README.md"> 简体中文</a></b> |
-        <b> English </b> 
+        <b> English </b>
    </p>
 </p>

 AscendSpeed provides end-to-end solutions for large language models on [Ascend](https://open.codehub.huawei.com/OpenBaize/Ascend/ascendspeed/files?ref=master&filePath=examples%2Fbaichuan%2Fpretrain_baichuan_zero_7B.sh&isFile=true) chips, including models, algorithms, kernels, and tasks.

-## Why AscendSpeed? 
+## Why AscendSpeed?

 ---
 ### Prepared Models
-Currently, the following models has been supported: 
-
+Currently, the following models has been supported:
+* <a href="https://huggingface.co/BAAI/Aquila-7B/tree/main" style="color:green">Aquila</a>-[[README: 7B]](examples/aquila/README.md)
 * <a href="https://github.com/baichuan-inc" style="color:green">Baichuan</a>-[[README: 7B/13B]](examples/baichuan/README.md)
 * <a href="https://arxiv.org/pdf/2108.12409.pdf" style="color:green">Bloom</a>-[[README: 7B/176B]](examples/bloom/README.md)
-* <a href="https://internlm.intern-ai.org.cn/" style="color:green">InternLM</a>-[[README: 7B]](examples/intern/README.md)
+* <a href="https://internlm.intern-ai.org.cn/" style="color:green">InternLM</a>-[[README: 7B/65B]](examples/intern/README.md)
 * <a href="https://huggingface.co/docs/transformers/main/model_doc/llama" style="color:green">LLaMA</a>-[[README: 7B/13B/33B/65B]](examples/llama/README.md)
 * <a href="https://huggingface.co/docs/transformers/main/model_doc/llama2" style="color:green">LLaMA2</a>-[[README: 7B/13B/70B]](examples/llama2/README.md)

-LLaMA2-34B, Aquila-7B, Baichuan2-7B/13B are coming soon ...
+LLaMA2-34B, Baichuan2-7B/13B are coming soon ...

 ### Downstream Tasks
 Currently, the following downstream tasks have been supported:
@ -85,6 +85,16 @@ Coming soon ...
    </tr>
  </thead>
  <tbody>
+    <tr>
+      <td rowspan="1"><a href="examples/aquila/README.md">Aquila</a></td>
+      <td>7B</td>
+      <td> 1x8</td>
+      <td> FP16 </td>
+      <td> 3644 </td>
+      <td> 4078 </td>
+      <td> <a href="./sources/images/aquila/aquila_comp1130.png">Loss</a> </td>
+      <td> <a href="examples/aquila/pretrain_aquila_7B.sh">Train</a> </td>
+    </tr>
    <tr>
      <td rowspan="2"><a href="examples/baichuan/README.md">Baichaun</a></td>
      <td>7B</td>
@ -124,7 +134,7 @@ Coming soon ...
      <td> <a href="examples/bloom/pretrain_bloom_176b.sh">Train</a> </td>
    </tr>
    <tr>
-      <td><a href="examples/intern/README.md">InternLM</td>
+      <td rowspan="2"><a href="examples/intern/README.md">InternLM</a></td>
      <td>7B</td>
      <td>1x8</td>
      <td>BF16</td>
@ -133,6 +143,15 @@ Coming soon ...
      <td>  <a href="sources/images/intern7b_loss.png">Loss</a>  </td>
      <td> <a href="examples/intern/pretrain_internlm_7b_zero.sh">Train</a> </td>
    </tr>
+    <tr>
+      <td >65B</td>
+      <td >4x8</td>
+      <td> BF16 </td>
+      <td> 342 </td>
+      <td> 414 </td>
+      <td> <a href="sources/images/intern65b_loss.png">Loss</a> </td>
+      <td> <a href="examples/intern/pretrain_internlm_65b_ptd_32p.sh">Train</a> </td>
+    </tr>
    <tr>
      <td rowspan="5"><a href="examples/llama/README.md">LLaMA</td>
      <td>7B</td>
@ -177,11 +196,11 @@ Coming soon ...
      <td rowspan="3"><a href="examples/llama2/README.md">LLaMA2</td>
      <td>7B</td>
      <td>1x8</td>
-      <td>FP16 </td>
-      <td> 2712</td>
-      <td> 2348 </td>
-      <td> <a href="sources/images/llama2/llama2_7b_shape_fp16_layer32_loss_with_weights.png">Loss</a> </td>
-      <td> <a href="examples/llama2/pretrain_llama2_7b_zero_8p.sh">Train</a> </td>
+      <td>BF16 </td>
+      <td> 2662</td>
+      <td> 2884 </td>
+      <td> <a href="sources/images/llama2/llama2-7b-tp8pp1mbs4gbs16-cann1115-Megatron-GPU-loss-releative.png">Loss</a> </td>
+      <td> <a href="examples/llama2/pretrain_llama2_7b_ptd.sh">Train</a> </td>
    </tr>
    <tr>
      <td>13B</td>
@ -211,8 +230,8 @@ This is an example of model training with AscendSpeed, and the detailed guidelin
 1. Clone the repository to your local server:

 ```bash
-git clone https://gitee.com/ascend/AscendSpeed.git 
-cd AscendSpeed 
+git clone https://gitee.com/ascend/AscendSpeed.git
+cd AscendSpeed
 mkdir logs
 mkdir ckpt
 ```
@ -285,7 +304,7 @@ python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-mode
 5. Start your task

 ```bash
-# set your data path / weight path / tokenizer path etc.   
+# set your data path / weight path / tokenizer path etc.
 sh examples/llama/pretrain_llama_7B_zero_8p.sh
 ```

@ -370,10 +389,10 @@ sh examples/llama/pretrain_llama_7B_zero_8p.sh
    <tr>
      <td><a href="examples/llama2/README.md">LLaMA2</a></td>
      <td>7B</td>
+      <td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/llama2/tune_llama2_7b_ptd.sh">lora</a>  </td>
+      <td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/llama/generate_llama2_7b_ptd.sh">inference </a> </td>
      <td> -- </td>
-      <td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/llama/generate_llama_7B_tp2_pp2.sh">inference </a> </td>
-      <td> -- </td>
-      <td> -- </td>
+      <td> <a href="https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json">alpaca_data.json </td>
    </tr>
  </tbody>
 </table>
@ -388,7 +407,7 @@ sh examples/llama/pretrain_llama_7B_zero_8p.sh
 # for llama, download alpaca dataset, like
 wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet

-# download tokenizer configs and (selective) weights from 
+# download tokenizer configs and (selective) weights from
 # https://huggingface.co/yahma/llama-7b-hf/tree/main
 # revise "LLaMATokenizer" as "LlamaTokenizer" in tokenizer_config.json (This is a bug of huggingface)
 mkdir dataset
@ -402,7 +421,7 @@ python tools/preprocess_data.py --input train-00000-of-00001-a09b74b3ef9c3b56.pa

 #### Preprocessing pretraining dataset

-##### wikipedia dataset 
+##### wikipedia dataset

 + download [wikipedia data](https://huggingface.co/datasets/wikipedia/tree/main) from huggingface to WORKSPACE/wikipedia
 + download [llama tokenizer model and config](https://huggingface.co/yahma/llama-7b-hf/tree/main) from huggingface to WORKSPACE/llama-7b-hf
@ -414,7 +433,7 @@ cd WORKSPACE
 mkdir wikipedia_preprocessed

 # specify huggingface load_dataset parameters.(--input param will be ignored)
-# these params will just be feed into datasets.load_dataset function 
+# these params will just be feed into datasets.load_dataset function
 hf_config_json="./hf_config_json.json"
 cat <<EOT > $hf_config_json
 {
@ -463,7 +482,7 @@ python tools/preprocess_data.py --input WORKSPACE/train-00000-of-00001-a09b74b3e
 # for llama, download alpaca dataset, like
 # wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet

-# download tokenizer configs and (selective) weights from 
+# download tokenizer configs and (selective) weights from
 # https://huggingface.co/yahma/llama-7b-hf/tree/main
 # revise "LLaMATokenizer" as "LlamaTokenizer" in tokenizer_config.json (This is a bug of huggingface)

@ -477,7 +496,7 @@ python tools/preprocess_data.py --input WORKSPACE/alpaca/train-00000-of-00001-a0
                                --handler-name GeneralInstructionHandler
 ```

-After preprocessing, there will be three `bin` files and three `idx` files in the `WORKSPACE/alpaca_preprocessed` dictionary. Then, we can train a model with `--data-path WORKSPACE/alpaca_preprocessed/alpaca` and `--is-instruction-dataset` flags. 
+After preprocessing, there will be three `bin` files and three `idx` files in the `WORKSPACE/alpaca_preprocessed` dictionary. Then, we can train a model with `--data-path WORKSPACE/alpaca_preprocessed/alpaca` and `--is-instruction-dataset` flags.
 In addition, we have developed the dynamic padding function based on the instruction dataset, which can be implemented using the `--variable-seq-lengths` flag.

 Note that instruction dataset has a `--handler-name GeneralInstructionHandler` flag which will choose `GeneralInstructionHandler` class to create prompt in `ascendspeed/data/data_handler.py`.
@ -489,7 +508,7 @@ In addition, `BelleMultiTurnInstructionHandler` is used to handle [belle dataset
 ### <span id="jump12"> Low-parameter fine-tuning </span>
 #### Lora

-Now, we support Lora to fine-tune your models. 
+Now, we support Lora to fine-tune your models.

 First, you need to install version 0.4.0 of the peft library, like this:
 ```shell
@ -510,11 +529,11 @@ There are other Lora related arguments here, you can find their definitions in t
 # Llama example
 --lora-r 64 \
 --lora-alpha 128 \
--lora-modules-to-save word_embeddings lm_head.lm_head \
+--lora-modules-to-save word_embeddings output_layer \
 --lora-register-forward-hook word_embeddings input_layernorm \
 ```

-Among them, the argument `--lora-register-forward-hook` is used to repair the gradient chain break caused by PP. It only needs to be set to the input layer of each PP stage, and the repair will not increase the trainable parameters.
+Among them, the argument `--lora-register-forward-hook` is used to repair the gradient chain break caused by PP. It only needs to be set to the input layer of each PP stage, and the repair will not increase the trainable parameters. The argument `--lora-modules-to-save` is used for fine-tuning when expanding the vocabulary. If there is no need for this, there is no need to pass in this argument.

 Finally, only Lora's parameters are saved after turning on Lora. Similarly, when loading a model, you need to specify the original model weight path and the Lora weight path. Parameters such as the optimizer are subject to those in the Lora weight path.

@ -523,7 +542,7 @@ Finally, only Lora's parameters are saved after turning on Lora. Similarly, when
 --lora-load ${LORA_CHECKPOINT} \
 ```

-There is an [example](examples/llama/tune_llama_ptd_13b.sh) could be referred. 
+There is an [example](examples/llama/tune_llama_ptd_13b.sh) could be referred.

 After using Lora to fine-tune the Llama model, the instruction dialogue effect is as follows:

@ -548,11 +567,11 @@ Currently, we support the following four cases of inference:
 Here are some example scripts in different mode mentioned above for you to launch directly.

 ***Please Note that:***
-1. If you want to use the weight from huggingface, please run the weight conversion script first. 
+1. If you want to use the weight from huggingface, please run the weight conversion script first.
    Take Llama-7B, for example:
-    
+
      - PTD only
-    
+
           ```bash
           python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-7b-hf \
                                                                               --output-model-dir llama-7b-tp2-pp2 \
@ -560,7 +579,7 @@ Here are some example scripts in different mode mentioned above for you to launc
                                                                               --pipeline-model-parallel-size 2 \
                                                                               --type 7B
           ```
-    
+
    - DeepSpeed ZeRO only
        ```bash
        python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-7b-hf \
@ -568,7 +587,7 @@ Here are some example scripts in different mode mentioned above for you to launc
                                                                            --type 7B \
                                                                            --deepspeed
        ```
-    
+
 2. You need to modify some variables in the shell script such as **model weight path** and **vocab path**.

    - **PTD only:** In this mode, the model is split by pipeline parallel and tensor parallel mode in megatron ways.
@ -767,7 +786,7 @@ VOCAB_FILE=../models/llama7b-hf/
 # configure task and data path
 DATA_PATH="dataset/boolq/test"
 TASK="boolq"
-# configure generation parameters 
+# configure generation parameters
 python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation_llama.py   \
       --task-data-path $DATA_PATH \
       --task $TASK\
@ -815,39 +834,39 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS evaluation_llama.py   \
       --micro-batch-size 1  \
       --seed 42 | tee logs/train.log
 ```
-##### BoolQ 
+##### BoolQ
 BoolQ is a question answering dataset for yes/no questions. Each question contains a triplet of (question, passage, answer), with the title of the page as optional additional context.
 The evaluation of the BoolQ data set is relatively simple, just configure `TASK="boolq"`, `--seq-length=512`, `--max-position-embeddings=512`, `--max-new-token=2`.
-The zero-shot results are usually affected by the given prompt, and a higher score can be obtained by a suitable prompt. 
+The zero-shot results are usually affected by the given prompt, and a higher score can be obtained by a suitable prompt.
 The prompt can be modified in `tasks/evaluation/evaluation.py`
 ```bash
 # Update new prompt by changing the template
 template = {instruction}
 ```

-##### MMLU 
+##### MMLU
 Since MMLU is a multidisciplinary task and 5 shots are performed, the length of each subject question varies greatly. If you want to run 57 subjects at the same time, you need to set `TASK="mmlu"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=2`. (`--max-new-tokens` can be set to between 2-4).
 On many websites, the accuracy of the MMLU is evaluated according to disciplines. The 57 categories of single subjects belong to four main categories. Therefore, the statistics should be summarized according to the major categories of the subjects. The [website](https://github.com/hendrycks/test/blob/master/categories.py) gives the major categories of subjects for 57 categories of subjects.


-##### GSM8K 
+##### GSM8K
 GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The answer of each question is a specific number. Since few shots are performed,  the question length is relatively long in GSM8K, and the output answer contains a chain of thoughts, it is necessary to configure `TASK="gsm8k"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=128`. (`--max-new-tokens` can be set between 256-512).

-##### HumanEval 
-HumanEval dataset is a handcrafted set of 164 programming problems designed to challenge code generation models. The problems include a function signature, docstring, body, and several unit tests, all handwritten to ensure they're not included in the training set of code generation models. 
+##### HumanEval
+HumanEval dataset is a handcrafted set of 164 programming problems designed to challenge code generation models. The problems include a function signature, docstring, body, and several unit tests, all handwritten to ensure they're not included in the training set of code generation models.
 Since the answer of HumanEval dataset contains long codes, it is necessary to configure `TASK="human_eval"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=1024`.

 ##### AGIEval
-AGIEval is a human-centric benchmark specifically designed to evaluate the general 
-abilities of foundation models in tasks pertinent to human cognition and problem-solving. This benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams.Since the length of answers to different type of questions varies, we have to configure `TASK="agieval"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=1024` to fit the longest answer. 
+AGIEval is a human-centric benchmark specifically designed to evaluate the general
+abilities of foundation models in tasks pertinent to human cognition and problem-solving. This benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams.Since the length of answers to different type of questions varies, we have to configure `TASK="agieval"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=1024` to fit the longest answer.

 ##### Big-Bench-Hard
 Big-bench-hard dataset is a subset of big bench, which is a diverse evaluation suite that focuses on a suite of 23 challenging BIG-Bench tasks. These are the task for which prior language model evaluations did not outperform the average human-rater. This dataset covers multiple areas including text understanding, reasoning, logical reasoning, mathematical reasoning, and common sense reasoning.
 Except word_sorting, all datasets are multiple-choice questions. So we can set `TASK="bbh"`, `--seq-length=2048`, `--max-position-embeddings=2048`, `--max-new-token=32`. (`--max-new-tokens` can be set between 32-64).

 ##### CEval
-As [C-Eval](https://cevalbenchmark.com/) shows, C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels, as shown below. You may explore our dataset examples at Explore, or check our paper for more details. The dataset contains validation and test data, however, only validation data has label for auto-evaluation. If 
-you want to evaluate on test data, you should email your results to [C-Eval](https://cevalbenchmark.com/). 
+As [C-Eval](https://cevalbenchmark.com/) shows, C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels, as shown below. You may explore our dataset examples at Explore, or check our paper for more details. The dataset contains validation and test data, however, only validation data has label for auto-evaluation. If
+you want to evaluate on test data, you should email your results to [C-Eval](https://cevalbenchmark.com/).


 #### Configuration of models and datasets
@ -858,7 +877,7 @@ python convert_weights_from_huggingface.py \
        --output-model-dir /home/w425040/models/llama-7b-tp2-pp4 \
        --type 7B \
        --tensor-model-parallel-size 2 \
-        --pipeline-model-parallel-size 4 
+        --pipeline-model-parallel-size 4
 ```
 Then, configure dataset path and task.  Note: since the evaluation parameters of different datasets are not totally same, it is not recommended to evaluate two or more different datasets together. Evaluation parameters such as `--seq-length`, `--max-new-tokens` and `--max-position-embeddings` need to be adjusted to datasets. The recommended parameters for each dataset will be given in the following instruction.

@ -869,7 +888,7 @@ VOCAB_FILE=../models/llama7b-hf/
 # configure task and data path
 DATA_PATH="dataset/boolq/test"
 TASK="boolq"
-# configure generation parameters 
+# configure generation parameters
 ```

 ## Introduction For Acceleration Features
@ -877,7 +896,7 @@ TASK="boolq"
 ---

 ### <span id="jump1"> Tensor Parallelism </span>
-Tensor parallelism (TP) is a kind of model parallelism strategy, which splits execution of a single transformer module over multiple devices. 
+Tensor parallelism (TP) is a kind of model parallelism strategy, which splits execution of a single transformer module over multiple devices.
 The basic principle of PP is:<div align=center>
 <img src="sources/images/tp_in_mlp.png" height="280px" width="500px">
 <img src="sources/images/tp_in_sa.png" height="280px" width="500px"></div>
@ -890,7 +909,7 @@ smaller microbatches. Virtual pipeline (VP) parallelism optimizes PP by add virt
 <img src="sources/images/pp_vp.png" height="350px" width="800px"></div>

 To enable pipeline model parallelism, use the `--pipeline-model-parallel-size` flag to specify the number of stages to split the model into (e.g., splitting a model with 24 transformer layers across 4 stages would mean each stage gets 6 transformer layers each).
-To enable virtual pipeline parallelism, additionally use `--num-layers-per-virtual-pipeline-stage` flag to decide number of layers per virtual stage. 
+To enable virtual pipeline parallelism, additionally use `--num-layers-per-virtual-pipeline-stage` flag to decide number of layers per virtual stage.
 To enable optimized pipeline parallelism, additionally use `--optimized-pipeline` and `--manual-mbs example-config-1` flag based on PP. Note that both VP and OPP reduce bubble time, but increase communication time.

 ### <span id="jump3"> Fold3D </span>
@ -899,8 +918,8 @@ Fold3D hides the commutation time of data parallelism in VP. The basic principle
 To enable pipeline model parallelism, use the `--fold-mode "aiao"` flag to choose strategy.

 ### <span id="jump4"> (Selective) Recomputation </span>
-To reduce NPU memory usage so deploy a large model to a training system, we support activation checkpointing and recomputation. 
-We support two levels of recompute granularity: full and selective. To enable full recomputation, please use `--checkpoint-activations` flag, and to enable selective recomputation, please use `--checkpoint-policy` flag to 
+To reduce NPU memory usage so deploy a large model to a training system, we support activation checkpointing and recomputation.
+We support two levels of recompute granularity: full and selective. To enable full recomputation, please use `--checkpoint-activations` flag, and to enable selective recomputation, please use `--checkpoint-policy` flag to
 decide the strategy of selective recomputation. Note that the selective strategy is customized and optimized for Ascend chips.

 To improve the performance of model training while maximizing the use of NPU memory, we support [auto selective recomputing](https://gitee.com/ascend/AscendSpeed/wikis/Algorithm%20Introduction/Auto%20Selective%20Recomputing) strategy by manually adjusting the training memory size.
@ -911,16 +930,16 @@ Auto selective recomputing selects a strategy based on the training memory infor
 By default, profiling is stopped in step 10, with a minimum setting of 5 steps. It is recommended to stop profiling after the training memory is stable, in order to obtain a better choice of recalculation strategy.

 ### <span id="jump5"> Sequence Parallelism </span>
-Sequence parallelism (SP) is a kind of model parallelism strategy, which splits the sequence axis in dropout and layernorm layers. SP depends on TP in our implementation. 
+Sequence parallelism (SP) is a kind of model parallelism strategy, which splits the sequence axis in dropout and layernorm layers. SP depends on TP in our implementation.
 The allreduce operation in TP is split to reduce-scatter and allgather by SP, which reduces the memory occupation in model training. The basic principle of SP is:<div align=center>
 <img src="sources/images/sp.png" height="250px" width="800px"></div>
 To enable sequence parallelism (SP), `--tensor-model-parallel-size` flag should be lager than 1, and set `--sequence-parallel` flag.

 ### <span id="jump6"> ZeRO-1/2/3 </span>
-Zero Redundancy Optimizer (ZeRO) is a kind of memory-optimization strategy in data parallelism proposed by MicroSoft. 
+Zero Redundancy Optimizer (ZeRO) is a kind of memory-optimization strategy in data parallelism proposed by MicroSoft.
 AscendSpeed supports ZeRO-1/2/3 by adding a deepspeed branch. The basic principle of ZeRO is:<div align=center>
 <img src="sources/images/ZeRO.png" height="250px" width="600px"></div>
-To enable ZeRO-1/2/3, a deepspeed config is required and an [example](examples/llama/pretrain_llama_7B_zero_8p.sh) could be referred. 
+To enable ZeRO-1/2/3, a deepspeed config is required and an [example](examples/llama/pretrain_llama_7B_zero_8p.sh) could be referred.
 Notably, if only the ZeRO-1 is supposed, deepspeed is not necessary, and simply set the `--use-distributed-optimizer` flag.

 ### <span id="jump7"> Inverted Triangle Acceleration </span>
@ -940,7 +959,7 @@ For llama and other LLMs without bias in FFN, the linear transformation in FFN c
 ### <span id="jump10"> Memory Overcommitment </span>
 In mix precision training, multiple state tensors, such as parameter copies, gradient copies, and optimizer states, occupy a large amount of static memory (16N, where N is the number of parameters). However, in fact, parameters and gradients (4N, N is the number of parameters) that participate in forward and reverse calculation account for a small proportion, and optimizing the preceding state tensors can bring great video memory benefits. By analyzing the actual use of each part of the state tensor, the memory reuse mechanism of the mechanism is obtained, and a multilevel optimizer memory optimization scheme integrating multiple algorithm modules is finally obtained.

- Memory Overcommitment O1 ——  **Relase FP32 Gradient** 
+- Memory Overcommitment O1 ——  **Relase FP32 Gradient**
    - Advantages: Completely equivalent; Support for multiple optimizers; lossless performance.
    - Algorithm principle: The static memory of the FP32 gradient copy that needs to be permanently stored is reused. The memory of the FP16 gradient is converted into the FP32 format by performing the Foreach+Cast operation when necessary, saving 4N space.
    - Usage: This equivalent algorithm is applicable to all optimizers and can be triggered by specifying  `--release-fp32-grad` in the script.
@ -992,5 +1011,5 @@ We appreciate every PR from community, and welcome to contribute to AscendSpeed.
 - Inner Function Description: Here are some inner implementation interface introduction [InnerInterface](https://gitee.com/ascend/AscendSpeed/wikis/Inner%20API%20Description/Some%20inner%20interface%20description?sort_id=8824096)
 - Parameters Description: Here are some parameters description and usage [param](https://gitee.com/ascend/AscendSpeed/wikis/Home).
 - Permission Description: It is recommended that the umask value of Linux be greater than or eqaul to 027. Before running the program, you are advised to take security measures such as permission control for files required for training, such as ckpt, logs and so on. You are advised to run the program or execute commands as a regular user not as root or super user. Also, you are advised to set the folder permission to 750 and the file permission to 640. When multiple users share datasets, set the read and write permissions for folders and files based on the minimum permissions to avoid security problems such as unauthorized access.
- Path Description: When you're using interface such as `torch.load`, unless weights_only parameter is set to True, uses pickle module implicitly, which is known to be insecure. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. We don't suggest you load data that could have come from an untrusted source in an unsafe mode, or that could have been tampered with. Please load data you trust. Moreover, when you need to read data from outside or your specified path you'd better make it trusted and safe, including but not limited to weights path, dataset path. 
+- Path Description: When you're using interface such as `torch.load`, unless weights_only parameter is set to True, uses pickle module implicitly, which is known to be insecure. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. We don't suggest you load data that could have come from an untrusted source in an unsafe mode, or that could have been tampered with. Please load data you trust. Moreover, when you need to read data from outside or your specified path you'd better make it trusted and safe, including but not limited to weights path, dataset path.
 - Communication Matrix: Please refer to this [link](https://gitee.com/ascend/AscendSpeed/wikis/Communition%20Matrix/AscendSpeed%20Communication%20Matrix?sort_id=8855822) to check the communication matrix.
--- a/examples/llama2/README.md
+++ b/examples/llama2/README.md
@ -7,7 +7,7 @@

 # 目录

- [LLaMA2-7B](#LLaMA2-7B)
+- [LLAMA2-7B](#LLAMA2-7B)
  - [训练](#训练)
  - [脚本](#脚本)
  - [性能](#性能)
@ -35,26 +35,26 @@
  - [评估](#评估)


-# LLaMA2-7B
+# LLAMA2-7B

 ## 训练

-LLaMA2-7B 训练的硬件配置:
+LLAMA2-7B 训练的硬件配置:

 | 硬件 |      配置      |
 | :--: | :-------------: |
 | NPU | 8 x Ascend NPUs |

-LLaMA2-7B 训练的软件配置:
+LLAMA2-7B 训练的软件配置:

 |           软件           |                                                   配置                                                   |
 | :-----------------------: | :-------------------------------------------------------------------------------------------------------: |
-|          python          |                                                  3.7.16                                                  |
+|          python          |                                                  3.8.18                                                  |
 |          driver          | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
 |         firmware         | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
 |           CANN           |       [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software)       |
 | binary arithmetic package |       [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software)       |
-|           torch           |                                                  1.11.0                                                  |
+|           torch           |                                                  2.1.0                                                  |
 |         torch_npu         |                             [package](https://gitee.com/ascend/pytorch/releases)                             |

 ### 脚本
@ -70,14 +70,14 @@ LLaMA2-7B 训练的软件配置:
 2. 搭建环境

   ```bash
-   # python3.7
-   conda create -n test python=3.7
+   # python3.8
+   conda create -n test python=3.8
   conda activate test
   
   # 安装 torch 和 torch_npu
-   pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
-   pip install torch_npu-1.11.0*-cp37-cp37m-linux_aarch64.whl
-   pip install apex-0.1_ascend*-cp37-cp37m-linux_aarch64.whl
+   pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
+   pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
+   pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
   
   # 安装 megatron-core
   pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
@ -92,7 +92,7 @@ LLaMA2-7B 训练的软件配置:
   # install other packages
   pip install -r requirements.txt 
   ```
-3. 下载 LLaMA2-7B 的 [预训练权重和词表](https://huggingface.co/daryl149/llama-2-7b-hf/tree/main)
+3. 下载 LLAMA2-7B 的 [预训练权重和词表](https://huggingface.co/daryl149/llama-2-7b-hf/tree/main)

   ```shell
     #!/bin/bash
@ -110,38 +110,7 @@ LLaMA2-7B 训练的软件配置:
     cd ..
   ```

-   ```text
-   # 请注意，如果要加载huggingface的预训练权重，需要修改一个deepspeed关于加载权重的bug：
-   # 在 `<deepspeed-installed-path>/runtime/engine.py` 文件里的 `_load_zero_checkpoint` 函数，
-   # 将 `if zero_sd_list is None` 改为 `if zero_sd_list is None or len(zero_sd_list) == 0`
-   
-   # 原始 deepspeed/runtime/engine.py, 大概 #Lines2746-2748
-   zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-   if zero_sd_list is None:
-       return False
-   
-   # 修改后
-   zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
-   if zero_sd_list is None or len(zero_sd_list) == 0:
-       return False
-   ```
-
-   3.1 将权重从 huggingface 格式转化为 AscendSpeed 格式 ： deepspeed模式
-
-   ```bash
-   # 修改 ascend-toolkit 路径
-   source /usr/local/Ascend/ascend-toolkit/set_env.sh
-   
-   # 权重格式转换
-   python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-2-7b-hf \
-                                                                       --output-model-dir ckpt \
-                                                                       --tensor-model-parallel-size 1 \
-                                                                       --pipeline-model-parallel-size 1 \
-                                                                       --type 7B \
-                                                                       --deepspeed
-   ```
-
-   3.2 将权重从 huggingface 格式转化为 AscendSpeed 格式 ： PTD模式
+   将权重从 huggingface 格式转化为 AscendSpeed 格式 ： PTD模式

   ```bash
    # 修改 ascend-toolkit 路径
@ -178,26 +147,7 @@ LLaMA2-7B 训练的软件配置:
       --log-interval 1000 \
       --tokenizer-type PretrainedFromHF
   ```
-
-   4.2 用deepspeed模式预训练
-   配置 LLaMA2-7B 预训练脚本: examples/llama2/pretrain_llama2_7b_zero_8p.sh
-
-   ```shell
-    # 设置 ascend-toolkit 路径
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-
-    # 配置词表，数据集等路径
-    TOKENIZER_PATH=./llama-2-7b-hf/  #词表路径
-    DATA_PATH=./dataset_llama2/alpaca_text_document  #数据集路径
-   ```
-
-   启动 LLaMA2-7B 预训练脚本: examples/llama2/pretrain_llama2_7b_zero_8p.sh
-
-   ```shell
-    bash examples/llama2/pretrain_llama2_7b_zero_8p.sh 
-   ```
-
-   4.3 用ptd模式预训练
+   4.2 用ptd模式预训练
   配置LLaMA2-7B PTD 预训练脚本: examples/llama2/pretrain_llama2_7b_ptd.sh

   ```shell
@ -240,17 +190,17 @@ LLaMA2-7B 训练的软件配置:
     --append-eod
   ```

-   5.2 用deepspeed模式微调
-   
-   5.2.1 全参微调
-   全参微调的配置脚本基本和预训练脚本pretrain_llama2_7b_zero_8p.sh一致.*唯一的区别是数据集*
+   5.2 全参微调
+   全参微调的配置脚本基本和预训练脚本pretrain_llama2_7b_ptd.sh一致. *区别是数据集，以及增加训练参数--is-instruction-dataset*

   ```bash
   DATA_PATH=./finetune_dataset/alpaca
+   
+   --is-instruction-dataset \
   ```

-   5.2.2 Lora微调
-   Lora微调的脚本配置是在预训练脚本pretrain_llama2_7b_zero_8p.sh基础上加上lora参数，如下所示:
+   5.3 Lora微调
+   Lora微调的脚本配置是在预训练脚本pretrain_llama2_7b_ptd.sh基础上加上lora参数，如下所示:

   ```bash
       --lora-target-modules query_key_value dense gate_proj up_proj down_proj \
@ -271,8 +221,6 @@ LLaMA2-7B 训练的软件配置:
       --lora-load ${LORA_CHECKPOINT} \   # lora参数checkpoint
   ```

-   5.3 PTD模式微调
-   *PTD模式的微调方法和deepspeed模式的微调方法完全一致.具体细节请参考上一小节.*

 ### 性能

@ -403,8 +351,10 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS tasks/evaluation/evaluation
     --num-attention-heads 32  \
     --mlp-layer-fusion \
     --load ${CHECKPOINT}  \
+     --position-embedding-type rope \
+     --normalization RMSNorm \
     --tokenizer-type PretrainedFromHF  \
-     --tokenizer-name-or-path $VOCAB_FILE \
+     --tokenizer-name-or-path ${TOKENIZER_PATH} \
     --tokenizer-not-use-fast \
     --fp16  \
     --micro-batch-size 1  \
@ -418,67 +368,68 @@ bash tasks/evaluation/eval.sh
 ```
 评估结果如下
 ```text
-                                subject  question_n       acc
-0            high_school_macroeconomics         390  0.466667
-1                          formal_logic         126  0.253968
-2                     international_law         121  0.652893
-3                   college_mathematics         100  0.330000
-4                      college_medicine         173  0.421965
-5                       world_religions         171  0.725146
-6                       moral_scenarios         895  0.220112
-7                             nutrition         306  0.513072
-8                high_school_statistics         216  0.361111
-9                      medical_genetics         100  0.490000
-10                    college_chemistry         100  0.300000
-11              professional_accounting         282  0.361702
-12                     professional_law        1534  0.338331
-13                        miscellaneous         783  0.698595
-14                            sociology         201  0.651741
-15                professional_medicine         272  0.496324
-16                    logical_fallacies         163  0.552147
-17                     public_relations         110  0.563636
-18                      college_biology         144  0.506944
-19         high_school_european_history         165  0.612121
-20                           philosophy         311  0.556270
-21                     abstract_algebra         100  0.310000
-22               high_school_psychology         545  0.678899
-23         high_school_computer_science         100  0.400000
-24               elementary_mathematics         378  0.312169
-25               high_school_us_history         204  0.617647
-26                     machine_learning         112  0.366071
-27                            astronomy         152  0.493421
-28                         global_facts         100  0.330000
-29              high_school_mathematics         270  0.255556
-30               electrical_engineering         145  0.496552
-31           high_school_microeconomics         238  0.415966
-32                      business_ethics         100  0.540000
-33             college_computer_science         100  0.400000
-34                  high_school_physics         151  0.317881
-35                      human_sexuality         131  0.526718
-36                      college_physics         102  0.245098
-37  high_school_government_and_politics         193  0.720207
-38                            marketing         234  0.747863
-39                high_school_geography         198  0.601010
-40                     security_studies         245  0.555102
-41                high_school_chemistry         203  0.418719
-42                           management         103  0.699029
-43                        jurisprudence         108  0.537037
-44                         econometrics         114  0.350877
-45                          human_aging         223  0.591928
-46                             virology         166  0.403614
-47                       moral_disputes         346  0.528902
-48                              anatomy         135  0.451852
-49              professional_psychology         612  0.498366
-50                   conceptual_physics         235  0.455319
-51                    computer_security         100  0.560000
-52                   clinical_knowledge         265  0.505660
-53                    us_foreign_policy         100  0.680000
-54                           prehistory         324  0.570988
-55            high_school_world_history         237  0.645570
-56                  high_school_biology         310  0.535484
-57                                total       14042  0.478422
-MMLU Running Time:  18266.85981464386
+                           学科名             问题数  参考准确率 NPU准确率     准确率差异
+17                     public_relations         110  0.563636  0.554545      0.009091
+44                         econometrics         114  0.368421  0.377193      0.008772
+30               electrical_engineering         145  0.503448  0.510345      0.006897
+5                       world_religions         171  0.701754  0.707602      0.005848
+25               high_school_us_history         204  0.647059  0.651961      0.004902
+45                          human_aging         223  0.596413  0.600897      0.004484
+38                            marketing         234  0.709402  0.713675      0.004274
+55            high_school_world_history         237  0.620253  0.624473      0.004219
+31           high_school_microeconomics         238  0.420168  0.424370      0.004202
+7                             nutrition         306  0.503268  0.500000      0.003268
+56                  high_school_biology         310  0.541935  0.545161      0.003226
+20                           philosophy         311  0.569132  0.565916      0.003215
+24               elementary_mathematics         378  0.291005  0.293651      0.002646
+22               high_school_psychology         545  0.645872  0.647706      0.001835
+12                     professional_law        1534  0.339635  0.340939      0.001304
+13                        miscellaneous         783  0.679438  0.678161      0.001277
+6                       moral_scenarios         895  0.221229  0.222346      0.001117
+37  high_school_government_and_politics         193  0.694301  0.694301      0.000000
+54                           prehistory         324  0.555556  0.555556      0.000000
+53                    us_foreign_policy         100  0.700000  0.700000      0.000000
+39                high_school_geography         198  0.626263  0.626263      0.000000
+40                     security_studies         245  0.522449  0.522449      0.000000
+41                high_school_chemistry         203  0.408867  0.408867      0.000000
+52                   clinical_knowledge         265  0.513208  0.513208      0.000000
+49              professional_psychology         612  0.482026  0.482026      0.000000
+42                           management         103  0.679612  0.679612      0.000000
+43                        jurisprudence         108  0.583333  0.583333      0.000000
+51                    computer_security         100  0.560000  0.560000      0.000000
+50                   conceptual_physics         235  0.417021  0.417021      0.000000
+35                      human_sexuality         131  0.526718  0.526718      0.000000
+46                             virology         166  0.439759  0.439759      0.000000
+47                       moral_disputes         346  0.514451  0.514451      0.000000
+48                              anatomy         135  0.459259  0.459259      0.000000
+36                      college_physics         102  0.215686  0.215686      0.000000
+0            high_school_macroeconomics         390  0.420513  0.420513      0.000000
+34                  high_school_physics         151  0.311258  0.311258      0.000000
+33             college_computer_science         100  0.420000  0.420000      0.000000
+2                     international_law         121  0.636364  0.636364      0.000000
+3                   college_mathematics         100  0.330000  0.330000      0.000000
+4                      college_medicine         173  0.410405  0.410405      0.000000
+8                high_school_statistics         216  0.314815  0.314815      0.000000
+9                      medical_genetics         100  0.450000  0.450000      0.000000
+10                    college_chemistry         100  0.290000  0.290000      0.000000
+11              professional_accounting         282  0.411348  0.411348      0.000000
+14                            sociology         201  0.601990  0.601990      0.000000
+15                professional_medicine         272  0.452206  0.452206      0.000000
+16                    logical_fallacies         163  0.521472  0.521472      0.000000
+18                      college_biology         144  0.506944  0.506944      0.000000
+19         high_school_european_history         165  0.575758  0.575758      0.000000
+21                     abstract_algebra         100  0.280000  0.280000      0.000000
+23         high_school_computer_science         100  0.430000  0.430000      0.000000
+26                     machine_learning         112  0.375000  0.375000      0.000000
+27                            astronomy         152  0.500000  0.500000      0.000000
+1                          formal_logic         126  0.222222  0.222222      0.000000
+29              high_school_mathematics         270  0.259259  0.259259      0.000000
+32                      business_ethics         100  0.450000  0.450000      0.000000
+28                         global_facts         100  0.380000  0.380000      0.000000
 ```
+|  数据集 | 总学科数  |总问题数  |参考准确率|NPU准确率|
+|:---:|:---:|:---:|:---:|:---:|
+| MMLU | 57| 14042 |0.4691|0.4698|


 # LLaMA2-13B
--- a/examples/llama2/README_en.md
+++ b/examples/llama2/README_en.md
@ -33,27 +33,27 @@
  - [Inference](#inference-70b)
  - [Evaluation](#Evaluation-70b)

-# LLaMA2-7B
+# LLAMA2-7B

 ## Training

-Here's a hardware summary of pre-training  LLaMA2-7B:
+Here's a hardware summary of pre-training  LLAMA2-7B:

 | Hardware |                      Value                      |
 | :------: | :---------------------------------------------: |
 |   NPU    |               8 x Ascend NPUs                   |

-Here's a software summary of pre-training  LLaMA2-7B: 
+Here's a software summary of pre-training  LLAMA2-7B: 


 |         Software          |   Version   |
 | :-----------------------: |:-----------:|
-|          Python           |   3.7.16    |
+|          Python           |   3.8.18    |
 |          driver           | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
 |         firmware          | [package](https://support.huawei.com/enterprise/zh/ascend-computing/atlas-900-pod-a2-pid-254184911/software) |
 |           CANN            |       [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software)       |
 | binary arithmetic package |       [package](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software)       |
-|           torch           |                                                    1.11.0                                                    |
+|           torch           |                                                    2.1.0                                                    |
 |         torch_npu         |                             [package](https://gitee.com/ascend/pytorch/releases)                             |

 ### Script
@ -69,14 +69,14 @@ Here's a software summary of pre-training  LLaMA2-7B:
 2. Build environment
   
    ```bash
-    # python3.7
-    conda create -n test python=3.7
+    # python3.8
+    conda create -n test python=3.8
    conda activate test
    
    # install torch and torch_npu
-    pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
-    pip install torch_npu-1.11.0*-cp37-cp37m-linux_aarch64.whl
-    pip install apex-0.1_ascend*-cp37-cp37m-linux_aarch64.whl
+    pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
+    pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
+    pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
    
    # install megatron-core
    pip3 install --no-use-pep517 -e git+https://github.com/NVIDIA/Megatron-LM.git@23.05#egg=megatron-core
@ -122,22 +122,8 @@ Here's a software summary of pre-training  LLaMA2-7B:
      wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/tokenizer_config.json
      cd ..
    ```
-    
-   3.1 weight conversion in deepspeed mode
-   *Note that if you want to use the weight from huggingface, please run the weight conversion script first. The following uses llama-2-7b model  weight conversion in deepspeed as an example.*
-    ```bash
-    # modify the script according to your own ascend-toolkit path
-    source /usr/local/Ascend/ascend-toolkit/set_env.sh
-    
-    # convert to deepspeed weights
-    python tools/ckpt_convert/llama/convert_weights_from_huggingface.py --input-model-dir llama-2-7b-hf \
-                                                                        --output-model-dir ckpt \
-                                                                        --tensor-model-parallel-size 1 \
-                                                                        --pipeline-model-parallel-size 1 \
-                                                                        --type 7B \
-                                                                        --deepspeed
-    ```
-   3.2 weight conversion in ptd mode
+
+   weight conversion in ptd mode
   *Note that if you want to use the weight from huggingface, please run the weight conversion script first. The following uses llama-2-7b model weight conversion in ptd as an example.*
   ```bash
    # modify the script according to your own ascend-toolkit path
@ -173,23 +159,7 @@ Here's a software summary of pre-training  LLaMA2-7B:
 		 --tokenizer-type PretrainedFromHF
 	```

-	4.2 pre-training using deepspeed mode
-	Config LLAMA2-7B pre-training script: examples/llama2/pretrain_llama2_7b_zero_8p.sh 
-	```shell
-	# modify the script according to your own ascend-toolkit path
-	source /usr/local/Ascend/ascend-toolkit/set_env.sh 
-	
-	# modify script orign dataset path according to your own dataset path
-	TOKENIZER_PATH=./llama-2-7b-hf/  #tokenizer path
-	DATA_PATH=./dataset_llama2/alpaca_text_document  #processed dataset
-	```
-
-	Launch LLAMA2-7B  pre-training script: examples/llama2/pretrain_llama2_7b_zero_8p.sh
-	```shell
-	bash examples/llama2/pretrain_llama2_7b_zero_8p.sh 
-	```
-	
-	4.3 pre-training using ptd mode
+	4.2 pre-training using ptd mode
 	Config LLAMA2-7B pre-training script: examples/llama2/pretrain_llama2_7b_ptd.sh 
   ```shell
    # modify the script according to your own ascend-toolkit path
@ -229,14 +199,15 @@ Here's a software summary of pre-training  LLaMA2-7B:
 		  --handler-name GeneralInstructionHandler \
 		  --append-eod
    ```
-   5.2 fine-tuning using deepspeed mode
-   5.2.1 Full Parameters Fine-Tuning
-   The configuration script for full parameters fine-tuning  is basically the same as that for pretrain_llama2_7b_zero_8p.sh.*The only difference is the data set.*
+   5.2 Full Parameters Fine-Tuning
+   The configuration script for full parameters fine-tuning  is basically the same as that for pretrain_llama2_7b_ptd.sh.*The difference is that the dataset and the training parameter is-instruction-dataset are added.*
   ```bash
   DATA_PATH=./finetune_dataset/alpaca
+   
+   --is-instruction-dataset \
   ```
-   5.2.2 Lora Fine-Tuning
-   The Lora fine-tuning script is configured by adding the following lora parameters to the pretrain_llama2_7b_zero_8p.sh script:
+   5.3 Lora Fine-Tuning
+   The Lora fine-tuning script is configured by adding the following lora parameters to the pretrain_llama2_7b_ptd.sh script:
   ```bash
       --lora-target-modules query_key_value dense gate_proj up_proj down_proj \
       --lora-r 16 \
@ -251,10 +222,6 @@ Here's a software summary of pre-training  LLaMA2-7B:
       --load ${ORIGIN_CHECKPOINT}  \
       --lora-load ${LORA_CHECKPOINT} \
   ```
-   
-   
-   5.3 fine-tuning using ptd mode
-   *The modification method is the same as that in deepspeed mode. For details, see the previous section.*

 ### Performance

@ -374,7 +341,6 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS tasks/evaluation/evaluation
     --seq-length 4096 \
     --max-new-tokens 1 \
     --max-position-embeddings 4096 \
-     --rotary-v3-impl \
     --tensor-model-parallel-size 8 \
     --pipeline-model-parallel-size 1  \
     --num-layers 32  \
@ -383,8 +349,10 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS tasks/evaluation/evaluation
     --num-attention-heads 32  \
     --mlp-layer-fusion \
     --load ${CHECKPOINT}  \
+     --position-embedding-type rope \
+     --normalization RMSNorm \
     --tokenizer-type PretrainedFromHF  \
-     --tokenizer-name-or-path $VOCAB_FILE \
+     --tokenizer-name-or-path ${TOKENIZER_PATH} \
     --tokenizer-not-use-fast \
     --fp16  \
     --micro-batch-size 1  \
@ -397,67 +365,68 @@ bash tasks/evaluation/eval.sh

 Evaluation results
 ```text
-                                subject  question_n       acc
-0            high_school_macroeconomics         390  0.466667
-1                          formal_logic         126  0.253968
-2                     international_law         121  0.652893
-3                   college_mathematics         100  0.330000
-4                      college_medicine         173  0.421965
-5                       world_religions         171  0.725146
-6                       moral_scenarios         895  0.220112
-7                             nutrition         306  0.513072
-8                high_school_statistics         216  0.361111
-9                      medical_genetics         100  0.490000
-10                    college_chemistry         100  0.300000
-11              professional_accounting         282  0.361702
-12                     professional_law        1534  0.338331
-13                        miscellaneous         783  0.698595
-14                            sociology         201  0.651741
-15                professional_medicine         272  0.496324
-16                    logical_fallacies         163  0.552147
-17                     public_relations         110  0.563636
-18                      college_biology         144  0.506944
-19         high_school_european_history         165  0.612121
-20                           philosophy         311  0.556270
-21                     abstract_algebra         100  0.310000
-22               high_school_psychology         545  0.678899
-23         high_school_computer_science         100  0.400000
-24               elementary_mathematics         378  0.312169
-25               high_school_us_history         204  0.617647
-26                     machine_learning         112  0.366071
-27                            astronomy         152  0.493421
-28                         global_facts         100  0.330000
-29              high_school_mathematics         270  0.255556
-30               electrical_engineering         145  0.496552
-31           high_school_microeconomics         238  0.415966
-32                      business_ethics         100  0.540000
-33             college_computer_science         100  0.400000
-34                  high_school_physics         151  0.317881
-35                      human_sexuality         131  0.526718
-36                      college_physics         102  0.245098
-37  high_school_government_and_politics         193  0.720207
-38                            marketing         234  0.747863
-39                high_school_geography         198  0.601010
-40                     security_studies         245  0.555102
-41                high_school_chemistry         203  0.418719
-42                           management         103  0.699029
-43                        jurisprudence         108  0.537037
-44                         econometrics         114  0.350877
-45                          human_aging         223  0.591928
-46                             virology         166  0.403614
-47                       moral_disputes         346  0.528902
-48                              anatomy         135  0.451852
-49              professional_psychology         612  0.498366
-50                   conceptual_physics         235  0.455319
-51                    computer_security         100  0.560000
-52                   clinical_knowledge         265  0.505660
-53                    us_foreign_policy         100  0.680000
-54                           prehistory         324  0.570988
-55            high_school_world_history         237  0.645570
-56                  high_school_biology         310  0.535484
-57                                total       14042  0.478422
-MMLU Running Time:  18266.85981464386
+                           subject_name  question_n   acc_ref   acc_npu  score_diff
+17                     public_relations         110  0.563636  0.554545      0.009091
+44                         econometrics         114  0.368421  0.377193      0.008772
+30               electrical_engineering         145  0.503448  0.510345      0.006897
+5                       world_religions         171  0.701754  0.707602      0.005848
+25               high_school_us_history         204  0.647059  0.651961      0.004902
+45                          human_aging         223  0.596413  0.600897      0.004484
+38                            marketing         234  0.709402  0.713675      0.004274
+55            high_school_world_history         237  0.620253  0.624473      0.004219
+31           high_school_microeconomics         238  0.420168  0.424370      0.004202
+7                             nutrition         306  0.503268  0.500000      0.003268
+56                  high_school_biology         310  0.541935  0.545161      0.003226
+20                           philosophy         311  0.569132  0.565916      0.003215
+24               elementary_mathematics         378  0.291005  0.293651      0.002646
+22               high_school_psychology         545  0.645872  0.647706      0.001835
+12                     professional_law        1534  0.339635  0.340939      0.001304
+13                        miscellaneous         783  0.679438  0.678161      0.001277
+6                       moral_scenarios         895  0.221229  0.222346      0.001117
+37  high_school_government_and_politics         193  0.694301  0.694301      0.000000
+54                           prehistory         324  0.555556  0.555556      0.000000
+53                    us_foreign_policy         100  0.700000  0.700000      0.000000
+39                high_school_geography         198  0.626263  0.626263      0.000000
+40                     security_studies         245  0.522449  0.522449      0.000000
+41                high_school_chemistry         203  0.408867  0.408867      0.000000
+52                   clinical_knowledge         265  0.513208  0.513208      0.000000
+49              professional_psychology         612  0.482026  0.482026      0.000000
+42                           management         103  0.679612  0.679612      0.000000
+43                        jurisprudence         108  0.583333  0.583333      0.000000
+51                    computer_security         100  0.560000  0.560000      0.000000
+50                   conceptual_physics         235  0.417021  0.417021      0.000000
+35                      human_sexuality         131  0.526718  0.526718      0.000000
+46                             virology         166  0.439759  0.439759      0.000000
+47                       moral_disputes         346  0.514451  0.514451      0.000000
+48                              anatomy         135  0.459259  0.459259      0.000000
+36                      college_physics         102  0.215686  0.215686      0.000000
+0            high_school_macroeconomics         390  0.420513  0.420513      0.000000
+34                  high_school_physics         151  0.311258  0.311258      0.000000
+33             college_computer_science         100  0.420000  0.420000      0.000000
+2                     international_law         121  0.636364  0.636364      0.000000
+3                   college_mathematics         100  0.330000  0.330000      0.000000
+4                      college_medicine         173  0.410405  0.410405      0.000000
+8                high_school_statistics         216  0.314815  0.314815      0.000000
+9                      medical_genetics         100  0.450000  0.450000      0.000000
+10                    college_chemistry         100  0.290000  0.290000      0.000000
+11              professional_accounting         282  0.411348  0.411348      0.000000
+14                            sociology         201  0.601990  0.601990      0.000000
+15                professional_medicine         272  0.452206  0.452206      0.000000
+16                    logical_fallacies         163  0.521472  0.521472      0.000000
+18                      college_biology         144  0.506944  0.506944      0.000000
+19         high_school_european_history         165  0.575758  0.575758      0.000000
+21                     abstract_algebra         100  0.280000  0.280000      0.000000
+23         high_school_computer_science         100  0.430000  0.430000      0.000000
+26                     machine_learning         112  0.375000  0.375000      0.000000
+27                            astronomy         152  0.500000  0.500000      0.000000
+1                          formal_logic         126  0.222222  0.222222      0.000000
+29              high_school_mathematics         270  0.259259  0.259259      0.000000
+32                      business_ethics         100  0.450000  0.450000      0.000000
+28                         global_facts         100  0.380000  0.380000      0.000000
 ```
+|  dataset | subject_num | question_num | reference_acc |NPU acc|
+|:---:|:-----------:|:------------:|:-------------:|:---:|
+| MMLU |     57      |    14042     |    0.4691     |0.4698|

 # LLaMA2-13B

--- a/examples/llama2/generate_llama2_7b_ptd.sh
+++ b/examples/llama2/generate_llama2_7b_ptd.sh
@ -38,6 +38,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/inference/inference
       --seq-length 4096 \
       --max-new-tokens 256 \
       --use-flash-attn \
+       --use-fused-rmsnorm \
       --seed 42 \
       --position-embedding-type rope \
       --normalization RMSNorm \
--- a/examples/llama2/pretrain_llama2_7b_ptd.sh
+++ b/examples/llama2/pretrain_llama2_7b_ptd.sh
@ -69,6 +69,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS \
       --lr-warmup-fraction 0.01 \
       --mlp-layer-fusion \
       --use-flash-attn \
+       --use-fused-rmsnorm \
       --bf16 | tee ./logs/ascendspeed-llama2-7b-ptd-bf16_train-${logfile}.log
 	   
 # ptd模式训练脚本
--- a/examples/llama2/tune_llama2_7b_ptd.sh
+++ b/examples/llama2/tune_llama2_7b_ptd.sh
@ -0,0 +1,80 @@
+# This is an example: training llama using PTD/
+
+# The number of parameters is not aligned
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
+export HCCL_CONNECT_TIMEOUT=1200
+export COMBINED_ENABLE=1
+
+# modify config according to your own actual situation
+LOAD_CHECKPOINT_PATH="your init model load path"
+SAVE_CHECKPOINT_PATH="your model ckpt save path"
+TOKENIZER_PATH=./llama-2-7b-hf/  #tokenizer path
+DATA_PATH=./finetune_dataset/alpaca  #processed dataset
+
+# Change for multinode config
+MASTER_ADDR=localhost
+MASTER_PORT=6001
+NNODES=1
+NODE_RANK=0
+NPUS_PER_NODE=8
+WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
+
+DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
+logfile=$(date +%Y%m%d)_$(date +%H%M%S)
+
+python -m torch.distributed.launch $DISTRIBUTED_ARGS \
+       pretrain_llama.py \
+       --DDP-impl local \
+       --tensor-model-parallel-size 8 \
+       --pipeline-model-parallel-size 1 \
+       --sequence-parallel \
+       --num-layers 32 \
+       --hidden-size 4096 \
+       --position-embedding-type rope \
+       --normalization RMSNorm \
+       --ffn-hidden-size 11008 \
+       --num-attention-heads 32 \
+       --attention-dropout 0.0 \
+       --hidden-dropout 0.0 \
+       --init-method-std 0.01 \
+       --micro-batch-size 4 \
+       --global-batch-size 16 \
+       --seq-length 4096 \
+       --max-position-embeddings 4096 \
+       --data-path $DATA_PATH \
+       --tokenizer-name-or-path $TOKENIZER_PATH \
+       --tokenizer-not-use-fast \
+       --split 100,0,0 \
+       --distributed-backend nccl \
+       --lr 1.25e-5 \
+       --min-lr 1.25e-6 \
+       --lr-decay-style cosine \
+       --weight-decay 1e-1 \
+       --clip-grad 1.0 \
+       --initial-loss-scale 65536.0 \
+       --adam-beta1 0.9 \
+       --adam-beta2 0.95 \
+       --log-interval 1 \
+       --load ${LOAD_CHECKPOINT_PATH} \
+       --save ${SAVE_CHECKPOINT_PATH} \
+       --save-interval 10000 \
+       --eval-interval 10000 \
+       --eval-iters 0 \
+       --use-fused-rotary-pos-emb \
+       --no-masked-softmax-fusion \
+       --no-load-optim \
+       --no-load-rng \
+       --train-iters 50000 \
+       --lr-warmup-fraction 0.01 \
+       --mlp-layer-fusion \
+       --use-flash-attn \
+       --use-fused-rmsnorm \
+       --is-instruction-dataset \
+       --lora-target-modules query_key_value dense proj dense_4h_to_h \
+       --lora-r 16 \
+       --lora-alpha 32 \
+       --bf16 | tee ./logs/ascendspeed-llama2-7b-ptd-bf16_train-${logfile}.log
+
+# ptd模式训练脚本
+