add readme

2024-12-05 05:17:40 +08:00 · 2023-10-10 17:40:33 +08:00 · 2023-10-10 17:40:33 +08:00 · 4946f364a6
commit 4946f364a6
parent 4a42cf21f2
7 changed files with 73 additions and 42 deletions
--- a/examples/baichuan/README.md
+++ b/examples/baichuan/README.md
@ -20,13 +20,24 @@

 ## Training

-Here's a quick summary of training Baichuan-7B:
+Here's a hardware summary of pre-training Baichuan-7B:

-|          |                                               |
-| -------- | --------------------------------------------- |
-| Hardware | 1x8 Ascend NPUs                          |
-| Software | AscendSpeed                                   |
-| Dataset  | train-00000-of-00001-a09b74b3ef9c3b56.parquet |
+| Hardware |                      Value                      |
+| :------: | :---------------------------------------------: |
+|   NPU    |               8 x Ascend NPUs                   |
+
+Here's a software summary of pre-training Baichuan-7B: 
+
+
+|         Software          |                 Version                 |link  |
+| :-----------------------: | :-------------------------------------: | :---:|
+|          Python           |                  3.7.16                 |-|
+|          driver           |               23.0.RC3.B050             |[link](https://support.huawei.com/enterprise/zh/ascend-computing/ascend-hdk-pid-252764743/software/261159045?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C252764743)|
+|         firmware          |              7.0.t8.0.b214              |[link](https://support.huawei.com/enterprise/zh/ascend-computing/ascend-hdk-pid-252764743/software/261159045?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C252764743)|
+|           CANN            |Ascend-cann-toolkit-7.0.T8-linux    |[link](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software/261204647?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C251168373)|
+| binary arithmetic package |   Ascend-cann-kernels-XXX_7.0.T8_linux |[link](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software/261204647?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C251168373)|
+|           torch           |                 1.11.0                  |[link](https://gitee.com/ascend/pytorch/releases/tag/v5.0.rc2.2-pytorch1.11.0)|
+|         torch_npu         |           1.11.0.post4-20230915         |[link](https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20230915.2/pytorch_v1.11.0_py37.tar.gz)|


 ### Script
@ -47,7 +58,7 @@ conda create -n test python=3.7
 conda activate test

 # install torch and torch_npu
-pip install torch-1.11.0-cp37-cp37m-linux_aarch64.whl
+pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
 pip install torch_npu-1.11.0.post4_XXXXXX-cp37-cp37m-linux_aarch64.whl
 pip install apex-0.1_ascend_XXXXXX-cp37-cp37m-linux_aarch64.whl

@ -134,13 +145,13 @@ The performance of Baichuan-7B in **Ascend NPU** and **Reference**:

 #### Accuracy of the loss

-NPU vs GPU loss.
+NPU vs Reference loss.

 The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. The relative error of the average loss is 0.01093, less than 2%, the maximum relative error is 0.1243, and the maximum absolute error is 0.4859. The precision meets the requirements.

 ![NPU-LOSS](./images/7B_loss_compare.png)

-NPU vs GPU loss relative error.
+NPU vs Reference loss relative error.

 ![NPU-Relative-Error](./images/7B_relative_error.png)

@ -149,16 +160,25 @@ NPU vs GPU loss relative error.
 # Baichuan-13B

 ## Training
-Here's a quick summary of training baichuan-13B:
-
-|          |                          |
-| :------: | :----------------------: |
-| Hardware |   1x8 Ascend NPUs   |
-| Software |       AscendSpeed        |
-| Dataset  | alpaca-data-conversation |

+Here's a hardware summary of pre-training Baichuan-13B:
+
+| Hardware |                      Value                      |
+| :------: | :---------------------------------------------: |
+|   NPU    |               8 x Ascend NPUs               |
+
+Here's a software summary of pre-training Baichuan-13B:


+|         Software          |                 Version                 |link  |
+| :-----------------------: | :-------------------------------------: | :---:|
+|          Python           |                  3.7.16                 |-|
+|          driver           |               23.0.RC3.B050             |[link](https://support.huawei.com/enterprise/zh/ascend-computing/ascend-hdk-pid-252764743/software/261159045?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C252764743)|
+|         firmware          |              7.0.t8.0.b214              |[link](https://support.huawei.com/enterprise/zh/ascend-computing/ascend-hdk-pid-252764743/software/261159045?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C252764743)|
+|           CANN            |Ascend-cann-toolkit-7.0.T8-linux    |[link](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software/261204647?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C251168373)|
+| binary arithmetic package |   Ascend-cann-kernels-XXX_7.0.T8_linux |[link](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software/261204647?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C251168373)|
+|           torch           |                 1.11.0                  |[link](https://gitee.com/ascend/pytorch/releases/tag/v5.0.rc2.2-pytorch1.11.0)|
+|         torch_npu         |           1.11.0.post4-20230915         |[link](https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20230915.2/pytorch_v1.11.0_py37.tar.gz)|



@ -179,7 +199,7 @@ conda create -n test python=3.7
 conda activate test

 # install torch and torch_npu
-pip install torch-1.11.0-cp37-cp37m-linux_aarch64.whl
+pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
 pip install torch_npu-1.11.0.post4_XXXXXX-cp37-cp37m-linux_aarch64.whl
 pip install apex-0.1_ascend_XXXXXX-cp37-cp37m-linux_aarch64.whl

@ -227,17 +247,19 @@ python $SCRIPT_PATH \
    --output-model-dir ./model_weights \
    --tensor-model-parallel-size 8 \
    --pipeline-model-parallel-size 1 \
-    --type 13B 
+    --make-vocab-size-divisible-by 1 \
+    --type 13B \
+    --pse True     
 ```

 4. Prepare dataset
-Download the Baichuan-13B datasets from [here](https://github.com/lm-sys/FastChat/blob/v0.1.10/playground/data/alpaca-data-conversation.json) 
+Download the Baichuan-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet) 

 ```shell
  mkdir dataset_baichuan
  mkdir model_save
  cd ./dataset_baichuan
-  wget https://github.com/lm-sys/FastChat/blob/v0.1.10/playground/data/alpaca-data-conversation.json
+  wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
  cd ..

 ```
@ -245,13 +267,13 @@ Download the Baichuan-13B datasets from [here](https://github.com/lm-sys/FastCha
 ```shell
 #!/bin/bash

-SCRIPT_PATH=./tools/preprocess_data.py
-python $SCRIPT_PATH \
-    --llama-json-data-path ./dataset_baichuan/alpaca-data-conversation.json \
-    --tokenizer-model-path ./tokenizer \
-    --output-prefix internlm_eos_text \
+python ./tools/preprocess_data.py \
+    --input ./dataset_baichuan/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
+    --tokenizer-name-or-path ./tokenizer \
+    --output-prefix ./dataset_baichuan/alpaca \
    --workers 4 \
-    --log-interval 1000 
+    --log-interval 1000 \
+    --tokenizer-type PretrainedFromHF 
 ```


@ -264,9 +286,9 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh

 # modify script orign dataset path according to your own dataset path
 TOKENIZER_PATH=./tokenizer/  
-DATA_PATH=./dataset_baichuan/internlm_eos_text  
+DATA_PATH=./dataset_baichuan/aplaca_text_document  
 LOAD_PATH=./model_weights
-CHECKPOINT_PATH=./model_save
+CHECKPOINT_PATH=./ckpt
 ```

 6. Launch Baichuan-13B pre-training script: /examples/baichuan/pretrain_baichuan_ptd_13B.sh
@ -288,23 +310,23 @@ The performance of the Baichuan-13B in **Ascend NPU** and **Reference**:
 | Device |    Model     | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
 | :----: | :----------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: | :---------------------------------: |
 |  NPUs  | Baichuan-13B |       1000       |             1.928             |             1024             |          16.067           |                89.37                |
-|  Reference  | Baichuan-13B |       1000       |             1.535             |             785              |          20.852           |                68.39                |
+|  Reference  | Baichuan-13B |       1000       |             1.535             |             862              |          19.852           |                72.39                |



 #### Accuracy of the loss

-NPU vs GPU loss.
+NPU vs Reference loss.

-The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected.
+The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. The relative error of the average loss is 0.00725, less than 2%, the maximum relative error is 0.01978, and the maximum absolute error is 0.10811. The precision meets the requirements.

-![NPU-LOSS](./images/13B_loss_compare.png)
+![NPU-LOSS](./images/13B-loss-compare.png)

-NPU vs GPU loss relative error.
+NPU vs Reference loss relative error.

-The relative error between NPU and GPU Loss is less than 0.02 throughout, as expected.
+The relative error between NPU and Reference Loss is less than 0.02 throughout, as expected.

-![NPU-Relative-Error](./images/13B_relative_error.png)
+![NPU-Relative-Error](./images/baichuan13B-loss-relative-error.png)



--- a/examples/baichuan/images/13B-loss-compare.png
+++ b/examples/baichuan/images/13B-loss-compare.png
--- a/examples/baichuan/images/13B_loss_compare.png
+++ b/examples/baichuan/images/13B_loss_compare.png
--- a/examples/baichuan/images/13B_relative_error.png
+++ b/examples/baichuan/images/13B_relative_error.png
--- a/examples/baichuan/images/7B_loss_compare.png
+++ b/examples/baichuan/images/7B_loss_compare.png
--- a/examples/baichuan/images/baichuan13B-loss-relative-error.png
+++ b/examples/baichuan/images/baichuan13B-loss-relative-error.png
--- a/tools/ckpt_convert/llama/convert_weights_from_huggingface.py
+++ b/tools/ckpt_convert/llama/convert_weights_from_huggingface.py
@ -10,7 +10,7 @@
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
+# See the License for the specific language governing permissions and 
 # limitations under the License.

 """Convert weight from huggingface to ascendspeed"""
@ -46,7 +46,7 @@ def get_args():
                        help="degree of pipeline model parallel")
    parser.add_argument("--added-token-num", type=int, default=0, help="the number of added tokens")
    parser.add_argument("--type", type=str, choices=["7B", "13B", "30B", "65B"], default="7B")
-
+    parser.add_argument("--pse", type=bool, default=False)
    return parser.parse_args()


@ -109,11 +109,20 @@ def generate_ascendspeed_weights_again(config):

            for pp_i in range(pp_n_layer):
                ori_i = pp_n_layer * pp_rank + pp_i
-                rank_model[f"language_model.layers.{pp_i}.attention.rotary_emb.inv_freq"] = get_weight_from_name(
-                    f"model.layers.{ori_i}.self_attn.rotary_emb.inv_freq")
-                qw = row_split(get_weight_from_name(f"model.layers.{ori_i}.self_attn.q_proj.weight"), tp_size, tp_rank)
-                kw = row_split(get_weight_from_name(f"model.layers.{ori_i}.self_attn.k_proj.weight"), tp_size, tp_rank)
-                vw = row_split(get_weight_from_name(f"model.layers.{ori_i}.self_attn.v_proj.weight"), tp_size, tp_rank)
+                if args.pse:
+                    w_pack = get_weight_from_name(f"model.layers.{ori_i}.self_attn.W_pack.weight")
+                    ws = torch.split(w_pack, w_pack.shape[0] // 3)
+                    qw = row_split(ws[0], tp_size, tp_rank)
+                    kw = row_split(ws[1], tp_size, tp_rank)
+                    vw = row_split(ws[2], tp_size, tp_rank)
+                else:
+                    rank_model[f"language_model.layers.{pp_i}.attention.rotary_emb.inv_freq"] = get_weight_from_name(
+                        f"model.layers.{ori_i}.self_attn.rotary_emb.inv_freq")
+                    qw = row_split(get_weight_from_name(f"model.layers.{ori_i}.self_attn.q_proj.weight"), tp_size, tp_rank)
+                    kw = row_split(get_weight_from_name(f"model.layers.{ori_i}.self_attn.k_proj.weight"), tp_size, tp_rank)
+                    vw = row_split(get_weight_from_name(f"model.layers.{ori_i}.self_attn.v_proj.weight"), tp_size, tp_rank)
+
+   
                permute_w = permute_qkv_weight(torch.cat([qw, kw, vw], dim=0), n_heads, hidden_size, tp_size)
                rank_model[f"language_model.layers.{pp_i}.attention.query_key_value.weight"] = permute_w