mirror of
https://gitee.com/ascend/ModelLink.git
synced 2024-12-05 05:17:40 +08:00
add readme
This commit is contained in:
parent
4a42cf21f2
commit
4946f364a6
@ -20,13 +20,24 @@
|
||||
|
||||
## Training
|
||||
|
||||
Here's a quick summary of training Baichuan-7B:
|
||||
Here's a hardware summary of pre-training Baichuan-7B:
|
||||
|
||||
| | |
|
||||
| -------- | --------------------------------------------- |
|
||||
| Hardware | 1x8 Ascend NPUs |
|
||||
| Software | AscendSpeed |
|
||||
| Dataset | train-00000-of-00001-a09b74b3ef9c3b56.parquet |
|
||||
| Hardware | Value |
|
||||
| :------: | :---------------------------------------------: |
|
||||
| NPU | 8 x Ascend NPUs |
|
||||
|
||||
Here's a software summary of pre-training Baichuan-7B:
|
||||
|
||||
|
||||
| Software | Version |link |
|
||||
| :-----------------------: | :-------------------------------------: | :---:|
|
||||
| Python | 3.7.16 |-|
|
||||
| driver | 23.0.RC3.B050 |[link](https://support.huawei.com/enterprise/zh/ascend-computing/ascend-hdk-pid-252764743/software/261159045?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C252764743)|
|
||||
| firmware | 7.0.t8.0.b214 |[link](https://support.huawei.com/enterprise/zh/ascend-computing/ascend-hdk-pid-252764743/software/261159045?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C252764743)|
|
||||
| CANN |Ascend-cann-toolkit-7.0.T8-linux |[link](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software/261204647?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C251168373)|
|
||||
| binary arithmetic package | Ascend-cann-kernels-XXX_7.0.T8_linux |[link](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software/261204647?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C251168373)|
|
||||
| torch | 1.11.0 |[link](https://gitee.com/ascend/pytorch/releases/tag/v5.0.rc2.2-pytorch1.11.0)|
|
||||
| torch_npu | 1.11.0.post4-20230915 |[link](https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20230915.2/pytorch_v1.11.0_py37.tar.gz)|
|
||||
|
||||
|
||||
### Script
|
||||
@ -47,7 +58,7 @@ conda create -n test python=3.7
|
||||
conda activate test
|
||||
|
||||
# install torch and torch_npu
|
||||
pip install torch-1.11.0-cp37-cp37m-linux_aarch64.whl
|
||||
pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
|
||||
pip install torch_npu-1.11.0.post4_XXXXXX-cp37-cp37m-linux_aarch64.whl
|
||||
pip install apex-0.1_ascend_XXXXXX-cp37-cp37m-linux_aarch64.whl
|
||||
|
||||
@ -134,13 +145,13 @@ The performance of Baichuan-7B in **Ascend NPU** and **Reference**:
|
||||
|
||||
#### Accuracy of the loss
|
||||
|
||||
NPU vs GPU loss.
|
||||
NPU vs Reference loss.
|
||||
|
||||
The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. The relative error of the average loss is 0.01093, less than 2%, the maximum relative error is 0.1243, and the maximum absolute error is 0.4859. The precision meets the requirements.
|
||||
|
||||
![NPU-LOSS](./images/7B_loss_compare.png)
|
||||
|
||||
NPU vs GPU loss relative error.
|
||||
NPU vs Reference loss relative error.
|
||||
|
||||
![NPU-Relative-Error](./images/7B_relative_error.png)
|
||||
|
||||
@ -149,16 +160,25 @@ NPU vs GPU loss relative error.
|
||||
# Baichuan-13B
|
||||
|
||||
## Training
|
||||
Here's a quick summary of training baichuan-13B:
|
||||
|
||||
| | |
|
||||
| :------: | :----------------------: |
|
||||
| Hardware | 1x8 Ascend NPUs |
|
||||
| Software | AscendSpeed |
|
||||
| Dataset | alpaca-data-conversation |
|
||||
|
||||
Here's a hardware summary of pre-training Baichuan-13B:
|
||||
|
||||
| Hardware | Value |
|
||||
| :------: | :---------------------------------------------: |
|
||||
| NPU | 8 x Ascend NPUs |
|
||||
|
||||
Here's a software summary of pre-training Baichuan-13B:
|
||||
|
||||
|
||||
| Software | Version |link |
|
||||
| :-----------------------: | :-------------------------------------: | :---:|
|
||||
| Python | 3.7.16 |-|
|
||||
| driver | 23.0.RC3.B050 |[link](https://support.huawei.com/enterprise/zh/ascend-computing/ascend-hdk-pid-252764743/software/261159045?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C252764743)|
|
||||
| firmware | 7.0.t8.0.b214 |[link](https://support.huawei.com/enterprise/zh/ascend-computing/ascend-hdk-pid-252764743/software/261159045?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C252764743)|
|
||||
| CANN |Ascend-cann-toolkit-7.0.T8-linux |[link](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software/261204647?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C251168373)|
|
||||
| binary arithmetic package | Ascend-cann-kernels-XXX_7.0.T8_linux |[link](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software/261204647?idAbsPath=fixnode01%7C23710424%7C251366513%7C22892968%7C251168373)|
|
||||
| torch | 1.11.0 |[link](https://gitee.com/ascend/pytorch/releases/tag/v5.0.rc2.2-pytorch1.11.0)|
|
||||
| torch_npu | 1.11.0.post4-20230915 |[link](https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20230915.2/pytorch_v1.11.0_py37.tar.gz)|
|
||||
|
||||
|
||||
|
||||
@ -179,7 +199,7 @@ conda create -n test python=3.7
|
||||
conda activate test
|
||||
|
||||
# install torch and torch_npu
|
||||
pip install torch-1.11.0-cp37-cp37m-linux_aarch64.whl
|
||||
pip install torch-1.11.0-cp37-cp37m-manylinux2014_aarch64.whl
|
||||
pip install torch_npu-1.11.0.post4_XXXXXX-cp37-cp37m-linux_aarch64.whl
|
||||
pip install apex-0.1_ascend_XXXXXX-cp37-cp37m-linux_aarch64.whl
|
||||
|
||||
@ -227,17 +247,19 @@ python $SCRIPT_PATH \
|
||||
--output-model-dir ./model_weights \
|
||||
--tensor-model-parallel-size 8 \
|
||||
--pipeline-model-parallel-size 1 \
|
||||
--type 13B
|
||||
--make-vocab-size-divisible-by 1 \
|
||||
--type 13B \
|
||||
--pse True
|
||||
```
|
||||
|
||||
4. Prepare dataset
|
||||
Download the Baichuan-13B datasets from [here](https://github.com/lm-sys/FastChat/blob/v0.1.10/playground/data/alpaca-data-conversation.json)
|
||||
Download the Baichuan-13B datasets from [here](https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet)
|
||||
|
||||
```shell
|
||||
mkdir dataset_baichuan
|
||||
mkdir model_save
|
||||
cd ./dataset_baichuan
|
||||
wget https://github.com/lm-sys/FastChat/blob/v0.1.10/playground/data/alpaca-data-conversation.json
|
||||
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
|
||||
cd ..
|
||||
|
||||
```
|
||||
@ -245,13 +267,13 @@ Download the Baichuan-13B datasets from [here](https://github.com/lm-sys/FastCha
|
||||
```shell
|
||||
#!/bin/bash
|
||||
|
||||
SCRIPT_PATH=./tools/preprocess_data.py
|
||||
python $SCRIPT_PATH \
|
||||
--llama-json-data-path ./dataset_baichuan/alpaca-data-conversation.json \
|
||||
--tokenizer-model-path ./tokenizer \
|
||||
--output-prefix internlm_eos_text \
|
||||
python ./tools/preprocess_data.py \
|
||||
--input ./dataset_baichuan/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
|
||||
--tokenizer-name-or-path ./tokenizer \
|
||||
--output-prefix ./dataset_baichuan/alpaca \
|
||||
--workers 4 \
|
||||
--log-interval 1000
|
||||
--log-interval 1000 \
|
||||
--tokenizer-type PretrainedFromHF
|
||||
```
|
||||
|
||||
|
||||
@ -264,9 +286,9 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
|
||||
# modify script orign dataset path according to your own dataset path
|
||||
TOKENIZER_PATH=./tokenizer/
|
||||
DATA_PATH=./dataset_baichuan/internlm_eos_text
|
||||
DATA_PATH=./dataset_baichuan/aplaca_text_document
|
||||
LOAD_PATH=./model_weights
|
||||
CHECKPOINT_PATH=./model_save
|
||||
CHECKPOINT_PATH=./ckpt
|
||||
```
|
||||
|
||||
6. Launch Baichuan-13B pre-training script: /examples/baichuan/pretrain_baichuan_ptd_13B.sh
|
||||
@ -288,23 +310,23 @@ The performance of the Baichuan-13B in **Ascend NPU** and **Reference**:
|
||||
| Device | Model | total Iterations | throughput rate (samples/s/p) | throughput rate (tokens/s/p) | single-step time (s/step) | floating point operation (TFLOPs/s) |
|
||||
| :----: | :----------: | :--------------: | :---------------------------: | :--------------------------: | :-----------------------: | :---------------------------------: |
|
||||
| NPUs | Baichuan-13B | 1000 | 1.928 | 1024 | 16.067 | 89.37 |
|
||||
| Reference | Baichuan-13B | 1000 | 1.535 | 785 | 20.852 | 68.39 |
|
||||
| Reference | Baichuan-13B | 1000 | 1.535 | 862 | 19.852 | 72.39 |
|
||||
|
||||
|
||||
|
||||
#### Accuracy of the loss
|
||||
|
||||
NPU vs GPU loss.
|
||||
NPU vs Reference loss.
|
||||
|
||||
The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected.
|
||||
The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected. The relative error of the average loss is 0.00725, less than 2%, the maximum relative error is 0.01978, and the maximum absolute error is 0.10811. The precision meets the requirements.
|
||||
|
||||
![NPU-LOSS](./images/13B_loss_compare.png)
|
||||
![NPU-LOSS](./images/13B-loss-compare.png)
|
||||
|
||||
NPU vs GPU loss relative error.
|
||||
NPU vs Reference loss relative error.
|
||||
|
||||
The relative error between NPU and GPU Loss is less than 0.02 throughout, as expected.
|
||||
The relative error between NPU and Reference Loss is less than 0.02 throughout, as expected.
|
||||
|
||||
![NPU-Relative-Error](./images/13B_relative_error.png)
|
||||
![NPU-Relative-Error](./images/baichuan13B-loss-relative-error.png)
|
||||
|
||||
|
||||
|
||||
|
BIN
examples/baichuan/images/13B-loss-compare.png
Normal file
BIN
examples/baichuan/images/13B-loss-compare.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 55 KiB |
Binary file not shown.
Before Width: | Height: | Size: 29 KiB |
Binary file not shown.
Before Width: | Height: | Size: 49 KiB |
Binary file not shown.
Before Width: | Height: | Size: 70 KiB After Width: | Height: | Size: 46 KiB |
BIN
examples/baichuan/images/baichuan13B-loss-relative-error.png
Normal file
BIN
examples/baichuan/images/baichuan13B-loss-relative-error.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 76 KiB |
@ -10,7 +10,7 @@
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""Convert weight from huggingface to ascendspeed"""
|
||||
@ -46,7 +46,7 @@ def get_args():
|
||||
help="degree of pipeline model parallel")
|
||||
parser.add_argument("--added-token-num", type=int, default=0, help="the number of added tokens")
|
||||
parser.add_argument("--type", type=str, choices=["7B", "13B", "30B", "65B"], default="7B")
|
||||
|
||||
parser.add_argument("--pse", type=bool, default=False)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
@ -109,11 +109,20 @@ def generate_ascendspeed_weights_again(config):
|
||||
|
||||
for pp_i in range(pp_n_layer):
|
||||
ori_i = pp_n_layer * pp_rank + pp_i
|
||||
rank_model[f"language_model.layers.{pp_i}.attention.rotary_emb.inv_freq"] = get_weight_from_name(
|
||||
f"model.layers.{ori_i}.self_attn.rotary_emb.inv_freq")
|
||||
qw = row_split(get_weight_from_name(f"model.layers.{ori_i}.self_attn.q_proj.weight"), tp_size, tp_rank)
|
||||
kw = row_split(get_weight_from_name(f"model.layers.{ori_i}.self_attn.k_proj.weight"), tp_size, tp_rank)
|
||||
vw = row_split(get_weight_from_name(f"model.layers.{ori_i}.self_attn.v_proj.weight"), tp_size, tp_rank)
|
||||
if args.pse:
|
||||
w_pack = get_weight_from_name(f"model.layers.{ori_i}.self_attn.W_pack.weight")
|
||||
ws = torch.split(w_pack, w_pack.shape[0] // 3)
|
||||
qw = row_split(ws[0], tp_size, tp_rank)
|
||||
kw = row_split(ws[1], tp_size, tp_rank)
|
||||
vw = row_split(ws[2], tp_size, tp_rank)
|
||||
else:
|
||||
rank_model[f"language_model.layers.{pp_i}.attention.rotary_emb.inv_freq"] = get_weight_from_name(
|
||||
f"model.layers.{ori_i}.self_attn.rotary_emb.inv_freq")
|
||||
qw = row_split(get_weight_from_name(f"model.layers.{ori_i}.self_attn.q_proj.weight"), tp_size, tp_rank)
|
||||
kw = row_split(get_weight_from_name(f"model.layers.{ori_i}.self_attn.k_proj.weight"), tp_size, tp_rank)
|
||||
vw = row_split(get_weight_from_name(f"model.layers.{ori_i}.self_attn.v_proj.weight"), tp_size, tp_rank)
|
||||
|
||||
|
||||
permute_w = permute_qkv_weight(torch.cat([qw, kw, vw], dim=0), n_heads, hidden_size, tp_size)
|
||||
rank_model[f"language_model.layers.{pp_i}.attention.query_key_value.weight"] = permute_w
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user