Go to file
fengliangjun 4a683f8dbe !1285 更新 mixtral-moe 模型至32K
Merge pull request !1285 from fengliangjun/master
2024-05-17 01:32:15 +00:00
ci !1213 将门禁中的 unittest 改写成 pytest,便于后续门禁增加测试case 2024-04-03 02:14:09 +00:00
examples !1285 更新 mixtral-moe 模型至32K 2024-05-17 01:32:15 +00:00
modellink !1280 增加非共享储存情况下非主节点数据缓存检测和生成 2024-05-16 07:40:36 +00:00
sources/images !1285 更新 mixtral-moe 模型至32K 2024-05-17 01:32:15 +00:00
tests !1273 feat: 添加 Aquila2-7B 适配 2024-05-14 09:17:09 +00:00
tools !1218 迁移megatron patch 2024-04-23 01:57:03 +00:00
.gitignore update .gitignore. 2023-09-26 11:00:39 +00:00
evaluation.py !1261 添加codellama-34B适配 2024-05-14 08:19:47 +00:00
inference.py !1218 迁移megatron patch 2024-04-23 01:57:03 +00:00
LICENSE Initial commit 2023-05-25 02:15:25 +00:00
OWNERS !999 修改OWNER 2024-03-13 10:14:53 +00:00
pretrain_gpt.py !1218 迁移megatron patch 2024-04-23 01:57:03 +00:00
public_address_statement.md !560 公网URL声明 2024-01-29 13:51:27 +00:00
README_en.md !1285 更新 mixtral-moe 模型至32K 2024-05-17 01:32:15 +00:00
README.md !1285 更新 mixtral-moe 模型至32K 2024-05-17 01:32:15 +00:00
requirements.txt !1074 requirements.txt移除apex依赖,模型训练脚本规范化加上日志存档 2024-03-19 10:55:11 +00:00
SECURITY.md fork megatron-deepspeed code. 2023-05-25 14:49:59 +08:00
setup.py !557 安全编译选项 2024-01-29 14:01:30 +00:00

GitHub Documentation

简体中文 | English

ModelLink provides end-to-end solutions for large language models on Ascend chips, including models, algorithms and tasks.


Supported features

Current ModelLink supported features for large model usage:

More novel and useful features are developing for LLMs training on Ascend ...

Supported Models

Current ModelLink supports pre-training and fine-tuning for the following models:

Model Scale Pretrain Inference LoRA SFT Chat Evaluation Contributor
Aquila 7B pretrain generate -- -- -- eval 【Ascend】
Aquila2 7B pretrain generate -- -- -- eval 【Community】
Baichuan 7B pretrain generate -- -- -- eval 【Ascend】
13B pretrain generate -- -- -- eval 【Ascend】
Baichuan2 7B pretrain generate -- -- -- eval 【Ascend】
13B pretrain generate -- -- -- eval 【Ascend】
Bloom 7B1 pretrain generate -- -- -- eval 【Ascend】
176B pretrain generate -- -- -- eval 【Ascend】
CodeLlama 34B pretrain generate -- -- -- eval 【Community】
InternLM 7B pretrain generate -- -- -- eval 【Ascend】
65B pretrain -- -- -- -- -- 【Ascend】
LLaMA 7B pretrain generate lora -- -- eval 【Ascend】
13B pretrain generate lora -- -- eval 【Ascend】
33B pretrain generate lora -- -- eval 【Ascend】
65B pretrain generate lora -- -- eval 【Ascend】
LLaMA2 7B pretrain generate lora -- -- eval 【Ascend】
13B pretrain generate lora -- -- eval 【Ascend】
34B pretrain generate lora -- -- eval 【Ascend】
70B pretrain generate lora -- -- eval 【Ascend】
LLaMA3 8B pretrain generate -- -- chat eval 【Community】
70B pretrain generate -- -- -- eval 【Community】
Qwen 7B pretrain generate -- -- -- eval 【Ascend】
14B pretrain generate -- -- -- eval 【Ascend】
72B pretrain generate -- -- -- eval 【Ascend】
Yi 34B pretrain generate -- -- -- eval 【Community】
Mixtral 8x7B pretrain generate -- -- -- eval 【Ascend】

Script Naming Rules

Script Rule
pretrain_xxx.sh Pre-training Script
tune_xxx.sh Fine-tuning Script
generate_xxx.sh Inference Script
xxx_chat_xxx.sh Chat Script
evaluation_xxx.sh Evaluation Script

Model Usage Guide and Version Notes

Model Usage Guide and Version Notes For the supported models listed above, we provide training scripts and readme instructions in the examples folder, which contain detailed processes for model training, inference, and evaluation.

【Please note the corresponding environment versions for model usage, as follows】

Software Version
Python 3.8
driver Ascend HDK 23.0.0
firmware Ascend HDK 23.0.0
CANN CANN 7.0.0
torch 2.1.0、2.2.0
torch_npu release v5.0.0

【Based on the current version of megatron, the performance statistics from our testing are as follows】

Model Parameters Cluster Scale Precision Mode Performance Reference Performance
Aquila 7B 1x8 BF16 2849 2874
Aquila2 7B 1x8 FP16 3323 2673
Baichuan 7B 1x8 FP16 2685 2036
13B 1x8 FP16 1213 862
Baichuan2 7B 1x8 BF16 2664 3969
13B 1x8 BF16 1668 2062
Bloom 7B1 1x8 FP16 2034 2525
176B 12x8 BF16 100 107
CodeLlama 34B 2x8 BF16 837 762
InternLM 7B 1x8 BF16 2776 2854
65B 4x8 BF16 341 414
LLaMA 7B 1x8 FP16 3600 3804
13B 1x8 FP16 1895 2012
33B 4x8 FP16 621 776
65B 4x8
BF16 348 426
LLaMA2 7B 1x8 BF16 4200 3850
13B 1x8 BF16 1990 1920
34B 2x8 BF16 690 796
70B 8x8 BF16 350 339
LLaMA3 8B 1x8 BF16 2483 2674
70B 8x8 BF16 283 --
Qwen 7B 1x8 BF16 2499 2867
14B 1x8 BF16 1560 1578
72B 16x8 BF16 285 345
Yi 34B 2x8 BF16 809 730
Mixtral 8x7B 2x8 BF16 1054 1139

Acceleration Features

ModelLink supports various acceleration algorithms such as tensor parallelism, pipeline parallelism, sequence parallelism, recomputation, distributed optimizer, and more. The table below shows the enable switches corresponding to each acceleration feature:

Acceleration Feature Enable Parameter
Tensor Parallel --tensor-model-parallel-size
Pipeline Parallel --pipeline-model-parallel-size
Dynamic division for PP --num-layer-list
Sequence Parallel --sequence-parallel
Recomputation --recompute-granularity
Distributed Optimizer --use-distributed-optimizer
overlap DDP allreduce --overlap-grad-reduce
overlap DDP allgather --overlap-param-gather
Flash attention --use-flash-attn
Fused rmsnorm --use-fused-rmsnorm
Fused swiglu --use-fused-swiglu
mc2 --use-mc2
Fused rotary
position embedding
--use-fused-rotary-pos-emb
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    --tensor-model-parallel-size ${TP} \
    --pipeline-model-parallel-size ${PP} \
    --num-layer-list 1,2,2,2,1 \
    --sequence-parallel \
    --recompute-granularity full \
    --recompute-method block \
    --recompute-num-layers 72 \
    --use-distributed-optimizer \
    --use-flash-attn \
    --use-fused-rmsnorm \
    --use-fused-swiglu \
    --overlap-grad-reduce \
    --overlap-param-gather \
    --use-fused-rotary-pos-emb \
    --use-mc2 \
    ... \
    ...

Analyze profiling data based on Ascend chips

Modellink supports analyze profiling data based on Ascend chips, which is useful for modelling:

--profile                        # enable profiling
--profile-step-start  5          # the start step
--profile-step-end 6             # the end step
--profile-ranks 0 1 2 3 4        # ranks for profiling
--profile-level level2           # level0, 1, 2 for data profiling
--profile-with-cpu               # profiling cpu information
--profile-with-stack             # profile stack information
--profile-with-memory            # profile memory information
--profile-record-shapes          # profile shape information
--profile-save-path ./profile_dir    # path to save data

Acknowledgments


ModelLink is jointly contributed by the following departments of Huawei Corporation:

  • Ascend Computing Product Unit
  • Algorithm Unit of Computing Product Unit
  • Research Unit of Computing Product Unit
  • Open Computing Kit of Computing Product Unit
  • General Development Department
  • Global Technical Service Department

We appreciate every PR from community, and welcome to contribute to ModelLink.

Appendix