ModelLink2/README_en.md
guoxinjie 62c40eef76 !1240 GPT3-175B 整理上库
Merge pull request !1240 from guoxinjie/gelu
2024-05-27 02:34:20 +00:00

26 KiB

GitHub Documentation

简体中文 | English

ModelLink provides end-to-end solutions for large language models on Ascend chips, including models, algorithms and tasks.


Supported features

Current ModelLink supported features for large model usage:

More novel and useful features are developing for LLMs training on Ascend ...

Supported Models

Current ModelLink supports pre-training and fine-tuning for the following models:

Model Scale Pretrain Inference LoRA SFT Chat Evaluation Contributor
Aquila 7B pretrain generate -- -- -- eval 【Ascend】
Aquila2 7B pretrain generate -- -- -- eval 【Community】
34B pretrain generate -- -- -- eval 【Community】
Baichuan 7B pretrain generate -- -- -- eval 【Ascend】
13B pretrain generate -- -- -- eval 【Ascend】
Baichuan2 7B pretrain generate -- -- -- eval 【Ascend】
13B pretrain generate -- -- -- eval 【Ascend】
Bloom 7B1 pretrain generate -- -- -- eval 【Ascend】
176B pretrain generate -- -- -- eval 【Ascend】
CodeLlama 34B pretrain generate -- -- -- eval 【Community】
InternLM 7B pretrain generate -- -- -- eval 【Ascend】
65B pretrain -- -- -- -- -- 【Ascend】
LLaMA 7B pretrain generate lora -- -- eval 【Ascend】
13B pretrain generate lora -- -- eval 【Ascend】
33B pretrain generate lora -- -- eval 【Ascend】
65B pretrain generate lora -- -- eval 【Ascend】
LLaMA2 7B pretrain generate lora -- -- eval 【Ascend】
13B pretrain generate lora -- -- eval 【Ascend】
34B pretrain generate lora -- -- eval 【Ascend】
70B pretrain generate lora -- -- eval 【Ascend】
LLaMA3 8B pretrain generate -- -- chat eval 【Ascend】
70B pretrain generate -- -- -- eval 【Ascend】
Qwen 7B pretrain generate -- -- -- eval 【Ascend】
14B pretrain generate -- -- -- eval 【Ascend】
72B pretrain generate -- -- -- eval 【Ascend】
Qwen1.5 7B pretrain generate -- -- -- eval 【Community】
14B pretrain generate -- -- -- eval 【Community】
Yi 34B pretrain generate -- -- -- eval 【Community】
Mixtral 8x7B pretrain generate -- -- -- eval 【Ascend】
Mistral 7B pretrain generate -- -- -- eval 【Ascend】
Gemma 2B pretrain generate -- -- -- eval 【Ascend】
7B pretrain generate lora -- -- eval 【Ascend】
GPT3 175B pretrain -- -- -- -- -- 【Community】

Script Naming Rules

Script Rule
pretrain_xxx.sh Pre-training Script
tune_xxx.sh Fine-tuning Script
generate_xxx.sh Inference Script
xxx_chat_xxx.sh Chat Script
evaluation_xxx.sh Evaluation Script

Model Usage Guide and Version Notes

Model Usage Guide and Version Notes For the supported models listed above, we provide training scripts and readme instructions in the examples folder, which contain detailed processes for model training, inference, and evaluation.

【Please note the corresponding environment versions for model usage, as follows】

Software Version
Python 3.8
driver Ascend HDK 23.0.0
firmware Ascend HDK 23.0.0
CANN CANN 7.0.0
torch 2.1.0、2.2.0
torch_npu release v5.0.0

【Based on the current version of megatron, the performance statistics from our testing are as follows】

Model Parameters Cluster Scale Precision Mode Performance Reference Performance
Aquila 7B 1x8 BF16 2849 2874
Aquila2 7B 1x8 FP16 3323 2673
34B 2x8 BF16 854 732
Baichuan 7B 1x8 FP16 2685 2036
13B 1x8 FP16 1213 862
Baichuan2 7B 1x8 BF16 2664 3969
13B 1x8 BF16 1668 2062
Bloom 7B1 1x8 FP16 2034 2525
176B 12x8 BF16 100 107
CodeLlama 34B 2x8 BF16 837 762
InternLM 7B 1x8 BF16 2776 2854
65B 4x8 BF16 341 414
LLaMA 7B 1x8 FP16 3600 3804
13B 1x8 FP16 1895 2012
33B 4x8 FP16 621 776
65B 4x8
BF16 348 426
LLaMA2 7B 1x8 BF16 4200 3850
13B 1x8 BF16 1990 1920
34B 2x8 BF16 690 796
70B 8x8 BF16 350 339
LLaMA3 8B 1x8 BF16 2483 2674
70B 8x8 BF16 283 355
Qwen 7B 1x8 BF16 2499 2867
14B 1x8 BF16 1560 1578
72B 16x8 BF16 285 345
Qwen1.5 7B 1x8 BF16 2862 2621
14B 1x8 BF16 1717 1702
Yi 34B 2x8 BF16 809 730
Mixtral 8x7B 2x8 BF16 487 610
Mistral 7B 1x8 BF16 2806 2734
Gemma 2B 1x8 BF16 6821 7602
7B 1x8 BF16 2938 2607
GPT3 175B 16x8 FP16 153 --

Acceleration Features

ModelLink supports various acceleration algorithms such as tensor parallelism, pipeline parallelism, sequence parallelism, recomputation, distributed optimizer, and more. The table below shows the enable switches corresponding to each acceleration feature:

Acceleration Feature Enable Parameter
Tensor Parallel --tensor-model-parallel-size
Pipeline Parallel --pipeline-model-parallel-size
Dynamic division for PP --num-layer-list
Sequence Parallel --sequence-parallel
Recomputation --recompute-granularity
Distributed Optimizer --use-distributed-optimizer
overlap DDP allreduce --overlap-grad-reduce
overlap DDP allgather --overlap-param-gather
Flash attention --use-flash-attn
Fused rmsnorm --use-fused-rmsnorm
Fused swiglu --use-fused-swiglu
mc2 --use-mc2
Fused rotary
position embedding
--use-fused-rotary-pos-emb
Sliding window attention --sliding-window
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    --tensor-model-parallel-size ${TP} \
    --pipeline-model-parallel-size ${PP} \
    --num-layer-list 1,2,2,2,1 \
    --sequence-parallel \
    --recompute-granularity full \
    --recompute-method block \
    --recompute-num-layers 72 \
    --use-distributed-optimizer \
    --use-flash-attn \
    --use-fused-rmsnorm \
    --use-fused-swiglu \
    --overlap-grad-reduce \
    --overlap-param-gather \
    --use-fused-rotary-pos-emb \
    --use-mc2 \
    --sliding-window 4096 \
    ... \
    ...

Analyze profiling data based on Ascend chips

Modellink supports analyze profiling data based on Ascend chips, which is useful for modelling:

--profile                        # enable profiling
--profile-step-start  5          # the start step
--profile-step-end 6             # the end step
--profile-ranks 0 1 2 3 4        # ranks for profiling
--profile-level level2           # level0, 1, 2 for data profiling
--profile-with-cpu               # profiling cpu information
--profile-with-stack             # profile stack information
--profile-with-memory            # profile memory information
--profile-record-shapes          # profile shape information
--profile-save-path ./profile_dir    # path to save data

Enable deterministic computing based on Ascend chips

  • add choice in script
--use-deter-comp
  • add environment variable
export HCCL_DETERMINISITIC=True

Acknowledgments


ModelLink is jointly contributed by the following departments of Huawei Corporation:

  • Ascend Computing Product Unit
  • Algorithm Unit of Computing Product Unit
  • Research Unit of Computing Product Unit
  • Open Computing Kit of Computing Product Unit
  • General Development Department
  • Global Technical Service Department

We appreciate every PR from community, and welcome to contribute to ModelLink.

Appendix