Merge pull request !1240 from guoxinjie/gelu
26 KiB
简体中文 | English
ModelLink provides end-to-end solutions for large language models on Ascend chips, including models, algorithms and tasks.
ModelLink Solution Overview
Supported features
Current ModelLink supported features for large model usage:
- Dataset Preparation for Pre-training/Fine-tuning Instruction Dataset Preparation
- Pre-training/Full-parameter Fine-tuning/Low-parameter Fine-tuning
- Inference: human-machine dialogue
- Evaluation with numerous benchmarks
- Utilizing Acceleration Features (Acceleration Algorithms + Fusion Operators)
- Profiling data based on Ascend chips
- Convert ckpt between huggingface and megatron
- Enbale deterministic computing function for Ascend
More novel and useful features are developing for LLMs training on Ascend ...
Supported Models
Current ModelLink supports pre-training and fine-tuning for the following models:
Model | Scale | Pretrain | Inference | LoRA | SFT | Chat | Evaluation | Contributor |
---|---|---|---|---|---|---|---|---|
Aquila | 7B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 |
Aquila2 | 7B | pretrain | generate | -- | -- | -- | eval | 【Community】 |
34B | pretrain | generate | -- | -- | -- | eval | 【Community】 | |
Baichuan | 7B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 |
13B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 | |
Baichuan2 | 7B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 |
13B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 | |
Bloom | 7B1 | pretrain | generate | -- | -- | -- | eval | 【Ascend】 |
176B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 | |
CodeLlama | 34B | pretrain | generate | -- | -- | -- | eval | 【Community】 |
InternLM | 7B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 |
65B | pretrain | -- | -- | -- | -- | -- | 【Ascend】 | |
LLaMA | 7B | pretrain | generate | lora | -- | -- | eval | 【Ascend】 |
13B | pretrain | generate | lora | -- | -- | eval | 【Ascend】 | |
33B | pretrain | generate | lora | -- | -- | eval | 【Ascend】 | |
65B | pretrain | generate | lora | -- | -- | eval | 【Ascend】 | |
LLaMA2 | 7B | pretrain | generate | lora | -- | -- | eval | 【Ascend】 |
13B | pretrain | generate | lora | -- | -- | eval | 【Ascend】 | |
34B | pretrain | generate | lora | -- | -- | eval | 【Ascend】 | |
70B | pretrain | generate | lora | -- | -- | eval | 【Ascend】 | |
LLaMA3 | 8B | pretrain | generate | -- | -- | chat | eval | 【Ascend】 |
70B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 | |
Qwen | 7B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 |
14B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 | |
72B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 | |
Qwen1.5 | 7B | pretrain | generate | -- | -- | -- | eval | 【Community】 |
14B | pretrain | generate | -- | -- | -- | eval | 【Community】 | |
Yi | 34B | pretrain | generate | -- | -- | -- | eval | 【Community】 |
Mixtral | 8x7B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 |
Mistral | 7B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 |
Gemma | 2B | pretrain | generate | -- | -- | -- | eval | 【Ascend】 |
7B | pretrain | generate | lora | -- | -- | eval | 【Ascend】 | |
GPT3 | 175B | pretrain | -- | -- | -- | -- | -- | 【Community】 |
Script Naming Rules
Script | Rule |
---|---|
pretrain_xxx.sh | Pre-training Script |
tune_xxx.sh | Fine-tuning Script |
generate_xxx.sh | Inference Script |
xxx_chat_xxx.sh | Chat Script |
evaluation_xxx.sh | Evaluation Script |
Model Usage Guide and Version Notes
Model Usage Guide and Version Notes For the supported models listed above, we provide training scripts and readme instructions in the examples folder, which contain detailed processes for model training, inference, and evaluation.
【Please note the corresponding environment versions for model usage, as follows】
Software | Version |
---|---|
Python | 3.8 |
driver | Ascend HDK 23.0.0 |
firmware | Ascend HDK 23.0.0 |
CANN | CANN 7.0.0 |
torch | 2.1.0、2.2.0 |
torch_npu | release v5.0.0 |
【Based on the current version of megatron, the performance statistics from our testing are as follows】
Model | Parameters | Cluster Scale | Precision Mode | Performance | Reference Performance |
---|---|---|---|---|---|
Aquila | 7B | 1x8 | BF16 | 2849 | 2874 |
Aquila2 | 7B | 1x8 | FP16 | 3323 | 2673 |
34B | 2x8 | BF16 | 854 | 732 | |
Baichuan | 7B | 1x8 | FP16 | 2685 | 2036 |
13B | 1x8 | FP16 | 1213 | 862 | |
Baichuan2 | 7B | 1x8 | BF16 | 2664 | 3969 |
13B | 1x8 | BF16 | 1668 | 2062 | |
Bloom | 7B1 | 1x8 | FP16 | 2034 | 2525 |
176B | 12x8 | BF16 | 100 | 107 | |
CodeLlama | 34B | 2x8 | BF16 | 837 | 762 |
InternLM | 7B | 1x8 | BF16 | 2776 | 2854 |
65B | 4x8 | BF16 | 341 | 414 | |
LLaMA | 7B | 1x8 | FP16 | 3600 | 3804 |
13B | 1x8 | FP16 | 1895 | 2012 | |
33B | 4x8 | FP16 | 621 | 776 | |
65B | 4x8 | ||||
BF16 | 348 | 426 | |||
LLaMA2 | 7B | 1x8 | BF16 | 4200 | 3850 |
13B | 1x8 | BF16 | 1990 | 1920 | |
34B | 2x8 | BF16 | 690 | 796 | |
70B | 8x8 | BF16 | 350 | 339 | |
LLaMA3 | 8B | 1x8 | BF16 | 2483 | 2674 |
70B | 8x8 | BF16 | 283 | 355 | |
Qwen | 7B | 1x8 | BF16 | 2499 | 2867 |
14B | 1x8 | BF16 | 1560 | 1578 | |
72B | 16x8 | BF16 | 285 | 345 | |
Qwen1.5 | 7B | 1x8 | BF16 | 2862 | 2621 |
14B | 1x8 | BF16 | 1717 | 1702 | |
Yi | 34B | 2x8 | BF16 | 809 | 730 |
Mixtral | 8x7B | 2x8 | BF16 | 487 | 610 |
Mistral | 7B | 1x8 | BF16 | 2806 | 2734 |
Gemma | 2B | 1x8 | BF16 | 6821 | 7602 |
7B | 1x8 | BF16 | 2938 | 2607 | |
GPT3 | 175B | 16x8 | FP16 | 153 | -- |
Acceleration Features
ModelLink supports various acceleration algorithms such as tensor parallelism, pipeline parallelism, sequence parallelism, recomputation, distributed optimizer, and more. The table below shows the enable switches corresponding to each acceleration feature:
Acceleration Feature | Enable Parameter |
---|---|
Tensor Parallel | --tensor-model-parallel-size |
Pipeline Parallel | --pipeline-model-parallel-size |
Dynamic division for PP | --num-layer-list |
Sequence Parallel | --sequence-parallel |
Recomputation | --recompute-granularity |
Distributed Optimizer | --use-distributed-optimizer |
overlap DDP allreduce | --overlap-grad-reduce |
overlap DDP allgather | --overlap-param-gather |
Flash attention | --use-flash-attn |
Fused rmsnorm | --use-fused-rmsnorm |
Fused swiglu | --use-fused-swiglu |
mc2 | --use-mc2 |
Fused rotary position embedding |
--use-fused-rotary-pos-emb |
Sliding window attention | --sliding-window |
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--num-layer-list 1,2,2,2,1 \
--sequence-parallel \
--recompute-granularity full \
--recompute-method block \
--recompute-num-layers 72 \
--use-distributed-optimizer \
--use-flash-attn \
--use-fused-rmsnorm \
--use-fused-swiglu \
--overlap-grad-reduce \
--overlap-param-gather \
--use-fused-rotary-pos-emb \
--use-mc2 \
--sliding-window 4096 \
... \
...
Analyze profiling data based on Ascend chips
Modellink supports analyze profiling data based on Ascend chips, which is useful for modelling:
--profile # enable profiling
--profile-step-start 5 # the start step
--profile-step-end 6 # the end step
--profile-ranks 0 1 2 3 4 # ranks for profiling
--profile-level level2 # level0, 1, 2 for data profiling
--profile-with-cpu # profiling cpu information
--profile-with-stack # profile stack information
--profile-with-memory # profile memory information
--profile-record-shapes # profile shape information
--profile-save-path ./profile_dir # path to save data
Enable deterministic computing based on Ascend chips
- add choice in script
--use-deter-comp
- add environment variable
export HCCL_DETERMINISITIC=True
Acknowledgments
ModelLink is jointly contributed by the following departments of Huawei Corporation:
- Ascend Computing Product Unit
- Algorithm Unit of Computing Product Unit
- Research Unit of Computing Product Unit
- Open Computing Kit of Computing Product Unit
- General Development Department
- Global Technical Service Department
We appreciate every PR from community, and welcome to contribute to ModelLink.
Appendix
- Safety Statement: Safety Statement