mirror of https://gitee.com/ascend/ModelLink.git synced 2024-12-05 05:17:40 +08:00

Go to file

商元义 baf8f2237f !1349 修复Qwen1.5错误 Merge pull request !1349 from 商元义/master		2024-06-20 07:29:12 +00:00
ci	!1344 修复padding_side为left时attention_mask不匹配问题	2024-06-18 03:50:18 +00:00
examples	!1349 修复Qwen1.5错误	2024-06-20 07:29:12 +00:00
modellink	!1345 chatglm3性能优化/增加微调功能	2024-06-18 04:03:01 +00:00
sources/images	!1331 添加Qwen1.5-0.5B适配	2024-06-17 06:04:05 +00:00
tests	!1355 看护m-core中的ep特性	2024-06-20 04:19:46 +00:00
tools	!1345 chatglm3性能优化/增加微调功能	2024-06-18 04:03:01 +00:00
.gitignore	!1306 feat: 给评估任务加上总文件及子任务进度条	2024-05-30 01:37:41 +00:00
evaluation.py	!1314 增加st和ut用例	2024-06-13 14:37:40 +00:00
inference.py	!1314 增加st和ut用例	2024-06-13 14:37:40 +00:00
LICENSE	Initial commit	2023-05-25 02:15:25 +00:00
OWNERS	!1326 更新owners文件	2024-06-04 07:17:42 +00:00
pretrain_gpt.py	!1344 修复padding_side为left时attention_mask不匹配问题	2024-06-18 03:50:18 +00:00
README_en.md	!1351 首页readme序列长度排版更新	2024-06-18 11:14:46 +00:00
README.md	!1351 首页readme序列长度排版更新	2024-06-18 11:14:46 +00:00
requirements.txt	!1320 gemma模型参数优化	2024-06-03 08:14:09 +00:00
SECURITY.md	fork megatron-deepspeed code.	2023-05-25 14:49:59 +08:00
SECURITYNOTE.md	!1330 添加securitynote	2024-06-05 08:18:58 +00:00
setup.py	!1342 脚本内删除overlap-param-gather，删除无效网址链接和无效markdown	2024-06-12 12:42:28 +00:00

README_en.md

简体中文 | English

ModelLink provides end-to-end solutions for large language models on Ascend chips, including models, algorithms and tasks.

ModelLink Solution Overview

Supported features

Current ModelLink supported features for large model usage:

Dataset Preparation for Pre-training/Fine-tuning Instruction Dataset Preparation
Pre-training/Full-parameter Fine-tuning/Low-parameter Fine-tuning
Inference: human-machine dialogue
Evaluation with numerous benchmarks
Utilizing Acceleration Features (Acceleration Algorithms + Fusion Operators)
Profiling data based on Ascend chips
Convert ckpt between huggingface and megatron
Enbale deterministic computing function for Ascend

More novel and useful features are developing for LLMs training on Ascend ...

Supported Models

Current ModelLink supports pre-training and fine-tuning for the following models:

Model	Parameters	Sequence length	Pretrain	Inference	LoRA	SFT	Chat	Evaluation	Contributor
Aquila	7B	2K	pretrain	generate	--	--	--	eval	【Ascend】
Aquila2	7B	2K	pretrain	generate	--	--	--	eval	【Community】
Aquila2	34B	4K	pretrain	generate	--	--	--	eval	【Community】
Baichuan	7B	4K	pretrain	generate	--	--	--	eval	【Ascend】
Baichuan	13B	4K	pretrain	generate	--	--	--	eval	【Ascend】
Baichuan2	7B	4K	pretrain	generate	--	--	--	eval	【Ascend】
Baichuan2	13B	4K	pretrain	generate	--	--	--	eval	【Ascend】
Bloom	7B1	2K	pretrain	generate	--	--	--	eval	【Ascend】
Bloom	176B	2K	pretrain	generate	--	--	--	eval	【Ascend】
ChatGLM3	6B	8K	pretrain	generate	--	--	--	eval	【Community】
CodeLlama	34B	4K	pretrain	generate	--	--	--	eval	【Community】
InternLM	7B	2K	pretrain	generate	--	--	--	eval	【Ascend】
InternLM	65B	2K	pretrain	--	--	--	--	--	【Ascend】
LLaMA	7B	2K	pretrain	generate	lora	--	--	eval	【Ascend】
	13B	2K	pretrain	generate	lora	--	--	eval	【Ascend】
	33B	2K	pretrain	generate	lora	--	--	eval	【Ascend】
	65B	2K	pretrain	generate	lora	--	--	eval	【Ascend】
LLaMA2	7B	4K	pretrain	generate	lora	--	--	eval	【Ascend】
	13B	4K	pretrain	generate	lora	--	--	eval	【Ascend】
	34B	4K	pretrain	generate	lora	--	--	eval	【Ascend】
	70B	4K	pretrain	generate	lora	--	--	eval	【Ascend】
LLaMA3	8B	8K	pretrain	generate	--	--	chat	eval	【Ascend】
LLaMA3	70B	8K	pretrain	generate	--	--	--	eval	【Ascend】
Qwen	7B	8K	pretrain	generate	--	--	--	eval	【Ascend】
	14B	2K	pretrain	generate	--	--	--	eval	【Ascend】
	72B	8K	pretrain	generate	--	--	--	eval	【Ascend】
Qwen1.5	0.5B	8K	pretrain	generate	--	--	--	eval	【Community】
	1.8B	8K	pretrain	generate	--	--	--	eval	【Community】
	4B	8K	pretrain	generate	--	--	--	eval	【Community】
	7B	8K	pretrain	generate	--	--	--	eval	【Community】
	14B	8K	pretrain	generate	--	--	--	eval	【Community】
	32B	8K	pretrain	generate	lora	--	--	eval	【Community】
	72B	8K	pretrain	generate	lora	--	--	eval	【Ascend】
Yi	34B	4K	pretrain	generate	--	--	--	eval	【Community】
Mixtral	8x7B	32K	pretrain	generate	--	--	--	eval	【Ascend】
Mistral	7B	32K	pretrain	generate	--	--	--	eval	【Ascend】
Gemma	2B	8K	pretrain	generate	--	--	--	eval	【Ascend】
Gemma	7B	8K	pretrain	generate	lora	--	--	eval	【Ascend】
GPT3	175B	2K	pretrain	--	--	--	--	--	【Community】

Script Naming Rules

Script	Rule
pretrain_xxx.sh	Pre-training Script
tune_xxx.sh	Fine-tuning Script
generate_xxx.sh	Inference Script
xxx_chat_xxx.sh	Chat Script
evaluation_xxx.sh	Evaluation Script

Model Usage Guide and Version Notes

Model Usage Guide and Version Notes For the supported models listed above, we provide training scripts and readme instructions in the examples folder, which contain detailed processes for model training, inference, and evaluation.

【Please note the corresponding environment versions for model usage, as follows】

Software	Version
Python	3.8
driver	under development version
firmware	under development version
CANN	under development version
torch	2.1.0、2.2.0
torch_npu	under development version

【Based on the current version of megatron, the performance statistics from our testing are as follows (Hardware info：Atlas 900 A2 PODc)】

Model	Parameters	Sequence length	Cluster Scale	Precision Mode	Performance	Reference Performance
Aquila	7B	2K	1x8	BF16	2849	2874
Aquila2	7B	2K	1x8	FP16	3323	2673
Aquila2	34B	4K	2x8	BF16	854	732
Baichuan	7B	4K	1x8	FP16	2685	2036
Baichuan	13B	4K	1x8	FP16	1213	862
Baichuan2	7B	4K	1x8	BF16	2664	3969
Baichuan2	13B	4K	1x8	BF16	1668	2062
Bloom	7B1	2K	1x8	FP16	2034	2525
Bloom	176B	2K	12x8	BF16	100	107
ChatGLM3	6B	8K	1x8	FP16	4297	4267
CodeLlama	34B	4K	2x8	BF16	837	762
InternLM	7B	2K	1x8	BF16	2776	2854
InternLM	65B	2K	4x8	BF16	341	414
LLaMA	7B	2K	1x8	FP16	3600	3804
	13B	2K	1x8	FP16	1895	2012
	33B	2K	4x8	FP16	621	776
	65B	2K	4x8
	65B	2K	4x8	BF16	348	426
LLaMA2	7B	4K	1x8	BF16	4200	3850
	13B	4K	1x8	BF16	1990	1920
	34B	4K	2x8	BF16	749	796
	70B	4K	4x8	BF16	420	430
LLaMA3	8B	8K	1x8	BF16	2483	2674
LLaMA3	70B	8K	8x8	BF16	283	355
Qwen	7B	8K	1x8	BF16	2499	2867
	14B	2K	1x8	BF16	1560	1578
	72B	8K	16x8	BF16	285	345
Qwen1.5	0.5B	8K	1x8	BF16	22834	25306
	1.8B	8K	1x8	BF16	13029	12181
	4B	8K	1x8	BF16	5033	5328
	7B	8K	1x8	BF16	2862	2621
	14B	8K	1x8	BF16	1717	1702
	32B	8K	4x8	BF16	751	708
	72B	8K	8x8	BF16	301	317
Yi	34B	4K	2x8	BF16	809	730
Mixtral	8x7B	32K	2x8	BF16	487	610
Mistral	7B	32K	1x8	BF16	2806	2734
Gemma	2B	8K	1x8	BF16	6821	7602
Gemma	7B	8K	1x8	BF16	2938	2607
GPT3	175B	2K	16x8	FP16	153	--

Acceleration Features

ModelLink supports various acceleration algorithms such as tensor parallelism, pipeline parallelism, sequence parallelism, recomputation, distributed optimizer, and more. The table below shows the enable switches corresponding to each acceleration feature:

Acceleration Feature	Enable Parameter
Tensor Parallel	--tensor-model-parallel-size
Pipeline Parallel	--pipeline-model-parallel-size
Dynamic division for PP	--num-layer-list
Sequence Parallel	--sequence-parallel
Recomputation	--recompute-granularity
Distributed Optimizer	--use-distributed-optimizer
overlap DDP allreduce	--overlap-grad-reduce
Flash attention	--use-flash-attn
Fused rmsnorm	--use-fused-rmsnorm
Fused swiglu	--use-fused-swiglu
mc2	--use-mc2
Fused rotary position embedding	--use-fused-rotary-pos-emb
Sliding window attention	--sliding-window

torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    --tensor-model-parallel-size ${TP} \
    --pipeline-model-parallel-size ${PP} \
    --num-layer-list 1,2,2,2,1 \
    --sequence-parallel \
    --recompute-granularity full \
    --recompute-method block \
    --recompute-num-layers 72 \
    --use-distributed-optimizer \
    --use-flash-attn \
    --use-fused-rmsnorm \
    --use-fused-swiglu \
    --overlap-grad-reduce \
    --use-fused-rotary-pos-emb \
    --use-mc2 \
    --sliding-window 4096 \
    ... \
    ...

Analyze profiling data based on Ascend chips

Modellink supports analyze profiling data based on Ascend chips, which is useful for modelling:

--profile                        # enable profiling
--profile-step-start  5          # the start step
--profile-step-end 6             # the end step
--profile-ranks 0 1 2 3 4        # ranks for profiling
--profile-level level2           # level0, 1, 2 for data profiling
--profile-with-cpu               # profiling cpu information
--profile-with-stack             # profile stack information
--profile-with-memory            # profile memory information
--profile-record-shapes          # profile shape information
--profile-save-path ./profile_dir    # path to save data

Enable deterministic computing based on Ascend chips

add choice in script

--use-deter-comp

add environment variable

export HCCL_DETERMINISITIC=True

Acknowledgments

ModelLink is jointly contributed by the following departments of Huawei Corporation:

Ascend Computing Product Unit
Algorithm Unit of Computing Product Unit
Research Unit of Computing Product Unit
Open Computing Kit of Computing Product Unit
General Development Department
Global Technical Service Department

We appreciate every PR from community, and welcome to contribute to ModelLink.

Appendix

Safety Statement: Safety Statement

README_en.md Unescape Escape