mirror of https://gitee.com/ascend/ModelLink.git synced 2024-12-11 01:18:47 +08:00

guoxinjie 62c40eef76 !1240 GPT3-175B 整理上库

Merge pull request !1240 from guoxinjie/gelu

2024-05-27 02:34:20 +00:00

26 KiB

Raw Blame History

简体中文 | English

ModelLink provides end-to-end solutions for large language models on Ascend chips, including models, algorithms and tasks.

ModelLink Solution Overview

Supported features

Current ModelLink supported features for large model usage:

Dataset Preparation for Pre-training/Fine-tuning Instruction Dataset Preparation
Pre-training/Full-parameter Fine-tuning/Low-parameter Fine-tuning
Inference: human-machine dialogue
Evaluation with numerous benchmarks
Utilizing Acceleration Features (Acceleration Algorithms + Fusion Operators)
Profiling data based on Ascend chips
Convert ckpt between huggingface and megatron
Enbale deterministic computing function for Ascend

More novel and useful features are developing for LLMs training on Ascend ...

Supported Models

Current ModelLink supports pre-training and fine-tuning for the following models:

Model	Scale	Pretrain	Inference	LoRA	SFT	Chat	Evaluation	Contributor
Aquila	7B	pretrain	generate	--	--	--	eval	【Ascend】
Aquila2	7B	pretrain	generate	--	--	--	eval	【Community】
Aquila2	34B	pretrain	generate	--	--	--	eval	【Community】
Baichuan	7B	pretrain	generate	--	--	--	eval	【Ascend】
Baichuan	13B	pretrain	generate	--	--	--	eval	【Ascend】
Baichuan2	7B	pretrain	generate	--	--	--	eval	【Ascend】
Baichuan2	13B	pretrain	generate	--	--	--	eval	【Ascend】
Bloom	7B1	pretrain	generate	--	--	--	eval	【Ascend】
Bloom	176B	pretrain	generate	--	--	--	eval	【Ascend】
CodeLlama	34B	pretrain	generate	--	--	--	eval	【Community】
InternLM	7B	pretrain	generate	--	--	--	eval	【Ascend】
InternLM	65B	pretrain	--	--	--	--	--	【Ascend】
LLaMA	7B	pretrain	generate	lora	--	--	eval	【Ascend】
	13B	pretrain	generate	lora	--	--	eval	【Ascend】
	33B	pretrain	generate	lora	--	--	eval	【Ascend】
	65B	pretrain	generate	lora	--	--	eval	【Ascend】
LLaMA2	7B	pretrain	generate	lora	--	--	eval	【Ascend】
	13B	pretrain	generate	lora	--	--	eval	【Ascend】
	34B	pretrain	generate	lora	--	--	eval	【Ascend】
	70B	pretrain	generate	lora	--	--	eval	【Ascend】
LLaMA3	8B	pretrain	generate	--	--	chat	eval	【Ascend】
LLaMA3	70B	pretrain	generate	--	--	--	eval	【Ascend】
Qwen	7B	pretrain	generate	--	--	--	eval	【Ascend】
	14B	pretrain	generate	--	--	--	eval	【Ascend】
	72B	pretrain	generate	--	--	--	eval	【Ascend】
Qwen1.5	7B	pretrain	generate	--	--	--	eval	【Community】
Qwen1.5	14B	pretrain	generate	--	--	--	eval	【Community】
Yi	34B	pretrain	generate	--	--	--	eval	【Community】
Mixtral	8x7B	pretrain	generate	--	--	--	eval	【Ascend】
Mistral	7B	pretrain	generate	--	--	--	eval	【Ascend】
Gemma	2B	pretrain	generate	--	--	--	eval	【Ascend】
Gemma	7B	pretrain	generate	lora	--	--	eval	【Ascend】
GPT3	175B	pretrain	--	--	--	--	--	【Community】

Script Naming Rules

Script	Rule
pretrain_xxx.sh	Pre-training Script
tune_xxx.sh	Fine-tuning Script
generate_xxx.sh	Inference Script
xxx_chat_xxx.sh	Chat Script
evaluation_xxx.sh	Evaluation Script

Model Usage Guide and Version Notes

Model Usage Guide and Version Notes For the supported models listed above, we provide training scripts and readme instructions in the examples folder, which contain detailed processes for model training, inference, and evaluation.

【Please note the corresponding environment versions for model usage, as follows】

Software	Version
Python	3.8
driver	Ascend HDK 23.0.0
firmware	Ascend HDK 23.0.0
CANN	CANN 7.0.0
torch	2.1.0、2.2.0
torch_npu	release v5.0.0

【Based on the current version of megatron, the performance statistics from our testing are as follows】

Model	Parameters	Cluster Scale	Precision Mode	Performance	Reference Performance
Aquila	7B	1x8	BF16	2849	2874
Aquila2	7B	1x8	FP16	3323	2673
Aquila2	34B	2x8	BF16	854	732
Baichuan	7B	1x8	FP16	2685	2036
Baichuan	13B	1x8	FP16	1213	862
Baichuan2	7B	1x8	BF16	2664	3969
Baichuan2	13B	1x8	BF16	1668	2062
Bloom	7B1	1x8	FP16	2034	2525
Bloom	176B	12x8	BF16	100	107
CodeLlama	34B	2x8	BF16	837	762
InternLM	7B	1x8	BF16	2776	2854
InternLM	65B	4x8	BF16	341	414
LLaMA	7B	1x8	FP16	3600	3804
	13B	1x8	FP16	1895	2012
	33B	4x8	FP16	621	776
	65B	4x8
	65B	4x8	BF16	348	426
LLaMA2	7B	1x8	BF16	4200	3850
	13B	1x8	BF16	1990	1920
	34B	2x8	BF16	690	796
	70B	8x8	BF16	350	339
LLaMA3	8B	1x8	BF16	2483	2674
LLaMA3	70B	8x8	BF16	283	355
Qwen	7B	1x8	BF16	2499	2867
	14B	1x8	BF16	1560	1578
	72B	16x8	BF16	285	345
Qwen1.5	7B	1x8	BF16	2862	2621
Qwen1.5	14B	1x8	BF16	1717	1702
Yi	34B	2x8	BF16	809	730
Mixtral	8x7B	2x8	BF16	487	610
Mistral	7B	1x8	BF16	2806	2734
Gemma	2B	1x8	BF16	6821	7602
Gemma	7B	1x8	BF16	2938	2607
GPT3	175B	16x8	FP16	153	--

Acceleration Features

ModelLink supports various acceleration algorithms such as tensor parallelism, pipeline parallelism, sequence parallelism, recomputation, distributed optimizer, and more. The table below shows the enable switches corresponding to each acceleration feature:

Acceleration Feature	Enable Parameter
Tensor Parallel	--tensor-model-parallel-size
Pipeline Parallel	--pipeline-model-parallel-size
Dynamic division for PP	--num-layer-list
Sequence Parallel	--sequence-parallel
Recomputation	--recompute-granularity
Distributed Optimizer	--use-distributed-optimizer
overlap DDP allreduce	--overlap-grad-reduce
overlap DDP allgather	--overlap-param-gather
Flash attention	--use-flash-attn
Fused rmsnorm	--use-fused-rmsnorm
Fused swiglu	--use-fused-swiglu
mc2	--use-mc2
Fused rotary position embedding	--use-fused-rotary-pos-emb
Sliding window attention	--sliding-window

torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    --tensor-model-parallel-size ${TP} \
    --pipeline-model-parallel-size ${PP} \
    --num-layer-list 1,2,2,2,1 \
    --sequence-parallel \
    --recompute-granularity full \
    --recompute-method block \
    --recompute-num-layers 72 \
    --use-distributed-optimizer \
    --use-flash-attn \
    --use-fused-rmsnorm \
    --use-fused-swiglu \
    --overlap-grad-reduce \
    --overlap-param-gather \
    --use-fused-rotary-pos-emb \
    --use-mc2 \
    --sliding-window 4096 \
    ... \
    ...

Analyze profiling data based on Ascend chips

Modellink supports analyze profiling data based on Ascend chips, which is useful for modelling:

--profile                        # enable profiling
--profile-step-start  5          # the start step
--profile-step-end 6             # the end step
--profile-ranks 0 1 2 3 4        # ranks for profiling
--profile-level level2           # level0, 1, 2 for data profiling
--profile-with-cpu               # profiling cpu information
--profile-with-stack             # profile stack information
--profile-with-memory            # profile memory information
--profile-record-shapes          # profile shape information
--profile-save-path ./profile_dir    # path to save data

Enable deterministic computing based on Ascend chips

add choice in script

--use-deter-comp

add environment variable

export HCCL_DETERMINISITIC=True

Acknowledgments

ModelLink is jointly contributed by the following departments of Huawei Corporation:

Ascend Computing Product Unit
Algorithm Unit of Computing Product Unit
Research Unit of Computing Product Unit
Open Computing Kit of Computing Product Unit
General Development Department
Global Technical Service Department

We appreciate every PR from community, and welcome to contribute to ModelLink.

Appendix

Safety Statement: Safety Statement

26 KiB Raw Blame History