ModelLink2

mirror of https://gitee.com/ascend/ModelLink.git synced 2024-12-05 05:17:40 +08:00

History

LeiZhenzhen c0be616e7e !1814 refactor trainer Merge pull request !1814 from LeiZhenzhen/master		2024-11-06 10:53:02 +00:00
..
pipeline	!1807 add baichuan2权重转换 into pipeline	2024-11-06 08:22:53 +00:00
st	!1814 refactor trainer	2024-11-06 10:53:02 +00:00
test_tools	!1847 支持CI ST用例判别标准范围配置	2024-11-05 12:55:35 +00:00
ut	!1805 推理模块整改	2024-11-02 07:36:42 +00:00
README.md	!1807 add baichuan2权重转换 into pipeline	2024-11-06 08:22:53 +00:00

README.md

MindSpeed-LLM 测试用例贡献说明

门禁看护列表

Tests	Module	Structure	Features	Scripts	Acc.	Throu.	Mem.
ST	Pretrain	Mcore	TP，PP，VPP，重计算，enable_recompute_layers_per_pp_rank，FA_TND	llama2_tp2_pp4_vpp2.sh	Y	Y	Y
		Mcore	cp_ring，分布式优化器，reuse_fp32_param，recompute_activation_function，fused_rmsnorm，fused_swiglu，fused_rope，overlap_grad_reduce, overlap_param_gather	llama2_tp2_cp4_mem_recompute.sh	Y	Y	Y
		Mcore	cp_ring，general_cp，double_ring，分布式优化器，reuse_fp32_param，recompute_activation_function，fused_rmsnorm，fused_swiglu，fused_rope，overlap_grad_reduce, overlap_param_gather	llama2_tp2_cp4_general_double_ring.sh	Y	Y	Y
		Mcore	recompute_in_advance, pp2vpp	llama3_tp2_pp2_vpp1.sh	Y	Y	Y
		Mcore	cp_hybrid，gqa	chatglm3_gqa_cp8.sh	Y	Y	Y
		Mcore	swap_attention，recompute_activation_function，enable_recompute_layers_per_pp_rank，reuse_fp32_param	llama2_tp2_pp4_vpp2_swap.sh	Y	Y	Y
		Mcore	glm_rope, rotary_percent	chatglm3_tp1_pp2_rope.sh	Y	Y	Y
		Mcore	EP，CP，num_experts，moe_router_topk，aux_loss，moe_allgather，group_query_attention，rotary_base	mixtral_mcore_tp4_cp2_ep2_ptd.sh	Y	Y	Y
		Mcore	moe_expert_capacity_factor，moe_alltoall，pad_to_capacity, topk_softmax_with_capacity	gpt4_mcore_tp4_cp2_32k_moe_drop.sh	Y	Y	Y
		Mcore	enable_high_availability	llama2_tp2_pp1_ha_save_ptd.sh	Y	Y	Y
		Mcore	mla_attention，moe_grouped_gemm，EP，allgather_dispatcher	deepseek_v2_mcore_tp1_pp1_ep8.sh	Y	Y	Y
		Mcore	post_norm, query_pre_attn_scalar, interleave_sliding_window, add_rmsnorm_offset, input_embeds_norm	gemma2_tp8_pp1_ptd.sh	Y	Y	Y
		Mcore	MOE,PP,EP,Drop,DPP,use_fused_moe_token_permute_and_unpermute	mixtral_tp1_pp4_ep2_drop_dpp.sh	Y	Y	Y
		Mcore	shared_experts shared_expert_gate	qwen2_moe_tp1_pp2_ep2_cp2_32k_ptd.sh	Y	Y	Y
		Legacy	TP，PP，VPP，SP，全重计算，fused_rmsnorm，fused_swiglu，fused_rope，overlap_grad_reduce	llama2_tp2_pp4_vpp2_legacy.sh	Y	Y	Y
	LoRA	Legacy	CCLoRA, TP, PP, 全重计算	tune_llama2_tp2_pp4_lora_ptd.sh	Y	Y	Y
		Legacy	CCLoRA单卡	tune_llama2_tp1_pp1_lora_ptd.sh	Y	Y	Y
		Mcore	CCLoRA, TP, PP, MOE	tune_mixtral_tp2_pp2_lora_ptd.sh	Y	Y	Y
	FullSFT	Legacy	prompt_type, variable_seq_lengths	tune_qwen7b_tp8_pp1_full_ptd.sh	Y	Y	Y
		Mcore	prompt_type, variable_seq_lengths, VPP	tune_llama2_tp2_pp4_vpp2_mcore_full.sh	Y	Y	Y
		Mcore	自适应cp，general_cp，SFT_pack_cp	tune_llama2_tp2_cp4_adaptive_cp.sh	Y	Y	Y
	RewardModel	Mcore	prompt_type, variable_seq_lengths	reward_chatglm3_tp2_pp4_full.sh	Y	Y	Y
UT	Inference	Legacy	greedy_search, lora_inference, deterministic_computation, chatglm3_inference, baichuan2_inference	test_inference.py	Y
	Evaluation	Legacy	mmlu, prompt_mmlu, prompt_boolq, prompt_ceval, qwen2_mmlu, lora_mmlu, agieval, humaneval, bbh	test_evaluate.py	Y
	CP	Mcore	hybrid	test_hybrid_context_parallel.py	Y
			ring_attn	test_ringattn_context_parallel.py	Y
			ulysses	test_ulysses_context_parallel.py	Y
			adaptive	test_adaptive_context_parallel.py	Y
	ModelModule	Mcore	rope	test_rotary_pos_embedding.py	Y
	ModelModule	Mcore	transformer_attention, alibi	test_attention.py	Y
	Checkpoint	Mcore	hf2mcore, tp, pp, ep, dpp, vpp, deepseek2; hf2mcore, tp, deepseek2	test_checkpoint.py	Y
		Mcore	hf2mcore, tp, pp, dpp, vpp, chatglm3, qwen2	test_hf2mcore.py	Y
		Legacy	legacy2mcore, lora	test_legacy2hf.py	Y
		Legacy	legacy2legacy, lora	test_legacy2legacy.py	Y
	ProcessData	Mcore	pretrain_data_alpaca, pretrain_merge_datasets, instruction_data_alpaca, instruction_merge_datasets	test_preprocess_data.py	Y
			instruction_data_alpaca, instruction_data_alpaca_history, instruction_data_sharegpt, instruction_data_openai,	test_process_instruction_data_lf.py	Y
			instruction_data_handler	test_process_instruction_pack_data.py	Y
			pairwise_data_alpaca, pairwise_data_sharegpt	test_process_pairwise_data_lf.py	Y

Pipeline 二级流水看护列表

Model	Structure	Module	Test Case	Accuracy	Throughput	Memory
Baichuan2-13B	Legacy	pretrain	baichuan2_13B_legacy_tp8_pp1_ptd.sh	Y	Y	Y
		data_process	test_process_pretrain_data.py	Y
		ckpt_hf2mg	test_ckpt_hf2mg.py	Y
		inference	test_generation.py	Y
		evaluation	test_evaluation.py	Y
Chatglm3-6B	Legacy	pretrain	chatglm3_6B_legacy_tp1_pp2_ptd.sh	Y	Y	Y
		data_process	test_process_pretrain_data.py	Y
		inference	test_generation.py	Y
		evaluation	test_evaluation.py	Y
Bloom-7B	Legacy	pretrain	bloom_7B_legacy_tp8_pp1_ptd.sh	Y	Y	Y
		data_process	test_process_pretrain_data.py	Y
		inference	test_generation.py	Y
		evaluation	test_evaluation.py	Y
Gemma-7B	Legacy	pretrain	gemma_7B_legacy_tp8_pp1_ptd.sh	Y	Y	Y
		data_process	test_process_pretrain_data.py	Y
		inference	test_generation.py	Y
		evaluation	test_evaluation.py	Y
Qwen15-7B	Legacy	pretrain	qwen15_7B_legacy_tp8_pp1_ptd.sh	Y	Y	Y
		data_process	test_process_pretrain_data.py	Y
		inference	test_generation.py	Y
		evaluation	test_evaluation.py	Y

开发规则

ST

① 贡献脚本用例请放置于 st/shell_scripts 文件夹下，命名规则为 {模型名}_{切分策略} 或者 {模型名}_{特性名称}，如 llama2_tp2_pp4_vpp2_ptd.sh，请贡献者严格对齐；

② 注意脚本用例中不需要单独重定向log，日志收集工作已在 st_run.sh 中统一管理；

③ 标杆数据请放置于 st/baseline_results 文件夹下，命名保证完全与 shell 脚本对齐，否则自动化脚本执行将扫描不到；

④ 获取标杆数据：通过门禁任务执行获得首次数据，并将结果保存至本地 log 或者 txt 文件中，后通过本地执行 st/st_utils/common.py 中的 transfer_logs_as_json 函数进行提取，最后再连同用例脚本上仓即可；

⑤ 在贡献时候需要考虑最终校验的具体指标，精度(Acc.)、性能(Throu.)、显存(Mem.)，在对应指标空白处填上 Y，如无校验的保留空白即可。

UT

① 建议所有 UT 用例通过分布式 pytest 来拉起，即继承 tests/common.py 文件下的 DistributedTest，指定 world_size，具体参照已有用例即可；

② 建议按照功能特性进行文件夹命名区分，至多不超过两层目录，所有用例以 test 作为命名前缀；

③ 新增用例可以在原有用例基础上做 test_xxx 的补充，尽量保证测试功能的集成性；对于存在 .json 文件的用例，贡献时在 .json 中加入 test_xxx 配置，然后在 .py 中通过 @pytest.mark.parameterize 传入参数、构造用例，请注意 .json 中的 key 值命名需与 .py 中的 test_xxx 保持统一；

④ 在贡献时候需要考虑最终校验的具体指标，精度(Acc.)、性能(Throu.)、显存(Mem.)，在对应指标空白处填上 Y，如无校验的保留空白即可。

Pipeline

①贡献脚本用例放置于pipeline/的对应模型文件夹下，如baichuan2-13B,文件命名规则为 {模型名}{切分策略} 或者 {模型名}{特性名称}，如 baichuan2_13B_tp8_pp1_ptd.sh，请贡献者严格对齐；

② 注意脚本用例中不需要单独重定向log，日志收集工作已在 pipe_run.sh 中进行统一管理；

③ 标杆数据请放置于 pipeline/baseline 文件夹下，命名保证完全与 shell 脚本对齐，否则自动化脚本执行将扫描不到；

④ 获取标杆数据：通过门禁任务执行获得首次数据，并将结果保存至本地 log 或者 txt 文件中，后通过本地执行 tests/st/st_utils/common.py 中的 transfer_logs_as_json 函数进行提取，最后再连同用例脚本上仓即可；