ModelLink2/examples/mixtral
guoxinjie 2f32c76be2 !1224 移除 ModelLink 下的 megatron,并在 readme 中进行补充
Merge pull request !1224 from guoxinjie/remove_megatron
2024-04-09 07:44:00 +00:00
..
evaluate_mixtral_8x7b_ptd.sh !1169 整理 tasks 文件目录,对外提供 evaluation和 inference.py 2024-03-27 07:55:22 +00:00
generate_mixtral_8x7b_ptd.sh !1215 fix: 统一Mixtral-README为预训练模板 2024-04-03 02:08:55 +00:00
pretrain_mixtral_8x7b_ptd.sh !1199 【DTS2024032814829】临时规避压测反复kill拉起情况下loss出现NaN的问题 2024-03-29 06:09:28 +00:00
README_en.md !1224 移除 ModelLink 下的 megatron,并在 readme 中进行补充 2024-04-09 07:44:00 +00:00
README.md !1224 移除 ModelLink 下的 megatron,并在 readme 中进行补充 2024-04-09 07:44:00 +00:00

Mixtral

简体中文 | English

Table of Contents

8x7B

Hardware-Requirements

Minimum hardware requirements for training:

Hardware Configuration
NPU 16 x Ascend NPUs

Recommended hardware configuration for inference:

Hardware Configuration
NPU 8 x Ascend NPUs

Preparation

  1. Clone the code repository to the local server
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
  1. Set up the environment
# python3.8
conda create -n test python=3.8
conda activate test

# Install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

# modify the path according to your own  ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# install AscendSpeed
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt 
pip3 install -e .
cd ..

# install other packages
pip install -r requirements.txt 
  1. Download the pre-trained weights and vocabulary for Mixtral-8x7B from here. (It is recommended to only download weights in safetensors format)
#!/bin/bash
cd ./model_from_hf/
git lfs install
git clone https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
mv Mixtral-8x7B-v0.1 Mixtral-8x7B
cd ..
  1. Weight conversion

HuggingFace weights --> Megatron weights with any parallel slicing strategy (This scenario is generally used to train open-source HuggingFace models on Megatron)

# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# HF to tp1-pp8-ep2
python tools/checkpoint/convert_ckpt.py \
    --model-type GPT \
    --loader mixtral_hf \
    --saver mixtral \
    --load-dir ./model_from_hf/Mixtral-8x7B/ \
    --save-dir ./model_weights/Mixtral-8x7B-v0.1-tp1-pp8-ep2/ \
    --tokenizer-model ./model_from_hf/Mixtral-8x7B/tokenizer.model \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 8 \
    --target-expert-parallel-size 2

Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy (This scenario is generally used to reconfigure the sliced model weights, such as training on a dual-node 16-card EP2-PP8 strategy, and then wanting to infer on a single-node 8-card TP8)

# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# tp1-pp8-ep2 to tp1-pp8-ep1
python tools/checkpoint/convert_ckpt.py \
    --model-type GPT \
    --loader mixtral_mg \
    --saver mixtral \
    --load-dir ./model_weights/Mixtral-8x7B-v0.1-tp1-pp8-ep2/ \
    --save-dir ./model_weights/Mixtral-8x7B-v0.1-tp1-pp8-ep1/ \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 8 \
    --target-expert-parallel-size 1 

Any Megatron weights with parallel slicing strategy --> HuggingFace weights (This scenario is generally used to convert the trained megatron model back to the HuggingFace format)

# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# tp1-pp8-ep2 to HF
python tools/checkpoint/convert_ckpt.py \
    --model-type GPT \
    --loader mixtral_mg \
    --saver mixtral \
    --save-model-type huggingface \
    --load-dir ./model_weights/Mixtral-8x7B-v0.1-tp1-pp8-ep2/ \
    --save-dir ./model_from_hf/Mixtral-8x7B/    # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Mixtral-8x7B/mg2hg/

Model-Training

Prepare dataset

Download the datasets from here, save to ModelLink/dataset/ directory.

# download datasets
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/Mixtral-8x7B/
python ./tools/preprocess_data.py \
    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
    --tokenizer-name-or-path ./model_from_hf/Mixtral-8x7B/ \
    --output-prefix ./dataset/Mixtral-8x7B/alpaca \
    --workers 4 \
    --log-interval 1000 \
    --tokenizer-type PretrainedFromHF

Configure Mixtral-8x7B pre-training script: examples/mixtral/pretrain_mixtral_8x7b_ptd.sh

# Set the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

# Configure according to the actual vocabulary, dataset, and model parameter save path
DATA_PATH="./dataset/Mixtral-8x7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Mixtral-8x7B/"
CKPT_SAVE_DIR="./ckpt/Mixtral-8x7B/"

# Configure distributed parameters according to the actual distributed cluster
GPUS_PER_NODE=8
MASTER_ADDR="your master node IP"
MASTER_PORT=6000
NNODES=2
NODE_RANK="current node id"
WORLD_SIZE=$(($GPUS_PER_NODE * $NNODES))

# Training parallel strategy
TP=1
PP=8
EP=2

Start Mixtral-8x7B pre-training script: examples/pretrain_mixtral_8x7b_ptd.sh

bash examples/mixtral/pretrain_mixtral_8x7b_ptd.sh

Note: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.

Fine-Tuning

Prepare fine-tuning dataset Download the fine-tuning datasets from here

# download datasets
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese/blob/main/Alpaca_data_gpt4_zh.jsonl
cd ..

# process datasets  
mkdir ./finetune_dataset/Mixtral-8x7B/
python ./tools/preprocess_data.py \
    --input ./finetune_dataset/Alpaca_data_gpt4_zh.jsonl \
    --output-prefix ./finetune_dataset/Mixtral-8x7B/alpaca \
    --tokenizer-type PretrainedFromHF \
    --tokenizer-name-or-path ./model_from_hf/Mixtral-8x7B/ \
    --append-eod \
    --tokenizer-not-use-fast \
    --handler-name GeneralInstructionHandler \
    --workers 4

Supervised Fine-Tuning

The configuration script for full parameters fine-tuning is basically the same as that for pretrain shell. The difference is that the dataset and the training parameter is-instruction-dataset are added.

Add the fine-tuning parameter --finetune and add pretrained-weight load parameter --load, so that fine-tuning starts from the first step.

DATA_PATH="./finetune_dataset/Mixtral-8x7B/alpaca"
CKPT_PATH="./ckpt/Mixtral-8x7B/"
--load ${CKPT_PATH} \
--finetune \
--is-instruction-dataset

Model-Performance

Throughput

Comparison of Mixtral-8x7B performance on 2 nodes and 16 chips with ep2 pp8: (When there are enough nodes, the larger the ep, the higher the throughput. This is not the optimal performance here, just for reference)

Device Model Iterations Sample Throughput (samples/step) Tokens Throughput (tokens/s/p) Single Step Iteration Time (s/step)
NPUs Mixtral-8x7B 1000 3.13 1053.63 31.13
Reference Mixtral-8x7B 1000 4.45 1139.3 28.76

Model-Inference

First, configure the inference script: examples/mixtral/generate_mixtral_8x7b_ptd.sh

# Execute set_env.sh according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

# Modify the model weight path and tokenizer path
CHECKPOINT="./model_weights/Mixtral-8x7B-v0.1-tp1-pp8-ep1/"
TOKENIZER_MODEL="./model_from_hf/Mixtral-8x7B/"

# Modify according to the actual loaded model weight the parallel configuration
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
GPUS_PER_NODE=8
TP=8
PP=1

# Note
The Mixtral-8x7B-v0.1 model used in this document is an L0 model, only with continuation ability, inference does not involve any templates and is prone to repetition or non-stop answering.

If you want to have better human-machine dialogue capabilities, please use the Mixtral-8x7B-Instruct-v0.1 model. This model requires instruction compliance training and needs to be used with templates. The basic operations are the same as above, only the startup entry has changed:
torchrun $DISTRIBUTED_ARGS inference.py

Then you can start it directly

bash examples/mixtral/generate_mixtral_8x7b_ptd.sh

An example of inference is as follows: Inference

Model-Evaluation

Evaluate the model using the MMLU dataset. Dataset download path here. Configure the evaluation script: examples/mixtral/evaluate_mixtral_8x7b_ptd.sh

# Ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

# Modify the model parameter path and tokenizer path
TOKENIZER_PATH="./model_from_hf/Mixtral-8x7B/"                                           #tokenizer path
CHECKPOINT="./model_weights/Mixtral-8x7B-v0.1-tp1-pp8-ep1"                                         #model path

# Configure tasks and dataset paths
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"

Start the evaluation

bash examples/mixtral/evaluate_mixtral_8x7b_ptd.sh

The evaluation results are as follows

Dataset Dataset Refer Accuracy Ours
MMLU 14042 0.658 0.660