2f32c76be2
Merge pull request !1224 from guoxinjie/remove_megatron |
||
---|---|---|
.. | ||
evaluate_mixtral_8x7b_ptd.sh | ||
generate_mixtral_8x7b_ptd.sh | ||
pretrain_mixtral_8x7b_ptd.sh | ||
README_en.md | ||
README.md |
Mixtral
简体中文 | English
Table of Contents
8x7B
Hardware-Requirements
Minimum hardware requirements for training:
Hardware | Configuration |
---|---|
NPU | 16 x Ascend NPUs |
Recommended hardware configuration for inference:
Hardware | Configuration |
---|---|
NPU | 8 x Ascend NPUs |
Preparation
- Clone the code repository to the local server
git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt
- Set up the environment
# python3.8
conda create -n test python=3.8
conda activate test
# Install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# modify the path according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# install AscendSpeed
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip3 install -e .
cd ..
# install other packages
pip install -r requirements.txt
- Download the pre-trained weights and vocabulary for Mixtral-8x7B from here. (It is recommended to only download weights in safetensors format)
#!/bin/bash
cd ./model_from_hf/
git lfs install
git clone https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
mv Mixtral-8x7B-v0.1 Mixtral-8x7B
cd ..
- Weight conversion
HuggingFace weights --> Megatron weights with any parallel slicing strategy (This scenario is generally used to train open-source HuggingFace models on Megatron)
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# HF to tp1-pp8-ep2
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader mixtral_hf \
--saver mixtral \
--load-dir ./model_from_hf/Mixtral-8x7B/ \
--save-dir ./model_weights/Mixtral-8x7B-v0.1-tp1-pp8-ep2/ \
--tokenizer-model ./model_from_hf/Mixtral-8x7B/tokenizer.model \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 8 \
--target-expert-parallel-size 2
Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy (This scenario is generally used to reconfigure the sliced model weights, such as training on a dual-node 16-card EP2-PP8 strategy, and then wanting to infer on a single-node 8-card TP8)
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# tp1-pp8-ep2 to tp1-pp8-ep1
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader mixtral_mg \
--saver mixtral \
--load-dir ./model_weights/Mixtral-8x7B-v0.1-tp1-pp8-ep2/ \
--save-dir ./model_weights/Mixtral-8x7B-v0.1-tp1-pp8-ep1/ \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 8 \
--target-expert-parallel-size 1
Any Megatron weights with parallel slicing strategy --> HuggingFace weights (This scenario is generally used to convert the trained megatron model back to the HuggingFace format)
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# tp1-pp8-ep2 to HF
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader mixtral_mg \
--saver mixtral \
--save-model-type huggingface \
--load-dir ./model_weights/Mixtral-8x7B-v0.1-tp1-pp8-ep2/ \
--save-dir ./model_from_hf/Mixtral-8x7B/ # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Mixtral-8x7B/mg2hg/
Model-Training
Prepare dataset
Download the datasets from here, save to ModelLink/dataset/ directory.
# download datasets
cd ./dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# process datasets
mkdir ./dataset/Mixtral-8x7B/
python ./tools/preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./model_from_hf/Mixtral-8x7B/ \
--output-prefix ./dataset/Mixtral-8x7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
Configure Mixtral-8x7B pre-training script: examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
# Set the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# Configure according to the actual vocabulary, dataset, and model parameter save path
DATA_PATH="./dataset/Mixtral-8x7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Mixtral-8x7B/"
CKPT_SAVE_DIR="./ckpt/Mixtral-8x7B/"
# Configure distributed parameters according to the actual distributed cluster
GPUS_PER_NODE=8
MASTER_ADDR="your master node IP"
MASTER_PORT=6000
NNODES=2
NODE_RANK="current node id"
WORLD_SIZE=$(($GPUS_PER_NODE * $NNODES))
# Training parallel strategy
TP=1
PP=8
EP=2
Start Mixtral-8x7B pre-training script: examples/pretrain_mixtral_8x7b_ptd.sh
bash examples/mixtral/pretrain_mixtral_8x7b_ptd.sh
Note: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.
Fine-Tuning
Prepare fine-tuning dataset Download the fine-tuning datasets from here
# download datasets
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese/blob/main/Alpaca_data_gpt4_zh.jsonl
cd ..
# process datasets
mkdir ./finetune_dataset/Mixtral-8x7B/
python ./tools/preprocess_data.py \
--input ./finetune_dataset/Alpaca_data_gpt4_zh.jsonl \
--output-prefix ./finetune_dataset/Mixtral-8x7B/alpaca \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ./model_from_hf/Mixtral-8x7B/ \
--append-eod \
--tokenizer-not-use-fast \
--handler-name GeneralInstructionHandler \
--workers 4
Supervised Fine-Tuning
The configuration script for full parameters fine-tuning is basically the same as that for pretrain shell. The difference is that the dataset and the training parameter is-instruction-dataset are added.
Add the fine-tuning parameter --finetune
and add pretrained-weight load parameter --load
, so that fine-tuning starts from the first step.
DATA_PATH="./finetune_dataset/Mixtral-8x7B/alpaca"
CKPT_PATH="./ckpt/Mixtral-8x7B/"
--load ${CKPT_PATH} \
--finetune \
--is-instruction-dataset
Model-Performance
Throughput
Comparison of Mixtral-8x7B performance on 2 nodes and 16 chips with ep2 pp8: (When there are enough nodes, the larger the ep, the higher the throughput. This is not the optimal performance here, just for reference)
Device | Model | Iterations | Sample Throughput (samples/step) | Tokens Throughput (tokens/s/p) | Single Step Iteration Time (s/step) |
---|---|---|---|---|---|
NPUs | Mixtral-8x7B | 1000 | 3.13 | 1053.63 | 31.13 |
Reference | Mixtral-8x7B | 1000 | 4.45 | 1139.3 | 28.76 |
Model-Inference
First, configure the inference script: examples/mixtral/generate_mixtral_8x7b_ptd.sh
# Execute set_env.sh according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# Modify the model weight path and tokenizer path
CHECKPOINT="./model_weights/Mixtral-8x7B-v0.1-tp1-pp8-ep1/"
TOKENIZER_MODEL="./model_from_hf/Mixtral-8x7B/"
# Modify according to the actual loaded model weight the parallel configuration
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
GPUS_PER_NODE=8
TP=8
PP=1
# Note
The Mixtral-8x7B-v0.1 model used in this document is an L0 model, only with continuation ability, inference does not involve any templates and is prone to repetition or non-stop answering.
If you want to have better human-machine dialogue capabilities, please use the Mixtral-8x7B-Instruct-v0.1 model. This model requires instruction compliance training and needs to be used with templates. The basic operations are the same as above, only the startup entry has changed:
torchrun $DISTRIBUTED_ARGS inference.py
Then you can start it directly
bash examples/mixtral/generate_mixtral_8x7b_ptd.sh
An example of inference is as follows:
Model-Evaluation
Evaluate the model using the MMLU dataset. Dataset download path here. Configure the evaluation script: examples/mixtral/evaluate_mixtral_8x7b_ptd.sh
# Ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# Modify the model parameter path and tokenizer path
TOKENIZER_PATH="./model_from_hf/Mixtral-8x7B/" #tokenizer path
CHECKPOINT="./model_weights/Mixtral-8x7B-v0.1-tp1-pp8-ep1" #model path
# Configure tasks and dataset paths
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"
Start the evaluation
bash examples/mixtral/evaluate_mixtral_8x7b_ptd.sh
The evaluation results are as follows
Dataset | Dataset | Refer Accuracy | Ours |
---|---|---|---|
MMLU | 14042 | 0.658 | 0.660 |