mirror of https://gitee.com/ascend/ModelLink.git synced 2024-12-05 05:17:40 +08:00

History

guhangsong d5a1d0dd13 !1329 ModelLink配套升级到megatron core 0.6.0 Merge pull request !1329 from guhangsong/upversion		2024-06-11 07:53:57 +00:00
..
evaluate_aquila_7b_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
generate_aquila_7b_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
pretrain_aquila_7b_ptd.sh	!1329 ModelLink配套升级到megatron core 0.6.0	2024-06-11 07:53:57 +00:00
README_en.md	!1329 ModelLink配套升级到megatron core 0.6.0	2024-06-11 07:53:57 +00:00
README.md	!1329 ModelLink配套升级到megatron core 0.6.0	2024-06-11 07:53:57 +00:00

README_en.md

Aquila-7B

简体中文 | English

Aquila-7B

Training

Here's a hardware summary of pre-training Aquila-7B:

Hardware	Device	Value
NPU	A+K	8 x Ascend NPUs

Script

Clone the repository to your local server and switch to modellink branch:

git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.6.0
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt

Build conda environment

# python3.8
conda create -n test python=3.8
conda activate test
# install torch, torch_npu and apex
pip install torch==2.1.0
pip install torch_npu-2.1.0.postxxxx-cp38-cp38-xxxx_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

# source the set_env.sh file based on your host settings(you may need to change the path)
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# use git to clone the MindSpeed source code, enter the directory, then install mindspeed package by source code
git clone https://gitee.com/ascend/MindSpeed.git
cd MindSpeed/
git checkout 2b0edd2
pip install -r requirements.txt
pip install -e .
cd ..

# install other packages
pip install -r requirements.txt

Download the Aquila-7B model, config, and tokenizer from here

save to ModelLink/model_from_hf/Aquila7B/ directory.

Prepare dataset.

step1: Download the datasets from here, save to ModelLink/dataset/ directory.

cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

step2: use Aquila-7B specified tokenizer to pre-process data:

source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/Aquila-7B/
python ./tools/preprocess_data.py \
    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
    --tokenizer-name-or-path ./model_from_hf/Aquila-7B/ \
    --output-prefix ./dataset/Aquila-7B/alpaca \
    --workers 4 \
    --log-interval 1000  \
    --tokenizer-type PretrainedFromHF

Weights convert

HuggingFace weights --> Megatron weights (This scenario is generally used to train open-source HuggingFace models on Megatron)

# please modify the path to set_env.sh based on your environment.
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
    --model-type GPT \
    --load-dir ./model_from_hf/Aquila-7B/ \
    --save-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
    --loader llama2_hf \
    --saver megatron \
    --target-tensor-parallel-size 8 \
    --tokenizer-model ./model_from_hf/Aquila-7B/tokenizer.json

Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy (This scenario is generally used to convert the trained megatron model back to the HuggingFace format)

# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
    --loader megatron \
    --saver megatron \
    --save-model-type save_huggingface_llama \
    --load-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --save-dir ./model_from_hf/Aquila-7B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Aquila-7B/mg2hg/

Config Aquila-7B pre-training script.

Config the environment variables in aquila pretrain script
```
# set dataset path, CKPT load path for loading weights, and the tokenizer path
TOKENIZER_PATH="./model_from_hf/Aquila-7B/"  #tokenizer path
DATA_PATH="./dataset/Aquila-7B/alpaca_text_document"  #processed dataset
CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"   # pointing to the converted model weights
CKPT_SAVE_DIR="./ckpt/Aquila-7B/"                   # pointing to the path to save checkpoints
```
Note that if you do not load weights for pre-training, you can ignore CKPT_LOAD_DIR, and remove the --load parameter from the training script, and vice versa If you do not want to save weights during pre-training, you can ignore CKPT_SAVE_DIR, and remove the --save $CKPT_SAVE_DIR parameter from the training script, and vice versa When you want to save checkpoint and load it in future pre-training, just follow the above "save" and "load" suggestions.
Launch Aquila-7B pre-training script.

Before running the pre-training script, please execute the set_env.sh script first to setup environment variables. Alternatively, you can do this inside aquila pre-training script.
```
# you may need to change the path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```
Start pre-training Aquila-7B model:
```
bash examples/aquila/pretrain_aquila_7b_ptd.sh
```
Note: If using multi machine training, and no data sharing configuration on the mechines, it's necessary to add the parameter --no-shared-storage. This parameter will determine whether non master nodes need to load data based on distributed parameters, and check the corresponding cache and generated data.

Performance

Machine performance

The performance of Aquila-7B in Ascend NPU and reference device:

Device	Model	Iterations	throughput rate (tokens/p/s)	single iteration step time (s/step)
NPU	Aquila-7B	1000	2849	5.75
Reference	Aquila-7B	1000	2874	5.70

Inference

We support MindSpeed Inference for text generation with Aquila 7B model.

Inference is different from pre-training because it requires loading the pre-trained model weights. Therefore, we need to complete the aforementioned model weight conversion task first, then configure the Aquila-7B Inference shell script examples/aquila/generate_aquila_7b_ptd.sh. "CKPT_LOAD_DIR" must point to the converted weights directory, and "TOKENIZER_PATH" must point to the directory which contains Aquila vocabulary files -- in our example, it is "./model_from_hf/Aquila-7B/". In your operation, please fill in correct value based on your actual scenario.

# please change to actual values
CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila-7B/"

Start Aquila-7B Inference:

bash ./examples/aquila/generate_aquila_7b_ptd.sh

Sample results of Aquila-7B Inference:

Evaluation with Benchmark

We use BoolQ benchmark to evaluate our model. You can go to the BoolQ Benchmark page and find the dataset, download it and save it. For example, save to "ModelLink/boolq/test" directory

Evaluation task is similar to inference task too，it also requires loading the pre-trained model weights. Please note that the model structure parameters used in converting weights should be consistent with those used in running the evaluation task.

After weight conversion is complete, we configure the Aquila-7B evaluation script examples/aquila/evaluate_aquila_7b_ptd.sh. We need to correctly specify the path to load weights, the path to tokenizer and vocab, and so on (the following example is for reference only)

CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila-7B/"
EVAL_DATA_PATH="./boolq/test"
TASK="boolq"

Start evaluation task

bash ./examples/aquila/evaluate_aquila_7b_ptd.sh

Sample Aquila-7B performance running in Ascend NPU:

Task	Model	NPU	Benchmark
BoolQ	Aquila-7B	77.3%

README_en.md Unescape Escape