mirror of https://gitee.com/ascend/ModelLink.git synced 2024-12-05 05:17:40 +08:00

History

guoxinjie 2ae8749f4a !1252 统一 readme 格式（aquila） Merge pull request !1252 from guoxinjie/readme		2024-04-30 07:50:27 +00:00
..
evaluate_aquila_7b_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
generate_aquila_7b_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
pretrain_aquila_7b_ptd.sh	!822 添加静默检测环境变量	2024-03-01 09:29:11 +00:00
README_en.md	!1252 统一 readme 格式（aquila）	2024-04-30 07:50:27 +00:00
README.md	!1252 统一 readme 格式（aquila）	2024-04-30 07:50:27 +00:00

README_en.md

Aquila-7B

简体中文 | English

Aquila-7B

Training

Here's a hardware summary of pre-training Aquila-7B:

Hardware	Device	Value
NPU	A+K	8 x Ascend NPUs

Script

Clone the repository to your local server and switch to modellink branch:

git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt

Build conda environment

# python3.8
conda create -n test python=3.8
conda activate test
# install torch, torch_npu and apex
pip install torch==2.1.0
pip install torch_npu-2.1.0.postxxxx-cp38-cp38-xxxx_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

# source the set_env.sh file based on your host settings(you may need to change the path)
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# use git to clone the AscendSpeed source code, enter the directory, then install ascendspeed package by source code
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed/
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt
pip install -e .
cd ..

# install other packages
pip install -r requirements.txt

Download the Aquila-7B model, config, and tokenizer from here

save to ModelLink/model_from_hf/Aquila7B/ directory.

Prepare dataset.

step1: Download the datasets from here, save to ModelLink/dataset/ directory.

cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

step2: use Aquila-7B specified tokenizer to pre-process data:

source /usr/local/Ascend/ascend-toolkit/set_env.sh
mkdir ./dataset/Aquila-7B/
python ./tools/preprocess_data.py \
    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
    --tokenizer-name-or-path ./model_from_hf/Aquila-7B/ \
    --output-prefix ./dataset/Aquila-7B/alpaca \
    --workers 4 \
    --log-interval 1000  \
    --tokenizer-type PretrainedFromHF

Weights convert

HuggingFace weights --> Megatron weights (This scenario is generally used to train open-source HuggingFace models on Megatron)

# please modify the path to set_env.sh based on your environment.
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py \
    --model-type GPT \
    --load-dir ./model_from_hf/Aquila-7B/ \
    --save-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
    --loader llama2_hf \
    --saver megatron \
    --target-tensor-parallel-size 8 \
    --tokenizer-model ./model_from_hf/Aquila-7B/tokenizer.json

Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy (This scenario is generally used to convert the trained megatron model back to the HuggingFace format)

# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
    --loader megatron \
    --saver megatron \
    --save-model-type save_huggingface_llama \
    --load-dir ./model_weights/Aquila-7B-v0.1-tp8-pp1/ \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --save-dir ./model_from_hf/Aquila-7B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Aquila-7B/mg2hg/

Config Aquila-7B pre-training script.

Config the environment variables in aquila pretrain script
```
# set dataset path, CKPT load path for loading weights, and the tokenizer path
TOKENIZER_PATH="./model_from_hf/Aquila-7B/"  #tokenizer path
DATA_PATH="./dataset/Aquila-7B/alpaca_text_document"  #processed dataset
CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"   # pointing to the converted model weights
CKPT_SAVE_DIR="./ckpt/Aquila-7B/"                   # pointing to the path to save checkpoints
```
Note that if you do not load weights for pre-training, you can ignore CKPT_LOAD_DIR, and remove the --load parameter from the training script, and vice versa If you do not want to save weights during pre-training, you can ignore CKPT_SAVE_DIR, and remove the --save $CKPT_SAVE_DIR parameter from the training script, and vice versa When you want to save checkpoint and load it in future pre-training, just follow the above "save" and "load" suggestions.
Launch Aquila-7B pre-training script.

Before running the pre-training script, please execute the set_env.sh script first to setup environment variables. Alternatively, you can do this inside aquila pre-training script.
```
# you may need to change the path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```
Start pre-training Aquila-7B model:
```
bash examples/aquila/pretrain_aquila_7b_ptd.sh
```
Note: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.

Performance

Machine performance

The performance of Aquila-7B in Ascend NPU and reference device:

Device	Hardware	Model	Iterations	throughput rate (tokens/p/s)	single iteration step time (s/step)
NPU	910b 1node*8p	Aquila-7B	1000	2849	5.75
Reference		Aquila-7B	1000	2874	5.70

Inference

We support AscendSpeed Inference for text generation with Aquila 7B model.

Inference is different from pre-training because it requires loading the pre-trained model weights. Therefore, we need to complete the aforementioned model weight conversion task first, then configure the Aquila-7B Inference shell script examples/aquila/generate_aquila_7b_ptd.sh. "CKPT_LOAD_DIR" must point to the converted weights directory, and "TOKENIZER_PATH" must point to the directory which contains Aquila vocabulary files -- in our example, it is "./model_from_hf/Aquila-7B/". In your operation, please fill in correct value based on your actual scenario.

# please change to actual values
CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila-7B/"

Start Aquila-7B Inference:

bash ./examples/aquila/generate_aquila_7b_ptd.sh

Sample results of Aquila-7B Inference:

Evaluation with Benchmark

We use BoolQ benchmark to evaluate our model. You can go to the BoolQ Benchmark page and find the dataset, download it and save it. For example, save to "ModelLink/boolq/test" directory

Evaluation task is similar to inference task too，it also requires loading the pre-trained model weights. Please note that the model structure parameters used in converting weights should be consistent with those used in running the evaluation task.

After weight conversion is complete, we configure the Aquila-7B evaluation script examples/aquila/evaluate_aquila_7b_ptd.sh. We need to correctly specify the path to load weights, the path to tokenizer and vocab, and so on (the following example is for reference only)

CKPT_LOAD_DIR="./model_weights/Aquila-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Aquila-7B/"
EVAL_DATA_PATH="./boolq/test"
TASK="boolq"

Start evaluation task

bash ./examples/aquila/evaluate_aquila_7b_ptd.sh

Sample Aquila-7B performance running in Ascend NPU:

Task	Model	NPU	Benchmark
BoolQ	Aquila-7B	77.3%

README_en.md Unescape Escape