mirror of https://gitee.com/ascend/ModelLink.git synced 2024-12-05 05:17:40 +08:00

History

wucong ae21a622b8 !1254 统一 readme 格式（baichuan2 + bloom） Merge pull request !1254 from wucong/dev2		2024-04-30 07:39:33 +00:00
..
evaluate_baichuan2_7B_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
evaluate_baichuan2_13B_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
generate_baichuan2_7b_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
generate_baichuan2_13b_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
pretrain_baichuan2_ptd_7B.sh	!1074 requirements.txt移除apex依赖，模型训练脚本规范化加上日志存档	2024-03-19 10:55:11 +00:00
pretrain_baichuan2_ptd_13B.sh	!1246 更新baichuan2-13B性能至1668	2024-04-26 01:47:52 +00:00
README_en.md	!1254 统一 readme 格式（baichuan2 + bloom）	2024-04-30 07:39:33 +00:00
README.md	!1254 统一 readme 格式（baichuan2 + bloom）	2024-04-30 07:39:33 +00:00

README_en.md

BaiChuan2

简体中文 | English

Baichuan2-7B
Baichuan2-13B

Baichuan2-7B

Training

Here's a hardware summary of pre-training Baichuan2-7B:

Hardware	Value
NPU	8 x Ascend NPUs

Script

Clone the repository to your local server:

git clone https://gitee.com/ascend/ModelLink.git
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt

Build environment

# python3.8
conda create -n test python=3.8
conda activate test

# install torch and torch_npu 
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

# modify the path according to your own  ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

# install AscendSpeed
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt 
pip3 install -e .
cd ..

# install other packages
pip install -r requirements.txt

Prepare pretrained weights Download the Baichuan2-7B checkpoint from here

mkdir ./model_from_hf/Baichuan2-7B/
cd ./model_from_hf/Baichuan2-7B/
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/generation_utils.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00001-of-00002.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model-00002-of-00002.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer.model
wget https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/resolve/main/tokenizer_config.json
cd ../../

Weights convert

In order to adapt to the baichuan2-7B model, the following script is used to convert the model pre-training weights. (This scenario is generally used to train open-source HuggingFace models on Megatron)

# modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh

python tools/checkpoint/convert_ckpt.py \
    --model-type GPT \
    --loader llama2_hf \
    --saver megatron \
    --target-tensor-parallel-size 8 \
    --load-dir ./model_from_hf/Baichuan2-7B/ \
    --save-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
    --tokenizer-model ./model_from_hf/Baichuan2-7B/tokenizer.model \
    --params-dtype bf16 \
    --w-pack True

Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy (This scenario is generally used to convert the trained megatron model back to the HuggingFace format)

# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
    --loader megatron \
    --saver megatron \
    --save-model-type save_huggingface_llama \
    --load-dir ./model_weights/Baichuan2-7B-v0.1-tp8-pp1/ \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --w-pack True \
    --save-dir ./model_from_hf/Baichuan2-7B/   # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan2-7B/mg2hg/

Prepare dataset

Download the Baichuan2-7B-Base datasets from here

# download datasets
cd ./dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

# process datasets      
mkdir ./dataset/Baichuan2-7B/
python ./tools/preprocess_data.py \
    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
    --tokenizer-name-or-path ./model_from_hf/Baichuan2-7B/ \
    --output-prefix ./dataset/Baichuan2-7B/alpaca \
    --workers 4 \
    --log-interval 1000 \
    --tokenizer-type PretrainedFromHF

Config Baichuan2-7B pre-training script : examples/baichuan2/pretrain_baichuan2_ptd_7B.sh

# modify the script according to your own  ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

# modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/Baichuan2-7B/"
DATA_PATH="./dataset/Baichuan2-7B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan2-7B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"

Launch Baichuan2-7B pre-training script: examples/baichuan2/pretrain_baichuan2_ptd_7B.sh
```
bash examples/baichuan2/pretrain_baichuan2_ptd_7B.sh 
```
Note: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.

Performance

Machine performance

The performance of Baichuan2-7B in Ascend NPU and Reference:

Device	Model	total Iterations	throughput rate (samples/s)	throughput rate (tokens/s/p)	single-step time (s/step)
NPUs	Baichuan2-7B	1000	5.2	2664	12.3
Reference	Baichuan2-7B	1000	--	3969	--

Inference

Config baichuan2-7B inference script: examples/baichuan2/generate_baichuan2_7b_ptd.sh

# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 
# modify script model path and tokenizer path
CHECKPOINT="./model_weights/Baichuan2-7B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Baichuan2-7B/"

Launch baichuan2-7B inference script: examples/baichuan2/generate_baichuan2_7b_ptd.sh

bash examples/baichuan2/generate_baichuan2_7b_ptd.sh

Some inference samples are as follows:

Evaluation

We use the boolq benchmark to evaluate our model. Benchmark Download.

# config origin weight and vocab file path
CHECKPOINT=<origin-ckpt-path>
TOKENIZER_PATH=<tokenizer-path>
# config tasks and dataset path
DATA_PATH="./boolq/"
TASK="boolq"

bash ./examples/baichuan2/evaluate_baichuan2_13B_ptd.sh

Task	Subset	Model	NPU	OpenSource
Boolq	test	Baichuan2-7B	0.7	0.632

Baichuan2-13B

Training

Here's a hardware summary of pre-training Baichuan2-13B:

Hardware	Value
NPU	8 x Ascend NPUs

Script

Clone the repository to your local server:

git clone https://gitee.com/ascend/ModelLink.git 
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout -f bcce6f
cp -r megatron ../ModelLink/
cd ..
cd ModelLink
mkdir logs
mkdir model_from_hf
mkdir dataset
mkdir ckpt

Build environment

# python3.8
conda create -n test python=3.8
conda activate test

# install torch and torch_npu 
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

# modify the path according to your own  ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

# install AscendSpeed
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
git checkout 224ae35e8fc96778f957029d1371ddb623452a50
pip install -r requirements.txt 
pip3 install -e .
cd ..

# install other packages
pip install -r requirements.txt

Prepare pretrained weights

Download the Baichuan2-13B checkpoint from here

mkdir ./model_from_hf/Baichuan2-13B/
cd ./model_from_hf/Baichuan2-13B/
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/generation_utils.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer_config.json
wget https://huggingface.co/baichuan-inc/Baichuan2-13B-Base/resolve/main/tokenizer.model
cd ../../

Weights convert

In order to adapt to the baichuan2-13B model, the following script is used to convert the model pre-training weights. (This scenario is generally used to train open-source HuggingFace models on Megatron)

# modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh

python tools/checkpoint/convert_ckpt.py \
    --model-type GPT \
    --loader llama2_hf \
    --saver megatron \
    --target-tensor-parallel-size 8 \
    --load-dir ./model_from_hf/Baichuan2-13B/ \
    --save-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
    --tokenizer-model ./model_from_hf/Baichuan2-13B/tokenizer.model \
    --params-dtype bf16 \
    --w-pack True

source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
    --loader megatron \
    --saver megatron \
    --save-model-type save_huggingface_llama \
    --load-dir ./model_weights/Baichuan2-13B-v0.1-tp8-pp1/ \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --w-pack True \
    --save-dir ./model_from_hf/Baichuan2-13B/     # <-- Fill in the original HF model path here, new weights will be saved in ./model_from_hf/Baichuan2-13B/mg2hg/

Prepare dataset

Download the Baichuan2-13B datasets from here

cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

mkdir ./dataset/Baichuan2-13B/
python ./tools/preprocess_data.py \
    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
    --tokenizer-name-or-path ./model_from_hf/Baichuan2-13B/ \
    --output-prefix ./dataset/Baichuan2-13B/alpaca \
    --workers 4 \
    --log-interval 1000 \
    --tokenizer-type PretrainedFromHF

Config Baichuan2-13B pre-training script: examples/baichuan2/pretrain_baichuan2_ptd_13B.sh

# modify the script according to your own  ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

# modify script orign dataset path according to your own dataset path
CKPT_SAVE_DIR="./ckpt/Baichuan2-13B/"
DATA_PATH="./dataset/Baichuan2-13B/alpaca_text_document"
TOKENIZER_MODEL="./model_from_hf/Baichuan2-13B/tokenizer.model"
CKPT_LOAD_DIR="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"

Launch Baichuan2-13B pre-training script: examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
```
bash examples/baichuan2/pretrain_baichuan2_ptd_13B.sh
```
Note: If using multi machine training, it is necessary to set up multi machine data sharing, and non primary nodes can read the primary node data through data sharing. Alternatively, directly copy the data generated by the master node to non master nodes.

Performance

Machine performance

The performance of the Baichuan2-13B in Ascend NPU and Reference:

Device	Model	total Iterations	throughput rate (samples/s/p)	throughput rate (tokens/s/p)	single-step time (s/step)
NPUs	Baichuan2-13B	1000	-	1668	-
Reference	Baichuan2-13B	-	-	2062	-

Inference

Config baichuan2-13B inference script: examples/baichuan2/generate_baichuan2_13b_ptd.sh

# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 
# modify script model path and tokenizer path
CHECKPOINT="./model_weights/Baichuan2-13B-v0.1-tp8-pp1/"
TOKENIZER_PATH="./model_from_hf/Baichuan2-13B/"

Launch baichuan2-13B inference script: examples/baichuan2/generate_baichuan2_13b_ptd.sh

bash examples/baichuan2/generate_baichuan2_13b_ptd.sh

Some inference samples are as follows:

Evaluation

We use the boolq benchmark to evaluate our model. Benchmark Download.

# config origin weight and vocab file path
CHECKPOINT=<origin-ckpt-path>
TOKENIZER_PATH=<tokenizer-path>
# config tasks and dataset path
DATA_PATH="./boolq/"
TASK="boolq"

bash ./examples/baichuan2/evaluate_baichuan2_13B_ptd.sh

Task	Subset	Model	NPU	OpenSource
Boolq	test	Baichuan2-13B	0.78	0.67

README_en.md

BaiChuan2

Contents

Baichuan2-7B

Training

Script

Performance

Machine performance

Inference

Evaluation

Baichuan2-13B

Training

Script

Performance

Machine performance

Inference

Evaluation