mirror of https://gitee.com/ascend/ModelLink.git synced 2024-12-05 05:17:40 +08:00

History

wwzhuo 24c423201b !1152 llama2 readme修改，更正tokenizer说明 * 修改llama2 readme中微调tokenizer变更说明		2024-03-27 08:25:03 +00:00
..
evaluate_llama2_7B_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
evaluate_llama2_13B_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
evaluate_llama2_34B_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
evaluate_llama2_70B_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
generate_llama2_7b_lora_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
generate_llama2_7b_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
generate_llama2_13b_lora_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
generate_llama2_13b_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
generate_llama2_34b_lora_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
generate_llama2_34B_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
generate_llama2_70b_lora_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
generate_llama2_70b_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
pretrain_llama2_7b_ptd.sh	!1074 requirements.txt移除apex依赖，模型训练脚本规范化加上日志存档	2024-03-19 10:55:11 +00:00
pretrain_llama2_13B_ptd_8p.sh	!1074 requirements.txt移除apex依赖，模型训练脚本规范化加上日志存档	2024-03-19 10:55:11 +00:00
pretrain_llama2_34B_ptd_16p.sh	!1074 requirements.txt移除apex依赖，模型训练脚本规范化加上日志存档	2024-03-19 10:55:11 +00:00
pretrain_llama2_70b_ptd.sh	!1082 增加 llama2-70B 脚本中的环境变量	2024-03-19 13:13:40 +00:00
README_en.md	!1152 llama2 readme修改，更正tokenizer说明	2024-03-27 08:25:03 +00:00
README.md	!1152 llama2 readme修改，更正tokenizer说明	2024-03-27 08:25:03 +00:00
tune_llama2_7b_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
tune_llama2_13b_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
tune_llama2_34b_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00
tune_llama2_70b_ptd.sh	!1169 整理 tasks 文件目录，对外提供 evaluation和 inference.py	2024-03-27 07:55:22 +00:00

README_en.md

LLaMA

简体中文 | English

LLaMA2-7B
LLaMA2-13B
- Training
LLaMA2-34B/70B

LLAMA2-7B

Training

Here's a hardware summary of pre-training LLAMA2-7B:

Hardware	Value
NPU	8 x Ascend NPUs

Script

Clone the repository to your local server:

git clone https://gitee.com/ascend/ModelLink.git 
cd ModelLink 
mkdir logs
mkdir ckpt

Build environment

# python3.8
conda create -n test python=3.8
conda activate test

# install torch and torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

# modify ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

# install AscendSpeed
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
pip install -r requirements.txt 
pip3 install -e .
cd ..

# install other packages
pip install -r requirements.txt

Note that if you want to train with the weight from huggingface, please run fix a deepspeed loading checkpointing bug by modified if zero_sd_list is None as if zero_sd_list is None or len(zero_sd_list) == 0 in the _load_zero_checkpoint function of <deepspeed-installed-path>/runtime/engine.py

# original deepspeed/runtime/engine.py, about #Lines2746-2748
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None:
    return False

# modified
zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
if zero_sd_list is None or len(zero_sd_list) == 0:
    return False

Prepare pretrained weights and tokenizer Download the LLAMA2-7B checkpoint from here

  #!/bin/bash
  mkdir -p llama-2-7b-hf
  cd llama-2-7b-hf
  wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/config.json
  wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/generation_config.json
  wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/pytorch_model-00001-of-00002.bin
  wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/pytorch_model-00002-of-00002.bin
  wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/pytorch_model.bin.index.json
  wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/special_tokens_map.json
  wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/tokenizer.json
  wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/tokenizer.model
  wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/tokenizer_config.json
  cd ..

weight conversion in ptd mode

Note that if you want to use the weight from huggingface, please run the weight conversion script first. The following uses llama-2-7b model weight conversion in ptd as an example.

 # modify the script according to your own ascend-toolkit path
 source /usr/local/Ascend/ascend-toolkit/set_env.sh

 # convert to ptd weights
 python tools/checkpoint/convert_ckpt.py --model-type GPT \
                                 --loader llama2_hf \
                                 --saver megatron \
                                 --target-tensor-parallel-size 8 \
                                 --target-pipeline-parallel-size 1 \
                                 --load-dir ../llama-2-7b-hf \
                                 --save-dir {your megatron ckpt save path} \
                                 --tokenizer-model ../llama-2-7b-hf/tokenizer.model

Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy

cd ModelLink/
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
    --loader megatron \
    --saver megatron \
    --save-model-type save_huggingface_llama \
    --load-dir ../llama27B-v0.1-pt8-pp1 \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --save-dir ../llama27B_downloaded  # <-- Fill in the original HF model path here, new weights will be saved in ../llama27B_downloaded/mg2hg

Weight conversion is suitable for pre-training, fine-tuning, inference and evaluation. Adjust the parameters target-tensor-parallel-size and target-pipeline-parallel-size according to different tasks.

pre-training

5.1 Prepare dataset

Download the LLAMA2-7B datasets from here

# download datasets
mkdir dataset_llama2
cd ./dataset_llama2
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

# process datasets                              
python ./tools/preprocess_data.py \
	 --input ./dataset_llama2/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
	 --tokenizer-name-or-path ./llama-2-7b-hf \
	 --output-prefix ./dataset_llama2/alpaca \
	 --workers 4 \
	 --log-interval 1000 \
	 --tokenizer-type PretrainedFromHF

5.2 pre-training using ptd mode Config LLAMA2-7B pre-training script: examples/llama2/pretrain_llama2_7b_ptd.sh

 # modify the script according to your own ascend-toolkit path
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 

 # modify config according to your own actual situation
 LOAD_CHECKPOINT_PATH="your init model load path"
 SAVE_CHECKPOINT_PATH="your model ckpt save path"
 TOKENIZER_MODEL=./llama-2-7b-hf/tokenizer.model  #tokenizer path
 DATA_PATH=./dataset_llama2/alpaca_text_document  #processed dataset

Multi-machine training requires the addition of parameter --overlap-grad-reduce

Launch LLAMA2-7B pre-training script: examples/llama2/pretrain_llama2_7b_ptd.sh

 bash examples/llama2/pretrain_llama2_7b_ptd.sh

fine-tuning

6.1 Prepare fine-tuning dataset Download the LLAMA2-7B datasets from here

# download datasets
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

# process datasets                              
python ./tools/preprocess_data.py \
	  --input ./dataset_llama2/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
	  --tokenizer-name-or-path ./llama-2-7b-hf \
	  --output-prefix ./finetune_dataset/alpaca \
	  --workers 4 \
	  --log-interval 1000 \
	  --tokenizer-type PretrainedFromHF \
	  --handler-name GeneralInstructionHandler \
	  --append-eod

6.2 Full Parameters Fine-Tuning The configuration script for full parameters fine-tuning is basically the same as that for pretrain_llama2_7b_ptd.sh.The difference is that the dataset and the training parameter is-instruction-dataset are added.

Add the fine-tuning parameter --finetune so that fine-tuning starts from the first step.

DATA_PATH=./finetune_dataset/alpaca
TOKENIZER_PATH=./llama-2-7b-hf
--finetune \
--is-instruction-dataset \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \

6.3 Lora Fine-Tuning The Lora fine-tuning script is configured by adding the following lora parameters to the pretrain_llama2_7b_ptd.sh script:

    --lora-target-modules query_key_value dense proj dense_4h_to_h \
    --lora-r 16 \
    --lora-alpha 32 \

If the vocabulary is changed, add the following parameters:

  --lora-modules-to-save word_embeddings output_layer \

The following parameters are added to the resumable training capability of Lora:

    --load ${ORIGIN_CHECKPOINT}  \
    --lora-load ${LORA_CHECKPOINT} \

Launch LLAMA2-7B lora fine tune script: examples/finetune/tune_llama2_7b_ptd.sh

 bash examples/llama2/tune_llama2_7b_ptd.sh

Performance

Machine performance

The performance of LLaMA2-7B in Ascend NPU and Reference:

Device	Model	total Iterations	throughput rate (samples/s/p)	throughput rate (tokens/s/p)	single-step time (s/step)	floating point operation (TFLOPs/s)
NPUs	LLaMA2-7B	1024	5.19	2730	3.08	122.39
Reference	LLaMA2-7B	1024	5.63	2884	2.84	131.96

Inference-7B

Config llama2-7B inference script: examples/llama2/generate_llama2_7b_ptd.sh

# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 
# modify script model path and tokenizer path
TOKENIZER_PATH=./llama2-7b-hf/  #tokenizer path
CHECKPOINT=./llama2-7b-tp8pp1  #model path

Config llama2-7B lora inference script: examples/llama2/generate_llama2_7b_lora_ptd.sh

# modify lora model path
CHECKPOINT_LORA="your lora model directory path"

Launch llama2-7B inference script: examples/llama2/generate_llama2_7b_ptd.sh

bash examples/llama2/generate_llama2_7b_ptd.sh

Launch llama2-7B lora inference script: examples/llama2/generate_llama2_7b_lora_ptd.sh

bash examples/llama2/generate_llama2_7b_lora_ptd.sh

Some inference samples are as follows:

Evaluation-7B

We use MMLU benchmark to evaluate our model. Benchmark Download here. Config llama2-7B evaluation script: examples/llama2/evaluate_llama2_7B_ptd.sh

source /usr/local/Ascend/ascend-toolkit/set_env.sh 

# modify script model path and tokenizer path
TOKENIZER_PATH=./llama2-7b-hf/  #tokenizer path
CHECKPOINT=./llama2-7b-tp8pp1  #model path
# configure task and data path
DATA_PATH="./mmlu/data/test/"
TASK="mmlu"

Launch llama2-7B evaluation script:

bash examples/llama2/evaluate_llama2_7B_ptd.sh

Evaluation results

                           subject_name  question_n   acc_ref   acc_npu  score_diff
17                     public_relations         110  0.563636  0.554545      0.009091
44                         econometrics         114  0.368421  0.377193      0.008772
30               electrical_engineering         145  0.503448  0.510345      0.006897
5                       world_religions         171  0.701754  0.707602      0.005848
25               high_school_us_history         204  0.647059  0.651961      0.004902
45                          human_aging         223  0.596413  0.600897      0.004484
38                            marketing         234  0.709402  0.713675      0.004274
55            high_school_world_history         237  0.620253  0.624473      0.004219
31           high_school_microeconomics         238  0.420168  0.424370      0.004202
7                             nutrition         306  0.503268  0.500000      0.003268
56                  high_school_biology         310  0.541935  0.545161      0.003226
20                           philosophy         311  0.569132  0.565916      0.003215
24               elementary_mathematics         378  0.291005  0.293651      0.002646
22               high_school_psychology         545  0.645872  0.647706      0.001835
12                     professional_law        1534  0.339635  0.340939      0.001304
13                        miscellaneous         783  0.679438  0.678161      0.001277
6                       moral_scenarios         895  0.221229  0.222346      0.001117
37  high_school_government_and_politics         193  0.694301  0.694301      0.000000
54                           prehistory         324  0.555556  0.555556      0.000000
53                    us_foreign_policy         100  0.700000  0.700000      0.000000
39                high_school_geography         198  0.626263  0.626263      0.000000
40                     security_studies         245  0.522449  0.522449      0.000000
41                high_school_chemistry         203  0.408867  0.408867      0.000000
52                   clinical_knowledge         265  0.513208  0.513208      0.000000
49              professional_psychology         612  0.482026  0.482026      0.000000
42                           management         103  0.679612  0.679612      0.000000
43                        jurisprudence         108  0.583333  0.583333      0.000000
51                    computer_security         100  0.560000  0.560000      0.000000
50                   conceptual_physics         235  0.417021  0.417021      0.000000
35                      human_sexuality         131  0.526718  0.526718      0.000000
46                             virology         166  0.439759  0.439759      0.000000
47                       moral_disputes         346  0.514451  0.514451      0.000000
48                              anatomy         135  0.459259  0.459259      0.000000
36                      college_physics         102  0.215686  0.215686      0.000000
0            high_school_macroeconomics         390  0.420513  0.420513      0.000000
34                  high_school_physics         151  0.311258  0.311258      0.000000
33             college_computer_science         100  0.420000  0.420000      0.000000
2                     international_law         121  0.636364  0.636364      0.000000
3                   college_mathematics         100  0.330000  0.330000      0.000000
4                      college_medicine         173  0.410405  0.410405      0.000000
8                high_school_statistics         216  0.314815  0.314815      0.000000
9                      medical_genetics         100  0.450000  0.450000      0.000000
10                    college_chemistry         100  0.290000  0.290000      0.000000
11              professional_accounting         282  0.411348  0.411348      0.000000
14                            sociology         201  0.601990  0.601990      0.000000
15                professional_medicine         272  0.452206  0.452206      0.000000
16                    logical_fallacies         163  0.521472  0.521472      0.000000
18                      college_biology         144  0.506944  0.506944      0.000000
19         high_school_european_history         165  0.575758  0.575758      0.000000
21                     abstract_algebra         100  0.280000  0.280000      0.000000
23         high_school_computer_science         100  0.430000  0.430000      0.000000
26                     machine_learning         112  0.375000  0.375000      0.000000
27                            astronomy         152  0.500000  0.500000      0.000000
1                          formal_logic         126  0.222222  0.222222      0.000000
29              high_school_mathematics         270  0.259259  0.259259      0.000000
32                      business_ethics         100  0.450000  0.450000      0.000000
28                         global_facts         100  0.380000  0.380000      0.000000

dataset	subject_num	question_num	reference_acc	NPU acc
MMLU	57	14042	0.4691	0.4698

LLaMA2-13B

Training

Here's a hardware summary of pre-training LLaMA2-13B:

Hardware	Value
NPU	8 x Ascend NPUs

Script

Clone the repository to your local server:

git clone https://gitee.com/ascend/ModelLink.git 
cd ModelLink 
mkdir logs
mkdir ckpt

Build environment

# python3.8
conda create -n test python=3.8
conda activate test

# install torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl
pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

# modify ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

# install AscendSpeed
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
pip install -r requirements.txt 
pip3 install -e .
cd ..

# install other packages
pip install -r requirements.txt

Prepare pretrained weights and tokenizer Download the LLaMA2-13B checkpoint from here
```
git lfs install
git clone https://huggingface.co/NousResearch/Llama-2-13b-hf
```

Weights convert

Note that if you want to use the weight from huggingface, please run the weight conversion script first. The following uses llama-2-13b model weight conversion as an example.

# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# convert weights
python tools/checkpoint/convert_ckpt.py --model-type GPT \
    --loader llama2_hf \
    --saver megatron \
    --target-tensor-parallel-size 8 \
    --target-pipeline-parallel-size 1 \
    --load-dir ./llama2-13b-hf \
    --save-dir ./llama2-13b-hf-tp8 \
    --tokenizer-model ./llama2-13b-hf/tokenizer.model

Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy

cd ModelLink/
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
    --loader megatron \
    --saver megatron \
    --save-model-type save_huggingface_llama \
    --load-dir ../llama213B-v0.1-pt8-pp1 \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --save-dir ../llama213B_downloaded  # <-- Fill in the original HF model path here, new weights will be saved in ../llama213B_downloaded/mg2hg

pre-training

5.1 Prepare dataset

Download the LLAMA2-13B datasets from here

# download datasets
mkdir dataset_llama2
cd ./dataset_llama2
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

# process datasets                              
python ./tools/preprocess_data.py \
	 --input ./dataset_llama2/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
	 --tokenizer-name-or-path ./llama-2-13b-hf \
	 --output-prefix ./dataset_llama2/alpaca \
	 --workers 4 \
	 --log-interval 1000 \
	 --tokenizer-type PretrainedFromHF

5.2 pre-training using ptd mode Config LLAMA2-13B pre-training script: examples/llama2/pretrain_llama2_13B_ptd_8p.sh

 # modify the script according to your own ascend-toolkit path
 source /usr/local/Ascend/ascend-toolkit/set_env.sh 

 # modify config according to your own actual situation
 LOAD_CHECKPOINT_PATH="your init model load path"
 SAVE_CHECKPOINT_PATH="your model ckpt save path"
 TOKENIZER_MODEL=./llama-2-13b-hf/tokenizer.model  #tokenizer path
 DATA_PATH=./dataset_llama2/alpaca_text_document  #processed dataset

Multi-machine training requires the addition of parameter --overlap-grad-reduce

Launch LLAMA2-13B pre-training script: examples/llama2/pretrain_llama2_13B_ptd_8p.sh

 bash examples/llama2/pretrain_llama2_13B_ptd_8p.sh

fine-tuning

6.1 Prepare fine-tuning dataset Download the LLAMA2-13B datasets from here

# download datasets
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

# process datasets                              
python ./tools/preprocess_data.py \
	  --input ./dataset_llama2/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
	  --tokenizer-name-or-path ./llama-2-13b-hf \
	  --output-prefix ./finetune_dataset/alpaca \
	  --workers 4 \
	  --log-interval 1000 \
	  --tokenizer-type PretrainedFromHF \
	  --handler-name GeneralInstructionHandler \
	  --append-eod

Add the fine-tuning parameter --finetune and add pretrained-weight load parameter --load, so that fine-tuning starts from the first step.

DATA_PATH=./finetune_dataset/alpaca
TOKENIZER_PATH=./llama-2-13b-hf
CKPT_PATH=./ckpt
--load ${CKPT_PATH} \
--finetune \
--is-instruction-dataset \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \

6.3 Lora Fine-Tuning The Lora fine-tuning script is configured by adding the following lora parameters based on the full-parameter finetune script pretrain_llama2_7b_ptd.sh:

    --lora-target-modules query_key_value dense proj dense_4h_to_h \
    --lora-r 16 \
    --lora-alpha 32 \

If the vocabulary is changed, add the following parameters:

  --lora-modules-to-save word_embeddings output_layer \

Launch LLAMA2-13B lora fine tune script: examples/llama2/tune_llama2_13b_ptd.sh

 bash examples/llama2/tune_llama2_13b_ptd.sh

Performance

Machine performance

The performance of LLaMA2-13B in Ascend NPU and Reference:

Device	Model	total Iterations	throughput rate (samples/s/p)	throughput rate (tokens/s/p)	single-step time (s/step)	floating point operation (TFLOPs/s)
NPUs	LLaMA2-13B	5000	3.027	1550	5.285	133.77
Reference	LLaMA2-13B	--	--	1750	--	--

Inference

We support Inference for text generation with Llama2 13B. Inference different from pre-training, such as we need to Load pre-training checkpoint and the length of the output samples:

Config Llama2-13B inference script: tasks/inference/generate_llama2_13b_ptd.sh

# modify the model weight path and tokenizer path
CHECKPOINT=./llama2-13b-tp8-pp1/
TOKENIZER_PATH=./llama2-13b-hf/

Config Llama2-13B lora inference script: examples/llama2/generate_llama2_13b_lora_ptd.sh

# modify lora model directory path
CHECKPOINT_LORA="your lora model directory path"

Launch Llama2-13B inference script.

bash examples/llama2/generate_llama2_13b_ptd.sh

Launch Llama2-13B lora inference script.

bash examples/llama2/generate_llama2_13b_lora_ptd.sh

Some inference samples are as follows:

Evaluation

We use boolq benchmark to evaluate our model. Benchmark Download here.

    # modify the model weight path and tokenizer path
    CHECKPOINT=./llama2-13b-tp8-pp1/
    TOKENIZER_PATH=./llama2-13b-hf/

bash examples/llama2/evaluate_llama2_13B_ptd.sh

Task	Subset	Model	NPU	OpenSource
Boolq	Test	Llama2 13B	0.821	0.817

LLaMA2-34B/70B

Training-2

Here's a hardware summary of pre-training LLaMA2-34B/70B:

Model	Hardware	Value
34B	NPU	16 x Ascend NPUs
70B	NPU	64 x Ascend NPUs

Script-2

Clone the repository to your local server:

git clone https://gitee.com/ascend/ModelLink.git 
cd ModeLlink 
mkdir logs
mkdir ckpt

Build environment

# python3.8
conda create -n test python=3.8
conda activate test

# install torch and torch_npu 
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

# modify the path according to your own  ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# install AscendSpeed
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
pip install -r requirements.txt 
pip3 install -e .
cd ..

# install other packages
pip install -r requirements.txt

Prepare pretrained weights and tokenizer Download the LLaMA2-70B checkpoint from here

#!/bin/bash
mkdir -p llama2-70b-hf
cd llama2-70b-hf
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/config.json
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/generation_config.json
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00001-of-00015.bin
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00002-of-00015.bin
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00003-of-00015.bin
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00004-of-00015.bin
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00005-of-00015.bin
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00006-of-00015.bin
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00007-of-00015.bin   
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00008-of-00015.bin
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00009-of-00015.bin
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00010-of-00015.bin
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00011-of-00015.bin
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00012-of-00015.bin   
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00013-of-00015.bin
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00014-of-00015.bin
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model-00015-of-00015.bin   
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/pytorch_model.bin.index.json
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/special_tokens_map.json
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/tokenizer.json
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/tokenizer.model
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/tokenizer_config.json
cd ..

For LLaMA2-34B, we use CodeLlama-34b weights and LLaMA2-70B tokenizer.

CodeLlama-34b weights can be downloaded from here,

#!/bin/bash
mkdir -p codellama-34b-hf
cd codellama-34b-hf
wget https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf/resolve/main/config.json
wget https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf/resolve/main/generation_config.json
wget https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf/resolve/main/pytorch_model-00001-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf/resolve/main/pytorch_model-00002-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf/resolve/main/pytorch_model-00003-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf/resolve/main/pytorch_model-00004-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf/resolve/main/pytorch_model-00005-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf/resolve/main/pytorch_model-00006-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf/resolve/main/pytorch_model-00007-of-00007.bin
wget https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf/resolve/main/pytorch_model.bin.index.json
cd ..

LLaMA2-70B tokenizer can be downloaded from here

#!/bin/bash
mkdir -p llama2-70b-hf
cd llama2-70b-hf
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/special_tokens_map.json
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/tokenizer.json
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/tokenizer.model
wget https://huggingface.co/meta-llama/Llama-2-70b-hf/blob/main/tokenizer_config.json
cd ..

Weights convert

*Note that if you want to use the weight from huggingface, please run the weight conversion script first. * The following converts llama-2-70b model weight.

# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# convert to megatron weights
python tools/checkpoint/convert_ckpt.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 4 \
--load-dir ./llama2-70b-hf/ \
--save-dir ./load_ckpt \
--tokenizer-model ./llama2-70b-hf/tokenizer.model

The following converts llama-2-34b model weight.

# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# convert to megatron weights
python tools/checkpoint/convert_ckpt.py --model-type GPT \
 --loader llama2_hf \
 --saver megatron \
 --target-tensor-parallel-size 8 \
 --target-pipeline-parallel-size 4 \
 --load-dir ./codellama-34b-hf \
 --save-dir ./load_ckpt \
 --tokenizer-model ./llama2-70b-hf/tokenizer.model \
 --params-dtype bf16

Any Megatron weights with parallel slicing strategy --> Any Megatron weights with parallel slicing strategy.

The following converts llama-2-70b model weight.

cd ModelLink/
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
    --loader megatron \
    --saver megatron \
    --save-model-type save_huggingface_llama \
    --load-dir ../llama270B-v0.1-pt8-pp1 \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --save-dir ../llama270B_downloaded  # <-- Fill in the original HF model path here, new weights will be saved in ../llama270B_downloaded/mg2hg

The following converts llama-2-34b model weight.

cd ModelLink/
# Modify the ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/convert_ckpt.py --model-type GPT \
    --loader megatron \
    --saver megatron \
    --save-model-type save_huggingface_llama \
    --load-dir ../llama234B-v0.1-pt8-pp1 \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --save-dir ../llama234B_downloaded  # <-- Fill in the original HF model path here, new weights will be saved in ../llama234B_downloaded/mg2hg

pre-training

5.1 Prepare dataset

There are two dataset examples: Alpaca and Moss.

Alpaca Dataset

Download the Alpaca datasets from here

# download datasets
mkdir dataset_llama2
cd ./dataset_llama2
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

# process datasets                              
python ./tools/preprocess_data.py \
--input ./dataset_llama2/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./llama2-70b-hf \
--output-prefix ./dataset_llama2/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF

Moss Dataset

Download the Moss datasets from here

# download datasets
mkdir dataset_llama2
cd ./dataset_llama2
wget https://huggingface.co/datasets/fnlp/moss-003-sft-data/resolve/main/moss-003-sft-no-tools.jsonl.zip --no-check-certificate
unzip moss-003-sft-no-tools.jsonl.zip
cd ..

# process datasets                              
python tools/preprocess_data.py \
--input ./dataset_llama2/moss-003-sft-no-tools.jsonl \
--output-prefix ./dataset_llama2/moss \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ./llama2-70b-hf \
--tokenizer-not-use-fast \
--handler-name MOSSInstructionHandler

5.2 pre-training using ptd mode

LLaMA2-34B: examples/llama2/pretrain_llama2_34B_ptd_16p.sh

# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

# modify script orign dataset path according to your own dataset path
TOKENIZER_MODEL=./llama2-70b-hf/tokenizer.model  #tokenizer path
DATA_PATH=./dataset_llama2/alpaca_text_document  #processed dataset

LLaMA2-70B: examples/llama2/pretrain_llama2_70b_ptd.sh

# modify the script according to your own ascend-toolkit path
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

# modify script orign dataset path according to your own dataset path
TOKENIZER_PATH=./llama2-70b-hf/  #tokenizer path
DATA_PATH=./dataset_llama2/alpaca_text_document  #processed dataset

Launch pre-training script

LLaMA2-34B: examples/llama2/pretrain_llama2_34B_ptd_16p.sh

bash examples/llama2/pretrain_llama2_34B_ptd_16p.sh

LLaMA2-70B: examples/llama2/pretrain_llama2_70b_ptd.sh

bash examples/llama2/pretrain_llama2_70b_ptd.sh

fine-tuning

6.1 Prepare fine-tuning dataset Download the LLAMA2-13B datasets from here

# download datasets
mkdir finetune_dataset
cd ./finetune_dataset
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

# process datasets                              
python ./tools/preprocess_data.py \
	  --input ./dataset_llama2/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
	  --tokenizer-name-or-path ./llama-2-70b-hf \
	  --output-prefix ./finetune_dataset/alpaca \
	  --workers 4 \
	  --log-interval 1000 \
	  --tokenizer-type PretrainedFromHF \
	  --handler-name GeneralInstructionHandler \
	  --append-eod

Add the fine-tuning parameter --finetune so that fine-tuning starts from the first step.

DATA_PATH=./finetune_dataset/alpaca
TOKENIZER_PATH=./llama-2-70b-hf
--finetune \
--is-instruction-dataset \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--tokenizer-not-use-fast \

6.3 Lora Fine-Tuning The Lora fine-tuning script is configured by adding the following lora parameters to the pretrain_llama2_7b_ptd.sh script:

    --lora-target-modules query_key_value dense proj dense_4h_to_h \
    --lora-r 16 \
    --lora-alpha 32 \

If the vocabulary is changed, add the following parameters:

  --lora-modules-to-save word_embeddings output_layer \

The following parameters are added to the resumable training capability of Lora:

    --load ${ORIGIN_CHECKPOINT}  \
    --lora-load ${LORA_CHECKPOINT} \

Launch LLAMA2-34B lora fine tune script: examples/llama2/tune_llama2_34b_ptd.sh

 bash examples/llama2/tune_llama2_34b_ptd.sh

Launch LLAMA2-70B lora fine tune script: examples/llama2/tune_llama2_70b_ptd.sh

 bash examples/llama2/tune_llama2_70b_ptd.sh

Performance-2

Machine performance-2

The performance of LLaMA2-34B/70B in Ascend NPU and Reference

Device	Model	throughput (tokens/s/p)
NPUs	LLaMA2-34B	690
Reference	LLaMA2-34B	796
NPUs	LLaMA2-70B	350
Reference	LLaMA2-70B	339

Inference-2

Models could generate with 8 NPUs, for example:

Config inference script:

LLaMA2-34B:examples/llama2/generate_llama2_34B_ptd.sh.

LLaMA2-70B:examples/llama2/generate_llama2_70b_ptd.sh.

# Modify checkpoint path and vocabfile path.
CHECKPOINT=<checkpoint-path>
TOKENIZER_PATH=<tokenizer-path>

Config lora inference script:

# modify lora model directory path
CHECKPOINT_LORA="your lora model directory path"

Launch LLaMA2-34B inference:

bash ./examples/llama2/generate_llama2_34B_ptd.sh

Launch LLaMA2-34B lora inference:

bash ./examples/llama2/generate_llama2_34b_lora_ptd.sh

Launch LLaMA2-70B inference:

bash ./examples/llama2/generate_llama2_70b_ptd.sh

Launch LLaMA2-70B lora inference:

bash ./examples/llama2/generate_llama2_70b_lora_ptd.sh

Some inference samples of LLaMA2-34B are as follows:

Some inference samples of LLaMA2-70B are as follows:

Evaluation-2

We use BoolQ benchmark to evaluate our model. Benchmark here Download dev parthere and put it in a directory named “boolq_dev”.

Config evaluation script:

LLaMA2-34B:examples/llama2/evaluate_llama2_34B_ptd.sh.

LLaMA2-70B:examples/llama2/evaluate_llama2_70B_ptd.sh.

# Modify checkpoint path and vocabfile path.
CHECKPOINT=<checkpoint-path>
TOKENIZER_PATH=<tokenizer-path>

Launch LLaMA2-34B evaluation:

bash examples/llama2/evaluate_llama2_34B_ptd.sh

Launch LLaMA2-70B evaluation:

bash examples/llama2/evaluate_llama2_70B_ptd.sh

Evaluation results with BoolQ dataset:

Task	Subset	Model	NPU	Benchmark
BoolQ	dev	Llama2-70b	0.859	(Llama2-70b test) 0.877
BoolQ	dev	Llama2-34b	0.651	(AquilaChat2-34B test) 0.571

README_en.md

LLaMA

Contents

LLAMA2-7B

Training

Script

Performance

Machine performance

Inference-7B

Evaluation-7B

LLaMA2-13B

Training

Script

Performance

Machine performance

Inference

Evaluation

LLaMA2-34B/70B

Training-2

Script-2

Performance-2

Machine performance-2

Inference-2

Evaluation-2