!98 add finetune chinese llama alpaca scripts and readme file

Merge pull request !98 from stacey/master
This commit is contained in:
i-robot 2023-10-08 09:10:24 +00:00 committed by Gitee
commit a9d3f05550
No known key found for this signature in database
GPG Key ID: 173E9B9CA92EEF8F
4 changed files with 415 additions and 4 deletions

View File

@ -5,7 +5,7 @@ AscendSpeed provides end-to-end test scripts for large language models on [Ascen
## Why AscendSpeed?
---
### Pretrained Models
### Prepared Models
Currently, the pre-training of the following models has been supported:
* <a href="https://github.com/baichuan-inc" style="color:green">Baichuan</a>-[7B/13B](examples/baichuan/README.md)
@ -323,7 +323,7 @@ For llama and other LLMs without bias in FFN, the linear transformation in FFN c
<td> -- </td>
<td> -- </td>
<td> -- </td>
<td rowspan="8"> -- </td>
<td> -- </td>
</tr>
<tr>
<td rowspan="2"><a href="examples/bloom/README.md">Bloom</a></td>
@ -331,12 +331,14 @@ For llama and other LLMs without bias in FFN, the linear transformation in FFN c
<td> -- </td>
<td> -- </td>
<td> -- </td>
<td> -- </td>
</tr>
<tr>
<td> 176B </td>
<td> -- </td>
<td> -- </td>
<td> -- </td>
<td> -- </td>
</tr>
<tr>
<td>InternLM</td>
@ -344,25 +346,36 @@ For llama and other LLMs without bias in FFN, the linear transformation in FFN c
<td> -- </td>
<td> -- </td>
<td> -- </td>
<td> -- </td>
</tr>
<tr>
<td rowspan="3">LLaMA</td>
<td rowspan="4">LLaMA</td>
<td>7B</td>
<td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh">lora</a> </td>
<td> -- </td>
<td> -- </td>
<td> -- </td>
<td> <a href="https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json">alpaca_data.json</a> </td>
</tr>
<tr>
<td>13B</td>
<td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh">lora</a> </td>
<td> -- </td>
<td> -- </td>
<td> <a href="https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json">alpaca_data.json</a> </td>
</tr>
<tr>
<td>33B</td>
<td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh">lora</a> </td>
<td> -- </td>
<td> -- </td>
<td> <a href="https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json">alpaca_data.json</a> </td>
</tr>
<tr>
<td > 65B </td>
<td > -- </td>
<td> -- </td>
<td> -- </td>
<td> -- </td>
</tr>
<tr>
<td>LLaMA2</td>
@ -370,6 +383,7 @@ For llama and other LLMs without bias in FFN, the linear transformation in FFN c
<td> -- </td>
<td> -- </td>
<td> -- </td>
<td> -- </td>
</tr>
</tbody>
</table>

189
examples/alpaca/README.md Normal file
View File

@ -0,0 +1,189 @@
# Chinese-LLaMA-Alpaca
This directory contains scripts used to produce the results of Chinese-LLaMA-Alpaca in AscendSpeed.
Chinese-LLaMA-Alpaca model is from: [Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca](https://arxiv.org/abs/2304.08177)
> Cui, Yang, and Yao, et al. "Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca." arXiv preprint arXiv:2304.08177 (2023).
# Contents
- [Contents](#contents)
- [Model Weights](#model-Weights)
- [Merge Model](#merge-Model)
- [Fine-tune](#Fine-tune)
- [Script](#script)
- [Citation](#citation)
# Model Weights
First download the [original LLaMA model](https://github.com/facebookresearch/llama) weights, then download the [Chinese-LLaMA-Alpaca model](https://github.com/ymcui/Chinese-LLaMA-Alpaca) LoRA weight, which can be understood as a "patch" on the original LLaMA model. And then merge the original LLaMA model with it to obtain a complete weight.
# Merge Weights
Before merging weights, please ensure that the machine has enough memory to load the complete model weights (for example, 7B model requires 13-15G) for the merge model operation. And confirm the integrity of the base model and the downloaded LoRA model, and check whether they are consistent with the values shown in SHA256.md, otherwise the merge operation cannot be performed. The original LLaMA includes: tokenizer.model, tokenizer_checklist.chk, consolidated.*.pth, params.json.
#### Step 1: [Convert the original LLaMA model to HF format.](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E6%89%8B%E5%8A%A8%E6%A8%A1%E5%9E%8B%E5%90%88%E5%B9%B6%E4%B8%8E%E8%BD%AC%E6%8D%A2#step-1-%E5%B0%86%E5%8E%9F%E7%89%88llama%E6%A8%A1%E5%9E%8B%E8%BD%AC%E6%8D%A2%E4%B8%BAhf%E6%A0%BC%E5%BC%8F)
Please use the script [convert_llama_weights_to_hf.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py) provided by Transformers to convert the original LLAMA model to `huggingFace` format.
```
python convert_llama_weights_to_hf.py \
--input_dir path_to_original_llama_root_dir \
--model_size 7B \
--output_dir path_to_original_llama_hf_dir
```
Model files in HF format will be generated in the `--output_dir` directory, such as:
```
config.json
generation_config.json
pytorch_model-00001-of-00002.bin
pytorch_model-00002-of-00002.bin
pytorch_model.bin.index.json
special_tokens_map.json
tokenizer_config.json
tokenizer.json
tokenizer.model
```
#### Step 2: [Combine LoRA weights to generate full model weights.](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E6%89%8B%E5%8A%A8%E6%A8%A1%E5%9E%8B%E5%90%88%E5%B9%B6%E4%B8%8E%E8%BD%AC%E6%8D%A2#step-2-%E5%90%88%E5%B9%B6lora%E6%9D%83%E9%87%8D%E7%94%9F%E6%88%90%E5%85%A8%E9%87%8F%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D)
This step will expand the Chinese vocabulary of the original LLaMA model (HF format), merge the LoRA weights and generate the full model weights. Here you can choose to output the PyTorch version weight (.pth file) or HuggingFace version weight (.bin file). Please convert it to pth file first, compare the SHA256 of the merged model and then convert it to HF format as needed.
**Single LoRA weight merging** (applicable to Chinese-LLaMA, Chinese-LLaMA-Plus, Chinese-Alpaca).
Download the script [merge_llama_with_chinese_lora.py](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_llama_with_chinese_lora.py), and execute the following command:
```
python merge_llama_with_chinese_lora.py \
--base_model path_to_original_llama_hf_dir \
--lora_model path_to_chinese_llama_or_alpaca_lora \
--output_type huggingface \
--output_dir path_to_merged_hf_dir
```
Parameter Description:
- `--base_model`Directory to store LLAMA model weights and configuration files in HF format (STEP 1 generation).
- `--lora_model`Directory where the Chinese LLAMA/Alpaca LoRA decompressed files are located.
- `--output_type`: Specify the output format, which can be `pth` or `huggingface`. If it is not specified, the default is `pth`.
- `--output_dir`Specify the directory of saving full model weight, default `./`.
- (Optional)`--offload_dir`(Only valid for old script `scripts/merge_llama_with_chinese_lora.py`)For low memory users, you need to specify an Office cache path.
- (Optional)`--verbose`(Only valid for new script `scripts/merge_llama_with_chinese_lora_low_mem.py`)Display the detailed information during the merge process.
**Multi-LoRA weight merging** (applicable to Chinese-Alpaca-Plus and Chinese-Alpaca-Pro).
Download the script [merge_llama_with_chinese_lora.py](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_llama_with_chinese_lora.py), and execute the following command:
```
python merge_llama_with_chinese_lora.py \
--base_model path_to_original_llama_hf_dir \
--lora_model path_to_chinese_llama_plus_lora,path_to_chinese_alpaca_plus_lora \
--output_type huggingface \
--output_dir path_to_merged_hf_dir
```
#### Step 3: Check SHA256 after merge.
Be sure to check [SHA256](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/SHA256.md) after the merge is complete. It is recommended to convert to pth format first, and after comparing the SHA256 is correct, then convert to HF format if necessary, because the model SHA256 corresponding to the HF format often changes (meta information changes).
#### Step 4: Convert ckpt from huggingface format to model parallel format.
Execute the following command:
```
python tools/ckpt_convert/llama/convert_weights_from_huggingface.py \
--input-model-dir path_to_merged_hf_dir \
--output-model-dir path_to_merged_ascendspeed_dir \
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 2 \
--type 7B
```
# Fine-tune
## Script
1. Clone the repository to your local server
```bash
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
mkdir logs
mkdir ckpt
```
2. Build environment
```bash
# python3.7
conda create -n test python=3.7
conda activate test
# install torch and torch_npu
pip install torch==1.11 -i https://pypi.tuna.tsinghua.edu.cn/simple
wget https://gitee.com/ascend/pytorch/releases/download/v5.0.rc2-pytorch1.11.0/torch_npu-1.11.0.post1-cp37-cp37m-linux_aarch64.whl (ARM)
or wget https://gitee.com/ascend/pytorch/releases/download/v5.0.rc2-pytorch1.11.0/torch_npu-1.11.0.post1-cp37-cp37m-linux_x86_64.whl (X86)
pip install torch_npu-1.11.0.post1-cp37-cp37m-linux_XXXXXX.whl
# install deepspeed and deepspeed_npu
pip install deepspeed==0.9.2
git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
cd deepspeed_npu
pip3 install -e ./
# install other packages
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
3. Prepare dataset
```bash
# for llama, download alpaca dataset, like
wget http://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json
# download tokenizer configs and (selective) weights from
# https://huggingface.co/decapoda-research/llama-7b-hf/tree/main
# revise "LLaMATokenizer" as "LlamaTokenizer" in tokenizer_config.json (This is a bug of huggingface)
mkdir dataset
python tools/preprocess_data.py --input alpaca_data.json \
--output-prefix $DATA_PATH \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path $TOKENIZER_PATH \
--tokenizer-not-use-fast \
--handler-name GeneralInstructionHandler
```
4. Config Chinese-LLaMA-Alpaca fine-tune script
Parameters of 7B/13B/33B are distinguished through `$MODEL_PATH`. For example, if `$MODEL_PATH` matches `*7b*`, then using the parameter of 7B.
* Based on PyTorch's built-in distributed launcher : [Chinese-LLaMA-Alpaca-7B/13B/33B](finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh)
```bash
bash examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh
```
* Based on Deepspeed launcher : [Chinese-LLaMA-Alpaca-7B/13B/33B](finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh)
```bash
bash examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh
```
# Citation
You may also consider original work in your reference:
```
@article{chinese-llama-alpaca,
title={Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca},
author={Cui, Yiming and Yang, Ziqing and Yao, Xin},
journal={arXiv preprint arXiv:2304.08177},
url={https://arxiv.org/abs/2304.08177},
year={2023}
}
```

View File

@ -0,0 +1,124 @@
# This script is used to run Chinese LLaMA Alpaca with 7B/13B/33B weights based on deepspeed launcher, configured with tensor model parallel size of 1, pipeline model parallel size of 1.
# add HCCL_OP_BASE_FFTS_MODE_ENABLE
export HCCL_OP_BASE_FFTS_MODE_ENABLE=TRUE
# modify the script according to your own conda and ascend-toolkit path
export LD_LIBRARY_PATH=/usr/local/lib:/root/anaconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
source /usr/local/Ascend/ascend-toolkit/set_env.sh
GPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
# modify script orign dataset path and tokenizer path according to your own dataset path and tokenizer path
TOKENIZER_PATH=<tokenizer-path>
DATA_PATH=<data-path>
# your own merged model path
MODEL_PATH=<model-path>
DS_CONFIG=deepspeed_config_13B.json
ZERO_STAGE=2
GLOBAL_BATCH=16
MICRO_BATCH=2
# 7b/13b/33b models use the following parameters respectively
if [[ "$MODEL_PATH" == *13b* ]]; then
num_layers=40
hidden_size=5120
ffn_hidden_size=13824
num_heads=40
elif [[ "$MODEL_PATH" == *33b* ]]; then
num_layers=60
hidden_size=6656
ffn_hidden_size=17920
num_heads=52
else
num_layers=32
hidden_size=4096
ffn_hidden_size=11008
num_heads=32
fi
# This is the configuration for deepspeed
cat <<EOT > $DS_CONFIG
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 8,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "Adam"
},
"zero_optimization": {
"stage": $ZERO_STAGE,
"allgather_partitions": true,
"allgather_bucket_size": 1e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": 1,
"train_batch_size": $GLOBAL_BATCH,
"train_micro_batch_size_per_gpu":$MICRO_BATCH,
"zero_allow_untested_optimizer": true
}
EOT
ds_args=" --deepspeed ${ds_args}"
ds_args=" --no-pipeline-parallel ${ds_args}"
ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
deepspeed pretrain_llama.py \
--DDP-impl local \
--is-instruction-dataset \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--num-layers $num_layers \
--hidden-size $hidden_size \
--ffn-hidden-size $ffn_hidden_size \
--num-attention-heads $num_heads \
--micro-batch-size $MICRO_BATCH \
--global-batch-size $GLOBAL_BATCH \
--seq-length 1024 \
--max-position-embeddings 2048 \
--train-iters 1 \
--lr-decay-iters 320000 \
--load $MODEL_PATH \
--data-path $DATA_PATH \
--tokenizer-name-or-path $TOKENIZER_PATH \
--tokenizer-not-use-fast \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 1e-4 \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--checkpoint-activations \
--log-interval 1 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--lora-target-modules query_key_value dense gate_proj up_proj down_proj \
--lora-r 64 \
--lora-alpha 128 \
--lora-modules-to-save word_embeddings lm_head.lm_head \
$ds_args \
--fp16 | tee logs/train.log

View File

@ -0,0 +1,84 @@
# This script is used to run Chinese LLaMA Alpaca with 7B/13B/33B weights, configured with tensor model parallel size of 4, pipeline model parallel size of 2.
# modify the script according to your own conda and ascend-toolkit path
export LD_LIBRARY_PATH=/usr/local/lib:/root/anaconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
source /usr/local/Ascend/ascend-toolkit/set_env.sh
NPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
# modify script orign dataset path and tokenizer path according to your own dataset path and tokenizer path
TOKENIZER_PATH=<tokenizer-path>
DATA_PATH=<data-path>
# your own merged model path
MODEL_PATH=<model-path>
ZERO_STAGE=2
GLOBAL_BATCH=16
MICRO_BATCH=2
# Distributed setting
DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
# 7b/13b/33b models use the following parameters respectively
if [[ "$MODEL_PATH" == *13b* ]]; then
num_layers=40
hidden_size=5120
ffn_hidden_size=13824
num_heads=40
elif [[ "$MODEL_PATH" == *33b* ]]; then
num_layers=60
hidden_size=6656
ffn_hidden_size=17920
num_heads=52
else
num_layers=32
hidden_size=4096
ffn_hidden_size=11008
num_heads=32
fi
python -m torch.distributed.launch ${DISTRIBUTED_ARGS} \
pretrain_llama.py \
--no-contiguous-buffers-in-local-ddp \
--DDP-impl local \
--is-instruction-dataset \
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 2 \
--num-layers $num_layers \
--hidden-size $hidden_size \
--ffn-hidden-size $ffn_hidden_size \
--num-attention-heads $num_heads \
--micro-batch-size $MICRO_BATCH \
--global-batch-size $GLOBAL_BATCH \
--seq-length 2048 \
--max-position-embeddings 2048 \
--train-iters 1 \
--lr-decay-iters 320000 \
--load $MODEL_PATH \
--data-path $DATA_PATH \
--tokenizer-name-or-path $TOKENIZER_PATH \
--tokenizer-not-use-fast \
--data-impl mmap \
--initial-loss-scale 4096 \
--split 949,50,1 \
--distributed-backend nccl \
--lr 1e-4 \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--eval-iters 0 \
--log-interval 1 \
--save-interval 10000 \
--lora-target-modules query_key_value dense gate_proj up_proj down_proj \
--lora-r 64 \
--lora-alpha 128 \
--lora-modules-to-save word_embeddings lm_head.lm_head \
--fp16 | tee logs/train.log