!98 add finetune chinese llama alpaca scripts and readme file

Merge pull request !98 from stacey/master
2024-12-05 05:17:40 +08:00 · 2023-10-08 09:10:24 +00:00 · 2023-10-08 09:10:24 +00:00 · a9d3f05550
commit a9d3f05550
parent dc4e1d0328 fb12bc05ee
4 changed files with 415 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -5,7 +5,7 @@ AscendSpeed provides end-to-end test scripts for large language models on [Ascen
 ## Why AscendSpeed? 

 ---
-### Pretrained Models
+### Prepared Models
 Currently, the pre-training of the following models has been supported: 

 * <a href="https://github.com/baichuan-inc" style="color:green">Baichuan</a>-[7B/13B](examples/baichuan/README.md)
@ -323,7 +323,7 @@ For llama and other LLMs without bias in FFN, the linear transformation in FFN c
      <td> -- </td>
      <td> -- </td>
      <td> -- </td>
-      <td  rowspan="8"> -- </td>
+      <td> -- </td>
    </tr>
    <tr>
      <td rowspan="2"><a href="examples/bloom/README.md">Bloom</a></td>
@ -331,12 +331,14 @@ For llama and other LLMs without bias in FFN, the linear transformation in FFN c
      <td> -- </td>
      <td> -- </td>
      <td> -- </td>
+      <td> -- </td>
    </tr>
    <tr>
      <td> 176B </td>
      <td> -- </td>
      <td> -- </td>
      <td> -- </td>
+      <td> -- </td>
    </tr>
    <tr>
      <td>InternLM</td>
@ -344,25 +346,36 @@ For llama and other LLMs without bias in FFN, the linear transformation in FFN c
      <td> -- </td>
      <td> -- </td>
      <td> -- </td>
+      <td> -- </td>
    </tr>
    <tr>
-      <td rowspan="3">LLaMA</td>
+      <td rowspan="4">LLaMA</td>
      <td>7B</td>
+      <td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh">lora</a> </td>
      <td> -- </td>
      <td> -- </td>
-      <td> -- </td>
+      <td> <a href="https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json">alpaca_data.json</a> </td>
    </tr>
    <tr>
      <td>13B</td>
+      <td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh">lora</a> </td>
      <td> -- </td>
      <td> -- </td>
+      <td> <a href="https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json">alpaca_data.json</a> </td>
+    </tr>
+    <tr>
+      <td>33B</td>
+      <td> <a href="https://gitee.com/ascend/AscendSpeed/tree/master/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh">lora</a> </td>
      <td> -- </td>
+      <td> -- </td>
+      <td> <a href="https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json">alpaca_data.json</a> </td>
    </tr>
    <tr>
      <td > 65B </td>
      <td > -- </td>
      <td> -- </td>
      <td> -- </td>
+      <td> -- </td>
    </tr>
    <tr>
      <td>LLaMA2</td>
@ -370,6 +383,7 @@ For llama and other LLMs without bias in FFN, the linear transformation in FFN c
      <td> -- </td>
      <td> -- </td>
      <td> -- </td>
+      <td> -- </td>
    </tr>
  </tbody>
 </table>
--- a/examples/alpaca/README.md
+++ b/examples/alpaca/README.md
@ -0,0 +1,189 @@
+# Chinese-LLaMA-Alpaca
+
+This directory contains scripts used to produce the results of Chinese-LLaMA-Alpaca in AscendSpeed.
+
+Chinese-LLaMA-Alpaca model is from: [Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca](https://arxiv.org/abs/2304.08177)
+
+> Cui, Yang, and Yao, et al. "Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca." arXiv preprint arXiv:2304.08177 (2023).
+
+
+
+# Contents
+
+- [Contents](#contents)
+
+- [Model Weights](#model-Weights)
+
+- [Merge Model](#merge-Model)
+
+- [Fine-tune](#Fine-tune)
+
+  - [Script](#script)
+
+- [Citation](#citation)
+
+  
+
+# Model Weights
+
+
+First download the [original LLaMA model](https://github.com/facebookresearch/llama) weights, then download the [Chinese-LLaMA-Alpaca model](https://github.com/ymcui/Chinese-LLaMA-Alpaca) LoRA weight, which can be understood as a "patch" on the original LLaMA model. And then merge the original LLaMA model with it to obtain a complete weight.
+
+# Merge Weights
+
+Before merging weights, please ensure that the machine has enough memory to load the complete model weights (for example, 7B model requires 13-15G) for the merge model operation. And confirm the integrity of the base model and the downloaded LoRA model, and check whether they are consistent with the values shown in SHA256.md, otherwise the merge operation cannot be performed. The original LLaMA includes: tokenizer.model, tokenizer_checklist.chk, consolidated.*.pth, params.json. 
+
+#### Step 1: [Convert the original LLaMA model to HF format.](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E6%89%8B%E5%8A%A8%E6%A8%A1%E5%9E%8B%E5%90%88%E5%B9%B6%E4%B8%8E%E8%BD%AC%E6%8D%A2#step-1-%E5%B0%86%E5%8E%9F%E7%89%88llama%E6%A8%A1%E5%9E%8B%E8%BD%AC%E6%8D%A2%E4%B8%BAhf%E6%A0%BC%E5%BC%8F)
+Please use the script [convert_llama_weights_to_hf.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py) provided by Transformers to convert the original LLAMA model to `huggingFace` format. 
+```
+python convert_llama_weights_to_hf.py \
+    --input_dir path_to_original_llama_root_dir \
+    --model_size 7B \
+    --output_dir path_to_original_llama_hf_dir
+```
+
+Model files in HF format will be generated in the `--output_dir` directory, such as:
+
+```
+config.json
+generation_config.json
+pytorch_model-00001-of-00002.bin
+pytorch_model-00002-of-00002.bin
+pytorch_model.bin.index.json
+special_tokens_map.json
+tokenizer_config.json
+tokenizer.json
+tokenizer.model
+```
+
+#### Step 2: [Combine LoRA weights to generate full model weights.](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E6%89%8B%E5%8A%A8%E6%A8%A1%E5%9E%8B%E5%90%88%E5%B9%B6%E4%B8%8E%E8%BD%AC%E6%8D%A2#step-2-%E5%90%88%E5%B9%B6lora%E6%9D%83%E9%87%8D%E7%94%9F%E6%88%90%E5%85%A8%E9%87%8F%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D)
+
+This step will expand the Chinese vocabulary of the original LLaMA model (HF format), merge the LoRA weights and generate the full model weights. Here you can choose to output the PyTorch version weight (.pth file) or HuggingFace version weight (.bin file). Please convert it to pth file first, compare the SHA256 of the merged model and then convert it to HF format as needed. 
+
+**Single LoRA weight merging** (applicable to Chinese-LLaMA, Chinese-LLaMA-Plus, Chinese-Alpaca). 
+Download the script [merge_llama_with_chinese_lora.py](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_llama_with_chinese_lora.py), and execute the following command:
+```
+python merge_llama_with_chinese_lora.py \
+    --base_model path_to_original_llama_hf_dir \
+    --lora_model path_to_chinese_llama_or_alpaca_lora \
+    --output_type huggingface \
+    --output_dir path_to_merged_hf_dir 
+```
+Parameter Description:
+
+- `--base_model`：Directory to store LLAMA model weights and configuration files in HF format (STEP 1 generation).
+- `--lora_model`：Directory where the Chinese LLAMA/Alpaca LoRA decompressed files are located. 
+- `--output_type`: Specify the output format, which can be `pth` or `huggingface`. If it is not specified, the default is `pth`.
+- `--output_dir`：Specify the directory of saving full model weight, default `./`.
+- (Optional)`--offload_dir`(Only valid for old script `scripts/merge_llama_with_chinese_lora.py`)：For low memory users, you need to specify an Office cache path.
+- (Optional)`--verbose`(Only valid for new script `scripts/merge_llama_with_chinese_lora_low_mem.py`)：Display the detailed information during the merge process.
+
+
+**Multi-LoRA weight merging** (applicable to Chinese-Alpaca-Plus and Chinese-Alpaca-Pro). 
+Download the script [merge_llama_with_chinese_lora.py](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_llama_with_chinese_lora.py), and execute the following command:
+```
+python merge_llama_with_chinese_lora.py \
+    --base_model path_to_original_llama_hf_dir \
+    --lora_model path_to_chinese_llama_plus_lora,path_to_chinese_alpaca_plus_lora \
+    --output_type huggingface \
+    --output_dir path_to_merged_hf_dir 
+```
+
+#### Step 3: Check SHA256 after merge.
+
+ Be sure to check [SHA256](https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/SHA256.md) after the merge is complete. It is recommended to convert to pth format first, and after comparing the SHA256 is correct, then convert to HF format if necessary, because the model SHA256 corresponding to the HF format often changes (meta information changes).
+
+#### Step 4: Convert ckpt from huggingface format to model parallel format. 
+
+Execute the following command:
+
+```
+python tools/ckpt_convert/llama/convert_weights_from_huggingface.py \
+    --input-model-dir path_to_merged_hf_dir \
+    --output-model-dir path_to_merged_ascendspeed_dir \
+    --tensor-model-parallel-size 4 \
+    --pipeline-model-parallel-size 2 \
+    --type 7B                                                                    
+```
+
+# Fine-tune
+## Script
+
+
+1. Clone the repository to your local server
+
+```bash
+git clone https://gitee.com/ascend/AscendSpeed.git 
+cd AscendSpeed 
+mkdir logs
+mkdir ckpt
+```
+
+2. Build environment
+
+```bash
+# python3.7
+conda create -n test python=3.7
+conda activate test
+
+# install torch and torch_npu
+pip install torch==1.11 -i https://pypi.tuna.tsinghua.edu.cn/simple
+wget https://gitee.com/ascend/pytorch/releases/download/v5.0.rc2-pytorch1.11.0/torch_npu-1.11.0.post1-cp37-cp37m-linux_aarch64.whl (ARM)
+or wget https://gitee.com/ascend/pytorch/releases/download/v5.0.rc2-pytorch1.11.0/torch_npu-1.11.0.post1-cp37-cp37m-linux_x86_64.whl (X86)
+pip install torch_npu-1.11.0.post1-cp37-cp37m-linux_XXXXXX.whl
+
+# install deepspeed and deepspeed_npu
+pip install deepspeed==0.9.2
+git clone https://gitee.com/ascend/DeepSpeed.git -b v0.9.2 deepspeed_npu
+cd deepspeed_npu
+pip3 install -e ./
+
+# install other packages
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+3. Prepare dataset
+```bash
+# for llama, download alpaca dataset, like
+wget http://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json
+
+# download tokenizer configs and (selective) weights from 
+# https://huggingface.co/decapoda-research/llama-7b-hf/tree/main
+# revise "LLaMATokenizer" as "LlamaTokenizer" in tokenizer_config.json (This is a bug of huggingface)
+mkdir dataset
+python tools/preprocess_data.py --input alpaca_data.json \
+                                --output-prefix $DATA_PATH \
+                                --tokenizer-type PretrainedFromHF \
+                                --tokenizer-name-or-path $TOKENIZER_PATH \
+                                --tokenizer-not-use-fast \
+                                --handler-name GeneralInstructionHandler
+```
+
+4. Config Chinese-LLaMA-Alpaca fine-tune script 
+
+Parameters of 7B/13B/33B are distinguished through `$MODEL_PATH`. For example, if `$MODEL_PATH` matches `*7b*`, then using the parameter of 7B.
+
+* Based on PyTorch's built-in distributed launcher : [Chinese-LLaMA-Alpaca-7B/13B/33B](finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh)
+
+```bash
+bash examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh
+```
+
+* Based on Deepspeed launcher : [Chinese-LLaMA-Alpaca-7B/13B/33B](finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh)
+
+```bash
+bash examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh
+```
+
+# Citation
+
+You may also consider original work in your reference:
+
+```
+@article{chinese-llama-alpaca,
+      title={Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca}, 
+      author={Cui, Yiming and Yang, Ziqing and Yao, Xin},
+      journal={arXiv preprint arXiv:2304.08177},
+      url={https://arxiv.org/abs/2304.08177},
+      year={2023}
+}
+```
--- a/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh
+++ b/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp1_pp1_deepspeed.sh
@ -0,0 +1,124 @@
+# This script is used to run Chinese LLaMA Alpaca with 7B/13B/33B weights based on deepspeed launcher, configured with tensor model parallel size of 1, pipeline model parallel size of 1.
+# add HCCL_OP_BASE_FFTS_MODE_ENABLE
+export HCCL_OP_BASE_FFTS_MODE_ENABLE=TRUE
+
+# modify the script according to your own conda and ascend-toolkit path
+export LD_LIBRARY_PATH=/usr/local/lib:/root/anaconda3/lib:$LD_LIBRARY_PATH
+export HCCL_CONNECT_TIMEOUT=1200
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+
+GPUS_PER_NODE=8
+# Change for multinode config
+MASTER_ADDR=localhost
+MASTER_PORT=6000
+NNODES=1
+NODE_RANK=0
+WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
+
+# modify script orign dataset path and tokenizer path according to your own dataset path and tokenizer path
+TOKENIZER_PATH=<tokenizer-path>
+DATA_PATH=<data-path>
+# your own merged model path
+MODEL_PATH=<model-path>
+
+DS_CONFIG=deepspeed_config_13B.json
+ZERO_STAGE=2
+GLOBAL_BATCH=16
+MICRO_BATCH=2
+
+# 7b/13b/33b models use the following parameters respectively 
+if [[ "$MODEL_PATH" == *13b* ]]; then
+  num_layers=40
+  hidden_size=5120
+  ffn_hidden_size=13824
+  num_heads=40
+elif [[ "$MODEL_PATH" == *33b* ]]; then
+  num_layers=60
+  hidden_size=6656
+  ffn_hidden_size=17920
+  num_heads=52
+else
+  num_layers=32
+  hidden_size=4096
+  ffn_hidden_size=11008
+  num_heads=32
+fi
+
+# This is the configuration for deepspeed
+cat <<EOT > $DS_CONFIG
+{
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 8,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+
+    "optimizer": {
+        "type": "Adam"
+    },
+
+    "zero_optimization": {
+        "stage": $ZERO_STAGE,
+        "allgather_partitions": true,
+        "allgather_bucket_size": 1e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": 1e8,
+        "contiguous_gradients": true
+    },
+
+    "gradient_accumulation_steps": 1,
+    "train_batch_size": $GLOBAL_BATCH,
+    "train_micro_batch_size_per_gpu":$MICRO_BATCH,
+    "zero_allow_untested_optimizer": true
+}
+EOT
+
+ds_args=" --deepspeed ${ds_args}"
+ds_args=" --no-pipeline-parallel ${ds_args}"
+ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
+ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
+ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
+
+deepspeed pretrain_llama.py \
+         --DDP-impl local \
+         --is-instruction-dataset \
+         --tensor-model-parallel-size 1 \
+         --pipeline-model-parallel-size 1 \
+         --num-layers $num_layers \
+         --hidden-size $hidden_size \
+         --ffn-hidden-size $ffn_hidden_size \
+         --num-attention-heads $num_heads \
+         --micro-batch-size $MICRO_BATCH \
+         --global-batch-size $GLOBAL_BATCH \
+         --seq-length 1024 \
+         --max-position-embeddings 2048 \
+         --train-iters 1 \
+         --lr-decay-iters 320000 \
+         --load $MODEL_PATH \
+         --data-path $DATA_PATH \
+         --tokenizer-name-or-path $TOKENIZER_PATH \
+         --tokenizer-not-use-fast \
+         --data-impl mmap \
+         --split 949,50,1 \
+         --distributed-backend nccl \
+         --lr 1e-4 \
+         --lr-decay-style cosine \
+         --min-lr 1.0e-5 \
+         --weight-decay 1e-2 \
+         --clip-grad 1.0 \
+         --lr-warmup-fraction .01 \
+         --checkpoint-activations \
+         --log-interval 1 \
+         --save-interval 10000 \
+         --eval-interval 1000 \
+         --eval-iters 10 \
+         --lora-target-modules query_key_value dense gate_proj up_proj down_proj \
+         --lora-r 64 \
+         --lora-alpha 128 \
+         --lora-modules-to-save word_embeddings lm_head.lm_head \
+         $ds_args \
+         --fp16 | tee logs/train.log
--- a/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh
+++ b/examples/alpaca/finetune_chinese_llama_alpaca_7_13_33b_tp4_pp2.sh
@ -0,0 +1,84 @@
+# This script is used to run Chinese LLaMA Alpaca with 7B/13B/33B weights, configured with tensor model parallel size of 4, pipeline model parallel size of 2.
+# modify the script according to your own conda and ascend-toolkit path
+export LD_LIBRARY_PATH=/usr/local/lib:/root/anaconda3/lib:$LD_LIBRARY_PATH
+export HCCL_CONNECT_TIMEOUT=1200
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+
+NPUS_PER_NODE=8
+# Change for multinode config
+MASTER_ADDR=localhost
+MASTER_PORT=6000
+NNODES=1
+NODE_RANK=0
+WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
+
+# modify script orign dataset path and tokenizer path according to your own dataset path and tokenizer path
+TOKENIZER_PATH=<tokenizer-path>
+DATA_PATH=<data-path>
+# your own merged model path
+MODEL_PATH=<model-path>
+
+ZERO_STAGE=2
+GLOBAL_BATCH=16
+MICRO_BATCH=2
+
+# Distributed setting
+DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
+
+# 7b/13b/33b models use the following parameters respectively 
+if [[ "$MODEL_PATH" == *13b* ]]; then
+  num_layers=40
+  hidden_size=5120
+  ffn_hidden_size=13824
+  num_heads=40
+elif [[ "$MODEL_PATH" == *33b* ]]; then
+  num_layers=60
+  hidden_size=6656
+  ffn_hidden_size=17920
+  num_heads=52
+else
+  num_layers=32
+  hidden_size=4096
+  ffn_hidden_size=11008
+  num_heads=32
+fi
+
+python -m torch.distributed.launch ${DISTRIBUTED_ARGS} \
+         pretrain_llama.py \
+         --no-contiguous-buffers-in-local-ddp \
+         --DDP-impl local \
+         --is-instruction-dataset \
+         --tensor-model-parallel-size 4 \
+         --pipeline-model-parallel-size 2 \
+         --num-layers $num_layers \
+         --hidden-size $hidden_size \
+         --ffn-hidden-size $ffn_hidden_size \
+         --num-attention-heads $num_heads \
+         --micro-batch-size $MICRO_BATCH \
+         --global-batch-size $GLOBAL_BATCH \
+         --seq-length 2048 \
+         --max-position-embeddings 2048 \
+         --train-iters 1 \
+         --lr-decay-iters 320000 \
+         --load $MODEL_PATH \
+         --data-path $DATA_PATH \
+         --tokenizer-name-or-path $TOKENIZER_PATH \
+         --tokenizer-not-use-fast \
+         --data-impl mmap \
+         --initial-loss-scale 4096 \
+         --split 949,50,1 \
+         --distributed-backend nccl \
+         --lr 1e-4 \
+         --lr-decay-style cosine \
+         --min-lr 1.0e-5 \
+         --weight-decay 1e-2 \
+         --clip-grad 1.0 \
+         --lr-warmup-fraction .01 \
+         --eval-iters 0 \
+         --log-interval 1 \
+         --save-interval 10000 \
+         --lora-target-modules query_key_value dense gate_proj up_proj down_proj \
+         --lora-r 64 \
+         --lora-alpha 128 \
+         --lora-modules-to-save word_embeddings lm_head.lm_head \
+         --fp16 | tee logs/train.log