ModelLink2/examples/baichuan
gitee_code_template 062a639d1d add baichuan readme
2023-10-07 19:37:43 +08:00
..
images add baichuan readme 2023-10-07 19:37:43 +08:00
generate_baichuan_13B_tp8_pp1.sh up 2023-09-26 14:30:47 +08:00
pretrain_baichuan_ptd_13B.sh baichuan13B模型 2023-09-28 10:30:20 +08:00
pretrain_baichuan_zero_7B.sh up 2023-09-26 14:30:47 +08:00
README.md add baichuan readme 2023-10-07 19:37:43 +08:00

Baichuan-13B

Contents

Pre-Training

Here's a quick summary of training baichuan-13B:

Hardware 96 64GB Altas 910B NPUs
Software AscendSpeed
Dataset alpaca-data-conversation

Datasets

Vicuna is created by fine-tuning a LLaMA base model using approximately 70K user-shared conversations gathered from ShareGPT.com with public APIs. To ensure data quality, we convert the HTML back to markdown and filter out some inappropriate or low-quality samples. Additionally, we divide lengthy conversations into smaller segments that fit the model's maximum context length. For detailed instructions to clean the ShareGPT data, check out here.

Due to some concerns, we may not release the data at the moment. If you would like to try the fine-tuning code, you can try to run it with our preprocessed alpaca dataset (originally from here).

Script

1.Install AscendSpeed requirement environment.

2.Download alpaca-data-conversation dataset:https://github.com/lm-sys/FastChat/blob/v0.1.10/playground/data/alpaca-data-conversation.json

3.Config Baichuan-13B pre-training script: AscendSpeed/examples/baichuan/pretrain_baichuan_ptd_13B.sh

# This is an example: training llama using PTD/

# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200


# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=12892
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
GLOBAL_BATCH=32
MICRO_BATCH=1

DATA_PATH=./data/baichuan_text
TOKENIZER_PATH=./tokenizer
 
CHECKPOINT_PATH=./ckpt
LOAD_PATH=./weight

DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

logfile=$(date +%Y%m%d)_$(date +%H%M%S)
rm -rf kernel_meta*

# Main script
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
       pretrain_baichuan.py \
       --DDP-impl local \
       --tensor-model-parallel-size 8 \
       --pipeline-model-parallel-size 1 \
       --sequence-parallel \
       --num-layers 40 \
       --hidden-size 5120 \
       --ffn-hidden-size 13696 \
       --num-attention-heads 40 \
       --micro-batch-size $MICRO_BATCH \
       --global-batch-size $GLOBAL_BATCH \
       --seq-length 4096 \
       --max-position-embeddings 4096 \
       --train-iters 100000 \
       --save $CHECKPOINT_PATH \
       --load $LOAD_PATH \
       --data-path $DATA_PATH \
       --tokenizer-name-or-path $TOKENIZER_PATH \
       --tokenizer-not-use-fast \
       --data-impl mmap \
       --split 949,50,1 \
       --make-vocab-size-divisible-by 8 \
       --distributed-backend nccl \
       --lr 1e-6 \
       --lr-decay-style cosine \
       --min-lr 1e-8 \
       --weight-decay 1e-1 \
       --position-embedding-type alibi \
       --clip-grad 1.0 \
       --initial-loss-scale 8188.0 \
       --seed 1234 \
       --adam-beta1 0.9 \
       --adam-beta2 0.95 \
       --adam-eps 1.0e-5 \
       --log-interval 1 \
       --save-interval 1000 \
       --eval-interval 1000 \
       --eval-iters 10 \
       --fp16 | tee logs/loss_${logfile}.log

4.Launch Baichuan-13B pre-training script: AscendSpeed/examples/baichuan/pretrain_baichuan_ptd_13B.sh

bash examples/baichuan/pretrain_baichuan_ptd_13B.sh

There is an hourly pulse checking script running that checks that the training is either running or scheduled.

The Training log will look like these:

TODO

Performance

Machine performance

The performance of the NPUs in Ascend910 B1 64GB and GPUs is A800:

Device Model total Iterations throughput rate (samples/s/p) throughput rate (tokens/s/p) single-step time (s/step) floating point operation (TFLOPs/s)
GPUs Baichuan-13B 1000 1.535 785 20.852 68.39
NPUs Baichuan-13B 1000 1.928 1024 16.067 89.37

Notes:

  • Baichuan-13B model trained on alpaca-data-conversation on a single machine with 8 NPUs

Here's a hardware summary of pre-training Baichuan-13B:

Hardware Value
CPU 4xKunPeng920@3.0GHz64 Core Pre Socket 256CPUS
RAM 32x32 GB DDR4
NPU 8 x Ascend910B1 64G

Here's a software summary of pre-training Baichuan-13B:

Software Version
OS Euler OS release 2.0(SP10)
uname aarch64
Python 3.7.16
driver 23.0.RC3.B051
firmware 23.0.RC3.B051
CANN 7.0.RC1
binary arithmetic package Ascend-cann-kernels-910b_7.0.T8_linux
torch 1.11.0
torch_npu 1.11.0.post4-20230915
deepspeed 0.9.2
deepspeed-npu 0.1
transformers 4.30.2
Ascendspeed 2023-7-21

Accuracy of the loss

NPU vs GPU loss.

The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected.

NPU-LOSS

NPU vs GPU loss relative error.

The relative error between NPU and GPU Loss is less than 0.02 throughout, as expected.

NPU-Relative-Error

Fine-tune and Evaluation

TODO

Inference

We support AscendSpeed Inference for text generation with Baichuan-13B.

Model weights

We provide scripts that support converting pretrained weights into weights that AscendSpeed can load and used for inference. Download the Baichuan-13B checkpoint from here, make sure all chunks are downloaded completely, then use the following command to convert them into checkpoints that AscendSpeed can load.

#!/bin/bash

SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
python $SCRIPT_PATH \
    --input-model-dir "your huggingface checkpoint path" \
    --output-model-dir "your ascendspeed checkpoint path" \
    --tensor-model-parallel-size 8 \
    --pipeline-model-parallel-size 1 \
    --type 13B 

Set CHECKPOINT_PATH in ./examples/baichuan/generate_baichuan_13B_tp8_pp1.sh to the path of the extracted folder. Since the checkpoint file is large, it is recommended to use the SSD or RAM disk to reduce the checkpoint loading time.

Script

We generate text samples using the generate_baichuan_13B_tp8_pp1.sh script. Inference different from pre-training, such as we need to Load pre training checkpoint and the length of the output samples:

bash ./examples/baichuan/generate_baichuan_13B_tp8_pp1.sh

Alternatively you can also use DeepSpeed from source:

TODO: XXXX

Citation

You may also consider original work in your reference.