mirror of https://gitee.com/ascend/ModelLink.git synced 2024-12-05 05:17:40 +08:00

History

gitee_code_template 062a639d1d add baichuan readme		2023-10-07 19:37:43 +08:00
..
images	add baichuan readme	2023-10-07 19:37:43 +08:00
generate_baichuan_13B_tp8_pp1.sh	up	2023-09-26 14:30:47 +08:00
pretrain_baichuan_ptd_13B.sh	baichuan13B模型	2023-09-28 10:30:20 +08:00
pretrain_baichuan_zero_7B.sh	up	2023-09-26 14:30:47 +08:00
README.md	add baichuan readme	2023-10-07 19:37:43 +08:00

README.md

Baichuan-13B

Contents
Pre-Training
Fine-tune and Evaluation
Inference
- Model weights
- Script
Citation

Pre-Training

Here's a quick summary of training baichuan-13B:


Hardware	96 64GB Altas 910B NPUs
Software	AscendSpeed
Dataset	alpaca-data-conversation

Datasets

Vicuna is created by fine-tuning a LLaMA base model using approximately 70K user-shared conversations gathered from ShareGPT.com with public APIs. To ensure data quality, we convert the HTML back to markdown and filter out some inappropriate or low-quality samples. Additionally, we divide lengthy conversations into smaller segments that fit the model's maximum context length. For detailed instructions to clean the ShareGPT data, check out here.

Due to some concerns, we may not release the data at the moment. If you would like to try the fine-tuning code, you can try to run it with our preprocessed alpaca dataset (originally from here).

Script

1.Install AscendSpeed requirement environment.

2.Download alpaca-data-conversation dataset:https://github.com/lm-sys/FastChat/blob/v0.1.10/playground/data/alpaca-data-conversation.json

3.Config Baichuan-13B pre-training script: AscendSpeed/examples/baichuan/pretrain_baichuan_ptd_13B.sh

# This is an example: training llama using PTD/

# The number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200


# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=12892
NNODES=1
NODE_RANK=0
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
GLOBAL_BATCH=32
MICRO_BATCH=1

DATA_PATH=./data/baichuan_text
TOKENIZER_PATH=./tokenizer
 
CHECKPOINT_PATH=./ckpt
LOAD_PATH=./weight

DISTRIBUTED_ARGS="--nproc_per_node $NPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

logfile=$(date +%Y%m%d)_$(date +%H%M%S)
rm -rf kernel_meta*

# Main script
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
       pretrain_baichuan.py \
       --DDP-impl local \
       --tensor-model-parallel-size 8 \
       --pipeline-model-parallel-size 1 \
       --sequence-parallel \
       --num-layers 40 \
       --hidden-size 5120 \
       --ffn-hidden-size 13696 \
       --num-attention-heads 40 \
       --micro-batch-size $MICRO_BATCH \
       --global-batch-size $GLOBAL_BATCH \
       --seq-length 4096 \
       --max-position-embeddings 4096 \
       --train-iters 100000 \
       --save $CHECKPOINT_PATH \
       --load $LOAD_PATH \
       --data-path $DATA_PATH \
       --tokenizer-name-or-path $TOKENIZER_PATH \
       --tokenizer-not-use-fast \
       --data-impl mmap \
       --split 949,50,1 \
       --make-vocab-size-divisible-by 8 \
       --distributed-backend nccl \
       --lr 1e-6 \
       --lr-decay-style cosine \
       --min-lr 1e-8 \
       --weight-decay 1e-1 \
       --position-embedding-type alibi \
       --clip-grad 1.0 \
       --initial-loss-scale 8188.0 \
       --seed 1234 \
       --adam-beta1 0.9 \
       --adam-beta2 0.95 \
       --adam-eps 1.0e-5 \
       --log-interval 1 \
       --save-interval 1000 \
       --eval-interval 1000 \
       --eval-iters 10 \
       --fp16 | tee logs/loss_${logfile}.log

4.Launch Baichuan-13B pre-training script: AscendSpeed/examples/baichuan/pretrain_baichuan_ptd_13B.sh

bash examples/baichuan/pretrain_baichuan_ptd_13B.sh

There is an hourly pulse checking script running that checks that the training is either running or scheduled.

The Training log will look like these:

TODO

Performance

Machine performance

The performance of the NPUs in Ascend910 B1 64GB and GPUs is A800:

Device	Model	total Iterations	throughput rate (samples/s/p)	throughput rate (tokens/s/p)	single-step time (s/step)	floating point operation (TFLOPs/s)
GPUs	Baichuan-13B	1000	1.535	785	20.852	68.39
NPUs	Baichuan-13B	1000	1.928	1024	16.067	89.37

Notes:

Baichuan-13B model trained on alpaca-data-conversation on a single machine with 8 NPUs

Here's a hardware summary of pre-training Baichuan-13B:

Hardware	Value
CPU	4xKunPeng920@3.0GHz，64 Core Pre Socket 256CPUS
RAM	32x32 GB DDR4
NPU	8 x Ascend910B1 64G

Here's a software summary of pre-training Baichuan-13B:

Software	Version
OS	Euler OS release 2.0(SP10)
uname	aarch64
Python	3.7.16
driver	23.0.RC3.B051
firmware	23.0.RC3.B051
CANN	7.0.RC1
binary arithmetic package	Ascend-cann-kernels-910b_7.0.T8_linux
torch	1.11.0
torch_npu	1.11.0.post4-20230915
deepspeed	0.9.2
deepspeed-npu	0.1
transformers	4.30.2
Ascendspeed	2023-7-21

Accuracy of the loss

NPU vs GPU loss.

The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected.

NPU vs GPU loss relative error.

The relative error between NPU and GPU Loss is less than 0.02 throughout, as expected.

Fine-tune and Evaluation

TODO

Inference

We support AscendSpeed Inference for text generation with Baichuan-13B.

Model weights

We provide scripts that support converting pretrained weights into weights that AscendSpeed can load and used for inference. Download the Baichuan-13B checkpoint from here, make sure all chunks are downloaded completely, then use the following command to convert them into checkpoints that AscendSpeed can load.

#!/bin/bash

SCRIPT_PATH=./tools/ckpt_convert/llama/convert_weights_from_huggingface.py
python $SCRIPT_PATH \
    --input-model-dir "your huggingface checkpoint path" \
    --output-model-dir "your ascendspeed checkpoint path" \
    --tensor-model-parallel-size 8 \
    --pipeline-model-parallel-size 1 \
    --type 13B

Set CHECKPOINT_PATH in ./examples/baichuan/generate_baichuan_13B_tp8_pp1.sh to the path of the extracted folder. Since the checkpoint file is large, it is recommended to use the SSD or RAM disk to reduce the checkpoint loading time.

Script

We generate text samples using the generate_baichuan_13B_tp8_pp1.sh script. Inference different from pre-training, such as we need to Load pre training checkpoint and the length of the output samples:

bash ./examples/baichuan/generate_baichuan_13B_tp8_pp1.sh

Alternatively you can also use DeepSpeed from source:

TODO: XXXX

Citation

You may also consider original work in your reference.

README.md Unescape Escape

Baichuan-13B

Contents

Pre-Training

Datasets

Script

Performance

Machine performance

Accuracy of the loss

Fine-tune and Evaluation

Inference

Model weights

Script

Citation

README.md