ModelLink2/examples/bloom
fengliangjun 7a21f0bf58 up
2023-09-26 14:30:47 +08:00
..
images change the readme format. 2023-07-14 10:54:42 +08:00
generate_bloom_7b_deepspeed_pipeline.sh up 2023-09-26 14:30:47 +08:00
pretrain_bloom_7b1.sh add ffts mode 2023-07-25 14:14:28 +08:00
pretrain_bloom_176b.sh up 2023-09-26 14:30:47 +08:00
pretrain_bloom_550m.sh add ffts mode 2023-07-25 14:14:28 +08:00
README.md up 2023-09-26 14:30:47 +08:00

BLOOM

This directory contains some of the scripts that were used to produce the results in the AscendSpeed. These scripts is to show the example how to run BLOOM in terminal.

BLOOM model is from: A 176B-Parameter Open-Access Multilingual Language Model

Scao, Teven Le, et al. "Bloom: A 176b-parameter open-access multilingual language model." arXiv preprint arXiv:2211.05100 (2022).

Contents

Pre-Training

BLOOM's architecture is very similar to GPT3 with a few added improvements as will be discussed later in this article.

Here's a quick summary of training bloom:

Hardware 96 64GB Altas 910B NPUs
Software AscendSpeed
Architecture GPT3 w/ extras
Dataset Oscar-1GB

Datasets

OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.The Oscar-1GB dataset was used for this Bloom-7B1 pre-training, and the data was processed into FastChat dialog format.

Script

1.Install AscendSpeed requirement environment.

2.Download Oscar-1GB dataset:https://www.huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz

3.Config Bloom-7B1 pre-training script : AscendSpeed/examples/bloom/pretrain_bloom_7b1.sh

# add HCCL_OP_BASE_FFTS_MODE_ENABLE
export HCCL_OP_BASE_FFTS_MODE_ENABLE=TRUE

# modify the script according to your own conda and ascend-toolkit path
export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH # line 3
source /usr/local/Ascend/ascend-toolkit/set_env.sh # line 5

# modify script orign dataset path according to your own dataset path
TOKENIZER_NAME_OR_PATH=./dataset/bloom_vocab/vocab_file/ # line 34
DATA_PATH=./dataset/oscar_data_1g/my-gpt2_text_document # line 35

# Bloom-7B1 configuration parameters are consistent with the original script configuration
# PP=1, TP=8, micro_bs=4
MICRO_BATCH_SIZE=4 # micro_bs in Line 19
--tensor-model-parallel-size 8 # TP in Line 77
--pipeline-model-parallel-size 1 # PP in Line 78

# DeepSpeed config in Line 40 ~ Line59
cat <<EOT > $config_json
{
  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
  "train_batch_size": $GLOBAL_BATCH_SIZE,
  "gradient_clipping": 1.0,
  "zero_optimization": {
    "stage": $ZERO_STAGE
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 500,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "initial_scale_power": 12
  },
  "steps_per_print": 2000,
  "wall_clock_breakdown": false
}
EOT

4.Launch Bloom-7B1 pre-training script : AscendSpeed/examples/bloom/pretrain_bloom_7b1.sh

bash examples/bloom/pretrain_bloom_7b1.sh

There is an hourly pulse checking script running that checks that the training is either running or scheduled.

The Training log will look like these:

TODO

Performance

Machine performance

The performance of the NPUs in Ascend910 B1 64GB and GPUs is A100:

Device Model total Iterations throughput rate (samples/s/p) throughput rate (tokens/s/p) single-step time (s/step) floating point operation (TFLOPs/s)
GPUs Bloom-7B1 995 1.23 2540.23 19.44 5.68
NPUs Bloom-7B1 995 1.26 2588.03 19.07 5.91

Notes:

  • Bloom-7B1 model trained on oscar-1GB on a single machine with 8 NPUs

Here's a hardware summary of pre-training Bloom-7B:

Hardware Value
CPU 4xKunPeng920@3.0GHz64 Core Pre Socket 256CPUS
RAM 32x32 GB DDR4
NPU 8 x Ascend910B1 64G

Here's a software summary of pre-training Bloom-7B:

Software Version
OS Euler OS release 2.0(SP10)
uname aarch64
Python 3.7.5
driver 23.0.RC2.b130
firmware 23.0.RC2.b130
CANN 6.3.RC2
binary arithmetic package Ascend-ccann-kernels-910b_6.3.RC2_linux
torch 1.11.0
torch_npu 1.11.0.dev20230713
deepspeed 0.9.2
deepspeed-npu 0.1
transformers 4.30.2
Ascendspeed 2023-7-21

Accuracy of the loss

NPU vs GPU loss.

The NPU runs smoothly, the resource usage is stable, no errors are reported in the middle of the process, the Loss is on a decreasing trend, and the convergence speed is as expected.

NPU-LOSS

NPU vs GPU loss relative error.

The relative error between NPU and GPU Loss is less than 0.02 throughout, as expected.

NPU-Relative-Error

Fine-tune and Evaluation

TODO提供微调的方式先加载权重再微调脚本跟预训练格式一样后面需要提供task的验证结果待开发

Inference

We support AscendSpeed Inference for text generation with BLOOM 7B1.

Model weights

We provide scripts that support converting pretrained weights into weights that AscendSpeed can load and used for inference. Download the BLOOM-176B checkpoint from here, make sure all chunks are downloaded completely, then use the following command to convert them into checkpoints that AscendSpeed can load. --partition-layers specifies the partitioning strategy under the pipeline parallel strategy, you can also modify it to a different strategy, but the sum of all elements of --partition layers should be equal to 70 and the number of elements in --partition-layers should be equal to --pipeline-model-parallel-size.

#!/bin/bash

SCRIPT_PATH=./tasks/ckpt_convert/bloom/convert_weights_from_huggingface.py
python $SCRIPT_PATH \
    --input-model-dir "your huggingface checkpoint path" \
    --output-model-dir "your ascendspeed checkpoint path" \
    --tensor-model-parallel-size 8 \
    --pipeline-model-parallel-size 12 \
    --type 176B \
    --deepspeed \
    --partition-layers 6,6,6,6,6,6,6,6,6,6,6,4

Download the BLOOM model checkpoint from [here](TODO: XXXXX), make sure all chunks are downloaded completely, then use the following command to merge them into a single archive file and extract it:

cat bloom-7b1.tar.part_* > gbloom-7b1.tar
tar xvf bloom-7b1.tar

Set CHECKPOINT_PATH in /generate_bloom.sh to the path of the extracted folder. Since the checkpoint file is large, it is recommended to use the SSD or RAM disk to reduce the checkpoint loading time. Since the checkpoint we distribute is in 8-way tensor parallel, a conversion scripts is also provided if you need to change the tensor parallel dimension.

TODO: add convert_tp tools.

python tools/convert_tp.py \
    --input-folder <SRC_CKPT_PATH>  \
    --output-folder <DST_CKPT_PATH> \
    --target-tp <TARGET_TP>

Script

We generate text samples using the generate_bloom script. Inference different from pre-training, such as we need to Load pre training checkpoint and the length of the output samples:

bash ./generate_bloom_7b1.sh

Alternatively you can also use DeepSpeed from source:

TODO: XXXX

Samples

Left-To-Right Generation Examples

Example 1

Input: Who is the greatest artist? The greatest artist is ?

Output: TODO: XXXX.

Example 2 (Chinese)

Input: 问题:冬天,中国哪座城市最适合避寒?问题描述:能推荐一些国内适合冬天避寒的城市吗?回答用户:旅游爱好者 回答:?

Output: 问题: XXXX.

All the provided scripts are tested on 8 910B 64GB GPUs for BLOOM 7B1 (fp16). These scripts might not work for other models or a different number of NPUs.

Note: Sometimes NPUs memory is not freed when inference deployment crashes. You can free this memory by running kill all python in terminal.

Citation

You may also consider original work in your reference:

@article{scao2022bloom, title={Bloom: A 176b-parameter open-access multilingual language model}, author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ili{'c}, Suzana and Hesslow, Daniel and Castagn{'e}, Roman and Luccioni, Alexandra Sasha and Yvon, Fran{\c{c}}ois and Gall{'e}, Matthias and others}, journal={arXiv preprint arXiv:2211.05100}, year={2022} }