diff --git a/README.md b/README.md index bb3813faa..d456a64c5 100644 --- a/README.md +++ b/README.md @@ -390,6 +390,17 @@ ModelLink旨在为华为 [昇腾芯片](https://open.codehub.huawei.com/OpenBaiz
+ 简体中文 | + English +
+ +# 目录 + +- [GPT3](#GPT3) +- [目录](#目录) +- [GPT3-175B](#GPT3-175B) + - [训练-175B](#训练) + - [脚本](#脚本) + - [性能](#性能) + - [吞吐](#吞吐) + +# GPT3-175B + +## 训练 + +GPT3-175B 训练的硬件配置: + +| 硬件 | 配置 | +| :--: | :-------------: | +| NPU | 128 x Ascend NPUs | + +### 脚本 + +1. 克隆仓库到本地服务器: + + ```shell + git clone https://gitee.com/ascend/ModelLink.git + git clone https://github.com/NVIDIA/Megatron-LM.git + cd Megatron-LM + git checkout -f bcce6f + cp -r megatron ../ModelLink/ + cd .. + cd ModelLink + mkdir logs + mkdir vocab_file + mkdir dataset + ``` + +2. 搭建环境 + + ```bash + # python3.8 + conda create -n test python=3.8 + conda activate test + + # 安装 torch 和 torch_npu + pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl + pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl + pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl + + # 修改 ascend-toolkit 路径 + source /usr/local/Ascend/ascend-toolkit/set_env.sh + + # 安装 AscendSpeed + git clone https://gitee.com/ascend/AscendSpeed.git + cd AscendSpeed + git checkout 224ae35e8fc96778f957029d1371ddb623452a50 + pip install -r requirements.txt + pip3 install -e . + cd .. + + # 安装其他依赖 + pip install -r requirements.txt + ``` + +3. 准备数据、词表来拉起模型 + 3.1 准备数据 + + 可以从 [这里](https://huggingface.co/datasets/wikipedia/tree/main/data/20220301.en) 下载原始数据 + ```shell + # 下载 enwiki 数据 + # 总共有 41 个文件,我们可以选择部分来制作数据 + cd ./dataset + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00000-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00001-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00002-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00003-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00004-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00005-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00006-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00007-of-00041.parquet + cd .. + + # 下载 vocab file 和 merge table + cd vocab_file + wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json + wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt + cd .. + + # 处理成训练数据 + python ./tools/preprocess_data.py \ + --input ./dataset/ \ + --output-prefix ./dataset/gpt_text_sentence \ + --tokenizer-type GPT2BPETokenizer \ + --vocab-file ./vocab_file/gpt2-vocab.json \ + --merge-file ./vocab_file/gpt2-merges.txt \ + --append-eod \ + --workers 4 \ + --log-interval 1000 + ``` + + 3.2 用 ptd 模式进行预训练 + 配置 GPT3-175B PTD 预训练脚本: examples/gpt3/pretrain_gpt3_175B.sh + + ```shell + # 请根据真实情况配置 ascend-toolkit 路径 + source /usr/local/Ascend/ascend-toolkit/set_env.sh + + # 请根据真实存放路径配置以下参数 + VOCAB_FILE="./vocab_file/gpt2-vocab.json" # 词表 + MERGE_FILE="./vocab_file/gpt2-merges.txt" # BPE 合并表 + DATA_PATH="./dataset/gpt_text_sentence" # 数据路径 + ``` + + 拉起 GPT3-175B PTD 预训练脚本: examples/gpt3/pretrain_gpt3_175B.sh + + ```shell + bash examples/gpt3/pretrain_gpt3_175B.sh + ``` + +### 性能 + +#### 吞吐 + +GPT3-175B 在 **昇腾芯片**上的性能数据: + +| 设备 | 模型 | tokens吞吐 (tokens/s/p) | +| :--: | :--------: |:---------------------:| +| NPUs | GPT3-175B | 153.1 | + diff --git a/examples/gpt3/readme_en.md b/examples/gpt3/readme_en.md new file mode 100644 index 000000000..39dda4ae4 --- /dev/null +++ b/examples/gpt3/readme_en.md @@ -0,0 +1,136 @@ +# GPT3 $\color{black}{\rm\tiny{【model}}$ $\color{black}{\rm\tiny{contributed}}$ $\color{black}{\rm\tiny{by}}$ $\color{black}{\rm\tiny{Community】}}$ + ++ English | + English +
+ +# Contents + +- [GPT3](#GPT3) +- [Contents](#contents) +- [GPT3-175B](#GPT3-175B) + - [Training-175B](#training) + - [Script](#script) + - [Perforfance](#performance) + - [Machine performance](#machine-performance) + +# GPT3-175B + +## Training + +Here is a hardware summary of pre-trianing GPT3-175B: + +| Hardware | Value | +| :--: | :-------------: | +| NPU | 128 x Ascend NPUs | + +### Script + +1. Clone repository to your local server: + + ```shell + git clone https://gitee.com/ascend/ModelLink.git + git clone https://github.com/NVIDIA/Megatron-LM.git + cd Megatron-LM + git checkout -f bcce6f + cp -r megatron ../ModelLink/ + cd .. + cd ModelLink + mkdir logs + mkdir vocab_file + mkdir dataset + ``` + +2. Build environment + + ```bash + # python3.8 + conda create -n test python=3.8 + conda activate test + + # install torch and torch_npu + pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl + pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl + pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl + + # modify ascend-toolkit path + source /usr/local/Ascend/ascend-toolkit/set_env.sh + + # install AscendSpeed + git clone https://gitee.com/ascend/AscendSpeed.git + cd AscendSpeed + git checkout 224ae35e8fc96778f957029d1371ddb623452a50 + pip install -r requirements.txt + pip3 install -e . + cd .. + + # install other packages + pip install -r requirements.txt + ``` + +3. Prepare dataset and vocab file for pretrain + 3.1 Prepare dataset + + Download the GPT raw dataset from [here](https://huggingface.co/datasets/wikipedia/tree/main/data/20220301.en) + ```shell + # download enwiki raw data + # There are 41 files in total, we can just select part to make our datasets. + cd ./dataset + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00000-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00001-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00002-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00003-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00004-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00005-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00006-of-00041.parquet + wget https://huggingface.co/datasets/wikipedia/blob/main/data/20220301.en/train-00007-of-00041.parquet + cd .. + + # download vocab file and merge table + cd vocab_file + wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json + wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt + cd .. + + # process formal dataset + python ./tools/preprocess_data.py \ + --input ./dataset/ \ + --output-prefix ./dataset/gpt_text_sentence \ + --tokenizer-type GPT2BPETokenizer \ + --vocab-file ./vocab_file/gpt2-vocab.json \ + --merge-file ./vocab_file/gpt2-merges.txt \ + --append-eod \ + --workers 4 \ + --log-interval 1000 + ``` + + 3.2 pre-training in ptd mode + Config GPT3-175B PTD pre-training script: examples/gpt3/pretrain_gpt3_175B.sh + + ```shell + # modify ascend-toolkit path according to your own config + source /usr/local/Ascend/ascend-toolkit/set_env.sh + + # modify config according to your own actual situation + VOCAB_FILE="./vocab_file/gpt2-vocab.json" # vocab file for training + MERGE_FILE="./vocab_file/gpt2-merges.txt" # BPE merge file for training + DATA_PATH="./dataset/gpt_text_sentence" # dataset path + ``` + + Launch GPT3-175B PTD pre-training script: examples/gpt3/pretrain_gpt3_175B.sh + + ```shell + bash examples/gpt3/pretrain_gpt3_175B.sh + ``` + +### Performance + +#### Machine performance + +The performance of GPT3-175B in **Ascend NPU**: + +| device | model | tokens capacity (tokens/s/p) | +| :--: | :--------: |:---------------------:| +| NPUs | GPT3-175B | 153.1 | + diff --git a/modellink/model/transformer.py b/modellink/model/transformer.py index 56f372d21..bd1ef11fe 100644 --- a/modellink/model/transformer.py +++ b/modellink/model/transformer.py @@ -420,7 +420,7 @@ class FlashSelfAttention(torch.nn.Module): """Implements the multihead softmax attention. Arguments --------- - q, k, v: The tensor containing the query, key, and value. (B, S, H, D) + q, k, v: The tensor containing the query, key, and value. (S, B, H, D) """ args = get_args()