fork megatron-deepspeed code.

This commit is contained in:
chenzomi 2023-05-25 14:49:59 +08:00 committed by dilililiwhy
parent ea6e3d2ceb
commit e4a120a662
254 changed files with 44961 additions and 0 deletions

3
.gitignore vendored
View File

@ -6,6 +6,9 @@ __pycache__/
# C extensions
*.so
# MACOS
.DS_Store
# Distribution / packaging
.Python
build/

605
README.md Normal file
View File

@ -0,0 +1,605 @@
## Megatron-DeepSpeed
DeepSpeed version of NVIDIA's Megatron-LM that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others. The ```Megatron-DeepSpeed/examples/``` folder includes example scripts about the features supported by DeepSpeed.
### Run on Azure and AzureML
To try out DeepSpeed on Azure, this fork of Megatron offers easy-to-use recipes and bash scripts. We strongly recommend to start with AzureML recipe in the ```examples/azureml``` folder. If you have a custom infrastructure (e.g. HPC clusters) or Azure VM based environment, please refer to the bash scripts in the ```examples/azure``` folder.
Below is Megatron-LM's original README:
------
Megatron ([1](https://arxiv.org/pdf/1909.08053.pdf) and [2](https://arxiv.org/pdf/2104.04473.pdf)) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel (tensor and pipeline), and multi-node pre-training of transformer based models such as [GPT](https://arxiv.org/abs/2005.14165), [BERT](https://arxiv.org/pdf/1810.04805.pdf), and [T5](https://arxiv.org/abs/1910.10683) using mixed precision.
Below are some of the projects where we have directly used Megatron:
* [BERT and GPT Studies Using Megatron](https://arxiv.org/pdf/1909.08053.pdf)
* [BioMegatron: Larger Biomedical Domain Language Model](https://www.aclweb.org/anthology/2020.emnlp-main.379.pdf)
* [End-to-End Training of Neural Retrievers for Open-Domain Question Answering](https://arxiv.org/abs/2101.00408)
* [Large Scale Multi-Actor Generative Dialog Modeling](https://www.aclweb.org/anthology/2020.acl-main.8.pdf)
* [Local Knowledge Powered Conversational Agents](https://arxiv.org/abs/2010.10150)
* [MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models](https://www.aclweb.org/anthology/2020.emnlp-main.226.pdf)
* [RACE Reading Comprehension Dataset Leaderboard](http://www.qizhexie.com/data/RACE_leaderboard.html)
* [Scaling Language Model Training to a Trillion Parameters Using Megatron](https://arxiv.org/pdf/2104.04473.pdf)
* [Training Question Answering Models From Synthetic Data](https://www.aclweb.org/anthology/2020.emnlp-main.468.pdf)
Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs (both per GPU and aggregate over all GPUs). Note that the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.
![Cases](images/cases_april2021.png)
All the cases from 1 billion to 1 trillion parameters achieve more than 43% half precision utilization, which is high for an end-to-end application. We observe that initially the utilization remains constant but as hidden size increases for larger models, utilization starts increasing and reaches 52% for the largest model. We also note that achieved aggregate petaFLOPs across all GPUs increases almost linearly with number of GPUs, demonstrating good weak scaling.
# Contents
* [Contents](#contents)
* [Setup](#setup)
* [Downloading Checkpoints](#downloading-checkpoints)
* [Usage](#usage)
* [Training](#training)
* [Data Preprocessing](#data-preprocessing)
* [BERT Pretraining](#bert-pretraining)
* [GPT Pretraining](#gpt-pretraining)
* [T5 Pretraining](#t5-pretraining)
* [Distributed Pretraining](#distributed-pretraining)
* [GPT-3 Example](#gpt-3-example)
* [Evaluation and Tasks](#evaluation-and-tasks)
* [GPT Text Generation](#gpt-text-generation)
* [GPT Evaluation](#gpt-evaluation)
* [WikiText Perplexity Evaluation](#wikitext-perplexity-evaluation)
* [LAMBADA Cloze Accuracy](#lambada-cloze-accuracy)
* [BERT Task Evaluation](#bert-task-evaluation)
* [RACE Evaluation](#race-evaluation)
* [MNLI Evaluation](#mnli-evaluation)
* [Datasets](#datasets)
* [Collecting Wikipedia Training Data](#collecting-wikipedia-training-data)
* [Collecting GPT Webtext Data](#collecting-gpt-webtext-data)
# Setup
We have tested Megatron with [NGC's PyTorch container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) version 20.12, which uses python 3.8, pytorch 1.8, cuda 11.1, and nccl 2.8.3.
To use this repository, please install the latest supported versions of PyTorch with GPU support (python 3.8, pytorch 1.8, cuda 11.1, and nccl 2.8.3 and above) and NVIDIA [APEX](https://github.com/NVIDIA/apex#quick-start). We strongly recommend using one of [NGC's recent PyTorch containers](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) (the latest compatible version at time of publication can be pulled with `docker pull nvcr.io/nvidia/pytorch:20.12-py3`). Data preprocessing requires [NLTK](https://www.nltk.org/install.html), though this is not required for training, evaluation, or downstream tasks.
<!--
To use megatron you can either clone the repo or install it via pip (make sure python3-dev is installed):
<pre>
pip install megatron-lm
</pre>
-->
## Downloading Checkpoints
We have provided pretrained [BERT-345M](https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m) and [GPT-345M](https://ngc.nvidia.com/catalog/models/nvidia:megatron_lm_345m) checkpoints for use to evaluate or finetuning downstream tasks. To access these checkpoints, first [sign up](https://ngc.nvidia.com/signup) for and [setup](https://ngc.nvidia.com/setup/installers/cli) the NVIDIA GPU Cloud (NGC) Registry CLI. Further documentation for downloading models can be found in the [NGC documentation](https://docs.nvidia.com/dgx/ngc-registry-cli-user-guide/index.html#topic_6_4_1).
Alternatively, you can directly download the checkpoints using:
<pre>
BERT-345M-uncased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_uncased/zip -O megatron_bert_345m_v0.1_uncased.zip
BERT-345M-cased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_cased/zip -O megatron_bert_345m_v0.1_cased.zip
GPT-345M: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
</pre>
The models require vocabulary files to run. The BERT WordPiece vocab file can be extracted from Google's pretrained BERT models: [uncased](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt), [cased](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt). The GPT [vocab file](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json) and [merge table](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt) can be downloaded directly.
Additional notes for DeepSpeed. We have added a helper script to download the checkpoints and make the example runnable.
Steps to follow:
- bash dataset/download_ckpt.sh -- this will download and extract the checkpoint
- bash dataset/download_vocab.sh -- this will download GPT merges and vocab files.
- bash examples/generate_text.sh -- this will generate examples using the 345m GPT model.
# Usage
After installation, there are several possible workflows. The most comprehensive is:
1. Data preprocessing
2. Pretraining
3. Finetuning (Optional for zero-shot tasks)
4. Downstream task evaluation or text generation
However, steps 1 and 2 can be replaced by using one of the pretrained models mentioned above.
We've provided several scripts for pretraining both BERT and GPT in [`examples`](./examples) directory, as well as scripts for both zero-shot and fine-tuned downstream tasks including MNLI, RACE, WikiText103, and LAMBADA evaluation. There is also a script for GPT interactive text generation.
# Training
## Data Preprocessing
The training data requires preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line. For example:
<pre>
{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}
</pre>
The name of the `text` field of the json can be changed by using the `--json-key` flag in [`preprocess_data.py`](./tools/preprocess_data.py) The other metadata are optional and are not used in training.
The loose json is then processed into a binary format for training. To convert the json into mmap, cached index file, or the lazy loader format use `preprocess_data.py`. Set the `--dataset-impl` flag to `mmap`, `cached`, or `lazy`, respectively (default is `mmap`). An example script to prepare data for BERT training is:
<pre>
python tools/preprocess_data.py \
--input my-corpus.json \
--output-prefix my-bert \
--vocab bert-vocab.txt \
--dataset-impl mmap \
--tokenizer-type BertWordPieceLowerCase \
--split-sentences
</pre>
The output will be two files named, in this case, `my-bert_text_sentence.bin` and `my-bert_text_sentence.idx`. The `--data-path` specified in later BERT training is the full path and new filename, but without the file extension.
Some minor modifications are required for GPT data preprocessing, namely, the addition of a merge table, an end-of-document token, removal of sentence splitting, and a change to the tokenizer type:
<pre>
python tools/preprocess_data.py \
--input my-corpus.json \
--output-prefix my-gpt2 \
--vocab gpt2-vocab.json \
--dataset-impl mmap \
--tokenizer-type GPT2BPETokenizer \
--merge-file gpt2-merges.txt \
--append-eod
</pre>
Here the output files are named `my-gpt2_text_document.bin` and `my-gpt2_text_document.idx`. As before, in GPT training, use the longer name without the extension as `--data-path`.
Further command line arguments are described in the source file [`preprocess_data.py`](./tools/preprocess_data.py).
## BERT Pretraining
The `examples/pretrain_bert.sh` script runs single GPU 345M parameter BERT pretraining. Debugging is the primary use for single GPU training, as the code base and command line arguments are optimized for highly distributed training. Most of the arguments are fairly self-explanatory. By default, the learning rate decays linearly over the training iterations starting at `--lr` to a minimum set by `--min-lr` over `--lr-decay-iters` iterations. The fraction of training iterations used for warmup is set by `--lr-warmup-fraction`. While this is single GPU training, the batch size specified by `--micro-batch-size` is a single forward-backward path batch-size and the code will perform gradient accumulation steps until it reaches `global-batch-size` whcih is the batch size per iteration. The data is partitioned into a 949:50:1 ratio for training/validation/test sets (default is 969:30:1). This partitioning happens on the fly, but is consistent across runs with the same random seed (1234 by default, or specified manually with `--seed`). We use `train-iters` as the training iterations requested. Alternatively, one can provide `--train-samples` which is total number of samples to train on. If this option is present, then instead of providing `--lr-decay-iters`, one will need to provide `--lr-decay-samples`.
The logging, checkpoint-saving, and evaluation intervals are specified. Checkpointing the activations facilitates the training of larger models and/or batches. Note that the `--data-path` now includes the additional `_text_sentence` suffix added in preprocessing, but does not include the file extensions.
<pre>
CHECKPOINT_PATH=checkpoints/bert_345m
VOCAB_FILE=bert-vocab.txt
DATA_PATH=my-bert_text_sentence
BERT_ARGS="--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--seq-length 512 \
--max-position-embeddings 512 \
--lr 0.0001 \
--lr-decay-iters 990000 \
--train-iters 2000000 \
--min-lr 0.00001 \
--lr-warmup-fraction 0.01 \
--micro-batch-size 4 \
--global-batch-size 8 \
--vocab-file $VOCAB_FILE \
--split 949,50,1 \
--fp16"
OUTPUT_ARGS="--log-interval 10 \
--save-interval 500 \
--eval-interval 100 \
--eval-iters 10 \
--checkpoint-activations"
python pretrain_bert.py \
$BERT_ARGS \
$OUTPUT_ARGS \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH
</pre>
Further command line arguments are described in the source file [`arguments.py`](./megatron/arguments.py).
## GPT Pretraining
The `examples/pretrain_gpt.sh` script runs single GPU 345M parameter GPT pretraining. As mentioned above, single GPU training is primarily intended for debugging purposes, as the code is optimized for distributed training.
It follows largely the same format as the previous BERT script with a few notable differences: the tokenization scheme used is BPE (which requires a merge table and a `json` vocabulary file) instead of WordPiece, the model architecture allows for longer sequences (note that the max position embedding must be greater than or equal to the maximum sequence length), and the `--lr-decay-style` has been set to cosine decay. Note that the `--data-path` now includes the additional `_text_document` suffix added in preprocessing, but does not include the file extensions.
<pre>
CHECKPOINT_PATH=checkpoints/gpt2_345m
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
DATA_PATH=my-gpt2_text_document
GPT_ARGS="--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--micro-batch-size 4 \
--global-batch-size 8 \
--lr 0.00015 \
--train-iters 500000 \
--lr-decay-iters 320000 \
--lr-decay-style cosine \
--vocab-file $VOCAB_FILE \
--merge-file $MERGE_FILE \
--lr-warmup-fraction .01 \
--fp16"
OUTPUT_ARGS=&#60;same as those in <a href="#bert-pretraining">BERT pretraining</a> above&#62;
python pretrain_gpt.py \
$GPT_ARGS \
$OUTPUT_ARGS \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
</pre>
Further command line arguments are described in the source file [`arguments.py`](./megatron/arguments.py).
## T5 Pretraining
Very similar to BERT and GPT, the `examples/pretrain_t5.sh` script runs single GPU "base" (~220M parameter) T5 pretraining. The primary difference from BERT and GPT is the addition of the following arguments to accomodate the T5 architecture:
* `--kv-channels` sets the inner dimension of the "key" and "value" matrices of all attention mechanisms in the model. For BERT and GPT this defaults to the hidden size divided by the number of attention heads, but can be configured for T5.
* `--ffn-hidden-size` sets the hidden size in the feed-forward networks within a transformer layer. For BERT and GPT this defaults to 4 times the transformer hidden size, but can be configured for T5.
* `--encoder-seq-length` and `--decoder-seq-length` set the sequence length for the encoder and decoder separately.
All of the other arguments remain as they were for BERT and GPT pretraining.
<pre>
CHECKPOINT_PATH=checkpoints/t5_base
VOCAB_FILE=t5-vocab.txt
DATA_PATH=my-t5_text_sentence
T5_ARGS="--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--kv-channels 64 \
--ffn-hidden-size 3072 \
--encoder-seq-length 512 \
--decoder-seq-length 128 \
--max-position-embeddings 512 \
--lr 0.0001 \
--lr-decay-iters 990000 \
--train-iters 2000000 \
--min-lr 0.00001 \
--lr-warmup-fraction 0.01 \
--micro-batch-size 16 \
--global-batch-size 2048 \
--vocab-file $VOCAB_FILE \
--split 949,50,1 \
--fp16"
OUTPUT_ARGS=&#60;same as those in <a href="#bert-pretraining">BERT pretraining</a> above&#62;
python pretrain_t5.py \
$BERT_ARGS \
$OUTPUT_ARGS \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH
</pre>
## Distributed Pretraining
The `examples/pretrain_{bert,gpt,t5}_distributed.sh` scripts use the PyTorch distributed launcher for distributed training. As such, multi-node training can be achieved by properly setting environment variables and using `init_method='env://'` in the launcher. See the official PyTorch [documentation](https://pytorch.org/docs/stable/distributed.html#launch-utility) for further description of these [environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization). By default, multi-node training uses the [nccl](https://developer.nvidia.com/nccl) distributed backend. A simple set of additional arguments and the use of the PyTorch distributed module with the Python flag `-m torch.distributed.launch`, detailed below, are the only additional requirements to adopt distributed training.
We use two types of parallelism: data and model parallelism. We facilitate two distributed data parallel implementations: a simple one of our own that performs gradient all-reduce at the end of back propagation step, and Torch's distributed data parallel wrapper that overlaps gradient reduction with back propagation computation. To switch between these two options use `--DDP-impl local` or `--DDP-impl torch`, respectively. As expected, Torch distributed data parallelism is more efficient at larger model sizes. For example, for the 8.3 billion parameters model running on 512 GPUs, the scaling increases from 60% to 76% when Torch's distributed data parallel is used. However, the overlapping method requires more memory and for some configurations (e.g., 2.5 billion parameters using 2-way model parallel and 1.2 billion parameters with no model parallel) can make the overall training slower as a result. We empirically found that using a smaller model in those cases improves the training time.
Second, we developed a simple and efficient two-dimensional model-parallel approach. To use tensor model parallelism (splitting execution of a single transformer module over multiple GPUs), add the `--tensor-model-parallel-size` flag to specify the number of GPUs among which to split the model, along with the arguments passed to the distributed launcher as mentioned above. To use pipeline model parallelism (sharding the transformer modules into stages with an equal number of transformer modules on each stage, and then pipelining execution by breaking the batch into smaller microbatches), use the `--pipeline-model-parallel-size` flag to specify the number of stages to split the model into (e.g., splitting a model with 24 transformer layers across 4 stages would mean each stage gets 6 transformer layers each).
<!-- The number of microbatches in a per-pipeline minibatch is controlled by the `--num-microbatches-in-minibatch` argument. With `WORLD_SIZE` GPUs, `TENSOR_MP_SIZE` tensor-model-parallel size, `PIPELINE_MP_SIZE` pipeline-model-parallel-size, `WORLD_SIZE`/(`TENSOR_MP_SIZE` * `PIPELINE_MP_SIZE`) GPUs will be used for data parallelism. The default values for `--tensor-model-parallel-size` and `--pipeline-model-parallel-size` is 1, which will not implement either form of model parallelism. -->
We have examples of how to use these two different forms of model parallelism the example scripts ending in `distributed_with_mp.sh`, note that pipeline parallelism is not currently supported in the T5 model:
Other than these minor changes, the distributed training is identical to the training on a single GPU.
Distributed training:
<pre>
WORLD_SIZE=8
TENSOR_MP_SIZE=2
PIPELINE_MP_SIZE=2
DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
--nnodes 1 \
--node_rank 0 \
--master_addr localhost \
--master_port 6000"
CHECKPOINT_PATH=&#60;same as above&#62;
VOCAB_FILE=&#60;same as above&#62;
DATA_PATH=&#60;same as above&#62;
MODEL_ARGS=&#60;same as above&#62;
OUTPUT_ARGS=&#60;same as above&#62;
python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_<model>.py \
$MODEL_ARGS \
$OUTPUT_ARGS \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--tensor-model-parallel-size $TENSOR_MP_SIZE \
--pipeline-model-parallel-size $PIPELINE_MP_SIZE \
--DDP-impl torch
</pre>
## GPT-3 Example
In `examples/pretrain_gpt3_175B.sh` we have provided an example of how to configure Megatron to run [GPT-3](https://arxiv.org/abs/2005.14165) with 175 billion parameters on 1024 GPUs. The script is designed for [slurm](https://slurm.schedmd.com/documentation.html) with [pyxis](https://github.com/NVIDIA/pyxis) plugin but can be easily adopted to any other scheduler. It uses 8-way and 16-way tensor and pipeline parallelism, respectively. With options `global-batch-size 1536` and `rampup-batch-size 16 16 5859375`, the training will start with global batch size 16 and linearly increase the global batch size to 1536 over 5,859,375 samples with incrmeental steps 16. The training dataset can be either a single set or a multiple datasets combined with a set of weights.
With full global batch size of 1536 on 1024 A100 GPUs, each iteration takes around 32 seconds resulting in 138 teraFLOPs per GPU which is 44% of the theoretical peak FLOPs.
<!--
## REALM Pipeline
We are working on implementing the [REALM](https://arxiv.org/pdf/2002.08909.pdf) system. The following sections (will) reflect the three stages of training it. For now it's just the ICT code.
Loosely, they are pretraining the retriever modules, then jointly training the language model and the retriever, and then finetuning a question answering head on the language model with fixed retriever.
### Inverse Cloze Task (ICT) Pretraining
1. Have a corpus in loose JSON format with the intention of creating a collection of fixed-size blocks of text as the fundamental units of data. For a corpus like Wikipedia, this will mean multiple sentences per block but also multiple blocks per document.
Run `tools/preprocess_data.py` to construct one or more indexed datasets with the `--split-sentences` argument to make sentences the basic unit. For the original REALM system, we construct two datasets, one with the title of every document, and another with the body.
Refer to the following script
<pre>
python preprocess_data.py \
--input /path/to/corpus.json \
--json-keys text title \
--split-sentences \
--tokenizer-type BertWordPieceLowerCase \
--vocab-file /path/to/vocab.txt \
--output-prefix corpus_indexed \
--workers 5 # works well for 10 CPU cores. Scale up accordingly.
</pre>
2. Use a custom samples mapping function in place of `megatron/data/realm_dataset_utils.get_block_samples_mapping` if required. To do this, you will need to implement a new function in C++ inside of `megatron/data/helpers.cpp`. The samples mapping data structure is used to select the data that will constitute every training sample in advance of the training loop.
The samples mapping is responsible for holding all of the required metadata needed to construct the sample from one or more indexed datasets. In REALM, the samples mapping contains the start and end sentence indices, as well as the document index (to find the correct title for a body) and a unique ID for every block.
3. Pretrain a BERT language model using `pretrain_bert.py`, with the sequence length equal to the block size in token ids. This model should be trained on the same indexed dataset that is used to supply the blocks for the information retrieval task.
In REALM, this is an uncased bert base model trained with the standard hyperparameters.
4. Use `pretrain_ict.py` to train an `ICTBertModel` which uses two BERT-based encoders to encode queries and blocks to perform retrieval with.
The script below trains the ICT model from REALM. It refrences a pretrained BERT model (step 3) in the `--bert-load` argument. The batch size used in the paper is 4096, so this would need to be run with data parallel world size 32.
<pre>
python pretrain_ict.py \
--num-layers 12 \
--num-attention-heads 12 \
--hidden-size 768 \
--batch-size 128 \
--seq-length 256 \
--max-position-embeddings 256 \
--ict-head-size 128 \
--train-iters 100000 \
--checkpoint-activations \
--bert-load /path/to/pretrained_bert \
--load checkpoints \
--save checkpoints \
--data-path /path/to/indexed_dataset \
--titles-data-path /path/to/titles_indexed_dataset \
--vocab-file /path/to/vocab.txt \
--lr 0.0001 \
--num-workers 2 \
--lr-decay-style linear \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--save-interval 3000 \
--query-in-block-prob 0.1 \
--fp16
</pre>
### Building an Index of Block Embeddings
After having trained an ICT model, you can now embed an entire dataset of blocks by creating a `BlockData` structure. After that has been saved, you can load it
and wrap it with a `FaissMIPSIndex` to do fast similarity search which is key in the learned information retrieval pipeline. The initial index can be built with the following script, meant to be run in an interactive session. It can leverage multiple GPUs on multiple nodes to index large datasets much more quickly.
<pre>
python tools/create_doc_index.py \
--num-layers 12 \
--hidden-size 768 \
--ict-head-size 128 \
--num-attention-heads 12 \
--batch-size 128 \
--checkpoint-activations \
--seq-length 256 \
--max-position-embeddings 256 \
--ict-load /path/to/pretrained_ict \
--data-path /path/to/indexed_dataset \
--titles-data-path /path/to/titles_indexed_dataset \
--block-data-path embedded_blocks.pkl \
--indexer-log-interval 1000 \
--indexer-batch-size 128 \
--vocab-file /path/to/vocab.txt \
--num-workers 2 \
--fp16
</pre>
-->
# Evaluation and Tasks
We provide several command line arguments, detailed in the scripts listed below, to handle various zero-shot and fine-tuned downstream tasks. However, you can also finetune your model from a pretrained checkpoint on other corpora as desired. To do so, simply add the `--finetune` flag and adjust the input files and training parameters within the original training script. The iteration count will be reset to zero, and the optimizer and internal state will be reinitialized. If the fine-tuning is interrupted for any reason, be sure to remove the `--finetune` flag before continuing, otherwise the training will start again from the beginning.
Because evaluation requires substantially less memory than training, it may be advantageous to merge a model trained in parallel for use on a single GPU in downstream tasks. The following script accomplishes this. Currently only tensor model parallelism is supported on input and pipeline model parallelsim on the output. This example reads in a model with 2-way tensor model parallelism and writes out a model with 2-way pipeline model parallelism.
<pre>
TENSOR_MODEL_PARALLEL_SIZE=2
TARGET_PIPELINE_MODEL_PARALLEL_SIZE=2
VOCAB_FILE=bert-vocab.txt
CHECKPOINT_PATH=checkpoints/bert_345m
WORLD_SIZE=$TENSOR_MODEL_PARALLEL_SIZE python tools/merge_mp_partitions.py \
--model-type BERT \
--tensor-model-parallel-size $TENSOR_MODEL_PARALLEL_SIZE \
--pipeline-model-parallel-size 1 \
--target-pipeline-model-parallel-size $TARGET_PIPELINE_MODEL_PARALLEL_SIZE \
--tokenizer-type BertWordPieceLowerCase \
--vocab-file $VOCAB_FILE \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--seq-length 512 \
--max-position-embeddings 512 \
--load $CHECKPOINT_PATH
--save $CHECKPOINT_PATH/merged
</pre>
Several downstream tasks are described for both GPT and BERT models below. They can be run in distributed and model parallel modes with the same changes used in the training scripts.
## GPT Text Generation
`bash examples/generate_text.sh`
We generate text samples using largely the GPT pretraining script. Few changes need to make, such as we need to provide the path to the pretrained checkpoint, the length of the output samples, whether to generate texts unconditionally (`--num-samples` to denote how many samples to generate) or conditional (need to pass `--sample-input-file <filename>` where each line of the file will be used as the conditional texts). There are few optional parameters to play, e.g. `top-k`, `top-p`, or `greedy` (set top-k and top-p to 0) sampling..
<pre>
CHECKPOINT_PATH=checkpoints/gpt2_345m
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
GPT_ARGS=&#60;same as those in <a href="#gpt-pretraining">GPT pretraining</a> above&#62;
MAX_OUTPUT_SEQUENCE_LENGTH=1024
TEMPERATURE=1.0
TOP_P=0.9
NUMBER_OF_SAMPLES=2
OUTPUT_FILE=samples.json
python tools/generate_samples_gpt.py \
$GPT_ARGS \
--load $CHECKPOINT_PATH \
--out-seq-length $MAX_OUTPUT_SEQUENCE_LENGTH \
--temperature $TEMPERATURE \
--genfile $OUTPUT_FILE \
--num-samples $NUMBER_OF_SAMPLES \
--top_p $TOP_P \
--recompute
</pre>
## GPT Evaluation
We include example scripts for GPT evaluation on WikiText perplexity evaluation and LAMBADA Cloze accuracy.
### WikiText Perplexity Evaluation
For even comparison with prior works, we evaluate perplexity on the word-level [WikiText-103 test dataset](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip), and appropriately compute perplexity given the change in tokens when using our subword tokenizer.
We use the following command to run WikiText-103 evaluation on a 345M parameter model.
<pre>
TASK="WIKITEXT103"
VALID_DATA=&#60;wikitext path&#62;.txt
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m
COMMON_TASK_ARGS="--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--fp16 \
--vocab-file $VOCAB_FILE"
python tasks/main.py \
--task $TASK \
$COMMON_TASK_ARGS \
--valid-data $VALID_DATA \
--tokenizer-type GPT2BPETokenizer \
--merge-file $MERGE_FILE \
--load $CHECKPOINT_PATH \
--micro-batch-size 8 \
--checkpoint-activations \
--log-interval 10 \
--no-load-optim \
--no-load-rng
</pre>
### LAMBADA Cloze Accuracy
To compute LAMBADA cloze accuracy (the accuracy of predicting the last token given the preceeding tokens) we utilize a detokenized, processed version of the [LAMBADA dataset](https://github.com/cybertronai/bflm/blob/master/lambada_test.jsonl).
We use the following command to run LAMBADA evaluation on a 345M parameter model. Note that the `--strict-lambada` flag should be used to require whole word matching. Make that `lambada` is part of the file path.
<pre>
TASK="LAMBADA"
VALID_DATA=&#60;lambada path&#62;.json
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m
COMMON_TASK_ARGS=&#60;same as those in <a href="#wikitext-perplexity-evaluation">WikiText Perplexity Evaluation</a> above&#62;
python tasks/main.py \
--task $TASK \
$COMMON_TASK_ARGS \
--valid-data $VALID_DATA \
--tokenizer-type GPT2BPETokenizer \
--strict-lambada \
--merge-file $MERGE_FILE \
--load $CHECKPOINT_PATH \
--micro-batch-size 8 \
--checkpoint-activations \
--log-interval 10 \
--no-load-optim \
--no-load-rng
</pre>
Further command line arguments are described in the source file [`main.py`](./tasks/main.py)
## BERT Task Evaluation
### RACE Evaluation
The following script finetunes the BERT model for evaluation on the [RACE dataset](http://www.cs.cmu.edu/~glai1/data/race/). The `TRAIN_DATA` and `VALID_DATA` directory contain the RACE dataset as separate `.txt` files. Note that for RACE, the batch size is the number of RACE query's to evaluate. Since each RACE query has four samples, the effective batch size passed through the model will be four times the batch size specified on the command line.
<pre>
TRAIN_DATA="data/RACE/train/middle"
VALID_DATA="data/RACE/dev/middle \
data/RACE/dev/high"
VOCAB_FILE=bert-vocab.txt
PRETRAINED_CHECKPOINT=checkpoints/bert_345m
CHECKPOINT_PATH=checkpoints/bert_345m_race
COMMON_TASK_ARGS="--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--seq-length 512 \
--max-position-embeddings 512 \
--fp16 \
--vocab-file $VOCAB_FILE"
COMMON_TASK_ARGS_EXT="--train-data $TRAIN_DATA \
--valid-data $VALID_DATA \
--pretrained-checkpoint $PRETRAINED_CHECKPOINT \
--checkpoint-activations \
--save-interval 10000 \
--save $CHECKPOINT_PATH \
--log-interval 100 \
--eval-interval 1000 \
--eval-iters 10 \
--weight-decay 1.0e-1"
python tasks/main.py \
--task RACE \
$COMMON_TASK_ARGS \
$COMMON_TASK_ARGS_EXT \
--tokenizer-type BertWordPieceLowerCase \
--epochs 3 \
--micro-batch-size 4 \
--lr 1.0e-5 \
--lr-warmup-fraction 0.06
</pre>
### MNLI Evaluation
The following script finetunes the BERT model for evaluation with the [MultiNLI sentence pair corpus](https://www.nyu.edu/projects/bowman/multinli/). Because the matching tasks are quite similar, the script can be quickly tweaked to work with the [Quora Question Pairs](https://www.kaggle.com/quora/question-pairs-dataset) (QQP) dataset as well.
<pre>
TRAIN_DATA="data/glue_data/MNLI/train.tsv"
VALID_DATA="data/glue_data/MNLI/dev_matched.tsv \
data/glue_data/MNLI/dev_mismatched.tsv"
PRETRAINED_CHECKPOINT=checkpoints/bert_345m
VOCAB_FILE=bert-vocab.txt
CHECKPOINT_PATH=checkpoints/bert_345m_mnli
COMMON_TASK_ARGS=&#60;same as those in <a href="#race-evaluation">RACE Evaluation</a> above&#62;
COMMON_TASK_ARGS_EXT=&#60;same as those in <a href="#race-evaluation">RACE Evaluation</a> above&#62;
python tasks/main.py \
--task MNLI \
$COMMON_TASK_ARGS \
$COMMON_TASK_ARGS_EXT \
--tokenizer-type BertWordPieceLowerCase \
--epochs 5 \
--micro-batch-size 8 \
--lr 5.0e-5 \
--lr-warmup-fraction 0.065
</pre>
# Datasets
We do not host any datasets for GPT or BERT training, however, we detail their collection so that our results may be reproduced.
## Collecting Wikipedia Training Data
We recommend following the Wikipedia data extraction process specified by Google research: "the recommended pre-processing is to download [the latest dump](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2), extract the text with [WikiExtractor.py](https://github.com/attardi/wikiextractor), and then apply any necessary cleanup to convert it into plain text."
We recommend using the `--json` argument when using WikiExtractor, which will dump the Wikipedia data into loose json format (one json per line), making it more manageable on the file system and also readily consumable by our codebase. We recommend further preprocessing this json dataset by nltk punctuation standardization. For BERT training, use the `--split-sentences` flag to `preprocess_data.py` as described [above](#data-preprocessing) to include sentence breaks in the produced index. If you'd like to use Wikipedia data for GPT training you should still clean it with nltk/spacy/ftfy, but do not use the `--split-sentences` flag.
## Collecting GPT Webtext Data
We utilize the publicly available [OpenWebText](https://github.com/eukaryote31/openwebtext) library from [jcpeterson](https://github.com/jcpeterson/openwebtext) and [eukaryote31's](https://github.com/eukaryote31/openwebtext) work to download urls. We then filtered, cleaned, and deduplicated all downloaded content according to the procedure described in our [openwebtext](./tools/openwebtext) directory. For reddit URLs corresponding to content up to October 2018 we arrived at approximately 37GB of content.

41
SECURITY.md Normal file
View File

@ -0,0 +1,41 @@
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.7 BLOCK -->
## Security
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/opensource/security/definition), please report it to us as described below.
## Reporting Security Issues
**Please do not report security vulnerabilities through public GitHub issues.**
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/opensource/security/create-report).
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/opensource/security/pgpkey).
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://aka.ms/opensource/security/msrc).
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue
This information will help us triage your report more quickly.
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/opensource/security/bounty) page for more details about our active programs.
## Preferred Languages
We prefer all communications to be in English.
## Policy
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/opensource/security/cvd).
<!-- END MICROSOFT SECURITY.MD BLOCK -->

5
dataset/README.md Normal file
View File

@ -0,0 +1,5 @@
# Run the scripts below to setup dataset
bash download_books.sh
bash download_vocab.sh

View File

@ -0,0 +1,2 @@
wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.bin
wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.idx

8
dataset/download_ckpt.sh Normal file
View File

@ -0,0 +1,8 @@
mkdir -p checkpoints/gpt2_345m
cd checkpoints/gpt2_345m
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
unzip megatron_lm_345m_v0.0.zip
rm megatron_lm_345m_v0.0.zip
cd ../..

View File

@ -0,0 +1,2 @@
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

View File

@ -0,0 +1,39 @@
{
"train_batch_size" : CONFIG_BATCH_SIZE,
"train_micro_batch_size_per_gpu": CONFIG_MBSIZE,
"steps_per_print": LOG_INTERVAL,
"zero_optimization": {
"stage": ZERO_STAGE,
"elastic_checkpoint": true
},
"gradient_clipping": 1.0,
"prescale_gradients": PRESCALE_GRAD,
"fp16": {
"enabled": CONFIG_FP16_ENABLED,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"bf16": {
"enabled": CONFIG_BF16_ENABLED
},
"curriculum_learning": {
"enabled": CONFIG_CL_ENABLED,
"curriculum_type": "seqlen",
"min_difficulty": CONFIG_CL_MIN,
"max_difficulty": CONFIG_CL_MAX,
"schedule_type": "fixed_linear",
"schedule_config": {
"total_curriculum_step": CONFIG_CL_DURATION,
"difficulty_step": 8
}
},
"wall_clock_breakdown" : false
}

View File

@ -0,0 +1,38 @@
{
"train_batch_size" : CONFIG_BATCH_SIZE,
"train_micro_batch_size_per_gpu": CONFIG_MBSIZE,
"steps_per_print": LOG_INTERVAL,
"zero_optimization": {
"stage": 2
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"fp16": {
"enabled": CONFIG_FP16_ENABLED,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"bf16": {
"enabled": CONFIG_BF16_ENABLED
},
"curriculum_learning": {
"enabled": CONFIG_CL_ENABLED,
"curriculum_type": "seqlen",
"min_difficulty": CONFIG_CL_MIN,
"max_difficulty": CONFIG_CL_MAX,
"schedule_type": "fixed_linear",
"schedule_config": {
"total_curriculum_step": CONFIG_CL_DURATION,
"difficulty_step": 8
}
},
"wall_clock_breakdown" : false
}

View File

@ -0,0 +1,71 @@
# This is an example zero-shot eval script. Please first read the readme_evalharness.md under the same directory.
CHECKPOINT_PATH=/blob/users/conglli/project/gpt3_with_pile/checkpoint/gpt3-with-pile-0.125B-lr-2.4e-3-minlr-6.0e-5-bs-2048-gpus-128-zero-0-mp-1-pp-1-no_pp-cl-startseqlen-72-step-20728-token-45B/global_step81566/
CONFIG_PATH=ds_config_gpt3-with-pile-0.125B-lr-2.4e-3-minlr-6.0e-5-bs-2048-gpus-128-zero-0-mp-1-pp-1-no_pp-cl-startseqlen-72-step-20728-token-45B.json
RESULT_PATH=gpt3-with-pile-0.125B-lr-2.4e-3-minlr-6.0e-5-bs-2048-gpus-128-zero-0-mp-1-pp-1-no_pp-cl-startseqlen-72-step-20728-token-45B_global_step81566.log
PP_SIZE=1
TP_SIZE=1
NO_PP="true"
EP_PARALLEL_SIZE=1
# Currently eval harness does not support data parallel
# However, for MoE models it's possible to enable a "fake data parallel"
# in order to load experts on multiple gpus. At the same time, it's not
# real data parallel because we load the same data on all gpus.
# On the other hand, it's better to use less number of gpus than training,
# to reduce communication overhead.
NUM_NODE=1
NUM_GPU_PER_NODE=1
TASKS="lambada"
# WikiText-2, not used in GPT-3 paper but used in GPT-2 paper
# TASKS="wikitext"
# Tasks that appeared in GPT-3 paper (sorted based on the order in paper), plus WikiText-2.
# TASKS="hellaswag,lambada,triviaqa,webqs,winogrande,piqa,arc_challenge,arc_easy,openbookqa,race,boolq,cb,copa,rte,wic,wsc,multirc,record,anli_r1,anli_r2,anli_r3,wikitext"
# All tasks that confirmed to work, there are more tasks on https://github.com/EleutherAI/lm-evaluation-harness that we didn't test.
# TASKS="hellaswag,lambada,triviaqa,webqs,winogrande,piqa,arc_challenge,arc_easy,openbookqa,race,boolq,cb,copa,rte,wic,wsc,multirc,record,anli_r1,anli_r2,anli_r3,wikitext,logiqa,mathqa,mc_taco,mrpc,prost,pubmedqa,qnli,qqp,sciq,sst,wnli"
VOCAB_FILE=/data/Megatron-LM/data/gpt2-vocab.json
MERGE_FILE=/data/Megatron-LM/data/gpt2-merges.txt
export HF_DATASETS_OFFLINE=1
# Dummy arguments to make megatron happy. No need to configure them.
# The reason we don't need to configure them and many other arguments is
# because the eval framework will read the arguments from checkpoint file.
MEGATRON_REQUIRED_ARGS="\
--num-layers -1\
--hidden-size -1\
--num-attention-heads -1\
--seq-length -1 \
--max-position-embeddings -1
"
CMD="../../tasks/eval_harness/evaluate.py \
--load $CHECKPOINT_PATH\
--tensor-model-parallel-size $TP_SIZE \
--pipeline-model-parallel-size $PP_SIZE\
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--vocab-file $VOCAB_FILE\
--merge-file $MERGE_FILE\
--micro-batch-size 12\
--no-load-optim \
--no-load-rng \
--inference \
--disable-moe-token-dropping \
--adaptive_seq_len\
--eval_fp32\
--task_list $TASKS\
--results_path $RESULT_PATH \
--deepspeed \
--deepspeed_config $CONFIG_PATH \
$MEGATRON_REQUIRED_ARGS\
"
if [[ "${NO_PP}" = "true" ]]; then
CMD="${CMD} \
--no-pipeline-parallel"
fi
LAUNCHER="deepspeed --num_nodes $NUM_NODE --num_gpus $NUM_GPU_PER_NODE"
$LAUNCHER $CMD

View File

@ -0,0 +1,349 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
# MODEL_SIZE=0.125
# NUM_LAYERS=12
# HIDDEN_SIZE=768
# NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5
## GPT-3 Medium 350M
# MODEL_SIZE=0.35
# NUM_LAYERS=24
# HIDDEN_SIZE=1024
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
MODEL_SIZE=1.3
NUM_LAYERS=24
HIDDEN_SIZE=2048
NUM_ATTN_HEADS=16
GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
# MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
## For MoE model, we found sometimes training a bit more to 330B tokens helps
TRAIN_TOKENS=300000000000
# TRAIN_TOKENS=330000000000
## TRAIN_ITERS is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some steps,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_ITERS.
TRAIN_ITERS=$(( ${TRAIN_TOKENS} * 3 / ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
## For MoE model, we found that setting the decay token to 300B helps.
WARMUP_TOKENS=375000000
# LR_DECAY_TOKENS=260000000000
LR_DECAY_TOKENS=300000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=8
## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=1
## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
NUM_GPUS=64
###############################################################################
### MoE configs
## Number of experts. EP_SIZE 1 means dense model without MoE
# EP_SIZE=1
EP_SIZE=128
if [[ $EP_SIZE -gt $NUM_GPUS ]]; then
EP_PARALLEL_SIZE=$NUM_GPUS
else
EP_PARALLEL_SIZE=$EP_SIZE
fi
## Original GPT-3 model always set min LR at 10% of max LR. For MoE model, we
## found that lower LR and min LR (than the base dense model) helps.
## For 1.3B MoE-128 model we used LR=1.2e-4 and MIN_LR=1.0e-6.
## For 350M MoE-128 model we used LR=2.0e-4 and MIN_LR=2.0e-6, but they are not
## heavily tuned.
LR=1.2e-4
MIN_LR=1.0e-6
## Coefficient for MoE loss. We find that 0.01 is a good value at least for
## 1.3B MoE-128 model
MLC=0.01
## Below configs adjust the MoE expert token capacity limit during training and
## eval. To completely disable capacity limit, set MOE_DROP_TOKEN to false.
## Larger capacity factor or disabling capacity limit could improve training
## convergence, but will also reduce training throughput.
MOE_TRAIN_CAP_FACTOR=1.0
MOE_EVAL_CAP_FACTOR=1.0
MOE_MIN_CAP=4
MOE_DROP_TOKEN="true"
# MOE_DROP_TOKEN="false"
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_TOKENS=$((${CL_TOKENS} * 1000000000))
CL_STEP=$(( ${CL_TOKENS} / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=10000
## Standard deviation for weight initialization
## We used 0.014 for 350M/1.3B dense/MoE models, and used 0.01 for 6.7B
## dense model. Usually larger model needs lower std.
INIT_STD=0.014
# INIT_STD=0.01
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-mp-${MP_SIZE}-pp-${PP_SIZE}"
if [[ $EP_SIZE -gt 1 ]]; then
NAME="${NAME}-ep-${EP_SIZE}-mlc-${MLC}-cap-${MOE_TRAIN_CAP_FACTOR}-drop-${MOE_DROP_TOKEN}"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-${CL_START_SEQLEN}-${CL_STEP}"
fi
OUTPUT_BASEPATH=$DIR/output
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/"
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/"
mkdir -p "${OUTPUT_BASEPATH}/log/"
TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}_${current_time}"
mkdir -p ${TENSORBOARD_DIR}
## Note that for MoE model with billion-scale base model, the checkpoint can be
## as large as TB-scale which normal NFS cannot handle efficiently.
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
# USE_INTERNAL_DATA="true"
USE_INTERNAL_DATA="false"
if [ "${USE_INTERNAL_DATA}" = "true" ]; then
## The internal data is only accessible within Microsoft
## For cluster Azure-EastUS-V100-32GB-4, Azure-WestUS3-A100
# BASE_DATA_PATH=/vc_data/Megatron-LM/data
# DATA_HOME="/vc_data/pile-cc1-cc2-shuf"
## For cluster Lab-RR1-V100
BASE_DATA_PATH=/data/Megatron-LM/data
DATA_HOME="/turing-ssd/users/conglli/data/pile-cc1-cc2-shuf"
## For cluster Azure-CentralUS-A100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME=/vc_data_1/users/amawa/blended
VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
ARX="${DATA_HOME}/ArXiv_ftfy_cleaned_id_shuf_text_document"
BC2="${DATA_HOME}/BookCorpus2_ftfy_cleaned_id_shuf_text_document"
B3="${DATA_HOME}/Books3_ftfy_cleaned_id_shuf_text_document"
CC2020="${DATA_HOME}/CC-2020-50_id_cleaned_shuf_text_document"
CC2021="${DATA_HOME}/CC-2021-04_id_cleaned_shuf_text_document"
GIT="${DATA_HOME}/Github_ftfy_id_shuf_text_document"
GUT="${DATA_HOME}/Gutenberg_PG-19_ftfy_cleaned_id_cleaned_shuf_text_document"
NIH="${DATA_HOME}/NIH_ExPorter_ftfy_id_shuf_text_document"
OWT2="${DATA_HOME}/OpenWebText2_ftfy_cleaned_id_shuf_text_document"
PCC="${DATA_HOME}/Pile-CC_id_cleaned_shuf_text_document"
PM="${DATA_HOME}/PubMed_Abstracts_ftfy_id_shuf_text_document"
RN="${DATA_HOME}/rn_dedup_shuf_cleaned_0.7_cleaned_shuf_text_document"
SE="${DATA_HOME}/StackExchange_ftfy_id_shuf_text_document"
ST="${DATA_HOME}/stories_dedup0.7_shuf_cleaned_shuf_text_document"
WIK="${DATA_HOME}/Wikipedia_en_ftfy_id_shuf_text_document"
DATA_BLEND="0.14336 ${B3} 0.08962 ${RN} 0.19336 ${OWT2} 0.05689 ${SE} \
0.00859 ${ST} 0.02897 ${PM} 0.04771 ${WIK} 0.00873 ${GUT} 0.01007 ${BC2} \
0.00208 ${NIH} 0.13017 ${CC2020} 0.09446 ${PCC} 0.15652 ${CC2021} \
0.01359 ${ARX} 0.01588 ${GIT}"
else
VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
DATA_BLEND=/data/the_pile_public_merged_nopreprocessing/pile_text_document
fi
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_BLEND} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--num-experts ${EP_SIZE} \
--moe-loss-coeff ${MLC} \
--moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
--moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
--moe-min-capacity ${MOE_MIN_CAP} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers ${NUM_LAYERS} \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-iters ${TRAIN_ITERS} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_DIR}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [[ $EP_SIZE -gt 1 ]]; then
megatron_options="${megatron_options} \
--create-moe-param-group"
fi
if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
--disable-moe-token-dropping"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/0/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--pipeline-model-parallel-size ${PP_SIZE}"
# Currently MoE is not compatible with pipeline parallel
if [[ $EP_SIZE -gt 1 ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,341 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
# MODEL_SIZE=0.125
# NUM_LAYERS=12
# HIDDEN_SIZE=768
# NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5
## GPT-3 Medium 350M
# MODEL_SIZE=0.35
# NUM_LAYERS=24
# HIDDEN_SIZE=1024
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
MODEL_SIZE=1.3
NUM_LAYERS=24
HIDDEN_SIZE=2048
NUM_ATTN_HEADS=16
GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
# MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
## For MoE model, we found sometimes training a bit more to 330B tokens helps
TRAIN_TOKENS=300000000000
# TRAIN_TOKENS=330000000000
## TRAIN_ITERS is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some steps,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_ITERS.
TRAIN_ITERS=$(( ${TRAIN_TOKENS} * 3 / ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
## For MoE model, we found that setting the decay token to 300B helps.
WARMUP_TOKENS=375000000
# LR_DECAY_TOKENS=260000000000
LR_DECAY_TOKENS=300000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=8
## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=1
## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
NUM_GPUS=64
###############################################################################
### MoE configs
## Number of experts. EP_SIZE 128 means standard MoE
# EP_SIZE=128
EP_SIZE="64 64 64 64 64 64 64 64 64 64 128 128"
EP_PARALLEL_SIZE=$NUM_GPUS
## Original GPT-3 model always set min LR at 10% of max LR. For MoE model, we
## found that lower LR and min LR (than the base dense model) helps.
## For 1.3B PR-MoE-64/128 model we used LR=1.2e-4 and MIN_LR=1.0e-6.
## heavily tuned.
LR=1.2e-4
MIN_LR=1.0e-6
## Coefficient for MoE loss. We find that 0.01 is a good value at least for
## 1.3B MoE-128 model
MLC=0.01
## Below configs adjust the MoE expert token capacity limit during training and
## eval. To completely disable capacity limit, set MOE_DROP_TOKEN to false.
## Larger capacity factor or disabling capacity limit could improve training
## convergence, but will also reduce training throughput.
MOE_TRAIN_CAP_FACTOR=1.0
MOE_EVAL_CAP_FACTOR=1.0
MOE_MIN_CAP=4
MOE_DROP_TOKEN="true"
# MOE_DROP_TOKEN="false"
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_TOKENS=$((${CL_TOKENS} * 1000000000))
CL_STEP=$(( ${CL_TOKENS} / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=10000
## Standard deviation for weight initialization
## We used 0.014 for 350M/1.3B dense/MoE models, and used 0.01 for 6.7B
## dense model. Usually larger model needs lower std.
INIT_STD=0.014
# INIT_STD=0.01
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-mp-${MP_SIZE}-pp-${PP_SIZE}"
NAME="${NAME}-ep-pyramid-64+128-mlc-${MLC}-cap-${MOE_TRAIN_CAP_FACTOR}-drop-${MOE_DROP_TOKEN}"
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-${CL_START_SEQLEN}-${CL_STEP}"
fi
OUTPUT_BASEPATH=$DIR/output
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/"
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/"
mkdir -p "${OUTPUT_BASEPATH}/log/"
TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}_${current_time}"
mkdir -p ${TENSORBOARD_DIR}
## Note that for MoE model with billion-scale base model, the checkpoint can be
## as large as TB-scale which normal NFS cannot handle efficiently.
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
# USE_INTERNAL_DATA="true"
USE_INTERNAL_DATA="false"
if [ "${USE_INTERNAL_DATA}" = "true" ]; then
## The internal data is only accessible within Microsoft
## For cluster Azure-EastUS-V100-32GB-4, Azure-WestUS3-A100
BASE_DATA_PATH=/vc_data/Megatron-LM/data
DATA_HOME="/vc_data/pile-cc1-cc2-shuf"
## For cluster Lab-RR1-V100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME="/turing-ssd/users/conglli/data/pile-cc1-cc2-shuf"
## For cluster Azure-CentralUS-A100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME=/vc_data_1/users/amawa/blended
VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
ARX="${DATA_HOME}/ArXiv_ftfy_cleaned_id_shuf_text_document"
BC2="${DATA_HOME}/BookCorpus2_ftfy_cleaned_id_shuf_text_document"
B3="${DATA_HOME}/Books3_ftfy_cleaned_id_shuf_text_document"
CC2020="${DATA_HOME}/CC-2020-50_id_cleaned_shuf_text_document"
CC2021="${DATA_HOME}/CC-2021-04_id_cleaned_shuf_text_document"
GIT="${DATA_HOME}/Github_ftfy_id_shuf_text_document"
GUT="${DATA_HOME}/Gutenberg_PG-19_ftfy_cleaned_id_cleaned_shuf_text_document"
NIH="${DATA_HOME}/NIH_ExPorter_ftfy_id_shuf_text_document"
OWT2="${DATA_HOME}/OpenWebText2_ftfy_cleaned_id_shuf_text_document"
PCC="${DATA_HOME}/Pile-CC_id_cleaned_shuf_text_document"
PM="${DATA_HOME}/PubMed_Abstracts_ftfy_id_shuf_text_document"
RN="${DATA_HOME}/rn_dedup_shuf_cleaned_0.7_cleaned_shuf_text_document"
SE="${DATA_HOME}/StackExchange_ftfy_id_shuf_text_document"
ST="${DATA_HOME}/stories_dedup0.7_shuf_cleaned_shuf_text_document"
WIK="${DATA_HOME}/Wikipedia_en_ftfy_id_shuf_text_document"
DATA_BLEND="0.14336 ${B3} 0.08962 ${RN} 0.19336 ${OWT2} 0.05689 ${SE} \
0.00859 ${ST} 0.02897 ${PM} 0.04771 ${WIK} 0.00873 ${GUT} 0.01007 ${BC2} \
0.00208 ${NIH} 0.13017 ${CC2020} 0.09446 ${PCC} 0.15652 ${CC2021} \
0.01359 ${ARX} 0.01588 ${GIT}"
else
VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
DATA_BLEND=/data/the_pile_public_merged_nopreprocessing/pile_text_document
fi
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_BLEND} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--num-experts ${EP_SIZE} \
--moe-loss-coeff ${MLC} \
--mlp-type residual \
--moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
--moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
--moe-min-capacity ${MOE_MIN_CAP} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers ${NUM_LAYERS} \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-iters ${TRAIN_ITERS} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_DIR}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
megatron_options="${megatron_options} \
--create-moe-param-group"
if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
--disable-moe-token-dropping"
fi
template_json="ds_config_gpt_Zero2_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--pipeline-model-parallel-size ${PP_SIZE}"
# Currently MoE is not compatible with pipeline parallel
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,355 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
# MODEL_SIZE=0.125
# NUM_LAYERS=12
# HIDDEN_SIZE=768
# NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5
## GPT-3 Medium 350M
# MODEL_SIZE=0.35
# NUM_LAYERS=24
# HIDDEN_SIZE=1024
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
MODEL_SIZE=1.3
NUM_LAYERS=24
HIDDEN_SIZE=2048
NUM_ATTN_HEADS=16
GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
# MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
## For MoE model, we found sometimes training a bit more to 330B tokens helps
TRAIN_TOKENS=300000000000
# TRAIN_TOKENS=330000000000
## TRAIN_ITERS is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some steps,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_ITERS.
TRAIN_ITERS=$(( ${TRAIN_TOKENS} * 3 / ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
## For MoE model, we found that setting the decay token to 300B helps.
WARMUP_TOKENS=375000000
# LR_DECAY_TOKENS=260000000000
LR_DECAY_TOKENS=300000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=4
## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=1
## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
NUM_GPUS=128
###############################################################################
### MoE configs
## Number of experts. EP_SIZE 128 means standard MoE
# EP_SIZE=128
EP_SIZE="64 64 64 64 64 64 64 64 128 128"
EP_SIZE_TEACHER="64 64 64 64 64 64 64 64 64 64 128 128"
EP_PARALLEL_SIZE=$NUM_GPUS
## Original GPT-3 model always set min LR at 10% of max LR. For MoE model, we
## found that lower LR and min LR (than the base dense model) helps.
## For 1.3B PR-MoE-64/128 model we used LR=1.2e-4 and MIN_LR=1.0e-6.
## heavily tuned.
LR=1.2e-4
MIN_LR=1.0e-6
## Coefficient for MoE loss. We find that 0.01 is a good value at least for
## 1.3B MoE-128 model
MLC=0.01
## Below configs adjust the MoE expert token capacity limit during training and
## eval. To completely disable capacity limit, set MOE_DROP_TOKEN to false.
## Larger capacity factor or disabling capacity limit could improve training
## convergence, but will also reduce training throughput.
MOE_TRAIN_CAP_FACTOR=1.0
MOE_EVAL_CAP_FACTOR=1.0
MOE_MIN_CAP=4
MOE_DROP_TOKEN="true"
# MOE_DROP_TOKEN="false"
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_TOKENS=$((${CL_TOKENS} * 1000000000))
CL_STEP=$(( ${CL_TOKENS} / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=10000
## Standard deviation for weight initialization
## We used 0.014 for 350M/1.3B dense/MoE models, and used 0.01 for 6.7B
## dense model. Usually larger model needs lower std.
INIT_STD=0.014
# INIT_STD=0.01
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-mp-${MP_SIZE}-pp-${PP_SIZE}"
NAME="${NAME}-ep-pyramid-64+128-mos-mlc-${MLC}-cap-${MOE_TRAIN_CAP_FACTOR}-drop-${MOE_DROP_TOKEN}"
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-${CL_START_SEQLEN}-${CL_STEP}"
fi
OUTPUT_BASEPATH=$DIR/output
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/"
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/"
mkdir -p "${OUTPUT_BASEPATH}/log/"
TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}"
mkdir -p ${TENSORBOARD_DIR}
## Note that for MoE model with billion-scale base model, the checkpoint can be
## as large as TB-scale which normal NFS cannot handle efficiently.
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
### Mixture-of-Students (MoS) configs
KD_BETA_CE=1
CHECKPOINT_PATH_STUDENT="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
CHECKPOINT_PATH_TEACHER="${OUTPUT_BASEPATH}/checkpoint/gpt-1.3B-lr-1.2e-4-minlr-1.0e-6-bs-512-gpus-128-mp-1-pp-1-ep-pyramid-64+128-mlc-0.01-cap-1.0-drop-true/"
CHECKPOINT_PATH_SAVE="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
USE_INTERNAL_DATA="true"
# USE_INTERNAL_DATA="false"
if [ "${USE_INTERNAL_DATA}" = "true" ]; then
## The internal data is only accessible within Microsoft
## For cluster Azure-EastUS-V100-32GB-4, Azure-WestUS3-A100
BASE_DATA_PATH=/vc_data/Megatron-LM/data
DATA_HOME="/vc_data/pile-cc1-cc2-shuf"
## For cluster Lab-RR1-V100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME="/turing-ssd/users/conglli/data/pile-cc1-cc2-shuf"
## For cluster Azure-CentralUS-A100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME=/vc_data_1/users/amawa/blended
VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
ARX="${DATA_HOME}/ArXiv_ftfy_cleaned_id_shuf_text_document"
BC2="${DATA_HOME}/BookCorpus2_ftfy_cleaned_id_shuf_text_document"
B3="${DATA_HOME}/Books3_ftfy_cleaned_id_shuf_text_document"
CC2020="${DATA_HOME}/CC-2020-50_id_cleaned_shuf_text_document"
CC2021="${DATA_HOME}/CC-2021-04_id_cleaned_shuf_text_document"
GIT="${DATA_HOME}/Github_ftfy_id_shuf_text_document"
GUT="${DATA_HOME}/Gutenberg_PG-19_ftfy_cleaned_id_cleaned_shuf_text_document"
NIH="${DATA_HOME}/NIH_ExPorter_ftfy_id_shuf_text_document"
OWT2="${DATA_HOME}/OpenWebText2_ftfy_cleaned_id_shuf_text_document"
PCC="${DATA_HOME}/Pile-CC_id_cleaned_shuf_text_document"
PM="${DATA_HOME}/PubMed_Abstracts_ftfy_id_shuf_text_document"
RN="${DATA_HOME}/rn_dedup_shuf_cleaned_0.7_cleaned_shuf_text_document"
SE="${DATA_HOME}/StackExchange_ftfy_id_shuf_text_document"
ST="${DATA_HOME}/stories_dedup0.7_shuf_cleaned_shuf_text_document"
WIK="${DATA_HOME}/Wikipedia_en_ftfy_id_shuf_text_document"
DATA_BLEND="0.14336 ${B3} 0.08962 ${RN} 0.19336 ${OWT2} 0.05689 ${SE} \
0.00859 ${ST} 0.02897 ${PM} 0.04771 ${WIK} 0.00873 ${GUT} 0.01007 ${BC2} \
0.00208 ${NIH} 0.13017 ${CC2020} 0.09446 ${PCC} 0.15652 ${CC2021} \
0.01359 ${ARX} 0.01588 ${GIT}"
else
## Placeholder, we plan to test a public dataset
VOCAB_PATH=""
MERGE_PATH=""
DATA_BLEND=""
fi
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_BLEND} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--num-experts ${EP_SIZE} \
--moe-loss-coeff ${MLC} \
--mlp-type residual \
--moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
--moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
--moe-min-capacity ${MOE_MIN_CAP} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers 21 \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-iters ${TRAIN_ITERS} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH_STUDENT} \
--save ${CHECKPOINT_PATH_SAVE} \
--mos \
--kd-beta-ce ${KD_BETA_CE} \
--num-layers-teacher ${NUM_LAYERS} \
--num-experts-teacher ${EP_SIZE_TEACHER} \
--hidden-size-teacher ${HIDDEN_SIZE} \
--num-attention-heads-teacher ${NUM_ATTN_HEADS} \
--load-teacher ${CHECKPOINT_PATH_TEACHER} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_DIR}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
megatron_options="${megatron_options} \
--create-moe-param-group"
if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
--disable-moe-token-dropping"
fi
template_json="ds_config_gpt_Zero2_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--pipeline-model-parallel-size ${PP_SIZE}"
# Currently MoE is not compatible with pipeline parallel
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
# run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,350 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
# MODEL_SIZE=0.125
# NUM_LAYERS=12
# HIDDEN_SIZE=768
# NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5
## GPT-3 Medium 350M
# MODEL_SIZE=0.35
# NUM_LAYERS=24
# HIDDEN_SIZE=1024
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
MODEL_SIZE=1.3
NUM_LAYERS=24
HIDDEN_SIZE=2048
NUM_ATTN_HEADS=16
GLOBAL_BATCH_SIZE=512
LR=2.0e-4
MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
## For MoE model, we found sometimes training a bit more to 330B tokens helps
TRAIN_TOKENS=300000000000
# TRAIN_TOKENS=330000000000
## TRAIN_SAMPLES is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some steps,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_SAMPLES.
TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
## For MoE model, we found that setting the decay token to 300B helps.
WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000
# LR_DECAY_TOKENS=300000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=2
## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=4
## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
NUM_GPUS=64
###############################################################################
### MoE configs
## Number of experts. EP_SIZE 1 means dense model without MoE
EP_SIZE=1
# EP_SIZE=128
if [[ $EP_SIZE -gt $NUM_GPUS ]]; then
EP_PARALLEL_SIZE=$NUM_GPUS
else
EP_PARALLEL_SIZE=$EP_SIZE
fi
## Original GPT-3 model always set min LR at 10% of max LR. For MoE model, we
## found that lower LR and min LR (than the base dense model) helps.
## For 1.3B MoE-128 model we used LR=1.2e-4 and MIN_LR=1.0e-6.
## For 350M MoE-128 model we used LR=2.0e-4 and MIN_LR=2.0e-6, but they are not
## heavily tuned.
# LR=2.0e-4
# MIN_LR=2e-06
## Coefficient for MoE loss. We find that 0.01 is a good value at least for
## 1.3B MoE-128 model
MLC=0.01
## Below configs adjust the MoE expert token capacity limit during training and
## eval. To completely disable capacity limit, set MOE_DROP_TOKEN to false.
## Larger capacity factor or disabling capacity limit could improve training
## convergence, but will also reduce training throughput.
MOE_TRAIN_CAP_FACTOR=1.0
MOE_EVAL_CAP_FACTOR=1.0
MOE_MIN_CAP=4
MOE_DROP_TOKEN="true"
# MOE_DROP_TOKEN="false"
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_TOKENS=$((${CL_TOKENS} * 1000000000))
CL_STEP=$(( ${CL_TOKENS} / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=1000
## Standard deviation for weight initialization
## We used 0.014 for 350M/1.3B dense/MoE models, and used 0.01 for 6.7B
## dense model. Usually larger model needs lower std.
INIT_STD=0.014
# INIT_STD=0.01
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-mp-${MP_SIZE}-pp-${PP_SIZE}"
if [[ $EP_SIZE -gt 1 ]]; then
NAME="${NAME}-ep-${EP_SIZE}-mlc-${MLC}-cap-${MOE_TRAIN_CAP_FACTOR}-drop-${MOE_DROP_TOKEN}"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-${CL_START_SEQLEN}-${CL_STEP}"
fi
OUTPUT_BASEPATH=$DIR/output
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/"
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/"
mkdir -p "${OUTPUT_BASEPATH}/log/"
TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}_${current_time}"
mkdir -p ${TENSORBOARD_DIR}
## Note that for MoE model with billion-scale base model, the checkpoint can be
## as large as TB-scale which normal NFS cannot handle efficiently.
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
# USE_INTERNAL_DATA="true"
USE_INTERNAL_DATA="false"
if [ "${USE_INTERNAL_DATA}" = "true" ]; then
## The internal data is only accessible within Microsoft
## For cluster Azure-EastUS-V100-32GB-4, Azure-WestUS3-A100
# BASE_DATA_PATH=/vc_data/Megatron-LM/data
# DATA_HOME="/vc_data/pile-cc1-cc2-shuf"
## For cluster Lab-RR1-V100
BASE_DATA_PATH=/data/Megatron-LM/data
DATA_HOME="/turing-ssd/users/conglli/data/pile-cc1-cc2-shuf"
## For cluster Azure-CentralUS-A100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME=/vc_data_1/users/amawa/blended
VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
ARX="${DATA_HOME}/ArXiv_ftfy_cleaned_id_shuf_text_document"
BC2="${DATA_HOME}/BookCorpus2_ftfy_cleaned_id_shuf_text_document"
B3="${DATA_HOME}/Books3_ftfy_cleaned_id_shuf_text_document"
CC2020="${DATA_HOME}/CC-2020-50_id_cleaned_shuf_text_document"
CC2021="${DATA_HOME}/CC-2021-04_id_cleaned_shuf_text_document"
GIT="${DATA_HOME}/Github_ftfy_id_shuf_text_document"
GUT="${DATA_HOME}/Gutenberg_PG-19_ftfy_cleaned_id_cleaned_shuf_text_document"
NIH="${DATA_HOME}/NIH_ExPorter_ftfy_id_shuf_text_document"
OWT2="${DATA_HOME}/OpenWebText2_ftfy_cleaned_id_shuf_text_document"
PCC="${DATA_HOME}/Pile-CC_id_cleaned_shuf_text_document"
PM="${DATA_HOME}/PubMed_Abstracts_ftfy_id_shuf_text_document"
RN="${DATA_HOME}/rn_dedup_shuf_cleaned_0.7_cleaned_shuf_text_document"
SE="${DATA_HOME}/StackExchange_ftfy_id_shuf_text_document"
ST="${DATA_HOME}/stories_dedup0.7_shuf_cleaned_shuf_text_document"
WIK="${DATA_HOME}/Wikipedia_en_ftfy_id_shuf_text_document"
DATA_BLEND="0.14336 ${B3} 0.08962 ${RN} 0.19336 ${OWT2} 0.05689 ${SE} \
0.00859 ${ST} 0.02897 ${PM} 0.04771 ${WIK} 0.00873 ${GUT} 0.01007 ${BC2} \
0.00208 ${NIH} 0.13017 ${CC2020} 0.09446 ${PCC} 0.15652 ${CC2021} \
0.01359 ${ARX} 0.01588 ${GIT}"
else
VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
DATA_BLEND=/data/the_pile_public_merged_nopreprocessing/pile_text_document
fi
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_BLEND} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--num-experts ${EP_SIZE} \
--moe-loss-coeff ${MLC} \
--moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
--moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
--moe-min-capacity ${MOE_MIN_CAP} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--rampup-batch-size 32 32 1953125 \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers ${NUM_LAYERS} \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-samples ${TRAIN_SAMPLES} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_DIR}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [[ $EP_SIZE -gt 1 ]]; then
megatron_options="${megatron_options} \
--create-moe-param-group"
fi
if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
--disable-moe-token-dropping"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/0/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--pipeline-model-parallel-size ${PP_SIZE}"
# Currently MoE is not compatible with pipeline parallel
if [[ $EP_SIZE -gt 1 ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,285 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
# MODEL_SIZE=0.125
# NUM_LAYERS=12
# HIDDEN_SIZE=768
# NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5
## GPT-3 Medium 350M
# MODEL_SIZE=0.35
# NUM_LAYERS=24
# HIDDEN_SIZE=1024
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
MODEL_SIZE=1.3
NUM_LAYERS=24
HIDDEN_SIZE=2048
NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
MIN_LR=2.0e-5
# Curriculum learning (CL) enables stable large-batch training
GLOBAL_BATCH_SIZE=4096 # 8x
LR=8.0e-4 # 4x
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
TRAIN_TOKENS=300000000000
## TRAIN_SAMPLES is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some samples,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_SAMPLES.
TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=16
## Model parallelism, 1 is no MP
MP_SIZE=2
## Pipeline parallelism. To disable PP, set PP_SIZE to 1 and NO_PP to true.
PP_SIZE=1
NO_PP="true"
## ZeRO stage
ZERO_STAGE=0
## Total number of GPUs
NUM_GPUS=128
DP_SIZE=$(( ${NUM_GPUS} / ${PP_SIZE} / ${MP_SIZE} ))
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="true"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_STEP=$(( ${CL_TOKENS} * 1000000000 / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=1000
## Standard deviation for weight initialization. Usually larger model needs
## lower std. We used a heuristic equation of sqrt(1/3/HIDDEN_SIZE) from the
## MT-NLG 530B work (https://arxiv.org/pdf/2201.11990.pdf)
INIT_STD=0.013
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
## Whether or not log optimizer states (norms, max abs values) to tensorboard.
## This is not required for training and might save GPU memory when turned off.
LOG_OPTIMIZER_STATE="true"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt3-with-pile-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-zero-${ZERO_STAGE}-mp-${MP_SIZE}-pp-${PP_SIZE}"
if [ "${NO_PP}" = "true" ]; then
NAME="${NAME}-no_pp"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-startseqlen-${CL_START_SEQLEN}-step-${CL_STEP}-token-${CL_TOKENS}B"
fi
LOG_PATH="log/"
TENSORBOARD_PATH="tensorboard/${NAME}_${host}_${current_time}"
CHECKPOINT_PATH="/blob/users/conglli/project/gpt3_with_pile/checkpoint/${NAME}"
mkdir -p ${LOG_PATH}
mkdir -p ${TENSORBOARD_PATH}
mkdir -p ${CHECKPOINT_PATH}
VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
DATA_PATH=/data/the_pile_public_merged_nopreprocessing/pile_text_document
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_PATH} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers ${NUM_LAYERS} \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-samples ${TRAIN_SAMPLES} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_PATH}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [ "${LOG_OPTIMIZER_STATE}" = "true" ]; then
megatron_options="${megatron_options} \
--log-optimizer-states-to-tensorboard"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_${NAME}.json"
if [[ $ZERO_STAGE -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
fi
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${ZERO_STAGE} \
--pipeline-model-parallel-size ${PP_SIZE}"
if [[ "${NO_PP}" = "true" ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${LOG_PATH}/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,373 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
MODEL_SIZE=0.125
NUM_LAYERS=12
HIDDEN_SIZE=768
NUM_ATTN_HEADS=12
GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5
## GPT-3 Medium 350M
# MODEL_SIZE=0.35
# NUM_LAYERS=24
# HIDDEN_SIZE=1024
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
# MODEL_SIZE=1.3
# NUM_LAYERS=24
# HIDDEN_SIZE=2048
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
# MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
## For MoE model, we found sometimes training a bit more to 330B tokens helps
TRAIN_TOKENS=300000000000
# TRAIN_TOKENS=330000000000
## TRAIN_ITERS is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some steps,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_ITERS.
TRAIN_ITERS=$(( ${TRAIN_TOKENS} * 3 / ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
## For MoE model, we found that setting the decay token to 300B helps.
WARMUP_TOKENS=375000000
# LR_DECAY_TOKENS=260000000000
LR_DECAY_TOKENS=300000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=4
## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=1
## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
NUM_GPUS=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
NUM_GPUS_PERNODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
NUM_NODE=$(( ${NUM_GPUS} / ${NUM_GPUS_PERNODE} ))
###############################################################################
### MoE configs
## Number of experts. EP_SIZE 1 means dense model without MoE
# EP_SIZE=1
EP_SIZE=64
if [[ $EP_SIZE -gt $NUM_GPUS ]]; then
EP_PARALLEL_SIZE=$NUM_GPUS
else
EP_PARALLEL_SIZE=$EP_SIZE
fi
## Original GPT-3 model always set min LR at 10% of max LR. For MoE model, we
## found that lower LR and min LR (than the base dense model) helps.
## For 1.3B MoE-128 model we used LR=1.2e-4 and MIN_LR=1.0e-6.
## For 350M MoE-128 model we used LR=2.0e-4 and MIN_LR=2.0e-6, but they are not
## heavily tuned.
LR=4.5e-4
MIN_LR=4.5e-06
## Coefficient for MoE loss. We find that 0.01 is a good value at least for
## 1.3B MoE-128 model
MLC=0.01
## Below configs adjust the MoE expert token capacity limit during training and
## eval. To completely disable capacity limit, set MOE_DROP_TOKEN to false.
## Larger capacity factor or disabling capacity limit could improve training
## convergence, but will also reduce training throughput.
MOE_TRAIN_CAP_FACTOR=1.0
MOE_EVAL_CAP_FACTOR=1.0
MOE_MIN_CAP=4
MOE_DROP_TOKEN="true"
# MOE_DROP_TOKEN="false"
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_TOKENS=$((${CL_TOKENS} * 1000000000))
CL_STEP=$(( ${CL_TOKENS} / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=10000
## Standard deviation for weight initialization
## We used 0.014 for 350M/1.3B dense/MoE models, and used 0.01 for 6.7B
## dense model. Usually larger model needs lower std.
INIT_STD=0.014
# INIT_STD=0.01
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-mp-${MP_SIZE}-pp-${PP_SIZE}"
if [[ $EP_SIZE -gt 1 ]]; then
NAME="${NAME}-ep-${EP_SIZE}-mlc-${MLC}-cap-${MOE_TRAIN_CAP_FACTOR}-drop-${MOE_DROP_TOKEN}"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-${CL_START_SEQLEN}-${CL_STEP}"
fi
OUTPUT_BASEPATH=$DIR/output
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/"
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/"
mkdir -p "${OUTPUT_BASEPATH}/log/"
TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}_${current_time}"
mkdir -p ${TENSORBOARD_DIR}
## Note that for MoE model with billion-scale base model, the checkpoint can be
## as large as TB-scale which normal NFS cannot handle efficiently.
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
# USE_INTERNAL_DATA="true"
USE_INTERNAL_DATA="false"
if [ "${USE_INTERNAL_DATA}" = "true" ]; then
## The internal data is only accessible within Microsoft
## For cluster Azure-EastUS-V100-32GB-4, Azure-WestUS3-A100
# BASE_DATA_PATH=/vc_data/Megatron-LM/data
# DATA_HOME="/vc_data/pile-cc1-cc2-shuf"
## For cluster Lab-RR1-V100
BASE_DATA_PATH=/data/Megatron-LM/data
DATA_HOME="/turing-ssd/users/conglli/data/pile-cc1-cc2-shuf"
## For cluster Azure-CentralUS-A100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME=/vc_data_1/users/amawa/blended
VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
ARX="${DATA_HOME}/ArXiv_ftfy_cleaned_id_shuf_text_document"
BC2="${DATA_HOME}/BookCorpus2_ftfy_cleaned_id_shuf_text_document"
B3="${DATA_HOME}/Books3_ftfy_cleaned_id_shuf_text_document"
CC2020="${DATA_HOME}/CC-2020-50_id_cleaned_shuf_text_document"
CC2021="${DATA_HOME}/CC-2021-04_id_cleaned_shuf_text_document"
GIT="${DATA_HOME}/Github_ftfy_id_shuf_text_document"
GUT="${DATA_HOME}/Gutenberg_PG-19_ftfy_cleaned_id_cleaned_shuf_text_document"
NIH="${DATA_HOME}/NIH_ExPorter_ftfy_id_shuf_text_document"
OWT2="${DATA_HOME}/OpenWebText2_ftfy_cleaned_id_shuf_text_document"
PCC="${DATA_HOME}/Pile-CC_id_cleaned_shuf_text_document"
PM="${DATA_HOME}/PubMed_Abstracts_ftfy_id_shuf_text_document"
RN="${DATA_HOME}/rn_dedup_shuf_cleaned_0.7_cleaned_shuf_text_document"
SE="${DATA_HOME}/StackExchange_ftfy_id_shuf_text_document"
ST="${DATA_HOME}/stories_dedup0.7_shuf_cleaned_shuf_text_document"
WIK="${DATA_HOME}/Wikipedia_en_ftfy_id_shuf_text_document"
DATA_PATH="0.14336 ${B3} 0.08962 ${RN} 0.19336 ${OWT2} 0.05689 ${SE} \
0.00859 ${ST} 0.02897 ${PM} 0.04771 ${WIK} 0.00873 ${GUT} 0.01007 ${BC2} \
0.00208 ${NIH} 0.13017 ${CC2020} 0.09446 ${PCC} 0.15652 ${CC2021} \
0.01359 ${ARX} 0.01588 ${GIT}"
else
VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
# For cluster Azure-EastUS-V100-32GB-4, Lab-RR1-V100
DATA_PATH=/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing/pile_text_document
# For cluster Azure-WestUS3-A100
# DATA_PATH=/blob/data/the_pile_public_merged_nopreprocessing/pile_text_document
fi
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_PATH} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--num-experts ${EP_SIZE} \
--moe-loss-coeff ${MLC} \
--moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
--moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
--moe-min-capacity ${MOE_MIN_CAP} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers ${NUM_LAYERS} \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-iters ${TRAIN_ITERS} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_DIR}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [[ $EP_SIZE -gt 1 ]]; then
megatron_options="${megatron_options} \
--create-moe-param-group"
fi
if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
--disable-moe-token-dropping"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/0/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--pipeline-model-parallel-size ${PP_SIZE}"
# Currently MoE is not compatible with pipeline parallel
if [[ $EP_SIZE -gt 1 ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
## When saving checkpoint to a storage with cache, their could be consistency
## issue of the pointer to latest checkpoint. Here we find the correct pointer
## and broadcast it to all nodes.
ITERATION_FILE="$CHECKPOINT_PATH/latest_checkpointed_iteration.txt"
ITERATION_FILE_2="$CHECKPOINT_PATH/latest"
ITERATION=0
for (( node = 0; node <= NUM_NODE-1; node++ ))
do
if $(ssh -q worker-"$node" "test -f \"$ITERATION_FILE\""); then
LOCAL_ITERATION=$(ssh -q worker-"$node" cat $ITERATION_FILE)
ITERATION=$(( ${LOCAL_ITERATION} > ${ITERATION} ? ${LOCAL_ITERATION} : ${ITERATION} ))
fi
done
if [[ $ITERATION -gt 0 ]]; then
ITERATION_2="global_step${ITERATION}"
ds_ssh "echo $ITERATION > $ITERATION_FILE"
ds_ssh "echo $ITERATION_2 > $ITERATION_FILE_2"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,309 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
MODEL_SIZE=0.125
NUM_LAYERS=12
HIDDEN_SIZE=768
NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
MIN_LR=6.0e-5
# Curriculum learning (CL) enables stable large-batch training
GLOBAL_BATCH_SIZE=2048 # 8x
LR=2.4e-3 # 4x
## GPT-3 Medium 350M
# MODEL_SIZE=0.35
# NUM_LAYERS=24
# HIDDEN_SIZE=1024
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
# MODEL_SIZE=1.3
# NUM_LAYERS=24
# HIDDEN_SIZE=2048
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
# MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
TRAIN_TOKENS=300000000000
## TRAIN_SAMPLES is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some samples,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_SAMPLES.
TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=16
## Model parallelism, 1 is no MP
MP_SIZE=1
## Pipeline parallelism. To disable PP, set PP_SIZE to 1 and NO_PP to true.
PP_SIZE=1
NO_PP="true"
## ZeRO stage
ZERO_STAGE=0
## Total number of GPUs
NUM_GPUS=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
NUM_GPUS_PERNODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
NUM_NODE=$(( ${NUM_GPUS} / ${NUM_GPUS_PERNODE} ))
DP_SIZE=$(( ${NUM_GPUS} / ${PP_SIZE} / ${MP_SIZE} ))
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="true"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=72
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_STEP=$(( ${CL_TOKENS} * 1000000000 / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=1000
## Standard deviation for weight initialization. Usually larger model needs
## lower std. We used a heuristic equation of sqrt(1/3/HIDDEN_SIZE) from the
## MT-NLG 530B work (https://arxiv.org/pdf/2201.11990.pdf)
INIT_STD=0.02
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
## Whether or not log optimizer states (norms, max abs values) to tensorboard.
## This is not required for training and might save GPU memory when turned off.
LOG_OPTIMIZER_STATE="true"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt3-with-pile-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-zero-${ZERO_STAGE}-mp-${MP_SIZE}-pp-${PP_SIZE}"
if [ "${NO_PP}" = "true" ]; then
NAME="${NAME}-no_pp"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-startseqlen-${CL_START_SEQLEN}-step-${CL_STEP}-token-${CL_TOKENS}B"
fi
LOG_PATH="log/"
TENSORBOARD_PATH="tensorboard/${NAME}_${host}_${current_time}"
CHECKPOINT_PATH="/blob/users/conglli/project/gpt3_with_pile/checkpoint/${NAME}"
mkdir -p ${LOG_PATH}
mkdir -p ${TENSORBOARD_PATH}
mkdir -p ${CHECKPOINT_PATH}
VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
# For cluster Azure-EastUS-V100-32GB-4, Lab-RR1-V100
DATA_PATH=/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing/pile_text_document
# For cluster Azure-WestUS3-A100
# DATA_PATH=/blob/data/the_pile_public_merged_nopreprocessing/pile_text_document
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_PATH} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers ${NUM_LAYERS} \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-samples ${TRAIN_SAMPLES} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_PATH}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [ "${LOG_OPTIMIZER_STATE}" = "true" ]; then
megatron_options="${megatron_options} \
--log-optimizer-states-to-tensorboard"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_${NAME}.json"
if [[ $ZERO_STAGE -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
fi
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${ZERO_STAGE} \
--pipeline-model-parallel-size ${PP_SIZE}"
if [[ "${NO_PP}" = "true" ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
## When saving checkpoint to a storage with cache, their could be consistency
## issue of the pointer to latest checkpoint. Here we find the correct pointer
## and broadcast it to all nodes.
ITERATION_FILE="$CHECKPOINT_PATH/latest_checkpointed_iteration.txt"
ITERATION_FILE_2="$CHECKPOINT_PATH/latest"
ITERATION=0
for (( node = 0; node <= NUM_NODE-1; node++ ))
do
if $(ssh -q worker-"$node" "test -f \"$ITERATION_FILE\""); then
LOCAL_ITERATION=$(ssh -q worker-"$node" cat $ITERATION_FILE)
ITERATION=$(( ${LOCAL_ITERATION} > ${ITERATION} ? ${LOCAL_ITERATION} : ${ITERATION} ))
fi
done
if [[ $ITERATION -gt 0 ]]; then
ITERATION_2="global_step${ITERATION}"
ds_ssh "echo $ITERATION > $ITERATION_FILE"
ds_ssh "echo $ITERATION_2 > $ITERATION_FILE_2"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${LOG_PATH}/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,349 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
# MODEL_SIZE=0.125
# NUM_LAYERS=12
# HIDDEN_SIZE=768
# NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5
## GPT-3 Medium 350M
MODEL_SIZE=0.35
NUM_LAYERS=24
HIDDEN_SIZE=1024
NUM_ATTN_HEADS=16
GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
# MODEL_SIZE=1.3
# NUM_LAYERS=24
# HIDDEN_SIZE=2048
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
# MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
## For MoE model, we found sometimes training a bit more to 330B tokens helps
TRAIN_TOKENS=300000000000
# TRAIN_TOKENS=330000000000
## TRAIN_ITERS is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some steps,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_ITERS.
TRAIN_ITERS=$(( ${TRAIN_TOKENS} * 3 / ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
## For MoE model, we found that setting the decay token to 300B helps.
WARMUP_TOKENS=375000000
# LR_DECAY_TOKENS=260000000000
LR_DECAY_TOKENS=300000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=4
## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=1
## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
NUM_GPUS=64
###############################################################################
### MoE configs
## Number of experts. EP_SIZE 1 means dense model without MoE
# EP_SIZE=1
EP_SIZE=128
if [[ $EP_SIZE -gt $NUM_GPUS ]]; then
EP_PARALLEL_SIZE=$NUM_GPUS
else
EP_PARALLEL_SIZE=$EP_SIZE
fi
## Original GPT-3 model always set min LR at 10% of max LR. For MoE model, we
## found that lower LR and min LR (than the base dense model) helps.
## For 1.3B MoE-128 model we used LR=1.2e-4 and MIN_LR=1.0e-6.
## For 350M MoE-128 model we used LR=2.0e-4 and MIN_LR=2.0e-6, but they are not
## heavily tuned.
LR=2.0e-4
MIN_LR=2e-06
## Coefficient for MoE loss. We find that 0.01 is a good value at least for
## 1.3B MoE-128 model
MLC=0.01
## Below configs adjust the MoE expert token capacity limit during training and
## eval. To completely disable capacity limit, set MOE_DROP_TOKEN to false.
## Larger capacity factor or disabling capacity limit could improve training
## convergence, but will also reduce training throughput.
MOE_TRAIN_CAP_FACTOR=1.0
MOE_EVAL_CAP_FACTOR=1.0
MOE_MIN_CAP=4
MOE_DROP_TOKEN="true"
# MOE_DROP_TOKEN="false"
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_TOKENS=$((${CL_TOKENS} * 1000000000))
CL_STEP=$(( ${CL_TOKENS} / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=10000
## Standard deviation for weight initialization
## We used 0.014 for 350M/1.3B dense/MoE models, and used 0.01 for 6.7B
## dense model. Usually larger model needs lower std.
INIT_STD=0.014
# INIT_STD=0.01
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-mp-${MP_SIZE}-pp-${PP_SIZE}"
if [[ $EP_SIZE -gt 1 ]]; then
NAME="${NAME}-ep-${EP_SIZE}-mlc-${MLC}-cap-${MOE_TRAIN_CAP_FACTOR}-drop-${MOE_DROP_TOKEN}"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-${CL_START_SEQLEN}-${CL_STEP}"
fi
OUTPUT_BASEPATH=$DIR/output
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/"
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/"
mkdir -p "${OUTPUT_BASEPATH}/log/"
TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}_${current_time}"
mkdir -p ${TENSORBOARD_DIR}
## Note that for MoE model with billion-scale base model, the checkpoint can be
## as large as TB-scale which normal NFS cannot handle efficiently.
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
# USE_INTERNAL_DATA="true"
USE_INTERNAL_DATA="false"
if [ "${USE_INTERNAL_DATA}" = "true" ]; then
## The internal data is only accessible within Microsoft
## For cluster Azure-EastUS-V100-32GB-4, Azure-WestUS3-A100
# BASE_DATA_PATH=/vc_data/Megatron-LM/data
# DATA_HOME="/vc_data/pile-cc1-cc2-shuf"
## For cluster Lab-RR1-V100
BASE_DATA_PATH=/data/Megatron-LM/data
DATA_HOME="/turing-ssd/users/conglli/data/pile-cc1-cc2-shuf"
## For cluster Azure-CentralUS-A100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME=/vc_data_1/users/amawa/blended
VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
ARX="${DATA_HOME}/ArXiv_ftfy_cleaned_id_shuf_text_document"
BC2="${DATA_HOME}/BookCorpus2_ftfy_cleaned_id_shuf_text_document"
B3="${DATA_HOME}/Books3_ftfy_cleaned_id_shuf_text_document"
CC2020="${DATA_HOME}/CC-2020-50_id_cleaned_shuf_text_document"
CC2021="${DATA_HOME}/CC-2021-04_id_cleaned_shuf_text_document"
GIT="${DATA_HOME}/Github_ftfy_id_shuf_text_document"
GUT="${DATA_HOME}/Gutenberg_PG-19_ftfy_cleaned_id_cleaned_shuf_text_document"
NIH="${DATA_HOME}/NIH_ExPorter_ftfy_id_shuf_text_document"
OWT2="${DATA_HOME}/OpenWebText2_ftfy_cleaned_id_shuf_text_document"
PCC="${DATA_HOME}/Pile-CC_id_cleaned_shuf_text_document"
PM="${DATA_HOME}/PubMed_Abstracts_ftfy_id_shuf_text_document"
RN="${DATA_HOME}/rn_dedup_shuf_cleaned_0.7_cleaned_shuf_text_document"
SE="${DATA_HOME}/StackExchange_ftfy_id_shuf_text_document"
ST="${DATA_HOME}/stories_dedup0.7_shuf_cleaned_shuf_text_document"
WIK="${DATA_HOME}/Wikipedia_en_ftfy_id_shuf_text_document"
DATA_BLEND="0.14336 ${B3} 0.08962 ${RN} 0.19336 ${OWT2} 0.05689 ${SE} \
0.00859 ${ST} 0.02897 ${PM} 0.04771 ${WIK} 0.00873 ${GUT} 0.01007 ${BC2} \
0.00208 ${NIH} 0.13017 ${CC2020} 0.09446 ${PCC} 0.15652 ${CC2021} \
0.01359 ${ARX} 0.01588 ${GIT}"
else
VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
DATA_BLEND=/data/the_pile_public_merged_nopreprocessing/pile_text_document
fi
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_BLEND} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--num-experts ${EP_SIZE} \
--moe-loss-coeff ${MLC} \
--moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
--moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
--moe-min-capacity ${MOE_MIN_CAP} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers ${NUM_LAYERS} \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-iters ${TRAIN_ITERS} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_DIR}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [[ $EP_SIZE -gt 1 ]]; then
megatron_options="${megatron_options} \
--create-moe-param-group"
fi
if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
--disable-moe-token-dropping"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/0/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--pipeline-model-parallel-size ${PP_SIZE}"
# Currently MoE is not compatible with pipeline parallel
if [[ $EP_SIZE -gt 1 ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,342 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
# MODEL_SIZE=0.125
# NUM_LAYERS=12
# HIDDEN_SIZE=768
# NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5
## GPT-3 Medium 350M
MODEL_SIZE=0.35
NUM_LAYERS=24
HIDDEN_SIZE=1024
NUM_ATTN_HEADS=16
GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
# MODEL_SIZE=1.3
# NUM_LAYERS=24
# HIDDEN_SIZE=2048
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
# MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
## For MoE model, we found sometimes training a bit more to 330B tokens helps
TRAIN_TOKENS=300000000000
# TRAIN_TOKENS=330000000000
## TRAIN_ITERS is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some steps,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_ITERS.
TRAIN_ITERS=$(( ${TRAIN_TOKENS} * 3 / ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
## For MoE model, we found that setting the decay token to 300B helps.
WARMUP_TOKENS=375000000
# LR_DECAY_TOKENS=260000000000
LR_DECAY_TOKENS=300000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=4
## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=1
## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
NUM_GPUS=64
###############################################################################
### MoE configs
## Number of experts. EP_SIZE 128 means standard MoE
# EP_SIZE=128
EP_SIZE="32 32 32 32 32 32 32 32 32 32 64 64"
EP_PARALLEL_SIZE=$NUM_GPUS
## Original GPT-3 model always set min LR at 10% of max LR. For MoE model, we
## found that lower LR and min LR (than the base dense model) helps.
## For 1.3B PR-MoE-64/128 model we used LR=1.2e-4 and MIN_LR=1.0e-6.
## For 350M PR-MoE-32/64 model we used LR=3.0e-4 and MIN_LR=1.0e-6, but they are not
## heavily tuned.
LR=3.0e-4
MIN_LR=1.0e-06
## Coefficient for MoE loss. We find that 0.01 is a good value at least for
## 1.3B MoE-128 model
MLC=0.01
## Below configs adjust the MoE expert token capacity limit during training and
## eval. To completely disable capacity limit, set MOE_DROP_TOKEN to false.
## Larger capacity factor or disabling capacity limit could improve training
## convergence, but will also reduce training throughput.
MOE_TRAIN_CAP_FACTOR=1.0
MOE_EVAL_CAP_FACTOR=1.0
MOE_MIN_CAP=4
MOE_DROP_TOKEN="true"
# MOE_DROP_TOKEN="false"
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_TOKENS=$((${CL_TOKENS} * 1000000000))
CL_STEP=$(( ${CL_TOKENS} / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=10000
## Standard deviation for weight initialization
## We used 0.014 for 350M/1.3B dense/MoE models, and used 0.01 for 6.7B
## dense model. Usually larger model needs lower std.
INIT_STD=0.014
# INIT_STD=0.01
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-mp-${MP_SIZE}-pp-${PP_SIZE}"
NAME="${NAME}-ep-pyramid-32+64-mlc-${MLC}-cap-${MOE_TRAIN_CAP_FACTOR}-drop-${MOE_DROP_TOKEN}"
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-${CL_START_SEQLEN}-${CL_STEP}"
fi
OUTPUT_BASEPATH=$DIR/output
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/"
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/"
mkdir -p "${OUTPUT_BASEPATH}/log/"
TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}_${current_time}"
mkdir -p ${TENSORBOARD_DIR}
## Note that for MoE model with billion-scale base model, the checkpoint can be
## as large as TB-scale which normal NFS cannot handle efficiently.
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
# USE_INTERNAL_DATA="true"
USE_INTERNAL_DATA="false"
if [ "${USE_INTERNAL_DATA}" = "true" ]; then
## The internal data is only accessible within Microsoft
## For cluster Azure-EastUS-V100-32GB-4, Azure-WestUS3-A100
BASE_DATA_PATH=/vc_data/Megatron-LM/data
DATA_HOME="/vc_data/pile-cc1-cc2-shuf"
## For cluster Lab-RR1-V100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME="/turing-ssd/users/conglli/data/pile-cc1-cc2-shuf"
## For cluster Azure-CentralUS-A100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME=/vc_data_1/users/amawa/blended
VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
ARX="${DATA_HOME}/ArXiv_ftfy_cleaned_id_shuf_text_document"
BC2="${DATA_HOME}/BookCorpus2_ftfy_cleaned_id_shuf_text_document"
B3="${DATA_HOME}/Books3_ftfy_cleaned_id_shuf_text_document"
CC2020="${DATA_HOME}/CC-2020-50_id_cleaned_shuf_text_document"
CC2021="${DATA_HOME}/CC-2021-04_id_cleaned_shuf_text_document"
GIT="${DATA_HOME}/Github_ftfy_id_shuf_text_document"
GUT="${DATA_HOME}/Gutenberg_PG-19_ftfy_cleaned_id_cleaned_shuf_text_document"
NIH="${DATA_HOME}/NIH_ExPorter_ftfy_id_shuf_text_document"
OWT2="${DATA_HOME}/OpenWebText2_ftfy_cleaned_id_shuf_text_document"
PCC="${DATA_HOME}/Pile-CC_id_cleaned_shuf_text_document"
PM="${DATA_HOME}/PubMed_Abstracts_ftfy_id_shuf_text_document"
RN="${DATA_HOME}/rn_dedup_shuf_cleaned_0.7_cleaned_shuf_text_document"
SE="${DATA_HOME}/StackExchange_ftfy_id_shuf_text_document"
ST="${DATA_HOME}/stories_dedup0.7_shuf_cleaned_shuf_text_document"
WIK="${DATA_HOME}/Wikipedia_en_ftfy_id_shuf_text_document"
DATA_BLEND="0.14336 ${B3} 0.08962 ${RN} 0.19336 ${OWT2} 0.05689 ${SE} \
0.00859 ${ST} 0.02897 ${PM} 0.04771 ${WIK} 0.00873 ${GUT} 0.01007 ${BC2} \
0.00208 ${NIH} 0.13017 ${CC2020} 0.09446 ${PCC} 0.15652 ${CC2021} \
0.01359 ${ARX} 0.01588 ${GIT}"
else
VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
DATA_BLEND=/data/the_pile_public_merged_nopreprocessing/pile_text_document
fi
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_BLEND} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--num-experts ${EP_SIZE} \
--moe-loss-coeff ${MLC} \
--mlp-type residual \
--moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
--moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
--moe-min-capacity ${MOE_MIN_CAP} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers ${NUM_LAYERS} \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-iters ${TRAIN_ITERS} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_DIR}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
megatron_options="${megatron_options} \
--create-moe-param-group"
if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
--disable-moe-token-dropping"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/0/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--pipeline-model-parallel-size ${PP_SIZE}"
# Currently MoE is not compatible with pipeline parallel
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,354 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
# MODEL_SIZE=0.125
# NUM_LAYERS=12
# HIDDEN_SIZE=768
# NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5
## GPT-3 Medium 350M
MODEL_SIZE=0.35
NUM_LAYERS=24
HIDDEN_SIZE=1024
NUM_ATTN_HEADS=16
GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
# MODEL_SIZE=1.3
# NUM_LAYERS=24
# HIDDEN_SIZE=2048
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
# MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
## For MoE model, we found sometimes training a bit more to 330B tokens helps
TRAIN_TOKENS=300000000000
# TRAIN_TOKENS=330000000000
## TRAIN_ITERS is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some steps,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_ITERS.
TRAIN_ITERS=$(( ${TRAIN_TOKENS} * 3 / ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
## For MoE model, we found that setting the decay token to 300B helps.
WARMUP_TOKENS=375000000
# LR_DECAY_TOKENS=260000000000
LR_DECAY_TOKENS=300000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=4
## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=1
## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
NUM_GPUS=64
###############################################################################
### MoE configs
## Number of experts. EP_SIZE 128 means standard MoE
# EP_SIZE=128
EP_SIZE="32 32 32 32 32 32 32 32 64 64"
EP_SIZE_TEACHER="32 32 32 32 32 32 32 32 32 32 64 64"
EP_PARALLEL_SIZE=$NUM_GPUS
## Original GPT-3 model always set min LR at 10% of max LR. For MoE model, we
## found that lower LR and min LR (than the base dense model) helps.
## For 1.3B PR-MoE-64/128 model we used LR=1.2e-4 and MIN_LR=1.0e-6.
## For 350M PR-MoE-32/64 model we used LR=3.0e-4 and MIN_LR=1.0e-6, but they are not
## heavily tuned.
LR=3.0e-4
MIN_LR=1.0e-06
## Coefficient for MoE loss. We find that 0.01 is a good value at least for
## 1.3B MoE-128 model
MLC=0.01
## Below configs adjust the MoE expert token capacity limit during training and
## eval. To completely disable capacity limit, set MOE_DROP_TOKEN to false.
## Larger capacity factor or disabling capacity limit could improve training
## convergence, but will also reduce training throughput.
MOE_TRAIN_CAP_FACTOR=1.0
MOE_EVAL_CAP_FACTOR=1.0
MOE_MIN_CAP=4
MOE_DROP_TOKEN="true"
# MOE_DROP_TOKEN="false"
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_TOKENS=$((${CL_TOKENS} * 1000000000))
CL_STEP=$(( ${CL_TOKENS} / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=10000
## Standard deviation for weight initialization
## We used 0.014 for 350M/1.3B dense/MoE models, and used 0.01 for 6.7B
## dense model. Usually larger model needs lower std.
INIT_STD=0.014
# INIT_STD=0.01
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-mp-${MP_SIZE}-pp-${PP_SIZE}"
NAME="${NAME}-ep-pyramid-32+64-mos-mlc-${MLC}-cap-${MOE_TRAIN_CAP_FACTOR}-drop-${MOE_DROP_TOKEN}"
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-${CL_START_SEQLEN}-${CL_STEP}"
fi
OUTPUT_BASEPATH=$DIR/output
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/"
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/"
mkdir -p "${OUTPUT_BASEPATH}/log/"
TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}_${current_time}"
mkdir -p ${TENSORBOARD_DIR}
## Note that for MoE model with billion-scale base model, the checkpoint can be
## as large as TB-scale which normal NFS cannot handle efficiently.
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
### Mixture-of-Students (MoS) configs
KD_BETA_CE=1
CHECKPOINT_PATH_STUDENT="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
CHECKPOINT_PATH_TEACHER="${OUTPUT_BASEPATH}/checkpoint/gpt-1.3B-lr-1.2e-4-minlr-1.0e-6-bs-512-gpus-128-mp-1-pp-1-ep-pyramid-64+128-mlc-0.01-cap-1.0-drop-true/"
CHECKPOINT_PATH_SAVE="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
USE_INTERNAL_DATA="true"
# USE_INTERNAL_DATA="false"
if [ "${USE_INTERNAL_DATA}" = "true" ]; then
## The internal data is only accessible within Microsoft
## For cluster Azure-EastUS-V100-32GB-4, Azure-WestUS3-A100
BASE_DATA_PATH=/vc_data/Megatron-LM/data
DATA_HOME="/vc_data/pile-cc1-cc2-shuf"
## For cluster Lab-RR1-V100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME="/turing-ssd/users/conglli/data/pile-cc1-cc2-shuf"
## For cluster Azure-CentralUS-A100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME=/vc_data_1/users/amawa/blended
VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
ARX="${DATA_HOME}/ArXiv_ftfy_cleaned_id_shuf_text_document"
BC2="${DATA_HOME}/BookCorpus2_ftfy_cleaned_id_shuf_text_document"
B3="${DATA_HOME}/Books3_ftfy_cleaned_id_shuf_text_document"
CC2020="${DATA_HOME}/CC-2020-50_id_cleaned_shuf_text_document"
CC2021="${DATA_HOME}/CC-2021-04_id_cleaned_shuf_text_document"
GIT="${DATA_HOME}/Github_ftfy_id_shuf_text_document"
GUT="${DATA_HOME}/Gutenberg_PG-19_ftfy_cleaned_id_cleaned_shuf_text_document"
NIH="${DATA_HOME}/NIH_ExPorter_ftfy_id_shuf_text_document"
OWT2="${DATA_HOME}/OpenWebText2_ftfy_cleaned_id_shuf_text_document"
PCC="${DATA_HOME}/Pile-CC_id_cleaned_shuf_text_document"
PM="${DATA_HOME}/PubMed_Abstracts_ftfy_id_shuf_text_document"
RN="${DATA_HOME}/rn_dedup_shuf_cleaned_0.7_cleaned_shuf_text_document"
SE="${DATA_HOME}/StackExchange_ftfy_id_shuf_text_document"
ST="${DATA_HOME}/stories_dedup0.7_shuf_cleaned_shuf_text_document"
WIK="${DATA_HOME}/Wikipedia_en_ftfy_id_shuf_text_document"
DATA_BLEND="0.14336 ${B3} 0.08962 ${RN} 0.19336 ${OWT2} 0.05689 ${SE} \
0.00859 ${ST} 0.02897 ${PM} 0.04771 ${WIK} 0.00873 ${GUT} 0.01007 ${BC2} \
0.00208 ${NIH} 0.13017 ${CC2020} 0.09446 ${PCC} 0.15652 ${CC2021} \
0.01359 ${ARX} 0.01588 ${GIT}"
else
## Placeholder, we plan to test a public dataset
VOCAB_PATH=""
MERGE_PATH=""
DATA_BLEND=""
fi
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_BLEND} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--num-experts ${EP_SIZE} \
--moe-loss-coeff ${MLC} \
--mlp-type residual \
--moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
--moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
--moe-min-capacity ${MOE_MIN_CAP} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers 21 \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-iters ${TRAIN_ITERS} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH_STUDENT} \
--save ${CHECKPOINT_PATH_SAVE} \
--mos \
--kd-beta-ce ${KD_BETA_CE} \
--num-layers-teacher ${NUM_LAYERS} \
--num-experts-teacher ${EP_SIZE_TEACHER} \
--hidden-size-teacher ${HIDDEN_SIZE} \
--num-attention-heads-teacher ${NUM_ATTN_HEADS} \
--load-teacher ${CHECKPOINT_PATH_TEACHER} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_DIR}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
megatron_options="${megatron_options} \
--create-moe-param-group"
if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
--disable-moe-token-dropping"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--pipeline-model-parallel-size ${PP_SIZE}"
# Currently MoE is not compatible with pipeline parallel
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,349 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
# MODEL_SIZE=0.125
# NUM_LAYERS=12
# HIDDEN_SIZE=768
# NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5
## GPT-3 Medium 350M
MODEL_SIZE=0.35
NUM_LAYERS=24
HIDDEN_SIZE=1024
NUM_ATTN_HEADS=16
GLOBAL_BATCH_SIZE=256
LR=3.0e-4
MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
# MODEL_SIZE=1.3
# NUM_LAYERS=24
# HIDDEN_SIZE=2048
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
# MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
## For MoE model, we found sometimes training a bit more to 330B tokens helps
TRAIN_TOKENS=300000000000
# TRAIN_TOKENS=330000000000
## TRAIN_SAMPLES is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some steps,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_SAMPLES.
TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
## For MoE model, we found that setting the decay token to 300B helps.
WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000
# LR_DECAY_TOKENS=300000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=4
## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=1
## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
NUM_GPUS=64
###############################################################################
### MoE configs
## Number of experts. EP_SIZE 1 means dense model without MoE
EP_SIZE=1
# EP_SIZE=128
if [[ $EP_SIZE -gt $NUM_GPUS ]]; then
EP_PARALLEL_SIZE=$NUM_GPUS
else
EP_PARALLEL_SIZE=$EP_SIZE
fi
## Original GPT-3 model always set min LR at 10% of max LR. For MoE model, we
## found that lower LR and min LR (than the base dense model) helps.
## For 1.3B MoE-128 model we used LR=1.2e-4 and MIN_LR=1.0e-6.
## For 350M MoE-128 model we used LR=2.0e-4 and MIN_LR=2.0e-6, but they are not
## heavily tuned.
# LR=2.0e-4
# MIN_LR=2e-06
## Coefficient for MoE loss. We find that 0.01 is a good value at least for
## 1.3B MoE-128 model
MLC=0.01
## Below configs adjust the MoE expert token capacity limit during training and
## eval. To completely disable capacity limit, set MOE_DROP_TOKEN to false.
## Larger capacity factor or disabling capacity limit could improve training
## convergence, but will also reduce training throughput.
MOE_TRAIN_CAP_FACTOR=1.0
MOE_EVAL_CAP_FACTOR=1.0
MOE_MIN_CAP=4
MOE_DROP_TOKEN="true"
# MOE_DROP_TOKEN="false"
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_TOKENS=$((${CL_TOKENS} * 1000000000))
CL_STEP=$(( ${CL_TOKENS} / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=1000
## Standard deviation for weight initialization
## We used 0.014 for 350M/1.3B dense/MoE models, and used 0.01 for 6.7B
## dense model. Usually larger model needs lower std.
INIT_STD=0.014
# INIT_STD=0.01
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-mp-${MP_SIZE}-pp-${PP_SIZE}"
if [[ $EP_SIZE -gt 1 ]]; then
NAME="${NAME}-ep-${EP_SIZE}-mlc-${MLC}-cap-${MOE_TRAIN_CAP_FACTOR}-drop-${MOE_DROP_TOKEN}"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-${CL_START_SEQLEN}-${CL_STEP}"
fi
OUTPUT_BASEPATH=$DIR/output
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/"
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/"
mkdir -p "${OUTPUT_BASEPATH}/log/"
TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}_${current_time}"
mkdir -p ${TENSORBOARD_DIR}
## Note that for MoE model with billion-scale base model, the checkpoint can be
## as large as TB-scale which normal NFS cannot handle efficiently.
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
# USE_INTERNAL_DATA="true"
USE_INTERNAL_DATA="false"
if [ "${USE_INTERNAL_DATA}" = "true" ]; then
## The internal data is only accessible within Microsoft
## For cluster Azure-EastUS-V100-32GB-4, Azure-WestUS3-A100
# BASE_DATA_PATH=/vc_data/Megatron-LM/data
# DATA_HOME="/vc_data/pile-cc1-cc2-shuf"
## For cluster Lab-RR1-V100
BASE_DATA_PATH=/data/Megatron-LM/data
DATA_HOME="/turing-ssd/users/conglli/data/pile-cc1-cc2-shuf"
## For cluster Azure-CentralUS-A100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME=/vc_data_1/users/amawa/blended
VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
ARX="${DATA_HOME}/ArXiv_ftfy_cleaned_id_shuf_text_document"
BC2="${DATA_HOME}/BookCorpus2_ftfy_cleaned_id_shuf_text_document"
B3="${DATA_HOME}/Books3_ftfy_cleaned_id_shuf_text_document"
CC2020="${DATA_HOME}/CC-2020-50_id_cleaned_shuf_text_document"
CC2021="${DATA_HOME}/CC-2021-04_id_cleaned_shuf_text_document"
GIT="${DATA_HOME}/Github_ftfy_id_shuf_text_document"
GUT="${DATA_HOME}/Gutenberg_PG-19_ftfy_cleaned_id_cleaned_shuf_text_document"
NIH="${DATA_HOME}/NIH_ExPorter_ftfy_id_shuf_text_document"
OWT2="${DATA_HOME}/OpenWebText2_ftfy_cleaned_id_shuf_text_document"
PCC="${DATA_HOME}/Pile-CC_id_cleaned_shuf_text_document"
PM="${DATA_HOME}/PubMed_Abstracts_ftfy_id_shuf_text_document"
RN="${DATA_HOME}/rn_dedup_shuf_cleaned_0.7_cleaned_shuf_text_document"
SE="${DATA_HOME}/StackExchange_ftfy_id_shuf_text_document"
ST="${DATA_HOME}/stories_dedup0.7_shuf_cleaned_shuf_text_document"
WIK="${DATA_HOME}/Wikipedia_en_ftfy_id_shuf_text_document"
DATA_BLEND="0.14336 ${B3} 0.08962 ${RN} 0.19336 ${OWT2} 0.05689 ${SE} \
0.00859 ${ST} 0.02897 ${PM} 0.04771 ${WIK} 0.00873 ${GUT} 0.01007 ${BC2} \
0.00208 ${NIH} 0.13017 ${CC2020} 0.09446 ${PCC} 0.15652 ${CC2021} \
0.01359 ${ARX} 0.01588 ${GIT}"
else
VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
DATA_BLEND=/data/the_pile_public_merged_nopreprocessing/pile_text_document
fi
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_BLEND} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--num-experts ${EP_SIZE} \
--moe-loss-coeff ${MLC} \
--moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
--moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
--moe-min-capacity ${MOE_MIN_CAP} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers ${NUM_LAYERS} \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-samples ${TRAIN_SAMPLES} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_DIR}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [[ $EP_SIZE -gt 1 ]]; then
megatron_options="${megatron_options} \
--create-moe-param-group"
fi
if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
--disable-moe-token-dropping"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/0/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--pipeline-model-parallel-size ${PP_SIZE}"
# Currently MoE is not compatible with pipeline parallel
if [[ $EP_SIZE -gt 1 ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,350 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
# MODEL_SIZE=0.125
# NUM_LAYERS=12
# HIDDEN_SIZE=768
# NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5
## GPT-3 Medium 350M
# MODEL_SIZE=0.35
# NUM_LAYERS=24
# HIDDEN_SIZE=1024
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
# MODEL_SIZE=1.3
# NUM_LAYERS=24
# HIDDEN_SIZE=2048
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
# MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
MODEL_SIZE=6.7
NUM_LAYERS=32
HIDDEN_SIZE=4096
NUM_ATTN_HEADS=32
GLOBAL_BATCH_SIZE=1024
LR=1.2e-4
MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
## For MoE model, we found sometimes training a bit more to 330B tokens helps
TRAIN_TOKENS=300000000000
# TRAIN_TOKENS=330000000000
## TRAIN_SAMPLES is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some steps,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_SAMPLES.
TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
## For MoE model, we found that setting the decay token to 300B helps.
WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000
# LR_DECAY_TOKENS=300000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=4
## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=8
## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
NUM_GPUS=64
###############################################################################
### MoE configs
## Number of experts. EP_SIZE 1 means dense model without MoE
EP_SIZE=1
# EP_SIZE=128
if [[ $EP_SIZE -gt $NUM_GPUS ]]; then
EP_PARALLEL_SIZE=$NUM_GPUS
else
EP_PARALLEL_SIZE=$EP_SIZE
fi
## Original GPT-3 model always set min LR at 10% of max LR. For MoE model, we
## found that lower LR and min LR (than the base dense model) helps.
## For 1.3B MoE-128 model we used LR=1.2e-4 and MIN_LR=1.0e-6.
## For 350M MoE-128 model we used LR=2.0e-4 and MIN_LR=2.0e-6, but they are not
## heavily tuned.
# LR=2.0e-4
# MIN_LR=2e-06
## Coefficient for MoE loss. We find that 0.01 is a good value at least for
## 1.3B MoE-128 model
MLC=0.01
## Below configs adjust the MoE expert token capacity limit during training and
## eval. To completely disable capacity limit, set MOE_DROP_TOKEN to false.
## Larger capacity factor or disabling capacity limit could improve training
## convergence, but will also reduce training throughput.
MOE_TRAIN_CAP_FACTOR=1.0
MOE_EVAL_CAP_FACTOR=1.0
MOE_MIN_CAP=4
MOE_DROP_TOKEN="true"
# MOE_DROP_TOKEN="false"
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_TOKENS=$((${CL_TOKENS} * 1000000000))
CL_STEP=$(( ${CL_TOKENS} / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=1000
## Standard deviation for weight initialization
## We used 0.014 for 350M/1.3B dense/MoE models, and used 0.01 for 6.7B
## dense model. Usually larger model needs lower std.
# INIT_STD=0.014
INIT_STD=0.01
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-mp-${MP_SIZE}-pp-${PP_SIZE}"
if [[ $EP_SIZE -gt 1 ]]; then
NAME="${NAME}-ep-${EP_SIZE}-mlc-${MLC}-cap-${MOE_TRAIN_CAP_FACTOR}-drop-${MOE_DROP_TOKEN}"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-${CL_START_SEQLEN}-${CL_STEP}"
fi
OUTPUT_BASEPATH=$DIR/output
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/"
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/"
mkdir -p "${OUTPUT_BASEPATH}/log/"
TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}_${current_time}"
mkdir -p ${TENSORBOARD_DIR}
## Note that for MoE model with billion-scale base model, the checkpoint can be
## as large as TB-scale which normal NFS cannot handle efficiently.
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
# USE_INTERNAL_DATA="true"
USE_INTERNAL_DATA="false"
if [ "${USE_INTERNAL_DATA}" = "true" ]; then
## The internal data is only accessible within Microsoft
## For cluster Azure-EastUS-V100-32GB-4, Azure-WestUS3-A100
# BASE_DATA_PATH=/vc_data/Megatron-LM/data
# DATA_HOME="/vc_data/pile-cc1-cc2-shuf"
## For cluster Lab-RR1-V100
BASE_DATA_PATH=/data/Megatron-LM/data
DATA_HOME="/turing-ssd/users/conglli/data/pile-cc1-cc2-shuf"
## For cluster Azure-CentralUS-A100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME=/vc_data_1/users/amawa/blended
VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
ARX="${DATA_HOME}/ArXiv_ftfy_cleaned_id_shuf_text_document"
BC2="${DATA_HOME}/BookCorpus2_ftfy_cleaned_id_shuf_text_document"
B3="${DATA_HOME}/Books3_ftfy_cleaned_id_shuf_text_document"
CC2020="${DATA_HOME}/CC-2020-50_id_cleaned_shuf_text_document"
CC2021="${DATA_HOME}/CC-2021-04_id_cleaned_shuf_text_document"
GIT="${DATA_HOME}/Github_ftfy_id_shuf_text_document"
GUT="${DATA_HOME}/Gutenberg_PG-19_ftfy_cleaned_id_cleaned_shuf_text_document"
NIH="${DATA_HOME}/NIH_ExPorter_ftfy_id_shuf_text_document"
OWT2="${DATA_HOME}/OpenWebText2_ftfy_cleaned_id_shuf_text_document"
PCC="${DATA_HOME}/Pile-CC_id_cleaned_shuf_text_document"
PM="${DATA_HOME}/PubMed_Abstracts_ftfy_id_shuf_text_document"
RN="${DATA_HOME}/rn_dedup_shuf_cleaned_0.7_cleaned_shuf_text_document"
SE="${DATA_HOME}/StackExchange_ftfy_id_shuf_text_document"
ST="${DATA_HOME}/stories_dedup0.7_shuf_cleaned_shuf_text_document"
WIK="${DATA_HOME}/Wikipedia_en_ftfy_id_shuf_text_document"
DATA_BLEND="0.14336 ${B3} 0.08962 ${RN} 0.19336 ${OWT2} 0.05689 ${SE} \
0.00859 ${ST} 0.02897 ${PM} 0.04771 ${WIK} 0.00873 ${GUT} 0.01007 ${BC2} \
0.00208 ${NIH} 0.13017 ${CC2020} 0.09446 ${PCC} 0.15652 ${CC2021} \
0.01359 ${ARX} 0.01588 ${GIT}"
else
VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
DATA_BLEND=/data/the_pile_public_merged_nopreprocessing/pile_text_document
fi
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_BLEND} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--num-experts ${EP_SIZE} \
--moe-loss-coeff ${MLC} \
--moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
--moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
--moe-min-capacity ${MOE_MIN_CAP} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--rampup-batch-size 32 32 4882812 \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers ${NUM_LAYERS} \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-samples ${TRAIN_SAMPLES} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_DIR}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [[ $EP_SIZE -gt 1 ]]; then
megatron_options="${megatron_options} \
--create-moe-param-group"
fi
if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
--disable-moe-token-dropping"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/0/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--pipeline-model-parallel-size ${PP_SIZE}"
# Currently MoE is not compatible with pipeline parallel
if [[ $EP_SIZE -gt 1 ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,168 @@
# How to run lm-eval on Megatron-DeepSpeed checkpoint using the original setup
A great portion of this eval harness feature is inherited from https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/212, but with code/doc changes (e.g., to support case without pipeline parallelism and MoE models).
This particular setup uses the normal deepspeed checkpoint and requires no conversion to Megatron-LM.
## Prerequisites
1. Install software
On login console with external network
Get lm-eval harness (https://github.com/EleutherAI/lm-evaluation-harness) and `best-download==0.0.7` needed to download some tasks.
```
(maybe need pip install --upgrade pip)
pip install best-download==0.0.7
pip install lm-eval
(previously we used "pip install git+https://github.com/EleutherAI/lm-evaluation-harness" to install, but later found the command above has less dependency issues)
```
2. Pre-download needed datasets
some symlinks due to lm-harness' issues with relative position of data
```
mkdir data
cd ../../tasks/eval_harness/
ln -s ../../examples/MoE/data/ data
cd ../../examples/MoE/
```
<!-- Also make sure `data` is not on one of the limited paritions like WORKSF. -->
Then install datasets for the tasks:
```
python ../../tasks/eval_harness/download.py --task_list hellaswag,lambada,triviaqa,webqs,winogrande,piqa,arc_challenge,arc_easy,openbookqa,race,boolq,cb,copa,rte,wic,wsc,multirc,record,anli_r1,anli_r2,anli_r3,wikitext,logiqa,mathqa,mc_taco,mrpc,prost,pubmedqa,qnli,qqp,sciq,sst,wnli
```
and make sure that `export HF_DATASETS_OFFLINE=1`
<!-- If there are things like custom tokenizers, pre-download those too, e.g.:
```
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('bigscience/oscar_13_languages_alpha_weight')"
```
and make sure that `export TRANSFORMERS_OFFLINE=1` is in the script.
You know there is a custom tokenizer if the training script had something like:
```
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path bigscience/oscar_13_languages_alpha_weight \
``` -->
3. Prepare the script
<!-- Prepare the run script, replace `variant` with a unique identifier for the current eval so that multiple evals could run in parallel and not all log into the same `results.json` file. so, e.g., `tr9c-1B3-swiglu`
```
cp examples/run_evalharness_deepspeed.slurm run_evalharness-variant.slurm
```
now edit `run_evalharness-variant.slurm`
Note that the eval code knows to pull the original training args from the checkpoint, so we don't need to pass any of those. And we just need to setup the evaluation args. -->
`ds_evalharness.sh` is the example script.
1. Edit:
```
PP_SIZE=1
TP_SIZE=1
NO_PP="true"
EP_PARALLEL_SIZE=1
NUM_NODE=1
NUM_GPU_PER_NODE=1
```
to match the eval topology.
Edit:
```
CHECKPOINT_PATH=
CONFIG_PATH=
RESULT_PATH=
```
to the checkpoint/ds config you want to use, and where to save the results.
<!-- If the model fits into 1 gpu, then there is nothing to change.
The eval script will automatically reshape the model if it was of a different topology. -->
2. Adjust the following to fit the chosen GPU. As of last check for 1.3B model the settings are one of:
```
EVAL_MICRO_BATCH_SIZE=6 # 16GB GPU 1.3B model
EVAL_MICRO_BATCH_SIZE=12 # 32GB GPU 1.3B model
```
If you get OOM lower it further.
3. If not using the Deepspeed path, disable it by removing:
```
--deepspeed \
--deepspeed_config ds_config.json \
```
If you didn't disable it and the program crashed on checkpoint loading unable to find some key, disable deepspeed as explained above.
Note that for MoE models and for models without pipeline parallelism, currently they might not work for the case without deepspeed.
<!-- ## Eval
Currently it takes 2-3 hours to run on 32GB for 1.3B model, 6-7h for 16GB GPU, so a 20h slurm job should be enough.
When ready, launch:
```
sbatch ./run_evalharness-variant.slurm
```
To monitor progress:
```
tail -f tail -f $VARIANT-eval-harness.log
```
where the variant is what you set `$VARIANT` to in the slurm script.
The template is set up for 16GB gpu since they are easier to get by. If you change to 32GB, adjust:
```
#SBATCH --constraint=v100-32g
...
EVAL_MICRO_BATCH_SIZE=12 # 32GB GPU 1.3B model
```
Note that the original ETA at the start of the run can be 10x too longer than the actual outcome. For example it may suggest 18 hours but will complete in 2 hours.
## Short eval
if you just want to quickly test that everything can run to the end, edit `tasks/eval_harness/evaluate.py`, e.g. to run only 10 batches:
```
- results = evaluator.evaluate(adaptor, task_dict, False, 0, None)
+ results = evaluator.evaluate(adaptor, task_dict, False, 0, 10)
```
(XXX: could be a cmd line option so that code won't need to be modified)
## Import into spreadsheet
https://docs.google.com/spreadsheets/d/1CI8Q9RCblLRzUOPJ6ViqBmo284-8ojluQ-CmaEuhuv0/edit?usp=sharing
Note that the spreadsheet format is quite different, so use this script:
```
./tasks/eval_harness/report-to-csv.py results.json
```
to reformat the json results into csv while changing its shape to match the spreadsheet format
Since some records might be missing or extraneous here is the best way to do it:
1. copy the data from first 2 columns to some place under the main spreadsheet
2. put the pointer to the 3rd column next to where the 2 first columns were copied.
3. import `results.csv` using file-> import -> file ->
Import location: Replace data at selected cell
4. Now it should be easy to align the new records with the old ones - delete irrelevant records and Insert->Cells where data is missing until the first 2 columns match
5. now create 2 cols in the main table on top and now it should be safe to Copy-n-Paste the 2-col data range, without the task/metrics columns into the newly created space. -->

30
examples/README.md Normal file
View File

@ -0,0 +1,30 @@
## Recipes and Scripts
Please note that some of the script examples (e.g., pretrain_*.sh directly under ```Megatron-DeepSpeed/examples/``` folder) are from the original NVIDIA's Megatron-LM and does not have DeepSpeed integration (scripts with DeepSpeed integration should include the ```deepspeed``` keyword). Below we list various examples that do have DeepSpeed integration.
### Azure
We strongly recommend to start with AzureML recipe in the ```azureml``` folder.
If you have a custom infrastructure (e.g. HPC clusters) or Azure VM and VMSS based environments, please refer to the bash scripts in the ```azure``` folder.
### MoE
Please see the ```MoE``` folder for different training recipes and scripts for Mixture-of-expert based models and dense models. These recipes are for GPT-style NLG models.
### Data Efficiency
The ```data_efficiency``` folder includes GPT-3 and BERT pretraining examples for DeepSpeed Data Efficiency Library. Please refer to the detailed tutorials in data_efficiency/README.MD.
### Curriculum Learning
Curriculum learning recipes are in the ```curriculum_learning``` folder. Please refer to the detailed tutorials linked inside. These recipes are for GPT-style NLG models.
Note that the DeepSpeed Data Efficiency Library above includes a more general curriculum learning support. This legacy curriculum learning feature is still compatible, but we recommend using the DeepSpeed Data Efficiency Library above.
### Model Compression
The ```compression``` folder includes examples about layer reduction for task-agnostic compression. Please refer to [this tutorial](https://www.deepspeed.ai/tutorials/model-compression/#11-layer-reduction) about the DeepSpeed Model Compression Library. These recipes are for GPT-style NLG models.
### BERT example
The ```bert_with_pile``` folder includes examples about BERT-style model pre-training (using the public Pile data or user's own data) with DeepSpeed integration. Please refer to the readme in the folder for tutorial.

27
examples/azure/README.md Normal file
View File

@ -0,0 +1,27 @@
## Recipes for experimentation on Azure
The recipes have been tested on command line on a cluster setup using Azure VMs and VMSS as well as inside Docker based environments.
To run any of the examples in this folder, please go to the base directory of Megatron-DeepSpeed and run as follows
```bash examples/azure/run-benchmark-model.sh```
### Pre-requisites
To run the above script, you will need to either setup your own dataset and modify the scripts or use our helper scripts to download the publicly available Books dataset and GPT vocab files. Please use the following from the ```dataset``` folder
```bash dataset/download_books.sh```
```bash dataset/download_vocab.sh```
### Run 175B and 1T models
We have included two recipes for the 175B model and the 1T model. To train the model, we assume that the users will modify and tune hyperparameters and configurations by themselves. To facilitate initial training, we have made the recipes runnable with the Books dataset as follows.
```bash examples/azure/run-175b.sh```
```bash examples/azure/run-1t.sh```
### Note about ZeRO stage 3 and CPU offload
By default, we have enabled ZeRO Stage 3 for both the recipes above. For the 1T model, we have also enabled the CPU-offload feature to save on memory and enable a larger batch size that offers better performance.

142
examples/azure/run-175b.sh Normal file
View File

@ -0,0 +1,142 @@
#!/bin/bash
set -ex
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_PATH} \
--data-impl mmap"
BASE_PATH=$PWD/dataset/
DATA_PATH=${BASE_PATH}/BookCorpusDataset_text_document
DS_CONFIG=ds_config.json
# Hostfile path
HF=/job/hostfile
# Disabling tensor/pipeline parallelism
TP=1
PP=1
# HEADS ~= HIDDEN/128
# Model: 175B
NLAYERS=96
HIDDEN=12288
HEADS=96
SEQ=1024
MICRO_BATCH=4
NODES=1
GPN=8
GLOBAL_BATCH=$(( ${GPN} * ${MICRO_BATCH} * ${NODES} ))
# Initial power scale for loss
SP=15
# Uncomment/comment one of the following blocks.
# For 1T model, start with microbatch=1, try to get 2 and 4. If OOM w/ 4, use cpu-offloading
# Set to cpu for offloading to cpu for larger models
#OFFLOAD_DEVICE="cpu"
#CPU_OPTIM=" --cpu-optimizer"
# Set to none and empty string for no cpu offloading
OFFLOAD_DEVICE="none"
CPU_OPTIM=" "
ZERO_STAGE=3
OUTPUT_DIR=ds_z_off-${OFFLOAD_DEVICE}_stage_${ZERO_STAGE}_nl${NLAYERS}_hs${HIDDEN}_mb${MICRO_BATCH}_seq${SEQ}_gb${GLOBAL_BATCH}_nodes${NODES}
#OUTPUT_DIR=baseline_nl${NLAYERS}_hs${HIDDEN}_gb${GLOBAL_BATCH}_mb${MICRO_BATCH}
mkdir -p $OUTPUT_DIR
cat <<EOT > $DS_CONFIG
{
"train_batch_size" : $GLOBAL_BATCH,
"train_micro_batch_size_per_gpu": $MICRO_BATCH,
"steps_per_print": 1,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 3,
"stage3_max_live_parameters": 3e9,
"stage3_max_reuse_distance": 3e9,
"stage3_param_persistence_threshold": 1e5,
"stage3_prefetch_bucket_size": 5e7,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_bucket_size": 90000000,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "$OFFLOAD_DEVICE",
"buffer_count": 4,
"pipeline_read": false,
"pipeline_write": false,
"pin_memory": true
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"initial_scale_power" : $SP,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": true,
"zero_allow_untested_optimizer": false,
"aio": {
"block_size": 1048576,
"queue_depth": 16,
"single_submit": false,
"overlap_events": true,
"thread_count": 2
}
}
EOT
export NCCL_DEBUG=warn
ds_args=" "
ds_args=" --deepspeed ${ds_args}"
ds_args=" --no-pipeline-parallel ${ds_args}"
ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
deepspeed --force_multi --num_nodes=$NODES --hostfile $HF pretrain_gpt.py \
--tensor-model-parallel-size $TP \
--pipeline-model-parallel-size $PP \
--num-layers $NLAYERS \
--hidden-size $HIDDEN \
--num-attention-heads $HEADS \
--seq-length $SEQ \
--loss-scale $SP \
--max-position-embeddings $SEQ \
--micro-batch-size $MICRO_BATCH \
--global-batch-size $GLOBAL_BATCH \
--train-iters 1000 \
--lr 6.0e-5 \
--min-lr 6.0e-6 \
--lr-decay-style cosine \
--log-interval 1 \
--eval-iters 40 \
--eval-interval 1000 \
--data-path $DATA_PATH \
--vocab-file $BASE_PATH/gpt2-vocab.json \
--merge-file $BASE_PATH/gpt2-merges.txt \
--save-interval 1000 \
--split 98,2,0 \
--clip-grad 1.0 \
--weight-decay 0.1 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--init-method-std 0.006 \
--fp16 \
--checkpoint-activations \
--tensorboard-dir $OUTPUT_DIR \
$CPU_OPTIM $ds_args \
--exit-interval 5000 | tee ${OUTPUT_DIR}/output.log

154
examples/azure/run-1t.sh Normal file
View File

@ -0,0 +1,154 @@
#!/bin/bash
set -ex
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_PATH} \
--data-impl mmap"
BASE_PATH=$PWD/dataset/
DATA_PATH=${BASE_PATH}/BookCorpusDataset_text_document
DS_CONFIG=ds_config.json
# Hostfile path
HF=/job/hostfile
# Disabling tensor/pipeline parallelism
TP=1
PP=1
# HEADS ~= HIDDEN/128
# Refer to Megatron-table in the README.md file for model sizes
# Model: 310B
#NLAYERS=96
#HIDDEN=16384
#HEADS=128
#SEQ=2048
# Model 530B
#NLAYERS=105
#HIDDEN=20480
#HEADS=160
#SEQ=2048
# Model 1T
NLAYERS=128
HIDDEN=25600
HEADS=160
SEQ=1024
MICRO_BATCH=1
NODES=1
GPN=8
GLOBAL_BATCH=$(( ${GPN} * ${MICRO_BATCH} * ${NODES} ))
# Initial power scale for loss
SP=15
# Uncomment/comment one of the following blocks.
# For 1T model, start with microbatch=1, try to get 2 and 4. If OOM w/ 4, use cpu-offloading
# Set to cpu for offloading to cpu for larger models
OFFLOAD_DEVICE="cpu"
CPU_OPTIM=" --cpu-optimizer"
# Set to none and empty string for no cpu offloading
#OFFLOAD_DEVICE="none"
#CPU_OPTIM=" "
ZERO_STAGE=3
OUTPUT_DIR=ds_z_off-${OFFLOAD_DEVICE}_stage_${ZERO_STAGE}_nl${NLAYERS}_hs${HIDDEN}_mb${MICRO_BATCH}_seq${SEQ}_gb${GLOBAL_BATCH}_nodes${NODES}
#OUTPUT_DIR=baseline_nl${NLAYERS}_hs${HIDDEN}_gb${GLOBAL_BATCH}_mb${MICRO_BATCH}
mkdir -p $OUTPUT_DIR
cat <<EOT > $DS_CONFIG
{
"train_batch_size" : $GLOBAL_BATCH,
"train_micro_batch_size_per_gpu": $MICRO_BATCH,
"steps_per_print": 1,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 3,
"stage3_max_live_parameters": 3e9,
"stage3_max_reuse_distance": 3e9,
"stage3_param_persistence_threshold": 1e5,
"stage3_prefetch_bucket_size": 5e7,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_bucket_size": 90000000,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "$OFFLOAD_DEVICE",
"buffer_count": 4,
"pipeline_read": false,
"pipeline_write": false,
"pin_memory": true
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"initial_scale_power" : $SP,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": true,
"zero_allow_untested_optimizer": false,
"aio": {
"block_size": 1048576,
"queue_depth": 16,
"single_submit": false,
"overlap_events": true,
"thread_count": 2
}
}
EOT
export NCCL_DEBUG=warn
ds_args=" "
ds_args=" --deepspeed ${ds_args}"
ds_args=" --no-pipeline-parallel ${ds_args}"
ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
deepspeed --force_multi --num_nodes=$NODES --hostfile $HF pretrain_gpt.py \
--tensor-model-parallel-size $TP \
--pipeline-model-parallel-size $PP \
--num-layers $NLAYERS \
--hidden-size $HIDDEN \
--num-attention-heads $HEADS \
--seq-length $SEQ \
--loss-scale $SP \
--max-position-embeddings $SEQ \
--micro-batch-size $MICRO_BATCH \
--global-batch-size $GLOBAL_BATCH \
--train-iters 1000 \
--lr 6.0e-5 \
--min-lr 6.0e-6 \
--lr-decay-style cosine \
--log-interval 1 \
--eval-iters 40 \
--eval-interval 1000 \
--data-path $DATA_PATH \
--vocab-file $BASE_PATH/gpt2-vocab.json \
--merge-file $BASE_PATH/gpt2-merges.txt \
--save-interval 1000 \
--split 98,2,0 \
--clip-grad 1.0 \
--weight-decay 0.1 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--init-method-std 0.006 \
--fp16 \
--checkpoint-activations \
--tensorboard-dir $OUTPUT_DIR \
$CPU_OPTIM $ds_args \
--exit-interval 5000 | tee ${OUTPUT_DIR}/output.log

View File

@ -0,0 +1,142 @@
#!/bin/bash
set -ex
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_PATH} \
--data-impl mmap"
BASE_PATH=$PWD/dataset/
DATA_PATH=${BASE_PATH}/BookCorpusDataset_text_document
DS_CONFIG=ds_config.json
# Hostfile path
HF=/job/hostfile
# Disabling tensor/pipeline parallelism
TP=1
PP=1
# HEADS ~= HIDDEN/128
# Model: Benchmark model
NLAYERS=1
HIDDEN=12288
HEADS=96
SEQ=1024
MICRO_BATCH=4
NODES=2
GPN=8
GLOBAL_BATCH=$(( ${GPN} * ${MICRO_BATCH} * ${NODES} ))
# Initial power scale for loss
SP=15
# Uncomment/comment one of the following blocks.
# For 1T model, start with microbatch=1, try to get 2 and 4. If OOM w/ 4, use cpu-offloading
# Set to cpu for offloading to cpu for larger models
#OFFLOAD_DEVICE="cpu"
#CPU_OPTIM=" --cpu-optimizer"
# Set to none and empty string for no cpu offloading
OFFLOAD_DEVICE="none"
CPU_OPTIM=" "
ZERO_STAGE=3
OUTPUT_DIR=ds_z_off-${OFFLOAD_DEVICE}_stage_${ZERO_STAGE}_nl${NLAYERS}_hs${HIDDEN}_mb${MICRO_BATCH}_seq${SEQ}_gb${GLOBAL_BATCH}_nodes${NODES}
#OUTPUT_DIR=baseline_nl${NLAYERS}_hs${HIDDEN}_gb${GLOBAL_BATCH}_mb${MICRO_BATCH}
mkdir -p $OUTPUT_DIR
cat <<EOT > $DS_CONFIG
{
"train_batch_size" : $GLOBAL_BATCH,
"train_micro_batch_size_per_gpu": $MICRO_BATCH,
"steps_per_print": 1,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 3,
"stage3_max_live_parameters": 3e9,
"stage3_max_reuse_distance": 3e9,
"stage3_param_persistence_threshold": 1e5,
"stage3_prefetch_bucket_size": 5e7,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_bucket_size": 90000000,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "$OFFLOAD_DEVICE",
"buffer_count": 4,
"pipeline_read": false,
"pipeline_write": false,
"pin_memory": true
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"initial_scale_power" : $SP,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": true,
"zero_allow_untested_optimizer": false,
"aio": {
"block_size": 1048576,
"queue_depth": 16,
"single_submit": false,
"overlap_events": true,
"thread_count": 2
}
}
EOT
export NCCL_DEBUG=warn
ds_args=" "
ds_args=" --deepspeed ${ds_args}"
ds_args=" --no-pipeline-parallel ${ds_args}"
ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
deepspeed --force_multi --num_nodes=$NODES --hostfile $HF pretrain_gpt.py \
--tensor-model-parallel-size $TP \
--pipeline-model-parallel-size $PP \
--num-layers $NLAYERS \
--hidden-size $HIDDEN \
--num-attention-heads $HEADS \
--seq-length $SEQ \
--loss-scale $SP \
--max-position-embeddings $SEQ \
--micro-batch-size $MICRO_BATCH \
--global-batch-size $GLOBAL_BATCH \
--train-iters 50 \
--lr 6.0e-5 \
--min-lr 6.0e-6 \
--lr-decay-style cosine \
--log-interval 1 \
--eval-iters 40 \
--eval-interval 1000 \
--data-path $DATA_PATH \
--vocab-file $BASE_PATH/gpt2-vocab.json \
--merge-file $BASE_PATH/gpt2-merges.txt \
--save-interval 1000 \
--split 98,2,0 \
--clip-grad 1.0 \
--weight-decay 0.1 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--init-method-std 0.006 \
--fp16 \
--checkpoint-activations \
--tensorboard-dir $OUTPUT_DIR \
$CPU_OPTIM $ds_args \
--exit-interval 5000 | tee ${OUTPUT_DIR}/output.log

View File

@ -0,0 +1,5 @@
FROM mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.5-gpu
USER root:root
RUN pip install pybind11
RUN pip install regex

View File

@ -0,0 +1,14 @@
## Megatron-DeepSpeed on AzureML
Example script for running Megatron-DeepSpeed using Azure Machine Learning.
------
# Workspace Setup
Setup an AML workspace. Refer to: [set-up doc](https://github.com/Azure/azureml-examples/tree/main/python-sdk#set-up).
# Dataset Preparation
Create AML Dataset. To run remote AML job, you need to provide AML FileDataset.
Refer to [prepare_dataset script](prepare_dataset.py) to upload .bin and .idx files to blob store and on how to create FileDataset.
# Training
Run Megatron-DeepSpeed on Azure ML. Refer to [aml_submit script](aml_submit.py).

View File

@ -0,0 +1,198 @@
import os
import requests
import sys
# AzureML libraries
import azureml.core
from azureml.core import Dataset, Environment, Experiment, ScriptRunConfig, Workspace
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import PyTorchConfiguration
from azureml.core.environment import DockerBuildContext
# Check core SDK version number
print("SDK version:", azureml.core.VERSION)
# For setting up a workspace, refer to: https://github.com/Azure/azureml-examples/tree/main/python-sdk#set-up
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')
#-------------------------------------------------------------------------------
# Prepare Compute Cluster
#-------------------------------------------------------------------------------
cluster_name = "a100-80gb"
# Verify that the cluster doesn't exist already
try:
compute_target = ComputeTarget(workspace=ws, name=cluster_name)
print('Found existing compute target.')
except ComputeTargetException:
print('Creating a new compute target...')
compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_ND96amsr_A100_v4', min_nodes=32, max_nodes=32)
# create the cluster
compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True)
#-------------------------------------------------------------------------------
# Prepare Data
# Megatron-DeepSpeed takes in data_path, vocab_file, and merge_file.
# For AML, we are adding a parameter aml_data_download_path which specifies how to deliver the dataset to a compute target.
# In the submitted run, files in the datasets will be either mounted or downloaded to local path on the compute target.
#
# data_path for this example is path to the .bin and .idx file, excluding extension.
# e.g. for data/BookCorpusDataset_text_document.bin and data/BookCorpusDataset_text_document.idx,
# data_path = "data/BookCorpusDataset_text_document"
#
# Once the folder is downloaded to the compute target, it will use aml_data_download_path to locate the folder
# and data_path to locate .bin and .idx files
#
# vocab_file and merge_file would also be passed in a similar way.
#-------------------------------------------------------------------------------
datastore = ws.get_default_datastore()
blobstore_datadir = "bookcorpus_data"
data_path = f"BookCorpusDataset_text_document"
# Load data folder which contains bookcorpus .bin and .idx files
train_dataset = Dataset.File.from_files(path=[(datastore, blobstore_datadir)])
aml_data_download_path = train_dataset.as_download(blobstore_datadir)
vocab_file_dataset = Dataset.File.from_files("https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json")
merge_file_dataset = Dataset.File.from_files("https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt")
vocab_file = vocab_file_dataset.as_download()
merge_file = merge_file_dataset.as_download()
#-------------------------------------------------------------------------------
# Setup training environment
#-------------------------------------------------------------------------------
megatron_ds_env = Environment.from_docker_build_context(name='megatron-ds-curated-acpt', docker_build_context=DockerBuildContext.from_local_directory(workspace = ws, path = '.', dockerfile_path='Dockerfile.dockerfile'))
megatron_ds_env.register(ws).build(ws).wait_for_completion() # Comment this out if environment already exists
#-------------------------------------------------------------------------------
# Training Settings and Arguments
#-------------------------------------------------------------------------------
node_count = 2
total_processes_count = 16
micro_batch_size = 1
global_batch_size = micro_batch_size * total_processes_count
tensorboard_dir = '/tmp/outputs/tensorboard'
run_args = ['--tensor-model-parallel-size', 1,
'--pipeline-model-parallel-size', 1,
'--num-layers', 20,
'--hidden-size', 12288,
'--num-attention-heads', 96,
'--seq-length', 1024,
'--loss-scale', 15,
'--max-position-embeddings', 1024,
'--micro-batch-size', micro_batch_size,
'--global-batch-size', global_batch_size,
'--train-iters', 100,
'--lr', 6.0e-5,
'--min-lr', 6.0e-6,
'--lr-decay-style', 'cosine',
'--log-interval', 1,
'--eval-iters', 40,
'--eval-interval', 1000,
'--aml-data-download-path', aml_data_download_path,
'--data-path', data_path,
'--vocab-file', vocab_file,
'--merge-file', merge_file,
'--save-interval', 1000,
'--split', '98,2,0',
'--clip-grad', 1.0,
'--weight-decay', 0.1,
'--adam-beta1', 0.9,
'--adam-beta2', 0.95,
'--init-method-std', 0.006,
'--fp16',
'--data-impl', 'mmap',
'--checkpoint-activations',
'--tensorboard-dir', tensorboard_dir,
#'--cpu-optimizer',
'--deepspeed',
'--no-pipeline-parallel',
'--deepspeed_config', 'ds_config.json',
'--zero-stage', 3,
'--deepspeed-activation-checkpointing',
'--exit-interval', 5000,
]
#-------------------------------------------------------------------------------
# DeepSpeed ds_config.json
#-------------------------------------------------------------------------------
import json
ds_config = {
"train_batch_size" : global_batch_size,
"train_micro_batch_size_per_gpu": micro_batch_size,
"steps_per_print": 1,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 3,
"stage3_max_live_parameters": 3e9,
"stage3_max_reuse_distance": 3e9,
"stage3_param_persistence_threshold": 1e5,
"stage3_prefetch_bucket_size": 5e7,
"contiguous_gradients": True,
"overlap_comm": True,
"reduce_bucket_size": 90000000,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "none",
"buffer_count": 4,
"pipeline_read": False,
"pipeline_write": False,
"pin_memory": True
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": True,
"initial_scale_power" : 15,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": True,
"zero_allow_untested_optimizer": False,
"aio": {
"block_size": 1048576,
"queue_depth": 16,
"single_submit": False,
"overlap_events": True,
"thread_count": 2
}
}
# Place ds_config.json in the same folder as pretrain_gpt.py (script to run)
ds_config_path = '../../ds_config.json'
with open(ds_config_path, 'w') as fp:
json.dump(ds_config, fp, indent=4)
#-------------------------------------------------------------------------------
# Create ScriptRunConfig
#-------------------------------------------------------------------------------
distr_config = PyTorchConfiguration(process_count=total_processes_count, node_count=node_count)
megatron_ds_src = ScriptRunConfig(source_directory='../../',
script='pretrain_gpt.py',
arguments=run_args,
compute_target=compute_target,
environment=megatron_ds_env,
distributed_job_config=distr_config)
megatron_ds_src.run_config.environment_variables['NCCL_DEBUG'] = 'WARN'
megatron_ds_src.run_config.environment_variables['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
megatron_ds_src.run_config.environment_variables['NCCL_SOCKET_IFNAME'] = 'eth0'
megatron_ds_src.run_config.environment_variables['NCCL_IB_PCI_RELAXED_ORDERING']='1'
megatron_ds_src.run_config.environment_variables['UCX_TLS']='tcp'
megatron_ds_src.run_config.environment_variables['UCX_NET_DEVICES']='eth0'
#-------------------------------------------------------------------------------
# Submit experiment
#-------------------------------------------------------------------------------
experiment_name = 'megatron-ds'
experiment = Experiment(ws, name=experiment_name)
run = experiment.submit(megatron_ds_src, tags={'bs':micro_batch_size, 'gpus':total_processes_count})

View File

@ -0,0 +1,33 @@
# Use this script to upload data to blob store
# AzureML libraries
from azureml.core import Workspace
from azureml.core.dataset import Dataset
from azureml.data.datapath import DataPath
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')
data_dir = "bookcorpus_data" # Local directory for where data is located that includes .bin and .idx files
blobstore_datadir = data_dir # Blob store directory to store data in
datastore = ws.get_default_datastore()
# Book Corpus Data
print("upload dataset to blob store")
uploaded_data = Dataset.File.upload_directory(
src_dir=data_dir,
target=DataPath(datastore, blobstore_datadir),
show_progress=True
)
# Usage after uploading the directory
# To refer to the folder directly:
train_dataset = Dataset.File.from_files(path=[(datastore, blobstore_datadir)])
print(train_dataset)
# To refer to a specific file:
# train_dataset = Dataset.File.from_files(path=[(datastore, blobstore_datadir + "/filename.ext")])
# Create DatasetConsumptionConfig to specify how to deliver the dataset to a compute target.
# In the submitted run, files in the datasets will be either mounted or downloaded to local path on the compute target.
# input_data_dir = train_dataset.as_mount()
# input_data_dir = train_dataset.as_download()

View File

@ -0,0 +1,23 @@
This ```bert_with_pile``` folder includes examples about BERT pre-training (using [the public Pile data](https://github.com/EleutherAI/the-pile) or user's own data) with DeepSpeed integration. We also provide scripts about preprocessing Pile data and MNLI finetuning.
## Data preprocessing
```prepare_pile_data.py``` is the script for downloading, decompressing, and preprocessing [the public Pile data](https://github.com/EleutherAI/the-pile). Users can also modify this script to preprocess their own training data.
## BERT pre-training
```ds_pretrain_bert.sh``` is the script for BERT pre-training integrated with DeepSpeed, supporting [ZeRO](https://www.deepspeed.ai/tutorials/zero/) together with Megatron's tensor-slicing model parallelism. The training hyperparameters follow the [Megatron paper](https://arxiv.org/abs/1909.08053). Note that the pipeline parallelism is currently not supported: DeepSpeed's pipeline parallelism is only integrated with the GPT case, and currently DeepSpeed is not integrated with Megatron's own pipeline parallelism.
As a reference performance number, our measurements show that our example is able to achieve a throughput up to 145 TFLOPs per GPU when pre-training a 1.3B BERT model (with ZeRO stage-1, without model parallelism, with 64 NVIDIA A100 GPUs, with batch size 4096 (64 per GPU), with activation checkpointing).
One thing to note is that this pre-training recipe is NOT a strict reproduction of the [original BERT paper](https://arxiv.org/abs/1810.04805): the Pile data is larger than the data used in original BERT (and the data used by Megatron paper); Megatron-LM introduces some changes to the BERT model (see details in [Megatron paper](https://arxiv.org/abs/1909.08053)); the training hyperparameters are also different. Overall these differences lead to longer training time but also better model quality than original BERT (see MNLI score below), and supporting large model scale by the combination of ZeRO and model parallelism. If you don't have enough computation budget, we recommend to reduce the total training iterations (```train_iters``` in the script) and potentially increase the learning rate at the same time. If you want to strictly reproduce original BERT, we recommend to use our [another BERT example](https://github.com/microsoft/DeepSpeedExamples/tree/master/bing_bert).
## BERT MNLI fine-tuning
```ds_finetune_bert_mnli.sh``` is the script for BERT MNLI fine-tuning, following the hyperparameters in the [Megatron paper](https://arxiv.org/abs/1909.08053). As a reference, table below present the scores using the model pre-trained based on the script above, comparing with the scores of original BERT and Megatron paper's BERT. Our BERT-Large's score is slightly lower than Megatron paper's, mainly due to the different data we used (Pile data is much diverse and larger than the data in Megatron paper, which potentially has negative effect on small million-scale models).
| MNLI dev set accuracy | **MNLI-m** | **MNLI-mm** |
| ---------- |---------- |---------- |
| BERT-Base, [original BERT](https://arxiv.org/abs/1810.04805) | 84.6 | 83.4 |
| BERT-Base, ours (median on 5 seeds) | 86.1 | 86.1 |
| BERT-Large, [original BERT](https://arxiv.org/abs/1810.04805) | 86.7 | 85.9 |
| BERT-Large, [Megatron paper](https://arxiv.org/abs/1909.08053) | 89.7 | 90.0 |
| BERT-Large, ours (median on 5 seeds) | 89.1 | 89.6 |

View File

@ -0,0 +1,28 @@
{
"train_batch_size" : CONFIG_BATCH_SIZE,
"train_micro_batch_size_per_gpu": CONFIG_MBSIZE,
"steps_per_print": LOG_INTERVAL,
"zero_optimization": {
"stage": ZERO_STAGE,
"elastic_checkpoint": true
},
"gradient_clipping": 1.0,
"prescale_gradients": PRESCALE_GRAD,
"fp16": {
"enabled": CONFIG_FP16_ENABLED,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"bf16": {
"enabled": CONFIG_BF16_ENABLED
},
"wall_clock_breakdown" : false
}

View File

@ -0,0 +1,150 @@
seed=1234
pretrained_checkpoint="/blob/users/conglli/project/bert_with_pile/checkpoint/bert-pile-0.336B-iters-2M-lr-1e-4-min-1e-5-wmup-10000-dcy-2M-sty-linear-gbs-1024-mbs-16-gpu-64-zero-0-mp-1-pp-1-nopp"
###############################################################################
### Main configs
### The main configs are from Megatron-LM paper
### https://arxiv.org/abs/1909.08053. Choose based on your desired model size
### or build your own configs.
seq_len=512
## From Table 6 in https://arxiv.org/abs/1909.08053.
task="MNLI"
global_batch_size=128
lr=1e-5
epochs=10
train_data="/blob/data/GlueData/MNLI/train.tsv"
valid_data="/blob/data/GlueData/MNLI/dev_matched.tsv \
/blob/data/GlueData/MNLI/dev_mismatched.tsv"
## Adjust based on number of GPUs.
batch_size=16
## BERT 110M (same config as original BERT-Base model)
## This config is not included in Megatron-LM paper
# model_size=0.11
# num_layers=12
# hidden_size=768
# num_attn_heads=12
## BERT 336M (same config as original BERT-Large model)
model_size=0.336
num_layers=24
hidden_size=1024
num_attn_heads=16
## BERT 1.3B
# model_size=1.3
# num_layers=24
# hidden_size=2048
# num_attn_heads=32
## BERT 3.9B
# model_size=3.9
# num_layers=48
# hidden_size=2560
# num_attn_heads=40
###############################################################################
### Parallelism configs
## Model parallelism, 1 is no MP
mp_size=1
## Pipeline parallelism. To disable PP, set pp_size to 1 and no_pp to true.
## Currently pipeline parallelism is not supported for BERT model: DeepSpeed's
## pipeline parallelism is only integrated with the GPT case, and currently
## DeepSpeed is not integrated with Megatron's own pipeline parallelism.
pp_size=1
no_pp="true"
## ZeRO stage
zero_stage=0
###############################################################################
### Misc configs
log_interval=10
eval_iters=50
eval_interval=100
save_interval=500000
## Activation checkpointing saves GPU memory, but reduces training speed
# activation_checkpoint="true"
activation_checkpoint="false"
###############################################################################
vocab_file="bert-large-uncased-vocab.txt"
if [ ! -f "$vocab_file" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
fi
jobname="${task}-bsz${global_batch_size}-lr${lr}-epochs${epochs}-seed${seed}"
checkpoint_path="${pretrained_checkpoint}-finetune/${jobname}"
mkdir -p ${checkpoint_path}
template_json="ds_config_bert_TEMPLATE.json"
config_json="ds_config_bert_bsz${global_batch_size}_mbsz${batch_size}_log${log_interval}_zero${zero_stage}.json"
if [[ $zero_stage -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
fi
options=" \
--finetune \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${zero_stage} \
--task ${task} \
--seed ${seed} \
--train-data ${train_data} \
--valid-data ${valid_data} \
--tokenizer-type BertWordPieceLowerCase \
--vocab-file ${vocab_file} \
--epochs ${epochs} \
--pretrained-checkpoint ${pretrained_checkpoint} \
--tensor-model-parallel-size ${mp_size} \
--pipeline-model-parallel-size ${pp_size} \
--num-layers ${num_layers} \
--hidden-size ${hidden_size} \
--num-attention-heads ${num_attn_heads} \
--global-batch-size ${global_batch_size} \
--micro-batch-size ${batch_size} \
--lr ${lr} \
--lr-decay-style linear \
--lr-warmup-fraction 0.065 \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--save-interval ${save_interval} \
--save ${checkpoint_path} \
--log-interval ${log_interval} \
--eval-interval ${eval_interval} \
--eval-iters ${eval_iters} \
--weight-decay 1.0e-1 \
--fp16"
if [ "${activation_checkpoint}" = "true" ]; then
options="${options} \
--checkpoint-activations \
--deepspeed-activation-checkpointing"
fi
if [[ "${no_pp}" = "true" ]]; then
options="${options} \
--no-pipeline-parallel"
fi
# After the fine-tuning finishes, you can find the dev set accuracy numbers by
# "grep -e "overall:" -e "metrics for" ${checkpoint_path}/output.log"
deepspeed ../../tasks/main.py ${options} &> ${checkpoint_path}/output.log

View File

@ -0,0 +1,158 @@
seed=1234
pretrained_checkpoint="/blob/users/conglli/project/bert_with_pile/checkpoint/bert-pile-0.336B-iters-2M-lr-1e-4-min-1e-5-wmup-10000-dcy-2M-sty-linear-gbs-1024-mbs-16-gpu-64-zero-0-mp-1-pp-1-nopp"
###############################################################################
### Main configs
### The main configs are from Megatron-LM paper
### https://arxiv.org/abs/1909.08053. Choose based on your desired model size
### or build your own configs.
seq_len=512
## From Table 6 in https://arxiv.org/abs/1909.08053.
task="QQP"
train_data="/blob/data/GlueData/QQP/train.tsv"
valid_data="/blob/data/GlueData/QQP/dev.tsv"
## Adjust based on number of GPUs.
batch_size=16
## BERT 110M (same config as original BERT-Base model)
## This config is not included in Megatron-LM paper
# model_size=0.11
# num_layers=12
# hidden_size=768
# num_attn_heads=12
# global_batch_size=128
# lr=5e-5
# epochs=12
## BERT 336M (same config as original BERT-Large model)
model_size=0.336
num_layers=24
hidden_size=1024
num_attn_heads=16
global_batch_size=128
lr=5e-5
epochs=12
## BERT 1.3B
# model_size=1.3
# num_layers=24
# hidden_size=2048
# num_attn_heads=32
# global_batch_size=128
# lr=3e-5
# epochs=12
## BERT 3.9B
# model_size=3.9
# num_layers=48
# hidden_size=2560
# num_attn_heads=40
# global_batch_size=256
# lr=4e-5
# epochs=12
###############################################################################
### Parallelism configs
## Model parallelism, 1 is no MP
mp_size=1
## Pipeline parallelism. To disable PP, set pp_size to 1 and no_pp to true.
## Currently pipeline parallelism is not supported for BERT model: DeepSpeed's
## pipeline parallelism is only integrated with the GPT case, and currently
## DeepSpeed is not integrated with Megatron's own pipeline parallelism.
pp_size=1
no_pp="true"
## ZeRO stage
zero_stage=0
###############################################################################
### Misc configs
log_interval=10
eval_iters=50
eval_interval=100
save_interval=500000
## Activation checkpointing saves GPU memory, but reduces training speed
# activation_checkpoint="true"
activation_checkpoint="false"
###############################################################################
vocab_file="bert-large-uncased-vocab.txt"
if [ ! -f "$vocab_file" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
fi
jobname="${task}-bsz${global_batch_size}-lr${lr}-epochs${epochs}-seed${seed}"
checkpoint_path="${pretrained_checkpoint}-finetune/${jobname}"
mkdir -p ${checkpoint_path}
template_json="ds_config_bert_TEMPLATE.json"
config_json="ds_config_bert_bsz${global_batch_size}_mbsz${batch_size}_log${log_interval}_zero${zero_stage}.json"
if [[ $zero_stage -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
fi
options=" \
--finetune \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${zero_stage} \
--task ${task} \
--seed ${seed} \
--train-data ${train_data} \
--valid-data ${valid_data} \
--tokenizer-type BertWordPieceLowerCase \
--vocab-file ${vocab_file} \
--epochs ${epochs} \
--pretrained-checkpoint ${pretrained_checkpoint} \
--tensor-model-parallel-size ${mp_size} \
--pipeline-model-parallel-size ${pp_size} \
--num-layers ${num_layers} \
--hidden-size ${hidden_size} \
--num-attention-heads ${num_attn_heads} \
--global-batch-size ${global_batch_size} \
--micro-batch-size ${batch_size} \
--lr ${lr} \
--lr-decay-style linear \
--lr-warmup-fraction 0.065 \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--save-interval ${save_interval} \
--save ${checkpoint_path} \
--log-interval ${log_interval} \
--eval-interval ${eval_interval} \
--eval-iters ${eval_iters} \
--weight-decay 1.0e-1 \
--fp16"
if [ "${activation_checkpoint}" = "true" ]; then
options="${options} \
--checkpoint-activations \
--deepspeed-activation-checkpointing"
fi
if [[ "${no_pp}" = "true" ]]; then
options="${options} \
--no-pipeline-parallel"
fi
# After the fine-tuning finishes, you can find the dev set accuracy numbers by
# "grep -e "overall:" -e "metrics for" ${checkpoint_path}/output.log"
deepspeed ../../tasks/main.py ${options} &> ${checkpoint_path}/output.log

View File

@ -0,0 +1,172 @@
seed=1234
## RACE have two sub-tasks that need to be finetuned separately
difficulty="middle"
# difficulty="high"
pretrained_checkpoint="/blob/users/conglli/project/bert_with_pile/checkpoint/bert-pile-0.336B-iters-2M-lr-1e-4-min-1e-5-wmup-10000-dcy-2M-sty-linear-gbs-1024-mbs-16-gpu-64-zero-0-mp-1-pp-1-nopp"
###############################################################################
### Main configs
### The main configs are from Megatron-LM paper
### https://arxiv.org/abs/1909.08053. Choose based on your desired model size
### or build your own configs.
seq_len=512
## From Table 6 in https://arxiv.org/abs/1909.08053.
task="RACE"
## Race dataset can be downloaded by:
## wget http://www.cs.cmu.edu/~glai1/data/race/RACE.tar.gz
train_data="/blob/data/RACE/train/${difficulty}"
## The Megatron paper https://arxiv.org/abs/1909.08053 says: "For the test set
## results of RACE, we first use the development set to find the checkpoint
## that gives us the median score on the 5 random seeds and we report the
## results from that checkpoint on the test set", which is a quite confusing
## description. For simplicity, instead we directly get the median dev and test
## set score on 5 random seeds from a single pretrained_checkpoint.
valid_data="/blob/data/RACE/dev/${difficulty} \
/blob/data/RACE/test/${difficulty}"
## Adjust based on number of GPUs.
batch_size=4
## BERT 110M (same config as original BERT-Base model)
## This config is not included in Megatron-LM paper
# model_size=0.11
# num_layers=12
# hidden_size=768
# num_attn_heads=12
# global_batch_size=32
# lr=2e-5
# epochs=3
## BERT 336M (same config as original BERT-Large model)
model_size=0.336
num_layers=24
hidden_size=1024
num_attn_heads=16
global_batch_size=32
lr=2e-5
epochs=3
## BERT 1.3B
# model_size=1.3
# num_layers=24
# hidden_size=2048
# num_attn_heads=32
# global_batch_size=16
# lr=1e-5
# epochs=3
## BERT 3.9B
# model_size=3.9
# num_layers=48
# hidden_size=2560
# num_attn_heads=40
# global_batch_size=32
# lr=2e-5
# epochs=3
###############################################################################
### Parallelism configs
## Model parallelism, 1 is no MP
mp_size=1
## Pipeline parallelism. To disable PP, set pp_size to 1 and no_pp to true.
## Currently pipeline parallelism is not supported for BERT model: DeepSpeed's
## pipeline parallelism is only integrated with the GPT case, and currently
## DeepSpeed is not integrated with Megatron's own pipeline parallelism.
pp_size=1
no_pp="true"
## ZeRO stage
zero_stage=0
###############################################################################
### Misc configs
log_interval=10
eval_iters=50
eval_interval=100
save_interval=100000
## Activation checkpointing saves GPU memory, but reduces training speed
# activation_checkpoint="true"
activation_checkpoint="false"
###############################################################################
vocab_file="bert-large-uncased-vocab.txt"
if [ ! -f "$vocab_file" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
fi
jobname="${task}-${difficulty}-bsz${global_batch_size}-lr${lr}-epochs${epochs}-seed${seed}"
checkpoint_path="${pretrained_checkpoint}-finetune/${jobname}"
mkdir -p ${checkpoint_path}
template_json="ds_config_bert_TEMPLATE.json"
config_json="ds_config_bert_bsz${global_batch_size}_mbsz${batch_size}_log${log_interval}_zero${zero_stage}.json"
if [[ $zero_stage -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
fi
options=" \
--finetune \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${zero_stage} \
--task ${task} \
--seed ${seed} \
--train-data ${train_data} \
--valid-data ${valid_data} \
--tokenizer-type BertWordPieceLowerCase \
--vocab-file ${vocab_file} \
--epochs ${epochs} \
--pretrained-checkpoint ${pretrained_checkpoint} \
--tensor-model-parallel-size ${mp_size} \
--pipeline-model-parallel-size ${pp_size} \
--num-layers ${num_layers} \
--hidden-size ${hidden_size} \
--num-attention-heads ${num_attn_heads} \
--global-batch-size ${global_batch_size} \
--micro-batch-size ${batch_size} \
--lr ${lr} \
--lr-decay-style linear \
--lr-warmup-fraction 0.06 \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--save-interval ${save_interval} \
--save ${checkpoint_path} \
--log-interval ${log_interval} \
--eval-interval ${eval_interval} \
--eval-iters ${eval_iters} \
--weight-decay 1.0e-1 \
--clip-grad 1.0 \
--fp16"
if [ "${activation_checkpoint}" = "true" ]; then
options="${options} \
--checkpoint-activations \
--deepspeed-activation-checkpointing"
fi
if [[ "${no_pp}" = "true" ]]; then
options="${options} \
--no-pipeline-parallel"
fi
# After the fine-tuning finishes, you can find the dev/test set accuracy numbers
# by "grep -e "overall:" -e "metrics for" ${checkpoint_path}/output.log"
deepspeed ../../tasks/main.py ${options} &> ${checkpoint_path}/output.log

View File

@ -0,0 +1,267 @@
#!/bin/bash
dir=`pwd`
###############################################################################
### Main configs
### The main configs are from Megatron-LM paper
### https://arxiv.org/abs/1909.08053. Choose based on your desired model size
### or build your own configs.
seq_len=512
global_batch_size=1024
lr=1e-4
min_lr=1e-5
## init_std is the standard deviation for weight initialization. Usually larger
## model needs lower std. Here we roughly follow a heuristic equation of
## sqrt(1/3/hidden_size) from https://arxiv.org/pdf/2201.11990.pdf
## In addition, we find that the 3.9B model (even after tuning init_std) has
## NaN loss issue from the beginning thus unable to train. This is probably
## because in this example we use the public Pile data, which is a more diverse
## (and potentially more noisy) data than what used in Megatron paper. One
## potential solution is only use the sub datasets in Pile that are also
## used by Megatron paper.
## BERT 110M (same config as original BERT-Base model)
## This config is not included in Megatron-LM paper
# model_size=0.11
# num_layers=12
# hidden_size=768
# num_attn_heads=12
# init_std=0.02
## BERT 336M (same config as original BERT-Large model)
model_size=0.336
num_layers=24
hidden_size=1024
num_attn_heads=16
init_std=0.02
## BERT 1.3B
# model_size=1.3
# num_layers=24
# hidden_size=2048
# num_attn_heads=32
# init_std=0.013
## BERT 3.9B
# model_size=3.9
# num_layers=48
# hidden_size=2560
# num_attn_heads=40
# init_std=0.011
###############################################################################
### Training duration configs
## The main termination condition, original Megatron paper trains for 2M iters.
train_iters_in_million=2
train_iters=$((${train_iters_in_million} * 1000000))
###############################################################################
### lr configs
## lr warmup and decay duration. Original Megatron paper uses 10000 warmup
## iters. Decay iters is the same as train iters.
lr_warmup_iters=10000
lr_decay_iters_in_million=${train_iters_in_million}
lr_decay_iters=$((${lr_decay_iters_in_million} * 1000000))
lr_decay_style="linear"
###############################################################################
### Parallelism configs
## Model parallelism, 1 is no MP
mp_size=1
## Pipeline parallelism. To disable PP, set pp_size to 1 and no_pp to true.
## Currently pipeline parallelism is not supported for BERT model: DeepSpeed's
## pipeline parallelism is only integrated with the GPT case, and currently
## DeepSpeed is not integrated with Megatron's own pipeline parallelism.
pp_size=1
no_pp="true"
## ZeRO stage
zero_stage=0
## Total number of GPUs. ds_ssh is from DeepSpeed library.
num_gpus=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
num_gpus_pernode=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
num_node=$(( ${num_gpus} / ${num_gpus_pernode} ))
## Data parallel size.
dp_size=$(( ${num_gpus} / ${pp_size} / ${mp_size} ))
## Micro batch size per GPU
## Make sure that batch_size <= global_batch_size*pp_size*mp_size/num_gpus
## Below batch_size calculation assumes the case without gradient accumulation.
## Manually set it to a lower value if you hit out of memory during training.
batch_size=$(( ${global_batch_size} / ${dp_size} ))
###############################################################################
### Misc configs
log_interval=100
eval_iters=10
eval_interval=1000
# num_save controls how frequent to save checkpoint. num_save=20 means that a
# checkpoint will be saved every 5% of training. For longer training you would
# want larger num_save to save more frequently, and vice versa.
num_save=100
save_interval=$((${train_iters} / ${num_save}))
## Activation checkpointing saves GPU memory, but reduces training speed
# activation_checkpoint="true"
activation_checkpoint="false"
## Whether or not log optimizer states (norms, max abs values) to tensorboard.
## This is not required for training and might save GPU memory when turned off.
log_optimizer_state="true"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
## Public the Pile dataset, see prepare_pile_data.py in the same directory
## about how to download and preprocess the data.
jobname="bert-pile"
## For internal use. Change data_home to your own training data path.
data_home="/vc_data_blob/users/conglli/the_pile_bert"
if [[ "$host" == *"webxt"* ]]; then
data_home="/blob/data/the_pile_bert"
fi
data_path="${data_home}/pile_bert_train_text_sentence"
vocab_path="bert-large-uncased-vocab.txt"
if [ ! -f "$vocab_path" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
fi
## Number of workers for dataloader. We found that for BERT pre-training,
## num_workers will greatly affect data loading time and overall training
## time. In our experiment with 64 GPUs, the performance reaches peak at
## num_workers = 4 but it may differ depending on hardware. Also note that
## larger num_workers add more CPU computation/memory overhead.
num_workers=4
jobname="${jobname}-${model_size}B-iters-${train_iters_in_million}M"
jobname="${jobname}-lr-${lr}-min-${min_lr}-wmup-${lr_warmup_iters}-dcy-${lr_decay_iters_in_million}M-sty-${lr_decay_style}"
jobname="${jobname}-gbs-${global_batch_size}-mbs-${batch_size}-gpu-${num_gpus}-zero-${zero_stage}-mp-${mp_size}-pp-${pp_size}"
if [ "${no_pp}" = "true" ]; then
jobname="${jobname}-nopp"
fi
username=$(whoami)
output_home="/vc_data_blob/users/${username}/project/bert_with_pile"
if [[ "$host" == *"webxt"* ]]; then
output_home="/blob/users/${username}/project/bert_with_pile"
fi
log_path="${output_home}/log/"
checkpoint_path="${output_home}/checkpoint/${jobname}"
## Microsoft internal constraint: because tensorboard is logged by last rank,
## it's better to put the path in NFS instead of Blob.
tensorboard_dir="/vc_data/users/${username}/project/bert_with_pile/tensorboard/"
tensorboard_path="${tensorboard_dir}${jobname}_${host}_${current_time}"
mkdir -p ${log_path}
mkdir -p ${checkpoint_path}
mkdir -p ${tensorboard_path}
###############################################################################
data_options=" \
--vocab-file ${vocab_path} \
--data-path ${data_path} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.999 \
--init-method-std ${init_std} \
--tensor-model-parallel-size ${mp_size} \
--lr-decay-iters ${lr_decay_iters} \
--lr-warmup-iters ${lr_warmup_iters} \
--micro-batch-size ${batch_size} \
--global-batch-size ${global_batch_size} \
--num-layers ${num_layers} \
--hidden-size ${hidden_size} \
--num-attention-heads ${num_attn_heads} \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--train-iters ${train_iters} \
--lr ${lr} \
--min-lr ${min_lr} \
--lr-decay-style ${lr_decay_style} \
--split 949,50,1 \
--log-interval ${log_interval} \
--eval-interval ${eval_interval} \
--eval-iters ${eval_iters} \
--save-interval ${save_interval} \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--num-workers ${num_workers} \
--fp16 \
--load ${checkpoint_path} \
--save ${checkpoint_path} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${tensorboard_path}"
if [ "${activation_checkpoint}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [ "${log_optimizer_state}" = "true" ]; then
megatron_options="${megatron_options} \
--log-optimizer-states-to-tensorboard"
fi
template_json="ds_config_bert_TEMPLATE.json"
config_json="ds_config_bert_bsz${global_batch_size}_mbsz${batch_size}_log${log_interval}_zero${zero_stage}.json"
if [[ $zero_stage -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
fi
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${zero_stage} \
--pipeline-model-parallel-size ${pp_size}"
if [[ "${no_pp}" = "true" ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${activation_checkpoint}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
## When saving checkpoint to a storage with cache, their could be consistency
## issue of the pointer to latest checkpoint. Here we find the correct pointer
## and broadcast it to all nodes.
iteration_file="$checkpoint_path/latest_checkpointed_iteration.txt"
iteration_file_2="$checkpoint_path/latest"
iteration=0
for (( node = 0; node <= num_node-1; node++ ))
do
if $(ssh -q worker-"$node" "test -f \"$iteration_file\""); then
local_iteration=$(ssh -q worker-"$node" cat $iteration_file)
iteration=$(( ${local_iteration} > ${iteration} ? ${local_iteration} : ${iteration} ))
fi
done
if [[ $iteration -gt 0 ]]; then
iteration_2="global_step${iteration}"
ds_ssh "echo $iteration > $iteration_file"
ds_ssh "echo $iteration_2 > $iteration_file_2"
fi
deepspeed ${dir}/../../pretrain_bert.py ${megatron_options} ${data_options} ${deepspeed_options} &>> ${log_path}/${jobname}_${host}_${current_time}.log

View File

@ -0,0 +1,128 @@
import zstandard
import sys
import time
import os
import sys
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),
os.path.pardir,os.path.pardir)))
from megatron.data import indexed_dataset
def pile_download(download_url, file_path, i):
start = time.time()
zstd_file_path = f"{file_path}{i:02}.jsonl.zst"
download_path = f"{download_url}{i:02}.jsonl.zst"
if not os.path.exists(zstd_file_path):
os.system(f"wget -P {file_path} {download_path}")
print(f"Finished downloading chunk {i} in {time.time() - start} sec")
def pile_decompress(download_url, file_path, i):
zstd_file_path = f"{file_path}{i:02}.jsonl.zst"
output_path = f"{file_path}{i:02}.jsonl"
if not os.path.exists(output_path):
if not os.path.exists(zstd_file_path):
pile_download(download_url, file_path, i)
start = time.time()
with open(zstd_file_path, 'rb') as compressed:
decomp = zstandard.ZstdDecompressor()
with open(output_path, 'wb') as destination:
decomp.copy_stream(compressed, destination)
os.remove(zstd_file_path)
print(f"Finished decompressing chunk {i} in {time.time() - start} sec")
def pile_preprocess(download_url, file_path, vocab_file, num_workers, i):
json_file_path = f"{file_path}{i:02}.jsonl"
output_prefix = f"{file_path}pile_bert_train_{i:02}"
if not os.path.exists(f"{output_prefix}_text_sentence.idx"):
if not os.path.exists(json_file_path):
pile_decompress(download_url, file_path, i)
start = time.time()
cmd = f"python ../../tools/preprocess_data.py \
--input {json_file_path} \
--output-prefix {output_prefix} \
--vocab {vocab_file} \
--dataset-impl mmap \
--tokenizer-type BertWordPieceLowerCase \
--split-sentences \
--workers {num_workers} "
# It's possible to hit MemoryError during above cmd since the memory
# usage is proportional to num_workers. In this case we delete the
# incomplete output and user shall retry with smaller num_workers.
# Our experience show that chunk 6, 7, 9, 17, 18, 20, 21, 24, 27
# particularly have large memory usage.
if os.system(cmd) == 0: # Success
os.remove(json_file_path)
else:
print(f"Error: chunk {i} preprocessing got error, delete \
incomplete output. If MemoryError appeared, please retry \
with num_workers smaller than {num_workers}.")
if os.path.exists(f"{output_prefix}_text_sentence.idx"):
os.remove(f"{output_prefix}_text_sentence.idx")
if os.path.exists(f"{output_prefix}_text_sentence.bin"):
os.remove(f"{output_prefix}_text_sentence.bin")
print(f"Finished preprocessing chunk {i} in {time.time() - start} sec")
def pile_merge(file_path):
start = time.time()
num_chunks = 30
vocab_size = 30524
for i in range(num_chunks):
output_prefix = f"{file_path}pile_bert_train_{i:02}"
assert os.path.exists(f"{output_prefix}_text_sentence.idx")
assert os.path.exists(f"{output_prefix}_text_sentence.bin")
builder = indexed_dataset.make_builder(
f"{file_path}pile_bert_train_text_sentence.bin", impl="mmap",
vocab_size=vocab_size)
for i in range(num_chunks):
chunk_file = f"{file_path}pile_bert_train_{i:02}_text_sentence"
print(f"Merging file {chunk_file}")
builder.merge_file_(chunk_file)
print("Finalizing merged file ...")
builder.finalize(f"{file_path}pile_bert_train_text_sentence.idx")
print(f"Finished merging in {time.time() - start} sec")
# After verifying the merged data with real training, you may want to
# delete the data chunks.
# for i in range(num_chunks):
# output_prefix = f"{file_path}pile_bert_train_{i:02}"
# os.remove(f"{output_prefix}_text_sentence.idx")
# os.remove(f"{output_prefix}_text_sentence.bin")
if __name__ == '__main__':
# Path to download and store all the output files during the whole process.
# Estimated max storage usage would be around 1.6 TB (or 780GB if skip the
# final merge). Memory usage is proportional to the num_workers below (can
# be as high as O(300GB) if num_workers is around 20).
file_path = "/blob/data/the_pile_bert/"
# The raw Pile data has 30 compressed .zst chunks. To run on single
# machine for all chunks, run "python prepare_pile_data.py range 0 30".
# You can also split and run on multiple machines to speed up, since
# processing one chunk can take hours. The whole process only uses CPU.
if sys.argv[1] == "merge":
# "python prepare_pile_data.py merge" means merge all 30 processed data
# chunks. Run it only after all 30 chunks are preprocessed. The memory
# usage during merge is about 600GB. If you don't have enough memory,
# one solution is to directly use the 30 data chunks as multiple
# datasets. See '--data-path' in
# github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/arguments.py
pile_merge(file_path)
else:
if sys.argv[1] == "range":
# "python prepare_pile_data.py range 0 30" means process chunk 0-29
selected_chunk = range(int(sys.argv[2]), int(sys.argv[3]))
else:
# "python prepare_pile_data.py 2 5 8" means process chunk 2, 5, 8
selected_chunk = [int(x) for x in sys.argv[1:]]
print("selected_chunk: ", selected_chunk)
# Number of process. Adjust based on your CPU/Memory.
num_workers = 20
# Where the raw Pile data can be downloaded. The url may change in
# future. Contact EleutherAI (https://github.com/EleutherAI/the-pile)
# if this url does not work.
download_url = "https://the-eye.eu/public/AI/pile/train/"
vocab_file = "bert-large-uncased-vocab.txt"
vocab_url = "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt"
if not os.path.exists(vocab_file):
os.system(f"wget {vocab_url}")
os.makedirs(file_path, exist_ok=True)
for i in selected_chunk:
pile_preprocess(download_url, file_path, vocab_file, num_workers, i)

View File

@ -0,0 +1,253 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
MODEL_SIZE=0.125
NUM_LAYERS=12
HIDDEN_SIZE=768
NUM_ATTN_HEADS=12
GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
LR=6.0e-5
MIN_LR=6.0e-5
# Curriculum learning (CL) enables stable large-batch training
# GLOBAL_BATCH_SIZE=16 # 8x
# LR=6e-4 # 4x
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
# TRAIN_TOKENS=300000000000
TRAIN_TOKENS=5250000000
## TRAIN_SAMPLES is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some samples,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_SAMPLES.
TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=4
## Model parallelism, 1 is no MP
MP_SIZE=1
## Pipeline parallelism. To disable PP, set PP_SIZE to 1 and NO_PP to true.
PP_SIZE=1
NO_PP="true"
## ZeRO stage
ZERO_STAGE=0
## Total number of GPUs
NUM_GPUS=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
NUM_GPUS_PERNODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
NUM_NODE=$(( ${NUM_GPUS} / ${NUM_GPUS_PERNODE} ))
DP_SIZE=$(( ${NUM_GPUS} / ${PP_SIZE} / ${MP_SIZE} ))
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=72
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_STEP=$(( ${CL_TOKENS} * 1000000000 / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=1000
## Standard deviation for weight initialization. Usually larger model needs
## lower std. We used a heuristic equation of sqrt(1/3/HIDDEN_SIZE) from the
## MT-NLG 530B work (https://arxiv.org/pdf/2201.11990.pdf)
INIT_STD=0.02
## Activation checkpointing saves GPU memory, but reduces training speed
# ACTIVATION_CHECKPOINT="true"
ACTIVATION_CHECKPOINT="false"
## Whether or not log optimizer states (norms, max abs values) to tensorboard.
## This is not required for training and might save GPU memory when turned off.
LOG_OPTIMIZER_STATE="true"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="125M10L_Compression_Test_INT8_64gpu_lr6e-5_tokens5.25B_nocl"
if [ "${NO_PP}" = "true" ]; then
NAME="${NAME}-no_pp"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-startseqlen-${CL_START_SEQLEN}-step-${CL_STEP}-token-${CL_TOKENS}B"
fi
LOG_PATH="log/"
TENSORBOARD_PATH="tensorboard/${NAME}_${host}_${current_time}"
CHECKPOINT_PATH="/blob/users/zheweiyao/compression_library/checkpoint/${NAME}"
mkdir -p ${LOG_PATH}
mkdir -p ${TENSORBOARD_PATH}
mkdir -p ${CHECKPOINT_PATH}
VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
# For cluster Azure-EastUS-V100-32GB-4, Lab-RR1-V100
# DATA_PATH=/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing/pile_text_document
# For cluster Azure-WestUS3-A100
DATA_PATH=/blob/data/the_pile_public_merged_nopreprocessing/pile_text_document
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_PATH} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers 10 \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-samples ${TRAIN_SAMPLES} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load /blob/users/minjiaz/project/gpt3_distillation/checkpoint/gpt3-kd-staged-alpha1-with-pile-0.125B-lr-2.4e-3-minlr-6.0e-5-bs-2048-gpus-32-zero-0-mp-1-pp-1-no_pp-cl-startseqlen-72-step-27638-token-60B/ \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--no-load-lr-state \
--reset-iteration \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_PATH}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [ "${LOG_OPTIMIZER_STATE}" = "true" ]; then
megatron_options="${megatron_options} \
--log-optimizer-states-to-tensorboard"
fi
template_json="ds_config_gpt_TEMPLATE_compression.json"
config_json="ds_config_${NAME}.json"
if [[ $ZERO_STAGE -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
fi
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${ZERO_STAGE} \
--pipeline-model-parallel-size ${PP_SIZE}"
if [[ "${NO_PP}" = "true" ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
## When saving checkpoint to a storage with cache, their could be consistency
## issue of the pointer to latest checkpoint. Here we find the correct pointer
## and broadcast it to all nodes.
ITERATION_FILE="$CHECKPOINT_PATH/latest_checkpointed_iteration.txt"
ITERATION_FILE_2="$CHECKPOINT_PATH/latest"
ITERATION=0
for (( node = 0; node <= NUM_NODE-1; node++ ))
do
if $(ssh -q worker-"$node" "test -f \"$ITERATION_FILE\""); then
LOCAL_ITERATION=$(ssh -q worker-"$node" cat $ITERATION_FILE)
ITERATION=$(( ${LOCAL_ITERATION} > ${ITERATION} ? ${LOCAL_ITERATION} : ${ITERATION} ))
fi
done
if [[ $ITERATION -gt 0 ]]; then
ITERATION_2="global_step${ITERATION}"
ds_ssh "echo $ITERATION > $ITERATION_FILE"
ds_ssh "echo $ITERATION_2 > $ITERATION_FILE_2"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${LOG_PATH}/${NAME}.log"
# run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options}"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,253 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
MODEL_SIZE=0.125
NUM_LAYERS=12
HIDDEN_SIZE=768
NUM_ATTN_HEADS=12
GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
LR=6.0e-5
MIN_LR=6.0e-5
# Curriculum learning (CL) enables stable large-batch training
# GLOBAL_BATCH_SIZE=16 # 8x
# LR=6e-4 # 4x
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
# TRAIN_TOKENS=300000000000
TRAIN_TOKENS=5250000000
## TRAIN_SAMPLES is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some samples,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_SAMPLES.
TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=4
## Model parallelism, 1 is no MP
MP_SIZE=1
## Pipeline parallelism. To disable PP, set PP_SIZE to 1 and NO_PP to true.
PP_SIZE=1
NO_PP="true"
## ZeRO stage
ZERO_STAGE=0
## Total number of GPUs
NUM_GPUS=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
NUM_GPUS_PERNODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
NUM_NODE=$(( ${NUM_GPUS} / ${NUM_GPUS_PERNODE} ))
DP_SIZE=$(( ${NUM_GPUS} / ${PP_SIZE} / ${MP_SIZE} ))
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=72
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_STEP=$(( ${CL_TOKENS} * 1000000000 / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=1000
## Standard deviation for weight initialization. Usually larger model needs
## lower std. We used a heuristic equation of sqrt(1/3/HIDDEN_SIZE) from the
## MT-NLG 530B work (https://arxiv.org/pdf/2201.11990.pdf)
INIT_STD=0.02
## Activation checkpointing saves GPU memory, but reduces training speed
# ACTIVATION_CHECKPOINT="true"
ACTIVATION_CHECKPOINT="false"
## Whether or not log optimizer states (norms, max abs values) to tensorboard.
## This is not required for training and might save GPU memory when turned off.
LOG_OPTIMIZER_STATE="true"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="125M10L_Compression_Test_INT8_64gpu_lr6e-5_tokens5.25B_nocl_alpha"
if [ "${NO_PP}" = "true" ]; then
NAME="${NAME}-no_pp"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-startseqlen-${CL_START_SEQLEN}-step-${CL_STEP}-token-${CL_TOKENS}B"
fi
LOG_PATH="log/"
TENSORBOARD_PATH="tensorboard/${NAME}_${host}_${current_time}"
CHECKPOINT_PATH="/blob/users/minjiaz/compression_library/checkpoint/${NAME}"
mkdir -p ${LOG_PATH}
mkdir -p ${TENSORBOARD_PATH}
mkdir -p ${CHECKPOINT_PATH}
VOCAB_PATH=/blob/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/blob/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
# For cluster Azure-EastUS-V100-32GB-4, Lab-RR1-V100
# DATA_PATH=/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing/pile_text_document
# For cluster Azure-WestUS3-A100
DATA_PATH=/blob/data/the_pile_public_merged_nopreprocessing/pile_text_document
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_PATH} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers 10 \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-samples ${TRAIN_SAMPLES} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load /blob/users/minjiaz/project/gpt3_distillation/checkpoint/gpt3-kd-staged-alpha1-with-pile-0.125B-lr-2.4e-3-minlr-6.0e-5-bs-2048-gpus-32-zero-0-mp-1-pp-1-no_pp-cl-startseqlen-72-step-27638-token-60B/ \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--no-load-lr-state \
--reset-iteration \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_PATH}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [ "${LOG_OPTIMIZER_STATE}" = "true" ]; then
megatron_options="${megatron_options} \
--log-optimizer-states-to-tensorboard"
fi
template_json="ds_config_gpt_TEMPLATE_compression.json"
config_json="ds_config_${NAME}.json"
if [[ $ZERO_STAGE -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
fi
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${ZERO_STAGE} \
--pipeline-model-parallel-size ${PP_SIZE}"
if [[ "${NO_PP}" = "true" ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
## When saving checkpoint to a storage with cache, their could be consistency
## issue of the pointer to latest checkpoint. Here we find the correct pointer
## and broadcast it to all nodes.
ITERATION_FILE="$CHECKPOINT_PATH/latest_checkpointed_iteration.txt"
ITERATION_FILE_2="$CHECKPOINT_PATH/latest"
ITERATION=0
for (( node = 0; node <= NUM_NODE-1; node++ ))
do
if $(ssh -q worker-"$node" "test -f \"$ITERATION_FILE\""); then
LOCAL_ITERATION=$(ssh -q worker-"$node" cat $ITERATION_FILE)
ITERATION=$(( ${LOCAL_ITERATION} > ${ITERATION} ? ${LOCAL_ITERATION} : ${ITERATION} ))
fi
done
if [[ $ITERATION -gt 0 ]]; then
ITERATION_2="global_step${ITERATION}"
ds_ssh "echo $ITERATION > $ITERATION_FILE"
ds_ssh "echo $ITERATION_2 > $ITERATION_FILE_2"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${LOG_PATH}/${NAME}.log"
# run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options}"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,253 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
MODEL_SIZE=0.125
NUM_LAYERS=12
HIDDEN_SIZE=768
NUM_ATTN_HEADS=12
GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
LR=6.0e-5
MIN_LR=6.0e-5
# Curriculum learning (CL) enables stable large-batch training
# GLOBAL_BATCH_SIZE=16 # 8x
# LR=6e-4 # 4x
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
# TRAIN_TOKENS=300000000000
TRAIN_TOKENS=5250000000
## TRAIN_SAMPLES is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some samples,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_SAMPLES.
TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=4
## Model parallelism, 1 is no MP
MP_SIZE=1
## Pipeline parallelism. To disable PP, set PP_SIZE to 1 and NO_PP to true.
PP_SIZE=1
NO_PP="true"
## ZeRO stage
ZERO_STAGE=0
## Total number of GPUs
NUM_GPUS=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
NUM_GPUS_PERNODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
NUM_NODE=$(( ${NUM_GPUS} / ${NUM_GPUS_PERNODE} ))
DP_SIZE=$(( ${NUM_GPUS} / ${PP_SIZE} / ${MP_SIZE} ))
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=72
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_STEP=$(( ${CL_TOKENS} * 1000000000 / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=1000
## Standard deviation for weight initialization. Usually larger model needs
## lower std. We used a heuristic equation of sqrt(1/3/HIDDEN_SIZE) from the
## MT-NLG 530B work (https://arxiv.org/pdf/2201.11990.pdf)
INIT_STD=0.02
## Activation checkpointing saves GPU memory, but reduces training speed
# ACTIVATION_CHECKPOINT="true"
ACTIVATION_CHECKPOINT="false"
## Whether or not log optimizer states (norms, max abs values) to tensorboard.
## This is not required for training and might save GPU memory when turned off.
LOG_OPTIMIZER_STATE="true"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="125M12L_Compression_Test_INT8_64gpu_lr6e-5_tokens5.25B_nocl_alpha"
if [ "${NO_PP}" = "true" ]; then
NAME="${NAME}-no_pp"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-startseqlen-${CL_START_SEQLEN}-step-${CL_STEP}-token-${CL_TOKENS}B"
fi
LOG_PATH="log/"
TENSORBOARD_PATH="tensorboard/${NAME}_${host}_${current_time}"
CHECKPOINT_PATH="/blob/users/minjiaz/compression_library/checkpoint/${NAME}"
mkdir -p ${LOG_PATH}
mkdir -p ${TENSORBOARD_PATH}
mkdir -p ${CHECKPOINT_PATH}
VOCAB_PATH=/blob/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/blob/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
# For cluster Azure-EastUS-V100-32GB-4, Lab-RR1-V100
# DATA_PATH=/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing/pile_text_document
# For cluster Azure-WestUS3-A100
DATA_PATH=/blob/data/the_pile_public_merged_nopreprocessing/pile_text_document
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_PATH} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers 12 \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-samples ${TRAIN_SAMPLES} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load /blob/users/conglli/project/gpt3_with_pile/checkpoint/gpt3-with-pile-0.125B-lr-2.4e-3-minlr-6.0e-5-bs-2048-gpus-64-zero-0-mp-1-pp-1-no_pp-cl-startseqlen-72-step-27638-token-60B/ \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--no-load-lr-state \
--reset-iteration \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_PATH}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [ "${LOG_OPTIMIZER_STATE}" = "true" ]; then
megatron_options="${megatron_options} \
--log-optimizer-states-to-tensorboard"
fi
template_json="ds_config_gpt_TEMPLATE_compression.json"
config_json="ds_config_${NAME}.json"
if [[ $ZERO_STAGE -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
fi
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${ZERO_STAGE} \
--pipeline-model-parallel-size ${PP_SIZE}"
if [[ "${NO_PP}" = "true" ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
## When saving checkpoint to a storage with cache, their could be consistency
## issue of the pointer to latest checkpoint. Here we find the correct pointer
## and broadcast it to all nodes.
ITERATION_FILE="$CHECKPOINT_PATH/latest_checkpointed_iteration.txt"
ITERATION_FILE_2="$CHECKPOINT_PATH/latest"
ITERATION=0
for (( node = 0; node <= NUM_NODE-1; node++ ))
do
if $(ssh -q worker-"$node" "test -f \"$ITERATION_FILE\""); then
LOCAL_ITERATION=$(ssh -q worker-"$node" cat $ITERATION_FILE)
ITERATION=$(( ${LOCAL_ITERATION} > ${ITERATION} ? ${LOCAL_ITERATION} : ${ITERATION} ))
fi
done
if [[ $ITERATION -gt 0 ]]; then
ITERATION_2="global_step${ITERATION}"
ds_ssh "echo $ITERATION > $ITERATION_FILE"
ds_ssh "echo $ITERATION_2 > $ITERATION_FILE_2"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${LOG_PATH}/${NAME}.log"
# run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options}"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,39 @@
{
"train_batch_size" : CONFIG_BATCH_SIZE,
"train_micro_batch_size_per_gpu": CONFIG_MBSIZE,
"steps_per_print": LOG_INTERVAL,
"zero_optimization": {
"stage": ZERO_STAGE,
"elastic_checkpoint": true
},
"gradient_clipping": 1.0,
"prescale_gradients": PRESCALE_GRAD,
"fp16": {
"enabled": CONFIG_FP16_ENABLED,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"bf16": {
"enabled": CONFIG_BF16_ENABLED
},
"curriculum_learning": {
"enabled": CONFIG_CL_ENABLED,
"curriculum_type": "seqlen",
"min_difficulty": CONFIG_CL_MIN,
"max_difficulty": CONFIG_CL_MAX,
"schedule_type": "fixed_linear",
"schedule_config": {
"total_curriculum_step": CONFIG_CL_DURATION,
"difficulty_step": 8
}
},
"wall_clock_breakdown" : false
}

View File

@ -0,0 +1,87 @@
{
"train_batch_size" : CONFIG_BATCH_SIZE,
"train_micro_batch_size_per_gpu": CONFIG_MBSIZE,
"steps_per_print": LOG_INTERVAL,
"zero_optimization": {
"stage": ZERO_STAGE,
"elastic_checkpoint": true
},
"gradient_clipping": 1.0,
"prescale_gradients": PRESCALE_GRAD,
"fp16": {
"enabled": CONFIG_FP16_ENABLED,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"bf16": {
"enabled": CONFIG_BF16_ENABLED
},
"curriculum_learning": {
"enabled": CONFIG_CL_ENABLED,
"curriculum_type": "seqlen",
"min_difficulty": CONFIG_CL_MIN,
"max_difficulty": CONFIG_CL_MAX,
"schedule_type": "fixed_linear",
"schedule_config": {
"total_curriculum_step": CONFIG_CL_DURATION,
"difficulty_step": 8
}
},
"wall_clock_breakdown" : false,
"compression_training": {
"weight_quantization": {
"shared_parameters":{
"enabled": true,
"quantizer_kernel": false,
"schedule_offset": 50,
"quantize_groups": 48,
"quantize_verbose": false,
"quantization_type": "symmetric",
"rounding": "nearest",
"fp16_mixed_quantize":{
"enabled": false,
"quantize_change_ratio": 0.001
}
},
"different_groups":{
"wq1": {
"params": {
"start_bits": 12,
"target_bits": 4,
"quantization_period": 50
},
"modules": [
"encoder.layers"
]
}
}
},
"activation_quantization": {
"shared_parameters":{
"enabled": true,
"quantization_type": "asymmetric",
"range_calibration": "static",
"schedule_offset": 50
},
"different_groups":{
"aq1": {
"params": {
"bits": 8
},
"modules": [
"encoder.layers"
]
}
}
}
}
}

View File

@ -0,0 +1,74 @@
# This is an example zero-shot eval script. Please first read the readme_evalharness.md under the same directory.
# CHECKPOINT_PATH=/blob/users/minjiaz/compression_library/checkpoint/125M10L_Compression_Test_INT8_64gpu_lr6e-5_tokens5.25B_nocl_alpha-no_pp/global_step2000/
# CHECKPOINT_PATH=/blob/users/conglli/project/gpt3_with_pile/checkpoint/gpt3-with-pile-0.125B-lr-2.4e-3-minlr-6.0e-5-bs-2048-gpus-64-zero-0-mp-1-pp-1-no_pp-cl-startseqlen-72-step-27638-token-60B/global_step71000/
# CHECKPOINT_PATH=/blob/users/minjiaz/compression_library/checkpoint/125M12L_Compression_Test_INT8_64gpu_lr6e-5_tokens5.25B_nocl_alpha-no_pp/global_step5000/
CHECKPOINT_PATH=/blob/users/minjiaz/project/gpt3_distillation/checkpoint/gpt3-kd-test2-alpha1-with-pile-0.125B-lr-2.4e-3-minlr-6.0e-5-bs-2048-gpus-15-zero-0-mp-1-pp-1-no_pp-cl-startseqlen-72-step-27638-token-60B/global_step71426/
CONFIG_PATH=ds_config_gpt3-with-pile-0.125B-lr-2.4e-3-minlr-6.0e-5-bs-2048-gpus--1-zero-0-mp-1-pp-1-no_pp-cl-startseqlen-72-step-27638-token-60B.json
RESULT_PATH=gpt3-with-pile-0.125B-lr-2.4e-3-minlr-6.0e-5-bs-2048-gpus-128-zero-0-mp-1-pp-1-no_pp-cl-startseqlen-72-step-20728-token-45B_global_step81566.log
PP_SIZE=1
TP_SIZE=1
NO_PP="true"
EP_PARALLEL_SIZE=1
# Currently eval harness does not support data parallel
# However, for MoE models it's possible to enable a "fake data parallel"
# in order to load experts on multiple gpus. At the same time, it's not
# real data parallel because we load the same data on all gpus.
# On the other hand, it's better to use less number of gpus than training,
# to reduce communication overhead.
NUM_NODE=1
NUM_GPU_PER_NODE=1
# TASKS="lambada"
# WikiText-2, not used in GPT-3 paper but used in GPT-2 paper
TASKS="lambada,wikitext"
# Tasks that appeared in GPT-3 paper (sorted based on the order in paper), plus WikiText-2.
# TASKS="hellaswag,lambada,triviaqa,webqs,winogrande,piqa,arc_challenge,arc_easy,openbookqa,race,boolq,cb,copa,rte,wic,wsc,multirc,record,anli_r1,anli_r2,anli_r3,wikitext"
# All tasks that confirmed to work, there are more tasks on https://github.com/EleutherAI/lm-evaluation-harness that we didn't test.
# TASKS="hellaswag,lambada,triviaqa,webqs,winogrande,piqa,arc_challenge,arc_easy,openbookqa,race,boolq,cb,copa,rte,wic,wsc,multirc,record,anli_r1,anli_r2,anli_r3,wikitext,logiqa,mathqa,mc_taco,mrpc,prost,pubmedqa,qnli,qqp,sciq,sst,wnli"
VOCAB_FILE=/blob/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_FILE=/blob/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
export HF_DATASETS_OFFLINE=1
# Dummy arguments to make megatron happy. No need to configure them.
# The reason we don't need to configure them and many other arguments is
# because the eval framework will read the arguments from checkpoint file.
MEGATRON_REQUIRED_ARGS="\
--num-layers -1\
--hidden-size -1\
--num-attention-heads -1\
--seq-length -1 \
--max-position-embeddings -1
"
CMD="../../tasks/eval_harness/evaluate.py \
--load $CHECKPOINT_PATH\
--tensor-model-parallel-size $TP_SIZE \
--pipeline-model-parallel-size $PP_SIZE\
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--vocab-file $VOCAB_FILE\
--merge-file $MERGE_FILE\
--micro-batch-size 12\
--no-load-optim \
--no-load-rng \
--inference \
--disable-moe-token-dropping \
--adaptive_seq_len\
--eval_fp32\
--task_list $TASKS\
--results_path $RESULT_PATH \
--deepspeed \
--deepspeed_config $CONFIG_PATH \
$MEGATRON_REQUIRED_ARGS\
"
if [[ "${NO_PP}" = "true" ]]; then
CMD="${CMD} \
--no-pipeline-parallel"
fi
LAUNCHER="deepspeed --num_nodes $NUM_NODE --num_gpus $NUM_GPU_PER_NODE"
$LAUNCHER $CMD

View File

@ -0,0 +1,322 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
# MODEL_SIZE=0.125
# NUM_LAYERS=12
# HIDDEN_SIZE=768
# NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5
## GPT-3 Medium 350M
# MODEL_SIZE=0.35
# NUM_LAYERS=24
# HIDDEN_SIZE=1024
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
MODEL_SIZE=1.3
NUM_LAYERS=24
HIDDEN_SIZE=2048
NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
MIN_LR=2.0e-5
# Curriculum learning (CL) enables stable large-batch training
GLOBAL_BATCH_SIZE=4096 # 8x
LR=8.0e-4 # 4x
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
TRAIN_TOKENS=300000000000
## TRAIN_SAMPLES is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some samples,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_SAMPLES.
TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=16
## Model parallelism, 1 is no MP
MP_SIZE=2
## Pipeline parallelism. To disable PP, set PP_SIZE to 1 and NO_PP to true.
PP_SIZE=1
NO_PP="true"
## ZeRO stage
ZERO_STAGE=0
## Total number of GPUs
NUM_GPUS=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
NUM_GPUS_PERNODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
NUM_NODE=$(( ${NUM_GPUS} / ${NUM_GPUS_PERNODE} ))
DP_SIZE=$(( ${NUM_GPUS} / ${PP_SIZE} / ${MP_SIZE} ))
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="true"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_STEP=$(( ${CL_TOKENS} * 1000000000 / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=10000
## Standard deviation for weight initialization. Usually larger model needs
## lower std. We used a heuristic equation of sqrt(1/3/HIDDEN_SIZE) from the
## MT-NLG 530B work (https://arxiv.org/pdf/2201.11990.pdf)
INIT_STD=0.013
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
## Whether or not log optimizer states (norms, max abs values) to tensorboard.
## This is not required for training and might save GPU memory when turned off.
LOG_OPTIMIZER_STATE="true"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt3-kd-with-pile-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-zero-${ZERO_STAGE}-mp-${MP_SIZE}-pp-${PP_SIZE}"
if [ "${NO_PP}" = "true" ]; then
NAME="${NAME}-no_pp"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-startseqlen-${CL_START_SEQLEN}-step-${CL_STEP}-token-${CL_TOKENS}B"
fi
LOG_PATH="log/"
TENSORBOARD_PATH="tensorboard/${NAME}_${host}_${current_time}"
CHECKPOINT_PATH="/blob/users/minjiaz/project/gpt3_distillation/checkpoint/${NAME}"
mkdir -p ${LOG_PATH}
mkdir -p ${TENSORBOARD_PATH}
mkdir -p ${CHECKPOINT_PATH}
### KD configs
KD_BETA_CE=1
CHECKPOINT_PATH_TEACHER="/blob/users/conglli/project/gpt3_with_pile/checkpoint/gpt3-with-pile-1.3B-lr-8.0e-4-minlr-2.0e-5-bs-4096-gpus-128-zero-0-mp-2-pp-1-no_pp-cl-startseqlen-80-step-13767-token-60B/"
CHECKPOINT_PATH_SAVE="/blob/users/minjiaz/project/gpt3_distillation/checkpoint/${NAME}"
mkdir -p ${CHECKPOINT_PATH_SAVE}
VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
# DATA_PATH=/data/the_pile_public_merged_nopreprocessing/pile_text_document
# For cluster Azure-WestUS3-A100
DATA_PATH=/blob/data/the_pile_public_merged_nopreprocessing/pile_text_document
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_PATH} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers 21 \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-samples ${TRAIN_SAMPLES} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH_SAVE} \
--kd \
--kd-beta-ce ${KD_BETA_CE} \
--num-layers-teacher ${NUM_LAYERS} \
--hidden-size-teacher ${HIDDEN_SIZE} \
--num-attention-heads-teacher ${NUM_ATTN_HEADS} \
--load-teacher ${CHECKPOINT_PATH_TEACHER} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_PATH}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [ "${LOG_OPTIMIZER_STATE}" = "true" ]; then
megatron_options="${megatron_options} \
--log-optimizer-states-to-tensorboard"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_${NAME}.json"
if [[ $ZERO_STAGE -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
fi
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${ZERO_STAGE} \
--pipeline-model-parallel-size ${PP_SIZE}"
if [[ "${NO_PP}" = "true" ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
## When saving checkpoint to a storage with cache, their could be consistency
## issue of the pointer to latest checkpoint. Here we find the correct pointer
## and broadcast it to all nodes.
ITERATION_FILE="$CHECKPOINT_PATH/latest_checkpointed_iteration.txt"
ITERATION_FILE_2="$CHECKPOINT_PATH/latest"
ITERATION=0
for (( node = 0; node <= NUM_NODE-1; node++ ))
do
if $(ssh -q worker-"$node" "test -f \"$ITERATION_FILE\""); then
LOCAL_ITERATION=$(ssh -q worker-"$node" cat $ITERATION_FILE)
ITERATION=$(( ${LOCAL_ITERATION} > ${ITERATION} ? ${LOCAL_ITERATION} : ${ITERATION} ))
fi
done
if [[ $ITERATION -gt 0 ]]; then
ITERATION_2="global_step${ITERATION}"
ds_ssh "echo $ITERATION > $ITERATION_FILE"
ds_ssh "echo $ITERATION_2 > $ITERATION_FILE_2"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${LOG_PATH}/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,323 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
MODEL_SIZE=0.125
NUM_LAYERS=12
HIDDEN_SIZE=768
NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
MIN_LR=6.0e-5
# Curriculum learning (CL) enables stable large-batch training
GLOBAL_BATCH_SIZE=2048 # 8x
LR=2.4e-3 # 4x
## GPT-3 Medium 350M
# MODEL_SIZE=0.35
# NUM_LAYERS=24
# HIDDEN_SIZE=1024
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
# MODEL_SIZE=1.3
# NUM_LAYERS=24
# HIDDEN_SIZE=2048
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
# MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
TRAIN_TOKENS=300000000000
## TRAIN_SAMPLES is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some samples,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_SAMPLES.
TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=8
## Model parallelism, 1 is no MP
MP_SIZE=1
## Pipeline parallelism. To disable PP, set PP_SIZE to 1 and NO_PP to true.
PP_SIZE=1
NO_PP="true"
## ZeRO stage
ZERO_STAGE=0
## Total number of GPUs
NUM_GPUS=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
NUM_GPUS_PERNODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
NUM_NODE=$(( ${NUM_GPUS} / ${NUM_GPUS_PERNODE} ))
DP_SIZE=$(( ${NUM_GPUS} / ${PP_SIZE} / ${MP_SIZE} ))
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="true"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=72
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_STEP=$(( ${CL_TOKENS} * 1000000000 / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=10000
## Standard deviation for weight initialization. Usually larger model needs
## lower std. We used a heuristic equation of sqrt(1/3/HIDDEN_SIZE) from the
## MT-NLG 530B work (https://arxiv.org/pdf/2201.11990.pdf)
INIT_STD=0.02
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
## Whether or not log optimizer states (norms, max abs values) to tensorboard.
## This is not required for training and might save GPU memory when turned off.
LOG_OPTIMIZER_STATE="true"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt3-kd-test1-alpha1-with-pile-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-zero-${ZERO_STAGE}-mp-${MP_SIZE}-pp-${PP_SIZE}"
if [ "${NO_PP}" = "true" ]; then
NAME="${NAME}-no_pp"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-startseqlen-${CL_START_SEQLEN}-step-${CL_STEP}-token-${CL_TOKENS}B"
fi
LOG_PATH="log/"
TENSORBOARD_PATH="tensorboard/${NAME}_${host}_${current_time}"
CHECKPOINT_PATH="/blob/users/minjiaz/project/gpt3_distillation/checkpoint/${NAME}"
mkdir -p ${LOG_PATH}
mkdir -p ${TENSORBOARD_PATH}
mkdir -p ${CHECKPOINT_PATH}
### KD configs
KD_BETA_CE=1
CHECKPOINT_PATH_TEACHER="/blob/users/conglli/project/gpt3_with_pile/checkpoint/gpt3-with-pile-0.125B-lr-2.4e-3-minlr-6.0e-5-bs-2048-gpus-64-zero-0-mp-1-pp-1-no_pp-cl-startseqlen-72-step-27638-token-60B/"
CHECKPOINT_PATH_SAVE="/blob/users/minjiaz/project/gpt3_distillation/checkpoint/${NAME}"
mkdir -p ${CHECKPOINT_PATH_SAVE}
VOCAB_PATH=/blob/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/blob/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
# For cluster Azure-EastUS-V100-32GB-4, Lab-RR1-V100
# DATA_PATH=/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing/pile_text_document
# For cluster Azure-WestUS3-A100
DATA_PATH=/blob/data/the_pile_public_merged_nopreprocessing/pile_text_document
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_PATH} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers 10 \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-samples ${TRAIN_SAMPLES} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH_SAVE} \
--kd \
--kd-beta-ce ${KD_BETA_CE} \
--num-layers-teacher ${NUM_LAYERS} \
--hidden-size-teacher ${HIDDEN_SIZE} \
--num-attention-heads-teacher ${NUM_ATTN_HEADS} \
--load-teacher ${CHECKPOINT_PATH_TEACHER} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_PATH}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [ "${LOG_OPTIMIZER_STATE}" = "true" ]; then
megatron_options="${megatron_options} \
--log-optimizer-states-to-tensorboard"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_${NAME}.json"
if [[ $ZERO_STAGE -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
fi
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${ZERO_STAGE} \
--pipeline-model-parallel-size ${PP_SIZE}"
if [[ "${NO_PP}" = "true" ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
## When saving checkpoint to a storage with cache, their could be consistency
## issue of the pointer to latest checkpoint. Here we find the correct pointer
## and broadcast it to all nodes.
ITERATION_FILE="$CHECKPOINT_PATH/latest_checkpointed_iteration.txt"
ITERATION_FILE_2="$CHECKPOINT_PATH/latest"
ITERATION=0
for (( node = 0; node <= NUM_NODE-1; node++ ))
do
if $(ssh -q worker-"$node" "test -f \"$ITERATION_FILE\""); then
LOCAL_ITERATION=$(ssh -q worker-"$node" cat $ITERATION_FILE)
ITERATION=$(( ${LOCAL_ITERATION} > ${ITERATION} ? ${LOCAL_ITERATION} : ${ITERATION} ))
fi
done
if [[ $ITERATION -gt 0 ]]; then
ITERATION_2="global_step${ITERATION}"
ds_ssh "echo $ITERATION > $ITERATION_FILE"
ds_ssh "echo $ITERATION_2 > $ITERATION_FILE_2"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${LOG_PATH}/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,323 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
MODEL_SIZE=0.125
NUM_LAYERS=12
HIDDEN_SIZE=768
NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
MIN_LR=6.0e-5
# Curriculum learning (CL) enables stable large-batch training
GLOBAL_BATCH_SIZE=2048 # 8x
LR=2.4e-3 # 4x
## GPT-3 Medium 350M
# MODEL_SIZE=0.35
# NUM_LAYERS=24
# HIDDEN_SIZE=1024
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=3.0e-4
# MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
# MODEL_SIZE=1.3
# NUM_LAYERS=24
# HIDDEN_SIZE=2048
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
# MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
TRAIN_TOKENS=300000000000
## TRAIN_SAMPLES is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some samples,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_SAMPLES.
TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=8
## Model parallelism, 1 is no MP
MP_SIZE=1
## Pipeline parallelism. To disable PP, set PP_SIZE to 1 and NO_PP to true.
PP_SIZE=1
NO_PP="true"
## ZeRO stage
ZERO_STAGE=0
## Total number of GPUs
NUM_GPUS=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
NUM_GPUS_PERNODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
NUM_NODE=$(( ${NUM_GPUS} / ${NUM_GPUS_PERNODE} ))
DP_SIZE=$(( ${NUM_GPUS} / ${PP_SIZE} / ${MP_SIZE} ))
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=72
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_STEP=$(( ${CL_TOKENS} * 1000000000 / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=10000
## Standard deviation for weight initialization. Usually larger model needs
## lower std. We used a heuristic equation of sqrt(1/3/HIDDEN_SIZE) from the
## MT-NLG 530B work (https://arxiv.org/pdf/2201.11990.pdf)
INIT_STD=0.02
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
## Whether or not log optimizer states (norms, max abs values) to tensorboard.
## This is not required for training and might save GPU memory when turned off.
LOG_OPTIMIZER_STATE="true"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt3-kd-test1-alpha1-with-pile-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-zero-${ZERO_STAGE}-mp-${MP_SIZE}-pp-${PP_SIZE}"
if [ "${NO_PP}" = "true" ]; then
NAME="${NAME}-no_pp"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-startseqlen-${CL_START_SEQLEN}-step-${CL_STEP}-token-${CL_TOKENS}B"
fi
LOG_PATH="log/"
TENSORBOARD_PATH="tensorboard/${NAME}_${host}_${current_time}"
CHECKPOINT_PATH="/blob/users/minjiaz/project/gpt3_distillation/checkpoint/${NAME}"
mkdir -p ${LOG_PATH}
mkdir -p ${TENSORBOARD_PATH}
mkdir -p ${CHECKPOINT_PATH}
### KD configs
KD_BETA_CE=1
CHECKPOINT_PATH_TEACHER="/blob/users/conglli/project/gpt3_with_pile/checkpoint/gpt3-with-pile-0.125B-lr-2.4e-3-minlr-6.0e-5-bs-2048-gpus-64-zero-0-mp-1-pp-1-no_pp-cl-startseqlen-72-step-27638-token-60B/"
CHECKPOINT_PATH_SAVE="/blob/users/minjiaz/project/gpt3_distillation/checkpoint/${NAME}"
mkdir -p ${CHECKPOINT_PATH_SAVE}
VOCAB_PATH=/blob/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/blob/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
# For cluster Azure-EastUS-V100-32GB-4, Lab-RR1-V100
# DATA_PATH=/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing/pile_text_document
# For cluster Azure-WestUS3-A100
DATA_PATH=/blob/data/the_pile_public_merged_nopreprocessing/pile_text_document
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_PATH} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers 10 \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-samples ${TRAIN_SAMPLES} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH_SAVE} \
--kd \
--kd-beta-ce ${KD_BETA_CE} \
--num-layers-teacher ${NUM_LAYERS} \
--hidden-size-teacher ${HIDDEN_SIZE} \
--num-attention-heads-teacher ${NUM_ATTN_HEADS} \
--load-teacher ${CHECKPOINT_PATH_TEACHER} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_PATH}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [ "${LOG_OPTIMIZER_STATE}" = "true" ]; then
megatron_options="${megatron_options} \
--log-optimizer-states-to-tensorboard"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_${NAME}.json"
if [[ $ZERO_STAGE -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/${ZERO_STAGE}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
fi
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${ZERO_STAGE} \
--pipeline-model-parallel-size ${PP_SIZE}"
if [[ "${NO_PP}" = "true" ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
## When saving checkpoint to a storage with cache, their could be consistency
## issue of the pointer to latest checkpoint. Here we find the correct pointer
## and broadcast it to all nodes.
ITERATION_FILE="$CHECKPOINT_PATH/latest_checkpointed_iteration.txt"
ITERATION_FILE_2="$CHECKPOINT_PATH/latest"
ITERATION=0
for (( node = 0; node <= NUM_NODE-1; node++ ))
do
if $(ssh -q worker-"$node" "test -f \"$ITERATION_FILE\""); then
LOCAL_ITERATION=$(ssh -q worker-"$node" cat $ITERATION_FILE)
ITERATION=$(( ${LOCAL_ITERATION} > ${ITERATION} ? ${LOCAL_ITERATION} : ${ITERATION} ))
fi
done
if [[ $ITERATION -gt 0 ]]; then
ITERATION_2="global_step${ITERATION}"
ds_ssh "echo $ITERATION > $ITERATION_FILE"
ds_ssh "echo $ITERATION_2 > $ITERATION_FILE_2"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${LOG_PATH}/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,349 @@
#!/bin/bash
DIR=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs
## GPT-3 Small 125M
# MODEL_SIZE=0.125
# NUM_LAYERS=12
# HIDDEN_SIZE=768
# NUM_ATTN_HEADS=12
# GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5
## GPT-3 Medium 350M
MODEL_SIZE=0.35
NUM_LAYERS=24
HIDDEN_SIZE=1024
NUM_ATTN_HEADS=16
GLOBAL_BATCH_SIZE=256
LR=3.0e-4
MIN_LR=3.0e-5
## GPT-3 Large 760M
# MODEL_SIZE=0.76
# NUM_LAYERS=24
# HIDDEN_SIZE=1536
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=256
# LR=2.5e-4
# MIN_LR=2.5e-5
## GPT-3 XL 1.3B
# MODEL_SIZE=1.3
# NUM_LAYERS=24
# HIDDEN_SIZE=2048
# NUM_ATTN_HEADS=16
# GLOBAL_BATCH_SIZE=512
# LR=2.0e-4
# MIN_LR=2.0e-5
## GPT-3 2.7B
# MODEL_SIZE=2.7
# NUM_LAYERS=32
# HIDDEN_SIZE=2560
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=512
# LR=1.6e-4
# MIN_LR=1.6e-5
## GPT-3 6.7B
# MODEL_SIZE=6.7
# NUM_LAYERS=32
# HIDDEN_SIZE=4096
# NUM_ATTN_HEADS=32
# GLOBAL_BATCH_SIZE=1024
# LR=1.2e-4
# MIN_LR=1.2e-5
## GPT-3 13B
# MODEL_SIZE=13
# NUM_LAYERS=40
# HIDDEN_SIZE=5120
# NUM_ATTN_HEADS=40
# GLOBAL_BATCH_SIZE=1024
# LR=1.0e-4
# MIN_LR=1.0e-5
## GPT-3 175B
# MODEL_SIZE=175
# NUM_LAYERS=96
# HIDDEN_SIZE=12288
# NUM_ATTN_HEADS=96
# GLOBAL_BATCH_SIZE=1536
# LR=0.6e-4
# MIN_LR=0.6e-5
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
## For MoE model, we found sometimes training a bit more to 330B tokens helps
TRAIN_TOKENS=300000000000
# TRAIN_TOKENS=330000000000
## TRAIN_SAMPLES is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some steps,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_SAMPLES.
TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))
## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
## For MoE model, we found that setting the decay token to 300B helps.
WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000
# LR_DECAY_TOKENS=300000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=4
## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=1
## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
NUM_GPUS=64
###############################################################################
### MoE configs
## Number of experts. EP_SIZE 1 means dense model without MoE
EP_SIZE=1
# EP_SIZE=128
if [[ $EP_SIZE -gt $NUM_GPUS ]]; then
EP_PARALLEL_SIZE=$NUM_GPUS
else
EP_PARALLEL_SIZE=$EP_SIZE
fi
## Original GPT-3 model always set min LR at 10% of max LR. For MoE model, we
## found that lower LR and min LR (than the base dense model) helps.
## For 1.3B MoE-128 model we used LR=1.2e-4 and MIN_LR=1.0e-6.
## For 350M MoE-128 model we used LR=2.0e-4 and MIN_LR=2.0e-6, but they are not
## heavily tuned.
# LR=2.0e-4
# MIN_LR=2e-06
## Coefficient for MoE loss. We find that 0.01 is a good value at least for
## 1.3B MoE-128 model
MLC=0.01
## Below configs adjust the MoE expert token capacity limit during training and
## eval. To completely disable capacity limit, set MOE_DROP_TOKEN to false.
## Larger capacity factor or disabling capacity limit could improve training
## convergence, but will also reduce training throughput.
MOE_TRAIN_CAP_FACTOR=1.0
MOE_EVAL_CAP_FACTOR=1.0
MOE_MIN_CAP=4
MOE_DROP_TOKEN="true"
# MOE_DROP_TOKEN="false"
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_TOKENS=$((${CL_TOKENS} * 1000000000))
CL_STEP=$(( ${CL_TOKENS} / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=1000
## Standard deviation for weight initialization
## We used 0.014 for 350M/1.3B dense/MoE models, and used 0.01 for 6.7B
## dense model. Usually larger model needs lower std.
INIT_STD=0.014
# INIT_STD=0.01
## Activation checkpointing saves GPU memory, but reduces training speed
ACTIVATION_CHECKPOINT="true"
# ACTIVATION_CHECKPOINT="false"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
host="${HOSTNAME}"
NAME="gpt-kd-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-mp-${MP_SIZE}-pp-${PP_SIZE}"
if [[ $EP_SIZE -gt 1 ]]; then
NAME="${NAME}-ep-${EP_SIZE}-mlc-${MLC}-cap-${MOE_TRAIN_CAP_FACTOR}-drop-${MOE_DROP_TOKEN}"
fi
if [ "${CL_ENABLED}" = "true" ]; then
NAME="${NAME}-cl-${CL_START_SEQLEN}-${CL_STEP}"
fi
OUTPUT_BASEPATH=$DIR/output
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/"
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/"
mkdir -p "${OUTPUT_BASEPATH}/log/"
TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}_${current_time}"
mkdir -p ${TENSORBOARD_DIR}
## Note that for MoE model with billion-scale base model, the checkpoint can be
## as large as TB-scale which normal NFS cannot handle efficiently.
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
# USE_INTERNAL_DATA="true"
USE_INTERNAL_DATA="false"
if [ "${USE_INTERNAL_DATA}" = "true" ]; then
## The internal data is only accessible within Microsoft
## For cluster Azure-EastUS-V100-32GB-4, Azure-WestUS3-A100
# BASE_DATA_PATH=/vc_data/Megatron-LM/data
# DATA_HOME="/vc_data/pile-cc1-cc2-shuf"
## For cluster Lab-RR1-V100
BASE_DATA_PATH=/data/Megatron-LM/data
DATA_HOME="/turing-ssd/users/conglli/data/pile-cc1-cc2-shuf"
## For cluster Azure-CentralUS-A100
# BASE_DATA_PATH=/data/Megatron-LM/data
# DATA_HOME=/vc_data_1/users/amawa/blended
VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
ARX="${DATA_HOME}/ArXiv_ftfy_cleaned_id_shuf_text_document"
BC2="${DATA_HOME}/BookCorpus2_ftfy_cleaned_id_shuf_text_document"
B3="${DATA_HOME}/Books3_ftfy_cleaned_id_shuf_text_document"
CC2020="${DATA_HOME}/CC-2020-50_id_cleaned_shuf_text_document"
CC2021="${DATA_HOME}/CC-2021-04_id_cleaned_shuf_text_document"
GIT="${DATA_HOME}/Github_ftfy_id_shuf_text_document"
GUT="${DATA_HOME}/Gutenberg_PG-19_ftfy_cleaned_id_cleaned_shuf_text_document"
NIH="${DATA_HOME}/NIH_ExPorter_ftfy_id_shuf_text_document"
OWT2="${DATA_HOME}/OpenWebText2_ftfy_cleaned_id_shuf_text_document"
PCC="${DATA_HOME}/Pile-CC_id_cleaned_shuf_text_document"
PM="${DATA_HOME}/PubMed_Abstracts_ftfy_id_shuf_text_document"
RN="${DATA_HOME}/rn_dedup_shuf_cleaned_0.7_cleaned_shuf_text_document"
SE="${DATA_HOME}/StackExchange_ftfy_id_shuf_text_document"
ST="${DATA_HOME}/stories_dedup0.7_shuf_cleaned_shuf_text_document"
WIK="${DATA_HOME}/Wikipedia_en_ftfy_id_shuf_text_document"
DATA_BLEND="0.14336 ${B3} 0.08962 ${RN} 0.19336 ${OWT2} 0.05689 ${SE} \
0.00859 ${ST} 0.02897 ${PM} 0.04771 ${WIK} 0.00873 ${GUT} 0.01007 ${BC2} \
0.00208 ${NIH} 0.13017 ${CC2020} 0.09446 ${PCC} 0.15652 ${CC2021} \
0.01359 ${ARX} 0.01588 ${GIT}"
else
VOCAB_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-vocab.json
MERGE_PATH=/data/the_pile_public_merged_nopreprocessing/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
DATA_BLEND=/data/the_pile_public_merged_nopreprocessing/pile_text_document
fi
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_BLEND} \
--data-impl mmap"
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--num-experts ${EP_SIZE} \
--moe-loss-coeff ${MLC} \
--moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
--moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
--moe-min-capacity ${MOE_MIN_CAP} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers ${NUM_LAYERS} \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-samples ${TRAIN_SAMPLES} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_DIR}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [[ $EP_SIZE -gt 1 ]]; then
megatron_options="${megatron_options} \
--create-moe-param-group"
fi
if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
--disable-moe-token-dropping"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/0/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--pipeline-model-parallel-size ${PP_SIZE}"
# Currently MoE is not compatible with pipeline parallel
if [[ $EP_SIZE -gt 1 ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,32 @@
#!/bin/bash
# Compute embeddings for each entry of a given dataset (e.g. Wikipedia)
RANK=0
WORLD_SIZE=1
# Wikipedia data can be downloaded from the following link:
# https://github.com/facebookresearch/DPR/blob/master/data/download_data.py
EVIDENCE_DATA_DIR=<Specify path of Wikipedia dataset>
EMBEDDING_PATH=<Specify path to store embeddings>
CHECKPOINT_PATH=<Specify path of pretrained ICT model>
python tools/create_doc_index.py \
--num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--tensor-model-parallel-size 1 \
--micro-batch-size 128 \
--checkpoint-activations \
--seq-length 512 \
--retriever-seq-length 256 \
--max-position-embeddings 512 \
--load ${CHECKPOINT_PATH} \
--evidence-data-path ${EVIDENCE_DATA_DIR} \
--embedding-path ${EMBEDDING_PATH} \
--indexer-log-interval 1000 \
--indexer-batch-size 128 \
--vocab-file bert-vocab.txt \
--num-workers 2 \
--fp16

View File

@ -0,0 +1 @@
This is an example of how to use DeepSpeed's curriculum learning (CL) feature which provides faster and more stable language model pre-training. Currently it is only integrated for GPT pre-training. Note that there are two curriculum learning examples in two different repos for Megatron-LM GPT-2 pre-training. Both of them have some unique features and limitations. See details in our [tutorial](https://www.deepspeed.ai/tutorials/curriculum-learning/). For technical details please refer to our [paper](https://arxiv.org/abs/2108.06084).

View File

@ -0,0 +1,150 @@
#! /bin/bash
CONFIG=$1
TAG=$2
MODEL_SIZE=$3
LR=$4
TOTAL_BATCHSIZE=$5
SEQ_LEN=$6
MP_SIZE=$7
SEED=$8
SAVE_INTERVAL=$9
NUM_ITER=${10}
NUM_TOKEN=${11}
LR_DECAY_TOKEN=${12}
LR_WARMUP_ITER=${13}
CONFIG_TEMPLATE=${14}
CURRICULUM_STEP=${15}
CURRICULUM_MIN=${16}
# 12-layer, 768-hidden, 12-heads, 117M parameters
# 24-layer, 1024-hidden, 16-heads, 345M parameters
# 36-layer, 1280-hidden, 20-heads, 774M parameters
# 48-layer, 1600-hidden, 25-heads, 1558M parameters
if [[ $MODEL_SIZE -eq 117 ]]; then
NUM_LAYERS=12
HIDDEN_SIZE=768
NUM_ATTN_HEADS=12
elif [[ $MODEL_SIZE -eq 345 ]]; then
NUM_LAYERS=24
HIDDEN_SIZE=1024
NUM_ATTN_HEADS=16
elif [[ $MODEL_SIZE -eq 774 ]]; then
NUM_LAYERS=36
HIDDEN_SIZE=1280
NUM_ATTN_HEADS=20
elif [[ $MODEL_SIZE -eq 1558 ]]; then
NUM_LAYERS=48
HIDDEN_SIZE=1600
NUM_ATTN_HEADS=25
else
echo "Model size not supported."
exit 1
fi
# Pipeline parallelism. 1 means no pipelines.
PP_SIZE=1
# Change for multinode config
NUM_WORKERS=16
NUM_GPUS_PER_WORKER=8
NUM_GPUS=$(( ${NUM_WORKERS} * ${NUM_GPUS_PER_WORKER} ))
if [[ $PP_SIZE -gt 0 ]]; then
DP_SIZE=$(( ${NUM_GPUS} / (${PP_SIZE} * ${MP_SIZE}) ))
else
DP_SIZE=$(( ${NUM_GPUS} / ${MP_SIZE} ))
fi
# Batch size per gpu, here we assume grad accumulation step 1
# you can reduce this if gpu OOM
BATCHSIZE=$((TOTAL_BATCHSIZE/DP_SIZE))
DATA_PATH=/vc_data/Megatron-LM/data/indexed_datasets/megatron
VOCAB_PATH=/vc_data/Megatron-LM/data/gpt2-vocab.json
MERGE_PATH=/vc_data/Megatron-LM/data/gpt2-merges.txt
#ZeRO Configs
stage=1
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
script_path=$(realpath $0)
script_dir=$(dirname $script_path)
host="${HOSTNAME}"
if [ "${CONFIG_TEMPLATE}" = "true" ]; then
template_json="$script_dir/ds_zero_stage_${stage}_config_${CONFIG}.json"
config_json="$script_dir/ds_zero_stage_${stage}_config_${CONFIG}_min${CURRICULUM_MIN}_max${SEQ_LEN}_step${CURRICULUM_STEP}.json"
sed "s/CONFIG_CL_MIN/${CURRICULUM_MIN}/" ${template_json} \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CURRICULUM_STEP}/" \
> ${config_json}
else
config_json="$script_dir/ds_zero_stage_${stage}_config_${CONFIG}.json"
fi
JOB_NAME="gpt2_${MODEL_SIZE}M_bsz${TOTAL_BATCHSIZE}_seq${SEQ_LEN}_lr${LR}_warmup${LR_WARMUP_ITER}_decay${LR_DECAY_TOKEN}_seed${SEED}_${TAG}_stage${stage}_n${NUM_WORKERS}_g${NUM_GPUS_PER_WORKER}_mp${MP_SIZE}"
LOG_NAME="${JOB_NAME}_${host}_${current_time}"
OUTPUT_BASEPATH="/vc_data_blob/users/conglli"
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/curriculum/"
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/curriculum/"
mkdir -p "${OUTPUT_BASEPATH}/log/curriculum/"
LOGDIR="${OUTPUT_BASEPATH}/tensorboard/curriculum/${LOG_NAME}"
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/curriculum/${JOB_NAME}"
gpt_options=" \
--tensor-model-parallel-size ${MP_SIZE} \
--num-layers $NUM_LAYERS \
--hidden-size $HIDDEN_SIZE \
--num-attention-heads $NUM_ATTN_HEADS \
--seq-length $SEQ_LEN \
--max-position-embeddings $SEQ_LEN \
--micro-batch-size $BATCHSIZE \
--global-batch-size ${TOTAL_BATCHSIZE} \
--train-iters $NUM_ITER \
--train-tokens $NUM_TOKEN \
--lr-decay-tokens $LR_DECAY_TOKEN \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file $VOCAB_PATH \
--merge-file $MERGE_PATH \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--override-lr-scheduler \
--lr $LR \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-iters $LR_WARMUP_ITER \
--checkpoint-activations \
--log-interval 100 \
--save-interval $SAVE_INTERVAL \
--eval-interval 100 \
--eval-iters 10 \
--fp16 \
--seed $SEED \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--no-masked-softmax-fusion \
--tensorboard-dir ${LOGDIR}
"
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${stage} \
--pipeline-model-parallel-size ${PP_SIZE} \
--deepspeed-activation-checkpointing
"
full_options="${gpt_options} ${deepspeed_options}"
run_cmd="deepspeed --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} ../../pretrain_gpt.py ${full_options} &>> ${OUTPUT_BASEPATH}/log/curriculum/${JOB_NAME}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,37 @@
# # baseline
# CONFIG=baseline
# TAG=baseline
# MODEL_SIZE=1558
# LR=1.5e-4
# BSZ=512
# SEQ_LEN=1024
# MP_SIZE=1
# SEED=1234
# SAVE_INTERVAL=5000
# NUM_ITER=600000
# NUM_TOKEN=157286400000
# LR_DECAY_TOKEN=157286400000
# LR_WARMUP_ITER=3000
# CONFIG_TEMPLATE=false
# CURRICULUM_STEP=0
# CURRICULUM_MIN=0
# curriculum learning
CONFIG=curriculum_fixed_linear
MODEL_SIZE=1558
LR=6e-4
BSZ=4096
SEQ_LEN=1024
MP_SIZE=1
SEED=1234
SAVE_INTERVAL=1000
NUM_ITER=75000
NUM_TOKEN=157286400000
LR_DECAY_TOKEN=157286400000
LR_WARMUP_ITER=3000
CONFIG_TEMPLATE=true
CURRICULUM_STEP=45000
CURRICULUM_MIN=64
TAG="${CONFIG}_s${CURRICULUM_MIN}to${SEQ_LEN}_step${CURRICULUM_STEP}"
bash ds_pretrain_gpt2.sh $CONFIG $TAG $MODEL_SIZE $LR $BSZ $SEQ_LEN $MP_SIZE $SEED $SAVE_INTERVAL $NUM_ITER $NUM_TOKEN $LR_DECAY_TOKEN $LR_WARMUP_ITER $CONFIG_TEMPLATE $CURRICULUM_STEP $CURRICULUM_MIN

View File

@ -0,0 +1,26 @@
{
"train_batch_size": 512,
"gradient_accumulation_steps": 1,
"steps_per_print": 1,
"zero_optimization": {
"stage": 1
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015,
"max_grad_norm": 1.0,
"betas": [0.9, 0.95]
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": false,
"zero_allow_untested_optimizer": false
}

View File

@ -0,0 +1,37 @@
{
"train_batch_size": 512,
"gradient_accumulation_steps": 1,
"steps_per_print": 1,
"zero_optimization": {
"stage": 1
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015,
"max_grad_norm": 1.0,
"betas": [0.9, 0.95]
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": false,
"zero_allow_untested_optimizer": false,
"curriculum_learning": {
"enabled": true,
"curriculum_type": "seqlen",
"min_difficulty": CONFIG_CL_MIN,
"max_difficulty": CONFIG_CL_MAX,
"schedule_type": "fixed_linear",
"schedule_config": {
"total_curriculum_step": CONFIG_CL_DURATION,
"difficulty_step": 8
}
}
}

View File

@ -0,0 +1,23 @@
This directory includes GPT-3/BERT pretraining example scripts for DeepSpeed Data Efficiency Library technologies (curriculum learning, random-LTD, and the two composed together).
You need to install updated DeepSpeed version (>=0.8.0), which contains the DeepSpeed Data Efficiency Library.
Additional tutorial can be found at [DeepSpeed website](https://www.deepspeed.ai/tutorials/data-efficiency/).
Additional technical details can be found in our [random-LTD paper](https://arxiv.org/abs/2211.11586) and [data efficiency paper](https://arxiv.org/abs/2212.03597).
## GPT-3 pretraining and evaluation
Inside ``gpt`` folder, first the ``ds_analyze_gpt_data_map.sh`` and ``ds_analyze_gpt_data_reduce.sh`` are used for curriculum learning's offline data analysis and indexing.
``gpt/pretrain`` includes the pretraining example scripts. You can choose a setup to run by uncommenting one block in ``ds_pretrain_gpt_1.3B_dense_run.sh``. One thing to note is that in our [random-LTD paper](https://arxiv.org/abs/2211.11586) we did not scale peak learning rate when using less than 100% data, while in our later [data efficiency paper](https://arxiv.org/abs/2212.03597) we find that scaling LR based on used percentage of data helps improve model quality.
``gpt/eval`` includes the zero-/few-shot evaluation example scripts. ``ds_evalharness_parallel_run.sh`` is for zero-shot, and ``ds_evalharness_parallel_run_10shot.sh`` is for 10-shot.
## BERT pretraining and finetuning
Inside ``bert`` folder, first the ``pile_data_download_preprocess.py`` can be used to download and preprocess the public Pile dataset.
The ``ds_analyze_bert_data_map.sh`` and ``ds_analyze_bert_data_reduce.sh`` are used for curriculum learning's offline data analysis and indexing.
``bert/pretrain`` includes the pretraining example scripts. You can choose a setup to run by uncommenting one block in ``ds_pretrain_bert_336M_run.sh``. One thing to note is that in our [random-LTD paper](https://arxiv.org/abs/2211.11586) we did not scale peak learning rate when using less than 100% data, while in our later [data efficiency paper](https://arxiv.org/abs/2212.03597) we find that scaling LR based on used percentage of data helps improve model quality.
``bert/finetune`` includes the MNLI/QQP/RACE finetuning example scripts following the [Megatron-LM paper](https://arxiv.org/abs/1909.08053). However, we found that the RACE task's accuracy is not very stable and the Megatron-LM paper used a very long number of epochs for MNLI/QQP which is not necessary. Thus we added capability of finetuning other GLUE tasks, and switched to follow the hyperparameters of the [original BERT paper](https://arxiv.org/abs/1810.04805). The corresponding scripts are at ``bert/finetune_glue``, which we recommend to use instead of ``bert/finetune``. Our [data efficiency paper](https://arxiv.org/abs/2212.03597) also uses the scripts under ``bert/finetune_glue`` for GLUE finetuning.

View File

@ -0,0 +1,239 @@
# coding=utf-8
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''
Copyright 2022 The Microsoft DeepSpeed Team
'''
import os
import time
import sys
import math
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),
os.path.pardir,os.path.pardir)))
from datetime import datetime
import numpy as np
import torch
from deepspeed.runtime.data_pipeline.data_sampling.data_analyzer \
import DataAnalyzer
from deepspeed.runtime.data_pipeline.data_sampling.indexed_dataset \
import MMapIndexedDataset
from megatron import get_args
from megatron import print_rank_0
from megatron.initialize import initialize_megatron
def get_tasks_args(parser):
"""Provide extra arguments required for data analyzing."""
group = parser.add_argument_group(title='data_analyzing')
group.add_argument('--analyzing-task', type=str, required=True,
default=None,
choices=['map',
'reduce'],
help='What type of analyzing task to perform.')
group.add_argument('--analyzing-data-type', type=str, required=True,
default=None,
choices=['BERT',
'GPT'],
help='What type of data.')
group.add_argument('--analyzing-metric', type=str, nargs='+', default=[],
help='What kinds of metrics to analyze.')
group.add_argument('--analyzing-num-workers', type=int, default=1,
help='Number of workers. Each worker could be a single CPU node.')
group.add_argument('--analyzing-worker-id', type=int, default=0,
help='Worker id of current node.')
group.add_argument('--analyzing-num-threads', type=int, default=1,
help='Number of threads for each worker.')
group.add_argument('--analyzing-num-threads-reduce', type=int, default=1,
help='Number of threads for each worker.')
group.add_argument('--analyzing-specific-threads', type=int, nargs='+', default=[],
help='Which specific threads to run. Helpful when there are specific thread failed in previous run.')
return parser
def train_valid_test_datasets_provider_gpt():
"""Build train, valid, and test datasets."""
args = get_args()
print_rank_0('> building train, validation, and test datasets '
'for GPT ...')
from megatron.data.gpt_dataset import build_train_valid_test_datasets
train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
data_prefix=args.data_path,
data_impl=args.data_impl,
splits_string=args.split,
train_valid_test_num_samples=[1,1,1], # Just dummy numbers since we assume args.train_data_exact_num_epochs will override them
seq_length=args.seq_length,
seed=args.seed,
skip_warmup=(not args.mmap_warmup))
print_rank_0("> finished creating GPT datasets ...")
return train_ds, valid_ds, test_ds
def train_valid_test_datasets_provider_bert():
"""Build train, valid, and test datasets."""
args = get_args()
print_rank_0('> building train, validation, and test datasets '
'for BERT ...')
from megatron.data.dataset_utils import build_train_valid_test_datasets
train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
data_prefix=args.data_path,
data_impl=args.data_impl,
splits_string=args.split,
train_valid_test_num_samples=[1,1,1], # Just dummy numbers since we assume args.train_data_exact_num_epochs will override them
max_seq_length=args.seq_length,
masked_lm_prob=args.mask_prob,
short_seq_prob=args.short_seq_prob,
seed=args.seed,
skip_warmup=(not args.mmap_warmup),
binary_head=args.bert_binary_head)
print_rank_0("> finished creating BERT datasets ...")
return train_ds, valid_ds, test_ds
def metric_seqlen(data):
metric = torch.count_nonzero(data['padding_mask'], dim=1)
return metric
def metric_total_vocab_freq(data):
args = get_args()
if args.analyzing_data_type == 'BERT':
frequency = torch.bincount(data['text'].view(-1),
minlength=args.padded_vocab_size+1,
weights=data['padding_mask'].view(-1))
elif args.analyzing_data_type == 'GPT':
frequency = torch.bincount(data['text'].view(-1),
minlength=args.padded_vocab_size+1)
return frequency
def metric_vocab_rarity(data):
args = get_args()
if args.analyzing_data_type == 'BERT':
rarity = torch.sum(data['padding_mask'] * \
args.total_vocab_freq[data['text']], dim=1).to(torch.long)
elif args.analyzing_data_type == 'GPT':
rarity = []
# Do one by one to avoid too high memory consumption
for row in range(data['text'].size()[0]):
rarity.append(int(torch.sum(args.total_vocab_freq[data['text'][row]]).item()))
rarity = torch.tensor(rarity, dtype=torch.long)
print(f"rarity min {min(rarity)}, max {max(rarity)}, len {len(rarity)}, avg {sum(rarity)/len(rarity)}")
return rarity
def metric_seqlen_vocab_rarity(data):
args = get_args()
metric = torch.count_nonzero(data['padding_mask'], dim=1).to(torch.long) * args.seqlen_coeff
metric += torch.sum(data['padding_mask'] * \
args.total_vocab_freq[data['text']], dim=1).to(torch.long)
print(f"metric min {min(metric)}, max {max(metric)}, len {len(metric)}, avg {sum(metric)/len(metric)}")
return metric
def get_metric_function(metric_name):
if metric_name == 'seqlen':
return metric_seqlen
if metric_name == 'total_vocab_freq':
return metric_total_vocab_freq
if metric_name == 'vocab_rarity':
return metric_vocab_rarity
if metric_name == 'seqlen_vocab_rarity':
return metric_seqlen_vocab_rarity
def get_metric_type(metric_name):
if metric_name == 'seqlen':
return 'single_value_per_sample'
if metric_name == 'total_vocab_freq':
return 'accumulate_value_over_samples'
if metric_name == 'vocab_rarity':
return 'single_value_per_sample'
if metric_name == 'seqlen_vocab_rarity':
return 'single_value_per_sample'
def run_map():
args = get_args()
if args.analyzing_data_type == 'BERT':
args.mask_prob = 0 # When analyzing data, we don't want any mask.
train_ds, _, _ = train_valid_test_datasets_provider_bert()
elif args.analyzing_data_type == 'GPT':
train_ds, _, _ = train_valid_test_datasets_provider_gpt()
assert 'seqlen' not in args.analyzing_metric, 'GPT data has fixed seqlen, thus unnecessary to analyze seqlen metric.'
assert 'seqlen_vocab_rarity' not in args.analyzing_metric, 'GPT data has fixed seqlen, thus unnecessary to analyze seqlen metric.'
if 'vocab_rarity' in args.analyzing_metric or 'seqlen_vocab_rarity' in args.analyzing_metric:
total_vocab_freq_fname = f"{args.save}/total_vocab_freq/total_vocab_freq_metric_value"
assert os.path.isfile(f"{total_vocab_freq_fname}.bin") and os.path.isfile(f"{total_vocab_freq_fname}.idx"), "To analyze vocab rarity, first need to analyze the total vocab freq."
total_vocab_freq = MMapIndexedDataset(total_vocab_freq_fname, skip_warmup=True)
total_vocab_freq = np.copy(total_vocab_freq[0])
total_vocab_freq[total_vocab_freq == 0] = 1 # Avoid log(0) error
total_vocab_freq = np.log(total_vocab_freq/sum(total_vocab_freq)) * -1
args.total_vocab_freq = torch.tensor(total_vocab_freq, dtype=torch.double)
if 'seqlen_vocab_rarity' in args.analyzing_metric:
# Use large coeff to make seqlen dominates vocab_rarity
max_possible_rarity = args.seq_length * torch.max(args.total_vocab_freq).item()
args.seqlen_coeff = 10 ** (math.ceil(math.log(max_possible_rarity, 10)) + 1)
print(f"Metric seqlen_vocab_rarity: using {args.seqlen_coeff} as coefficient for seqlen.")
metric_functions = [get_metric_function(x) for x in args.analyzing_metric]
metric_types = [get_metric_type(x) for x in args.analyzing_metric]
# For metric_dtypes we int64 by default since it could be hard to estimate
# the appropriate dtype before the mapping analysis. During reduce where
# we merge the analysis results, the DataAnalyzer will automatically choose
# the dtype of merged result file as the smallest one that meet the range
# requirement.
metric_dtypes = [np.int64 for x in args.analyzing_metric]
start = time.time()
data_analyzer = DataAnalyzer(train_ds,
num_workers=args.analyzing_num_workers,
worker_id=args.analyzing_worker_id,
num_threads=args.analyzing_num_threads,
specific_threads=args.analyzing_specific_threads,
batch_size=args.global_batch_size, metric_names=args.analyzing_metric,
metric_functions=metric_functions, metric_types=metric_types,
metric_dtypes=metric_dtypes, save_path=args.save)
data_analyzer.run_map()
duration = (time.time() - start) / 3600.0
print(f"map job finished in {duration} hr.")
def run_reduce():
args = get_args()
if args.analyzing_data_type == 'BERT':
args.mask_prob = 0 # When analyzing data, we don't want any mask.
train_ds, _, _ = train_valid_test_datasets_provider_bert()
elif args.analyzing_data_type == 'GPT':
train_ds, _, _ = train_valid_test_datasets_provider_gpt()
metric_functions = [get_metric_function(x) for x in args.analyzing_metric]
metric_types = [get_metric_type(x) for x in args.analyzing_metric]
metric_dtypes = [np.int64 for x in args.analyzing_metric]
start = time.time()
data_analyzer = DataAnalyzer(train_ds,
num_workers=args.analyzing_num_workers,
num_threads=args.analyzing_num_threads,
num_threads_reduce=args.analyzing_num_threads_reduce,
batch_size=args.global_batch_size, metric_names=args.analyzing_metric,
metric_functions=metric_functions, metric_types=metric_types,
metric_dtypes=metric_dtypes, save_path=args.save)
data_analyzer.run_reduce()
duration = (time.time() - start) / 3600.0
print(f"reduce job finished in {duration} hr.")
if __name__ == "__main__":
initialize_megatron(extra_args_provider=get_tasks_args, allow_no_cuda=True)
args = get_args()
if args.analyzing_task == 'map':
run_map()
elif args.analyzing_task == 'reduce':
run_reduce()
else:
raise NotImplementedError('Task {} is not implemented.'.format(
args.analyzing_task))

View File

@ -0,0 +1,67 @@
#!/bin/bash
num_workers=1 # Num nodes to run the map job
num_threads=40 # Num threads on each node. Set this based on #CPU cores
# If different data epochs have slightly different data samples (e.g., due
# to randomness), then you need to specify large enough num_epochs that cover
# whole pretraining. If different data epochs are the same, set num_epochs to
# 1 to only index 1 epoch, and during pretraining DeepSpeed data efficiency
# library will automatically handle reshuffling when reaching another epoch.
num_epochs=5
# Which node is this node (start with 0 and end with num_workers-1). This
# script only launch the map job on 1 worker node, since we don't expect
# running on many nodes and workers don't need any communication. But you
# can modify this script to add a MPI/torch distributed launcher.
worker_id=$1
save_path="/blob/users/conglli/data/analysis_pile_bert_${num_epochs}epoch/"
metric='total_vocab_freq'
# metric='vocab_rarity' # this requires the result of total_vocab_freq
# metric='seqlen_vocab_rarity' # this requires the result of total_vocab_freq
# metric='seqlen'
seq_len=512
batch_size=10000
jobname="bert-pile-analyzing-${metric}-${num_epochs}epoch-map-worker${worker_id}"
## Public the Pile dataset, see prepare_pile_data.py in the same directory
## about how to download and preprocess the data.
## Change data_home to your own training data path.
# data_home="/vc_data_blob/users/conglli/the_pile_bert"
data_home="/blob/data/the_pile_bert"
data_path="${data_home}/pile_bert_train_text_sentence"
vocab_path="bert-large-uncased-vocab.txt"
if [ ! -f "$vocab_path" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
fi
# Make sure the "--split" is the same as what you will use for pre-training.
options=" \
--analyzing-task map \
--analyzing-data-type BERT \
--analyzing-metric ${metric} \
--analyzing-num-workers ${num_workers} \
--analyzing-worker-id ${worker_id} \
--analyzing-num-threads ${num_threads} \
--vocab-file ${vocab_path} \
--data-path ${data_path} \
--data-impl mmap \
--tokenizer-type BertWordPieceLowerCase \
--micro-batch-size ${batch_size} \
--global-batch-size ${batch_size} \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--num-layers 1 \
--hidden-size 1 \
--num-attention-heads 1 \
--split 949,50,1 \
--distributed-backend gloo \
--train-data-exact-num-epochs ${num_epochs} \
--return-data-index \
--save-interval 1 \
--save ${save_path}"
python ../analyze_data.py ${options} &> ${jobname}.log

View File

@ -0,0 +1,66 @@
#!/bin/bash
# Set these 2 to the same as what you used during map job. We need these 2
# configs to know how many map job result files do we have.
num_workers=1
num_threads=40
# Reduce job only has 1 worker but can accelerate by multithreading.
num_threads_reduce=40
# If different data epochs have slightly different data samples (e.g., due
# to randomness), then you need to specify large enough num_epochs that cover
# whole pretraining. If different data epochs are the same, set num_epochs to
# 1 to only index 1 epoch, and during pretraining DeepSpeed data efficiency
# library will automatically handle reshuffling when reaching another epoch.
num_epochs=5
save_path="/blob/users/conglli/data/analysis_pile_bert_${num_epochs}epoch/"
metric='total_vocab_freq'
# metric='vocab_rarity' # this requires the result of total_vocab_freq
# metric='seqlen_vocab_rarity' # this requires the result of total_vocab_freq
# metric='seqlen'
seq_len=512
batch_size=10000
jobname="bert-pile-analyzing-${metric}-${num_epochs}epoch-reduce"
## Public the Pile dataset, see prepare_pile_data.py in the same directory
## about how to download and preprocess the data.
## Change data_home to your own training data path.
# data_home="/vc_data_blob/users/conglli/the_pile_bert"
data_home="/blob/data/the_pile_bert"
data_path="${data_home}/pile_bert_train_text_sentence"
vocab_path="bert-large-uncased-vocab.txt"
if [ ! -f "$vocab_path" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
fi
# Make sure the "--split" is the same as what you will use for pre-training.
options=" \
--analyzing-task reduce \
--analyzing-data-type BERT \
--analyzing-metric ${metric} \
--analyzing-num-workers ${num_workers} \
--analyzing-num-threads ${num_threads} \
--analyzing-num-threads-reduce ${num_threads_reduce} \
--vocab-file ${vocab_path} \
--data-path ${data_path} \
--data-impl mmap \
--tokenizer-type BertWordPieceLowerCase \
--micro-batch-size ${batch_size} \
--global-batch-size ${batch_size} \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--num-layers 1 \
--hidden-size 1 \
--num-attention-heads 1 \
--split 949,50,1 \
--distributed-backend gloo \
--train-data-exact-num-epochs ${num_epochs} \
--return-data-index \
--save-interval 1 \
--save ${save_path}"
python ../analyze_data.py ${options} &> ${jobname}.log

View File

@ -0,0 +1,24 @@
{
"train_batch_size" : CONFIG_BATCH_SIZE,
"train_micro_batch_size_per_gpu": CONFIG_MBSIZE,
"steps_per_print": LOG_INTERVAL,
"zero_optimization": {
"stage": ZERO_STAGE,
"elastic_checkpoint": true
},
"gradient_clipping": 1.0,
"prescale_gradients": PRESCALE_GRAD,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"wall_clock_breakdown" : false
}

View File

@ -0,0 +1,150 @@
seed=1234
pretrained_checkpoint="/blob/users/conglli/project/bert_with_pile/checkpoint/bert-pile-0.336B-iters-2M-lr-1e-4-min-1e-5-wmup-10000-dcy-2M-sty-linear-gbs-1024-mbs-16-gpu-64-zero-0-mp-1-pp-1-nopp"
###############################################################################
### Main configs
### The main configs are from Megatron-LM paper
### https://arxiv.org/abs/1909.08053. Choose based on your desired model size
### or build your own configs.
seq_len=512
## From Table 6 in https://arxiv.org/abs/1909.08053.
task="MNLI"
global_batch_size=128
lr=1e-5
epochs=10
train_data="/blob/data/GlueData/MNLI/train.tsv"
valid_data="/blob/data/GlueData/MNLI/dev_matched.tsv \
/blob/data/GlueData/MNLI/dev_mismatched.tsv"
## Adjust based on number of GPUs.
batch_size=16
## BERT 110M (same config as original BERT-Base model)
## This config is not included in Megatron-LM paper
# model_size=0.11
# num_layers=12
# hidden_size=768
# num_attn_heads=12
## BERT 336M (same config as original BERT-Large model)
model_size=0.336
num_layers=24
hidden_size=1024
num_attn_heads=16
## BERT 1.3B
# model_size=1.3
# num_layers=24
# hidden_size=2048
# num_attn_heads=32
## BERT 3.9B
# model_size=3.9
# num_layers=48
# hidden_size=2560
# num_attn_heads=40
###############################################################################
### Parallelism configs
## Model parallelism, 1 is no MP
mp_size=1
## Pipeline parallelism. To disable PP, set pp_size to 1 and no_pp to true.
## Currently pipeline parallelism is not supported for BERT model: DeepSpeed's
## pipeline parallelism is only integrated with the GPT case, and currently
## DeepSpeed is not integrated with Megatron's own pipeline parallelism.
pp_size=1
no_pp="true"
## ZeRO stage
zero_stage=0
###############################################################################
### Misc configs
log_interval=10
eval_iters=50
eval_interval=100
save_interval=500000
## Activation checkpointing saves GPU memory, but reduces training speed
# activation_checkpoint="true"
activation_checkpoint="false"
###############################################################################
vocab_file="bert-large-uncased-vocab.txt"
if [ ! -f "$vocab_file" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
fi
jobname="${task}-bsz${global_batch_size}-lr${lr}-epochs${epochs}-seed${seed}"
checkpoint_path="${pretrained_checkpoint}-finetune/${jobname}"
mkdir -p ${checkpoint_path}
template_json="ds_config_bert_TEMPLATE.json"
config_json="ds_config_bert_bsz${global_batch_size}_mbsz${batch_size}_log${log_interval}_zero${zero_stage}.json"
if [[ $zero_stage -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
fi
options=" \
--finetune \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${zero_stage} \
--task ${task} \
--seed ${seed} \
--train-data ${train_data} \
--valid-data ${valid_data} \
--tokenizer-type BertWordPieceLowerCase \
--vocab-file ${vocab_file} \
--epochs ${epochs} \
--pretrained-checkpoint ${pretrained_checkpoint} \
--tensor-model-parallel-size ${mp_size} \
--pipeline-model-parallel-size ${pp_size} \
--num-layers ${num_layers} \
--hidden-size ${hidden_size} \
--num-attention-heads ${num_attn_heads} \
--global-batch-size ${global_batch_size} \
--micro-batch-size ${batch_size} \
--lr ${lr} \
--lr-decay-style linear \
--lr-warmup-fraction 0.065 \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--save-interval ${save_interval} \
--save ${checkpoint_path} \
--log-interval ${log_interval} \
--eval-interval ${eval_interval} \
--eval-iters ${eval_iters} \
--weight-decay 1.0e-1 \
--fp16"
if [ "${activation_checkpoint}" = "true" ]; then
options="${options} \
--checkpoint-activations \
--deepspeed-activation-checkpointing"
fi
if [[ "${no_pp}" = "true" ]]; then
options="${options} \
--no-pipeline-parallel"
fi
# After the fine-tuning finishes, you can find the dev set accuracy numbers by
# "grep -e "overall:" -e "metrics for" ${checkpoint_path}/output.log"
deepspeed ../../../../tasks/main.py ${options} &> ${checkpoint_path}/output.log

View File

@ -0,0 +1,158 @@
seed=1234
pretrained_checkpoint="/blob/users/conglli/project/bert_with_pile/checkpoint/bert-pile-0.336B-iters-2M-lr-1e-4-min-1e-5-wmup-10000-dcy-2M-sty-linear-gbs-1024-mbs-16-gpu-64-zero-0-mp-1-pp-1-nopp"
###############################################################################
### Main configs
### The main configs are from Megatron-LM paper
### https://arxiv.org/abs/1909.08053. Choose based on your desired model size
### or build your own configs.
seq_len=512
## From Table 6 in https://arxiv.org/abs/1909.08053.
task="QQP"
train_data="/blob/data/GlueData/QQP/train.tsv"
valid_data="/blob/data/GlueData/QQP/dev.tsv"
## Adjust based on number of GPUs.
batch_size=16
## BERT 110M (same config as original BERT-Base model)
## This config is not included in Megatron-LM paper
# model_size=0.11
# num_layers=12
# hidden_size=768
# num_attn_heads=12
# global_batch_size=128
# lr=5e-5
# epochs=12
## BERT 336M (same config as original BERT-Large model)
model_size=0.336
num_layers=24
hidden_size=1024
num_attn_heads=16
global_batch_size=128
lr=5e-5
epochs=12
## BERT 1.3B
# model_size=1.3
# num_layers=24
# hidden_size=2048
# num_attn_heads=32
# global_batch_size=128
# lr=3e-5
# epochs=12
## BERT 3.9B
# model_size=3.9
# num_layers=48
# hidden_size=2560
# num_attn_heads=40
# global_batch_size=256
# lr=4e-5
# epochs=12
###############################################################################
### Parallelism configs
## Model parallelism, 1 is no MP
mp_size=1
## Pipeline parallelism. To disable PP, set pp_size to 1 and no_pp to true.
## Currently pipeline parallelism is not supported for BERT model: DeepSpeed's
## pipeline parallelism is only integrated with the GPT case, and currently
## DeepSpeed is not integrated with Megatron's own pipeline parallelism.
pp_size=1
no_pp="true"
## ZeRO stage
zero_stage=0
###############################################################################
### Misc configs
log_interval=10
eval_iters=50
eval_interval=100
save_interval=500000
## Activation checkpointing saves GPU memory, but reduces training speed
# activation_checkpoint="true"
activation_checkpoint="false"
###############################################################################
vocab_file="bert-large-uncased-vocab.txt"
if [ ! -f "$vocab_file" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
fi
jobname="${task}-bsz${global_batch_size}-lr${lr}-epochs${epochs}-seed${seed}"
checkpoint_path="${pretrained_checkpoint}-finetune/${jobname}"
mkdir -p ${checkpoint_path}
template_json="ds_config_bert_TEMPLATE.json"
config_json="ds_config_bert_bsz${global_batch_size}_mbsz${batch_size}_log${log_interval}_zero${zero_stage}.json"
if [[ $zero_stage -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
fi
options=" \
--finetune \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${zero_stage} \
--task ${task} \
--seed ${seed} \
--train-data ${train_data} \
--valid-data ${valid_data} \
--tokenizer-type BertWordPieceLowerCase \
--vocab-file ${vocab_file} \
--epochs ${epochs} \
--pretrained-checkpoint ${pretrained_checkpoint} \
--tensor-model-parallel-size ${mp_size} \
--pipeline-model-parallel-size ${pp_size} \
--num-layers ${num_layers} \
--hidden-size ${hidden_size} \
--num-attention-heads ${num_attn_heads} \
--global-batch-size ${global_batch_size} \
--micro-batch-size ${batch_size} \
--lr ${lr} \
--lr-decay-style linear \
--lr-warmup-fraction 0.065 \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--save-interval ${save_interval} \
--save ${checkpoint_path} \
--log-interval ${log_interval} \
--eval-interval ${eval_interval} \
--eval-iters ${eval_iters} \
--weight-decay 1.0e-1 \
--fp16"
if [ "${activation_checkpoint}" = "true" ]; then
options="${options} \
--checkpoint-activations \
--deepspeed-activation-checkpointing"
fi
if [[ "${no_pp}" = "true" ]]; then
options="${options} \
--no-pipeline-parallel"
fi
# After the fine-tuning finishes, you can find the dev set accuracy numbers by
# "grep -e "overall:" -e "metrics for" ${checkpoint_path}/output.log"
deepspeed ../../../../tasks/main.py ${options} &> ${checkpoint_path}/output.log

View File

@ -0,0 +1,172 @@
seed=1234
## RACE have two sub-tasks that need to be finetuned separately
difficulty="middle"
# difficulty="high"
pretrained_checkpoint="/blob/users/conglli/project/bert_with_pile/checkpoint/bert-pile-0.336B-iters-2M-lr-1e-4-min-1e-5-wmup-10000-dcy-2M-sty-linear-gbs-1024-mbs-16-gpu-64-zero-0-mp-1-pp-1-nopp"
###############################################################################
### Main configs
### The main configs are from Megatron-LM paper
### https://arxiv.org/abs/1909.08053. Choose based on your desired model size
### or build your own configs.
seq_len=512
## From Table 6 in https://arxiv.org/abs/1909.08053.
task="RACE"
## Race dataset can be downloaded by:
## wget http://www.cs.cmu.edu/~glai1/data/race/RACE.tar.gz
train_data="/blob/data/RACE/train/${difficulty}"
## The Megatron paper https://arxiv.org/abs/1909.08053 says: "For the test set
## results of RACE, we first use the development set to find the checkpoint
## that gives us the median score on the 5 random seeds and we report the
## results from that checkpoint on the test set", which is a quite confusing
## description. For simplicity, instead we directly get the median dev and test
## set score on 5 random seeds from a single pretrained_checkpoint.
valid_data="/blob/data/RACE/dev/${difficulty} \
/blob/data/RACE/test/${difficulty}"
## Adjust based on number of GPUs.
batch_size=4
## BERT 110M (same config as original BERT-Base model)
## This config is not included in Megatron-LM paper
# model_size=0.11
# num_layers=12
# hidden_size=768
# num_attn_heads=12
# global_batch_size=32
# lr=2e-5
# epochs=3
## BERT 336M (same config as original BERT-Large model)
model_size=0.336
num_layers=24
hidden_size=1024
num_attn_heads=16
global_batch_size=32
lr=2e-5
epochs=3
## BERT 1.3B
# model_size=1.3
# num_layers=24
# hidden_size=2048
# num_attn_heads=32
# global_batch_size=16
# lr=1e-5
# epochs=3
## BERT 3.9B
# model_size=3.9
# num_layers=48
# hidden_size=2560
# num_attn_heads=40
# global_batch_size=32
# lr=2e-5
# epochs=3
###############################################################################
### Parallelism configs
## Model parallelism, 1 is no MP
mp_size=1
## Pipeline parallelism. To disable PP, set pp_size to 1 and no_pp to true.
## Currently pipeline parallelism is not supported for BERT model: DeepSpeed's
## pipeline parallelism is only integrated with the GPT case, and currently
## DeepSpeed is not integrated with Megatron's own pipeline parallelism.
pp_size=1
no_pp="true"
## ZeRO stage
zero_stage=0
###############################################################################
### Misc configs
log_interval=10
eval_iters=50
eval_interval=100
save_interval=100000
## Activation checkpointing saves GPU memory, but reduces training speed
# activation_checkpoint="true"
activation_checkpoint="false"
###############################################################################
vocab_file="bert-large-uncased-vocab.txt"
if [ ! -f "$vocab_file" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
fi
jobname="${task}-${difficulty}-bsz${global_batch_size}-lr${lr}-epochs${epochs}-seed${seed}"
checkpoint_path="${pretrained_checkpoint}-finetune/${jobname}"
mkdir -p ${checkpoint_path}
template_json="ds_config_bert_TEMPLATE.json"
config_json="ds_config_bert_bsz${global_batch_size}_mbsz${batch_size}_log${log_interval}_zero${zero_stage}.json"
if [[ $zero_stage -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
fi
options=" \
--finetune \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${zero_stage} \
--task ${task} \
--seed ${seed} \
--train-data ${train_data} \
--valid-data ${valid_data} \
--tokenizer-type BertWordPieceLowerCase \
--vocab-file ${vocab_file} \
--epochs ${epochs} \
--pretrained-checkpoint ${pretrained_checkpoint} \
--tensor-model-parallel-size ${mp_size} \
--pipeline-model-parallel-size ${pp_size} \
--num-layers ${num_layers} \
--hidden-size ${hidden_size} \
--num-attention-heads ${num_attn_heads} \
--global-batch-size ${global_batch_size} \
--micro-batch-size ${batch_size} \
--lr ${lr} \
--lr-decay-style linear \
--lr-warmup-fraction 0.06 \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--save-interval ${save_interval} \
--save ${checkpoint_path} \
--log-interval ${log_interval} \
--eval-interval ${eval_interval} \
--eval-iters ${eval_iters} \
--weight-decay 1.0e-1 \
--clip-grad 1.0 \
--fp16"
if [ "${activation_checkpoint}" = "true" ]; then
options="${options} \
--checkpoint-activations \
--deepspeed-activation-checkpointing"
fi
if [[ "${no_pp}" = "true" ]]; then
options="${options} \
--no-pipeline-parallel"
fi
# After the fine-tuning finishes, you can find the dev/test set accuracy numbers
# by "grep -e "overall:" -e "metrics for" ${checkpoint_path}/output.log"
deepspeed ../../../../tasks/main.py ${options} &> ${checkpoint_path}/output.log

View File

@ -0,0 +1,111 @@
import os
import statistics
def gather_numbers(fname, match_keywords, index_keywords, index_offsets):
results = {}
for k in index_keywords:
results[k] = []
file1 = open(fname, 'r')
while True:
line = file1.readline()
if not line:
break
splits = line.split(' ')
for i in range(len(match_keywords)):
if match_keywords[i] in line:
ref_idx = splits.index(index_keywords[i])
results[index_keywords[i]].append(float(splits[ref_idx+index_offsets[i]]))
file1.close()
return results
def gather_MNLI_results(result_path):
overall = []
matched = []
mismatched = []
for file in os.listdir(result_path):
if file.startswith('MNLI'):
fname = f'{result_path}/{file}/output.log'
if os.path.exists(fname):
results = gather_numbers(fname,
['overall:', 'metrics for dev-matched:', 'metrics for dev-mismatched:'],
['overall:', 'dev-matched:', 'dev-mismatched:'],
[9, 9, 9])
overall_candidate = results['overall:']
matched_candidate = results['dev-matched:']
mismatched_candidate = results['dev-mismatched:']
if len(overall_candidate) > 0:
assert len(overall_candidate) == len(matched_candidate) and len(overall_candidate) == len(mismatched_candidate)
best_index = overall_candidate.index(max(overall_candidate))
overall.append(overall_candidate[best_index])
matched.append(matched_candidate[best_index])
mismatched.append(mismatched_candidate[best_index])
if len(overall) > 0:
if len(overall) % 2 == 1:
median_idx = overall.index(statistics.median(overall))
else:
median_idx = overall.index(statistics.median_high(overall))
print(f'MNLI how Megatron paper reported: overall results median {statistics.median(overall)}, corresponding matched/mismatched: {matched[median_idx]}/{mismatched[median_idx]}')
print(f'MNLI other results:')
print(f'MNLI overall results {overall}, median {statistics.median(overall)} (corresponding matched/mismatched {matched[median_idx]}/{mismatched[median_idx]}), mean {statistics.mean(overall)}, std {statistics.stdev(overall)}')
print(f'MNLI matched results {matched}, median {statistics.median(matched)}, mean {statistics.mean(matched)}, std {statistics.stdev(matched)}')
print(f'MNLI mismatched results {mismatched}, median {statistics.median(mismatched)}, mean {statistics.mean(mismatched)}, std {statistics.stdev(mismatched)}')
else:
print("Didn't find any MNLI result")
def gather_QQP_results(result_path):
overall = []
for file in os.listdir(result_path):
if file.startswith('QQP'):
fname = f'{result_path}/{file}/output.log'
if os.path.exists(fname):
results = gather_numbers(fname, ['overall:'], ['overall:'], [9])
overall_candidate = results['overall:']
if len(overall_candidate) > 0:
best_index = overall_candidate.index(max(overall_candidate))
overall.append(overall_candidate[best_index])
if len(overall) > 0:
print(f'QQP how Megatron paper reported: overall results median {statistics.median(overall)}')
print(f'QQP other results:')
print(f'QQP overall results {overall}, median {statistics.median(overall)}, mean {statistics.mean(overall)}, std {statistics.stdev(overall)}')
else:
print("Didn't find any QQP result")
def gather_RACE_results(result_path, task):
dev = []
test = []
for file in os.listdir(result_path):
if file.startswith(f'RACE-{task}'):
fname = f'{result_path}/{file}/output.log'
if os.path.exists(fname):
results = gather_numbers(fname,
[f'metrics for dev-{task}:', f'metrics for test-{task}:'],
[f'dev-{task}:', f'test-{task}:'],
[9, 9])
dev_candidate = results[f'dev-{task}:']
test_candidate = results[f'test-{task}:']
if len(dev_candidate) > 0:
assert len(dev_candidate) == len(test_candidate)
dev.append(max(dev_candidate))
test.append(max(test_candidate))
if len(dev) > 0:
if len(dev) % 2 == 1:
median_idx = dev.index(statistics.median(dev))
else:
median_idx = dev.index(statistics.median_high(dev))
print(f'RACE-{task} how Megatron paper reported: test result from the median of dev results {test[median_idx]}')
print(f'RACE-{task} other results:')
print(f'RACE-{task} dev results {dev}, median {statistics.median(dev)}, mean {statistics.mean(dev)}, std {statistics.stdev(dev)}')
print(f'RACE-{task} test results {test}, median {statistics.median(test)}, mean {statistics.mean(test)}, std {statistics.stdev(test)}')
else:
print(f"Didn't find any RACE-{task} result")
def gather_finetune_results(result_path):
print(f'Gather finetune results for {result_path}')
gather_MNLI_results(result_path)
gather_QQP_results(result_path)
gather_RACE_results(result_path, 'middle')
gather_RACE_results(result_path, 'high')
if __name__ == '__main__':
result_path='/blob/users/conglli/project/bert_with_pile/checkpoint/bert-pile-0.336B-iters-2M-lr-1e-4-min-1e-5-wmup-10000-dcy-2M-sty-linear-gbs-1024-mbs-16-gpu-64-zero-0-mp-1-pp-1-nopp-finetune/'
gather_finetune_results(result_path)

View File

@ -0,0 +1,24 @@
{
"train_batch_size" : CONFIG_BATCH_SIZE,
"train_micro_batch_size_per_gpu": CONFIG_MBSIZE,
"steps_per_print": LOG_INTERVAL,
"zero_optimization": {
"stage": ZERO_STAGE,
"elastic_checkpoint": true
},
"gradient_clipping": 1.0,
"prescale_gradients": PRESCALE_GRAD,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"wall_clock_breakdown" : false
}

View File

@ -0,0 +1,156 @@
hostname_and_rank=$1
master_port=$2
seed=$3
task=$4
lr=$5
pretrained_checkpoint=$6
# hostname_and_rank="worker-0:0,1,2,3"
# master_port=12345
# seed=1234
# task="MNLI"
# lr=2e-5
# pretrained_checkpoint="/blob/users/conglli/project/bert_with_pile/checkpoint/bert-pile-0.336B-iters-2M-lr-1e-4-min-1e-5-wmup-10000-dcy-2M-sty-linear-gbs-1024-mbs-16-gpu-64-zero-0-mp-1-pp-1-nopp"
###############################################################################
### Main configs
seq_len=512
global_batch_size=32
epochs=3
train_data="/blob/data/GlueData/${task}/train.tsv"
valid_data="/blob/data/GlueData/${task}/dev.tsv"
if [[ "${task}" = "MNLI" ]]; then
valid_data="/blob/data/GlueData/MNLI/dev_matched.tsv \
/blob/data/GlueData/MNLI/dev_mismatched.tsv"
fi
## Adjust based on number of GPUs.
batch_size=8
## BERT 110M (BERT-Base)
# model_size=0.11
# num_layers=12
# hidden_size=768
# num_attn_heads=12
## BERT 336M (BERT-Large)
model_size=0.336
num_layers=24
hidden_size=1024
num_attn_heads=16
## BERT 1.3B
# model_size=1.3
# num_layers=24
# hidden_size=2048
# num_attn_heads=32
## BERT 3.9B
# model_size=3.9
# num_layers=48
# hidden_size=2560
# num_attn_heads=40
###############################################################################
### Parallelism configs
## Model parallelism, 1 is no MP
mp_size=1
## Pipeline parallelism. To disable PP, set pp_size to 1 and no_pp to true.
## Currently pipeline parallelism is not supported for BERT model: DeepSpeed's
## pipeline parallelism is only integrated with the GPT case, and currently
## DeepSpeed is not integrated with Megatron's own pipeline parallelism.
pp_size=1
no_pp="true"
## ZeRO stage
zero_stage=0
###############################################################################
### Misc configs
log_interval=10
eval_iters=50
eval_interval=100
## Activation checkpointing saves GPU memory, but reduces training speed
# activation_checkpoint="true"
activation_checkpoint="false"
###############################################################################
vocab_file="bert-large-uncased-vocab.txt"
if [ ! -f "$vocab_file" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
fi
jobname="${task}-bsz${global_batch_size}-lr${lr}-epochs${epochs}-seed${seed}"
# output_path="${pretrained_checkpoint}-finetune-glue-4v100/${jobname}"
output_path=$(basename "$pretrained_checkpoint")
output_path="glue-results/${output_path}-finetune-glue-4v100/${jobname}"
mkdir -p ${output_path}
template_json="ds_config_bert_TEMPLATE.json"
config_json="ds_config_bert_bsz${global_batch_size}_mbsz${batch_size}_log${log_interval}_zero${zero_stage}.json"
if [[ $zero_stage -gt 0 ]]; then
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/false/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
else
sed "s/CONFIG_BATCH_SIZE/${global_batch_size}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/true/" \
| sed "s/CONFIG_BF16_ENABLED/false/" \
> ${config_json}
fi
options=" \
--finetune \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${zero_stage} \
--task ${task} \
--seed ${seed} \
--train-data ${train_data} \
--valid-data ${valid_data} \
--tokenizer-type BertWordPieceLowerCase \
--vocab-file ${vocab_file} \
--epochs ${epochs} \
--pretrained-checkpoint ${pretrained_checkpoint} \
--tensor-model-parallel-size ${mp_size} \
--pipeline-model-parallel-size ${pp_size} \
--num-layers ${num_layers} \
--hidden-size ${hidden_size} \
--num-attention-heads ${num_attn_heads} \
--global-batch-size ${global_batch_size} \
--micro-batch-size ${batch_size} \
--lr ${lr} \
--lr-decay-style linear \
--lr-warmup-fraction 0.1 \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--log-interval ${log_interval} \
--eval-interval ${eval_interval} \
--eval-iters ${eval_iters} \
--weight-decay 1.0e-1 \
--fp16"
if [ "${activation_checkpoint}" = "true" ]; then
options="${options} \
--checkpoint-activations \
--deepspeed-activation-checkpointing"
fi
if [[ "${no_pp}" = "true" ]]; then
options="${options} \
--no-pipeline-parallel"
fi
# After the fine-tuning finishes, you can find the dev set accuracy numbers by
# "grep -e "overall:" -e "metrics for" ${output_path}/output.log"
deepspeed --include=${hostname_and_rank} --master_port=${master_port} ../../../../tasks/main.py ${options} &> ${output_path}/output.log

View File

@ -0,0 +1,44 @@
hostname_and_rank=$1
master_port=$2
pretrained_checkpoint=$3
# hostname_and_rank="worker-0:0,1,2,3"
# master_port=12345
# pretrained_checkpoint="/blob/users/conglli/project/bert_with_pile/checkpoint/bert-pile-0.336B-iters-2M-lr-1e-4-min-1e-5-wmup-10000-dcy-2M-sty-linear-gbs-1024-mbs-16-gpu-64-zero-0-mp-1-pp-1-nopp"
tasks=(
RTE
MRPC
STS-B
CoLA
SST-2
QNLI
QQP
MNLI
)
seeds=(
1234
1235
1236
1237
1238
)
lrs=(
2e-5
3e-5
4e-5
5e-5
)
for ((i=0;i<${#tasks[@]};++i)); do
task=${tasks[i]}
for ((j=0;j<${#seeds[@]};++j)); do
seed=${seeds[j]}
for ((k=0;k<${#lrs[@]};++k)); do
lr=${lrs[k]}
bash ds_finetune_bert_glue.sh ${hostname_and_rank} ${master_port} ${seed} ${task} ${lr} ${pretrained_checkpoint}
done
done
done

View File

@ -0,0 +1,118 @@
import os
import statistics
def gather_numbers(fname, match_keywords, index_keywords, index_offsets):
results = {}
for k in index_keywords:
results[k] = []
file1 = open(fname, 'r')
while True:
line = file1.readline()
if not line:
break
splits = line.split(' ')
for i in range(len(match_keywords)):
if match_keywords[i] in line:
ref_idx = splits.index(index_keywords[i])
results[index_keywords[i]].append(float(splits[ref_idx+index_offsets[i]]))
file1.close()
return results
def gather_GLUE_results(result_path, key, lr):
result = []
mnli_matched_result = []
mnli_mismatched_result = []
for file in os.listdir(result_path):
if file.startswith(key) and lr in file:
fname = f'{result_path}/{file}/output.log'
if os.path.exists(fname):
if key == "STS-B":
results = gather_numbers(fname, ['metrics for'], ['spearmanr'], [2])
overall_candidate = results['spearmanr']
overall_candidate = [x * 100.0 for x in overall_candidate]
elif key == "CoLA":
results = gather_numbers(fname, ['metrics for'], ['mcc'], [2])
overall_candidate = results['mcc']
overall_candidate = [x * 100.0 for x in overall_candidate]
elif key == "MNLI":
results = gather_numbers(fname,
['overall:', 'metrics for dev-matched:', 'metrics for dev-mismatched:'],
['overall:', 'dev-matched:', 'dev-mismatched:'],
[9, 9, 9])
overall_candidate = results['overall:']
matched_candidate = results['dev-matched:']
mismatched_candidate = results['dev-mismatched:']
else:
results = gather_numbers(fname, ['overall:'], ['overall:'], [9])
overall_candidate = results['overall:']
if len(overall_candidate) > 0:
if len(overall_candidate) != 3:
print(f"{result_path} task {key} lr {lr} only has {len(overall_candidate)} epoch")
best_index = overall_candidate.index(max(overall_candidate))
result.append(overall_candidate[best_index])
if key == "MNLI":
mnli_matched_result.append(matched_candidate[best_index])
mnli_mismatched_result.append(mismatched_candidate[best_index])
if len(result) > 0:
if len(result) != 5:
print(f"{result_path} task {key} lr {lr} only has {len(result)} seed")
if key == "MNLI":
best_index = result.index(statistics.median_high(result))
return round(mnli_matched_result[best_index],2), round(statistics.stdev(mnli_matched_result),2), round(mnli_mismatched_result[best_index],2), round(statistics.stdev(mnli_mismatched_result),2)
else:
return round(statistics.median_high(result),2), round(statistics.stdev(result),2)
else:
if key == "MNLI":
return None, None, None, None
else:
return None, None
def gather_finetune_results(result_path, extra_col=[], lr="2e-5"):
output = ""
for field in extra_col:
output += f"{field} &"
task_output = ""
median_list, std_list = [], []
m_median, m_std, mm_median, mm_std = gather_GLUE_results(result_path, "MNLI", lr)
if m_median is not None:
median_list += [m_median, mm_median]
std_list += [m_std, mm_std]
task_output += f"{m_median}±{m_std} & {mm_median}±{mm_std} &"
tasks = ["QQP", "QNLI", "SST-2", "CoLA", "STS-B", "MRPC", "RTE"]
for task in tasks:
t_median, t_std = gather_GLUE_results(result_path, task, lr)
if t_median is not None:
median_list += [t_median]
std_list += [t_std]
if task == "RTE":
task_output += f"{t_median}±{t_std} "
else:
task_output += f"{t_median}±{t_std} &"
overall_median = round(sum(median_list) / len(median_list), 2)
overall_std = round(sum(std_list) / len(std_list), 2)
output += f"{overall_median}±{overall_std} &"
output += task_output
output += " \\\\"
print(output)
if __name__ == '__main__':
print("\\begin{table}")
print("\centering")
print("\\tiny")
text = "\\begin{tabular}{@{}l|"
for _ in range(11):
text += "c"
text += "@{}}"
print(text)
print("\\toprule")
print("Case & Train tokens & Average & MNLI-m & MNLI-mm & QQP & QNLI & SST-2 & CoLA & STS-B & MRPC & RTE \\\\")
print("\midrule")
result_path='/blob/users/conglli/project/bert_with_pile/checkpoint/bert-pile-0.336B-iters-2M-lr-1e-4-min-1e-5-wmup-10000-dcy-2M-sty-linear-gbs-1024-mbs-16-gpu-64-zero-0-mp-1-pp-1-nopp-finetune/'
gather_finetune_results(result_path)
print("\\bottomrule")
print("\end{tabular}")
print("\end{table}")
print("")
print("")

View File

@ -0,0 +1,129 @@
import zstandard
import sys
import time
import os
import sys
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),
os.path.pardir,os.path.pardir,os.path.pardir)))
from megatron.data import indexed_dataset
def pile_download(download_url, file_path, i):
start = time.time()
zstd_file_path = f"{file_path}{i:02}.jsonl.zst"
download_path = f"{download_url}{i:02}.jsonl.zst"
if not os.path.exists(zstd_file_path):
os.system(f"wget -P {file_path} {download_path}")
print(f"Finished downloading chunk {i} in {time.time() - start} sec")
def pile_decompress(download_url, file_path, i):
zstd_file_path = f"{file_path}{i:02}.jsonl.zst"
output_path = f"{file_path}{i:02}.jsonl"
if not os.path.exists(output_path):
if not os.path.exists(zstd_file_path):
pile_download(download_url, file_path, i)
start = time.time()
with open(zstd_file_path, 'rb') as compressed:
decomp = zstandard.ZstdDecompressor()
with open(output_path, 'wb') as destination:
decomp.copy_stream(compressed, destination)
os.remove(zstd_file_path)
print(f"Finished decompressing chunk {i} in {time.time() - start} sec")
def pile_preprocess(download_url, file_path, vocab_file, num_workers, i):
json_file_path = f"{file_path}{i:02}.jsonl"
output_prefix = f"{file_path}pile_bert_train_{i:02}"
if not os.path.exists(f"{output_prefix}_text_sentence.idx"):
if not os.path.exists(json_file_path):
pile_decompress(download_url, file_path, i)
start = time.time()
cmd = f"python ../../tools/preprocess_data.py \
--input {json_file_path} \
--output-prefix {output_prefix} \
--vocab {vocab_file} \
--dataset-impl mmap \
--tokenizer-type BertWordPieceLowerCase \
--split-sentences \
--workers {num_workers} "
# It's possible to hit MemoryError during above cmd since the memory
# usage is proportional to num_workers. In this case we delete the
# incomplete output and user shall retry with smaller num_workers.
# Our experience show that chunk 6, 7, 9, 17, 18, 20, 21, 24, 27
# particularly have large memory usage.
if os.system(cmd) == 0: # Success
os.remove(json_file_path)
else:
print(f"Error: chunk {i} preprocessing got error, delete \
incomplete output. If MemoryError appeared, please retry \
with num_workers smaller than {num_workers}.")
if os.path.exists(f"{output_prefix}_text_sentence.idx"):
os.remove(f"{output_prefix}_text_sentence.idx")
if os.path.exists(f"{output_prefix}_text_sentence.bin"):
os.remove(f"{output_prefix}_text_sentence.bin")
print(f"Finished preprocessing chunk {i} in {time.time() - start} sec")
def pile_merge(file_path):
start = time.time()
num_chunks = 30
vocab_size = 30524
for i in range(num_chunks):
output_prefix = f"{file_path}pile_bert_train_{i:02}"
assert os.path.exists(f"{output_prefix}_text_sentence.idx")
assert os.path.exists(f"{output_prefix}_text_sentence.bin")
builder = indexed_dataset.make_builder(
f"{file_path}pile_bert_train_text_sentence.bin", impl="mmap",
vocab_size=vocab_size)
for i in range(num_chunks):
chunk_file = f"{file_path}pile_bert_train_{i:02}_text_sentence"
print(f"Merging file {chunk_file}")
builder.merge_file_(chunk_file)
print("Finalizing merged file ...")
builder.finalize(f"{file_path}pile_bert_train_text_sentence.idx")
print(f"Finished merging in {time.time() - start} sec")
# After verifying the merged data with real training, you may want to
# delete the data chunks.
# for i in range(num_chunks):
# output_prefix = f"{file_path}pile_bert_train_{i:02}"
# os.remove(f"{output_prefix}_text_sentence.idx")
# os.remove(f"{output_prefix}_text_sentence.bin")
if __name__ == '__main__':
# Path to download and store all the output files during the whole process.
# Estimated max storage usage would be around 1.6 TB (or 780GB if skip the
# final merge). Memory usage is proportional to the num_workers below (can
# be as high as O(300GB) if num_workers is around 20).
file_path = "/blob/data/the_pile_bert/"
# The raw Pile data has 30 compressed .zst chunks. To run on single
# machine for all chunks, run "python prepare_pile_data.py range 0 30".
# You can also split and run on multiple machines to speed up, since
# processing one chunk can take hours. The whole process only uses CPU.
if sys.argv[1] == "merge":
# "python prepare_pile_data.py merge" means merge all 30 processed data
# chunks. Run it only after all 30 chunks are preprocessed. The memory
# usage during merge is about 600GB. If you don't have enough memory,
# one solution is to directly use the 30 data chunks as multiple
# datasets. See '--data-path' in
# github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/arguments.py
pile_merge(file_path)
else:
if sys.argv[1] == "range":
# "python prepare_pile_data.py range 0 30" means process chunk 0-29
selected_chunk = range(int(sys.argv[2]), int(sys.argv[3]))
else:
# "python prepare_pile_data.py 2 5 8" means process chunk 2, 5, 8
selected_chunk = [int(x) for x in sys.argv[1:]]
print("selected_chunk: ", selected_chunk)
# Number of process. Adjust based on your CPU/Memory.
num_workers = 20
# Where the raw Pile data can be downloaded. The url may change in
# future. Contact EleutherAI (https://github.com/EleutherAI/the-pile)
# if this url does not work.
download_url = "https://the-eye.eu/public/AI/pile/train/"
vocab_file = "bert-large-uncased-vocab.txt"
vocab_url = "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt"
if not os.path.exists(vocab_file):
os.system(f"wget {vocab_url}")
os.makedirs(file_path, exist_ok=True)
for i in selected_chunk:
pile_preprocess(download_url, file_path, vocab_file, num_workers, i)

View File

@ -0,0 +1,74 @@
{
"train_batch_size": GBSIZE,
"train_micro_batch_size_per_gpu": MBSIZE,
"steps_per_print": LOG_INTERVAL,
"zero_optimization": {
"stage": ZERO_STAGE,
"elastic_checkpoint": true
},
"gradient_clipping": 1.0,
"prescale_gradients": PRESCALE_GRAD,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"wall_clock_breakdown" : false,
"dataloader_drop_last": true,
"data_efficiency": {
"enabled": true,
"seed": DATA_EFFICIENCY_SEED,
"data_routing": {
"enabled": LTD_ENABLED,
"random_ltd":{
"enabled": LTD_ENABLED,
"total_layer_num": 24,
"random_ltd_layer_num": 22,
"random_ltd_layer_id": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
"model_mask_name": "attention_mask",
"model_type": "encoder",
"hidden_state_order": "seq_batch_dim",
"random_ltd_schedule": {
"min_value": LTD_MIN,
"max_value": LTD_MAX,
"schedule_type":"fixed_linear",
"schedule_config": {
"require_steps": LTD_STEP,
"seq_per_step": 16
}
}
}
},
"data_sampling": {
"enabled": CL_ENABLED,
"num_workers": DATA_SAMPLING_NUM_WORKERS,
"curriculum_learning": {
"enabled": CL_ENABLED,
"data_cluster_path": "CL_CLUSTER_PATH",
"curriculum_metrics": {
"CL_1st_METRIC_NAME": {
"index_to_sample_path": "CL_1st_SAMPLE_PATH",
"index_to_metric_path": "CL_1st_METRIC_PATH",
"difficulty_type": "CL_1st_DIFF_TYPE",
"clustering_type": "CL_1st_CLUSTER_TYPE",
"min_difficulty": CL_1st_MIN,
"max_difficulty": CL_1st_MAX,
"schedule_type": "fixed_root",
"schedule_config": {
"total_curriculum_step": CL_1st_TOTAL_STEP,
"difficulty_step": CL_1st_DIFF_STEP,
"root_degree": CL_1st_ROOT
}
}
}
}
}
}
}

View File

@ -0,0 +1,88 @@
{
"train_batch_size": GBSIZE,
"train_micro_batch_size_per_gpu": MBSIZE,
"steps_per_print": LOG_INTERVAL,
"zero_optimization": {
"stage": ZERO_STAGE,
"elastic_checkpoint": true
},
"gradient_clipping": 1.0,
"prescale_gradients": PRESCALE_GRAD,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"wall_clock_breakdown" : false,
"dataloader_drop_last": true,
"data_efficiency": {
"enabled": true,
"seed": DATA_EFFICIENCY_SEED,
"data_routing": {
"enabled": LTD_ENABLED,
"random_ltd":{
"enabled": LTD_ENABLED,
"total_layer_num": 24,
"random_ltd_layer_num": 22,
"random_ltd_layer_id": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
"model_mask_name": "attention_mask",
"model_type": "encoder",
"hidden_state_order": "seq_batch_dim",
"random_ltd_schedule": {
"min_value": LTD_MIN,
"max_value": LTD_MAX,
"schedule_type":"fixed_linear",
"schedule_config": {
"require_steps": LTD_STEP,
"seq_per_step": 16
}
}
}
},
"data_sampling": {
"enabled": CL_ENABLED,
"num_workers": DATA_SAMPLING_NUM_WORKERS,
"curriculum_learning": {
"enabled": CL_ENABLED,
"data_cluster_path": "CL_CLUSTER_PATH",
"curriculum_metrics": {
"CL_1st_METRIC_NAME": {
"index_to_sample_path": "CL_1st_SAMPLE_PATH",
"index_to_metric_path": "CL_1st_METRIC_PATH",
"difficulty_type": "CL_1st_DIFF_TYPE",
"clustering_type": "CL_1st_CLUSTER_TYPE",
"min_difficulty": CL_1st_MIN,
"max_difficulty": CL_1st_MAX,
"schedule_type": "fixed_root",
"schedule_config": {
"total_curriculum_step": CL_1st_TOTAL_STEP,
"difficulty_step": CL_1st_DIFF_STEP,
"root_degree": CL_1st_ROOT
}
},
"CL_2nd_METRIC_NAME": {
"index_to_sample_path": "CL_2nd_SAMPLE_PATH",
"index_to_metric_path": "CL_2nd_METRIC_PATH",
"difficulty_type": "CL_2nd_DIFF_TYPE",
"clustering_type": "CL_2nd_CLUSTER_TYPE",
"min_difficulty": CL_2nd_MIN,
"max_difficulty": CL_2nd_MAX,
"schedule_type": "fixed_root",
"schedule_config": {
"total_curriculum_step": CL_2nd_TOTAL_STEP,
"difficulty_step": CL_2nd_DIFF_STEP,
"root_degree": CL_2nd_ROOT
}
}
}
}
}
}
}

View File

@ -0,0 +1,472 @@
#!/bin/bash
dir=`pwd`
###############################################################################
### Main configs
### The main configs are from Megatron-LM paper
### https://arxiv.org/abs/1909.08053. Choose based on your desired model size
### or build your own configs.
seq_len=512
global_batch_size=1024
# lr=1e-4
lr=$1
min_lr=1e-5
## init_std is the standard deviation for weight initialization. Usually larger
## model needs lower std. Here we roughly follow a heuristic equation of
## sqrt(1/3/hidden_size) from https://arxiv.org/pdf/2201.11990.pdf
## In addition, we find that the 3.9B model (even after tuning init_std) has
## NaN loss issue from the beginning thus unable to train. This is probably
## because in this example we use the public Pile data, which is a more diverse
## (and potentially more noisy) data than what used in Megatron paper. One
## potential solution is only use the sub datasets in Pile that are also
## used by Megatron paper.
## BERT 110M (same config as original BERT-Base model)
## This config is not included in Megatron-LM paper
# model_size=0.11
# num_layers=12
# hidden_size=768
# num_attn_heads=12
# init_std=0.02
## BERT 336M (same config as original BERT-Large model)
model_size=0.336
num_layers=24
hidden_size=1024
num_attn_heads=16
init_std=0.02
## BERT 1.3B
# model_size=1.3
# num_layers=24
# hidden_size=2048
# num_attn_heads=32
# init_std=0.013
## BERT 3.9B
# model_size=3.9
# num_layers=48
# hidden_size=2560
# num_attn_heads=40
# init_std=0.011
###############################################################################
### Training duration configs
## The main termination condition, original Megatron paper trains for 2M iters.
## We changed to token-based termination since data efficiency techniques could
## change token per step.
calc() { awk "BEGIN{ printf \"%.0f\n\", $* }"; }
# train_iters_in_million=2
train_iters_in_million=$2
train_tokens=$(calc $train_iters_in_million*1000000*$seq_len*$global_batch_size)
train_tokens_in_billion=$(calc $train_tokens/1000000000)
## A large enough number of iters, just to make sure we index enough data. The
## only effective termination condition is the train_tokens above.
train_iters=4000000
## Another wall-clock time termination condition in minutes. Set it large
## enough to avoid undesired early termination.
exit_duration=30000000
###############################################################################
### lr configs
## lr warmup and decay duration. Original Megatron paper uses 10000 warmup
## iters. We changed lr decay to token based since data efficiency techniques
## could change token per step.
lr_warmup_iters=10000
lr_decay_tokens_in_billion=${train_tokens_in_billion}
lr_decay_tokens=${train_tokens}
lr_decay_style="linear"
###############################################################################
### Parallelism configs
## Model parallelism, 1 is no MP
mp_size=1
## Pipeline parallelism. To disable PP, set pp_size to 1 and no_pp to true.
## Currently pipeline parallelism is not supported for BERT model: DeepSpeed's
## pipeline parallelism is only integrated with the GPT case, and currently
## DeepSpeed is not integrated with Megatron's own pipeline parallelism.
## Note that currently both curriculum learning and random-LTD are NOT
## compatible with pipeline parallelism.
pp_size=1
no_pp="true"
## ZeRO-based data parallelism, stage=0 will disable ZeRO
zero_stage=0
## Total number of GPUs. ds_ssh is from DeepSpeed library.
num_gpus=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
num_gpus_pernode=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
num_node=$(( ${num_gpus} / ${num_gpus_pernode} ))
## Data parallel size.
dp_size=$(( ${num_gpus} / ${pp_size} / ${mp_size} ))
## Micro batch size per GPU
## Make sure that batch_size <= global_batch_size*pp_size*mp_size/num_gpus
## Reduce it manually if GPU OOM
batch_size=$(( ${global_batch_size} / ${dp_size} ))
###############################################################################
### Random layerwise token dropping (random-LTD) configs
## random-LTD's main switch. "false" means disabled. "true" means enabled.
ltd_enabled=${3:-'false'}
## How much dropping ratio to start with. The value denotes the seqlen after
## dropping.
ltd_start=${4:-512}
## How many steps for random-LTD to gradually reduce dropping ratio to zero.
ltd_step_in_million=${5:-1}
# ltd_enabled="true"
# ltd_start=200
# ltd_step_in_million=1.8
ltd_step=$(calc $ltd_step_in_million*1000000)
## For BERT pretraining, we observe that random-LTD when combined with zero
## dropout can achieve better finetune accuracy on certain tasks. However, this
## is not guaranteed for all models/tasks. It is still recommend to try both
## with and without dropout for random-LTD.
dropout=${6:-0.1}
###############################################################################
### Curriculum learning (CL) configs
## CL's main switch. "false" means disabled. "true" means enabled.
cl_enabled=${7:-'false'}
## Number of CL metrics to use.
cl_num_metric=${8:-1}
## Name of difficulty metric
cl_1st_metric=${9:-'dummy'}
## Path to the data indexes for this difficulty metric. Samples on ith row of
## index_to_sample have the difficulty value equals to ith row of
## index_to_metric.
cl_1st_index_to_sample_path=${10:-'dummy'}
cl_1st_index_to_metric_path=${11:-'dummy'}
## During training, whether increase difficulty by value- or percentile-based.
cl_1st_difficulty_type=${12:-'value'}
## "single_cluster" means no clustering required and probably CL is achieved by
## data postprocessing. "schedule_based" means will cluster data based on the
## difficulty schedule (pacing function) below.
cl_1st_clustering_type=${13:-'single_cluster'}
## Start difficulty
cl_1st_min=${14:-512}
## End difficulty
cl_1st_max=${15:-512}
## Total step to reach end difficulty
cl_1st_total_step_in_million=${16:-1}
## When changing difficulty, always make sure it's a multiple of the
## difficulty_step below.
cl_1st_difficulty_step=${17:-1}
## Root degree of the schedule (pacing function).
cl_1st_root=${18:-1}
cl_2nd_metric=${19:-'dummy'}
cl_2nd_index_to_sample_path=${20:-'dummy'}
cl_2nd_index_to_metric_path=${21:-'dummy'}
cl_2nd_difficulty_type=${22:-'value'}
cl_2nd_clustering_type=${23:-'single_cluster'}
cl_2nd_min=${24:-2048}
cl_2nd_max=${25:-2048}
cl_2nd_total_step_in_million=${26:-1}
cl_2nd_difficulty_step=${27:-1}
cl_2nd_root=${28:-1}
# cl_enabled="true"
# cl_num_metric=2
# cl_1st_metric="voc"
# ## The *_index_to_sample_percentile_merged is a concatenated index for perf
# ## improvement, but it only works when you set difficulty_type="percentile" in
# ## ds_config. If you use difficulty_type="value", you need to change this to
# ## *_index_to_sample
# # cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_sample_percentile_merged"
# cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_sample"
# cl_1st_index_to_metric_path="/blob/users/conglli/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="value"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=600
# cl_1st_max=9069
# cl_1st_total_step_in_million=0.96
# cl_1st_difficulty_step=1
# cl_1st_root=2
# cl_2nd_metric="seqlen_truncate"
# cl_2nd_index_to_sample_path="dummy"
# cl_2nd_index_to_metric_path="dummy"
# cl_2nd_difficulty_type="value"
# cl_2nd_clustering_type="single_cluster"
# cl_2nd_min=128
# cl_2nd_max=512
# cl_2nd_total_step_in_million=0.96
# cl_2nd_difficulty_step=8
# cl_2nd_root=1
cl_1st_total_step=$(calc $cl_1st_total_step_in_million*1000000)
cl_2nd_total_step=$(calc $cl_2nd_total_step_in_million*1000000)
###############################################################################
### Misc configs
log_interval=100
eval_iters=10
eval_interval=1000
# num_save controls how frequent to save checkpoint. num_save=20 means that a
# checkpoint will be saved every 5% of training. For longer training you would
# want larger num_save to save more frequently, and vice versa.
num_save=100
estimated_train_iter=$((${train_tokens} / ${seq_len} / ${global_batch_size}))
save_interval=$((${estimated_train_iter} / ${num_save}))
## Activation checkpointing saves GPU memory, but reduces training speed
# activation_checkpoint="true"
activation_checkpoint="false"
## Whether or not log optimizer states (norms, max abs values) to tensorboard.
## This is not required for training and might save GPU memory when turned off.
log_optimizer_state="true"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d_%H.%M.%S")
host="${HOSTNAME}"
seed=1234
## Number of workers for dataloader. We found that for BERT pre-training,
## num_workers will greatly affect data loading time and overall training
## time. In our experiment with 64 GPUs, the performance reaches peak at
## num_workers = 4 but it may differ depending on hardware. Also note that
## larger num_workers add more CPU computation/memory overhead.
num_workers=4
## Public the Pile dataset, see ../pile_data_download_preprocess.py about how
## to download and preprocess the data. Change data_home to where you store the
## pile_bert_train_text_sentence.bin and pile_bert_train_text_sentence.idx.
data_home="/vc_data_blob/users/conglli/the_pile_bert"
if [[ "$host" == *"webxt"* ]]; then
data_home="/blob/data/the_pile_bert"
fi
data_path="${data_home}/pile_bert_train_text_sentence"
## train_idx_path forces Megatron to use a specific data index file generated
## when we analyze data. This is needed because our index for curriculum
## learning difficulty metric is based on this data index.
train_idx_path="${data_home}/pile_bert_train_text_sentence_train_indexmap_exact5ep_509msl_0.10ssp_1234s.npy"
vocab_path="bert-large-uncased-vocab.txt"
if [ ! -f "$vocab_path" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
fi
prescale_grad="true"
jobname="bert_${model_size}B_tok${train_tokens_in_billion}B"
jobname="${jobname}_lr${lr}_min${min_lr}_w${lr_warmup_iters}_d${lr_decay_tokens_in_billion}B_${lr_decay_style}"
jobname="${jobname}_gbs${global_batch_size}_mbs${batch_size}_g${num_gpus}"
if [[ $zero_stage -gt 0 ]]; then
jobname="${jobname}_z${zero_stage}"
prescale_grad="false"
fi
if [[ $mp_size -gt 1 ]]; then
jobname="${jobname}_mp${mp_size}"
fi
if [ "${no_pp}" = "false" ]; then
jobname="${jobname}_pp${pp_size}"
fi
jobname="${jobname}_seed${seed}"
if [ "${ltd_enabled}" = "true" ]; then
jobname="${jobname}_ltd_${ltd_start}_${ltd_step_in_million}M_drop${dropout}"
fi
if [ "${cl_enabled}" = "true" ]; then
jobname="${jobname}_cl_${cl_1st_metric}_${cl_1st_min}_${cl_1st_max}_${cl_1st_total_step_in_million}M_${cl_1st_root}"
if [[ $cl_num_metric -gt 1 ]]; then
jobname="${jobname}_${cl_2nd_metric}_${cl_2nd_min}_${cl_2nd_max}_${cl_2nd_total_step_in_million}M_${cl_2nd_root}"
fi
fi
username=$(whoami)
output_home="/blob/users/${username}/project/data_efficient_bert"
log_path="${output_home}/log/"
checkpoint_path="${output_home}/checkpoint/${jobname}"
## Microsoft internal constraint: because tensorboard is logged by last rank,
## it's better to put the path in NFS instead of Blob.
tensorboard_dir="/vc_data/users/${username}/project/data_efficient_bert/tensorboard/"
tensorboard_path="${tensorboard_dir}${jobname}_${host}_${current_time}"
mkdir -p ${log_path}
mkdir -p ${checkpoint_path}
mkdir -p ${tensorboard_path}
if [ "${cl_enabled}" = "true" ]; then
data_cluster_path="${output_home}/data_cluster/${jobname}"
mkdir -p ${data_cluster_path}
fi
###############################################################################
data_options=" \
--vocab-file ${vocab_path} \
--data-path ${data_path} \
--data-impl mmap"
## If CL is used, make sure to set "--split" the same as what you used during
## offline data analysis&indexing.
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.999 \
--tensor-model-parallel-size ${mp_size} \
--init-method-std ${init_std} \
--lr-decay-tokens ${lr_decay_tokens} \
--lr-warmup-iters ${lr_warmup_iters} \
--micro-batch-size ${batch_size} \
--exit-duration-in-mins ${exit_duration} \
--global-batch-size ${global_batch_size} \
--num-layers ${num_layers} \
--hidden-size ${hidden_size} \
--num-attention-heads ${num_attn_heads} \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--train-tokens ${train_tokens} \
--train-iters ${train_iters} \
--lr ${lr} \
--min-lr ${min_lr} \
--lr-decay-style ${lr_decay_style} \
--split 949,50,1 \
--log-interval ${log_interval} \
--eval-interval ${eval_interval} \
--eval-iters ${eval_iters} \
--save-interval ${save_interval} \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--num-workers ${num_workers} \
--fp16 \
--seed ${seed} \
--load ${checkpoint_path} \
--save ${checkpoint_path} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${tensorboard_path}"
if [ "${activation_checkpoint}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [ "${log_optimizer_state}" = "true" ]; then
megatron_options="${megatron_options} \
--log-optimizer-states-to-tensorboard"
fi
if [ "${ltd_enabled}" = "true" ]; then
megatron_options="${megatron_options} \
--attention-dropout ${dropout} \
--hidden-dropout ${dropout} \
--random-ltd"
fi
if [ "${cl_enabled}" = "true" ]; then
megatron_options="${megatron_options} \
--train-idx-path ${train_idx_path} \
--data-efficiency-curriculum-learning"
fi
config_json="ds_config_gbs${global_batch_size}_mbs${batch_size}_log${log_interval}_zero${zero_stage}_seed${seed}"
if [ "${ltd_enabled}" = "true" ]; then
config_json="${config_json}_ltd_${ltd_start}_${ltd_step}"
fi
if [ "${cl_enabled}" = "true" ]; then
config_json="${config_json}_cl_${cl_1st_metric}_${cl_1st_min}_${cl_1st_max}_${cl_1st_total_step}_${cl_1st_root}"
if [[ $cl_num_metric -gt 1 ]]; then
config_json="${config_json}_${cl_2nd_metric}_${cl_2nd_min}_${cl_2nd_max}_${cl_2nd_total_step}_${cl_2nd_root}"
fi
fi
config_json="${config_json}.json"
if [[ $cl_num_metric -gt 1 ]]; then
template_json="ds_config_bert_2clmetrics_TEMPLATE.json"
sed "s/GBSIZE/${global_batch_size}/" ${template_json} \
| sed "s/MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/${prescale_grad}/" \
| sed "s/DATA_EFFICIENCY_SEED/${seed}/" \
| sed "s/LTD_ENABLED/${ltd_enabled}/" \
| sed "s/LTD_MIN/${ltd_start}/" \
| sed "s/LTD_MAX/${seq_len}/" \
| sed "s/LTD_STEP/${ltd_step}/" \
| sed "s/CL_ENABLED/${cl_enabled}/" \
| sed "s/DATA_SAMPLING_NUM_WORKERS/${num_workers}/" \
| sed "s#CL_CLUSTER_PATH#${data_cluster_path}#" \
| sed "s#CL_1st_METRIC_NAME#${cl_1st_metric}#" \
| sed "s#CL_1st_SAMPLE_PATH#${cl_1st_index_to_sample_path}#" \
| sed "s#CL_1st_METRIC_PATH#${cl_1st_index_to_metric_path}#" \
| sed "s#CL_1st_DIFF_TYPE#${cl_1st_difficulty_type}#" \
| sed "s#CL_1st_CLUSTER_TYPE#${cl_1st_clustering_type}#" \
| sed "s/CL_1st_MIN/${cl_1st_min}/" \
| sed "s/CL_1st_MAX/${cl_1st_max}/" \
| sed "s/CL_1st_TOTAL_STEP/${cl_1st_total_step}/" \
| sed "s/CL_1st_DIFF_STEP/${cl_1st_difficulty_step}/" \
| sed "s/CL_1st_ROOT/${cl_1st_root}/" \
| sed "s#CL_2nd_METRIC_NAME#${cl_2nd_metric}#" \
| sed "s#CL_2nd_SAMPLE_PATH#${cl_2nd_index_to_sample_path}#" \
| sed "s#CL_2nd_METRIC_PATH#${cl_2nd_index_to_metric_path}#" \
| sed "s#CL_2nd_DIFF_TYPE#${cl_2nd_difficulty_type}#" \
| sed "s#CL_2nd_CLUSTER_TYPE#${cl_2nd_clustering_type}#" \
| sed "s/CL_2nd_MIN/${cl_2nd_min}/" \
| sed "s/CL_2nd_MAX/${cl_2nd_max}/" \
| sed "s/CL_2nd_TOTAL_STEP/${cl_2nd_total_step}/" \
| sed "s/CL_2nd_DIFF_STEP/${cl_2nd_difficulty_step}/" \
| sed "s/CL_2nd_ROOT/${cl_2nd_root}/" \
> ${config_json}
else
template_json="ds_config_bert_1clmetric_TEMPLATE.json"
sed "s/GBSIZE/${global_batch_size}/" ${template_json} \
| sed "s/MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/${prescale_grad}/" \
| sed "s/DATA_EFFICIENCY_SEED/${seed}/" \
| sed "s/LTD_ENABLED/${ltd_enabled}/" \
| sed "s/LTD_MIN/${ltd_start}/" \
| sed "s/LTD_MAX/${seq_len}/" \
| sed "s/LTD_STEP/${ltd_step}/" \
| sed "s/CL_ENABLED/${cl_enabled}/" \
| sed "s/DATA_SAMPLING_NUM_WORKERS/${num_workers}/" \
| sed "s#CL_CLUSTER_PATH#${data_cluster_path}#" \
| sed "s#CL_1st_METRIC_NAME#${cl_1st_metric}#" \
| sed "s#CL_1st_SAMPLE_PATH#${cl_1st_index_to_sample_path}#" \
| sed "s#CL_1st_METRIC_PATH#${cl_1st_index_to_metric_path}#" \
| sed "s#CL_1st_DIFF_TYPE#${cl_1st_difficulty_type}#" \
| sed "s#CL_1st_CLUSTER_TYPE#${cl_1st_clustering_type}#" \
| sed "s/CL_1st_MIN/${cl_1st_min}/" \
| sed "s/CL_1st_MAX/${cl_1st_max}/" \
| sed "s/CL_1st_TOTAL_STEP/${cl_1st_total_step}/" \
| sed "s/CL_1st_DIFF_STEP/${cl_1st_difficulty_step}/" \
| sed "s/CL_1st_ROOT/${cl_1st_root}/" \
> ${config_json}
fi
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${zero_stage} \
--pipeline-model-parallel-size ${pp_size}"
if [[ "${no_pp}" = "true" ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${activation_checkpoint}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
## When saving checkpoint to a storage with cache, their could be consistency
## issue of the pointer to latest checkpoint. Here we find the correct pointer
## and broadcast it to all nodes.
iteration_file="$checkpoint_path/latest_checkpointed_iteration.txt"
iteration_file_2="$checkpoint_path/latest"
iteration=0
for (( node = 0; node <= num_node-1; node++ ))
do
if $(ssh -q worker-"$node" "test -f \"$iteration_file\""); then
local_iteration=$(ssh -q worker-"$node" cat $iteration_file)
iteration=$(( ${local_iteration} > ${iteration} ? ${local_iteration} : ${iteration} ))
fi
done
if [[ $iteration -gt 0 ]]; then
iteration_2="global_step${iteration}"
ds_ssh "echo $iteration > $iteration_file"
ds_ssh "echo $iteration_2 > $iteration_file_2"
fi
deepspeed ${dir}/../../../../pretrain_bert.py ${megatron_options} ${data_options} ${deepspeed_options} &>> ${log_path}/${jobname}_${host}_${current_time}.log

View File

@ -0,0 +1,363 @@
###############################################################################
### Each block below is one pretraining setup. Uncomment one block to try.
###############################################################################
### Baseline cases, mostly based on Megatron-LM's BERT-Large hyperparameters,
### but with some changes (different LR schedule).
## Baseline 1049B tokens (100%):
# lr=1e-4
# train_iters_in_million=2
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million}
###############################################################################
## Baseline 703B tokens (67%):
# lr=1.5e-4
# train_iters_in_million=134e-2
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million}
###############################################################################
## Baseline 524B tokens (50%):
# lr=2e-4
# train_iters_in_million=1
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million}
###############################################################################
### Curriculum learning (CL) + Random layerwise token dropping (random-LTD).
### DeepSpeed Data Efficiency's composed solution.
### BERT pretraining.
## CL+random-LTD 1049B tokens (100%):
# lr=1e-4
# train_iters_in_million=2
# ltd_enabled="true"
# ltd_start=128
# ltd_step_in_million=2
# dropout=1e-1
# cl_enabled="true"
# cl_num_metric=2
# cl_1st_metric="voc"
# cl_1st_index_to_sample_path="/vc_data/users/conglli/code/data_efficiency/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_sample_percentile_merged"
# cl_1st_index_to_metric_path="/vc_data/users/conglli/code/data_efficiency/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=5
# cl_1st_max=100
# cl_1st_total_step_in_million=96e-2
# cl_1st_difficulty_step=1
# cl_1st_root=2
# cl_2nd_metric="seqlen_truncate"
# cl_2nd_index_to_sample_path="dummy"
# cl_2nd_index_to_metric_path="dummy"
# cl_2nd_difficulty_type="value"
# cl_2nd_clustering_type="single_cluster"
# cl_2nd_min=128
# cl_2nd_max=512
# cl_2nd_total_step_in_million=96e-2
# cl_2nd_difficulty_step=8
# cl_2nd_root=1
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million} \
# ${ltd_enabled} ${ltd_start} ${ltd_step_in_million} ${dropout} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step_in_million} ${cl_1st_difficulty_step} \
# ${cl_1st_root} ${cl_2nd_metric} ${cl_2nd_index_to_sample_path} \
# ${cl_2nd_index_to_metric_path} ${cl_2nd_difficulty_type} \
# ${cl_2nd_clustering_type} ${cl_2nd_min} ${cl_2nd_max} \
# ${cl_2nd_total_step_in_million} ${cl_2nd_difficulty_step} ${cl_2nd_root}
###############################################################################
## CL+random-LTD 524B tokens (50%):
# lr=2e-4
# train_iters_in_million=1
# ltd_enabled="true"
# ltd_start=128
# ltd_step_in_million=1
# dropout=1e-1
# cl_enabled="true"
# cl_num_metric=2
# cl_1st_metric="voc"
# cl_1st_index_to_sample_path="/vc_data/users/conglli/code/data_efficiency/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_sample_percentile_merged"
# cl_1st_index_to_metric_path="/vc_data/users/conglli/code/data_efficiency/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=5
# cl_1st_max=100
# cl_1st_total_step_in_million=48e-2
# cl_1st_difficulty_step=1
# cl_1st_root=2
# cl_2nd_metric="seqlen_truncate"
# cl_2nd_index_to_sample_path="dummy"
# cl_2nd_index_to_metric_path="dummy"
# cl_2nd_difficulty_type="value"
# cl_2nd_clustering_type="single_cluster"
# cl_2nd_min=128
# cl_2nd_max=512
# cl_2nd_total_step_in_million=48e-2
# cl_2nd_difficulty_step=8
# cl_2nd_root=1
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million} \
# ${ltd_enabled} ${ltd_start} ${ltd_step_in_million} ${dropout} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step_in_million} ${cl_1st_difficulty_step} \
# ${cl_1st_root} ${cl_2nd_metric} ${cl_2nd_index_to_sample_path} \
# ${cl_2nd_index_to_metric_path} ${cl_2nd_difficulty_type} \
# ${cl_2nd_clustering_type} ${cl_2nd_min} ${cl_2nd_max} \
# ${cl_2nd_total_step_in_million} ${cl_2nd_difficulty_step} ${cl_2nd_root}
###############################################################################
### Random layerwise token dropping (random-LTD).
## random-LTD 1049B tokens (100%):
# lr=1e-4
# train_iters_in_million=2
# ltd_enabled="true"
# ltd_start=128
# ltd_step_in_million=2
# dropout=1e-1
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million} \
# ${ltd_enabled} ${ltd_start} ${ltd_step_in_million} ${dropout}
###############################################################################
## random-LTD 703B tokens (67%):
# lr=1.5e-4
# train_iters_in_million=134e-2
# ltd_enabled="true"
# ltd_start=128
# ltd_step_in_million=134e-2
# dropout=1e-1
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million} \
# ${ltd_enabled} ${ltd_start} ${ltd_step_in_million} ${dropout}
###############################################################################
## random-LTD 524B tokens (50%):
# lr=2e-4
# train_iters_in_million=1
# ltd_enabled="true"
# ltd_start=128
# ltd_step_in_million=1
# dropout=1e-1
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million} \
# ${ltd_enabled} ${ltd_start} ${ltd_step_in_million} ${dropout}
###############################################################################
### Curriculum learning (CL).
## CL vocab rarity + seqlen truncation 524B tokens (50%):
# lr=2e-4
# train_iters_in_million=1
# ltd_enabled="false"
# ltd_start=512
# ltd_step_in_million=1
# dropout=1e-1
# cl_enabled="true"
# cl_num_metric=2
# cl_1st_metric="voc"
# cl_1st_index_to_sample_path="/vc_data/users/conglli/code/data_efficiency/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_sample_percentile_merged"
# cl_1st_index_to_metric_path="/vc_data/users/conglli/code/data_efficiency/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=5
# cl_1st_max=100
# cl_1st_total_step_in_million=48e-2
# cl_1st_difficulty_step=1
# cl_1st_root=2
# cl_2nd_metric="seqlen_truncate"
# cl_2nd_index_to_sample_path="dummy"
# cl_2nd_index_to_metric_path="dummy"
# cl_2nd_difficulty_type="value"
# cl_2nd_clustering_type="single_cluster"
# cl_2nd_min=128
# cl_2nd_max=512
# cl_2nd_total_step_in_million=48e-2
# cl_2nd_difficulty_step=8
# cl_2nd_root=1
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million} \
# ${ltd_enabled} ${ltd_start} ${ltd_step_in_million} ${dropout} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step_in_million} ${cl_1st_difficulty_step} \
# ${cl_1st_root} ${cl_2nd_metric} ${cl_2nd_index_to_sample_path} \
# ${cl_2nd_index_to_metric_path} ${cl_2nd_difficulty_type} \
# ${cl_2nd_clustering_type} ${cl_2nd_min} ${cl_2nd_max} \
# ${cl_2nd_total_step_in_million} ${cl_2nd_difficulty_step} ${cl_2nd_root}
###############################################################################
## CL vocab rarity + seqlen truncation 703B tokens (67%):
# lr=1.5e-4
# train_iters_in_million=134e-2
# ltd_enabled="false"
# ltd_start=512
# ltd_step_in_million=1
# dropout=1e-1
# cl_enabled="true"
# cl_num_metric=2
# cl_1st_metric="voc"
# cl_1st_index_to_sample_path="/vc_data/users/conglli/code/data_efficiency/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_sample_percentile_merged"
# cl_1st_index_to_metric_path="/vc_data/users/conglli/code/data_efficiency/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=5
# cl_1st_max=100
# cl_1st_total_step_in_million=64e-2
# cl_1st_difficulty_step=1
# cl_1st_root=2
# cl_2nd_metric="seqlen_truncate"
# cl_2nd_index_to_sample_path="dummy"
# cl_2nd_index_to_metric_path="dummy"
# cl_2nd_difficulty_type="value"
# cl_2nd_clustering_type="single_cluster"
# cl_2nd_min=128
# cl_2nd_max=512
# cl_2nd_total_step_in_million=64e-2
# cl_2nd_difficulty_step=8
# cl_2nd_root=1
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million} \
# ${ltd_enabled} ${ltd_start} ${ltd_step_in_million} ${dropout} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step_in_million} ${cl_1st_difficulty_step} \
# ${cl_1st_root} ${cl_2nd_metric} ${cl_2nd_index_to_sample_path} \
# ${cl_2nd_index_to_metric_path} ${cl_2nd_difficulty_type} \
# ${cl_2nd_clustering_type} ${cl_2nd_min} ${cl_2nd_max} \
# ${cl_2nd_total_step_in_million} ${cl_2nd_difficulty_step} ${cl_2nd_root}
###############################################################################
## CL vocab rarity + seqlen truncation 1049B tokens (100%):
# lr=1e-4
# train_iters_in_million=2
# ltd_enabled="false"
# ltd_start=512
# ltd_step_in_million=1
# dropout=1e-1
# cl_enabled="true"
# cl_num_metric=2
# cl_1st_metric="voc"
# cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_sample"
# cl_1st_index_to_metric_path="/blob/users/conglli/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=5
# cl_1st_max=100
# cl_1st_total_step_in_million=96e-2
# cl_1st_difficulty_step=1
# cl_1st_root=2
# cl_2nd_metric="seqlen_truncate"
# cl_2nd_index_to_sample_path="dummy"
# cl_2nd_index_to_metric_path="dummy"
# cl_2nd_difficulty_type="value"
# cl_2nd_clustering_type="single_cluster"
# cl_2nd_min=128
# cl_2nd_max=512
# cl_2nd_total_step_in_million=96e-2
# cl_2nd_difficulty_step=8
# cl_2nd_root=1
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million} \
# ${ltd_enabled} ${ltd_start} ${ltd_step_in_million} ${dropout} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step_in_million} ${cl_1st_difficulty_step} \
# ${cl_1st_root} ${cl_2nd_metric} ${cl_2nd_index_to_sample_path} \
# ${cl_2nd_index_to_metric_path} ${cl_2nd_difficulty_type} \
# ${cl_2nd_clustering_type} ${cl_2nd_min} ${cl_2nd_max} \
# ${cl_2nd_total_step_in_million} ${cl_2nd_difficulty_step} ${cl_2nd_root}
###############################################################################
## CL vocab rarity + seqlen reorder 1049B tokens (100%):
# lr=1e-4
# train_iters_in_million=2
# ltd_enabled="false"
# ltd_start=512
# ltd_step_in_million=1
# dropout=1e-1
# cl_enabled="true"
# cl_num_metric=1
# cl_1st_metric="seqlenvocabrarity"
# cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_bert_5epoch/seqlen_vocab_rarity/seqlen_vocab_rarity_index_to_sample_percentile_merged"
# cl_1st_index_to_metric_path="/blob/users/conglli/data/analysis_pile_bert_5epoch/seqlen_vocab_rarity/seqlen_vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=5
# cl_1st_max=100
# cl_1st_total_step_in_million=96e-2
# cl_1st_difficulty_step=1
# cl_1st_root=2
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million} \
# ${ltd_enabled} ${ltd_start} ${ltd_step_in_million} ${dropout} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step_in_million} ${cl_1st_difficulty_step} \
# ${cl_1st_root}
###############################################################################
## CL vocab rarity 1049B tokens (100%):
# lr=1e-4
# train_iters_in_million=2
# ltd_enabled="false"
# ltd_start=512
# ltd_step_in_million=1
# dropout=1e-1
# cl_enabled="true"
# cl_num_metric=1
# cl_1st_metric="voc"
# cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_sample"
# cl_1st_index_to_metric_path="/blob/users/conglli/data/analysis_pile_bert_5epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=5
# cl_1st_max=100
# cl_1st_total_step_in_million=96e-2
# cl_1st_difficulty_step=1
# cl_1st_root=2
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million} \
# ${ltd_enabled} ${ltd_start} ${ltd_step_in_million} ${dropout} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step_in_million} ${cl_1st_difficulty_step} \
# ${cl_1st_root}
###############################################################################
## CL seqlen truncation 1049B tokens (100%):
# lr=1e-4
# train_iters_in_million=2
# ltd_enabled="false"
# ltd_start=512
# ltd_step_in_million=1
# dropout=1e-1
# cl_enabled="true"
# cl_num_metric=1
# cl_1st_metric="seqlen_truncate"
# cl_1st_index_to_sample_path="dummy"
# cl_1st_index_to_metric_path="dummy"
# cl_1st_difficulty_type="value"
# cl_1st_clustering_type="single_cluster"
# cl_1st_min=128
# cl_1st_max=512
# cl_1st_total_step_in_million=96e-2
# cl_1st_difficulty_step=8
# cl_1st_root=1
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million} \
# ${ltd_enabled} ${ltd_start} ${ltd_step_in_million} ${dropout} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step_in_million} ${cl_1st_difficulty_step} \
# ${cl_1st_root}
###############################################################################
## CL seqlen reorder 1049B tokens (100%):
# lr=1e-4
# train_iters_in_million=2
# ltd_enabled="false"
# ltd_start=512
# ltd_step_in_million=1
# dropout=1e-1
# cl_enabled="true"
# cl_num_metric=1
# cl_1st_metric="seqlen"
# cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_bert_5epoch/seqlen/seqlen_index_to_sample_percentile_merged"
# cl_1st_index_to_metric_path="/blob/users/conglli/data/analysis_pile_bert_5epoch/seqlen/seqlen_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="single_cluster"
# cl_1st_min=5
# cl_1st_max=100
# cl_1st_total_step_in_million=96e-2
# cl_1st_difficulty_step=8
# cl_1st_root=2
# bash ds_pretrain_bert_336M_base_script.sh ${lr} ${train_iters_in_million} \
# ${ltd_enabled} ${ltd_start} ${ltd_step_in_million} ${dropout} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step_in_million} ${cl_1st_difficulty_step} \
# ${cl_1st_root}
###############################################################################

View File

@ -0,0 +1,70 @@
#!/bin/bash
num_workers=1 # Num nodes to run the map job
num_threads=40 # Num threads on each node. Set this based on #CPU cores
# If different data epochs have slightly different data samples (e.g., due
# to randomness), then you need to specify large enough num_epochs that cover
# whole pretraining. If different data epochs are the same, set num_epochs to
# 1 to only index 1 epoch, and during pretraining DeepSpeed data efficiency
# library will automatically handle reshuffling when reaching another epoch.
num_epochs=1
# Which node is this node (start with 0 and end with num_workers-1). This
# script only launch the map job on 1 worker node, since we don't expect
# running on many nodes and workers don't need any communication. But you
# can modify this script to add a MPI/torch distributed launcher.
worker_id=$1
save_path="/blob/users/conglli/data/analysis_pile_gpt_${num_epochs}epoch/"
metric='total_vocab_freq'
# metric='vocab_rarity' # this requires the result of total_vocab_freq
seq_len=2048
batch_size=10000
jobname="gpt-pile-analyzing-${metric}-${num_epochs}epoch-map-worker${worker_id}"
# Public the Pile dataset, can be downloaded at
# https://mystic.the-eye.eu/public/AI/pile_neox/
## Change data_home to your own training data path.
# data_home="/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing"
data_home="/blob/data/the_pile_public_merged_nopreprocessing"
data_path="${data_home}/pile_text_document"
vocab_path="gpt2-vocab.json"
if [ ! -f "$vocab_path" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
fi
merge_path="gpt2-merges.txt"
if [ ! -f "$merge_path" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
fi
# Make sure the "--split" is the same as what you will use for pre-training.
options=" \
--analyzing-task map \
--analyzing-data-type GPT \
--analyzing-metric ${metric} \
--analyzing-num-workers ${num_workers} \
--analyzing-worker-id ${worker_id} \
--analyzing-num-threads ${num_threads} \
--vocab-file ${vocab_path} \
--merge-file ${merge_path} \
--data-path ${data_path} \
--data-impl mmap \
--tokenizer-type GPT2BPETokenizer \
--micro-batch-size ${batch_size} \
--global-batch-size ${batch_size} \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--num-layers 1 \
--hidden-size 1 \
--num-attention-heads 1 \
--split 949,50,1 \
--distributed-backend gloo \
--train-data-exact-num-epochs ${num_epochs} \
--return-data-index \
--save-interval 1 \
--save ${save_path}"
python ../analyze_data.py ${options} &> ${jobname}.log

View File

@ -0,0 +1,69 @@
#!/bin/bash
# Set these 2 to the same as what you used during map job. We need these 2
# configs to know how many map job result files do we have.
num_workers=1
num_threads=40
# Reduce job only has 1 worker but can accelerate by multithreading.
num_threads_reduce=40
# If different data epochs have slightly different data samples (e.g., due
# to randomness), then you need to specify large enough num_epochs that cover
# whole pretraining. If different data epochs are the same, set num_epochs to
# 1 to only index 1 epoch, and during pretraining DeepSpeed data efficiency
# library will automatically handle reshuffling when reaching another epoch.
num_epochs=1
save_path="/blob/users/conglli/data/analysis_pile_gpt_${num_epochs}epoch/"
metric='total_vocab_freq'
# metric='vocab_rarity' # this requires the result of total_vocab_freq
seq_len=2048
batch_size=10000
jobname="gpt-pile-analyzing-${metric}-${num_epochs}epoch-reduce"
# Public the Pile dataset, can be downloaded at
# https://mystic.the-eye.eu/public/AI/pile_neox/
## Change data_home to your own training data path.
# data_home="/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing"
data_home="/blob/data/the_pile_public_merged_nopreprocessing"
data_path="${data_home}/pile_text_document"
vocab_path="gpt2-vocab.json"
if [ ! -f "$vocab_path" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
fi
merge_path="gpt2-merges.txt"
if [ ! -f "$merge_path" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
fi
# Make sure the "--split" is the same as what you will use for pre-training.
options=" \
--analyzing-task reduce \
--analyzing-data-type GPT \
--analyzing-metric ${metric} \
--analyzing-num-workers ${num_workers} \
--analyzing-num-threads ${num_threads} \
--analyzing-num-threads-reduce ${num_threads_reduce} \
--vocab-file ${vocab_path} \
--merge-file ${merge_path} \
--data-path ${data_path} \
--data-impl mmap \
--tokenizer-type GPT2BPETokenizer \
--micro-batch-size ${batch_size} \
--global-batch-size ${batch_size} \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--num-layers 1 \
--hidden-size 1 \
--num-attention-heads 1 \
--split 949,50,1 \
--distributed-backend gloo \
--train-data-exact-num-epochs ${num_epochs} \
--return-data-index \
--save-interval 1 \
--save ${save_path}"
python ../analyze_data.py ${options} &> ${jobname}.log

View File

@ -0,0 +1,28 @@
{
"train_batch_size" : 2048,
"train_micro_batch_size_per_gpu": 16,
"steps_per_print": 10,
"zero_optimization": {
"stage": 0,
"elastic_checkpoint": true
},
"gradient_clipping": 1.0,
"prescale_gradients": true,
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"bf16": {
"enabled": false
},
"wall_clock_breakdown" : false
}

View File

@ -0,0 +1,77 @@
## CAUTION: first read Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md
## and follow the steps of installation/data downloading.
## Code below only works when you run each evalharness task on a single GPU.
## For multi-GPU evalharness, check Megatron-DeepSpeed/blob/main/examples/MoE/ds_evalharness.sh
checkpoint_path=$1
config_path=$2
result_path=$3
rank=$4
tasks=$5
hostname=$6
master_port=$(( 12345 + ${rank} ))
batch_size=$7
num_fewshot=$8
mp_size=1
pp_size=1
no_pp="true"
ep_size=1
vocab_file="gpt2-vocab.json"
if [ ! -f "$vocab_file" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
fi
merge_file="gpt2-merges.txt"
if [ ! -f "$merge_file" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
fi
export HF_DATASETS_OFFLINE=1
dir2=$(dirname "$checkpoint_path")
dirname=$(basename "$dir2")/$(basename "$checkpoint_path")
result_path="${result_path}/${dirname}"
mkdir -p $result_path
result_file="${result_path}/${tasks}_${num_fewshot}shot.json"
# Dummy arguments to make megatron happy. No need to configure them.
# The reason we don't need to configure them and many other arguments is
# because the eval framework will read the arguments from checkpoint file.
megatron_required_args="\
--num-layers -1 \
--hidden-size -1 \
--num-attention-heads -1 \
--seq-length -1 \
--max-position-embeddings -1
"
command="../../../../tasks/eval_harness/evaluate.py \
--load ${checkpoint_path} \
--tensor-model-parallel-size ${mp_size} \
--pipeline-model-parallel-size ${pp_size} \
--moe-expert-parallel-size ${ep_size} \
--vocab-file ${vocab_file} \
--merge-file ${merge_file} \
--micro-batch-size ${batch_size} \
--no-load-optim \
--no-load-rng \
--inference \
--disable-moe-token-dropping \
--adaptive_seq_len \
--eval_fp32 \
--num_fewshot ${num_fewshot} \
--task_list ${tasks} \
--results_path ${result_file} \
--deepspeed \
--deepspeed_config ${config_path} \
${megatron_required_args} \
"
if [[ "${no_pp}" = "true" ]]; then
command="${command} \
--no-pipeline-parallel"
fi
launcher="deepspeed --include=$hostname:$rank --master_port=${master_port}"
$launcher $command &> "${result_path}/${tasks}_${num_fewshot}shot.log"

View File

@ -0,0 +1,358 @@
import json
import os
import math
from math import log10, floor
import copy
def mean(arr):
return sum(arr) / len(arr)
def pop_stddev(arr):
mu = mean(arr)
return math.sqrt(sum([(x - mu) ** 2 for x in arr]) / len(arr))
def sample_stddev(arr):
mu = mean(arr)
return math.sqrt(sum([(x - mu) ** 2 for x in arr]) / (len(arr) - 1))
def mean_stderr(arr):
return sample_stddev(arr) / math.sqrt(len(arr))
def median(arr):
return arr[len(arr) // 2]
metric_dict = {
"hellaswag":"acc_norm",
"lambada":"acc",
"triviaqa":"acc",
"webqs":"acc",
"winogrande":"acc",
"piqa":"acc_norm",
"arc_challenge":"acc_norm",
"arc_easy":"acc_norm",
"openbookqa":"acc_norm",
"race":"acc",
"boolq":"acc",
"cb":"acc",
"copa":"acc",
"rte":"acc",
"wic":"acc",
"wsc":"acc",
"multirc":"acc",
"record":"f1",
"anli_r1":"acc",
"anli_r2":"acc",
"anli_r3":"acc",
"wikitext":"word_perplexity",
"logiqa":"acc_norm",
"mathqa":"acc_norm",
"mc_taco":"f1",
"mrpc":"acc",
"prost":"acc_norm",
"pubmedqa":"acc",
"qnli":"acc",
"qqp":"acc",
"sciq":"acc_norm",
"sst":"acc",
"wnli":"acc"
}
official_dict = {
"hellaswag":["HellaSwag","acc"],
"lambada":["LAMBADA","acc"],
"triviaqa":["TriviaQA","acc"],
"webqs":["WebQs","acc"],
"winogrande":["Winogrande","acc"],
"piqa":["PIQA","acc"],
"arc_challenge":["ARC Challenge","acc"],
"arc_easy":["ARC Easy","acc"],
"openbookqa":["OpenBookQA","acc"],
"race":["RACE-h","acc"],
"boolq":["BoolQ","acc"],
"cb":["CB","acc"],
"copa":["Copa","acc"],
"rte":["RTE","acc"],
"wic":["WiC","acc"],
"wsc":["WSC","acc"],
"multirc":["MultiRC","acc"],
"record":["ReCoRD","f1"],
"anli_r1":["ANLI R1","acc"],
"anli_r2":["ANLI R2","acc"],
"anli_r3":["ANLI R3","acc"],
"wikitext":["WikiText-2","ppl"],
"logiqa":["LogiQA","acc"],
"mathqa":["MathQA","acc"],
"mc_taco":["MC-TACO","f1"],
"mrpc":["MRPC","acc"],
"prost":["PROST","acc"],
"pubmedqa":["PubMedQA","acc"],
"qnli":["QNLI","acc"],
"qqp":["QQP","acc"],
"sciq":["SciQ","acc"],
"sst":["SST-2","acc"],
"wnli":["WNLI","acc"]
}
# When comparing with gpt3 paper, the most trustful tasks are the hellaswag to
# anli_r3, who have >= 1000 samples (less variation), and have <= 43% data
# contamination in the paper.
gpt3paper_zeroshoteval = {
"hellaswag":[33.7,43.6,51.0,54.7,62.8,67.4,70.9,78.9],
"lambada":[42.7,54.3,60.4,63.6,67.1,70.3,72.5,76.2],
"triviaqa":[4.15,7.61,14.0,19.7,31.3,38.7,41.8,64.3],
"webqs":[1.77,3.20,4.33,4.63,7.92,7.73,8.22,14.4],
"winogrande":[52.0,52.1,57.4,58.7,62.3,64.5,67.9,70.2],
"piqa":[64.6,70.2,72.9,75.1,75.6,78.0,78.5,81.0],
"arc_challenge":[26.6,29.5,31.8,35.5,38.0,41.4,43.7,51.4],
"arc_easy":[43.6,46.5,53.0,53.8,58.2,60.2,63.8,68.8],
"anli_r1":[33.4,34.2,33.4,33.4,34.2,32.3,33.2,34.6],
"anli_r2":[33.2,31.9,33.3,33.3,33.8,33.5,33.5,35.4],
"anli_r3":[33.6,34.0,33.8,33.4,35.3,34.8,34.4,34.5],
"openbookqa":[35.6,43.2,45.2,46.8,53.0,50.4,55.6,57.6],
"race":[35.2,37.9,40.1,40.9,42.4,44.1,44.6,45.5],
"boolq":[49.7,60.3,58.9,62.4,67.1,65.4,66.2,60.5],
"cb":[0.00,32.1,8.93,19.6,19.6,28.6,19.6,46.4],
"copa":[66.0,68.0,73.0,77.0,76.0,80.0,84.0,91.0],
"rte":[47.7,49.8,48.4,56.0,46.6,55.2,62.8,63.5],
"wic":[0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00],
"wsc":[59.6,56.7,65.4,61.5,66.3,60.6,64.4,65.4],
"multirc":[4.72,9.65,12.3,13.6,14.3,18.4,24.2,27.6],
"record":[71.9,79.2,82.8,85.2,87.3,89.5,90.4,91.0]
}
gpt3paper_fewshoteval = {
"hellaswag":[33.5,43.1,51.3,54.9,62.9,67.3,71.3,79.3],
"lambada":[22.0,40.4,63.2,57.0,78.1,79.1,81.3,86.4],
"triviaqa":[6.96,16.3,26.5,32.1,42.3,51.6,57.5,71.2],
"webqs":[5.46,12.6,15.9,19.6,24.8,27.7,33.5,41.5],
"winogrande":[51.3,52.6,57.5,59.1,62.6,67.4,70.0,77.7],
"piqa":[64.3,69.4,72.0,74.3,75.4,77.8,79.9,82.3],
"arc_challenge":[25.5,28.4,32.3,36.7,39.5,43.7,44.8,51.5],
"arc_easy":[42.7,51.0,58.1,59.1,62.1,65.8,69.1,70.1],
"anli_r1":[32.1,32.5,30.9,32.5,33.5,33.1,33.3,36.8],
"anli_r2":[35.7,33.8,32.1,31.4,32.6,33.3,32.6,34.0],
"anli_r3":[35.0,34.4,35.1,36.0,32.7,33.9,34.5,40.2],
"openbookqa":[37.0,43.6,48.0,50.6,55.6,55.2,60.8,65.4],
"race":[34.3,37.0,40.4,41.4,42.3,44.7,45.1,46.8],
"boolq":[43.1,60.6,62.0,64.1,70.3,70.0,70.2,77.5],
"cb":[42.9,58.9,53.6,69.6,67.9,60.7,66.1,82.1],
"copa":[67.0,64.0,72.0,77.0,83.0,83.0,86.0,92.0],
"rte":[52.3,48.4,46.9,50.9,56.3,49.5,60.6,72.9],
"wic":[49.8,55.0,53.0,53.0,51.6,53.1,51.1,55.3],
"wsc":[58.7,60.6,54.8,49.0,62.5,67.3,75.0,75.0],
"multirc":[6.09,11.8,16.8,20.8,24.7,23.8,25.0,32.5],
"record":[70.7,77.9,82.1,84.0,87.5,88.8,89.8,90.1]
}
gpt3paper_zeroshoteval_index = {
"125M":0, # Small
"350M":1, # Medium
"760M":2, # Large
"1.3B":3, # XL
"2.7B":4,
"6.7B":5,
"13B":6,
"175B":7
}
def round_sig(x, sig=3):
if x == 0:
return 0
return round(x, sig-int(floor(log10(abs(x))))-1)
def generate_result_table(tab_header, configs, task_order, caption, avg_range,
avg_tag, avg_only=False, fontsize="\\footnotesize", find_best=False,
candidate_range=None, candidate_task=None, split_name_by_space=False,
print_stderr=False, few_shot=False):
# Gather results
result_list = []
for i in range(len(configs)):
result_dict = {}
eval_path = configs[i][-1]
if "paper" in configs[i][0]:
assert eval_path is None
if eval_path is None:
assert "paper" in configs[i][0]
assert configs[i][1] in gpt3paper_zeroshoteval_index, "the second element has to be the model size"
paper_result_idx = gpt3paper_zeroshoteval_index[configs[i][1]]
if few_shot:
for task in gpt3paper_fewshoteval:
result_dict[task] = [gpt3paper_fewshoteval[task][paper_result_idx]]
else:
for task in gpt3paper_zeroshoteval:
result_dict[task] = [gpt3paper_zeroshoteval[task][paper_result_idx]]
else:
for file in os.listdir(eval_path):
if file.endswith(".json"):
result = json.load(open(eval_path+"/"+file, "r"))
for task in result['results']:
if task != "wikitext":
result_dict[task] = [100.0*result['results'][task][metric_dict[task]]]
else:
result_dict[task] = [result['results'][task][metric_dict[task]]]
result_list.append(result_dict)
avg_list = []
for i in range(len(configs)):
average_results = []
for j in range(len(avg_range)):
results = []
for k in range(avg_range[j]+1):
if task_order[k] in result_list[i]:
results.append(result_list[i][task_order[k]][0])
if len(results) > 0:
average_results.append(float(sum(results))/len(results))
else:
average_results.append(0)
avg_list.append(average_results)
if find_best:
best_avg_value = [0 for _ in range(len(avg_range))]
best_avg_idx = [0 for _ in range(len(avg_range))]
best_task_value = [0 for _ in range(len(candidate_task))]
best_task_idx = [0 for _ in range(len(candidate_task))]
for i in range(candidate_range, len(configs)):
for j in range(len(avg_range)):
if avg_list[i][j] > best_avg_value[j]:
best_avg_value[j] = avg_list[i][j]
best_avg_idx[j] = i
for j in range(len(candidate_task)):
if result_list[i][candidate_task[j]] > best_task_value[j]:
best_task_value[j] = result_list[i][candidate_task[j]]
best_task_idx[j] = i
# reorder configs, result_list, avg_list to only keep the best cases
new_configs = configs[:candidate_range]
new_result_list = result_list[:candidate_range]
new_avg_list = avg_list[:candidate_range]
for i in range(len(avg_range)):
selected_config = copy.deepcopy(configs[best_avg_idx[i]])
selected_config[0] = "({})Best Avg{}".format(len(new_configs),
avg_tag[i])
new_configs.append(selected_config)
new_result_list.append(result_list[best_avg_idx[i]])
new_avg_list.append(avg_list[best_avg_idx[i]])
for i in range(len(candidate_task)):
selected_config = copy.deepcopy(configs[best_task_idx[i]])
selected_config[0] = "({})Best {}".format(len(new_configs),
official_dict[candidate_task[i]][0])
new_configs.append(selected_config)
new_result_list.append(result_list[best_task_idx[i]])
new_avg_list.append(avg_list[best_task_idx[i]])
configs = new_configs
result_list = new_result_list
avg_list = new_avg_list
# split the case names by space
if split_name_by_space:
max_num_row = 1
splitted_names = []
for i in range(len(configs)):
new_name = configs[i][0].split()
max_num_row = max(max_num_row, len(new_name))
splitted_names.append(new_name)
tab_header = ["" for _ in range(max_num_row-1)] + tab_header
for i in range(len(configs)):
padding = ["" for _ in range(max_num_row-len(splitted_names[i]))]
configs[i] = padding + splitted_names[i] + configs[i][1:]
# generate the table
print("\\begin{table}")
print("\centering")
print(fontsize)
print("\caption{"+caption+"}")
text = "\\begin{tabular}{@{}l|"
for _ in range(len(configs)):
text += "c"
text += "@{}}"
print(text)
print("\\toprule")
for i in range(len(tab_header)):
text = "{} &".format(tab_header[i])
for j in range(len(configs)):
if j != len(configs) - 1:
text += (configs[j][i] + "& ")
else:
text += (configs[j][i] + "\\\\")
print(text)
print("\midrule")
for i in range(len(avg_range)):
text = ("Avg. " + avg_tag[i])
arr = []
for j in range(len(configs)):
arr.append(avg_list[j][i])
text += " & {}".format(round_sig(avg_list[j][i]))
text += "\\\\"
if print_stderr:
arr_mean = mean(arr)
arr_std = sample_stddev(arr)
text += " % mean {:.3f}, std {:.3f}, mean+1std {:.3f}, mean+2std {:.3f}, mean+3std {:.3f}".format(
arr_mean, arr_std, arr_mean+arr_std, arr_mean+arr_std*2, arr_mean+arr_std*3)
print(text)
if not avg_only:
print("\midrule")
for i in range(len(task_order)):
task = task_order[i]
text = "({}) {}".format(i, official_dict[task][0])
arr = []
for j in range(len(configs)):
result_dict = result_list[j]
if task in result_dict:
text += " & {}".format(round_sig(result_dict[task][0]))
arr.append(result_dict[task][0])
else:
text += " & N/A"
text += "\\\\"
if print_stderr:
arr_mean = mean(arr)
arr_std = sample_stddev(arr)
if task != "wikitext":
text += " % mean {:.3f}, std {:.3f}, mean+1std {:.3f}, mean+2std {:.3f}, mean+3std {:.3f}".format(
arr_mean, arr_std, arr_mean+arr_std, arr_mean+arr_std*2, arr_mean+arr_std*3)
else:
text += " % mean {:.3f}, std {:.3f}, mean-1std {:.3f}, mean-2std {:.3f}, mean-3std {:.3f}".format(
arr_mean, arr_std, arr_mean-arr_std, arr_mean-arr_std*2, arr_mean-arr_std*3)
print(text)
print("\\bottomrule")
print("\end{tabular}")
print("\end{table}")
print("")
print("")
if __name__ == '__main__':
task_order = ["hellaswag","lambada","triviaqa","webqs","winogrande","piqa",
"arc_challenge","arc_easy","anli_r1","anli_r2","anli_r3","openbookqa",
"race","boolq","copa","rte","wsc","multirc","record","wikitext"]
avg_range = [18]
avg_tag = ["0-18"]
tab_header = ["Case","Model size","Train tokens","Batch size","Bsz warmup","LR","min LR","LR warmup","LR decay","decay style"]
configs = [
["(0)paper","125M","300B","256","4B","6e-4","6e-5","375M","260B","cosine", None], # gpt3 paper orig results, thus result path is None
["(1)repro","125M","300B","256","4B","6e-4","6e-5","375M","260B","cosine",
'/blob/users/conglli/project/data_efficiency_gpt/eval_results/gpt-pile-0.125B-tok300B-lr6.0e-4-min6.0e-5-wup375M-dcy260B-sty-cosine-gbs256-mbs4-gpu64-zero0-mp1-pp1-nopp-seed1234-bwup4B/global_step591581/'],
["(2)fixedBsz","125M","300B","256","N/A","6e-4","6e-5","3000M","260B","cosine",
'/blob/users/conglli/project/data_efficiency_gpt/eval_results/gpt-pile-0.125B-tok300B-lr6.0e-4-min6.0e-5-wup3000M-dcy260B-sty-cosine-gbs256-mbs4-gpu64-zero0-mp1-pp1-nopp-seed1234/global_step572205/'],
["(3)fixedBsz 300B+minLR","125M","300B","256","N/A","6e-4","1e-6","3000M","300B","cosine",
'/blob/users/conglli/project/data_efficiency_gpt/eval_results/gpt-pile-0.125B-tok300B-lr6.0e-4-min1.0e-6-wup3000M-dcy300B-sty-cosine-gbs256-mbs4-gpu64-zero0-mp1-pp1-nopp-seed1234/global_step572205/']
]
caption = 'Conglong: GPT-3 125M results zero-shot'
generate_result_table(tab_header, configs, task_order, caption, avg_range,
avg_tag, split_name_by_space=True, fontsize="\\tiny")
configs = [
["(0)paper","125M","300B","256","4B","6e-4","6e-5","375M","260B","cosine", None], # gpt3 paper orig results, thus result path is None
["(1)repro","125M","300B","256","4B","6e-4","6e-5","375M","260B","cosine",
'/blob/users/conglli/project/data_efficiency_gpt/eval_results_fewshot/gpt-pile-0.125B-tok300B-lr6.0e-4-min6.0e-5-wup375M-dcy260B-sty-cosine-gbs256-mbs4-gpu64-zero0-mp1-pp1-nopp-seed1234-bwup4B/global_step591581/'],
["(2)fixedBsz","125M","300B","256","N/A","6e-4","6e-5","3000M","260B","cosine",
'/blob/users/conglli/project/data_efficiency_gpt/eval_results_fewshot/gpt-pile-0.125B-tok300B-lr6.0e-4-min6.0e-5-wup3000M-dcy260B-sty-cosine-gbs256-mbs4-gpu64-zero0-mp1-pp1-nopp-seed1234/global_step572205/'],
["(3)fixedBsz 300B+minLR","125M","300B","256","N/A","6e-4","1e-6","3000M","300B","cosine",
'/blob/users/conglli/project/data_efficiency_gpt/eval_results_fewshot/gpt-pile-0.125B-tok300B-lr6.0e-4-min1.0e-6-wup3000M-dcy300B-sty-cosine-gbs256-mbs4-gpu64-zero0-mp1-pp1-nopp-seed1234/global_step572205/'],
]
caption = 'Conglong: GPT-3 125M results few-shot'
generate_result_table(tab_header, configs, task_order, caption, avg_range,
avg_tag, split_name_by_space=True, fontsize="\\tiny", few_shot=True)

View File

@ -0,0 +1,66 @@
## CAUTION: first read Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md
## and follow the steps of installation/data downloading.
checkpoint_paths=(
/vc_data_blob/users/conglli/project/data_efficient_gpt/checkpoint/gpt-pile-0.125B-tok300B-lr6.0e-4-min6.0e-5-wup375M-dcy260B-sty-cosine-gbs256-mbs4-gpu64-zero0-mp1-pp1-nopp-seed1234-bwup4B/global_step591581/
/vc_data_blob/users/conglli/project/data_efficient_gpt/checkpoint/gpt-pile-0.125B-tok300B-lr6.0e-4-min6.0e-5-wup3000M-dcy260B-sty-cosine-gbs256-mbs4-gpu64-zero0-mp1-pp1-nopp-seed1234/global_step572205/
)
## No need to use the exact training config json, just use this dummy is fine
config_path=ds_config_eval_dummy.json
username=$(whoami)
result_path="/blob/users/${username}/project/data_efficient_gpt/eval_results"
## Task(s) on the same row will be performed together in the same process.
## There exist other tasks that can run but we skip because they didn't appear
## or have strange scores in GPT-3 paper: qqp, prost, cb, wic, mrpc, sst, wnli
## pubmedqa, logiqa, qnli, sciq, mc_taco, mathqa. For wikitext, it didn't
## appear in paper but we include it for a perplexity task.
tasks=(
record
triviaqa
hellaswag
arc_challenge
arc_easy
race
multirc
openbookqa
lambada
webqs
winogrande
piqa
anli_r1,anli_r2,anli_r3
boolq,copa
rte,wsc
wikitext
)
## Use localhost if you didn't setup hostfile as described in
## https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node.
## If hostfile exist, use hostname (e.g., worker-0) in hostfile.
# hostname="localhost"
hostname="worker-0"
batch_size=32
## This script is for zero-shot
num_fewshot=0
num_gpus=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
cuda_id=-1
total_mem=$(nvidia-smi --query-gpu=memory.total --format=csv -i 0 | grep -Eo [0-9]+)
## Code below only works when you run each evalharness task on a single GPU.
## For multi-GPU evalharness, check Megatron-DeepSpeed/blob/main/examples/MoE/ds_evalharness.sh
for l in "${!checkpoint_paths[@]}"; do
checkpoint_path=${checkpoint_paths[l]}
for ((i=0;i<${#tasks[@]};++i)); do
task=${tasks[i]}
free_mem=0
while [ $free_mem -lt $total_mem ]; do
cuda_id=$(((cuda_id+1)%num_gpus))
free_mem=$(nvidia-smi --query-gpu=memory.free --format=csv -i $cuda_id | grep -Eo [0-9]+)
sleep 60s
done
bash ds_evalharness_1gpu.sh $checkpoint_path $config_path $result_path $cuda_id $task $hostname $batch_size $num_fewshot &
done
done

View File

@ -0,0 +1,61 @@
## CAUTION: first read Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md
## and follow the steps of installation/data downloading.
checkpoint_paths=(
/vc_data_blob/users/conglli/project/data_efficient_gpt/checkpoint/gpt-pile-0.125B-tok300B-lr6.0e-4-min6.0e-5-wup375M-dcy260B-sty-cosine-gbs256-mbs4-gpu64-zero0-mp1-pp1-nopp-seed1234-bwup4B/global_step591581/
/vc_data_blob/users/conglli/project/data_efficient_gpt/checkpoint/gpt-pile-0.125B-tok300B-lr6.0e-4-min6.0e-5-wup3000M-dcy260B-sty-cosine-gbs256-mbs4-gpu64-zero0-mp1-pp1-nopp-seed1234/global_step572205/
)
## No need to use the exact training config json, just use this dummy is fine
config_path=ds_config_eval_dummy.json
username=$(whoami)
result_path="/blob/users/${username}/project/data_efficient_gpt/eval_results_10shot"
## Task(s) on the same row will be performed together in the same process.
tasks=(
record
triviaqa
hellaswag
arc_challenge
arc_easy
race
multirc
openbookqa
lambada
webqs
winogrande
piqa
anli_r1,anli_r2
anli_r3
boolq,copa
rte,wsc
)
num_fewshot=10
## Use localhost if you didn't setup hostfile as described in
## https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node.
## If hostfile exist, use hostname (e.g., worker-0) in hostfile.
# hostname="localhost"
hostname="worker-0"
batch_size=16
num_gpus=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
cuda_id=-1
total_mem=$(nvidia-smi --query-gpu=memory.total --format=csv -i 0 | grep -Eo [0-9]+)
## Code below only works when you run each evalharness task on a single GPU.
## For multi-GPU evalharness, check Megatron-DeepSpeed/blob/main/examples/MoE/ds_evalharness.sh
for l in "${!checkpoint_paths[@]}"; do
checkpoint_path=${checkpoint_paths[l]}
for ((i=0;i<${#tasks[@]};++i)); do
task=${tasks[i]}
free_mem=0
while [ $free_mem -lt $total_mem ]; do
cuda_id=$(((cuda_id+1)%num_gpus))
free_mem=$(nvidia-smi --query-gpu=memory.free --format=csv -i $cuda_id | grep -Eo [0-9]+)
sleep 60s
done
bash ds_evalharness_1gpu.sh $checkpoint_path $config_path $result_path $cuda_id $task $hostname $batch_size $num_fewshot &
done
done

View File

@ -0,0 +1,74 @@
{
"train_batch_size": GBSIZE,
"train_micro_batch_size_per_gpu": MBSIZE,
"steps_per_print": LOG_INTERVAL,
"zero_optimization": {
"stage": ZERO_STAGE,
"elastic_checkpoint": true
},
"gradient_clipping": 1.0,
"prescale_gradients": PRESCALE_GRAD,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"wall_clock_breakdown" : false,
"dataloader_drop_last": true,
"data_efficiency": {
"enabled": true,
"seed": DATA_EFFICIENCY_SEED,
"data_routing": {
"enabled": LTD_ENABLED,
"random_ltd":{
"enabled": LTD_ENABLED,
"total_layer_num": 24,
"random_ltd_layer_num": 22,
"random_ltd_layer_id": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
"model_mask_name": "attention_mask",
"model_type": "decoder",
"hidden_state_order": "seq_batch_dim",
"random_ltd_schedule": {
"min_value": LTD_MIN,
"max_value": LTD_MAX,
"schedule_type":"fixed_linear",
"schedule_config": {
"require_steps": LTD_STEP,
"seq_per_step": 16
}
}
}
},
"data_sampling": {
"enabled": CL_ENABLED,
"num_workers": DATA_SAMPLING_NUM_WORKERS,
"curriculum_learning": {
"enabled": CL_ENABLED,
"data_cluster_path": "CL_CLUSTER_PATH",
"curriculum_metrics": {
"CL_1st_METRIC_NAME": {
"index_to_sample_path": "CL_1st_SAMPLE_PATH",
"index_to_metric_path": "CL_1st_METRIC_PATH",
"difficulty_type": "CL_1st_DIFF_TYPE",
"clustering_type": "CL_1st_CLUSTER_TYPE",
"min_difficulty": CL_1st_MIN,
"max_difficulty": CL_1st_MAX,
"schedule_type": "fixed_root",
"schedule_config": {
"total_curriculum_step": CL_1st_TOTAL_STEP,
"difficulty_step": CL_1st_DIFF_STEP,
"root_degree": CL_1st_ROOT
}
}
}
}
}
}
}

View File

@ -0,0 +1,88 @@
{
"train_batch_size": GBSIZE,
"train_micro_batch_size_per_gpu": MBSIZE,
"steps_per_print": LOG_INTERVAL,
"zero_optimization": {
"stage": ZERO_STAGE,
"elastic_checkpoint": true
},
"gradient_clipping": 1.0,
"prescale_gradients": PRESCALE_GRAD,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"wall_clock_breakdown" : false,
"dataloader_drop_last": true,
"data_efficiency": {
"enabled": true,
"seed": DATA_EFFICIENCY_SEED,
"data_routing": {
"enabled": LTD_ENABLED,
"random_ltd":{
"enabled": LTD_ENABLED,
"total_layer_num": 24,
"random_ltd_layer_num": 22,
"random_ltd_layer_id": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
"model_mask_name": "attention_mask",
"model_type": "decoder",
"hidden_state_order": "seq_batch_dim",
"random_ltd_schedule": {
"min_value": LTD_MIN,
"max_value": LTD_MAX,
"schedule_type":"fixed_linear",
"schedule_config": {
"require_steps": LTD_STEP,
"seq_per_step": 16
}
}
}
},
"data_sampling": {
"enabled": CL_ENABLED,
"num_workers": DATA_SAMPLING_NUM_WORKERS,
"curriculum_learning": {
"enabled": CL_ENABLED,
"data_cluster_path": "CL_CLUSTER_PATH",
"curriculum_metrics": {
"CL_1st_METRIC_NAME": {
"index_to_sample_path": "CL_1st_SAMPLE_PATH",
"index_to_metric_path": "CL_1st_METRIC_PATH",
"difficulty_type": "CL_1st_DIFF_TYPE",
"clustering_type": "CL_1st_CLUSTER_TYPE",
"min_difficulty": CL_1st_MIN,
"max_difficulty": CL_1st_MAX,
"schedule_type": "fixed_root",
"schedule_config": {
"total_curriculum_step": CL_1st_TOTAL_STEP,
"difficulty_step": CL_1st_DIFF_STEP,
"root_degree": CL_1st_ROOT
}
},
"CL_2nd_METRIC_NAME": {
"index_to_sample_path": "CL_2nd_SAMPLE_PATH",
"index_to_metric_path": "CL_2nd_METRIC_PATH",
"difficulty_type": "CL_2nd_DIFF_TYPE",
"clustering_type": "CL_2nd_CLUSTER_TYPE",
"min_difficulty": CL_2nd_MIN,
"max_difficulty": CL_2nd_MAX,
"schedule_type": "fixed_root",
"schedule_config": {
"total_curriculum_step": CL_2nd_TOTAL_STEP,
"difficulty_step": CL_2nd_DIFF_STEP,
"root_degree": CL_2nd_ROOT
}
}
}
}
}
}
}

View File

@ -0,0 +1,515 @@
#!/bin/bash
dir=`pwd`
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
seq_len=2048
## The "GPT-3 XXX" below are configs from GPT-3 paper
## https://arxiv.org/abs/2005.14165, choose based on
## your desired model size or build your own configs
## init_std is standard deviation for weight initialization. Usually larger
## model needs lower std. We used a heuristic equation of sqrt(1/3/hidden_size)
## from the MT-NLG 530B work (https://arxiv.org/pdf/2201.11990.pdf)
## We changed min_lr to a lower number (1.0e-6), which we found is able to
## provide better zero-shot eval results.
## GPT-3 Small 125M
# model_size=0.125
# num_layers=12
# hidden_size=768
# num_attn_heads=12
# global_batch_size=256
# lr=6.0e-4
# min_lr=1.0e-6
# init_std=0.02
## GPT-3 Medium 350M
# model_size=0.35
# num_layers=24
# hidden_size=1024
# num_attn_heads=16
# global_batch_size=256
# lr=3.0e-4
# min_lr=1.0e-6
# init_std=0.018
## GPT-3 Large 760M
# model_size=0.76
# num_layers=24
# hidden_size=1536
# num_attn_heads=16
# global_batch_size=256
# lr=2.5e-4
# min_lr=1.0e-6
# init_std=0.015
## GPT-3 XL 1.3B
model_size=1.3
num_layers=24
hidden_size=2048
num_attn_heads=16
global_batch_size=512
# lr=2.0e-4
lr=$1
min_lr=1.0e-6
init_std=0.013
## GPT-3 2.7B
# model_size=2.7
# num_layers=32
# hidden_size=2560
# num_attn_heads=32
# global_batch_size=512
# lr=1.6e-4
# min_lr=1.0e-6
# init_std=0.011
## GPT-3 6.7B
# model_size=6.7
# num_layers=32
# hidden_size=4096
# num_attn_heads=32
# global_batch_size=1024
# lr=1.2e-4
# min_lr=1.0e-6
# init_std=0.009
## GPT-3 13B
# model_size=13
# num_layers=40
# hidden_size=5120
# num_attn_heads=40
# global_batch_size=1024
# lr=1.0e-4
# min_lr=1.0e-6
# init_std=0.008
## GPT-3 175B
# model_size=175
# num_layers=96
# hidden_size=12288
# num_attn_heads=96
# global_batch_size=1536
# lr=0.6e-4
# min_lr=1.0e-6
# init_std=0.005
###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens.
# train_tokens_in_billion=300
train_tokens_in_billion=$2
train_tokens=$((${train_tokens_in_billion} * 1000000000))
## train_samples is another termination condition and also affect the number of
## data samples to be indexed. Since we want to reach the train_tokens
## above, and data efficiency techniques may change num tokens in some samples,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by train_samples.
train_samples=$(( 300 * 1000000000 * 2 / ${seq_len} ))
## Another wall-clock time termination condition in minutes. Set it large
## enough to avoid undesired early termination.
exit_duration=30000000
###############################################################################
### lr configs
## lr warmup and decay duration.
## Original GPT-3 paper uses 375M warmup tokens and 260B cosine decay tokens.
## Here we increase the warmup tokens to 3B since when batch size warmup is not
## used, there are more tokens per step. Thus we need to increase warmup tokens
## to make sure there are enough warmup steps, which is important for training
## stability.
lr_warmup_tokens_in_million=3000
lr_warmup_tokens=$((${lr_warmup_tokens_in_million} * 1000000))
## Here we changed the LR decay tokens to align with total train tokens, since
## related works (e.g., https://arxiv.org/abs/2203.15556) find that setting the
## learning rate schedule to match the number of training tokens results in the
## best final model quality
lr_decay_tokens_in_billion=${train_tokens_in_billion}
lr_decay_tokens=$((${lr_decay_tokens_in_billion} * 1000000000))
lr_decay_style="cosine"
###############################################################################
### Parallelism configs
## Model parallelism, 1 is no MP
mp_size=1
## Pipeline parallelism. To disable PP, set pp_size to 1 and no_pp to true.
## Note that currently both curriculum learning and random-LTD are NOT
## compatible with pipeline parallelism.
pp_size=1
no_pp="true"
## ZeRO-based data parallelism, stage=0 will disable ZeRO
zero_stage=1
## Total number of GPUs. ds_ssh is from DeepSpeed library.
num_gpus=$(($(ds_ssh nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)-2))
num_gpus_pernode=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
num_node=$(( ${num_gpus} / ${num_gpus_pernode} ))
## Data parallel size.
dp_size=$(( ${num_gpus} / ${pp_size} / ${mp_size} ))
## Micro batch size per GPU
## Make sure that batch_size <= global_batch_size*pp_size*mp_size/num_gpus
## Reduce it manually if GPU OOM
batch_size=$(( ${global_batch_size} / ${dp_size} ))
###############################################################################
### Random layerwise token dropping (random-LTD) configs
## random-LTD's main switch. "false" means disabled. "true" means enabled.
ltd_enabled=${3:-'false'}
## How much dropping ratio to start with. The value denotes the seqlen after
## dropping.
ltd_start=${4:-2048}
## How many steps for random-LTD to gradually reduce dropping ratio to zero.
ltd_step=${5:-1}
# ltd_enabled="true"
# ltd_start=128
# ltd_step=200000
###############################################################################
### Curriculum learning (CL) configs
## CL's main switch. "false" means disabled. "true" means enabled.
cl_enabled=${6:-'false'}
## Number of CL metrics to use.
cl_num_metric=${7:-1}
## Name of difficulty metric
cl_1st_metric=${8:-'dummy'}
## Path to the data indexes for this difficulty metric. Samples on ith row of
## index_to_sample have the difficulty value equals to ith row of
## index_to_metric.
cl_1st_index_to_sample_path=${9:-'dummy'}
cl_1st_index_to_metric_path=${10:-'dummy'}
## During training, whether increase difficulty by value- or percentile-based.
cl_1st_difficulty_type=${11:-'value'}
## "single_cluster" means no clustering required and probably CL is achieved by
## data postprocessing. "schedule_based" means will cluster data based on the
## difficulty schedule (pacing function) below.
cl_1st_clustering_type=${12:-'single_cluster'}
## Start difficulty
cl_1st_min=${13:-2048}
## End difficulty
cl_1st_max=${14:-2048}
## Total step to reach end difficulty
cl_1st_total_step=${15:-1}
## When changing difficulty, always make sure it's a multiple of the
## difficulty_step below.
cl_1st_difficulty_step=${16:-1}
## Root degree of the schedule (pacing function).
cl_1st_root=${17:-1}
cl_2nd_metric=${18:-'dummy'}
cl_2nd_index_to_sample_path=${19:-'dummy'}
cl_2nd_index_to_metric_path=${20:-'dummy'}
cl_2nd_difficulty_type=${21:-'value'}
cl_2nd_clustering_type=${22:-'single_cluster'}
cl_2nd_min=${23:-2048}
cl_2nd_max=${24:-2048}
cl_2nd_total_step=${25:-1}
cl_2nd_difficulty_step=${26:-1}
cl_2nd_root=${27:-1}
# cl_enabled="true"
# cl_num_metric=2
# cl_1st_metric="voc"
# ## The *_index_to_sample_percentile_merged is a concatenated index for perf
# ## improvement, but it only works when you set difficulty_type="percentile" in
# ## ds_config. If you use difficulty_type="value", you need to change this to
# ## *_index_to_sample
# cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_sample_percentile_merged"
# # cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_sample"
# cl_1st_index_to_metric_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=1
# cl_1st_max=100
# cl_1st_total_step=110000
# cl_1st_difficulty_step=1
# cl_1st_root=2
# cl_2nd_metric="seqlen_truncate"
# cl_2nd_index_to_sample_path="dummy"
# cl_2nd_index_to_metric_path="dummy"
# cl_2nd_difficulty_type="value"
# cl_2nd_clustering_type="single_cluster"
# cl_2nd_min=80
# cl_2nd_max=2048
# cl_2nd_total_step=110000
# cl_2nd_difficulty_step=8
# cl_2nd_root=1
###############################################################################
### Misc configs
log_interval=100
eval_iters=10
eval_interval=100
# num_save controls how frequent to save checkpoint. num_save=20 means that a
# checkpoint will be saved every 5% of training. For longer training you would
# want larger num_save to save more frequently, and vice versa.
num_save=100
estimated_train_iter=$((${train_tokens} / ${seq_len} / ${global_batch_size}))
save_interval=$((${estimated_train_iter} / ${num_save}))
## Activation checkpointing saves GPU memory, but reduces training speed
activation_checkpoint="true"
# activation_checkpoint="false"
## Whether or not log optimizer states (norms, max abs values) to tensorboard.
## This is not required for training and might save GPU memory when turned off.
log_optimizer_state="true"
###############################################################################
### Output and data configs
current_time=$(date "+%Y.%m.%d_%H.%M.%S")
host="${HOSTNAME}"
seed=1234
num_workers=0
## Public the Pile dataset, can be downloaded at
## https://mystic.the-eye.eu/public/AI/pile_neox/ Change data_home to where you
## store the pile_text_document.bin and pile_text_document.idx.
data_home="/vc_data_blob/users/conglli/the_pile_public_merged_nopreprocessing"
if [[ "$host" == *"webxt"* ]]; then
data_home="/blob/data/the_pile_public_merged_nopreprocessing"
fi
data_path="${data_home}/pile_text_document"
## *_idx_path force Megatron to use a specific data index file generated when
## we analyze data. This is needed because our index for curriculum learning
## difficulty metric is based on this data index.
doc_idx_path="${data_home}/pile_text_document_train_indexmap_exact1ep_2048sl_1234s_doc_idx.npy"
sample_idx_path="${data_home}/pile_text_document_train_indexmap_exact1ep_2048sl_1234s_sample_idx.npy"
shuffle_idx_path="${data_home}/pile_text_document_train_indexmap_exact1ep_2048sl_1234s_shuffle_idx.npy"
vocab_path="gpt2-vocab.json"
if [ ! -f "$vocab_path" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
fi
merge_path="gpt2-merges.txt"
if [ ! -f "$merge_path" ]; then
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
fi
prescale_grad="true"
jobname="gpt_${model_size}B_tok${train_tokens_in_billion}B"
jobname="${jobname}_lr${lr}_min${min_lr}_w${lr_warmup_tokens_in_million}M_d${lr_decay_tokens_in_billion}B_${lr_decay_style}"
jobname="${jobname}_gbs${global_batch_size}_mbs${batch_size}_g${num_gpus}"
if [[ $zero_stage -gt 0 ]]; then
jobname="${jobname}_z${zero_stage}"
prescale_grad="false"
fi
if [[ $mp_size -gt 1 ]]; then
jobname="${jobname}_mp${mp_size}"
fi
if [ "${no_pp}" = "false" ]; then
jobname="${jobname}_pp${pp_size}"
fi
jobname="${jobname}_seed${seed}"
if [ "${ltd_enabled}" = "true" ]; then
jobname="${jobname}_ltd_${ltd_start}_${ltd_step}"
fi
if [ "${cl_enabled}" = "true" ]; then
jobname="${jobname}_cl_${cl_1st_metric}_${cl_1st_min}_${cl_1st_max}_${cl_1st_total_step}_${cl_1st_root}"
if [[ $cl_num_metric -gt 1 ]]; then
jobname="${jobname}_${cl_2nd_metric}_${cl_2nd_min}_${cl_2nd_max}_${cl_2nd_total_step}_${cl_2nd_root}"
fi
fi
username=$(whoami)
output_home="/blob/users/${username}/project/data_efficient_gpt"
log_path="${output_home}/log/"
checkpoint_path="${output_home}/checkpoint/${jobname}"
## Microsoft internal constraint: because tensorboard is logged by last rank,
## it's better to put the path in NFS instead of Blob.
tensorboard_dir="/vc_data/users/${username}/project/data_efficient_gpt/tensorboard/"
tensorboard_path="${tensorboard_dir}${jobname}_${host}_${current_time}"
mkdir -p ${log_path}
mkdir -p ${checkpoint_path}
mkdir -p ${tensorboard_path}
if [ "${cl_enabled}" = "true" ]; then
data_cluster_path="${output_home}/data_cluster/${jobname}"
mkdir -p ${data_cluster_path}
fi
###############################################################################
data_options=" \
--vocab-file ${vocab_path} \
--merge-file ${merge_path} \
--data-path ${data_path} \
--data-impl mmap"
## If CL is used, make sure to set "--split" the same as what you used during
## offline data analysis&indexing.
megatron_options=" \
--override-lr-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${mp_size} \
--init-method-std ${init_std} \
--lr-decay-tokens ${lr_decay_tokens} \
--lr-warmup-tokens ${lr_warmup_tokens} \
--micro-batch-size ${batch_size} \
--exit-duration-in-mins ${exit_duration} \
--global-batch-size ${global_batch_size} \
--num-layers ${num_layers} \
--hidden-size ${hidden_size} \
--num-attention-heads ${num_attn_heads} \
--seq-length ${seq_len} \
--max-position-embeddings ${seq_len} \
--train-tokens ${train_tokens} \
--train-samples ${train_samples} \
--lr ${lr} \
--min-lr ${min_lr} \
--lr-decay-style ${lr_decay_style} \
--split 949,50,1 \
--log-interval ${log_interval} \
--eval-interval ${eval_interval} \
--eval-iters ${eval_iters} \
--save-interval ${save_interval} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers ${num_workers} \
--fp16 \
--seed ${seed} \
--load ${checkpoint_path} \
--save ${checkpoint_path} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${tensorboard_path}"
if [ "${activation_checkpoint}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [ "${log_optimizer_state}" = "true" ]; then
megatron_options="${megatron_options} \
--log-optimizer-states-to-tensorboard"
fi
if [ "${ltd_enabled}" = "true" ]; then
megatron_options="${megatron_options} \
--random-ltd"
fi
if [ "${cl_enabled}" = "true" ]; then
megatron_options="${megatron_options} \
--train-doc-idx-path ${doc_idx_path} \
--train-sample-idx-path ${sample_idx_path} \
--train-shuffle-idx-path ${shuffle_idx_path} \
--data-efficiency-curriculum-learning"
fi
config_json="ds_config_gbs${global_batch_size}_mbs${batch_size}_log${log_interval}_zero${zero_stage}_seed${seed}"
if [ "${ltd_enabled}" = "true" ]; then
config_json="${config_json}_ltd_${ltd_start}_${ltd_step}"
fi
if [ "${cl_enabled}" = "true" ]; then
config_json="${config_json}_cl_${cl_1st_metric}_${cl_1st_min}_${cl_1st_max}_${cl_1st_total_step}_${cl_1st_root}"
if [[ $cl_num_metric -gt 1 ]]; then
config_json="${config_json}_${cl_2nd_metric}_${cl_2nd_min}_${cl_2nd_max}_${cl_2nd_total_step}_${cl_2nd_root}"
fi
fi
config_json="${config_json}.json"
if [[ $cl_num_metric -gt 1 ]]; then
template_json="ds_config_gpt_2clmetrics_TEMPLATE.json"
sed "s/GBSIZE/${global_batch_size}/" ${template_json} \
| sed "s/MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/${prescale_grad}/" \
| sed "s/DATA_EFFICIENCY_SEED/${seed}/" \
| sed "s/LTD_ENABLED/${ltd_enabled}/" \
| sed "s/LTD_MIN/${ltd_start}/" \
| sed "s/LTD_MAX/${seq_len}/" \
| sed "s/LTD_STEP/${ltd_step}/" \
| sed "s/CL_ENABLED/${cl_enabled}/" \
| sed "s/DATA_SAMPLING_NUM_WORKERS/${num_workers}/" \
| sed "s#CL_CLUSTER_PATH#${data_cluster_path}#" \
| sed "s#CL_1st_METRIC_NAME#${cl_1st_metric}#" \
| sed "s#CL_1st_SAMPLE_PATH#${cl_1st_index_to_sample_path}#" \
| sed "s#CL_1st_METRIC_PATH#${cl_1st_index_to_metric_path}#" \
| sed "s#CL_1st_DIFF_TYPE#${cl_1st_difficulty_type}#" \
| sed "s#CL_1st_CLUSTER_TYPE#${cl_1st_clustering_type}#" \
| sed "s/CL_1st_MIN/${cl_1st_min}/" \
| sed "s/CL_1st_MAX/${cl_1st_max}/" \
| sed "s/CL_1st_TOTAL_STEP/${cl_1st_total_step}/" \
| sed "s/CL_1st_DIFF_STEP/${cl_1st_difficulty_step}/" \
| sed "s/CL_1st_ROOT/${cl_1st_root}/" \
| sed "s#CL_2nd_METRIC_NAME#${cl_2nd_metric}#" \
| sed "s#CL_2nd_SAMPLE_PATH#${cl_2nd_index_to_sample_path}#" \
| sed "s#CL_2nd_METRIC_PATH#${cl_2nd_index_to_metric_path}#" \
| sed "s#CL_2nd_DIFF_TYPE#${cl_2nd_difficulty_type}#" \
| sed "s#CL_2nd_CLUSTER_TYPE#${cl_2nd_clustering_type}#" \
| sed "s/CL_2nd_MIN/${cl_2nd_min}/" \
| sed "s/CL_2nd_MAX/${cl_2nd_max}/" \
| sed "s/CL_2nd_TOTAL_STEP/${cl_2nd_total_step}/" \
| sed "s/CL_2nd_DIFF_STEP/${cl_2nd_difficulty_step}/" \
| sed "s/CL_2nd_ROOT/${cl_2nd_root}/" \
> ${config_json}
else
template_json="ds_config_gpt_1clmetric_TEMPLATE.json"
sed "s/GBSIZE/${global_batch_size}/" ${template_json} \
| sed "s/MBSIZE/${batch_size}/" \
| sed "s/LOG_INTERVAL/${log_interval}/" \
| sed "s/ZERO_STAGE/${zero_stage}/" \
| sed "s/PRESCALE_GRAD/${prescale_grad}/" \
| sed "s/DATA_EFFICIENCY_SEED/${seed}/" \
| sed "s/LTD_ENABLED/${ltd_enabled}/" \
| sed "s/LTD_MIN/${ltd_start}/" \
| sed "s/LTD_MAX/${seq_len}/" \
| sed "s/LTD_STEP/${ltd_step}/" \
| sed "s/CL_ENABLED/${cl_enabled}/" \
| sed "s/DATA_SAMPLING_NUM_WORKERS/${num_workers}/" \
| sed "s#CL_CLUSTER_PATH#${data_cluster_path}#" \
| sed "s#CL_1st_METRIC_NAME#${cl_1st_metric}#" \
| sed "s#CL_1st_SAMPLE_PATH#${cl_1st_index_to_sample_path}#" \
| sed "s#CL_1st_METRIC_PATH#${cl_1st_index_to_metric_path}#" \
| sed "s#CL_1st_DIFF_TYPE#${cl_1st_difficulty_type}#" \
| sed "s#CL_1st_CLUSTER_TYPE#${cl_1st_clustering_type}#" \
| sed "s/CL_1st_MIN/${cl_1st_min}/" \
| sed "s/CL_1st_MAX/${cl_1st_max}/" \
| sed "s/CL_1st_TOTAL_STEP/${cl_1st_total_step}/" \
| sed "s/CL_1st_DIFF_STEP/${cl_1st_difficulty_step}/" \
| sed "s/CL_1st_ROOT/${cl_1st_root}/" \
> ${config_json}
fi
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${zero_stage} \
--pipeline-model-parallel-size ${pp_size}"
if [[ "${no_pp}" = "true" ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${activation_checkpoint}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
## When saving checkpoint to a storage with cache, their could be consistency
## issue of the pointer to latest checkpoint. Here we find the correct pointer
## and broadcast it to all nodes.
iteration_file="$checkpoint_path/latest_checkpointed_iteration.txt"
iteration_file_2="$checkpoint_path/latest"
iteration=0
for (( node = 0; node <= num_node-1; node++ ))
do
if $(ssh -q worker-"$node" "test -f \"$iteration_file\""); then
local_iteration=$(ssh -q worker-"$node" cat $iteration_file)
iteration=$(( ${local_iteration} > ${iteration} ? ${local_iteration} : ${iteration} ))
fi
done
if [[ $iteration -gt 0 ]]; then
iteration_2="global_step${iteration}"
ds_ssh "echo $iteration > $iteration_file"
ds_ssh "echo $iteration_2 > $iteration_file_2"
fi
deepspeed ${dir}/../../../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &>> ${log_path}/${jobname}_${host}_${current_time}.log

View File

@ -0,0 +1,366 @@
###############################################################################
### Each block below is one pretraining setup. Uncomment one block to try.
###############################################################################
### Baseline cases, mostly based on OpenAI's GPT-3 hyperparameters, but with
### some changes (without batch size warmup, and different LR schedule).
## Baseline 300B tokens (100%):
# lr=2.0e-4
# train_tokens_in_billion=300
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion}
###############################################################################
## Baseline 200B tokens (67%):
# lr=3.0e-4 # scaled based on train token reduction ratio
# train_tokens_in_billion=200
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion}
###############################################################################
## Baseline 150B tokens (50%):
# lr=4.0e-4
# train_tokens_in_billion=150
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion}
###############################################################################
### Curriculum learning (CL) + Random layerwise token dropping (random-LTD).
### DeepSpeed Data Efficiency's best composed solution.
## CL+random-LTD 300B tokens (100%):
# lr=2.0e-4
# train_tokens_in_billion=300
# ltd_enabled="true"
# ltd_start=128
# ltd_step=200000
# cl_enabled="true"
# cl_num_metric=2
# cl_1st_metric="voc"
# cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_sample_percentile_merged"
# cl_1st_index_to_metric_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=1
# cl_1st_max=100
# cl_1st_total_step=110000
# cl_1st_difficulty_step=1
# cl_1st_root=2
# cl_2nd_metric="seqlen_truncate"
# cl_2nd_index_to_sample_path="dummy"
# cl_2nd_index_to_metric_path="dummy"
# cl_2nd_difficulty_type="value"
# cl_2nd_clustering_type="single_cluster"
# cl_2nd_min=80
# cl_2nd_max=2048
# cl_2nd_total_step=110000
# cl_2nd_difficulty_step=8
# cl_2nd_root=1
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion} ${ltd_enabled} ${ltd_start} ${ltd_step} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step} ${cl_1st_difficulty_step} \
# ${cl_1st_root} ${cl_2nd_metric} ${cl_2nd_index_to_sample_path} \
# ${cl_2nd_index_to_metric_path} ${cl_2nd_difficulty_type} \
# ${cl_2nd_clustering_type} ${cl_2nd_min} ${cl_2nd_max} \
# ${cl_2nd_total_step} ${cl_2nd_difficulty_step} ${cl_2nd_root}
###############################################################################
## CL+random-LTD 150B tokens (50%):
# lr=4.0e-4
# train_tokens_in_billion=150
# ltd_enabled="true"
# ltd_start=128
# ltd_step=100000
# cl_enabled="true"
# cl_num_metric=2
# cl_1st_metric="voc"
# cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_sample_percentile_merged"
# cl_1st_index_to_metric_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=1
# cl_1st_max=100
# cl_1st_total_step=55000
# cl_1st_difficulty_step=1
# cl_1st_root=2
# cl_2nd_metric="seqlen_truncate"
# cl_2nd_index_to_sample_path="dummy"
# cl_2nd_index_to_metric_path="dummy"
# cl_2nd_difficulty_type="value"
# cl_2nd_clustering_type="single_cluster"
# cl_2nd_min=80
# cl_2nd_max=2048
# cl_2nd_total_step=55000
# cl_2nd_difficulty_step=8
# cl_2nd_root=1
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion} ${ltd_enabled} ${ltd_start} ${ltd_step} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step} ${cl_1st_difficulty_step} \
# ${cl_1st_root} ${cl_2nd_metric} ${cl_2nd_index_to_sample_path} \
# ${cl_2nd_index_to_metric_path} ${cl_2nd_difficulty_type} \
# ${cl_2nd_clustering_type} ${cl_2nd_min} ${cl_2nd_max} \
# ${cl_2nd_total_step} ${cl_2nd_difficulty_step} ${cl_2nd_root}
###############################################################################
### Random layerwise token dropping (random-LTD).
## random-LTD 300B tokens (100%):
# lr=2.0e-4
# train_tokens_in_billion=300
# ltd_enabled="true"
# ltd_start=128
# ltd_step=200000
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion} ${ltd_enabled} ${ltd_start} ${ltd_step}
###############################################################################
## random-LTD 200B tokens (67%):
# lr=3.0e-4
# train_tokens_in_billion=200
# ltd_enabled="true"
# ltd_start=128
# ltd_step=133333
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion} ${ltd_enabled} ${ltd_start} ${ltd_step}
###############################################################################
## random-LTD 150B tokens (50%):
# lr=4.0e-4
# train_tokens_in_billion=150
# ltd_enabled="true"
# ltd_start=128
# ltd_step=100000
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion} ${ltd_enabled} ${ltd_start} ${ltd_step}
###############################################################################
### Curriculum learning (CL).
## CL vocab rarity + seqlen truncation 300B tokens (100%):
# lr=2.0e-4
# train_tokens_in_billion=300
# ltd_enabled="false"
# ltd_start=2048
# ltd_step=1
# cl_enabled="true"
# cl_num_metric=2
# cl_1st_metric="voc"
# cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_sample_percentile_merged"
# cl_1st_index_to_metric_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=1
# cl_1st_max=100
# cl_1st_total_step=110000
# cl_1st_difficulty_step=1
# cl_1st_root=2
# cl_2nd_metric="seqlen_truncate"
# cl_2nd_index_to_sample_path="dummy"
# cl_2nd_index_to_metric_path="dummy"
# cl_2nd_difficulty_type="value"
# cl_2nd_clustering_type="single_cluster"
# cl_2nd_min=80
# cl_2nd_max=2048
# cl_2nd_total_step=110000
# cl_2nd_difficulty_step=8
# cl_2nd_root=1
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion} ${ltd_enabled} ${ltd_start} ${ltd_step} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step} ${cl_1st_difficulty_step} \
# ${cl_1st_root} ${cl_2nd_metric} ${cl_2nd_index_to_sample_path} \
# ${cl_2nd_index_to_metric_path} ${cl_2nd_difficulty_type} \
# ${cl_2nd_clustering_type} ${cl_2nd_min} ${cl_2nd_max} \
# ${cl_2nd_total_step} ${cl_2nd_difficulty_step} ${cl_2nd_root}
###############################################################################
## CL vocab rarity + seqlen truncation 200B tokens (67%):
# lr=3.0e-4
# train_tokens_in_billion=200
# ltd_enabled="false"
# ltd_start=2048
# ltd_step=1
# cl_enabled="true"
# cl_num_metric=2
# cl_1st_metric="voc"
# cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_sample_percentile_merged"
# cl_1st_index_to_metric_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=1
# cl_1st_max=100
# cl_1st_total_step=73000
# cl_1st_difficulty_step=1
# cl_1st_root=2
# cl_2nd_metric="seqlen_truncate"
# cl_2nd_index_to_sample_path="dummy"
# cl_2nd_index_to_metric_path="dummy"
# cl_2nd_difficulty_type="value"
# cl_2nd_clustering_type="single_cluster"
# cl_2nd_min=80
# cl_2nd_max=2048
# cl_2nd_total_step=73000
# cl_2nd_difficulty_step=8
# cl_2nd_root=1
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion} ${ltd_enabled} ${ltd_start} ${ltd_step} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step} ${cl_1st_difficulty_step} \
# ${cl_1st_root} ${cl_2nd_metric} ${cl_2nd_index_to_sample_path} \
# ${cl_2nd_index_to_metric_path} ${cl_2nd_difficulty_type} \
# ${cl_2nd_clustering_type} ${cl_2nd_min} ${cl_2nd_max} \
# ${cl_2nd_total_step} ${cl_2nd_difficulty_step} ${cl_2nd_root}
###############################################################################
## CL vocab rarity + seqlen truncation 150B tokens (50%):
# lr=4.0e-4
# train_tokens_in_billion=150
# ltd_enabled="false"
# ltd_start=2048
# ltd_step=1
# cl_enabled="true"
# cl_num_metric=2
# cl_1st_metric="voc"
# cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_sample_percentile_merged"
# cl_1st_index_to_metric_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=1
# cl_1st_max=100
# cl_1st_total_step=55000
# cl_1st_difficulty_step=1
# cl_1st_root=2
# cl_2nd_metric="seqlen_truncate"
# cl_2nd_index_to_sample_path="dummy"
# cl_2nd_index_to_metric_path="dummy"
# cl_2nd_difficulty_type="value"
# cl_2nd_clustering_type="single_cluster"
# cl_2nd_min=80
# cl_2nd_max=2048
# cl_2nd_total_step=55000
# cl_2nd_difficulty_step=8
# cl_2nd_root=1
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion} ${ltd_enabled} ${ltd_start} ${ltd_step} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step} ${cl_1st_difficulty_step} \
# ${cl_1st_root} ${cl_2nd_metric} ${cl_2nd_index_to_sample_path} \
# ${cl_2nd_index_to_metric_path} ${cl_2nd_difficulty_type} \
# ${cl_2nd_clustering_type} ${cl_2nd_min} ${cl_2nd_max} \
# ${cl_2nd_total_step} ${cl_2nd_difficulty_step} ${cl_2nd_root}
###############################################################################
## CL vocab rarity + seqlen reshape 300B tokens (100%):
# lr=2.0e-4
# train_tokens_in_billion=300
# ltd_enabled="false"
# ltd_start=2048
# ltd_step=1
# cl_enabled="true"
# cl_num_metric=2
# cl_1st_metric="voc"
# cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_sample_percentile_merged"
# cl_1st_index_to_metric_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=1
# cl_1st_max=100
# cl_1st_total_step=110000
# cl_1st_difficulty_step=1
# cl_1st_root=2
# cl_2nd_metric="seqlen_reshape"
# cl_2nd_index_to_sample_path="dummy"
# cl_2nd_index_to_metric_path="dummy"
# cl_2nd_difficulty_type="value"
# cl_2nd_clustering_type="single_cluster"
# cl_2nd_min=80
# cl_2nd_max=2048
# cl_2nd_total_step=110000
# cl_2nd_difficulty_step=8
# cl_2nd_root=1
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion} ${ltd_enabled} ${ltd_start} ${ltd_step} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step} ${cl_1st_difficulty_step} \
# ${cl_1st_root} ${cl_2nd_metric} ${cl_2nd_index_to_sample_path} \
# ${cl_2nd_index_to_metric_path} ${cl_2nd_difficulty_type} \
# ${cl_2nd_clustering_type} ${cl_2nd_min} ${cl_2nd_max} \
# ${cl_2nd_total_step} ${cl_2nd_difficulty_step} ${cl_2nd_root}
###############################################################################
## CL vocab rarity 300B tokens (100%):
# lr=2.0e-4
# train_tokens_in_billion=300
# ltd_enabled="false"
# ltd_start=2048
# ltd_step=1
# cl_enabled="true"
# cl_num_metric=1
# cl_1st_metric="voc"
# cl_1st_index_to_sample_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_sample_percentile_merged"
# cl_1st_index_to_metric_path="/blob/users/conglli/data/analysis_pile_gpt_1epoch/vocab_rarity/vocab_rarity_index_to_metric"
# cl_1st_difficulty_type="percentile"
# cl_1st_clustering_type="schedule_based"
# cl_1st_min=1
# cl_1st_max=100
# cl_1st_total_step=110000
# cl_1st_difficulty_step=1
# cl_1st_root=2
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion} ${ltd_enabled} ${ltd_start} ${ltd_step} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step} ${cl_1st_difficulty_step} \
# ${cl_1st_root}
###############################################################################
## CL seqlen truncation 300B tokens (100%):
# lr=2.0e-4
# train_tokens_in_billion=300
# ltd_enabled="false"
# ltd_start=2048
# ltd_step=1
# cl_enabled="true"
# cl_num_metric=1
# cl_1st_metric="seqlen_truncate"
# cl_1st_index_to_sample_path="dummy"
# cl_1st_index_to_metric_path="dummy"
# cl_1st_difficulty_type="value"
# cl_1st_clustering_type="single_cluster"
# cl_1st_min=80
# cl_1st_max=2048
# cl_1st_total_step=110000
# cl_1st_difficulty_step=8
# cl_1st_root=1
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion} ${ltd_enabled} ${ltd_start} ${ltd_step} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step} ${cl_1st_difficulty_step} \
# ${cl_1st_root}
###############################################################################
## CL seqlen reshape 300B tokens (100%):
# lr=2.0e-4
# train_tokens_in_billion=300
# ltd_enabled="false"
# ltd_start=2048
# ltd_step=1
# cl_enabled="true"
# cl_num_metric=1
# cl_1st_metric="seqlen_reshape"
# cl_1st_index_to_sample_path="dummy"
# cl_1st_index_to_metric_path="dummy"
# cl_1st_difficulty_type="value"
# cl_1st_clustering_type="single_cluster"
# cl_1st_min=80
# cl_1st_max=2048
# cl_1st_total_step=110000
# cl_1st_difficulty_step=8
# cl_1st_root=1
# bash ds_pretrain_gpt_1.3B_dense_base_script.sh ${lr} \
# ${train_tokens_in_billion} ${ltd_enabled} ${ltd_start} ${ltd_step} \
# ${cl_enabled} ${cl_num_metric} ${cl_1st_metric} \
# ${cl_1st_index_to_sample_path} ${cl_1st_index_to_metric_path} \
# ${cl_1st_difficulty_type} ${cl_1st_clustering_type} ${cl_1st_min} \
# ${cl_1st_max} ${cl_1st_total_step} ${cl_1st_difficulty_step} \
# ${cl_1st_root}
###############################################################################

View File

@ -0,0 +1,36 @@
#!/bin/bash
# Evaluate natural question test data given Wikipedia embeddings and pretrained
# ICT model
# Datasets can be downloaded from the following link:
# https://github.com/facebookresearch/DPR/blob/master/data/download_data.py
EVIDENCE_DATA_DIR=<Specify path of Wikipedia dataset>
EMBEDDING_PATH=<Specify path of the embeddings>
CHECKPOINT_PATH=<Specify path of pretrained ICT model>
QA_FILE=<Path of the natural question test dataset>
python tasks/main.py \
--task ICT-ZEROSHOT-NQ \
--tokenizer-type BertWordPieceLowerCase \
--num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--tensor-model-parallel-size 1 \
--micro-batch-size 128 \
--checkpoint-activations \
--seq-length 512 \
--max-position-embeddings 512 \
--load ${CHECKPOINT_PATH} \
--evidence-data-path ${EVIDENCE_DATA_DIR} \
--embedding-path ${EMBEDDING_PATH} \
--retriever-seq-length 256 \
--vocab-file bert-vocab.txt\
--qa-data-test ${QA_FILE} \
--num-workers 2 \
--faiss-use-gpu \
--retriever-report-topk-accuracies 1 5 20 100 \
--fp16

View File

@ -0,0 +1,38 @@
#!/bin/bash
WORLD_SIZE=8
DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
--nnodes 1 \
--node_rank 0 \
--master_addr localhost \
--master_port 6000"
TASK="LAMBADA"
VALID_DATA=<lambada path>
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT=checkpoints/gpt2_345m
python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
--task $TASK \
--valid-data $VALID_DATA \
--tokenizer-type GPT2BPETokenizer \
--strict-lambada \
--vocab-file $VOCAB_FILE \
--merge-file $MERGE_FILE \
--load $CHECKPOINT \
--tensor-model-parallel-size 1 \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--batch-size 8 \
--checkpoint-activations \
--seq-length 1024 \
--max-position-embeddings 1024 \
--log-interval 10 \
--fp16 \
--no-load-optim \
--no-load-rng

View File

@ -0,0 +1,44 @@
#!/bin/bash
WORLD_SIZE=8
DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
--nnodes 1 \
--node_rank 0 \
--master_addr localhost \
--master_port 6000"
TRAIN_DATA="data/glue_data/MNLI/train.tsv"
VALID_DATA="data/glue_data/MNLI/dev_matched.tsv \
data/glue_data/MNLI/dev_mismatched.tsv"
PRETRAINED_CHECKPOINT=checkpoints/bert_345m
VOCAB_FILE=bert-vocab.txt
CHECKPOINT_PATH=checkpoints/bert_345m_mnli
python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
--task MNLI \
--seed 1234 \
--train-data $TRAIN_DATA \
--valid-data $VALID_DATA \
--tokenizer-type BertWordPieceLowerCase \
--vocab-file $VOCAB_FILE \
--epochs 5 \
--pretrained-checkpoint $PRETRAINED_CHECKPOINT \
--tensor-model-parallel-size 1 \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 8 \
--checkpoint-activations \
--lr 5.0e-5 \
--lr-decay-style linear \
--lr-warmup-fraction 0.065 \
--seq-length 512 \
--max-position-embeddings 512 \
--save-interval 500000 \
--save $CHECKPOINT_PATH \
--log-interval 10 \
--eval-interval 100 \
--eval-iters 50 \
--weight-decay 1.0e-1 \
--fp16

View File

@ -0,0 +1,47 @@
#!/bin/bash
WORLD_SIZE=8
DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
--nnodes 1 \
--node_rank 0 \
--master_addr localhost \
--master_port 6000"
TRAIN_DATA="data/RACE/train/middle"
VALID_DATA="data/RACE/dev/middle \
data/RACE/dev/high"
VOCAB_FILE=bert-vocab.txt
PRETRAINED_CHECKPOINT=checkpoints/bert_345m
CHECKPOINT_PATH=checkpoints/bert_345m_race
python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
--task RACE \
--seed 1234 \
--train-data $TRAIN_DATA \
--valid-data $VALID_DATA \
--tokenizer-type BertWordPieceLowerCase \
--vocab-file $VOCAB_FILE \
--epochs 3 \
--pretrained-checkpoint $PRETRAINED_CHECKPOINT \
--tensor-model-parallel-size 1 \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 4 \
--checkpoint-activations \
--lr 1.0e-5 \
--lr-decay-style linear \
--lr-warmup-fraction 0.06 \
--seq-length 512 \
--max-position-embeddings 512 \
--save-interval 100000 \
--save $CHECKPOINT_PATH \
--log-interval 10 \
--eval-interval 100 \
--eval-iters 50 \
--weight-decay 1.0e-1 \
--clip-grad 1.0 \
--hidden-dropout 0.1 \
--attention-dropout 0.1 \
--fp16

49
examples/generate_text.sh Executable file
View File

@ -0,0 +1,49 @@
#!/bin/bash
export TORCH_CUDA_ARCH_LIST=8.6+PTX
CHECKPOINT_PATH=checkpoints/gpt2_345m
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
b=8
mp=1
experts=1
nodes=1
gpus=1
use_tutel=""
#use_tutel="--use-tutel"
#ds_inference=""
ds_inference="--ds-inference"
launch_cmd="deepspeed --num_nodes $nodes --num_gpus $gpus"
L=24
H=1024
A=16
#experts1=${experts[$k]}
program_cmd="tools/generate_samples_gpt.py \
--tensor-model-parallel-size $mp \
--num-layers $L \
--hidden-size $H \
--num-attention-heads $A \
--max-position-embeddings 1024 \
--tokenizer-type GPT2BPETokenizer \
--fp16 \
--num-experts ${experts} \
--mlp-type standard \
--micro-batch-size $b \
--seq-length 1024 \
--out-seq-length 1024 \
--temperature 1.0 \
--vocab-file $VOCAB_FILE \
--merge-file $MERGE_FILE \
--genfile unconditional_samples.json \
--top_p 0.9 \
--log-interval 1 \
--num-samples 0 \
--load $CHECKPOINT_PATH \
$use_tutel $ds_inference"
echo $launch_cmd $program_cmd
$launch_cmd $program_cmd

18
examples/merge_mp_bert.sh Executable file
View File

@ -0,0 +1,18 @@
#!/bin/bash
TENSOR_MODEL_PARALLEL_SIZE=2
VOCAB_FILE=bert-vocab.txt
CHECKPOINT_PATH=checkpoints/bert_345m
WORLD_SIZE=$TENSOR_MODEL_PARALLEL_SIZE python tools/merge_mp_partitions.py \
--model-type BERT \
--tensor-model-parallel-size $TENSOR_MODEL_PARALLEL_SIZE \
--tokenizer-type BertWordPieceLowerCase \
--vocab-file $VOCAB_FILE \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--seq-length 512 \
--max-position-embeddings 512 \
--load $CHECKPOINT_PATH

34
examples/pretrain_bert.sh Executable file
View File

@ -0,0 +1,34 @@
#!/bin/bash
RANK=0
WORLD_SIZE=1
DATA_PATH=<Specify path and file prefix>_text_sentence
CHECKPOINT_PATH=<Specify path>
python pretrain_bert.py \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 4 \
--global-batch-size 8 \
--seq-length 512 \
--max-position-embeddings 512 \
--train-iters 2000000 \
--lr-decay-iters 990000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file bert-vocab.txt \
--data-impl mmap \
--split 949,50,1 \
--lr 0.0001 \
--min-lr 0.00001 \
--lr-decay-style linear \
--lr-warmup-fraction .01 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--log-interval 100 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--fp16

View File

@ -0,0 +1,44 @@
#!/bin/bash
GPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
DATA_PATH=<Specify path and file prefix>_text_sentence
CHECKPOINT_PATH=<Specify path>
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
pretrain_bert.py \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 4 \
--global-batch-size 32 \
--seq-length 512 \
--max-position-embeddings 512 \
--train-iters 1000000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file bert-vocab.txt \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 0.0001 \
--lr-decay-style linear \
--min-lr 1.0e-5 \
--lr-decay-iters 990000 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--log-interval 100 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--fp16

View File

@ -0,0 +1,46 @@
#!/bin/bash
GPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
DATA_PATH=<Specify path and file prefix>_text_sentence
VOCAB_FILE=<Specify path to vocab.txt>
CHECKPOINT_PATH=<Specify path>
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
pretrain_bert.py \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 2 \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 2 \
--global-batch-size 16 \
--max-position-embeddings 512 \
--train-iters 1000000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file $VOCAB_FILE \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 0.0001 \
--lr-decay-style linear \
--min-lr 1.0e-5 \
--lr-decay-iters 990000 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--log-interval 100 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--fp16

41
examples/pretrain_gpt.sh Executable file
View File

@ -0,0 +1,41 @@
#! /bin/bash
# Runs the "345M" parameter model
RANK=0
WORLD_SIZE=1
DATA_PATH=<Specify path and file prefix>_text_document
CHECKPOINT_PATH=<Specify path>
python pretrain_gpt.py \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 4 \
--global-batch-size 8 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--train-iters 500000 \
--lr-decay-iters 320000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file gpt2-vocab.json \
--merge-file gpt2-merges.txt \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 0.00015 \
--min-lr 1.0e-5 \
--lr-decay-style cosine \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--checkpoint-activations \
--log-interval 100 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--fp16

65
examples/pretrain_gpt3_175B.sh Executable file
View File

@ -0,0 +1,65 @@
#!/bin/bash
#SBATCH <SLURM OPTIONS> --nodes=128 --exclusive --ntasks-per-node=8 --job-name=megatron_gpt3_175b
DIR=`pwd`
DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
mkdir -p $DIR/logs
DATASET_1="<PATH TO THE FIRST DATASET>"
DATASET_2="<PATH TO THE SECOND DATASET>"
DATASET_3="<PATH TO THE THIRD DATASET>"
DATASET="0.2 ${DATASET_1} 0.3 ${DATASET_2} 0.5 ${DATASET_3}"
options=" \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 16 \
--num-layers 96 \
--hidden-size 12288 \
--num-attention-heads 96 \
--seq-length 2048 \
--max-position-embeddings 2048 \
--micro-batch-size 1 \
--global-batch-size 1536 \
--rampup-batch-size 16 16 5859375 \
--train-samples 146484375 \
--lr-decay-samples 126953125 \
--lr-warmup-samples 183105 \
--lr 6.0e-5 \
--min-lr 6.0e-6 \
--lr-decay-style cosine \
--log-interval 10 \
--eval-iters 40 \
--eval-interval 1000 \
--data-path ${DATASET} \
--vocab-file <PATH TO gpt-vocab.json> \
--merge-file <PATH TO gpt-merges.txt> \
--save-interval 1000 \
--save <PATH TO CHECKPOINTS DIRECTORY> \
--load <PATH TO CHECKPOINTS DIRECTORY> \
--split 98,2,0 \
--clip-grad 1.0 \
--weight-decay 0.1 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--init-method-std 0.006 \
--tensorboard-dir <TENSORBOARD DIRECTORY> \
--fp16 \
--checkpoint-activations "
run_cmd="python -u ${DIR}/pretrain_gpt.py $@ ${options}"
srun -l \
--container-image "nvcr.io/nvidia/pytorch:20.12-py3" \
--container-mounts "<DIRECTORIES TO MOUNT>" \
--output=$DIR/logs/%x_%j_$DATETIME.log sh -c "${run_cmd}"
set +x

View File

@ -0,0 +1,48 @@
#! /bin/bash
# Runs the "345M" parameter model
GPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
DATA_PATH=<Specify path and file prefix>_text_document
CHECKPOINT_PATH=<Specify path>
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
pretrain_gpt.py \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 8 \
--global-batch-size 64 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--train-iters 500000 \
--lr-decay-iters 320000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file gpt2-vocab.json \
--merge-file gpt2-merges.txt \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 0.00015 \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--checkpoint-activations \
--log-interval 100 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--fp16

View File

@ -0,0 +1,50 @@
#! /bin/bash
# Runs the "345M" parameter model
GPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
DATA_PATH=<Specify path and file prefix>_text_document
CHECKPOINT_PATH=<Specify path>
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
pretrain_gpt.py \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 2 \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 4 \
--global-batch-size 16 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--train-iters 500000 \
--lr-decay-iters 320000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file gpt2-vocab.json \
--merge-file gpt2-merges.txt \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 0.00015 \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--checkpoint-activations \
--log-interval 100 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--fp16

44
examples/pretrain_ict.sh Executable file
View File

@ -0,0 +1,44 @@
#! /bin/bash
# Runs the "217M" parameter biencoder model for ICT retriever
RANK=0
WORLD_SIZE=1
PRETRAINED_BERT_PATH=<Specify path of pretrained BERT model>
TEXT_DATA_PATH=<Specify path and file prefix of the text data>
TITLE_DATA_PATH=<Specify path and file prefix od the titles>
CHECKPOINT_PATH=<Specify path>
python pretrain_ict.py \
--num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--tensor-model-parallel-size 1 \
--micro-batch-size 32 \
--seq-length 256 \
--max-position-embeddings 512 \
--train-iters 100000 \
--vocab-file bert-vocab.txt \
--tokenizer-type BertWordPieceLowerCase \
--DDP-impl torch \
--bert-load ${PRETRAINED_BERT_PATH} \
--log-interval 100 \
--eval-interval 1000 \
--eval-iters 10 \
--retriever-report-topk-accuracies 1 5 10 20 100 \
--retriever-score-scaling \
--load $CHECKPOINT_PATH \
--save $CHECKPOINT_PATH \
--data-path ${TEXT_DATA_PATH} \
--titles-data-path ${TITLE_DATA_PATH} \
--lr 0.0001 \
--lr-decay-style linear \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction 0.01 \
--save-interval 4000 \
--exit-interval 8000 \
--query-in-block-prob 0.1 \
--fp16

38
examples/pretrain_t5.sh Normal file
View File

@ -0,0 +1,38 @@
#!/bin/bash
RANK=0
WORLD_SIZE=1
DATA_PATH=<Specify path and file prefix>
VOCAB_FILE=<Specify path to vocab.txt>
CHECKPOINT_PATH=<Specify path>
python pretrain_t5.py \
--num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--kv-channels 64 \
--ffn-hidden-size 3072 \
--encoder-seq-length 512 \
--decoder-seq-length 128 \
--micro-batch-size 16 \
--global-batch-size 2048 \
--max-position-embeddings 512 \
--train-iters 1000000 \
--lr-decay-iters 1000000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file $VOCAB_FILE \
--data-impl mmap \
--split 949,50,1 \
--lr 0.0001 \
--min-lr 0.00001 \
--lr-decay-style linear \
--lr-warmup-fraction .01 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--log-interval 100 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--fp16

View File

@ -0,0 +1,47 @@
#!/bin/bash
GPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
DATA_PATH=<Specify path and file prefix>
VOCAB_FILE=<Specify path to vocab.txt>
CHECKPOINT_PATH=<Specify path>
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
pretrain_t5.py \
--num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--kv-channels 64 \
--ffn-hidden-size 3072 \
--encoder-seq-length 512 \
--decoder-seq-length 128 \
--micro-batch-size 16 \
--global-batch-size 2048 \
--max-position-embeddings 512 \
--train-iters 1000000 \
--lr-decay-iters 1000000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file $VOCAB_FILE \
--data-impl mmap \
--split 949,50,1 \
--lr 0.0001 \
--min-lr 0.00001 \
--lr-decay-style linear \
--lr-warmup-fraction .01 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--log-interval 100 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--fp16

Some files were not shown because too many files have changed in this diff Show More