mirror of
https://gitee.com/ascend/ModelLink.git
synced 2024-12-01 19:39:02 +08:00
change megatron to ascendspeed
This commit is contained in:
parent
106a415556
commit
37cc0b949d
33
README.md
33
README.md
@ -1,25 +1,12 @@
|
||||
## AscendSpeed
|
||||
DeepSpeed version of NVIDIA's Megatron-LM that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others. The ```AscendSpeed/examples/``` folder includes example scripts about the features supported by DeepSpeed.
|
||||
DeepSpeed version that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others. The ```AscendSpeed/examples/``` folder includes example scripts about the features supported by DeepSpeed.
|
||||
|
||||
### Run on Azure and AzureML
|
||||
To try out DeepSpeed on Azure, this fork of Megatron offers easy-to-use recipes and bash scripts. We strongly recommend to start with AzureML recipe in the ```examples/azureml``` folder. If you have a custom infrastructure (e.g. HPC clusters) or Azure VM based environment, please refer to the bash scripts in the ```examples/azure``` folder.
|
||||
To try out DeepSpeed on Azure, this fork of ascendspeed offers easy-to-use recipes and bash scripts. We strongly recommend to start with AzureML recipe in the ```examples/azureml``` folder. If you have a custom infrastructure (e.g. HPC clusters) or Azure VM based environment, please refer to the bash scripts in the ```examples/azure``` folder.
|
||||
|
||||
Below is Megatron-LM's original README:
|
||||
Below is ascendspeed original README:
|
||||
------
|
||||
|
||||
Megatron ([1](https://arxiv.org/pdf/1909.08053.pdf) and [2](https://arxiv.org/pdf/2104.04473.pdf)) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel (tensor and pipeline), and multi-node pre-training of transformer based models such as [GPT](https://arxiv.org/abs/2005.14165), [BERT](https://arxiv.org/pdf/1810.04805.pdf), and [T5](https://arxiv.org/abs/1910.10683) using mixed precision.
|
||||
|
||||
Below are some of the projects where we have directly used Megatron:
|
||||
* [BERT and GPT Studies Using Megatron](https://arxiv.org/pdf/1909.08053.pdf)
|
||||
* [BioMegatron: Larger Biomedical Domain Language Model](https://www.aclweb.org/anthology/2020.emnlp-main.379.pdf)
|
||||
* [End-to-End Training of Neural Retrievers for Open-Domain Question Answering](https://arxiv.org/abs/2101.00408)
|
||||
* [Large Scale Multi-Actor Generative Dialog Modeling](https://www.aclweb.org/anthology/2020.acl-main.8.pdf)
|
||||
* [Local Knowledge Powered Conversational Agents](https://arxiv.org/abs/2010.10150)
|
||||
* [MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models](https://www.aclweb.org/anthology/2020.emnlp-main.226.pdf)
|
||||
* [RACE Reading Comprehension Dataset Leaderboard](http://www.qizhexie.com/data/RACE_leaderboard.html)
|
||||
* [Scaling Language Model Training to a Trillion Parameters Using Megatron](https://arxiv.org/pdf/2104.04473.pdf)
|
||||
* [Training Question Answering Models From Synthetic Data](https://www.aclweb.org/anthology/2020.emnlp-main.468.pdf)
|
||||
|
||||
Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs (both per GPU and aggregate over all GPUs). Note that the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.
|
||||
|
||||
![Cases](images/cases_april2021.png)
|
||||
@ -52,14 +39,14 @@ All the cases from 1 billion to 1 trillion parameters achieve more than 43% half
|
||||
* [Collecting GPT Webtext Data](#collecting-gpt-webtext-data)
|
||||
|
||||
# Setup
|
||||
We have tested Megatron with [NGC's PyTorch container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) version 20.12, which uses python 3.8, pytorch 1.8, cuda 11.1, and nccl 2.8.3.
|
||||
We have tested with [NGC's PyTorch container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) version 20.12, which uses python 3.8, pytorch 1.8, cuda 11.1, and nccl 2.8.3.
|
||||
|
||||
To use this repository, please install the latest supported versions of PyTorch with GPU support (python 3.8, pytorch 1.8, cuda 11.1, and nccl 2.8.3 and above) and NVIDIA [APEX](https://github.com/NVIDIA/apex#quick-start). We strongly recommend using one of [NGC's recent PyTorch containers](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) (the latest compatible version at time of publication can be pulled with `docker pull nvcr.io/nvidia/pytorch:20.12-py3`). Data preprocessing requires [NLTK](https://www.nltk.org/install.html), though this is not required for training, evaluation, or downstream tasks.
|
||||
|
||||
<!--
|
||||
To use megatron you can either clone the repo or install it via pip (make sure python3-dev is installed):
|
||||
To use ascendspeed you can either clone the repo or install it via pip (make sure python3-dev is installed):
|
||||
<pre>
|
||||
pip install megatron-lm
|
||||
pip install ascendspeed
|
||||
</pre>
|
||||
-->
|
||||
|
||||
@ -176,7 +163,7 @@ python pretrain_bert.py \
|
||||
--data-path $DATA_PATH
|
||||
</pre>
|
||||
|
||||
Further command line arguments are described in the source file [`arguments.py`](./megatron/arguments.py).
|
||||
Further command line arguments are described in the source file [`arguments.py`](./ascendspeed/arguments.py).
|
||||
|
||||
|
||||
## GPT Pretraining
|
||||
@ -217,7 +204,7 @@ python pretrain_gpt.py \
|
||||
--data-path $DATA_PATH \
|
||||
</pre>
|
||||
|
||||
Further command line arguments are described in the source file [`arguments.py`](./megatron/arguments.py).
|
||||
Further command line arguments are described in the source file [`arguments.py`](./ascendspeed/arguments.py).
|
||||
|
||||
## T5 Pretraining
|
||||
|
||||
@ -311,7 +298,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_<model>.py \
|
||||
|
||||
## GPT-3 Example
|
||||
|
||||
In `examples/pretrain_gpt3_175B.sh` we have provided an example of how to configure Megatron to run [GPT-3](https://arxiv.org/abs/2005.14165) with 175 billion parameters on 1024 GPUs. The script is designed for [slurm](https://slurm.schedmd.com/documentation.html) with [pyxis](https://github.com/NVIDIA/pyxis) plugin but can be easily adopted to any other scheduler. It uses 8-way and 16-way tensor and pipeline parallelism, respectively. With options `global-batch-size 1536` and `rampup-batch-size 16 16 5859375`, the training will start with global batch size 16 and linearly increase the global batch size to 1536 over 5,859,375 samples with incrmeental steps 16. The training dataset can be either a single set or a multiple datasets combined with a set of weights.
|
||||
In `examples/pretrain_gpt3_175B.sh` we have provided an example of how to configure AscendSpeed to run [GPT-3](https://arxiv.org/abs/2005.14165) with 175 billion parameters on 1024 GPUs. The script is designed for [slurm](https://slurm.schedmd.com/documentation.html) with [pyxis](https://github.com/NVIDIA/pyxis) plugin but can be easily adopted to any other scheduler. It uses 8-way and 16-way tensor and pipeline parallelism, respectively. With options `global-batch-size 1536` and `rampup-batch-size 16 16 5859375`, the training will start with global batch size 16 and linearly increase the global batch size to 1536 over 5,859,375 samples with incrmeental steps 16. The training dataset can be either a single set or a multiple datasets combined with a set of weights.
|
||||
|
||||
With full global batch size of 1536 on 1024 A100 GPUs, each iteration takes around 32 seconds resulting in 138 teraFLOPs per GPU which is 44% of the theoretical peak FLOPs.
|
||||
|
||||
@ -336,7 +323,7 @@ python preprocess_data.py \
|
||||
--workers 5 # works well for 10 CPU cores. Scale up accordingly.
|
||||
</pre>
|
||||
|
||||
2. Use a custom samples mapping function in place of `megatron/data/realm_dataset_utils.get_block_samples_mapping` if required. To do this, you will need to implement a new function in C++ inside of `megatron/data/helpers.cpp`. The samples mapping data structure is used to select the data that will constitute every training sample in advance of the training loop.
|
||||
2. Use a custom samples mapping function in place of `ascendspeed/data/realm_dataset_utils.get_block_samples_mapping` if required. To do this, you will need to implement a new function in C++ inside of `ascendspeed/data/helpers.cpp`. The samples mapping data structure is used to select the data that will constitute every training sample in advance of the training loop.
|
||||
The samples mapping is responsible for holding all of the required metadata needed to construct the sample from one or more indexed datasets. In REALM, the samples mapping contains the start and end sentence indices, as well as the document index (to find the correct title for a body) and a unique ID for every block.
|
||||
3. Pretrain a BERT language model using `pretrain_bert.py`, with the sequence length equal to the block size in token ids. This model should be trained on the same indexed dataset that is used to supply the blocks for the information retrieval task.
|
||||
In REALM, this is an uncased bert base model trained with the standard hyperparameters.
|
||||
|
@ -13,7 +13,7 @@
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""Megatron arguments."""
|
||||
"""ascendspeed arguments."""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
@ -21,10 +21,11 @@ import os
|
||||
import torch
|
||||
import deepspeed
|
||||
|
||||
|
||||
def parse_args(extra_args_provider=None, defaults={},
|
||||
ignore_unknown_args=False):
|
||||
"""Parse all arguments."""
|
||||
parser = argparse.ArgumentParser(description='Megatron-LM Arguments',
|
||||
parser = argparse.ArgumentParser(description='AscendSpeed Arguments',
|
||||
allow_abbrev=False)
|
||||
|
||||
# Standard arguments.
|
@ -22,7 +22,8 @@ import numpy as np
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
import torch
|
||||
|
||||
from megatron import (get_args,
|
||||
|
||||
from ascendspeed import (get_args,
|
||||
is_rank_0,
|
||||
mpu,
|
||||
print_rank_0,
|
||||
@ -161,7 +162,7 @@ def save_checkpoint(iteration, model, optimizer, lr_scheduler):
|
||||
torch.save(state_dict, checkpoint_name)
|
||||
|
||||
if args.deepspeed:
|
||||
#megatron model uses state_dict_for_save_checkpointing instead of the standard state_dict
|
||||
#ascendspeed model uses state_dict_for_save_checkpointing instead of the standard state_dict
|
||||
#state_dict is used by deepspeed for module saving so it needs to point to the right function
|
||||
if args.no_pipeline_parallel:
|
||||
original_state_dict = model[0].module.state_dict
|
||||
@ -329,16 +330,16 @@ def load_checkpoint(model, optimizer, lr_scheduler, load_arg='load', strict=True
|
||||
try:
|
||||
state_dict = torch.load(checkpoint_name, map_location='cpu')
|
||||
except ModuleNotFoundError:
|
||||
from megatron.fp16_deprecated import loss_scaler
|
||||
from ascendspeed.fp16_deprecated import loss_scaler
|
||||
# For backward compatibility.
|
||||
print_rank_0(' > deserializing using the old code structure ...')
|
||||
sys.modules['fp16.loss_scaler'] = sys.modules[
|
||||
'megatron.fp16_deprecated.loss_scaler']
|
||||
sys.modules['megatron.fp16.loss_scaler'] = sys.modules[
|
||||
'megatron.fp16_deprecated.loss_scaler']
|
||||
'ascendspeed.fp16_deprecated.loss_scaler']
|
||||
sys.modules['ascendspeed.fp16.loss_scaler'] = sys.modules[
|
||||
'ascendspeed.fp16_deprecated.loss_scaler']
|
||||
state_dict = torch.load(checkpoint_name, map_location='cpu')
|
||||
sys.modules.pop('fp16.loss_scaler', None)
|
||||
sys.modules.pop('megatron.fp16.loss_scaler', None)
|
||||
sys.modules.pop('ascendspeed.fp16.loss_scaler', None)
|
||||
except BaseException as e:
|
||||
print_rank_0('could not load the checkpoint')
|
||||
print_rank_0(e)
|
@ -18,13 +18,13 @@
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from megatron import (
|
||||
from ascendspeed import (
|
||||
get_args,
|
||||
get_tokenizer,
|
||||
mpu,
|
||||
print_rank_0
|
||||
)
|
||||
from megatron.data.dataset_utils import (
|
||||
from ascendspeed.data.dataset_utils import (
|
||||
get_samples_mapping,
|
||||
get_a_and_b_segments,
|
||||
truncate_segments,
|
@ -4,10 +4,10 @@ import time
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from megatron import get_args, get_tokenizer, mpu, print_rank_0
|
||||
from megatron.data.dataset_utils import create_masked_lm_predictions, \
|
||||
from ascendspeed import get_args, get_tokenizer, mpu, print_rank_0
|
||||
from ascendspeed.data.dataset_utils import create_masked_lm_predictions, \
|
||||
pad_and_convert_to_numpy
|
||||
from megatron.data.data_samplers import MegatronPretrainingSampler
|
||||
from ascendspeed.data.data_samplers import MegatronPretrainingSampler
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
def make_attention_mask(source_block, target_block):
|
||||
"""
|
||||
@ -28,7 +28,7 @@ def get_one_epoch_dataloader(dataset, micro_batch_size=None):
|
||||
micro_batch_size = args.micro_batch_size
|
||||
num_workers = args.num_workers
|
||||
|
||||
# Use megatron's sampler with consumed samples set to 0 as
|
||||
# Use ascendspeed's sampler with consumed samples set to 0 as
|
||||
# this is only for evaluation and don't intend to resume half way.
|
||||
# Also, set the drop last to false as don't intend to remove
|
||||
# the last batch
|
||||
@ -162,7 +162,7 @@ def get_block_samples_mapping(block_dataset, title_dataset, data_prefix, num_epo
|
||||
print_rank_0(' > building samples index mapping for {} ...'.format(
|
||||
name))
|
||||
|
||||
from megatron.data import helpers
|
||||
from ascendspeed.data import helpers
|
||||
mapping_array = helpers.build_blocks_mapping(
|
||||
block_dataset.doc_idx,
|
||||
block_dataset.sizes,
|
@ -20,8 +20,8 @@ import time
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from megatron import print_rank_0
|
||||
from megatron import mpu
|
||||
from ascendspeed import print_rank_0
|
||||
from ascendspeed import mpu
|
||||
|
||||
|
||||
class BlendableDataset(torch.utils.data.Dataset):
|
||||
@ -49,7 +49,7 @@ class BlendableDataset(torch.utils.data.Dataset):
|
||||
self.dataset_index = np.zeros(self.size, dtype=np.uint8)
|
||||
self.dataset_sample_index = np.zeros(self.size, dtype=np.int64)
|
||||
|
||||
from megatron.data import helpers
|
||||
from ascendspeed.data import helpers
|
||||
helpers.build_blending_indices(self.dataset_index,
|
||||
self.dataset_sample_index,
|
||||
weights, num_datasets, self.size,
|
@ -18,8 +18,8 @@
|
||||
|
||||
import torch
|
||||
import random
|
||||
from megatron import get_args
|
||||
from megatron import mpu
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import mpu
|
||||
|
||||
|
||||
def build_pretraining_data_loader(dataset, consumed_samples):
|
||||
@ -29,7 +29,7 @@ def build_pretraining_data_loader(dataset, consumed_samples):
|
||||
return None
|
||||
args = get_args()
|
||||
|
||||
# Megatron sampler
|
||||
# ascendspeed sampler
|
||||
if args.dataloader_type == 'single':
|
||||
batch_sampler = MegatronPretrainingSampler(
|
||||
total_samples=len(dataset),
|
@ -26,13 +26,13 @@ import collections
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from megatron import (
|
||||
from ascendspeed import (
|
||||
get_args,
|
||||
mpu,
|
||||
print_rank_0
|
||||
)
|
||||
from megatron.data.blendable_dataset import BlendableDataset
|
||||
from megatron.data.indexed_dataset import make_dataset as make_indexed_dataset
|
||||
from ascendspeed.data.blendable_dataset import BlendableDataset
|
||||
from ascendspeed.data.indexed_dataset import make_dataset as make_indexed_dataset
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
DSET_TYPE_BERT = 'standard_bert'
|
||||
DSET_TYPE_ICT = 'ict'
|
||||
@ -515,9 +515,9 @@ def _build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
|
||||
print_split_stats('test', 2)
|
||||
|
||||
def build_dataset(index, name):
|
||||
from megatron.data.bert_dataset import BertDataset
|
||||
from megatron.data.ict_dataset import ICTDataset
|
||||
from megatron.data.t5_dataset import T5Dataset
|
||||
from ascendspeed.data.bert_dataset import BertDataset
|
||||
from ascendspeed.data.ict_dataset import ICTDataset
|
||||
from ascendspeed.data.t5_dataset import T5Dataset
|
||||
dataset = None
|
||||
if splits[index + 1] > splits[index]:
|
||||
# Get the pointer to the original doc-idx so we can set it later.
|
||||
@ -689,7 +689,7 @@ def get_samples_mapping(indexed_dataset,
|
||||
print_rank_0(' > building sapmles index mapping for {} ...'.format(
|
||||
name))
|
||||
# First compile and then import.
|
||||
from megatron.data import helpers
|
||||
from ascendspeed.data import helpers
|
||||
samples_mapping = helpers.build_mapping(
|
||||
indexed_dataset.doc_idx,
|
||||
indexed_dataset.sizes,
|
@ -4,12 +4,12 @@ import time
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from megatron import print_rank_0, mpu, logging
|
||||
from megatron.data.blendable_dataset import BlendableDataset
|
||||
from megatron.data.dataset_utils import get_datasets_weights_and_num_samples, get_split_by_range_, \
|
||||
from ascendspeed import print_rank_0, mpu, logging
|
||||
from ascendspeed.data.blendable_dataset import BlendableDataset
|
||||
from ascendspeed.data.dataset_utils import get_datasets_weights_and_num_samples, get_split_by_range_, \
|
||||
get_train_valid_test_split_
|
||||
from megatron.data.mtf_dataset import MTFDataset
|
||||
from megatron.data.indexed_dataset import make_dataset as make_indexed_dataset
|
||||
from ascendspeed.data.mtf_dataset import MTFDataset
|
||||
from ascendspeed.data.indexed_dataset import make_dataset as make_indexed_dataset
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
@ -17,15 +17,16 @@
|
||||
|
||||
import os
|
||||
import time
|
||||
|
||||
import numpy as np
|
||||
|
||||
import torch
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
from megatron import mpu, is_rank_0, print_rank_0, get_args
|
||||
from megatron.data.blendable_dataset import BlendableDataset
|
||||
from megatron.data.dataset_utils import get_datasets_weights_and_num_samples
|
||||
from megatron.data.dataset_utils import get_train_valid_test_split_
|
||||
from megatron.data.indexed_dataset import make_dataset as make_indexed_dataset
|
||||
|
||||
from ascendspeed import mpu, is_rank_0, print_rank_0, get_args
|
||||
from ascendspeed.data.blendable_dataset import BlendableDataset
|
||||
from ascendspeed.data.dataset_utils import get_datasets_weights_and_num_samples
|
||||
from ascendspeed.data.dataset_utils import get_train_valid_test_split_
|
||||
from ascendspeed.data.indexed_dataset import make_dataset as make_indexed_dataset
|
||||
|
||||
|
||||
def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
|
||||
@ -286,7 +287,7 @@ def _build_index_mappings(name, data_prefix, documents, sizes,
|
||||
start_time = time.time()
|
||||
# Use C++ implementation for speed.
|
||||
# First compile and then import.
|
||||
from megatron.data import helpers
|
||||
from ascendspeed.data import helpers
|
||||
assert doc_idx.dtype == np.int32
|
||||
assert sizes.dtype == np.int32
|
||||
sample_idx = helpers.build_sample_idx(sizes, doc_idx, seq_length,
|
@ -4,10 +4,10 @@ import random
|
||||
import numpy as np
|
||||
from torch.utils.data import Dataset
|
||||
|
||||
from megatron import get_tokenizer
|
||||
from megatron import get_args
|
||||
from megatron.data.dataset_utils import get_indexed_dataset_
|
||||
from megatron.data.realm_dataset_utils import get_block_samples_mapping
|
||||
from ascendspeed import get_tokenizer
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed.data.dataset_utils import get_indexed_dataset_
|
||||
from ascendspeed.data.realm_dataset_utils import get_block_samples_mapping
|
||||
|
||||
def make_attention_mask(source_block, target_block):
|
||||
"""
|
@ -17,10 +17,10 @@ import os
|
||||
import shutil
|
||||
import struct
|
||||
from itertools import accumulate
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from megatron import print_rank_0
|
||||
|
||||
from ascendspeed import print_rank_0
|
||||
|
||||
|
||||
def __best_fitting_dtype(vocab_size=None):
|
@ -3,11 +3,11 @@
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from megatron import print_rank_0, get_tokenizer, get_args
|
||||
from megatron.data.blendable_dataset import BlendableDataset
|
||||
from megatron.data.dataset_utils import get_datasets_weights_and_num_samples, get_split_by_range_
|
||||
from megatron.data.dataset_utils import get_train_valid_test_split_, get_indexed_dataset_
|
||||
from megatron.data.gpt_dataset import GPTDataset
|
||||
from ascendspeed import print_rank_0, get_tokenizer, get_args
|
||||
from ascendspeed.data.blendable_dataset import BlendableDataset
|
||||
from ascendspeed.data.dataset_utils import get_datasets_weights_and_num_samples, get_split_by_range_
|
||||
from ascendspeed.data.dataset_utils import get_train_valid_test_split_, get_indexed_dataset_
|
||||
from ascendspeed.data.gpt_dataset import GPTDataset
|
||||
|
||||
|
||||
def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
|
@ -20,8 +20,8 @@ import time
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from megatron import print_rank_0
|
||||
from megatron.data.indexed_dataset import make_dataset as make_indexed_dataset
|
||||
from ascendspeed import print_rank_0
|
||||
from ascendspeed.data.indexed_dataset import make_dataset as make_indexed_dataset
|
||||
|
||||
class MTFDataset(torch.utils.data.Dataset):
|
||||
|
@ -22,8 +22,8 @@ import random
|
||||
import torch
|
||||
from torch.utils.data import Dataset
|
||||
|
||||
from megatron import print_rank_0, get_args, get_tokenizer, mpu
|
||||
from megatron.data.biencoder_dataset_utils import make_attention_mask
|
||||
from ascendspeed import print_rank_0, get_args, get_tokenizer, mpu
|
||||
from ascendspeed.data.biencoder_dataset_utils import make_attention_mask
|
||||
|
||||
def get_open_retrieval_wiki_dataset():
|
||||
args = get_args()
|
@ -4,9 +4,9 @@ import time
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from megatron import mpu, print_rank_0
|
||||
from megatron.data.dataset_utils import create_masked_lm_predictions, pad_and_convert_to_numpy
|
||||
from megatron import get_args, get_tokenizer, print_rank_0, mpu
|
||||
from ascendspeed import mpu, print_rank_0
|
||||
from ascendspeed.data.dataset_utils import create_masked_lm_predictions, pad_and_convert_to_numpy
|
||||
from ascendspeed import get_args, get_tokenizer, print_rank_0, mpu
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
|
||||
def get_one_epoch_dataloader(dataset, micro_batch_size=None):
|
||||
@ -23,7 +23,7 @@ def get_one_epoch_dataloader(dataset, micro_batch_size=None):
|
||||
sampler = torch.utils.data.SequentialSampler(dataset)
|
||||
# importantly, drop_last must be False to get all the data.
|
||||
assert False, 'DistributedBatchSampler deprecated, change the implementation'
|
||||
from megatron.data.samplers import DistributedBatchSampler
|
||||
from ascendspeed.data.samplers import DistributedBatchSampler
|
||||
batch_sampler = DistributedBatchSampler(sampler,
|
||||
batch_size=global_batch_size,
|
||||
drop_last=False,
|
||||
@ -152,7 +152,7 @@ def get_block_samples_mapping(block_dataset, title_dataset, data_prefix, num_epo
|
||||
print_rank_0(' > building samples index mapping for {} ...'.format(
|
||||
name))
|
||||
|
||||
from megatron.data import helpers
|
||||
from ascendspeed.data import helpers
|
||||
mapping_array = helpers.build_blocks_mapping(
|
||||
block_dataset.doc_idx,
|
||||
block_dataset.sizes,
|
@ -6,8 +6,8 @@ import shutil
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import mpu
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import mpu
|
||||
|
||||
|
||||
def detach(tensor):
|
@ -20,8 +20,8 @@ import collections
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from megatron import get_tokenizer
|
||||
from megatron.data.dataset_utils import (
|
||||
from ascendspeed import get_tokenizer
|
||||
from ascendspeed.data.dataset_utils import (
|
||||
create_masked_lm_predictions,
|
||||
get_samples_mapping
|
||||
)
|
@ -2,8 +2,8 @@
|
||||
# put some code used during development and manual testing of
|
||||
# indexed_dataset.
|
||||
|
||||
from megatron.data import indexed_dataset
|
||||
from megatron.tokenizer import build_tokenizer
|
||||
from ascendspeed.data import indexed_dataset
|
||||
from ascendspeed.tokenizer import build_tokenizer
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
@ -15,7 +15,7 @@
|
||||
import os
|
||||
import torch
|
||||
from torchvision import datasets, transforms
|
||||
from megatron.data.autoaugment import ImageNetPolicy
|
||||
from ascendspeed.data.autoaugment import ImageNetPolicy
|
||||
|
||||
|
||||
def build_train_valid_datasets(data_path, crop_size=224, color_jitter=True):
|
@ -13,7 +13,7 @@
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""Megatron global variables."""
|
||||
"""global variables."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
@ -21,7 +21,7 @@ import time
|
||||
|
||||
import torch
|
||||
|
||||
from megatron.tokenizer import build_tokenizer
|
||||
from ascendspeed.tokenizer import build_tokenizer
|
||||
from .arguments import parse_args
|
||||
from .microbatches import build_num_microbatches_calculator
|
||||
from deepspeed.accelerator import get_accelerator
|
@ -2,15 +2,15 @@ import sys
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import mpu
|
||||
from megatron.checkpointing import load_biencoder_checkpoint
|
||||
from megatron.data.orqa_wiki_dataset import get_open_retrieval_wiki_dataset
|
||||
from megatron.data.orqa_wiki_dataset import get_open_retrieval_batch
|
||||
from megatron.data.biencoder_dataset_utils import get_one_epoch_dataloader
|
||||
from megatron.data.realm_index import detach, OpenRetreivalDataStore
|
||||
from megatron.model.biencoder_model import biencoder_model_provider
|
||||
from megatron.training import get_model
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed.checkpointing import load_biencoder_checkpoint
|
||||
from ascendspeed.data.orqa_wiki_dataset import get_open_retrieval_wiki_dataset
|
||||
from ascendspeed.data.orqa_wiki_dataset import get_open_retrieval_batch
|
||||
from ascendspeed.data.biencoder_dataset_utils import get_one_epoch_dataloader
|
||||
from ascendspeed.data.realm_index import detach, OpenRetreivalDataStore
|
||||
from ascendspeed.model.biencoder_model import biencoder_model_provider
|
||||
from ascendspeed.training import get_model
|
||||
|
||||
|
||||
class IndexBuilder(object):
|
@ -13,30 +13,31 @@
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""Megatron initialization."""
|
||||
"""initialization."""
|
||||
|
||||
import random
|
||||
import os
|
||||
import time
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from megatron import get_adlr_autoresume
|
||||
from megatron import get_args
|
||||
from megatron import get_tensorboard_writer
|
||||
from megatron import mpu
|
||||
from megatron.global_vars import set_global_variables
|
||||
from megatron.mpu import (set_tensor_model_parallel_rank,
|
||||
set_tensor_model_parallel_world_size)
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
import torch
|
||||
import deepspeed
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
|
||||
from ascendspeed import get_adlr_autoresume
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import get_tensorboard_writer
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed.global_vars import set_global_variables
|
||||
from ascendspeed.mpu import (set_tensor_model_parallel_rank,
|
||||
set_tensor_model_parallel_world_size)
|
||||
|
||||
|
||||
def initialize_megatron(extra_args_provider=None, args_defaults=None,
|
||||
ignore_unknown_args=False, allow_no_cuda=False):
|
||||
"""Set global variables, initialize distributed, and
|
||||
set autoresume and random seeds.
|
||||
`allow_no_cuda` should not be set unless using megatron for cpu only
|
||||
`allow_no_cuda` should not be set unless using ascendspeed for cpu only
|
||||
data processing. In general this arg should not be set unless you know
|
||||
what you are doing.
|
||||
Returns a function to finalize distributed env initialization
|
||||
@ -47,7 +48,7 @@ def initialize_megatron(extra_args_provider=None, args_defaults=None,
|
||||
|
||||
if not allow_no_cuda:
|
||||
# Make sure cuda is available.
|
||||
assert get_accelerator().is_available(), 'Megatron requires accelerator.'
|
||||
assert get_accelerator().is_available(), 'ascendspeed requires accelerator.'
|
||||
|
||||
# Parse args, build tokenizer, and set adlr-autoresume,
|
||||
# tensorboard-writer, and timers.
|
||||
@ -77,7 +78,7 @@ def initialize_megatron(extra_args_provider=None, args_defaults=None,
|
||||
set_tensor_model_parallel_rank(args.rank)
|
||||
return finish_mpu_init
|
||||
else:
|
||||
# Megatron's MPU is the master. Complete initialization right away.
|
||||
# MPU is the master. Complete initialization right away.
|
||||
finish_mpu_init()
|
||||
|
||||
# Initialize memory buffers.
|
||||
@ -97,7 +98,7 @@ def _compile_dependencies():
|
||||
if torch.distributed.get_rank() == 0:
|
||||
start_time = time.time()
|
||||
print('> compiling dataset index builder ...')
|
||||
from megatron.data.dataset_utils import compile_helper
|
||||
from ascendspeed.data.dataset_utils import compile_helper
|
||||
compile_helper()
|
||||
print('>>> done with dataset index builder. Compilation time: {:.3f} '
|
||||
'seconds'.format(time.time() - start_time), flush=True)
|
@ -17,7 +17,7 @@
|
||||
|
||||
import math
|
||||
|
||||
from megatron import print_rank_0, get_args
|
||||
from ascendspeed import print_rank_0, get_args
|
||||
|
||||
class AnnealingLR(object):
|
||||
"""Anneals the learning rate."""
|
@ -19,8 +19,8 @@ from abc import abstractmethod
|
||||
import torch
|
||||
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import mpu
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import mpu
|
||||
from .module import MegatronModule
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
|
@ -15,7 +15,7 @@
|
||||
|
||||
import torch
|
||||
import torch_npu
|
||||
from megatron.model.enums import AttnMaskType
|
||||
from ascendspeed.model.enums import AttnMaskType
|
||||
|
||||
|
||||
class NPUFusedScaleMaskSoftmax(torch.nn.Module):
|
@ -2,8 +2,8 @@ import torch
|
||||
from torch import nn
|
||||
from torch.nn import functional as F
|
||||
|
||||
from megatron import logging
|
||||
from megatron.model.utils import log_debug_usage
|
||||
from ascendspeed import logging
|
||||
from ascendspeed.model.utils import log_debug_usage
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
@ -18,8 +18,8 @@
|
||||
from functools import partial
|
||||
import torch
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import mpu
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import mpu
|
||||
from .module import MegatronModule, fp32_to_float16
|
||||
|
||||
from .enums import AttnMaskType
|
||||
@ -29,8 +29,8 @@ from .utils import init_method_normal
|
||||
from .utils import scaled_init_method_normal
|
||||
|
||||
from deepspeed.pipe import PipelineModule, LayerSpec, TiedLayerSpec
|
||||
from megatron.model import LayerNorm
|
||||
from megatron.model.module import float16_to_fp32
|
||||
from ascendspeed.model import LayerNorm
|
||||
from ascendspeed.model.module import float16_to_fp32
|
||||
from .language_model import EmbeddingPipe
|
||||
from .transformer import ParallelTransformerLayerPipe
|
||||
|
||||
@ -94,7 +94,7 @@ class GPTModel(MegatronModule):
|
||||
self.initialize_word_embeddings(init_method_normal)
|
||||
|
||||
def set_input_tensor(self, input_tensor):
|
||||
"""See megatron.model.transformer.set_input_tensor()"""
|
||||
"""See ascendspeed.model.transformer.set_input_tensor()"""
|
||||
self.language_model.set_input_tensor(input_tensor)
|
||||
|
||||
def forward(self, input_ids, position_ids, attention_mask, labels=None,
|
@ -18,13 +18,15 @@
|
||||
import torch
|
||||
import torch.nn.functional as F
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import mpu
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import mpu
|
||||
from .module import MegatronModule
|
||||
from megatron.model.enums import LayerType, AttnMaskType
|
||||
from megatron.model.transformer import ParallelTransformer
|
||||
from megatron.model.utils import get_linear_layer
|
||||
from megatron.model.utils import init_method_normal, scaled_init_method_normal
|
||||
|
||||
from ascendspeed.model.enums import LayerType, AttnMaskType
|
||||
from ascendspeed.model.transformer import ParallelTransformer
|
||||
from ascendspeed.model.utils import get_linear_layer
|
||||
from ascendspeed.model.utils import init_method_normal, scaled_init_method_normal
|
||||
|
||||
|
||||
def parallel_lm_logits(input_, word_embeddings_weight, parallel_output,
|
||||
bias=None):
|
||||
@ -367,7 +369,7 @@ class TransformerLanguageModel(MegatronModule):
|
||||
self._pooler_key = 'pooler'
|
||||
|
||||
def set_input_tensor(self, input_tensor):
|
||||
""" See megatron.model.transformer.set_input_tensor()"""
|
||||
""" See ascendspeed.model.transformer.set_input_tensor()"""
|
||||
self.encoder.set_input_tensor(input_tensor)
|
||||
|
||||
def forward(self, enc_input_ids, enc_position_ids, enc_attn_mask,
|
@ -24,14 +24,15 @@ from functools import partial
|
||||
import torch
|
||||
import torch.nn.functional as F
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import mpu
|
||||
from megatron.model.module import MegatronModule, float16_to_fp32, fp32_to_float16
|
||||
from megatron.model.enums import AttnMaskType, LayerType, AttnType
|
||||
from megatron.model.utils import get_linear_layer, init_method_normal, scaled_init_method_normal, attention_mask_func, \
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed.model.module import MegatronModule, float16_to_fp32, fp32_to_float16
|
||||
from ascendspeed.model.enums import AttnMaskType, LayerType, AttnType
|
||||
from ascendspeed.model.utils import get_linear_layer, init_method_normal, scaled_init_method_normal, attention_mask_func, \
|
||||
openai_gelu, erf_gelu
|
||||
from megatron.model.fused_softmax import NPUFusedScaleMaskSoftmax
|
||||
from megatron.model.language_model import Pooler
|
||||
|
||||
from ascendspeed.model.fused_softmax import NPUFusedScaleMaskSoftmax
|
||||
from ascendspeed.model.language_model import Pooler
|
||||
|
||||
import deepspeed
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
@ -85,7 +86,6 @@ def apply_rotary_pos_emb(q, k, cos, sin, offset: int = 0):
|
||||
return q_embed, k_embed
|
||||
|
||||
|
||||
# TODO not able to build apex cpp extention for Fused cuda kernel RMSNorm
|
||||
class RMSNorm(torch.nn.Module): # for cpu
|
||||
def __init__(self, hidden_size, eps=1e-6):
|
||||
"""
|
||||
@ -916,7 +916,7 @@ class LlamaModel(MegatronModule):
|
||||
parallel_output=self.parallel_output)
|
||||
|
||||
def set_input_tensor(self, input_tensor):
|
||||
"""See megatron.model.transformer.set_input_tensor()"""
|
||||
"""See ascendspeed.model.transformer.set_input_tensor()"""
|
||||
self.language_model.set_input_tensor(input_tensor)
|
||||
|
||||
def forward(self, input_ids, attention_mask, labels=None, layer_past=None, get_key_value=False):
|
@ -19,8 +19,8 @@ import torch
|
||||
from torch.autograd import Variable
|
||||
from torch.nn.parameter import Parameter
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
from megatron import get_args
|
||||
from megatron import mpu
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import mpu
|
||||
|
||||
|
||||
_FLOAT_TYPES = (torch.FloatTensor, get_accelerator().FloatTensor)
|
@ -18,14 +18,15 @@ import math
|
||||
import torch
|
||||
import torch.nn.functional as F
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import mpu
|
||||
from megatron.model import LayerNorm
|
||||
from megatron.model.fused_softmax import NPUFusedScaleMaskSoftmax
|
||||
from megatron.model.enums import AttnMaskType, LayerType, AttnType
|
||||
from megatron.model.fused_bias_gelu import bias_gelu_impl
|
||||
from megatron.model.utils import attention_mask_func, openai_gelu, erf_gelu
|
||||
from megatron.model.module import MegatronModule
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed.model import LayerNorm
|
||||
from ascendspeed.model.fused_softmax import NPUFusedScaleMaskSoftmax
|
||||
from ascendspeed.model.enums import AttnMaskType, LayerType, AttnType
|
||||
from ascendspeed.model.fused_bias_gelu import bias_gelu_impl
|
||||
from ascendspeed.model.utils import attention_mask_func, openai_gelu, erf_gelu
|
||||
from ascendspeed.model.module import MegatronModule
|
||||
|
||||
from torch import distributed as dist
|
||||
import deepspeed
|
||||
from deepspeed.moe.layer import MoE
|
@ -19,7 +19,7 @@ import math
|
||||
|
||||
import torch
|
||||
|
||||
from megatron import get_args
|
||||
from ascendspeed import get_args
|
||||
|
||||
def init_method_normal(sigma):
|
||||
"""Init method based on N(0, sigma)."""
|
@ -19,9 +19,9 @@ import math
|
||||
import einops
|
||||
import torch
|
||||
import torch.nn.functional as F
|
||||
from megatron import get_args
|
||||
from megatron.model.transformer import ParallelTransformer
|
||||
from megatron.model.utils import (
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed.model.transformer import ParallelTransformer
|
||||
from ascendspeed.model.utils import (
|
||||
get_linear_layer,
|
||||
init_method_normal,
|
||||
scaled_init_method_normal,
|
@ -35,7 +35,7 @@ from .random import get_cuda_rng_tracker
|
||||
from .utils import divide
|
||||
from .utils import split_tensor_along_last_dim
|
||||
from .utils import VocabUtility
|
||||
from megatron import get_args
|
||||
from ascendspeed import get_args
|
||||
import deepspeed.runtime.activation_checkpointing.checkpointing as ds_checkpointing
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
|
@ -24,8 +24,8 @@ from torch import _C
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
from torch.utils.checkpoint import detach_variable
|
||||
|
||||
from megatron import get_args
|
||||
from megatron.memory import allocate_mem_buff
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed.memory import allocate_mem_buff
|
||||
|
||||
from .initialize import get_data_parallel_rank
|
||||
from .initialize import get_tensor_model_parallel_group
|
@ -18,8 +18,8 @@ import apex
|
||||
import torch
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
|
||||
from megatron import get_args
|
||||
from megatron.model import LayerNorm
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed.model import LayerNorm
|
||||
|
||||
from .grad_scaler import ConstantGradScaler, DynamicGradScaler
|
||||
from .optimizer import Float16OptimizerWithFloat16Params, FP32Optimizer
|
@ -29,9 +29,9 @@ if get_accelerator().device_name() == 'cuda':
|
||||
import amp_C
|
||||
|
||||
|
||||
from megatron import mpu
|
||||
from megatron.model.module import param_is_not_shared
|
||||
from megatron.mpu.layers import param_is_not_tensor_parallel_duplicate
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed.model.module import param_is_not_shared
|
||||
from ascendspeed.mpu.layers import param_is_not_tensor_parallel_duplicate
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
|
||||
def clip_grad_norm_fp32(parameters, max_norm, norm_type=2):
|
@ -21,9 +21,9 @@ from abc import abstractmethod
|
||||
import torch
|
||||
|
||||
|
||||
from megatron import get_timers
|
||||
from megatron import mpu
|
||||
from megatron import print_rank_0
|
||||
from ascendspeed import get_timers
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed import print_rank_0
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
from .clip_grads import clip_grad_norm_fp32, count_zeros_fp32
|
||||
|
@ -17,14 +17,14 @@ from functools import reduce
|
||||
import operator
|
||||
import torch
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
from megatron import get_args
|
||||
from megatron import mpu
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import mpu
|
||||
|
||||
|
||||
def _communicate(tensor_send_next, tensor_send_prev, recv_prev, recv_next,
|
||||
use_ring_exchange=False):
|
||||
"""Communicate tensors between stages. Used as helper method in other
|
||||
communication methods that are used in megatron/schedules.py.
|
||||
communication methods that are used in ascendspeed/schedules.py.
|
||||
|
||||
Takes the following arguments:
|
||||
tensor_send_next: tensor to send to next rank (no tensor sent if
|
@ -19,12 +19,14 @@ MINOR = 1.5
|
||||
# Use the following formatting: (major, minor)
|
||||
VERSION = (MAJOR, MINOR)
|
||||
|
||||
__version__ = '.'.join(map(str, VERSION))
|
||||
__package_name__ = 'megatron-lm'
|
||||
__contact_names__ = 'NVIDIA INC'
|
||||
__url__ = 'https://github.com/NVIDIA/Megatron-LM'
|
||||
__download_url__ = 'https://github.com/NVIDIA/Megatron-LM/releases'
|
||||
__description__ = 'Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.'
|
||||
__license__ = 'See https://github.com/NVIDIA/Megatron-LM/blob/master/LICENSE'
|
||||
__keywords__ = 'deep learning, Megatron, gpu, NLP, nvidia, pytorch, torch, language'
|
||||
|
||||
__description__ = 'AscendSpeed for LLMs of Ascend'
|
||||
__version__ = '0.0.1'
|
||||
__author__ = 'Ascend'
|
||||
__long_description__ = 'AscendSpeed for LLMs of Ascend'
|
||||
__url__ = 'https://gitee.com/ascend/AscendSpeed'
|
||||
__download_url__ = 'https://gitee.com/ascend/AscendSpeed/release'
|
||||
__keywords__ = 'Ascend, langauge, deep learning, NLP'
|
||||
__license__ = 'See https://gitee.com/ascend/AscendSpeed'
|
||||
__package_name__ = 'ascendspeed'
|
||||
__contact_names__ = 'Ascend'
|
@ -18,15 +18,15 @@ import torch
|
||||
import torch_npu
|
||||
from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import get_num_microbatches
|
||||
from megatron import get_timers
|
||||
from megatron import mpu
|
||||
from megatron import p2p_communication
|
||||
from megatron import print_rank_0
|
||||
from megatron.utils import unwrap_model
|
||||
from megatron.model import DistributedDataParallel as LocalDDP
|
||||
from megatron.model import Float16Module
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import get_num_microbatches
|
||||
from ascendspeed import get_timers
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed import p2p_communication
|
||||
from ascendspeed import print_rank_0
|
||||
from ascendspeed.utils import unwrap_model
|
||||
from ascendspeed.model import DistributedDataParallel as LocalDDP
|
||||
from ascendspeed.model import Float16Module
|
||||
|
||||
|
||||
def clear_npu_overflow_flag():
|
@ -20,19 +20,20 @@ import json
|
||||
import os
|
||||
import time
|
||||
|
||||
# These are needed to unwrap the model, would be nice to put these in ascendspeed.utils if possible?
|
||||
import torch
|
||||
import torch.nn.functional as F
|
||||
from megatron import get_args
|
||||
from megatron import get_tokenizer
|
||||
from megatron import mpu
|
||||
from megatron.utils import get_ltor_masks_and_position_ids, unwrap_model
|
||||
from megatron.p2p_communication import recv_forward, send_forward
|
||||
|
||||
# These are needed to unwrap the model, would be nice to put these in megatron.utils if possible?
|
||||
from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
|
||||
from megatron.model import DistributedDataParallel as LocalDDP
|
||||
from megatron.model import Float16Module
|
||||
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import get_tokenizer
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed.utils import get_ltor_masks_and_position_ids, unwrap_model
|
||||
from ascendspeed.p2p_communication import recv_forward, send_forward
|
||||
from ascendspeed.model import DistributedDataParallel as LocalDDP
|
||||
from ascendspeed.model import Float16Module
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
|
||||
def get_batch(context_tokens):
|
||||
"""Generate batch from context tokens."""
|
||||
args = get_args()
|
||||
@ -180,9 +181,9 @@ def generate_samples_input_from_file(model):
|
||||
decode_tokens = decode_tokens[0].cpu().numpy().tolist()
|
||||
trim_decode_tokens = tokenizer.detokenize(
|
||||
decode_tokens)[raw_text_len:]
|
||||
print("\nMegatron-LM:", trim_decode_tokens, flush=True)
|
||||
print("\nAscendSpeed:", trim_decode_tokens, flush=True)
|
||||
|
||||
fname_out.write("\n\nMegatron-LM:")
|
||||
fname_out.write("\n\nAscendSpeed:")
|
||||
fname_out.write(trim_decode_tokens)
|
||||
fname_out.write("\n")
|
||||
|
||||
@ -301,7 +302,7 @@ def generate_samples_interactive(model, print_frequency=24):
|
||||
decode_tokens = decode_tokens[0].cpu().numpy().tolist()
|
||||
trim_decode_tokens = tokenizer.detokenize(
|
||||
decode_tokens)[raw_text_len:]
|
||||
print("\nMegatron-LM:", trim_decode_tokens, flush=True)
|
||||
print("\nAscendSpeed:", trim_decode_tokens, flush=True)
|
||||
|
||||
if mpu.is_pipeline_first_stage() \
|
||||
and mpu.get_tensor_model_parallel_rank() == 0:
|
||||
@ -313,7 +314,7 @@ def generate_samples_interactive(model, print_frequency=24):
|
||||
decode_tokens = decode_tokens[0].cpu().numpy().tolist()
|
||||
trim_decode_tokens = tokenizer.detokenize(
|
||||
decode_tokens)[raw_text_len:]
|
||||
print("\nMegatron-LM:", trim_decode_tokens, flush=True)
|
||||
print("\nAscendSpeed:", trim_decode_tokens, flush=True)
|
||||
|
||||
input("\nPress Enter to continue >>>")
|
||||
|
@ -26,40 +26,41 @@ _TRAIN_START_TIME = time.time()
|
||||
import torch
|
||||
from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import get_timers
|
||||
from megatron import get_tensorboard_writer
|
||||
from megatron import get_current_global_batch_size
|
||||
from megatron import get_num_microbatches
|
||||
from megatron import is_last_rank
|
||||
from megatron import update_num_microbatches
|
||||
from megatron import mpu
|
||||
from megatron import print_rank_0
|
||||
from megatron import print_rank_last
|
||||
from megatron.checkpointing import load_checkpoint
|
||||
from megatron.checkpointing import save_checkpoint
|
||||
from megatron.model import Float16Module
|
||||
from megatron.optimizer import get_megatron_optimizer
|
||||
from megatron.initialize import initialize_megatron
|
||||
from megatron.initialize import write_args_to_tensorboard
|
||||
from megatron.learning_rates import AnnealingLR
|
||||
from megatron.model import DistributedDataParallel as LocalDDP
|
||||
from megatron.utils import check_adlr_autoresume_termination
|
||||
from megatron.utils import unwrap_model
|
||||
from megatron.data.data_samplers import build_pretraining_data_loader
|
||||
from megatron.utils import calc_params_l2_norm
|
||||
from megatron.schedules import forward_backward_no_pipelining
|
||||
from megatron.schedules import forward_backward_pipelining_without_interleaving
|
||||
from megatron.schedules import forward_backward_pipelining_with_interleaving
|
||||
from megatron.utils import report_memory, throughput_calculator, checkpoint_throughput_calculator
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import get_timers
|
||||
from ascendspeed import get_tensorboard_writer
|
||||
from ascendspeed import get_current_global_batch_size
|
||||
from ascendspeed import get_num_microbatches
|
||||
from ascendspeed import is_last_rank
|
||||
from ascendspeed import update_num_microbatches
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed import print_rank_0
|
||||
from ascendspeed import print_rank_last
|
||||
from ascendspeed.checkpointing import load_checkpoint
|
||||
from ascendspeed.checkpointing import save_checkpoint
|
||||
from ascendspeed.model import Float16Module
|
||||
from ascendspeed.optimizer import get_megatron_optimizer
|
||||
from ascendspeed.initialize import initialize_megatron
|
||||
from ascendspeed.initialize import write_args_to_tensorboard
|
||||
from ascendspeed.learning_rates import AnnealingLR
|
||||
from ascendspeed.model import DistributedDataParallel as LocalDDP
|
||||
from ascendspeed.utils import check_adlr_autoresume_termination
|
||||
from ascendspeed.utils import unwrap_model
|
||||
from ascendspeed.data.data_samplers import build_pretraining_data_loader
|
||||
from ascendspeed.utils import calc_params_l2_norm
|
||||
from ascendspeed.schedules import forward_backward_no_pipelining
|
||||
from ascendspeed.schedules import forward_backward_pipelining_without_interleaving
|
||||
from ascendspeed.schedules import forward_backward_pipelining_with_interleaving
|
||||
from ascendspeed.utils import report_memory, throughput_calculator, checkpoint_throughput_calculator
|
||||
from ascendspeed.model.transformer import ParallelTransformerLayer
|
||||
|
||||
import deepspeed
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
from deepspeed.compression.compress import init_compression, redundancy_clean
|
||||
|
||||
|
||||
from megatron.model.transformer import ParallelTransformerLayer
|
||||
from deepspeed.runtime.data_pipeline.data_routing.helper import convert_to_random_ltd
|
||||
|
||||
|
||||
|
||||
def print_datetime(string):
|
||||
"""Note that this call will sync across all ranks."""
|
||||
torch.distributed.barrier()
|
||||
@ -76,7 +77,7 @@ def pretrain(train_valid_test_dataset_provider,
|
||||
"""Main training program.
|
||||
|
||||
This function will run the followings in the order provided:
|
||||
1) initialize Megatron.
|
||||
1) initialize ascendspeed.
|
||||
2) setup model, optimizer and lr schedule using the model_provider.
|
||||
3) call train_val_test_data_provider to get train/val/test datasets.
|
||||
4) train the modle using the forward_step_func.
|
||||
@ -109,9 +110,9 @@ def pretrain(train_valid_test_dataset_provider,
|
||||
torch.distributed.all_reduce(start_time_tensor,
|
||||
op=torch.distributed.ReduceOp.MIN)
|
||||
_TRAIN_START_TIME = start_time_tensor.item()
|
||||
print_rank_0('time to initialize megatron (seconds): {:.3f}'.format(
|
||||
print_rank_0('time to initialize ascendspeed (seconds): {:.3f}'.format(
|
||||
time.time() - _TRAIN_START_TIME))
|
||||
print_datetime('after megatron is initialized')
|
||||
print_datetime('after ascendspeed is initialized')
|
||||
|
||||
args = get_args()
|
||||
timers = get_timers()
|
||||
@ -483,7 +484,7 @@ def setup_model_and_optimizer(model_provider_func, teacher=False,
|
||||
pp = mpu.get_pipeline_model_parallel_world_size()
|
||||
if args.data_efficiency_curriculum_learning and build_train_valid_test_datasets_provider is not None:
|
||||
train_ds = None
|
||||
# Only need to build dataset on tp rank 0 since Megatron has the
|
||||
# Only need to build dataset on tp rank 0 since ascendspeed has the
|
||||
# broadcast_data() function that broadcast data from tp rank 0.
|
||||
if mpu.get_tensor_model_parallel_rank() == 0:
|
||||
# Number of train/valid/test samples.
|
@ -25,13 +25,13 @@ if get_accelerator().device_name() == 'cuda':
|
||||
from apex.multi_tensor_apply import multi_tensor_applier
|
||||
import amp_C
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import print_rank_0
|
||||
from megatron import get_adlr_autoresume
|
||||
from megatron import mpu
|
||||
from megatron.model.module import param_is_not_shared
|
||||
from megatron.mpu.layers import param_is_not_tensor_parallel_duplicate
|
||||
from megatron import get_num_microbatches
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import print_rank_0
|
||||
from ascendspeed import get_adlr_autoresume
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed.model.module import param_is_not_shared
|
||||
from ascendspeed.mpu.layers import param_is_not_tensor_parallel_duplicate
|
||||
from ascendspeed import get_num_microbatches
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
def unwrap_model(model, module_instances=(torchDDP)):
|
||||
return_list = True
|
||||
@ -135,7 +135,7 @@ def print_params_min_max_norm(optimizer, iteration):
|
||||
def check_adlr_autoresume_termination(iteration, model,
|
||||
optimizer, lr_scheduler):
|
||||
"""Check for autoresume signal and exit if it is received."""
|
||||
from megatron.checkpointing import save_checkpoint
|
||||
from ascendspeed.checkpointing import save_checkpoint
|
||||
|
||||
args = get_args()
|
||||
autoresume = get_adlr_autoresume()
|
2
examples/README.md
Normal file
2
examples/README.md
Normal file
@ -0,0 +1,2 @@
|
||||
## Recipes and Scripts
|
||||
|
48
examples/bloom_task/pretrain_bloom_550m.sh
Normal file
48
examples/bloom_task/pretrain_bloom_550m.sh
Normal file
@ -0,0 +1,48 @@
|
||||
#!/bin/bash
|
||||
|
||||
# This is an example of pre-training of BLOOM-550M
|
||||
# without parameter specific and any parallel technologies.
|
||||
|
||||
export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
|
||||
export HCCL_CONNECT_TIMEOUT=1200
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
RANK=0
|
||||
WORLD_SIZE=1
|
||||
|
||||
DATA_PATH=./dataset/enwiki-gpt/gpt_text_sentence
|
||||
CHECKPOINT_PATH=./ckpt
|
||||
|
||||
export LOCAL_RANK=0
|
||||
|
||||
python pretrain_llama.py \
|
||||
--DDP-impl local \
|
||||
--use-contiguous-buffers-in-ddp \
|
||||
--num-layers 24 \
|
||||
--hidden-size 1024 \
|
||||
--num-attention-heads 16 \
|
||||
--micro-batch-size 4 \
|
||||
--global-batch-size 8 \
|
||||
--seq-length 2048 \
|
||||
--max-position-embeddings 2048 \
|
||||
--train-iters 500000 \
|
||||
--lr-decay-iters 320000 \
|
||||
--save $CHECKPOINT_PATH \
|
||||
--load $CHECKPOINT_PATH \
|
||||
--data-path $DATA_PATH \
|
||||
--vocab-file ./dataset/gpt2-vocab.json \
|
||||
--merge-file ./dataset/gpt2-merges.txt \
|
||||
--data-impl mmap \
|
||||
--split 949,50,1 \
|
||||
--distributed-backend nccl \
|
||||
--lr 0.00015 \
|
||||
--min-lr 1.0e-5 \
|
||||
--lr-decay-style cosine \
|
||||
--weight-decay 1e-2 \
|
||||
--clip-grad 1.0 \
|
||||
--lr-warmup-fraction .01 \
|
||||
--checkpoint-activations \
|
||||
--log-interval 10 \
|
||||
--save-interval 10000 \
|
||||
--eval-interval 1000 \
|
||||
--eval-iters 10 \
|
||||
--fp16 | tee logs/train.log
|
@ -21,23 +21,24 @@ from functools import partial
|
||||
import torch
|
||||
import torch_npu
|
||||
from torch_npu.contrib import transfer_to_npu
|
||||
import deepspeed_npu
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import print_rank_0
|
||||
from megatron import get_timers
|
||||
from megatron import get_tokenizer
|
||||
from megatron import mpu
|
||||
from megatron.data.gpt_dataset import build_train_valid_test_datasets
|
||||
from megatron.model import GPTModel, GPTModelPipe
|
||||
from megatron.training import pretrain
|
||||
from megatron.utils import get_ltor_masks_and_position_ids
|
||||
from megatron.utils import average_losses_across_data_parallel_group
|
||||
|
||||
import deepspeed
|
||||
import deepspeed_npu
|
||||
from deepspeed.runtime.utils import see_memory_usage
|
||||
from deepspeed.accelerator.real_accelerator import get_accelerator
|
||||
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import print_rank_0
|
||||
from ascendspeed import get_timers
|
||||
from ascendspeed import get_tokenizer
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed.data.gpt_dataset import build_train_valid_test_datasets
|
||||
from ascendspeed.model import GPTModel, GPTModelPipe
|
||||
from ascendspeed.training import pretrain
|
||||
from ascendspeed.utils import get_ltor_masks_and_position_ids
|
||||
from ascendspeed.utils import average_losses_across_data_parallel_group
|
||||
|
||||
|
||||
from torch import nn
|
||||
import torch.nn.functional as F
|
||||
|
||||
|
@ -22,16 +22,16 @@ import torch_npu
|
||||
from torch_npu.contrib import transfer_to_npu
|
||||
import deepspeed_npu
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import print_rank_0
|
||||
from megatron import get_timers
|
||||
from megatron import get_tokenizer
|
||||
from megatron import mpu
|
||||
from megatron.data.gpt_dataset import build_train_valid_test_datasets
|
||||
from megatron.model import LlamaModel, LlamaModelPipe
|
||||
from megatron.training import pretrain
|
||||
from megatron.utils import get_ltor_masks_and_position_ids
|
||||
from megatron.utils import average_losses_across_data_parallel_group
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import print_rank_0
|
||||
from ascendspeed import get_timers
|
||||
from ascendspeed import get_tokenizer
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed.data.gpt_dataset import build_train_valid_test_datasets
|
||||
from ascendspeed.model import LlamaModel, LlamaModelPipe
|
||||
from ascendspeed.training import pretrain
|
||||
from ascendspeed.utils import get_ltor_masks_and_position_ids
|
||||
from ascendspeed.utils import average_losses_across_data_parallel_group
|
||||
|
||||
import deepspeed
|
||||
from deepspeed.runtime.utils import see_memory_usage
|
||||
|
6
setup.py
6
setup.py
@ -15,12 +15,11 @@
|
||||
|
||||
"""Setup for pip package."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import setuptools
|
||||
|
||||
if sys.version_info < (3,):
|
||||
raise Exception("Python 2 is not supported by Megatron.")
|
||||
raise Exception("Python 2 is not supported by AscendSpeed.")
|
||||
|
||||
__description__ = 'AscendSpeed for LLMs of Ascend'
|
||||
__version__ = '0.0.1'
|
||||
@ -51,7 +50,7 @@ def req_file(filename):
|
||||
install_requires = req_file("requirements.txt")
|
||||
|
||||
setuptools.setup(
|
||||
package_data={'ascendspeed':['megatron/data/Makefile']},
|
||||
package_data={'ascendspeed':['ascendspeed/data/Makefile']},
|
||||
name=__package_name__,
|
||||
# Versions should comply with PEP440. For a discussion on single-sourcing
|
||||
# the version across setup.py and the project code, see
|
||||
@ -77,6 +76,7 @@ setuptools.setup(
|
||||
'Programming Language :: Python :: 3.6',
|
||||
'Programming Language :: Python :: 3.7',
|
||||
'Programming Language :: Python :: 3.8',
|
||||
'Programming Language :: Python :: 3.9',
|
||||
# Additional Setting
|
||||
'Environment :: Console',
|
||||
'Natural Language :: English',
|
||||
|
@ -20,21 +20,21 @@ import numpy as np
|
||||
import time
|
||||
|
||||
import torch
|
||||
from megatron import get_args
|
||||
from megatron import print_rank_0
|
||||
from megatron import get_tokenizer
|
||||
from megatron import mpu
|
||||
from megatron.training import setup_model_and_optimizer, get_model
|
||||
from megatron.mpu.mappings import gather_from_tensor_model_parallel_region
|
||||
from ascendSpeed import get_args
|
||||
from ascendSpeed import print_rank_0
|
||||
from ascendSpeed import get_tokenizer
|
||||
from ascendSpeed import mpu
|
||||
from ascendSpeed.training import setup_model_and_optimizer, get_model
|
||||
from ascendSpeed.mpu.mappings import gather_from_tensor_model_parallel_region
|
||||
|
||||
from megatron.utils import get_ltor_masks_and_position_ids, unwrap_model
|
||||
from megatron.p2p_communication import recv_forward, send_forward
|
||||
from ascendSpeed.utils import get_ltor_masks_and_position_ids, unwrap_model
|
||||
from ascendSpeed.p2p_communication import recv_forward, send_forward
|
||||
import pickle
|
||||
import json
|
||||
|
||||
from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
|
||||
from megatron.model.distributed import DistributedDataParallel as LocalDDP
|
||||
from megatron.model.module import Float16Module
|
||||
from ascendspeed.model.distributed import DistributedDataParallel as LocalDDP
|
||||
from ascendspeed.model.module import Float16Module
|
||||
from deepspeed.runtime.pipe import schedule
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
|
||||
@ -279,15 +279,15 @@ class EvalHarnessAdaptor(GPT2LM):
|
||||
def tokenizer_encode(self, text):
|
||||
"""Tokenize text *without* adding special tokens."""
|
||||
# Splitting this into its own method in case we need to handle special cases for different tokenizers
|
||||
from megatron.tokenizer.gpt2_tokenization import GPT2Tokenizer
|
||||
from ascendspeed.tokenizer.gpt2_tokenization import GPT2Tokenizer
|
||||
if isinstance(self.tokenizer.tokenizer, GPT2Tokenizer):
|
||||
return self.tokenizer.tokenizer.encode(text)
|
||||
else:
|
||||
return self.tokenizer.tokenizer.encode(text, add_special_tokens=False)
|
||||
|
||||
|
||||
from megatron.initialize import initialize_megatron
|
||||
import megatron
|
||||
from ascendspeed.initialize import initialize_megatron
|
||||
import ascendspeed
|
||||
|
||||
from tools.convert_checkpoint.deepspeed_checkpoint import DeepSpeedCheckpoint
|
||||
from tools.convert_checkpoint.deepspeed_to_megatron import _create_rank_checkpoint
|
||||
@ -303,9 +303,9 @@ def override_args(args, override_args, skip_keys, skip_if_specified_keys):
|
||||
|
||||
# Note(Hesslow):
|
||||
# The model loading is a bit convoluted.
|
||||
# We want to parse out the model arguments from the checkpoint and use those to initialize megatron-ds.
|
||||
# We want to parse out the model arguments from the checkpoint and use those to initialize ascendspeed-ds.
|
||||
#
|
||||
# However megatron-ds expects its arguments on the command line.
|
||||
# However ascendspeed-ds expects its arguments on the command line.
|
||||
# And at that point we don't know them.
|
||||
#
|
||||
# Instead we use Jasons way: we load the arguments form the checkpoint and then override _parse_args to return whatever args we want.
|
||||
@ -314,12 +314,12 @@ def override_args(args, override_args, skip_keys, skip_if_specified_keys):
|
||||
# In order to support this we _first_ parse the arguments normally, and then override them with the arguments from the checkpoint.
|
||||
# Keeping the default-value of newer arguments.
|
||||
#
|
||||
# We then use the megatron deepspeed converter to load the deepspeed checkpoints as if they we're megatron checkpoints.
|
||||
# We then use the ascendspeed converter to load the deepspeed checkpoints as if they we're ascendspeed checkpoints.
|
||||
def load_ds_checkpoint_and_setup_megatron(extra_args_provider):
|
||||
# parse the megatorn args. But wait with initalizing megatron.
|
||||
# parse the ascendspeed args. But wait with initalizing ascendspeed.
|
||||
# avoid printing the arguments, since they will later be overridden.
|
||||
_print_args = megatron.arguments._print_args
|
||||
megatron.arguments._print_args = lambda *_args, **kwarg: None
|
||||
_print_args = ascendspeed.arguments._print_args
|
||||
ascendspeed.arguments._print_args = lambda *_args, **kwarg: None
|
||||
args = _parse_args(extra_args_provider)
|
||||
|
||||
ds_checkpoint = DeepSpeedCheckpoint(args.load,
|
||||
@ -342,14 +342,14 @@ def load_ds_checkpoint_and_setup_megatron(extra_args_provider):
|
||||
|
||||
override_args(args, cp_args, skip_keys, skip_if_specified)
|
||||
|
||||
# stop megatron from reparsing the arguments.
|
||||
megatron.global_vars._parse_args = lambda *_args, **kwarg: args
|
||||
megatron.global_vars._GLOBAL_ARGS = args
|
||||
# stop ascendspeed from reparsing the arguments.
|
||||
ascendspeed.global_vars._parse_args = lambda *_args, **kwarg: args
|
||||
ascendspeed.global_vars._GLOBAL_ARGS = args
|
||||
|
||||
initialize_megatron()
|
||||
torch.distributed.barrier()
|
||||
|
||||
# Initializing megatron will update eg. tokenizer size. Override again.
|
||||
# Initializing ascendspeed will update eg. tokenizer size. Override again.
|
||||
override_args(args, cp_args, skip_keys, skip_if_specified)
|
||||
|
||||
# print final arguments.
|
||||
@ -377,7 +377,7 @@ def load_ds_checkpoint_and_setup_megatron(extra_args_provider):
|
||||
model._config.zero_enabled = zero_enabled
|
||||
else:
|
||||
model = get_model(model_provider)[0]
|
||||
# Initialize megatron model using the parsed state dict.
|
||||
# Initialize ascendspeed model using the parsed state dict.
|
||||
sd = _create_rank_checkpoint(ds_checkpoint, None, mpu.get_tensor_model_parallel_rank(), mpu.get_pipeline_model_parallel_rank(), True)
|
||||
|
||||
model.load_state_dict(sd['model'], strict=True)
|
||||
@ -399,7 +399,7 @@ def tasks_args(parser):
|
||||
group.add_argument('--eval_fp32', default = False, action='store_true', help='Should the evaluation run in fp32')
|
||||
return parser
|
||||
|
||||
from megatron.global_vars import _parse_args
|
||||
from ascendspeed.global_vars import _parse_args
|
||||
|
||||
def main():
|
||||
start = time.time()
|
||||
|
@ -21,10 +21,10 @@ from functools import partial
|
||||
|
||||
import torch
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import print_rank_last, is_last_rank
|
||||
from megatron import mpu
|
||||
from megatron.schedules import get_forward_backward_func
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import print_rank_last, is_last_rank
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed.schedules import get_forward_backward_func
|
||||
from tasks.finetune_utils import build_data_loader
|
||||
from tasks.finetune_utils import process_batch
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
|
@ -19,19 +19,19 @@ from functools import partial
|
||||
|
||||
import torch
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import print_rank_0
|
||||
from megatron import get_timers
|
||||
from megatron import mpu
|
||||
from megatron.checkpointing import load_checkpoint
|
||||
from megatron.checkpointing import save_checkpoint
|
||||
from megatron.training import evaluate_and_print_results
|
||||
from megatron.training import setup_model_and_optimizer
|
||||
from megatron.training import train_step
|
||||
from megatron.training import training_log
|
||||
from megatron.utils import average_losses_across_data_parallel_group
|
||||
from megatron.utils import calc_params_l2_norm
|
||||
from megatron.utils import check_adlr_autoresume_termination
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed import print_rank_0
|
||||
from ascendspeed import get_timers
|
||||
from ascendspeed import mpu
|
||||
from ascendspeed.checkpointing import load_checkpoint
|
||||
from ascendspeed.checkpointing import save_checkpoint
|
||||
from ascendspeed.training import evaluate_and_print_results
|
||||
from ascendspeed.training import setup_model_and_optimizer
|
||||
from ascendspeed.training import train_step
|
||||
from ascendspeed.training import training_log
|
||||
from ascendspeed.utils import average_losses_across_data_parallel_group
|
||||
from ascendspeed.utils import calc_params_l2_norm
|
||||
from ascendspeed.utils import check_adlr_autoresume_termination
|
||||
from deepspeed.accelerator import get_accelerator
|
||||
|
||||
def process_batch(batch):
|
||||
|
@ -15,7 +15,7 @@
|
||||
|
||||
"""CoLA dataset."""
|
||||
|
||||
from megatron import print_rank_0
|
||||
from ascendSpeed import print_rank_0
|
||||
from tasks.data_utils import clean_text
|
||||
from .data import GLUEAbstractDataset
|
||||
|
||||
|
@ -20,7 +20,7 @@ from abc import abstractmethod
|
||||
|
||||
from torch.utils.data import Dataset
|
||||
|
||||
from megatron import print_rank_0
|
||||
from ascendSpeed import print_rank_0
|
||||
from tasks.data_utils import build_sample
|
||||
from tasks.data_utils import build_tokens_types_paddings_from_text
|
||||
|
||||
|
@ -15,11 +15,11 @@
|
||||
|
||||
"""GLUE finetuning/evaluation."""
|
||||
|
||||
from megatron import get_args
|
||||
from megatron import print_rank_0
|
||||
from megatron import get_tokenizer
|
||||
from megatron import mpu
|
||||
from megatron.model.classification import Classification
|
||||
from ascendSpeed import get_args
|
||||
from ascendSpeed import print_rank_0
|
||||
from ascendSpeed import get_tokenizer
|
||||
from ascendSpeed import mpu
|
||||
from ascendSpeed.model.classification import Classification
|
||||
from tasks.eval_utils import accuracy_func_provider
|
||||
from tasks.finetune_utils import finetune, mse_forward_step
|
||||
|
||||
|
@ -15,7 +15,7 @@
|
||||
|
||||
"""MNLI dataset."""
|
||||
|
||||
from megatron import print_rank_0
|
||||
from ascendSpeed import print_rank_0
|
||||
from tasks.data_utils import clean_text
|
||||
from .data import GLUEAbstractDataset
|
||||
|
||||
|
@ -15,7 +15,7 @@
|
||||
|
||||
"""MRPC dataset."""
|
||||
|
||||
from megatron import print_rank_0
|
||||
from ascendSpeed import print_rank_0
|
||||
from tasks.data_utils import clean_text
|
||||
from .data import GLUEAbstractDataset
|
||||
|
||||
|
@ -15,7 +15,7 @@
|
||||
|
||||
"""QNLI dataset."""
|
||||
|
||||
from megatron import print_rank_0
|
||||
from ascendSpeed import print_rank_0
|
||||
from tasks.data_utils import clean_text
|
||||
from .data import GLUEAbstractDataset
|
||||
|
||||
|
@ -15,7 +15,7 @@
|
||||
|
||||
"""QQP dataset."""
|
||||
|
||||
from megatron import print_rank_0
|
||||
from ascendSpeed import print_rank_0
|
||||
from tasks.data_utils import clean_text
|
||||
from .data import GLUEAbstractDataset
|
||||
|
||||
|
@ -15,7 +15,7 @@
|
||||
|
||||
"""RTE dataset."""
|
||||
|
||||
from megatron import print_rank_0
|
||||
from ascendspeed import print_rank_0
|
||||
from tasks.data_utils import clean_text
|
||||
from .data import GLUEAbstractDataset
|
||||
|
||||
|
@ -15,7 +15,7 @@
|
||||
|
||||
"""SST-2 dataset."""
|
||||
|
||||
from megatron import print_rank_0
|
||||
from ascendSpeed import print_rank_0
|
||||
from tasks.data_utils import clean_text
|
||||
from .data import GLUEAbstractDataset
|
||||
|
||||
|
@ -15,7 +15,7 @@
|
||||
|
||||
"""STS-B dataset."""
|
||||
|
||||
from megatron import print_rank_0
|
||||
from ascendSpeed import print_rank_0
|
||||
from tasks.data_utils import clean_text
|
||||
from .data import GLUEAbstractDataset
|
||||
|
||||
|
@ -20,8 +20,8 @@ import sys
|
||||
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),
|
||||
os.path.pardir)))
|
||||
|
||||
from megatron import get_args
|
||||
from megatron.initialize import initialize_megatron
|
||||
from ascendspeed import get_args
|
||||
from ascendspeed.initialize import initialize_megatron
|
||||
|
||||
|
||||
def get_tasks_args(parser):
|
||||
|
@ -18,7 +18,7 @@
|
||||
import os
|
||||
import sys
|
||||
|
||||
from megatron import get_args
|
||||
from ascendspeed import get_args
|
||||
from tasks.orqa.evaluate_utils import ORQAEvaluator
|
||||
|
||||
def main():
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user