change megatron to ascendspeed

This commit is contained in:
chenzomi 2023-06-10 21:26:01 +08:00 committed by dilililiwhy
parent 106a415556
commit 37cc0b949d
121 changed files with 3017 additions and 398 deletions

View File

@ -1,25 +1,12 @@
## AscendSpeed
DeepSpeed version of NVIDIA's Megatron-LM that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others. The ```AscendSpeed/examples/``` folder includes example scripts about the features supported by DeepSpeed.
DeepSpeed version that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others. The ```AscendSpeed/examples/``` folder includes example scripts about the features supported by DeepSpeed.
### Run on Azure and AzureML
To try out DeepSpeed on Azure, this fork of Megatron offers easy-to-use recipes and bash scripts. We strongly recommend to start with AzureML recipe in the ```examples/azureml``` folder. If you have a custom infrastructure (e.g. HPC clusters) or Azure VM based environment, please refer to the bash scripts in the ```examples/azure``` folder.
To try out DeepSpeed on Azure, this fork of ascendspeed offers easy-to-use recipes and bash scripts. We strongly recommend to start with AzureML recipe in the ```examples/azureml``` folder. If you have a custom infrastructure (e.g. HPC clusters) or Azure VM based environment, please refer to the bash scripts in the ```examples/azure``` folder.
Below is Megatron-LM's original README:
Below is ascendspeed original README:
------
Megatron ([1](https://arxiv.org/pdf/1909.08053.pdf) and [2](https://arxiv.org/pdf/2104.04473.pdf)) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel (tensor and pipeline), and multi-node pre-training of transformer based models such as [GPT](https://arxiv.org/abs/2005.14165), [BERT](https://arxiv.org/pdf/1810.04805.pdf), and [T5](https://arxiv.org/abs/1910.10683) using mixed precision.
Below are some of the projects where we have directly used Megatron:
* [BERT and GPT Studies Using Megatron](https://arxiv.org/pdf/1909.08053.pdf)
* [BioMegatron: Larger Biomedical Domain Language Model](https://www.aclweb.org/anthology/2020.emnlp-main.379.pdf)
* [End-to-End Training of Neural Retrievers for Open-Domain Question Answering](https://arxiv.org/abs/2101.00408)
* [Large Scale Multi-Actor Generative Dialog Modeling](https://www.aclweb.org/anthology/2020.acl-main.8.pdf)
* [Local Knowledge Powered Conversational Agents](https://arxiv.org/abs/2010.10150)
* [MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models](https://www.aclweb.org/anthology/2020.emnlp-main.226.pdf)
* [RACE Reading Comprehension Dataset Leaderboard](http://www.qizhexie.com/data/RACE_leaderboard.html)
* [Scaling Language Model Training to a Trillion Parameters Using Megatron](https://arxiv.org/pdf/2104.04473.pdf)
* [Training Question Answering Models From Synthetic Data](https://www.aclweb.org/anthology/2020.emnlp-main.468.pdf)
Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs (both per GPU and aggregate over all GPUs). Note that the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.
![Cases](images/cases_april2021.png)
@ -52,14 +39,14 @@ All the cases from 1 billion to 1 trillion parameters achieve more than 43% half
* [Collecting GPT Webtext Data](#collecting-gpt-webtext-data)
# Setup
We have tested Megatron with [NGC's PyTorch container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) version 20.12, which uses python 3.8, pytorch 1.8, cuda 11.1, and nccl 2.8.3.
We have tested with [NGC's PyTorch container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) version 20.12, which uses python 3.8, pytorch 1.8, cuda 11.1, and nccl 2.8.3.
To use this repository, please install the latest supported versions of PyTorch with GPU support (python 3.8, pytorch 1.8, cuda 11.1, and nccl 2.8.3 and above) and NVIDIA [APEX](https://github.com/NVIDIA/apex#quick-start). We strongly recommend using one of [NGC's recent PyTorch containers](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) (the latest compatible version at time of publication can be pulled with `docker pull nvcr.io/nvidia/pytorch:20.12-py3`). Data preprocessing requires [NLTK](https://www.nltk.org/install.html), though this is not required for training, evaluation, or downstream tasks.
<!--
To use megatron you can either clone the repo or install it via pip (make sure python3-dev is installed):
To use ascendspeed you can either clone the repo or install it via pip (make sure python3-dev is installed):
<pre>
pip install megatron-lm
pip install ascendspeed
</pre>
-->
@ -176,7 +163,7 @@ python pretrain_bert.py \
--data-path $DATA_PATH
</pre>
Further command line arguments are described in the source file [`arguments.py`](./megatron/arguments.py).
Further command line arguments are described in the source file [`arguments.py`](./ascendspeed/arguments.py).
## GPT Pretraining
@ -217,7 +204,7 @@ python pretrain_gpt.py \
--data-path $DATA_PATH \
</pre>
Further command line arguments are described in the source file [`arguments.py`](./megatron/arguments.py).
Further command line arguments are described in the source file [`arguments.py`](./ascendspeed/arguments.py).
## T5 Pretraining
@ -311,7 +298,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_<model>.py \
## GPT-3 Example
In `examples/pretrain_gpt3_175B.sh` we have provided an example of how to configure Megatron to run [GPT-3](https://arxiv.org/abs/2005.14165) with 175 billion parameters on 1024 GPUs. The script is designed for [slurm](https://slurm.schedmd.com/documentation.html) with [pyxis](https://github.com/NVIDIA/pyxis) plugin but can be easily adopted to any other scheduler. It uses 8-way and 16-way tensor and pipeline parallelism, respectively. With options `global-batch-size 1536` and `rampup-batch-size 16 16 5859375`, the training will start with global batch size 16 and linearly increase the global batch size to 1536 over 5,859,375 samples with incrmeental steps 16. The training dataset can be either a single set or a multiple datasets combined with a set of weights.
In `examples/pretrain_gpt3_175B.sh` we have provided an example of how to configure AscendSpeed to run [GPT-3](https://arxiv.org/abs/2005.14165) with 175 billion parameters on 1024 GPUs. The script is designed for [slurm](https://slurm.schedmd.com/documentation.html) with [pyxis](https://github.com/NVIDIA/pyxis) plugin but can be easily adopted to any other scheduler. It uses 8-way and 16-way tensor and pipeline parallelism, respectively. With options `global-batch-size 1536` and `rampup-batch-size 16 16 5859375`, the training will start with global batch size 16 and linearly increase the global batch size to 1536 over 5,859,375 samples with incrmeental steps 16. The training dataset can be either a single set or a multiple datasets combined with a set of weights.
With full global batch size of 1536 on 1024 A100 GPUs, each iteration takes around 32 seconds resulting in 138 teraFLOPs per GPU which is 44% of the theoretical peak FLOPs.
@ -336,7 +323,7 @@ python preprocess_data.py \
--workers 5 # works well for 10 CPU cores. Scale up accordingly.
</pre>
2. Use a custom samples mapping function in place of `megatron/data/realm_dataset_utils.get_block_samples_mapping` if required. To do this, you will need to implement a new function in C++ inside of `megatron/data/helpers.cpp`. The samples mapping data structure is used to select the data that will constitute every training sample in advance of the training loop.
2. Use a custom samples mapping function in place of `ascendspeed/data/realm_dataset_utils.get_block_samples_mapping` if required. To do this, you will need to implement a new function in C++ inside of `ascendspeed/data/helpers.cpp`. The samples mapping data structure is used to select the data that will constitute every training sample in advance of the training loop.
The samples mapping is responsible for holding all of the required metadata needed to construct the sample from one or more indexed datasets. In REALM, the samples mapping contains the start and end sentence indices, as well as the document index (to find the correct title for a body) and a unique ID for every block.
3. Pretrain a BERT language model using `pretrain_bert.py`, with the sequence length equal to the block size in token ids. This model should be trained on the same indexed dataset that is used to supply the blocks for the information retrieval task.
In REALM, this is an uncased bert base model trained with the standard hyperparameters.

View File

@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""Megatron arguments."""
"""ascendspeed arguments."""
import argparse
import os
@ -21,10 +21,11 @@ import os
import torch
import deepspeed
def parse_args(extra_args_provider=None, defaults={},
ignore_unknown_args=False):
"""Parse all arguments."""
parser = argparse.ArgumentParser(description='Megatron-LM Arguments',
parser = argparse.ArgumentParser(description='AscendSpeed Arguments',
allow_abbrev=False)
# Standard arguments.

View File

@ -22,7 +22,8 @@ import numpy as np
from deepspeed.accelerator import get_accelerator
import torch
from megatron import (get_args,
from ascendspeed import (get_args,
is_rank_0,
mpu,
print_rank_0,
@ -161,7 +162,7 @@ def save_checkpoint(iteration, model, optimizer, lr_scheduler):
torch.save(state_dict, checkpoint_name)
if args.deepspeed:
#megatron model uses state_dict_for_save_checkpointing instead of the standard state_dict
#ascendspeed model uses state_dict_for_save_checkpointing instead of the standard state_dict
#state_dict is used by deepspeed for module saving so it needs to point to the right function
if args.no_pipeline_parallel:
original_state_dict = model[0].module.state_dict
@ -329,16 +330,16 @@ def load_checkpoint(model, optimizer, lr_scheduler, load_arg='load', strict=True
try:
state_dict = torch.load(checkpoint_name, map_location='cpu')
except ModuleNotFoundError:
from megatron.fp16_deprecated import loss_scaler
from ascendspeed.fp16_deprecated import loss_scaler
# For backward compatibility.
print_rank_0(' > deserializing using the old code structure ...')
sys.modules['fp16.loss_scaler'] = sys.modules[
'megatron.fp16_deprecated.loss_scaler']
sys.modules['megatron.fp16.loss_scaler'] = sys.modules[
'megatron.fp16_deprecated.loss_scaler']
'ascendspeed.fp16_deprecated.loss_scaler']
sys.modules['ascendspeed.fp16.loss_scaler'] = sys.modules[
'ascendspeed.fp16_deprecated.loss_scaler']
state_dict = torch.load(checkpoint_name, map_location='cpu')
sys.modules.pop('fp16.loss_scaler', None)
sys.modules.pop('megatron.fp16.loss_scaler', None)
sys.modules.pop('ascendspeed.fp16.loss_scaler', None)
except BaseException as e:
print_rank_0('could not load the checkpoint')
print_rank_0(e)

View File

@ -18,13 +18,13 @@
import numpy as np
import torch
from megatron import (
from ascendspeed import (
get_args,
get_tokenizer,
mpu,
print_rank_0
)
from megatron.data.dataset_utils import (
from ascendspeed.data.dataset_utils import (
get_samples_mapping,
get_a_and_b_segments,
truncate_segments,

View File

@ -4,10 +4,10 @@ import time
import numpy as np
import torch
from megatron import get_args, get_tokenizer, mpu, print_rank_0
from megatron.data.dataset_utils import create_masked_lm_predictions, \
from ascendspeed import get_args, get_tokenizer, mpu, print_rank_0
from ascendspeed.data.dataset_utils import create_masked_lm_predictions, \
pad_and_convert_to_numpy
from megatron.data.data_samplers import MegatronPretrainingSampler
from ascendspeed.data.data_samplers import MegatronPretrainingSampler
from deepspeed.accelerator import get_accelerator
def make_attention_mask(source_block, target_block):
"""
@ -28,7 +28,7 @@ def get_one_epoch_dataloader(dataset, micro_batch_size=None):
micro_batch_size = args.micro_batch_size
num_workers = args.num_workers
# Use megatron's sampler with consumed samples set to 0 as
# Use ascendspeed's sampler with consumed samples set to 0 as
# this is only for evaluation and don't intend to resume half way.
# Also, set the drop last to false as don't intend to remove
# the last batch
@ -162,7 +162,7 @@ def get_block_samples_mapping(block_dataset, title_dataset, data_prefix, num_epo
print_rank_0(' > building samples index mapping for {} ...'.format(
name))
from megatron.data import helpers
from ascendspeed.data import helpers
mapping_array = helpers.build_blocks_mapping(
block_dataset.doc_idx,
block_dataset.sizes,

View File

@ -20,8 +20,8 @@ import time
import numpy as np
import torch
from megatron import print_rank_0
from megatron import mpu
from ascendspeed import print_rank_0
from ascendspeed import mpu
class BlendableDataset(torch.utils.data.Dataset):
@ -49,7 +49,7 @@ class BlendableDataset(torch.utils.data.Dataset):
self.dataset_index = np.zeros(self.size, dtype=np.uint8)
self.dataset_sample_index = np.zeros(self.size, dtype=np.int64)
from megatron.data import helpers
from ascendspeed.data import helpers
helpers.build_blending_indices(self.dataset_index,
self.dataset_sample_index,
weights, num_datasets, self.size,

View File

@ -18,8 +18,8 @@
import torch
import random
from megatron import get_args
from megatron import mpu
from ascendspeed import get_args
from ascendspeed import mpu
def build_pretraining_data_loader(dataset, consumed_samples):
@ -29,7 +29,7 @@ def build_pretraining_data_loader(dataset, consumed_samples):
return None
args = get_args()
# Megatron sampler
# ascendspeed sampler
if args.dataloader_type == 'single':
batch_sampler = MegatronPretrainingSampler(
total_samples=len(dataset),

View File

@ -26,13 +26,13 @@ import collections
import numpy as np
import torch
from megatron import (
from ascendspeed import (
get_args,
mpu,
print_rank_0
)
from megatron.data.blendable_dataset import BlendableDataset
from megatron.data.indexed_dataset import make_dataset as make_indexed_dataset
from ascendspeed.data.blendable_dataset import BlendableDataset
from ascendspeed.data.indexed_dataset import make_dataset as make_indexed_dataset
from deepspeed.accelerator import get_accelerator
DSET_TYPE_BERT = 'standard_bert'
DSET_TYPE_ICT = 'ict'
@ -515,9 +515,9 @@ def _build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
print_split_stats('test', 2)
def build_dataset(index, name):
from megatron.data.bert_dataset import BertDataset
from megatron.data.ict_dataset import ICTDataset
from megatron.data.t5_dataset import T5Dataset
from ascendspeed.data.bert_dataset import BertDataset
from ascendspeed.data.ict_dataset import ICTDataset
from ascendspeed.data.t5_dataset import T5Dataset
dataset = None
if splits[index + 1] > splits[index]:
# Get the pointer to the original doc-idx so we can set it later.
@ -689,7 +689,7 @@ def get_samples_mapping(indexed_dataset,
print_rank_0(' > building sapmles index mapping for {} ...'.format(
name))
# First compile and then import.
from megatron.data import helpers
from ascendspeed.data import helpers
samples_mapping = helpers.build_mapping(
indexed_dataset.doc_idx,
indexed_dataset.sizes,

View File

@ -4,12 +4,12 @@ import time
import numpy as np
import torch
from megatron import print_rank_0, mpu, logging
from megatron.data.blendable_dataset import BlendableDataset
from megatron.data.dataset_utils import get_datasets_weights_and_num_samples, get_split_by_range_, \
from ascendspeed import print_rank_0, mpu, logging
from ascendspeed.data.blendable_dataset import BlendableDataset
from ascendspeed.data.dataset_utils import get_datasets_weights_and_num_samples, get_split_by_range_, \
get_train_valid_test_split_
from megatron.data.mtf_dataset import MTFDataset
from megatron.data.indexed_dataset import make_dataset as make_indexed_dataset
from ascendspeed.data.mtf_dataset import MTFDataset
from ascendspeed.data.indexed_dataset import make_dataset as make_indexed_dataset
logger = logging.get_logger(__name__)

View File

@ -17,15 +17,16 @@
import os
import time
import numpy as np
import torch
from deepspeed.accelerator import get_accelerator
from megatron import mpu, is_rank_0, print_rank_0, get_args
from megatron.data.blendable_dataset import BlendableDataset
from megatron.data.dataset_utils import get_datasets_weights_and_num_samples
from megatron.data.dataset_utils import get_train_valid_test_split_
from megatron.data.indexed_dataset import make_dataset as make_indexed_dataset
from ascendspeed import mpu, is_rank_0, print_rank_0, get_args
from ascendspeed.data.blendable_dataset import BlendableDataset
from ascendspeed.data.dataset_utils import get_datasets_weights_and_num_samples
from ascendspeed.data.dataset_utils import get_train_valid_test_split_
from ascendspeed.data.indexed_dataset import make_dataset as make_indexed_dataset
def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
@ -286,7 +287,7 @@ def _build_index_mappings(name, data_prefix, documents, sizes,
start_time = time.time()
# Use C++ implementation for speed.
# First compile and then import.
from megatron.data import helpers
from ascendspeed.data import helpers
assert doc_idx.dtype == np.int32
assert sizes.dtype == np.int32
sample_idx = helpers.build_sample_idx(sizes, doc_idx, seq_length,

View File

@ -4,10 +4,10 @@ import random
import numpy as np
from torch.utils.data import Dataset
from megatron import get_tokenizer
from megatron import get_args
from megatron.data.dataset_utils import get_indexed_dataset_
from megatron.data.realm_dataset_utils import get_block_samples_mapping
from ascendspeed import get_tokenizer
from ascendspeed import get_args
from ascendspeed.data.dataset_utils import get_indexed_dataset_
from ascendspeed.data.realm_dataset_utils import get_block_samples_mapping
def make_attention_mask(source_block, target_block):
"""

View File

@ -17,10 +17,10 @@ import os
import shutil
import struct
from itertools import accumulate
import numpy as np
import torch
from megatron import print_rank_0
from ascendspeed import print_rank_0
def __best_fitting_dtype(vocab_size=None):

View File

@ -3,11 +3,11 @@
import numpy as np
import torch
from megatron import print_rank_0, get_tokenizer, get_args
from megatron.data.blendable_dataset import BlendableDataset
from megatron.data.dataset_utils import get_datasets_weights_and_num_samples, get_split_by_range_
from megatron.data.dataset_utils import get_train_valid_test_split_, get_indexed_dataset_
from megatron.data.gpt_dataset import GPTDataset
from ascendspeed import print_rank_0, get_tokenizer, get_args
from ascendspeed.data.blendable_dataset import BlendableDataset
from ascendspeed.data.dataset_utils import get_datasets_weights_and_num_samples, get_split_by_range_
from ascendspeed.data.dataset_utils import get_train_valid_test_split_, get_indexed_dataset_
from ascendspeed.data.gpt_dataset import GPTDataset
def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,

View File

@ -20,8 +20,8 @@ import time
import numpy as np
import torch
from megatron import print_rank_0
from megatron.data.indexed_dataset import make_dataset as make_indexed_dataset
from ascendspeed import print_rank_0
from ascendspeed.data.indexed_dataset import make_dataset as make_indexed_dataset
class MTFDataset(torch.utils.data.Dataset):

View File

@ -22,8 +22,8 @@ import random
import torch
from torch.utils.data import Dataset
from megatron import print_rank_0, get_args, get_tokenizer, mpu
from megatron.data.biencoder_dataset_utils import make_attention_mask
from ascendspeed import print_rank_0, get_args, get_tokenizer, mpu
from ascendspeed.data.biencoder_dataset_utils import make_attention_mask
def get_open_retrieval_wiki_dataset():
args = get_args()

View File

@ -4,9 +4,9 @@ import time
import numpy as np
import torch
from megatron import mpu, print_rank_0
from megatron.data.dataset_utils import create_masked_lm_predictions, pad_and_convert_to_numpy
from megatron import get_args, get_tokenizer, print_rank_0, mpu
from ascendspeed import mpu, print_rank_0
from ascendspeed.data.dataset_utils import create_masked_lm_predictions, pad_and_convert_to_numpy
from ascendspeed import get_args, get_tokenizer, print_rank_0, mpu
from deepspeed.accelerator import get_accelerator
def get_one_epoch_dataloader(dataset, micro_batch_size=None):
@ -23,7 +23,7 @@ def get_one_epoch_dataloader(dataset, micro_batch_size=None):
sampler = torch.utils.data.SequentialSampler(dataset)
# importantly, drop_last must be False to get all the data.
assert False, 'DistributedBatchSampler deprecated, change the implementation'
from megatron.data.samplers import DistributedBatchSampler
from ascendspeed.data.samplers import DistributedBatchSampler
batch_sampler = DistributedBatchSampler(sampler,
batch_size=global_batch_size,
drop_last=False,
@ -152,7 +152,7 @@ def get_block_samples_mapping(block_dataset, title_dataset, data_prefix, num_epo
print_rank_0(' > building samples index mapping for {} ...'.format(
name))
from megatron.data import helpers
from ascendspeed.data import helpers
mapping_array = helpers.build_blocks_mapping(
block_dataset.doc_idx,
block_dataset.sizes,

View File

@ -6,8 +6,8 @@ import shutil
import numpy as np
import torch
from megatron import get_args
from megatron import mpu
from ascendspeed import get_args
from ascendspeed import mpu
def detach(tensor):

View File

@ -20,8 +20,8 @@ import collections
import numpy as np
import torch
from megatron import get_tokenizer
from megatron.data.dataset_utils import (
from ascendspeed import get_tokenizer
from ascendspeed.data.dataset_utils import (
create_masked_lm_predictions,
get_samples_mapping
)

View File

@ -2,8 +2,8 @@
# put some code used during development and manual testing of
# indexed_dataset.
from megatron.data import indexed_dataset
from megatron.tokenizer import build_tokenizer
from ascendspeed.data import indexed_dataset
from ascendspeed.tokenizer import build_tokenizer
import argparse
import os
import sys

View File

@ -15,7 +15,7 @@
import os
import torch
from torchvision import datasets, transforms
from megatron.data.autoaugment import ImageNetPolicy
from ascendspeed.data.autoaugment import ImageNetPolicy
def build_train_valid_datasets(data_path, crop_size=224, color_jitter=True):

View File

@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""Megatron global variables."""
"""global variables."""
import os
import sys
@ -21,7 +21,7 @@ import time
import torch
from megatron.tokenizer import build_tokenizer
from ascendspeed.tokenizer import build_tokenizer
from .arguments import parse_args
from .microbatches import build_num_microbatches_calculator
from deepspeed.accelerator import get_accelerator

View File

@ -2,15 +2,15 @@ import sys
import torch
import torch.distributed as dist
from megatron import get_args
from megatron import mpu
from megatron.checkpointing import load_biencoder_checkpoint
from megatron.data.orqa_wiki_dataset import get_open_retrieval_wiki_dataset
from megatron.data.orqa_wiki_dataset import get_open_retrieval_batch
from megatron.data.biencoder_dataset_utils import get_one_epoch_dataloader
from megatron.data.realm_index import detach, OpenRetreivalDataStore
from megatron.model.biencoder_model import biencoder_model_provider
from megatron.training import get_model
from ascendspeed import get_args
from ascendspeed import mpu
from ascendspeed.checkpointing import load_biencoder_checkpoint
from ascendspeed.data.orqa_wiki_dataset import get_open_retrieval_wiki_dataset
from ascendspeed.data.orqa_wiki_dataset import get_open_retrieval_batch
from ascendspeed.data.biencoder_dataset_utils import get_one_epoch_dataloader
from ascendspeed.data.realm_index import detach, OpenRetreivalDataStore
from ascendspeed.model.biencoder_model import biencoder_model_provider
from ascendspeed.training import get_model
class IndexBuilder(object):

View File

@ -13,30 +13,31 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""Megatron initialization."""
"""initialization."""
import random
import os
import time
import numpy as np
import torch
from megatron import get_adlr_autoresume
from megatron import get_args
from megatron import get_tensorboard_writer
from megatron import mpu
from megatron.global_vars import set_global_variables
from megatron.mpu import (set_tensor_model_parallel_rank,
set_tensor_model_parallel_world_size)
from deepspeed.accelerator import get_accelerator
import torch
import deepspeed
from deepspeed.accelerator import get_accelerator
from ascendspeed import get_adlr_autoresume
from ascendspeed import get_args
from ascendspeed import get_tensorboard_writer
from ascendspeed import mpu
from ascendspeed.global_vars import set_global_variables
from ascendspeed.mpu import (set_tensor_model_parallel_rank,
set_tensor_model_parallel_world_size)
def initialize_megatron(extra_args_provider=None, args_defaults=None,
ignore_unknown_args=False, allow_no_cuda=False):
"""Set global variables, initialize distributed, and
set autoresume and random seeds.
`allow_no_cuda` should not be set unless using megatron for cpu only
`allow_no_cuda` should not be set unless using ascendspeed for cpu only
data processing. In general this arg should not be set unless you know
what you are doing.
Returns a function to finalize distributed env initialization
@ -47,7 +48,7 @@ def initialize_megatron(extra_args_provider=None, args_defaults=None,
if not allow_no_cuda:
# Make sure cuda is available.
assert get_accelerator().is_available(), 'Megatron requires accelerator.'
assert get_accelerator().is_available(), 'ascendspeed requires accelerator.'
# Parse args, build tokenizer, and set adlr-autoresume,
# tensorboard-writer, and timers.
@ -77,7 +78,7 @@ def initialize_megatron(extra_args_provider=None, args_defaults=None,
set_tensor_model_parallel_rank(args.rank)
return finish_mpu_init
else:
# Megatron's MPU is the master. Complete initialization right away.
# MPU is the master. Complete initialization right away.
finish_mpu_init()
# Initialize memory buffers.
@ -97,7 +98,7 @@ def _compile_dependencies():
if torch.distributed.get_rank() == 0:
start_time = time.time()
print('> compiling dataset index builder ...')
from megatron.data.dataset_utils import compile_helper
from ascendspeed.data.dataset_utils import compile_helper
compile_helper()
print('>>> done with dataset index builder. Compilation time: {:.3f} '
'seconds'.format(time.time() - start_time), flush=True)

View File

@ -17,7 +17,7 @@
import math
from megatron import print_rank_0, get_args
from ascendspeed import print_rank_0, get_args
class AnnealingLR(object):
"""Anneals the learning rate."""

View File

@ -19,8 +19,8 @@ from abc import abstractmethod
import torch
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
from megatron import get_args
from megatron import mpu
from ascendspeed import get_args
from ascendspeed import mpu
from .module import MegatronModule
from deepspeed.accelerator import get_accelerator

View File

@ -15,7 +15,7 @@
import torch
import torch_npu
from megatron.model.enums import AttnMaskType
from ascendspeed.model.enums import AttnMaskType
class NPUFusedScaleMaskSoftmax(torch.nn.Module):

View File

@ -2,8 +2,8 @@ import torch
from torch import nn
from torch.nn import functional as F
from megatron import logging
from megatron.model.utils import log_debug_usage
from ascendspeed import logging
from ascendspeed.model.utils import log_debug_usage
logger = logging.get_logger(__name__)

View File

@ -18,8 +18,8 @@
from functools import partial
import torch
from megatron import get_args
from megatron import mpu
from ascendspeed import get_args
from ascendspeed import mpu
from .module import MegatronModule, fp32_to_float16
from .enums import AttnMaskType
@ -29,8 +29,8 @@ from .utils import init_method_normal
from .utils import scaled_init_method_normal
from deepspeed.pipe import PipelineModule, LayerSpec, TiedLayerSpec
from megatron.model import LayerNorm
from megatron.model.module import float16_to_fp32
from ascendspeed.model import LayerNorm
from ascendspeed.model.module import float16_to_fp32
from .language_model import EmbeddingPipe
from .transformer import ParallelTransformerLayerPipe
@ -94,7 +94,7 @@ class GPTModel(MegatronModule):
self.initialize_word_embeddings(init_method_normal)
def set_input_tensor(self, input_tensor):
"""See megatron.model.transformer.set_input_tensor()"""
"""See ascendspeed.model.transformer.set_input_tensor()"""
self.language_model.set_input_tensor(input_tensor)
def forward(self, input_ids, position_ids, attention_mask, labels=None,

View File

@ -18,13 +18,15 @@
import torch
import torch.nn.functional as F
from megatron import get_args
from megatron import mpu
from ascendspeed import get_args
from ascendspeed import mpu
from .module import MegatronModule
from megatron.model.enums import LayerType, AttnMaskType
from megatron.model.transformer import ParallelTransformer
from megatron.model.utils import get_linear_layer
from megatron.model.utils import init_method_normal, scaled_init_method_normal
from ascendspeed.model.enums import LayerType, AttnMaskType
from ascendspeed.model.transformer import ParallelTransformer
from ascendspeed.model.utils import get_linear_layer
from ascendspeed.model.utils import init_method_normal, scaled_init_method_normal
def parallel_lm_logits(input_, word_embeddings_weight, parallel_output,
bias=None):
@ -367,7 +369,7 @@ class TransformerLanguageModel(MegatronModule):
self._pooler_key = 'pooler'
def set_input_tensor(self, input_tensor):
""" See megatron.model.transformer.set_input_tensor()"""
""" See ascendspeed.model.transformer.set_input_tensor()"""
self.encoder.set_input_tensor(input_tensor)
def forward(self, enc_input_ids, enc_position_ids, enc_attn_mask,

View File

@ -24,14 +24,15 @@ from functools import partial
import torch
import torch.nn.functional as F
from megatron import get_args
from megatron import mpu
from megatron.model.module import MegatronModule, float16_to_fp32, fp32_to_float16
from megatron.model.enums import AttnMaskType, LayerType, AttnType
from megatron.model.utils import get_linear_layer, init_method_normal, scaled_init_method_normal, attention_mask_func, \
from ascendspeed import get_args
from ascendspeed import mpu
from ascendspeed.model.module import MegatronModule, float16_to_fp32, fp32_to_float16
from ascendspeed.model.enums import AttnMaskType, LayerType, AttnType
from ascendspeed.model.utils import get_linear_layer, init_method_normal, scaled_init_method_normal, attention_mask_func, \
openai_gelu, erf_gelu
from megatron.model.fused_softmax import NPUFusedScaleMaskSoftmax
from megatron.model.language_model import Pooler
from ascendspeed.model.fused_softmax import NPUFusedScaleMaskSoftmax
from ascendspeed.model.language_model import Pooler
import deepspeed
from deepspeed.accelerator import get_accelerator
@ -85,7 +86,6 @@ def apply_rotary_pos_emb(q, k, cos, sin, offset: int = 0):
return q_embed, k_embed
# TODO not able to build apex cpp extention for Fused cuda kernel RMSNorm
class RMSNorm(torch.nn.Module): # for cpu
def __init__(self, hidden_size, eps=1e-6):
"""
@ -916,7 +916,7 @@ class LlamaModel(MegatronModule):
parallel_output=self.parallel_output)
def set_input_tensor(self, input_tensor):
"""See megatron.model.transformer.set_input_tensor()"""
"""See ascendspeed.model.transformer.set_input_tensor()"""
self.language_model.set_input_tensor(input_tensor)
def forward(self, input_ids, attention_mask, labels=None, layer_past=None, get_key_value=False):

View File

@ -19,8 +19,8 @@ import torch
from torch.autograd import Variable
from torch.nn.parameter import Parameter
from deepspeed.accelerator import get_accelerator
from megatron import get_args
from megatron import mpu
from ascendspeed import get_args
from ascendspeed import mpu
_FLOAT_TYPES = (torch.FloatTensor, get_accelerator().FloatTensor)

View File

@ -18,14 +18,15 @@ import math
import torch
import torch.nn.functional as F
from megatron import get_args
from megatron import mpu
from megatron.model import LayerNorm
from megatron.model.fused_softmax import NPUFusedScaleMaskSoftmax
from megatron.model.enums import AttnMaskType, LayerType, AttnType
from megatron.model.fused_bias_gelu import bias_gelu_impl
from megatron.model.utils import attention_mask_func, openai_gelu, erf_gelu
from megatron.model.module import MegatronModule
from ascendspeed import get_args
from ascendspeed import mpu
from ascendspeed.model import LayerNorm
from ascendspeed.model.fused_softmax import NPUFusedScaleMaskSoftmax
from ascendspeed.model.enums import AttnMaskType, LayerType, AttnType
from ascendspeed.model.fused_bias_gelu import bias_gelu_impl
from ascendspeed.model.utils import attention_mask_func, openai_gelu, erf_gelu
from ascendspeed.model.module import MegatronModule
from torch import distributed as dist
import deepspeed
from deepspeed.moe.layer import MoE

View File

@ -19,7 +19,7 @@ import math
import torch
from megatron import get_args
from ascendspeed import get_args
def init_method_normal(sigma):
"""Init method based on N(0, sigma)."""

View File

@ -19,9 +19,9 @@ import math
import einops
import torch
import torch.nn.functional as F
from megatron import get_args
from megatron.model.transformer import ParallelTransformer
from megatron.model.utils import (
from ascendspeed import get_args
from ascendspeed.model.transformer import ParallelTransformer
from ascendspeed.model.utils import (
get_linear_layer,
init_method_normal,
scaled_init_method_normal,

View File

@ -35,7 +35,7 @@ from .random import get_cuda_rng_tracker
from .utils import divide
from .utils import split_tensor_along_last_dim
from .utils import VocabUtility
from megatron import get_args
from ascendspeed import get_args
import deepspeed.runtime.activation_checkpointing.checkpointing as ds_checkpointing
from deepspeed.accelerator import get_accelerator

View File

@ -24,8 +24,8 @@ from torch import _C
from deepspeed.accelerator import get_accelerator
from torch.utils.checkpoint import detach_variable
from megatron import get_args
from megatron.memory import allocate_mem_buff
from ascendspeed import get_args
from ascendspeed.memory import allocate_mem_buff
from .initialize import get_data_parallel_rank
from .initialize import get_tensor_model_parallel_group

View File

@ -18,8 +18,8 @@ import apex
import torch
from deepspeed.accelerator import get_accelerator
from megatron import get_args
from megatron.model import LayerNorm
from ascendspeed import get_args
from ascendspeed.model import LayerNorm
from .grad_scaler import ConstantGradScaler, DynamicGradScaler
from .optimizer import Float16OptimizerWithFloat16Params, FP32Optimizer

View File

@ -29,9 +29,9 @@ if get_accelerator().device_name() == 'cuda':
import amp_C
from megatron import mpu
from megatron.model.module import param_is_not_shared
from megatron.mpu.layers import param_is_not_tensor_parallel_duplicate
from ascendspeed import mpu
from ascendspeed.model.module import param_is_not_shared
from ascendspeed.mpu.layers import param_is_not_tensor_parallel_duplicate
from deepspeed.accelerator import get_accelerator
def clip_grad_norm_fp32(parameters, max_norm, norm_type=2):

View File

@ -21,9 +21,9 @@ from abc import abstractmethod
import torch
from megatron import get_timers
from megatron import mpu
from megatron import print_rank_0
from ascendspeed import get_timers
from ascendspeed import mpu
from ascendspeed import print_rank_0
from deepspeed.accelerator import get_accelerator
from .clip_grads import clip_grad_norm_fp32, count_zeros_fp32

View File

@ -17,14 +17,14 @@ from functools import reduce
import operator
import torch
from deepspeed.accelerator import get_accelerator
from megatron import get_args
from megatron import mpu
from ascendspeed import get_args
from ascendspeed import mpu
def _communicate(tensor_send_next, tensor_send_prev, recv_prev, recv_next,
use_ring_exchange=False):
"""Communicate tensors between stages. Used as helper method in other
communication methods that are used in megatron/schedules.py.
communication methods that are used in ascendspeed/schedules.py.
Takes the following arguments:
tensor_send_next: tensor to send to next rank (no tensor sent if

View File

@ -19,12 +19,14 @@ MINOR = 1.5
# Use the following formatting: (major, minor)
VERSION = (MAJOR, MINOR)
__version__ = '.'.join(map(str, VERSION))
__package_name__ = 'megatron-lm'
__contact_names__ = 'NVIDIA INC'
__url__ = 'https://github.com/NVIDIA/Megatron-LM'
__download_url__ = 'https://github.com/NVIDIA/Megatron-LM/releases'
__description__ = 'Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.'
__license__ = 'See https://github.com/NVIDIA/Megatron-LM/blob/master/LICENSE'
__keywords__ = 'deep learning, Megatron, gpu, NLP, nvidia, pytorch, torch, language'
__description__ = 'AscendSpeed for LLMs of Ascend'
__version__ = '0.0.1'
__author__ = 'Ascend'
__long_description__ = 'AscendSpeed for LLMs of Ascend'
__url__ = 'https://gitee.com/ascend/AscendSpeed'
__download_url__ = 'https://gitee.com/ascend/AscendSpeed/release'
__keywords__ = 'Ascend, langauge, deep learning, NLP'
__license__ = 'See https://gitee.com/ascend/AscendSpeed'
__package_name__ = 'ascendspeed'
__contact_names__ = 'Ascend'

View File

@ -18,15 +18,15 @@ import torch
import torch_npu
from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
from megatron import get_args
from megatron import get_num_microbatches
from megatron import get_timers
from megatron import mpu
from megatron import p2p_communication
from megatron import print_rank_0
from megatron.utils import unwrap_model
from megatron.model import DistributedDataParallel as LocalDDP
from megatron.model import Float16Module
from ascendspeed import get_args
from ascendspeed import get_num_microbatches
from ascendspeed import get_timers
from ascendspeed import mpu
from ascendspeed import p2p_communication
from ascendspeed import print_rank_0
from ascendspeed.utils import unwrap_model
from ascendspeed.model import DistributedDataParallel as LocalDDP
from ascendspeed.model import Float16Module
def clear_npu_overflow_flag():

View File

@ -20,19 +20,20 @@ import json
import os
import time
# These are needed to unwrap the model, would be nice to put these in ascendspeed.utils if possible?
import torch
import torch.nn.functional as F
from megatron import get_args
from megatron import get_tokenizer
from megatron import mpu
from megatron.utils import get_ltor_masks_and_position_ids, unwrap_model
from megatron.p2p_communication import recv_forward, send_forward
# These are needed to unwrap the model, would be nice to put these in megatron.utils if possible?
from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
from megatron.model import DistributedDataParallel as LocalDDP
from megatron.model import Float16Module
from ascendspeed import get_args
from ascendspeed import get_tokenizer
from ascendspeed import mpu
from ascendspeed.utils import get_ltor_masks_and_position_ids, unwrap_model
from ascendspeed.p2p_communication import recv_forward, send_forward
from ascendspeed.model import DistributedDataParallel as LocalDDP
from ascendspeed.model import Float16Module
from deepspeed.accelerator import get_accelerator
def get_batch(context_tokens):
"""Generate batch from context tokens."""
args = get_args()
@ -180,9 +181,9 @@ def generate_samples_input_from_file(model):
decode_tokens = decode_tokens[0].cpu().numpy().tolist()
trim_decode_tokens = tokenizer.detokenize(
decode_tokens)[raw_text_len:]
print("\nMegatron-LM:", trim_decode_tokens, flush=True)
print("\nAscendSpeed:", trim_decode_tokens, flush=True)
fname_out.write("\n\nMegatron-LM:")
fname_out.write("\n\nAscendSpeed:")
fname_out.write(trim_decode_tokens)
fname_out.write("\n")
@ -301,7 +302,7 @@ def generate_samples_interactive(model, print_frequency=24):
decode_tokens = decode_tokens[0].cpu().numpy().tolist()
trim_decode_tokens = tokenizer.detokenize(
decode_tokens)[raw_text_len:]
print("\nMegatron-LM:", trim_decode_tokens, flush=True)
print("\nAscendSpeed:", trim_decode_tokens, flush=True)
if mpu.is_pipeline_first_stage() \
and mpu.get_tensor_model_parallel_rank() == 0:
@ -313,7 +314,7 @@ def generate_samples_interactive(model, print_frequency=24):
decode_tokens = decode_tokens[0].cpu().numpy().tolist()
trim_decode_tokens = tokenizer.detokenize(
decode_tokens)[raw_text_len:]
print("\nMegatron-LM:", trim_decode_tokens, flush=True)
print("\nAscendSpeed:", trim_decode_tokens, flush=True)
input("\nPress Enter to continue >>>")

View File

@ -26,40 +26,41 @@ _TRAIN_START_TIME = time.time()
import torch
from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
from megatron import get_args
from megatron import get_timers
from megatron import get_tensorboard_writer
from megatron import get_current_global_batch_size
from megatron import get_num_microbatches
from megatron import is_last_rank
from megatron import update_num_microbatches
from megatron import mpu
from megatron import print_rank_0
from megatron import print_rank_last
from megatron.checkpointing import load_checkpoint
from megatron.checkpointing import save_checkpoint
from megatron.model import Float16Module
from megatron.optimizer import get_megatron_optimizer
from megatron.initialize import initialize_megatron
from megatron.initialize import write_args_to_tensorboard
from megatron.learning_rates import AnnealingLR
from megatron.model import DistributedDataParallel as LocalDDP
from megatron.utils import check_adlr_autoresume_termination
from megatron.utils import unwrap_model
from megatron.data.data_samplers import build_pretraining_data_loader
from megatron.utils import calc_params_l2_norm
from megatron.schedules import forward_backward_no_pipelining
from megatron.schedules import forward_backward_pipelining_without_interleaving
from megatron.schedules import forward_backward_pipelining_with_interleaving
from megatron.utils import report_memory, throughput_calculator, checkpoint_throughput_calculator
from deepspeed.accelerator import get_accelerator
from ascendspeed import get_args
from ascendspeed import get_timers
from ascendspeed import get_tensorboard_writer
from ascendspeed import get_current_global_batch_size
from ascendspeed import get_num_microbatches
from ascendspeed import is_last_rank
from ascendspeed import update_num_microbatches
from ascendspeed import mpu
from ascendspeed import print_rank_0
from ascendspeed import print_rank_last
from ascendspeed.checkpointing import load_checkpoint
from ascendspeed.checkpointing import save_checkpoint
from ascendspeed.model import Float16Module
from ascendspeed.optimizer import get_megatron_optimizer
from ascendspeed.initialize import initialize_megatron
from ascendspeed.initialize import write_args_to_tensorboard
from ascendspeed.learning_rates import AnnealingLR
from ascendspeed.model import DistributedDataParallel as LocalDDP
from ascendspeed.utils import check_adlr_autoresume_termination
from ascendspeed.utils import unwrap_model
from ascendspeed.data.data_samplers import build_pretraining_data_loader
from ascendspeed.utils import calc_params_l2_norm
from ascendspeed.schedules import forward_backward_no_pipelining
from ascendspeed.schedules import forward_backward_pipelining_without_interleaving
from ascendspeed.schedules import forward_backward_pipelining_with_interleaving
from ascendspeed.utils import report_memory, throughput_calculator, checkpoint_throughput_calculator
from ascendspeed.model.transformer import ParallelTransformerLayer
import deepspeed
from deepspeed.accelerator import get_accelerator
from deepspeed.compression.compress import init_compression, redundancy_clean
from megatron.model.transformer import ParallelTransformerLayer
from deepspeed.runtime.data_pipeline.data_routing.helper import convert_to_random_ltd
def print_datetime(string):
"""Note that this call will sync across all ranks."""
torch.distributed.barrier()
@ -76,7 +77,7 @@ def pretrain(train_valid_test_dataset_provider,
"""Main training program.
This function will run the followings in the order provided:
1) initialize Megatron.
1) initialize ascendspeed.
2) setup model, optimizer and lr schedule using the model_provider.
3) call train_val_test_data_provider to get train/val/test datasets.
4) train the modle using the forward_step_func.
@ -109,9 +110,9 @@ def pretrain(train_valid_test_dataset_provider,
torch.distributed.all_reduce(start_time_tensor,
op=torch.distributed.ReduceOp.MIN)
_TRAIN_START_TIME = start_time_tensor.item()
print_rank_0('time to initialize megatron (seconds): {:.3f}'.format(
print_rank_0('time to initialize ascendspeed (seconds): {:.3f}'.format(
time.time() - _TRAIN_START_TIME))
print_datetime('after megatron is initialized')
print_datetime('after ascendspeed is initialized')
args = get_args()
timers = get_timers()
@ -483,7 +484,7 @@ def setup_model_and_optimizer(model_provider_func, teacher=False,
pp = mpu.get_pipeline_model_parallel_world_size()
if args.data_efficiency_curriculum_learning and build_train_valid_test_datasets_provider is not None:
train_ds = None
# Only need to build dataset on tp rank 0 since Megatron has the
# Only need to build dataset on tp rank 0 since ascendspeed has the
# broadcast_data() function that broadcast data from tp rank 0.
if mpu.get_tensor_model_parallel_rank() == 0:
# Number of train/valid/test samples.

View File

@ -25,13 +25,13 @@ if get_accelerator().device_name() == 'cuda':
from apex.multi_tensor_apply import multi_tensor_applier
import amp_C
from megatron import get_args
from megatron import print_rank_0
from megatron import get_adlr_autoresume
from megatron import mpu
from megatron.model.module import param_is_not_shared
from megatron.mpu.layers import param_is_not_tensor_parallel_duplicate
from megatron import get_num_microbatches
from ascendspeed import get_args
from ascendspeed import print_rank_0
from ascendspeed import get_adlr_autoresume
from ascendspeed import mpu
from ascendspeed.model.module import param_is_not_shared
from ascendspeed.mpu.layers import param_is_not_tensor_parallel_duplicate
from ascendspeed import get_num_microbatches
from deepspeed.accelerator import get_accelerator
def unwrap_model(model, module_instances=(torchDDP)):
return_list = True
@ -135,7 +135,7 @@ def print_params_min_max_norm(optimizer, iteration):
def check_adlr_autoresume_termination(iteration, model,
optimizer, lr_scheduler):
"""Check for autoresume signal and exit if it is received."""
from megatron.checkpointing import save_checkpoint
from ascendspeed.checkpointing import save_checkpoint
args = get_args()
autoresume = get_adlr_autoresume()

2
examples/README.md Normal file
View File

@ -0,0 +1,2 @@
## Recipes and Scripts

View File

@ -0,0 +1,48 @@
#!/bin/bash
# This is an example of pre-training of BLOOM-550M
# without parameter specific and any parallel technologies.
export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
source /usr/local/Ascend/ascend-toolkit/set_env.sh
RANK=0
WORLD_SIZE=1
DATA_PATH=./dataset/enwiki-gpt/gpt_text_sentence
CHECKPOINT_PATH=./ckpt
export LOCAL_RANK=0
python pretrain_llama.py \
--DDP-impl local \
--use-contiguous-buffers-in-ddp \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 4 \
--global-batch-size 8 \
--seq-length 2048 \
--max-position-embeddings 2048 \
--train-iters 500000 \
--lr-decay-iters 320000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file ./dataset/gpt2-vocab.json \
--merge-file ./dataset/gpt2-merges.txt \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 0.00015 \
--min-lr 1.0e-5 \
--lr-decay-style cosine \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--checkpoint-activations \
--log-interval 10 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--fp16 | tee logs/train.log

View File

@ -21,23 +21,24 @@ from functools import partial
import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu
import deepspeed_npu
from megatron import get_args
from megatron import print_rank_0
from megatron import get_timers
from megatron import get_tokenizer
from megatron import mpu
from megatron.data.gpt_dataset import build_train_valid_test_datasets
from megatron.model import GPTModel, GPTModelPipe
from megatron.training import pretrain
from megatron.utils import get_ltor_masks_and_position_ids
from megatron.utils import average_losses_across_data_parallel_group
import deepspeed
import deepspeed_npu
from deepspeed.runtime.utils import see_memory_usage
from deepspeed.accelerator.real_accelerator import get_accelerator
from ascendspeed import get_args
from ascendspeed import print_rank_0
from ascendspeed import get_timers
from ascendspeed import get_tokenizer
from ascendspeed import mpu
from ascendspeed.data.gpt_dataset import build_train_valid_test_datasets
from ascendspeed.model import GPTModel, GPTModelPipe
from ascendspeed.training import pretrain
from ascendspeed.utils import get_ltor_masks_and_position_ids
from ascendspeed.utils import average_losses_across_data_parallel_group
from torch import nn
import torch.nn.functional as F

View File

@ -22,16 +22,16 @@ import torch_npu
from torch_npu.contrib import transfer_to_npu
import deepspeed_npu
from megatron import get_args
from megatron import print_rank_0
from megatron import get_timers
from megatron import get_tokenizer
from megatron import mpu
from megatron.data.gpt_dataset import build_train_valid_test_datasets
from megatron.model import LlamaModel, LlamaModelPipe
from megatron.training import pretrain
from megatron.utils import get_ltor_masks_and_position_ids
from megatron.utils import average_losses_across_data_parallel_group
from ascendspeed import get_args
from ascendspeed import print_rank_0
from ascendspeed import get_timers
from ascendspeed import get_tokenizer
from ascendspeed import mpu
from ascendspeed.data.gpt_dataset import build_train_valid_test_datasets
from ascendspeed.model import LlamaModel, LlamaModelPipe
from ascendspeed.training import pretrain
from ascendspeed.utils import get_ltor_masks_and_position_ids
from ascendspeed.utils import average_losses_across_data_parallel_group
import deepspeed
from deepspeed.runtime.utils import see_memory_usage

View File

@ -15,12 +15,11 @@
"""Setup for pip package."""
import os
import sys
import setuptools
if sys.version_info < (3,):
raise Exception("Python 2 is not supported by Megatron.")
raise Exception("Python 2 is not supported by AscendSpeed.")
__description__ = 'AscendSpeed for LLMs of Ascend'
__version__ = '0.0.1'
@ -51,7 +50,7 @@ def req_file(filename):
install_requires = req_file("requirements.txt")
setuptools.setup(
package_data={'ascendspeed':['megatron/data/Makefile']},
package_data={'ascendspeed':['ascendspeed/data/Makefile']},
name=__package_name__,
# Versions should comply with PEP440. For a discussion on single-sourcing
# the version across setup.py and the project code, see
@ -77,6 +76,7 @@ setuptools.setup(
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: 3.9',
# Additional Setting
'Environment :: Console',
'Natural Language :: English',

View File

@ -20,21 +20,21 @@ import numpy as np
import time
import torch
from megatron import get_args
from megatron import print_rank_0
from megatron import get_tokenizer
from megatron import mpu
from megatron.training import setup_model_and_optimizer, get_model
from megatron.mpu.mappings import gather_from_tensor_model_parallel_region
from ascendSpeed import get_args
from ascendSpeed import print_rank_0
from ascendSpeed import get_tokenizer
from ascendSpeed import mpu
from ascendSpeed.training import setup_model_and_optimizer, get_model
from ascendSpeed.mpu.mappings import gather_from_tensor_model_parallel_region
from megatron.utils import get_ltor_masks_and_position_ids, unwrap_model
from megatron.p2p_communication import recv_forward, send_forward
from ascendSpeed.utils import get_ltor_masks_and_position_ids, unwrap_model
from ascendSpeed.p2p_communication import recv_forward, send_forward
import pickle
import json
from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
from megatron.model.distributed import DistributedDataParallel as LocalDDP
from megatron.model.module import Float16Module
from ascendspeed.model.distributed import DistributedDataParallel as LocalDDP
from ascendspeed.model.module import Float16Module
from deepspeed.runtime.pipe import schedule
from deepspeed.accelerator import get_accelerator
@ -279,15 +279,15 @@ class EvalHarnessAdaptor(GPT2LM):
def tokenizer_encode(self, text):
"""Tokenize text *without* adding special tokens."""
# Splitting this into its own method in case we need to handle special cases for different tokenizers
from megatron.tokenizer.gpt2_tokenization import GPT2Tokenizer
from ascendspeed.tokenizer.gpt2_tokenization import GPT2Tokenizer
if isinstance(self.tokenizer.tokenizer, GPT2Tokenizer):
return self.tokenizer.tokenizer.encode(text)
else:
return self.tokenizer.tokenizer.encode(text, add_special_tokens=False)
from megatron.initialize import initialize_megatron
import megatron
from ascendspeed.initialize import initialize_megatron
import ascendspeed
from tools.convert_checkpoint.deepspeed_checkpoint import DeepSpeedCheckpoint
from tools.convert_checkpoint.deepspeed_to_megatron import _create_rank_checkpoint
@ -303,9 +303,9 @@ def override_args(args, override_args, skip_keys, skip_if_specified_keys):
# Note(Hesslow):
# The model loading is a bit convoluted.
# We want to parse out the model arguments from the checkpoint and use those to initialize megatron-ds.
# We want to parse out the model arguments from the checkpoint and use those to initialize ascendspeed-ds.
#
# However megatron-ds expects its arguments on the command line.
# However ascendspeed-ds expects its arguments on the command line.
# And at that point we don't know them.
#
# Instead we use Jasons way: we load the arguments form the checkpoint and then override _parse_args to return whatever args we want.
@ -314,12 +314,12 @@ def override_args(args, override_args, skip_keys, skip_if_specified_keys):
# In order to support this we _first_ parse the arguments normally, and then override them with the arguments from the checkpoint.
# Keeping the default-value of newer arguments.
#
# We then use the megatron deepspeed converter to load the deepspeed checkpoints as if they we're megatron checkpoints.
# We then use the ascendspeed converter to load the deepspeed checkpoints as if they we're ascendspeed checkpoints.
def load_ds_checkpoint_and_setup_megatron(extra_args_provider):
# parse the megatorn args. But wait with initalizing megatron.
# parse the ascendspeed args. But wait with initalizing ascendspeed.
# avoid printing the arguments, since they will later be overridden.
_print_args = megatron.arguments._print_args
megatron.arguments._print_args = lambda *_args, **kwarg: None
_print_args = ascendspeed.arguments._print_args
ascendspeed.arguments._print_args = lambda *_args, **kwarg: None
args = _parse_args(extra_args_provider)
ds_checkpoint = DeepSpeedCheckpoint(args.load,
@ -342,14 +342,14 @@ def load_ds_checkpoint_and_setup_megatron(extra_args_provider):
override_args(args, cp_args, skip_keys, skip_if_specified)
# stop megatron from reparsing the arguments.
megatron.global_vars._parse_args = lambda *_args, **kwarg: args
megatron.global_vars._GLOBAL_ARGS = args
# stop ascendspeed from reparsing the arguments.
ascendspeed.global_vars._parse_args = lambda *_args, **kwarg: args
ascendspeed.global_vars._GLOBAL_ARGS = args
initialize_megatron()
torch.distributed.barrier()
# Initializing megatron will update eg. tokenizer size. Override again.
# Initializing ascendspeed will update eg. tokenizer size. Override again.
override_args(args, cp_args, skip_keys, skip_if_specified)
# print final arguments.
@ -377,7 +377,7 @@ def load_ds_checkpoint_and_setup_megatron(extra_args_provider):
model._config.zero_enabled = zero_enabled
else:
model = get_model(model_provider)[0]
# Initialize megatron model using the parsed state dict.
# Initialize ascendspeed model using the parsed state dict.
sd = _create_rank_checkpoint(ds_checkpoint, None, mpu.get_tensor_model_parallel_rank(), mpu.get_pipeline_model_parallel_rank(), True)
model.load_state_dict(sd['model'], strict=True)
@ -399,7 +399,7 @@ def tasks_args(parser):
group.add_argument('--eval_fp32', default = False, action='store_true', help='Should the evaluation run in fp32')
return parser
from megatron.global_vars import _parse_args
from ascendspeed.global_vars import _parse_args
def main():
start = time.time()

View File

@ -21,10 +21,10 @@ from functools import partial
import torch
from megatron import get_args
from megatron import print_rank_last, is_last_rank
from megatron import mpu
from megatron.schedules import get_forward_backward_func
from ascendspeed import get_args
from ascendspeed import print_rank_last, is_last_rank
from ascendspeed import mpu
from ascendspeed.schedules import get_forward_backward_func
from tasks.finetune_utils import build_data_loader
from tasks.finetune_utils import process_batch
from deepspeed.accelerator import get_accelerator

View File

@ -19,19 +19,19 @@ from functools import partial
import torch
from megatron import get_args
from megatron import print_rank_0
from megatron import get_timers
from megatron import mpu
from megatron.checkpointing import load_checkpoint
from megatron.checkpointing import save_checkpoint
from megatron.training import evaluate_and_print_results
from megatron.training import setup_model_and_optimizer
from megatron.training import train_step
from megatron.training import training_log
from megatron.utils import average_losses_across_data_parallel_group
from megatron.utils import calc_params_l2_norm
from megatron.utils import check_adlr_autoresume_termination
from ascendspeed import get_args
from ascendspeed import print_rank_0
from ascendspeed import get_timers
from ascendspeed import mpu
from ascendspeed.checkpointing import load_checkpoint
from ascendspeed.checkpointing import save_checkpoint
from ascendspeed.training import evaluate_and_print_results
from ascendspeed.training import setup_model_and_optimizer
from ascendspeed.training import train_step
from ascendspeed.training import training_log
from ascendspeed.utils import average_losses_across_data_parallel_group
from ascendspeed.utils import calc_params_l2_norm
from ascendspeed.utils import check_adlr_autoresume_termination
from deepspeed.accelerator import get_accelerator
def process_batch(batch):

View File

@ -15,7 +15,7 @@
"""CoLA dataset."""
from megatron import print_rank_0
from ascendSpeed import print_rank_0
from tasks.data_utils import clean_text
from .data import GLUEAbstractDataset

View File

@ -20,7 +20,7 @@ from abc import abstractmethod
from torch.utils.data import Dataset
from megatron import print_rank_0
from ascendSpeed import print_rank_0
from tasks.data_utils import build_sample
from tasks.data_utils import build_tokens_types_paddings_from_text

View File

@ -15,11 +15,11 @@
"""GLUE finetuning/evaluation."""
from megatron import get_args
from megatron import print_rank_0
from megatron import get_tokenizer
from megatron import mpu
from megatron.model.classification import Classification
from ascendSpeed import get_args
from ascendSpeed import print_rank_0
from ascendSpeed import get_tokenizer
from ascendSpeed import mpu
from ascendSpeed.model.classification import Classification
from tasks.eval_utils import accuracy_func_provider
from tasks.finetune_utils import finetune, mse_forward_step

View File

@ -15,7 +15,7 @@
"""MNLI dataset."""
from megatron import print_rank_0
from ascendSpeed import print_rank_0
from tasks.data_utils import clean_text
from .data import GLUEAbstractDataset

View File

@ -15,7 +15,7 @@
"""MRPC dataset."""
from megatron import print_rank_0
from ascendSpeed import print_rank_0
from tasks.data_utils import clean_text
from .data import GLUEAbstractDataset

View File

@ -15,7 +15,7 @@
"""QNLI dataset."""
from megatron import print_rank_0
from ascendSpeed import print_rank_0
from tasks.data_utils import clean_text
from .data import GLUEAbstractDataset

View File

@ -15,7 +15,7 @@
"""QQP dataset."""
from megatron import print_rank_0
from ascendSpeed import print_rank_0
from tasks.data_utils import clean_text
from .data import GLUEAbstractDataset

View File

@ -15,7 +15,7 @@
"""RTE dataset."""
from megatron import print_rank_0
from ascendspeed import print_rank_0
from tasks.data_utils import clean_text
from .data import GLUEAbstractDataset

View File

@ -15,7 +15,7 @@
"""SST-2 dataset."""
from megatron import print_rank_0
from ascendSpeed import print_rank_0
from tasks.data_utils import clean_text
from .data import GLUEAbstractDataset

View File

@ -15,7 +15,7 @@
"""STS-B dataset."""
from megatron import print_rank_0
from ascendSpeed import print_rank_0
from tasks.data_utils import clean_text
from .data import GLUEAbstractDataset

View File

@ -20,8 +20,8 @@ import sys
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),
os.path.pardir)))
from megatron import get_args
from megatron.initialize import initialize_megatron
from ascendspeed import get_args
from ascendspeed.initialize import initialize_megatron
def get_tasks_args(parser):

View File

@ -18,7 +18,7 @@
import os
import sys
from megatron import get_args
from ascendspeed import get_args
from tasks.orqa.evaluate_utils import ORQAEvaluator
def main():

Some files were not shown because too many files have changed in this diff Show More