diff --git a/CHANGE_LOG.md b/CHANGE_LOG.md new file mode 100644 index 000000000..6182c1eca --- /dev/null +++ b/CHANGE_LOG.md @@ -0,0 +1,19 @@ +### v0.4.4 Dec. 2021 +The system is successfully evaluated on a multi-node system. +The benchmark scripts are integrated with memory-centric tiling borrowed from DeepSpeed. +It trains an 18B model on WeChat Yard. + + +### v0.4.3 Nov. 2021 +The system is evaluated on A100 SuperPod. +Some optimizations are developed to improve further the model scale and efficiency, including memory saving communication (MSC) and allocation cache (CACHE). +A severe bug caused by asyn chunk copy using stream is identified and fixed. +It trains a 50B model on an 8xA100 SuperPod node. + + +### v0.4.0 Nov. 2021, +The system is upgraded with a better memory tracer. +We improve the max model scale further than v0.3.0 (15B vs. 12B) on the WeChat Yard Platform. + +### v0.3.0 Oct. 2021. +Our initial version significantly surpasses DeepSpeed both in model-scale and computing efficiency. diff --git a/README.md b/README.md index 9c824c6ff..7fe2ade2a 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,9 @@ ## PatrickStar: Parallel Training of Large Language Models via a Chunk-based Memory Management ![logo](./logo.png) -### News -1. Nov. 2021, v0.4.3 releaed. PatrickStar is evaluated on A100 SuperPod. Some execution options are provided, including memory saving communication technique, memory allocation cache. It trains 40B model on a SuperPod node. -2. Nov. 2021, v0.4.0 released. With a better memory tracer, PatrickStar further improves the max model scale than v0.3.0 (15B vs 12B). -3. Oct. 2021, v0.3.0 released. Our initial version significantly surpasses DeepSpeed. + +### Recent Progress +See [CHANGE_LOG.md](./CHANGE_LOG.md). ### Meeting PatrickStar Pre-Trained Models (PTM) are becoming the hotspot of both NLP research and industry application. However, the training of PTMs requires enormous hardware resources, which makes it only accessible to small portion of people in the AI community. Now, **PatrickStar will make PTM training available to everyone!** @@ -13,10 +12,10 @@ Out of memory error (OOM) is the nightmare of every engineer training PTMs. To p ### System Design The idea of Patrick is like this. The non-model data (mainly activations) varies during training, but the current heterogenous training solutions are **statically** spliting the model data to CPU and GPU. To make better use of the GPU, PatrickStar proposes a **dynamic** memory scheduling with the help of a chunk-based memory management module. The memory management of PatrickStar supports offloading everything but the current computing part of the model to CPU to save GPU. In addition, chunk-based memory management is efficient for collective communication when scaling to multiple GPU. -See [this doc](./INSIDE.md) for the idea behind PatrickStar. +See the paper and [this doc](./INSIDE.md) for the idea behind PatrickStar. ### Results -In experiment, Patrickstar v0.4.3 is able to train a **15 Billion**(15B) param model with 8xTesla V100 GPU and 240GB GPU memory, which is twice as large as the state of art. And the performance of PatrickStar is better for models of the same size as well. The pstar is PatrickStar v0.4.3. The deeps indicates performance of DeepSpeed v0.4.3 using the official example [DeepSpeed example](https://github.com/microsoft/DeepSpeedExamples/blob/master/Megatron-LM-v1.1.5-ZeRO3/examples/ds_pretrain_gpt2-zero3.sh) zero3 stage with activation optimzations openning by default. +In experiment, Patrickstar v0.4.3 is able to train a **18 Billion**(18B) param model with 8xTesla V100 GPU and 240GB GPU memory, which is over twice as large as the state of art. And the performance of PatrickStar is better for models of the same size as well. The pstar is PatrickStar v0.4.3. The deeps indicates performance of DeepSpeed v0.4.3 using the official example [DeepSpeed example](https://github.com/microsoft/DeepSpeedExamples/blob/master/Megatron-LM-v1.1.5-ZeRO3/examples/ds_pretrain_gpt2-zero3.sh) zero3 stage with activation optimzations openning by default. ![alt perf](./doc/mgpu_scalability.png "performance testing result") diff --git a/doc/mgpu_scalability.png b/doc/mgpu_scalability.png index 91c46cea9..ece8cae1b 100644 Binary files a/doc/mgpu_scalability.png and b/doc/mgpu_scalability.png differ diff --git a/doc/optimization_options.md b/doc/optimization_options.md index 14f021cb4..3162767ab 100644 --- a/doc/optimization_options.md +++ b/doc/optimization_options.md @@ -1,11 +1,11 @@ This page explains the optimization options for benchmarking. -Optimizations is divided into PatrickStar-related ones and general ones. -General Optimizations can be applied to any PyTorch-based frameworks. +Optimizations are divided into PatrickStar-related ones and general ones. +General Optimizations can be applied to any PyTorch-based framework. ## General Optimizations 1. Activation Checkpoing (a.k.a gradient checkpointing in [PyTorch](https://pytorch.org/docs/stable/checkpoint.html)) `--use_ckp` -Make sure this option is open for large model training. It can largely save activation memory footprint at cost of recomputing. +Make sure this option is open for large model training. It can primarily save activation memory footprint at the cost of recomputing. 2. Activation Offloading `--with_activation_offload` @@ -14,21 +14,23 @@ Note you have to use activation checkpoing first. 3. CPU Embedding `--use_cpu_embedding` -nn.Embedding is conducted on CPU, save GPU memory. More importantly, it shrinks the chunk size. For some small model, the biggest layer is Embedding. Therefore, the chunk size has to larger than the embedding numel. +nn.Embedding is executed on CPU, save GPU memory. More importantly, it shrinks the chunk size. For some small models, the most significant layer is Embedding. Therefore, the chunk size has to be larger than the embedding numel. + 4. Tiling Linear (a.k.a Memory-centric tiling in [DeepSpeed](https://deepspeed.readthedocs.io/en/stable/zero3.html#memory-centric-tiling)) `--with_tiling_linear` -Memory-centric tiling (MCT) is able to split a param tensor of linear into pieces, and they do not need to be stored in contiguous memory space. This will help reduce chunk size. To achieve the best performance you have to tune the in_splits/out_splits of the parameters of the function. +Memory-centric tiling (MCT) can split a param tensor of linear into pieces, and they do not need to be stored in contiguous memory space. This will help reduce chunk size. However, to achieve the best performance, you have to tune the in_splits/out_splits of the function's parameters. ## PatrickStar-related Optmizations 1. Memory Saving Communication. `--with_mem_saving_com` -Use one-to-all communication to replace the original collective communication. More specifically, reduce scatter is replaced with Nx reduce. all gather is replaced with Nx bcast. In this way, we do not need to keep a Nx chunk buffer for distributed training, therefore saving the GPU memory. This method also changes the CPU-GPU and intra-GPU communication volume. In general, it reduces CPU-GPU comm volume at a cost of increasing intra-GPU bcast comm volume and also lower the intra-GPU bcast bandwidth. However, for some cases, it can improve the overall performance of the system from such tradeoff. It is suitable for training an extremely large model with a computing cluster with high-quality intra-GPU communication bandwidth, i.e. 50B model on a node of SuperPod. Details in Merge Request #250. +Use one-to-all communication to replace the original collective communication. More specifically, reduce scatter is replaced with Nx reduce. all gather is replaced with Nx bcast. In this way, we do not need to keep a Nx chunk buffer for distributed training, therefore saving the GPU memory. This method also changes the CPU-GPU and intra-GPU communication volume. In general, it reduces CPU-GPU comm volume at a cost of increasing intra-GPU bcast comm volume and also lower the intra-GPU bcast bandwidth. However, in some cases, it can improve the overall performance of the system from such a tradeoff. It is suitable for training an extremely large model with a computing cluster with high-quality intra-GPU communication bandwidth, i.e. 50B model on a node of SuperPod. Details in Merge Request #250. 2. Memory Allocation Caching. `--with_mem_cache` -Use a cache to allocate and release chunk memory. The cache is a size-limited queue, whose capacity is default as 2. It is helpful for Memory Saving Communication in distributed training. It avoid frequent release and allocate memory for remote chunks. See detail in #241. +Use a cache to allocate and release chunk memory. The cache is a size-limited queue whose capacity is default as 2. It is helpful for Memory Saving Communication in distributed training. It avoids frequent release and allocates memory for remote chunks. See detail in #241. + 2. Hybrid ADAM: `--use_hybrid_adam` @@ -51,3 +53,28 @@ PatirckStar is famous for dynamic partition model data. With help of this flag y 6. Release Remote Chunk After Initialization. `release_after_init` The is a computing efficient irrelevant option used for distributed training. It allocates memory for remote chunks but release it immediately. In this way, we can make sure the model parameter is randomly initialized the same as a serial version. Solve the problem with random seed. It is used in combination with the `--res_check` option to check the correctness of distributed training. + +7. Adjusting the quota of CPU and GPU memory of memory tracer. +We provide ways to adjust the CPU and GPU memory usage quota for the memory tracer. We did not expose this optimization as parameters passed through the command line. As shown in the pretrain_bert_demo.py, there is a JSON config for the memory tracer setting. You can adjust the four ratio suffix values. + +`warmup_gpu_chunk_mem_ratio`: the max gpu memory of a GPU can be used for chunks during the warmup iteration. + +`overall_gpu_mem_ratio`: the available gpu mem size / real gpu mem capacity. Turn up the value if you meet cpu or gpu OOM during iteration. + +`overall_cpu_mem_ratio`: the available cpu mem size / real cpu mem capacity. Turn up the value if you meet cpu or gpu OOM during iteration. + +`margin_use_ratio`: Space to host optimizer states in GPU / the rest GPU space excluding the peak chunk-used space after warmup FWD+BWD. + +`use_fake_dist`: a debug flag, to simulate multiple-GPU on one GPU. It is used when we are poor. After we have multi-GPU we deprecated this flag. + +``` +"mem_tracer": { + "use_async_mem_monitor": args.with_async_mem_monitor, + "warmup_gpu_chunk_mem_ratio": 0.1, + "overall_gpu_mem_ratio": 0.8, + "overall_cpu_mem_ratio": 0.8, + "margin_use_ratio": 0.8, + "use_fake_dist": False, + "with_static_partition": args.with_static_partition, + }, +``` diff --git a/examples/pretrain_bert_demo.py b/examples/pretrain_bert_demo.py index 77824e407..2d09ba04e 100644 --- a/examples/pretrain_bert_demo.py +++ b/examples/pretrain_bert_demo.py @@ -592,11 +592,16 @@ def visit_and_register_hooks(module): SEQ_LEN = 1024 NUM_LAYER = 65 NUM_HEAD = 16 - elif MODEL_NAME == "GPT3_15B": + elif MODEL_NAME == "GPT3_18B": HIDDEN_DIM = 4096 SEQ_LEN = 1024 NUM_LAYER = 78 NUM_HEAD = 16 + elif MODEL_NAME == "GPT3_17B": + HIDDEN_DIM = 4096 + SEQ_LEN = 1024 + NUM_LAYER = 90 + NUM_HEAD = 16 # The following configs comes from paper # Efficient Large-Scale Language Model Training on GPU Clusters # NV model is wider in hidden-size @@ -627,6 +632,11 @@ def visit_and_register_hooks(module): SEQ_LEN = 1024 NUM_LAYER = 50 NUM_HEAD = 16 + elif MODEL_NAME == "GPT_DS_50B": + HIDDEN_DIM = 8192 + SEQ_LEN = 1024 + NUM_LAYER = 62 + NUM_HEAD = 16 elif MODEL_NAME == "GPT_DS_60B": HIDDEN_DIM = 8192 SEQ_LEN = 1024 diff --git a/examples/ps_modeling_bert.py b/examples/ps_modeling_bert.py deleted file mode 100644 index 4e703ed7c..000000000 --- a/examples/ps_modeling_bert.py +++ /dev/null @@ -1,444 +0,0 @@ -# BSD 3-Clause License -# -# Copyright (C) 2021 THL A29 Limited, a Tencent company. All rights reserved. -# -# Redistribution and use in source and binary forms, with or without modification, -# are permitted provided that the following conditions are met: -# -# * Redistributions of source code must retain the above copyright notice, this -# list of conditions and the following disclaimer. -# -# * Redistributions in binary form must reproduce the above copyright notice, -# this list of conditions and the following disclaimer in the documentation -# and/or other materials provided with the distribution. -# -# * Neither the name of the psutil authors nor the names of its contributors -# may be used to endorse or promote products derived from this software without -# specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND -# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED -# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE -# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR -# ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES -# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; -# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON -# ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS -# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - -# coding=utf-8 -# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. -# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""PyTorch BERT model. """ -import torch -from torch import nn -from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss - -from transformers.modeling_outputs import ( - BaseModelOutputWithPastAndCrossAttentions, - BaseModelOutputWithPoolingAndCrossAttentions, - SequenceClassifierOutput, -) -from transformers import BertLayer, BertPreTrainedModel -from transformers.models.bert.modeling_bert import BertEmbeddings, BertPooler - -from optimizations.checkpoint import checkpoint as ckp - - -class BertEncoder(nn.Module): - def __init__(self, config): - super().__init__() - self.config = config - self.layer = nn.ModuleList( - [BertLayer(config) for _ in range(config.num_hidden_layers)] - ) - self.gradient_checkpointing = True - - def forward( - self, - hidden_states, - attention_mask=None, - head_mask=None, - encoder_hidden_states=None, - encoder_attention_mask=None, - past_key_values=None, - use_cache=None, - output_attentions=False, - output_hidden_states=False, - return_dict=True, - ): - all_hidden_states = () if output_hidden_states else None - all_self_attentions = () if output_attentions else None - all_cross_attentions = ( - () if output_attentions and self.config.add_cross_attention else None - ) - - next_decoder_cache = () if use_cache else None - for i, layer_module in enumerate(self.layer): - if output_hidden_states: - all_hidden_states = all_hidden_states + (hidden_states,) - - layer_head_mask = head_mask[i] if head_mask is not None else None - past_key_value = past_key_values[i] if past_key_values is not None else None - - if self.gradient_checkpointing and self.training: - - if use_cache: - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs) - - return custom_forward - - layer_outputs = ckp( - create_custom_forward(layer_module), - hidden_states, - attention_mask, - layer_head_mask, - encoder_hidden_states, - encoder_attention_mask, - past_key_value, - output_attentions, - ) - else: - layer_outputs = layer_module( - hidden_states, - attention_mask, - layer_head_mask, - encoder_hidden_states, - encoder_attention_mask, - past_key_value, - output_attentions, - ) - - hidden_states = layer_outputs[0] - if use_cache: - next_decoder_cache += (layer_outputs[-1],) - if output_attentions: - all_self_attentions = all_self_attentions + (layer_outputs[1],) - if self.config.add_cross_attention: - all_cross_attentions = all_cross_attentions + (layer_outputs[2],) - - if output_hidden_states: - all_hidden_states = all_hidden_states + (hidden_states,) - - if not return_dict: - return tuple( - v - for v in [ - hidden_states, - next_decoder_cache, - all_hidden_states, - all_self_attentions, - all_cross_attentions, - ] - if v is not None - ) - return BaseModelOutputWithPastAndCrossAttentions( - last_hidden_state=hidden_states, - past_key_values=next_decoder_cache, - hidden_states=all_hidden_states, - attentions=all_self_attentions, - cross_attentions=all_cross_attentions, - ) - - -class BertModel(BertPreTrainedModel): - """ - The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of - cross-attention is added between the self-attention layers, following the architecture described in `Attention is - all you need `__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, - Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. - To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration - set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder` - argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an - input to the forward pass. - """ - - def __init__(self, config, add_pooling_layer=True): - super().__init__(config) - self.config = config - - self.embeddings = BertEmbeddings(config) - self.encoder = BertEncoder(config) - - self.pooler = BertPooler(config) if add_pooling_layer else None - - self.init_weights() - - def get_input_embeddings(self): - return self.embeddings.word_embeddings - - def set_input_embeddings(self, value): - self.embeddings.word_embeddings = value - - def _prune_heads(self, heads_to_prune): - """ - Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base - class PreTrainedModel - """ - for layer, heads in heads_to_prune.items(): - self.encoder.layer[layer].attention.prune_heads(heads) - - def forward( - self, - input_ids=None, - attention_mask=None, - token_type_ids=None, - position_ids=None, - head_mask=None, - inputs_embeds=None, - encoder_hidden_states=None, - encoder_attention_mask=None, - past_key_values=None, - use_cache=None, - output_attentions=None, - output_hidden_states=None, - return_dict=None, - ): - r""" - encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj: - `(batch_size, sequence_length, hidden_size)`, `optional`): - Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if - the model is configured as a decoder. - encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): - Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in - the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``: - - 1 for tokens that are **not masked**, - - 0 for tokens that are **masked**. - past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` - If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids` - (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)` - instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`. - use_cache (:obj:`bool`, `optional`): - If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up - decoding (see :obj:`past_key_values`). - """ - output_attentions = ( - output_attentions - if output_attentions is not None - else self.config.output_attentions - ) - output_hidden_states = ( - output_hidden_states - if output_hidden_states is not None - else self.config.output_hidden_states - ) - return_dict = ( - return_dict if return_dict is not None else self.config.use_return_dict - ) - - if self.config.is_decoder: - use_cache = use_cache if use_cache is not None else self.config.use_cache - else: - use_cache = False - - if input_ids is not None and inputs_embeds is not None: - raise ValueError( - "You cannot specify both input_ids and inputs_embeds at the same time" - ) - elif input_ids is not None: - input_shape = input_ids.size() - elif inputs_embeds is not None: - input_shape = inputs_embeds.size()[:-1] - else: - raise ValueError("You have to specify either input_ids or inputs_embeds") - - batch_size, seq_length = input_shape - device = input_ids.device if input_ids is not None else inputs_embeds.device - - # past_key_values_length - past_key_values_length = ( - past_key_values[0][0].shape[2] if past_key_values is not None else 0 - ) - - if attention_mask is None: - attention_mask = torch.ones( - ((batch_size, seq_length + past_key_values_length)), device=device - ) - - if token_type_ids is None: - if hasattr(self.embeddings, "token_type_ids"): - buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length] - buffered_token_type_ids_expanded = buffered_token_type_ids.expand( - batch_size, seq_length - ) - token_type_ids = buffered_token_type_ids_expanded - else: - token_type_ids = torch.zeros( - input_shape, dtype=torch.long, device=device - ) - - # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length] - # ourselves in which case we just need to make it broadcastable to all heads. - extended_attention_mask: torch.Tensor = self.get_extended_attention_mask( - attention_mask, input_shape, device - ) - - # If a 2D or 3D attention mask is provided for the cross-attention - # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length] - if self.config.is_decoder and encoder_hidden_states is not None: - ( - encoder_batch_size, - encoder_sequence_length, - _, - ) = encoder_hidden_states.size() - encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length) - if encoder_attention_mask is None: - encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device) - encoder_extended_attention_mask = self.invert_attention_mask( - encoder_attention_mask - ) - else: - encoder_extended_attention_mask = None - - # Prepare head mask if needed - # 1.0 in head_mask indicate we keep the head - # attention_probs has shape bsz x n_heads x N x N - # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] - # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length] - head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers) - - embedding_output = self.embeddings( - input_ids=input_ids, - position_ids=position_ids, - token_type_ids=token_type_ids, - inputs_embeds=inputs_embeds, - past_key_values_length=past_key_values_length, - ) - encoder_outputs = self.encoder( - embedding_output, - attention_mask=extended_attention_mask, - head_mask=head_mask, - encoder_hidden_states=encoder_hidden_states, - encoder_attention_mask=encoder_extended_attention_mask, - past_key_values=past_key_values, - use_cache=use_cache, - output_attentions=output_attentions, - output_hidden_states=output_hidden_states, - return_dict=return_dict, - ) - sequence_output = encoder_outputs[0] - pooled_output = ( - self.pooler(sequence_output) if self.pooler is not None else None - ) - - if not return_dict: - return (sequence_output, pooled_output) + encoder_outputs[1:] - - return BaseModelOutputWithPoolingAndCrossAttentions( - last_hidden_state=sequence_output, - pooler_output=pooled_output, - past_key_values=encoder_outputs.past_key_values, - hidden_states=encoder_outputs.hidden_states, - attentions=encoder_outputs.attentions, - cross_attentions=encoder_outputs.cross_attentions, - ) - - -class BertForSequenceClassification(BertPreTrainedModel): - def __init__(self, config): - super().__init__(config) - self.num_labels = config.num_labels - self.config = config - - self.bert = BertModel(config) - classifier_dropout = ( - config.classifier_dropout - if config.classifier_dropout is not None - else config.hidden_dropout_prob - ) - self.dropout = nn.Dropout(classifier_dropout) - self.classifier = nn.Linear(config.hidden_size, config.num_labels) - - self.init_weights() - - def forward( - self, - input_ids=None, - attention_mask=None, - token_type_ids=None, - position_ids=None, - head_mask=None, - inputs_embeds=None, - labels=None, - output_attentions=None, - output_hidden_states=None, - return_dict=None, - ): - r""" - labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): - Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ..., - config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss), - If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy). - """ - return_dict = ( - return_dict if return_dict is not None else self.config.use_return_dict - ) - - outputs = self.bert( - input_ids, - attention_mask=attention_mask, - token_type_ids=token_type_ids, - position_ids=position_ids, - head_mask=head_mask, - inputs_embeds=inputs_embeds, - output_attentions=output_attentions, - output_hidden_states=output_hidden_states, - return_dict=return_dict, - ) - - pooled_output = outputs[1] - - pooled_output = self.dropout(pooled_output) - logits = self.classifier(pooled_output) - - loss = None - if labels is not None: - if self.config.problem_type is None: - if self.num_labels == 1: - self.config.problem_type = "regression" - elif self.num_labels > 1 and ( - labels.dtype == torch.long or labels.dtype == torch.int - ): - self.config.problem_type = "single_label_classification" - else: - self.config.problem_type = "multi_label_classification" - - if self.config.problem_type == "regression": - loss_fct = MSELoss() - if self.num_labels == 1: - loss = loss_fct(logits.squeeze(), labels.squeeze()) - else: - loss = loss_fct(logits, labels) - elif self.config.problem_type == "single_label_classification": - loss_fct = CrossEntropyLoss() - loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) - elif self.config.problem_type == "multi_label_classification": - loss_fct = BCEWithLogitsLoss() - loss = loss_fct(logits, labels) - if not return_dict: - output = (logits,) + outputs[2:] - return ((loss,) + output) if loss is not None else output - - return SequenceClassifierOutput( - loss=loss, - logits=logits, - hidden_states=outputs.hidden_states, - attentions=outputs.attentions, - ) diff --git a/examples/run_bert.sh b/examples/run_bert.sh index f481700e1..53da4329a 100644 --- a/examples/run_bert.sh +++ b/examples/run_bert.sh @@ -123,7 +123,7 @@ LOG_DIR="./logs_${MODEL_NAME}" mkdir -p ${LOG_DIR} GIT_VER=`git rev-parse --short=5 HEAD` -LOG_FILE="log.${MODEL_NAME}_gpu_${GPU_NUM}_cs_${CS}_bs_${BS}_cpueb_${CPU_EBD}_lightseq_${LIGHTSEQ}_offload_${ACT_OFFLOAD}_SP_${SP}_AMM_${AMM}_MSC_${MSC}_CACHE_${CACHE}_${GIT_VER}" +LOG_FILE="log.${MODEL_NAME}_gpu_${GPU_NUM}_cs_${CS}_bs_${BS}_cpueb_${CPU_EBD}_lightseq_${LIGHTSEQ}_offload_${ACT_OFFLOAD}_SP_${SP}_AMM_${AMM}_MSC_${MSC}_CACHE_${CACHE}_TILING_${TILING}_${GIT_VER}" is_run_flag=`python ./benchmark/is_run_this_file.py --path "${LOG_DIR}" --file "${LOG_FILE}"` echo is_run_flag $is_run_flag diff --git a/setup.py b/setup.py index 1c8406801..b999b98a1 100644 --- a/setup.py +++ b/setup.py @@ -41,7 +41,7 @@ def fetch_requirements(path): setup( name="patrickstar", - version="0.4.3", + version="0.4.4", description="PatrickStart library", long_description="PatrickStar: Parallel Training of Large Language Models via a Chunk-based Parameter Server", long_description_content_type="text/markdown",