You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Get the max iteration retrieved across the ranks.
if torch.distributed.is_initialized():
iters_cuda = torch.tensor([iteration], dtype=torch.long, device='cuda' if 'nccl' in torch.distributed.get_backend() else 'cpu')
torch.distributed.all_reduce(iters_cuda, op=torch.distributed.ReduceOp.MAX)
max_iter = iters_cuda[0].item()
通信张量iters_cuda的device的判断条件为什么不是:
device='cpu' if 'gloo' in torch.distributed.get_backend() else 'cuda'
The text was updated successfully, but these errors were encountered:
请问在megatron/megatron/training/checkpointing.py 246-250行中:
通信张量
iters_cuda
的device
的判断条件为什么不是:The text was updated successfully, but these errors were encountered: