-
Notifications
You must be signed in to change notification settings - Fork 8
Closed
Labels
benchmarkscoreCore Iris library developmentCore Iris library developmentenhancementNew feature or requestNew feature or requestexamplesExamples showcasing Iris APIs and usageExamples showcasing Iris APIs and usagehelp wantedExtra attention is neededExtra attention is neededirisIris project issueIris project issuetestUnit/integration test related issuesUnit/integration test related issues
Description
Problem Description
We currently use mpi4py
and we use mpirun
to run code. We would like to get rid of these dependancies and use PyTorch. Code should look like this:
import torch
import torch.distributed as dist
from torch.distributed.elastic.multiprocessing import start_processes
class Iris:
def __init__(self, heap_size_bytes: int):
self.rank = dist.get_rank()
self.world_size = dist.get_world_size()
# Device is cuda:<rank>
torch.cuda.set_device(self.rank)
self.device = torch.device(f"cuda:{self.rank}")
# Allocate heap and record 64-bit base pointer
self.heap = torch.empty(heap_size_bytes, dtype=torch.int8, device=self.device)
self.heap_base = int(self.heap.data_ptr())
# All-gather heap bases
self.peer_heap_bases = [0 for _ in range(self.world_size)]
dist.all_gather_object(self.peer_heap_bases, self.heap_base)
def _worker(local_rank: int, world_size: int, init_url: str, heap_size_bytes: int):
backend = "nccl" if torch.cuda.is_available() else "gloo"
dist.init_process_group(
backend=backend,
init_method=init_url,
world_size=world_size,
rank=local_rank
)
iris = Iris(heap_size_bytes)
dist.barrier()
dist.destroy_process_group()
def main(nprocs: int = 2, heap_size_bytes: int = 1 << 20):
init_url = "tcp://127.0.0.1:29500"
start_processes(
fn=_worker,
args=(nprocs, init_url, heap_size_bytes),
nprocs=nprocs,
join=True,
)
if __name__ == "__main__":
n = torch.cuda.device_count() or 2
main(nprocs=n, heap_size_bytes=1 << 20)
Operating System
Any
CPU
Any
GPU
Any
ROCm Version
Any
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Copilot
Metadata
Metadata
Assignees
Labels
benchmarkscoreCore Iris library developmentCore Iris library developmentenhancementNew feature or requestNew feature or requestexamplesExamples showcasing Iris APIs and usageExamples showcasing Iris APIs and usagehelp wantedExtra attention is neededExtra attention is neededirisIris project issueIris project issuetestUnit/integration test related issuesUnit/integration test related issues