Skip to content

[Issue]: Get rid of mpi4py and MPI Dependencies #153

@mawad-amd

Description

@mawad-amd

Problem Description

We currently use mpi4py and we use mpirun to run code. We would like to get rid of these dependancies and use PyTorch. Code should look like this:

import torch
import torch.distributed as dist
from torch.distributed.elastic.multiprocessing import start_processes

class Iris:
    def __init__(self, heap_size_bytes: int):
        self.rank = dist.get_rank()
        self.world_size = dist.get_world_size()

        # Device is cuda:<rank>
        torch.cuda.set_device(self.rank)
        self.device = torch.device(f"cuda:{self.rank}")

        # Allocate heap and record 64-bit base pointer
        self.heap = torch.empty(heap_size_bytes, dtype=torch.int8, device=self.device)
        self.heap_base = int(self.heap.data_ptr())

        # All-gather heap bases
        self.peer_heap_bases = [0 for _ in range(self.world_size)]
        dist.all_gather_object(self.peer_heap_bases, self.heap_base)


def _worker(local_rank: int, world_size: int, init_url: str, heap_size_bytes: int):
    backend = "nccl" if torch.cuda.is_available() else "gloo"
    dist.init_process_group(
        backend=backend,
        init_method=init_url,
        world_size=world_size,
        rank=local_rank
    )

    iris = Iris(heap_size_bytes)
    dist.barrier()
    dist.destroy_process_group()

def main(nprocs: int = 2, heap_size_bytes: int = 1 << 20):
    init_url = "tcp://127.0.0.1:29500"
    start_processes(
        fn=_worker,
        args=(nprocs, init_url, heap_size_bytes),
        nprocs=nprocs,
        join=True,
    )

if __name__ == "__main__":
    n = torch.cuda.device_count() or 2
    main(nprocs=n, heap_size_bytes=1 << 20)

Operating System

Any

CPU

Any

GPU

Any

ROCm Version

Any

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

Labels

benchmarkscoreCore Iris library developmentenhancementNew feature or requestexamplesExamples showcasing Iris APIs and usagehelp wantedExtra attention is neededirisIris project issuetestUnit/integration test related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions