Skip to content

sehoffmann/dmlcloud

Repository files navigation

Dmlcloud Logo

PyPI Status Documentation Status Test Status

A torch library for easy distributed deep learning on HPC clusters. Supports both slurm and MPI. No unnecessary abstractions and overhead. Simple, yet powerful, API.

Highlights

  • Simple, yet powerful, API
  • Easy initialization of torch.distributed
  • Distributed checkpointing and metrics
  • Extensive logging and diagnostics
  • Wandb support
  • A wealth of useful utility functions

Installation

pip install dmlcloud

Minimal Example

TODO

Documentation

You can find the official documentation at Read the Docs

About

Distributed torch training using horovod and slurm

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages