A torch library for easy distributed deep learning on HPC clusters. Supports both slurm and MPI. No unnecessary abstractions and overhead. Simple, yet powerful, API.
- Simple, yet powerful, API
- Easy initialization of
torch.distributed
- Distributed checkpointing and metrics
- Extensive logging and diagnostics
- Wandb support
- A wealth of useful utility functions
pip install dmlcloud
TODO
You can find the official documentation at Read the Docs