Skip to content

Commit aa27a80

Browse files
committed
address feedback
1 parent f384e90 commit aa27a80

File tree

2 files changed

+13
-11
lines changed

2 files changed

+13
-11
lines changed

distributed.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,5 +193,6 @@ Custom Extensions
193193
intermediate/rpc_tutorial
194194
intermediate/rpc_param_server_tutorial
195195
intermediate/rpc_async_execution
196+
intermediate/monarch_distributed_tutorial
196197
advanced/rpc_ddp_tutorial
197198
advanced/generic_join

intermediate_source/monarch_titan_distributed_tutorial.rst

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Introduction
99

1010
As deep learning models continue to grow in size and complexity, training them efficiently requires coordinating computation across multiple GPUs and nodes.
1111
In this tutorial, you will learn how to easily set up and run large-scale distributed workflows using Monarch's actor framework together with TorchTitan, on a SLURM-managed cluster.
12+
Monarch will allow us to drive a large cluster of machines (organized into a mesh), as if developing on a single host, single process environment.
1213

1314
What is Monarch?
1415
^^^^^^^^^^^^^^^^
@@ -23,26 +24,25 @@ Monarch is an actor framework designed to streamline the development of distribu
2324

2425
For more details, see the `Monarch documentation <https://meta-pytorch.org/monarch/generated/examples/getting_started.html>`_.
2526

26-
Why Use Monarch with TorchTitan?
27+
Why Use Monarch?
2728
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2829

2930
TorchTitan is a PyTorch native library for pre-training at scale.
30-
While TorchTitan provides excellent primitives for distributed training, launching and managing these jobs across clusters can be complex. Monarch addresses this with:
31+
While TorchTitan provides excellent primitives for distributed training, launching and managing these jobs across clusters can slow down iteration. Monarch addresses this with:
3132

3233
1. **Simplified cluster interaction**: Reserve and manage compute resources with simple async Python calls instead of writing bash scripts
3334
2. **Interactive development**: Modify and re-run training code on existing allocations without waiting for new resources
3435
3. **Unified workflow**: Seamlessly move between local testing and cluster execution with the same code
35-
4. **Failure supervision**: Handle errors and failures gracefully, with fine-grained recovery options from the controller
3636

3737
Prerequisites
3838
-------------
3939

40-
To run this tutorial, you must have:
40+
We rely on a nightly build of Titan for this tutorial, so please ensure that other Torch libraries are tracking nightly builds:
4141

4242
1. **Monarch nightly installed:**
4343
`Install script <https://github.com/meta-pytorch/monarch/blob/main/scripts/install_nightly.py>`_
4444
2. **TorchTitan nightly installed:**
45-
`TorchTitan install instructions <https://github.com/pytorch/torchtitan?tab=readme-ov-fileightly-builds>`_
45+
`TorchTitan install instructions <https://github.com/pytorch/torchtitan?tab=readme-ov-file#nightly-builds>`_
4646
3. **A valid Titan model config** and **tokenizer** in your working directory (e.g., ``debug_model.toml`` from `TorchTitan configs <https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/train_configs/debug_model.toml>`_).
4747
4. **SLURM cluster access:**
4848

@@ -317,11 +317,13 @@ This is where Monarch's power becomes most apparent.
317317
**Monarch Highlights**:
318318

319319
1. **Interactive iteration**: After reserving the machine allocation, you can adjust your logic
320-
and re-spawn actors, without requesting new resources.
320+
and re-spawn actors, without requesting new resources. SLURM's shared filesystem ensures
321+
that framework/workspace changes are synchronized across workers.
321322
2. **Transparent logging**: All logs from remote workers stream back to your
322323
client in real-time, making debugging feel like local execution
323324

324-
Workflow:
325+
**Workflow**:
326+
325327
Reserve Machines → Create Proc Mesh → Configure Logging → Spawn Actors → Train → Cleanup
326328

327329
.. code-block:: python
@@ -430,14 +432,13 @@ Finally, we tie everything together in a main function that kicks off the workfl
430432
431433
logger.info("Workflow completed!")
432434
433-
Summary
434-
-------
435+
Conclusion
436+
-----------
435437

436438
Congrats! In this tutorial, you learned how to combine Monarch's actor framework with
437439
TorchTitan for scalable distributed training.
438440

439441
**Further Reading**
440442

441-
- Monarch also integrates with TorchFT to provide per-step fault-tolerance across replicated workers.
442-
You can find a comprehensive `proof of concept <https://github.com/meta-pytorch/torchft/tree/main/torchft/examples/slurm>`_ of this integration in the TorchFT repo.
443+
- Monarch also integrates with TorchFT to provide per-step fault-tolerance across replicated workers. You can find a comprehensive `proof of concept <https://github.com/meta-pytorch/torchft/tree/main/examples/monarch>`_ of this integration in the TorchFT repo.
443444
- For an interactive notebook covering similar topics to this tutorial, please consult `this Monarch example <https://github.com/meta-pytorch/monarch/blob/main/examples/slurm_titan.ipynb>`_.

0 commit comments

Comments
 (0)