You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: intermediate_source/monarch_titan_distributed_tutorial.rst
+12-11Lines changed: 12 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,7 @@ Introduction
9
9
10
10
As deep learning models continue to grow in size and complexity, training them efficiently requires coordinating computation across multiple GPUs and nodes.
11
11
In this tutorial, you will learn how to easily set up and run large-scale distributed workflows using Monarch's actor framework together with TorchTitan, on a SLURM-managed cluster.
12
+
Monarch will allow us to drive a large cluster of machines (organized into a mesh), as if developing on a single host, single process environment.
12
13
13
14
What is Monarch?
14
15
^^^^^^^^^^^^^^^^
@@ -23,26 +24,25 @@ Monarch is an actor framework designed to streamline the development of distribu
23
24
24
25
For more details, see the `Monarch documentation <https://meta-pytorch.org/monarch/generated/examples/getting_started.html>`_.
25
26
26
-
Why Use Monarch with TorchTitan?
27
+
Why Use Monarch?
27
28
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
28
29
29
30
TorchTitan is a PyTorch native library for pre-training at scale.
30
-
While TorchTitan provides excellent primitives for distributed training, launching and managing these jobs across clusters can be complex. Monarch addresses this with:
31
+
While TorchTitan provides excellent primitives for distributed training, launching and managing these jobs across clusters can slow down iteration. Monarch addresses this with:
31
32
32
33
1. **Simplified cluster interaction**: Reserve and manage compute resources with simple async Python calls instead of writing bash scripts
33
34
2. **Interactive development**: Modify and re-run training code on existing allocations without waiting for new resources
34
35
3. **Unified workflow**: Seamlessly move between local testing and cluster execution with the same code
35
-
4. **Failure supervision**: Handle errors and failures gracefully, with fine-grained recovery options from the controller
36
36
37
37
Prerequisites
38
38
-------------
39
39
40
-
To run this tutorial, you must have:
40
+
We rely on a nightly build of Titan for this tutorial, so please ensure that other Torch libraries are tracking nightly builds:
3. **A valid Titan model config** and **tokenizer** in your working directory (e.g., ``debug_model.toml`` from `TorchTitan configs <https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/train_configs/debug_model.toml>`_).
47
47
4. **SLURM cluster access:**
48
48
@@ -317,11 +317,13 @@ This is where Monarch's power becomes most apparent.
317
317
**Monarch Highlights**:
318
318
319
319
1. **Interactive iteration**: After reserving the machine allocation, you can adjust your logic
320
-
and re-spawn actors, without requesting new resources.
320
+
and re-spawn actors, without requesting new resources. SLURM's shared filesystem ensures
321
+
that framework/workspace changes are synchronized across workers.
321
322
2. **Transparent logging**: All logs from remote workers stream back to your
322
323
client in real-time, making debugging feel like local execution
@@ -430,14 +432,13 @@ Finally, we tie everything together in a main function that kicks off the workfl
430
432
431
433
logger.info("Workflow completed!")
432
434
433
-
Summary
434
-
-------
435
+
Conclusion
436
+
-----------
435
437
436
438
Congrats! In this tutorial, you learned how to combine Monarch's actor framework with
437
439
TorchTitan for scalable distributed training.
438
440
439
441
**Further Reading**
440
442
441
-
- Monarch also integrates with TorchFT to provide per-step fault-tolerance across replicated workers.
442
-
You can find a comprehensive `proof of concept <https://github.com/meta-pytorch/torchft/tree/main/torchft/examples/slurm>`_ of this integration in the TorchFT repo.
443
+
- Monarch also integrates with TorchFT to provide per-step fault-tolerance across replicated workers. You can find a comprehensive `proof of concept <https://github.com/meta-pytorch/torchft/tree/main/examples/monarch>`_ of this integration in the TorchFT repo.
443
444
- For an interactive notebook covering similar topics to this tutorial, please consult `this Monarch example <https://github.com/meta-pytorch/monarch/blob/main/examples/slurm_titan.ipynb>`_.
0 commit comments