forked from PyTorchKorea/tutorials-kr
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
beginner_source/ddp_series_intro.rst PyTorchKorea#891
- Loading branch information
Showing
1 changed file
with
31 additions
and
46 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,56 +1,41 @@ | ||
**Introduction** \|\| `What is DDP <ddp_series_theory.html>`__ \|\| | ||
`Single-Node Multi-GPU Training <ddp_series_multigpu.html>`__ \|\| | ||
`Fault Tolerance <ddp_series_fault_tolerance.html>`__ \|\| | ||
`Multi-Node training <../intermediate/ddp_series_multinode.html>`__ \|\| | ||
`minGPT Training <../intermediate/ddp_series_minGPT.html>`__ | ||
**์๊ฐ** \|\| `DDP๋ ๋ฌด์์ธ๊ฐ <ddp_series_theory.html>`__ \|\| | ||
`๋จ์ผ ๋ ธ๋ ๋ค์ค-GPU ํ์ต <ddp_series_multigpu.html>`__ \|\| | ||
`์ฅ์ ๋ด์ฑ <ddp_series_fault_tolerance.html>`__ \|\| | ||
`๋ค์ค ๋ ธ๋ ํ์ต <../intermediate/ddp_series_multinode.html>`__ \|\| | ||
`minGPT ํ์ต <../intermediate/ddp_series_minGPT.html>`__ | ||
|
||
Distributed Data Parallel in PyTorch - Video Tutorials | ||
====================================================== | ||
PyTorch์ ๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ ์ฒ๋ฆฌ - ๋น๋์ค ํํ ๋ฆฌ์ผ | ||
===================================================== | ||
|
||
Authors: `Suraj Subramanian <https://github.com/suraj813>`__ | ||
์ ์: `Suraj Subramanian <https://github.com/suraj813>`__ | ||
|
||
Follow along with the video below or on `youtube <https://www.youtube.com/watch/-K3bZYHYHEA>`__. | ||
์๋ ๋น๋์ค๋ฅผ ๋ณด๊ฑฐ๋ `YouTube <https://www.youtube.com/watch/-K3bZYHYHEA>`__์์ ํจ๊ป ์์ฒญํ์ธ์. | ||
|
||
.. raw:: html | ||
|
||
<div style="margin-top:10px; margin-bottom:10px;"> | ||
<iframe width="560" height="315" src="https://www.youtube.com/embed/-K3bZYHYHEA" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> | ||
</div> | ||
|
||
This series of video tutorials walks you through distributed training in | ||
PyTorch via DDP. | ||
|
||
The series starts with a simple non-distributed training job, and ends | ||
with deploying a training job across several machines in a cluster. | ||
Along the way, you will also learn about | ||
`torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__ for | ||
fault-tolerant distributed training. | ||
|
||
The tutorial assumes a basic familiarity with model training in PyTorch. | ||
|
||
Running the code | ||
---------------- | ||
|
||
You will need multiple CUDA GPUs to run the tutorial code. Typically, | ||
this can be done on a cloud instance with multiple GPUs (the tutorials | ||
use an Amazon EC2 P3 instance with 4 GPUs). | ||
|
||
The tutorial code is hosted in this | ||
`github repo <https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series>`__. | ||
Clone the repository and follow along! | ||
|
||
Tutorial sections | ||
----------------- | ||
|
||
0. Introduction (this page) | ||
1. `What is DDP? <ddp_series_theory.html>`__ Gently introduces what DDP is doing | ||
under the hood | ||
2. `Single-Node Multi-GPU Training <ddp_series_multigpu.html>`__ Training models | ||
using multiple GPUs on a single machine | ||
3. `Fault-tolerant distributed training <ddp_series_fault_tolerance.html>`__ | ||
Making your distributed training job robust with torchrun | ||
4. `Multi-Node training <../intermediate/ddp_series_multinode.html>`__ Training models using | ||
multiple GPUs on multiple machines | ||
5. `Training a GPT model with DDP <../intermediate/ddp_series_minGPT.html>`__ โReal-worldโ | ||
example of training a `minGPT <https://github.com/karpathy/minGPT>`__ | ||
model with DDP | ||
์ด ๋น๋์ค ํํ ๋ฆฌ์ผ ์๋ฆฌ์ฆ๋ PyTorch์์ DDP(Distributed Data Parallel)๋ฅผ ์ฌ์ฉํ ๋ถ์ฐ ํ์ต์ ๋ํด ์๋ดํฉ๋๋ค. | ||
|
||
์ด ์๋ฆฌ์ฆ๋ ๋จ์ํ ๋น๋ถ์ฐ ํ์ต ์์ ์์ ์์ํ์ฌ, ํด๋ฌ์คํฐ ๋ด ์ฌ๋ฌ ๊ธฐ๊ธฐ๋ค(multiple machines)์์ ํ์ต ์์ ์ ๋ฐฐํฌํ๋ ๊ฒ์ผ๋ก ๋ง๋ฌด๋ฆฌ๋ฉ๋๋ค. ์ด ๊ณผ์ ์์ `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__์ ์ฌ์ฉํ ์ฅ์ ํ์ฉ(fault-tolerant) ๋ถ์ฐ ํ์ต์ ๋ํด์๋ ๋ฐฐ์ฐ๊ฒ ๋ฉ๋๋ค. | ||
|
||
์ด ํํ ๋ฆฌ์ผ์ PyTorch์์ ๋ชจ๋ธ ํ์ต์ ๋ํ ๊ธฐ๋ณธ์ ์ธ ์ดํด๋ฅผ ์ ์ ๋ก ํฉ๋๋ค. | ||
|
||
์ฝ๋ ์คํ | ||
-------- | ||
|
||
ํํ ๋ฆฌ์ผ ์ฝ๋๋ฅผ ์คํํ๋ ค๋ฉด ์ฌ๋ฌ ๊ฐ์ CUDA GPU๊ฐ ํ์ํฉ๋๋ค. ์ผ๋ฐ์ ์ผ๋ก ์ฌ๋ฌ GPU๊ฐ ์๋ ํด๋ผ์ฐ๋ ์ธ์คํด์ค์์ ์ด๋ฅผ ์ํํ ์ ์์ผ๋ฉฐ, ํํ ๋ฆฌ์ผ์์๋ 4๊ฐ์ GPU๊ฐ ํ์ฌ๋ Amazon EC2 P3 ์ธ์คํด์ค๋ฅผ ์ฌ์ฉํฉ๋๋ค. | ||
|
||
ํํ ๋ฆฌ์ผ ์ฝ๋๋ ์ด `GitHub ์ ์ฅ์ <https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series>`__์ ํธ์คํ ๋์ด ์์ต๋๋ค. ์ ์ฅ์๋ฅผ ๋ณต์ ํ๊ณ ํจ๊ป ์งํํ์ธ์! | ||
|
||
ํํ ๋ฆฌ์ผ ์น์ | ||
-------------- | ||
|
||
0. ์๊ฐ (์ด ํ์ด์ง) | ||
1. `DDP๋ ๋ฌด์์ธ๊ฐ? <ddp_series_theory.html>`__ DDP๊ฐ ๋ด๋ถ์ ์ผ๋ก ์ํํ๋ ์์ ์ ๋ํด ๊ฐ๋จํ ์๊ฐํฉ๋๋ค. | ||
2. `์ฑ๊ธ ๋ ธ๋ ๋ฉํฐ-GPU ํ์ต <ddp_series_multigpu.html>`__ ํ ๊ธฐ๊ธฐ์์ ์ฌ๋ฌ GPU๋ฅผ ์ฌ์ฉํ์ฌ ๋ชจ๋ธ์ ํ์ตํ๋ ๋ฐฉ๋ฒ | ||
3. `์ฅ์ ๋ด์ฑ ๋ถ์ฐ ํ์ต <ddp_series_fault_tolerance.html>`__ torchrun์ ์ฌ์ฉํ์ฌ ๋ถ์ฐ ํ์ต ์์ ์ ๊ฒฌ๊ณ ํ๊ฒ ๋ง๋๋ ๋ฐฉ๋ฒ | ||
4. `๋ฉํฐ ๋ ธ๋ ํ์ต <../intermediate/ddp_series_multinode.html>`__ ์ฌ๋ฌ ๊ธฐ๊ธฐ์์ ์ฌ๋ฌ GPU๋ฅผ ์ฌ์ฉํ์ฌ ๋ชจ๋ธ์ ํ์ตํ๋ ๋ฐฉ๋ฒ | ||
5. `DDP๋ฅผ ์ฌ์ฉํ GPT ๋ชจ๋ธ ํ์ต <../intermediate/ddp_series_minGPT.html>`__ DDP๋ฅผ ์ฌ์ฉํ `minGPT <https://github.com/karpathy/minGPT>`__ ๋ชจ๋ธ ํ์ต์ โ์ค์ ์์โ |