Skip to content

Commit

Permalink
beginner_source/ddp_series_theory.rst λ²ˆμ—­ (#896)
Browse files Browse the repository at this point in the history
  • Loading branch information
rumjie authored Oct 15, 2024
1 parent d933bf7 commit b8d92ec
Showing 1 changed file with 42 additions and 43 deletions.
85 changes: 42 additions & 43 deletions beginner_source/ddp_series_theory.rst
Original file line number Diff line number Diff line change
@@ -1,70 +1,69 @@
`Introduction <ddp_series_intro.html>`__ \|\| **What is DDP** \|\|
`Single-Node Multi-GPU Training <ddp_series_multigpu.html>`__ \|\|
`Fault Tolerance <ddp_series_fault_tolerance.html>`__ \|\|
`Multi-Node training <../intermediate/ddp_series_multinode.html>`__ \|\|
`minGPT Training <../intermediate/ddp_series_minGPT.html>`__
`μ†Œκ°œ <ddp_series_intro.html>`__ \|\| **λΆ„μ‚° 데이터 병렬 처리 (DDP) λž€ 무엇인가?** \|\|
`단일 λ…Έλ“œ 닀쀑-GPU ν•™μŠ΅ <ddp_series_multigpu.html>`__ \|\|
`결함 λ‚΄μ„± <ddp_series_fault_tolerance.html>`__ \|\|
`닀쀑 λ…Έλ“œ ν•™μŠ΅ <../intermediate/ddp_series_multinode.html>`__ \|\|
`minGPT ν•™μŠ΅ <../intermediate/ddp_series_minGPT.html>`__

What is Distributed Data Parallel (DDP)
λΆ„μ‚° 데이터 병렬 처리 (DDP) λž€ 무엇인가?
=======================================

Authors: `Suraj Subramanian <https://github.com/suraj813>`__
μ €μž: `Suraj Subramanian <https://github.com/suraj813>`__
λ²ˆμ—­: `박지은 <https://github.com/rumjie>`__

.. grid:: 2

.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
.. grid-item-card:: :octicon:`mortar-board;1em;` 이 μž₯μ—μ„œ λ°°μš°λŠ” 것

* How DDP works under the hood
* What is ``DistributedSampler``
* How gradients are synchronized across GPUs
* DDP 의 λ‚΄λΆ€ μž‘λ™ 원리
* ``DistributedSampler`` μ΄λž€ 무엇인가?
* GPU κ°„ 변화도가 λ™κΈ°ν™”λ˜λŠ” 방법


.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
.. grid-item-card:: :octicon:`list-unordered;1em;` ν•„μš” 사항

* Familiarity with `basic non-distributed training <https://tutorials.pytorch.kr/beginner/basics/quickstart_tutorial.html>`__ in PyTorch
* νŒŒμ΄ν† μΉ˜ `λΉ„λΆ„μ‚° ν•™μŠ΅ <https://tutorials.pytorch.kr/beginner/basics/quickstart_tutorial.html>`__ 에 μ΅μˆ™ν•  것

Follow along with the video below or on `youtube <https://www.youtube.com/watch/Cvdhwx-OBBo>`__.
μ•„λž˜μ˜ μ˜μƒμ΄λ‚˜ `유투브 μ˜μƒ youtube <https://www.youtube.com/watch/Cvdhwx-OBBo>`__ 을 따라 μ§„ν–‰ν•˜μ„Έμš”.

.. raw:: html

<div style="margin-top:10px; margin-bottom:10px;">
<iframe width="560" height="315" src="https://www.youtube.com/embed/Cvdhwx-OBBo" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>

This tutorial is a gentle introduction to PyTorch `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__ (DDP)
which enables data parallel training in PyTorch. Data parallelism is a way to
process multiple data batches across multiple devices simultaneously
to achieve better performance. In PyTorch, the `DistributedSampler <https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler>`__
ensures each device gets a non-overlapping input batch. The model is replicated on all the devices;
each replica calculates gradients and simultaneously synchronizes with the others using the `ring all-reduce
algorithm <https://tech.preferred.jp/en/blog/technologies-behind-distributed-deep-learning-allreduce/>`__.
이 νŠœν† λ¦¬μ–Όμ€ νŒŒμ΄ν† μΉ˜μ—μ„œ λΆ„μ‚° 데이터 병렬 ν•™μŠ΅μ„ κ°€λŠ₯ν•˜κ²Œ ν•˜λŠ” `λΆ„μ‚° 데이터 병렬 <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__ (DDP)
에 λŒ€ν•΄ μ†Œκ°œν•©λ‹ˆλ‹€. 데이터 병렬 μ²˜λ¦¬λž€ 더 높은 μ„±λŠ₯을 λ‹¬μ„±ν•˜κΈ° μœ„ν•΄
μ—¬λŸ¬ 개의 λ””λ°”μ΄μŠ€μ—μ„œ μ—¬λŸ¬ 데이터 λ°°μΉ˜λ“€μ„ λ™μ‹œμ— μ²˜λ¦¬ν•˜λŠ” λ°©λ²•μž…λ‹ˆλ‹€.
νŒŒμ΄ν† μΉ˜μ—μ„œ, `λΆ„μ‚° μƒ˜ν”ŒλŸ¬ <https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler>`__ λŠ”
각 λ””λ°”μ΄μŠ€κ°€ μ„œλ‘œ λ‹€λ₯Έ μž…λ ₯ 배치λ₯Ό λ°›λŠ” 것을 보μž₯ν•©λ‹ˆλ‹€.
λͺ¨λΈμ€ λͺ¨λ“  λ””λ°”μ΄μŠ€μ— 볡제되며, 각 사본은 변화도λ₯Ό κ³„μ‚°ν•˜λŠ” λ™μ‹œμ— `Ring-All-Reduce
μ•Œκ³ λ¦¬μ¦˜ <https://tech.preferred.jp/en/blog/technologies-behind-distributed-deep-learning-allreduce/>`__ 을 μ‚¬μš©ν•΄ λ‹€λ₯Έ 사본과 λ™κΈ°ν™”λ©λ‹ˆλ‹€.

This `illustrative tutorial <https://tutorials.pytorch.kr/intermediate/dist_tuto.html#>`__ provides a more in-depth python view of the mechanics of DDP.
`μ˜ˆμ‹œ νŠœν† λ¦¬μ–Ό <https://tutorials.pytorch.kr/intermediate/dist_tuto.html#>`__ μ—μ„œ DDP λ©”μ»€λ‹ˆμ¦˜μ— λŒ€ν•΄ 파이썬 κ΄€μ μ—μ„œ 심도 μžˆλŠ” μ„€λͺ…을 λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€.

Why you should prefer DDP over ``DataParallel`` (DP)
``데이터 병렬 DataParallel`` (DP) 보닀 DDPκ°€ λ‚˜μ€ 이유
----------------------------------------------------

`DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__
is an older approach to data parallelism. DP is trivially simple (with just one extra line of code) but it is much less performant.
DDP improves upon the architecture in a few ways:

+---------------------------------------+------------------------------+
| ``DataParallel`` | ``DistributedDataParallel`` |
+=======================================+==============================+
| More overhead; model is replicated | Model is replicated only |
| and destroyed at each forward pass | once |
+---------------------------------------+------------------------------+
| Only supports single-node parallelism | Supports scaling to multiple |
| | machines |
+---------------------------------------+------------------------------+
| Slower; uses multithreading on a | Faster (no GIL contention) |
| single process and runs into Global | because it uses |
| Interpreter Lock (GIL) contention | multiprocessing |
+---------------------------------------+------------------------------+

Further Reading
`DP <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__ λŠ” 데이터 병렬 처리의 이전 μ ‘κ·Ό λ°©μ‹μž…λ‹ˆλ‹€.
DP λŠ” κ°„λ‹¨ν•˜μ§€λ§Œ, (ν•œ μ€„λ§Œ μΆ”κ°€ν•˜λ©΄ 됨) μ„±λŠ₯은 훨씬 λ–¨μ–΄μ§‘λ‹ˆλ‹€. DDPλŠ” μ•„λž˜μ™€ 같은 λ°©μ‹μœΌλ‘œ μ•„ν‚€ν…μ²˜λ₯Ό κ°œμ„ ν•©λ‹ˆλ‹€.

.. list-table::
:header-rows: 1

* - ``DataParallel``
- ``DistributedDataParallel``
* - μž‘μ—… λΆ€ν•˜κ°€ 큼, μ „νŒŒλ  λ•Œλ§ˆλ‹€ λͺ¨λΈμ΄ 볡제 및 μ‚­μ œλ¨
- λͺ¨λΈμ΄ ν•œ 번만 볡제됨
* - 단일 λ…Έλ“œ 병렬 처리만 κ°€λŠ₯
- μ—¬λŸ¬ λ¨Έμ‹ μœΌλ‘œ ν™•μž₯ κ°€λŠ₯
* - 느림, 단일 ν”„λ‘œμ„ΈμŠ€μ—μ„œ λ©€ν‹° μŠ€λ ˆλ”©μ„ μ‚¬μš©ν•˜κΈ° λ•Œλ¬Έμ— Global Interpreter Lock (GIL) 좩돌이 λ°œμƒ
- 빠름, λ©€ν‹° ν”„λ‘œμ„Έμ‹±μ„ μ‚¬μš©ν•˜κΈ° λ•Œλ¬Έμ— GIL 좩돌 μ—†μŒ


읽을거리
---------------

- `Multi-GPU training with DDP <ddp_series_multigpu.html>`__ (next tutorial in this series)
- `Multi-GPU training with DDP <ddp_series_multigpu.html>`__ (이 μ‹œλ¦¬μ¦ˆμ˜ λ‹€μŒ νŠœν† λ¦¬μ–Ό)
- `DDP
API <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
- `DDP Internal
Expand Down

0 comments on commit b8d92ec

Please sign in to comment.