Skip to content

Commit

Permalink
advanced_source/cpp_cuda_graphs.rst ๋ฒˆ์—ญ (#961)
Browse files Browse the repository at this point in the history
* translate beginner_source/torchtext_custom_dataset_tutorial.py
  • Loading branch information
hyoyoung authored Nov 30, 2024
1 parent 34644b5 commit 5fc8a7d
Showing 1 changed file with 53 additions and 54 deletions.
107 changes: 53 additions & 54 deletions advanced_source/cpp_cuda_graphs.rst
Original file line number Diff line number Diff line change
@@ -1,41 +1,41 @@
Using CUDA Graphs in PyTorch C++ API
====================================
PyTorch C++ API์—์„œ CUDA ๊ทธ๋ž˜ํ”„ ์‚ฌ์šฉํ•˜๊ธฐ
===========================================

**๋ฒˆ์—ญ**: `์žฅํšจ์˜ <https://github.com/hyoyoung>`_

.. note::
|edit| View and edit this tutorial in `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs.rst>`__. The full source code is available on `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs>`__.
|edit| ์ด ํŠœํ† ๋ฆฌ์–ผ์„ ์—ฌ๊ธฐ์„œ ๋ณด๊ณ  ํŽธ์ง‘ํ•˜์„ธ์š” `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs.rst>`__. ์ „์ฒด ์†Œ์Šค ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ์— ์žˆ์Šต๋‹ˆ๋‹ค `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs>`__.

Prerequisites:
์„ ์ˆ˜ ์ง€์‹:

- `Using the PyTorch C++ Frontend <../advanced_source/cpp_frontend.html>`__
- `PyTorch C++ ํ”„๋ก ํŠธ์—”๋“œ ์‚ฌ์šฉํ•˜๊ธฐ <../advanced_source/cpp_frontend.html>`__
- `CUDA semantics <https://pytorch.org/docs/master/notes/cuda.html>`__
- Pytorch 2.0 or later
- CUDA 11 or later

NVIDIAโ€™s CUDA Graphs have been a part of CUDA Toolkit library since the
release of `version 10 <https://developer.nvidia.com/blog/cuda-graphs/>`_.
They are capable of greatly reducing the CPU overhead increasing the
performance of applications.

In this tutorial, we will be focusing on using CUDA Graphs for `C++
frontend of PyTorch <https://tutorials.pytorch.kr/advanced/cpp_frontend.html>`_.
The C++ frontend is mostly utilized in production and deployment applications which
are important parts of PyTorch use cases. Since `the first appearance
<https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/>`_
the CUDA Graphs won usersโ€™ and developerโ€™s hearts for being a very performant
and at the same time simple-to-use tool. In fact, CUDA Graphs are used by default
in ``torch.compile`` of PyTorch 2.0 to boost the productivity of training and inference.

We would like to demonstrate CUDA Graphs usage on PyTorchโ€™s `MNIST
example <https://github.com/pytorch/examples/tree/main/cpp/mnist>`_.
The usage of CUDA Graphs in LibTorch (C++ Frontend) is very similar to its
`Python counterpart <https://pytorch.org/docs/main/notes/cuda.html#cuda-graphs>`_
but with some differences in syntax and functionality.

Getting Started
- Pytorch 2.0 ์ด์ƒ
- CUDA 11 ์ด์ƒ

NVIDIA์˜ CUDA ๊ทธ๋ž˜ํ”„๋Š” ๋ฒ„์ „ 10 ๋ฆด๋ฆฌ์ฆˆ ์ดํ›„๋กœ CUDA ํˆดํ‚ท ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ์ผ๋ถ€์˜€์Šต๋‹ˆ๋‹ค
`version 10 <https://developer.nvidia.com/blog/cuda-graphs/>`_.
CPU ๊ณผ๋ถ€ํ•˜๋ฅผ ํฌ๊ฒŒ ์ค„์—ฌ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š”, CUDA ๊ทธ๋ž˜ํ”„ ์‚ฌ์šฉ์— ์ดˆ์ ์„ ๋งž์ถœ ๊ฒƒ์ž…๋‹ˆ๋‹ค
`PyTorch C++ ํ”„๋ก ํŠธ์—”๋“œ ์‚ฌ์šฉํ•˜๊ธฐ <https://tutorials.pytorch.kr/advanced/cpp_frontend.html>`_.
C++ ํ”„๋ก ํŠธ์—”๋“œ๋Š” ํŒŒ์ดํ† ์น˜ ์‚ฌ์šฉ ์‚ฌ๋ก€์˜ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์ธ๋ฐ, ์ฃผ๋กœ ์ œํ’ˆ ๋ฐ ๋ฐฐํฌ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.
`์ฒซ๋ฒˆ์งธ ๋“ฑ์žฅ <https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/>`_
์ดํ›„๋กœ CUDA ๊ทธ๋ž˜ํ”„๋Š” ๋งค์šฐ ์„ฑ๋Šฅ์ด ์ข‹๊ณ  ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฌ์›Œ์„œ, ์‚ฌ์šฉ์ž์™€ ๊ฐœ๋ฐœ์ž์˜ ๋งˆ์Œ์„ ์‚ฌ๋กœ์žก์•˜์Šต๋‹ˆ๋‹ค.
์‹ค์ œ๋กœ, CUDA ๊ทธ๋ž˜ํ”„๋Š” ํŒŒ์ดํ† ์น˜ 2.0์˜ ``torch.compile`` ์—์„œ ๊ธฐ๋ณธ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋ฉฐ,
ํ›ˆ๋ จ๊ณผ ์ถ”๋ก  ์‹œ์— ์ƒ์‚ฐ์„ฑ์„ ๋†’์—ฌ์ค๋‹ˆ๋‹ค.

ํŒŒ์ดํ† ์น˜์—์„œ CUDA ๊ทธ๋ž˜ํ”„ ์‚ฌ์šฉ๋ฒ•์„ ๋ณด์—ฌ๋“œ๋ฆฌ๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค `MNIST
์˜ˆ์ œ <https://github.com/pytorch/examples/tree/main/cpp/mnist>`_.
LibTorch(C++ ํ”„๋ก ํŠธ์—”๋“œ)์—์„œ์˜ CUDA ๊ทธ๋ž˜ํ”„ ์‚ฌ์šฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๋งค์šฐ ์œ ์‚ฌํ•˜์ง€๋งŒ
`Python ์‚ฌ์šฉ์˜ˆ์ œ <https://pytorch.org/docs/main/notes/cuda.html#cuda-graphs>`_
์•ฝ๊ฐ„์˜ ๊ตฌ๋ฌธ๊ณผ ๊ธฐ๋Šฅ์˜ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ
---------------

The main training loop consists of the several steps and depicted in the
following code chunk:
์ฃผ์š” ํ›ˆ๋ จ ๋ฃจํ”„๋Š” ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์œผ๋ฉฐ
๋‹ค์Œ ์ฝ”๋“œ ๋ชจ์Œ์— ์„ค๋ช…๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

.. code-block:: cpp
Expand All @@ -49,12 +49,12 @@ following code chunk:
optimizer.step();
}
The example above includes a forward pass, a backward pass, and weight updates.
์œ„์˜ ์˜ˆ์‹œ์—๋Š” ์ˆœ์ „ํŒŒ, ์—ญ์ „ํŒŒ, ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

In this tutorial, we will be applying CUDA Graph on all the compute steps through the whole-network
graph capture. But before doing so, we need to slightly modify the source code. What we need
to do is preallocate tensors for reusing them in the main training loop. Here is an example
implementation:
์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š” ์ „์ฒด ๋„คํŠธ์›Œํฌ ๊ทธ๋ž˜ํ”„ ์บก์ฒ˜๋ฅผ ํ†ตํ•ด ๋ชจ๋“  ๊ณ„์‚ฐ ๋‹จ๊ณ„์— CUDA ๊ทธ๋ž˜ํ”„๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
ํ•˜์ง€๋งŒ ๊ทธ ์ „์— ์•ฝ๊ฐ„์˜ ์†Œ์Šค ์ฝ”๋“œ ์ˆ˜์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ํ•ด์•ผ ํ•  ์ผ์€ ์ฃผ ํ›ˆ๋ จ ๋ฃจํ”„์—์„œ
tensor๋ฅผ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก tensor๋ฅผ ๋ฏธ๋ฆฌ ํ• ๋‹นํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
๋‹ค์Œ์€ ๊ตฌํ˜„ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

.. code-block:: cpp
Expand All @@ -74,7 +74,7 @@ implementation:
training_step(model, optimizer, data, targets, output, loss);
}
Where ``training_step`` simply consists of forward and backward passes with corresponding optimizer calls:
์—ฌ๊ธฐ์„œ ``training_step``์€ ๋‹จ์ˆœํžˆ ํ•ด๋‹น ์˜ตํ‹ฐ๋งˆ์ด์ € ํ˜ธ์ถœ๊ณผ ํ•จ๊ป˜ ์ˆœ์ „ํŒŒ ๋ฐ ์—ญ์ „ํŒŒ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค
.. code-block:: cpp
Expand All @@ -92,7 +92,7 @@ Where ``training_step`` simply consists of forward and backward passes with corr
optimizer.step();
}
PyTorchโ€™s CUDA Graphs API is relying on Stream Capture which in our case would be used like this:
ํŒŒ์ดํ† ์น˜์˜ CUDA ๊ทธ๋ž˜ํ”„ API๋Š” ์ŠคํŠธ๋ฆผ ์บก์ฒ˜์— ์˜์กดํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ด ๊ฒฝ์šฐ ๋‹ค์Œ์ฒ˜๋Ÿผ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค
.. code-block:: cpp
Expand All @@ -104,9 +104,9 @@ PyTorchโ€™s CUDA Graphs API is relying on Stream Capture which in our case would
training_step(model, optimizer, data, targets, output, loss);
graph.capture_end();
Before the actual graph capture, it is important to run several warm-up iterations on side stream to
prepare CUDA cache as well as CUDA libraries (like CUBLAS and CUDNN) that will be used during
the training:
์‹ค์ œ ๊ทธ๋ž˜ํ”„ ์บก์ฒ˜ ์ „์—, ์‚ฌ์ด๋“œ ์ŠคํŠธ๋ฆผ์—์„œ ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ์›Œ๋ฐ์—… ๋ฐ˜๋ณต์„ ์‹คํ–‰ํ•˜์—ฌ
CUDA ์บ์‹œ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ›ˆ๋ จ ์ค‘์— ์‚ฌ์šฉํ• 
CUDA ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ(CUBLAS์™€ CUDNN๊ฐ™์€)๋ฅผ ์ค€๋น„ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.
.. code-block:: cpp
Expand All @@ -116,13 +116,13 @@ the training:
training_step(model, optimizer, data, targets, output, loss);
}
After the successful graph capture, we can replace ``training_step(model, optimizer, data, targets, output, loss);``
call via ``graph.replay();`` to do the training step.
๊ทธ๋ž˜ํ”„ ์บก์ฒ˜์— ์„ฑ๊ณตํ•˜๋ฉด ``training_step(model, optimizer, data, target, output, loss);`` ํ˜ธ์ถœ์„
``graph.replay()``๋กœ ๋Œ€์ฒดํ•˜์—ฌ ํ•™์Šต ๋‹จ๊ณ„๋ฅผ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
Training Results
ํ›ˆ๋ จ ๊ฒฐ๊ณผ
----------------
Taking the code for a spin we can see the following output from ordinary non-graphed training:
์ฝ”๋“œ๋ฅผ ํ•œ ๋ฒˆ ์‚ดํŽด๋ณด๋ฉด ๊ทธ๋ž˜ํ”„๊ฐ€ ์•„๋‹Œ ์ผ๋ฐ˜ ํ›ˆ๋ จ์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
.. code-block:: shell
Expand Down Expand Up @@ -152,7 +152,7 @@ Taking the code for a spin we can see the following output from ordinary non-gra
user 0m44.018s
sys 0m1.116s
While the training with the CUDA Graph produces the following output:
CUDA ๊ทธ๋ž˜ํ”„๋ฅผ ์‚ฌ์šฉํ•œ ํ›ˆ๋ จ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
.. code-block:: shell
Expand Down Expand Up @@ -182,12 +182,11 @@ While the training with the CUDA Graph produces the following output:
user 0m7.048s
sys 0m0.619s
Conclusion
๊ฒฐ๋ก 
----------

As we can see, just by applying a CUDA Graph on the `MNIST example
<https://github.com/pytorch/examples/tree/main/cpp/mnist>`_ we were able to gain the performance
by more than six times for training. This kind of large performance improvement was achievable due to
the small model size. In case of larger models with heavy GPU usage, the CPU overhead is less impactful
so the improvement will be smaller. Nevertheless, it is always advantageous to use CUDA Graphs to
gain the performance of GPUs.
์œ„ ์˜ˆ์‹œ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, ๋ฐ”๋กœ `MNIST ์˜ˆ์ œ
<https://github.com/pytorch/examples/tree/main/cpp/mnist>`_ ์— CUDA ๊ทธ๋ž˜ํ”„๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋„
ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ 6๋ฐฐ ์ด์ƒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
์ด๋ ‡๊ฒŒ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๊ฐ€๋Šฅํ–ˆ๋˜ ๊ฒƒ์€ ๋ชจ๋ธ ํฌ๊ธฐ๊ฐ€ ์ž‘์•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
GPU ์‚ฌ์šฉ๋Ÿ‰์ด ๋งŽ์€ ๋Œ€ํ˜• ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ CPU ๊ณผ๋ถ€ํ•˜์˜ ์˜ํ–ฅ์ด ์ ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐœ์„  ํšจ๊ณผ๊ฐ€ ๋” ์ž‘์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋Ÿฐ ๊ฒฝ์šฐ๋ผ๋„, GPU์˜ ์„ฑ๋Šฅ์„ ์ด๋Œ์–ด๋‚ด๋ ค๋ฉด CUDA ๊ทธ๋ž˜ํ”„๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ํ•ญ์ƒ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

0 comments on commit 5fc8a7d

Please sign in to comment.