From 106e5c5b913a0eb30da99d6c9951998ebbbaa04f Mon Sep 17 00:00:00 2001 From: dzier Date: Fri, 20 Nov 2020 11:21:27 -0800 Subject: [PATCH] Update README post-20.11 release --- README.rst | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 103 insertions(+), 4 deletions(-) diff --git a/README.rst b/README.rst index a202260..f452146 100644 --- a/README.rst +++ b/README.rst @@ -18,13 +18,60 @@ PyProf - PyTorch Profiling tool =============================== - **NOTE: You are currently on the r20.11 branch which tracks stabilization - towards the release. This branch is not usable during stabilization.** - .. overview-begin-marker-do-not-remove +PyProf is a tool that profiles and analyzes the GPU performance of PyTorch +models. PyProf aggregates kernel performance from `Nsight Systems +`_ or `NvProf +`_. + +What's New in 3.6.0 +------------------- + +* PyProf overhead was reduced to improve runtime performance + + * Improved database query from Nsight Systems + * Refactored nvmarker.py + +Known Issues +------------ + +* Forward-Backward kernel correlation heuristics do not work correctly with + PyTorch 1.6. Recommended work arounds include: + + * Use with PyTorch 1.5 + * Use DLProf in the `20.10 NGC Pytorch container `_ + +Features +-------- + +* Identifies the layer that launched a kernel: e.g. the association of + `ComputeOffsetsKernel` with a concrete PyTorch layer or API is not obvious. + +* Identifies the tensor dimensions and precision: without knowing the tensor + dimensions and precision, it's impossible to reason about whether the actual + (silicon) kernel time is close to maximum performance of such a kernel on + the GPU. Knowing the tensor dimensions and precision, we can figure out the + FLOPs and bandwidth required by a layer, and then determine how close to + maximum performance the kernel is for that operation. + +* Forward-backward correlation: PyProf determines what the forward pass step + is that resulted in the particular weight and data gradients (wgrad, dgrad), + which makes it possible to determine the tensor dimensions required by these + backprop steps to assess their performance. + +* Determines Tensor Core usage: PyProf can highlight the kernels that use + `Tensor Cores `_. + +* Correlate the line in the user's code that launched a particular kernel (program trace). + .. overview-end-marker-do-not-remove +The current release of PyProf is 3.6.0 and is available in the 20.11 release of +the PyTorch container on `NVIDIA GPU Cloud (NGC) `_. The +branch for this release is `r20.11 +`_. + Quick Installation Instructions ------------------------------- @@ -46,7 +93,7 @@ Quick Installation Instructions * Should display :: - pyprof 3.6.0.dev0 + pyprof 3.6.0 .. quick-install-end-marker-do-not-remove @@ -75,5 +122,57 @@ Quick Start Instructions .. quick-start-end-marker-do-not-remove +Documentation +------------- + +The User Guide can be found in the +`documentation for current release +`_, and +provides instructions on how to install and profile with PyProf. + +A complete `Quick Start Guide `_ +provides step-by-step instructions to get you quickly started using PyProf. + +An `FAQ `_ provides +answers for frequently asked questions. + +The `Release Notes +`_ +indicate the required versions of the NVIDIA Driver and CUDA, and also describe +which GPUs are supported by PyProf + +Presentation and Papers +^^^^^^^^^^^^^^^^^^^^^^^ + +* `Automating End-toEnd PyTorch Profiling `_. + * `Presentation slides `_. + +Contributing +------------ + +Contributions to PyProf are more than welcome. To +contribute make a pull request and follow the guidelines outlined in +the `Contributing `_ document. + +Reporting problems, asking questions +------------------------------------ + +We appreciate any feedback, questions or bug reporting regarding this +project. When help with code is needed, follow the process outlined in +the Stack Overflow (https://stackoverflow.com/help/mcve) +document. Ensure posted examples are: + +* minimal – use as little code as possible that still produces the + same problem + +* complete – provide all parts needed to reproduce the problem. Check + if you can strip external dependency and still show the problem. The + less time we spend on reproducing problems the more time we have to + fix it + +* verifiable – test the code you're about to provide to make sure it + reproduces the problem. Remove all other problems that are not + related to your request/question. + .. |License| image:: https://img.shields.io/badge/License-Apache2-green.svg :target: http://www.apache.org/licenses/LICENSE-2.0