Update README and Docs to including missing content (#64)

Update README and Docs to including missing content
NVIDIA · Aug 12, 2020 · 3a3360b · 3a3360b
1 parent 60fb0cd
commit 3a3360b
Show file tree

Hide file tree

Showing 7 changed files with 344 additions and 63 deletions.
diff --git a/README.rst b/README.rst
@@ -118,12 +118,27 @@ Documentation
 -------------
 
 The User Guide can be found in the 
-`PyProf docs folder <https://github.com/NVIDIA/PyProf/blob/master/docs>`_, and 
+`documentation for current release 
+<https://docs.nvidia.com/deeplearning/frameworks/pyprof-user-guide/index.html>`_, and 
 provides instructions on how to install and profile with PyProf.
 
-An `FAQ <https://github.com/NVIDIA/PyProf/blob/master/docs/faqs.rst>`_ provides
+A complete `Quick Start Guide <https://docs.nvidia.com/deeplearning/frameworks/pyprof-user-guide/quickstart.html>`_ 
+provides step-by-step instructions to get you quickly started using PyProf.
+
+An `FAQ <https://docs.nvidia.com/deeplearning/frameworks/pyprof-user-guide/faqs.html>`_ provides
 answers for frequently asked questions.
 
+The `Release Notes 
+<https://docs.nvidia.com/deeplearning/frameworks/pyprof-release-notes/index.html>`_
+indicate the required versions of the NVIDIA Driver and CUDA, and also describe 
+which GPUs are supported by PyProf
+
+Presentation and Papers
+^^^^^^^^^^^^^^^^^^^^^^^
+
+* `Automating End-toEnd PyTorch Profiling <https://developer.nvidia.com/gtc/2020/video/s21143>`_.
+   * `Presentation slides <https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21143-automating-end-to-end-pytorch-profiling.pdf>`_.
+
 Contributing
 ------------
 

diff --git a/docs/advanced.rst b/docs/advanced.rst
@@ -0,0 +1,138 @@
+..
+ # Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+ #
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at
+ #
+ #     http://www.apache.org/licenses/LICENSE-2.0
+ # 
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
+
+Advanced PyProf Usage 
+=====================
+
+This section demonstrates some advanced techniques to get even more from your
+PyProf profiles.
+
+.. _section-layer-annotation:
+
+Layer Annotation
+----------------
+
+Adding custom NVTX ranges to the model layers will allow PyProf to aggregate
+profile results based on the ranges. ::
+
+  # examples/user_annotation/resnet.py
+  # Use the “layer:” prefix
+  
+  class Bottleneck(nn.Module):
+    def forward(self, x):
+      nvtx.range_push("layer:Bottleneck_{}".format(self.id))  # NVTX push marker
+      
+      nvtx.range_push("layer:Conv1")                          # Nested NVTX push/pop markers
+      out = self.conv1(x)
+      nvtx.range_pop()
+      
+      nvtx.range_push("layer:BN1")                            # Use the “layer:” prefix
+      out = self.bn1(out)
+      nvtx.range_pop()
+      
+      nvtx.range_push("layer:ReLU")
+      out = self.relu(out)
+      nvtx.range_pop()
+      
+      ...
+      
+      nvtx.range_pop()                                        # NVTX pop marker.return out
+
+.. _section-custom-function:
+
+Custom Function
+---------------
+
+The following is example of how to enable Torch Autograd to profile a custom
+function. ::
+
+  # examples/custom_func_module/custom_function.py
+  
+  import torch
+  import pyprof
+  pyprof.init()
+  
+  class Foo(torch.autograd.Function):
+    @staticmethoddef forward(ctx, in1, in2):
+      out = in1 + in2                    # This could be a custom C++ function
+      return out
+    @staticmethod
+    def backward(ctx, grad):
+      in1_grad, in2_grad = grad, grad    # This could be a custom C++ function
+      return in1_grad, in2_grad
+  
+  # Hook the forward and backward functions to pyprof
+  pyprof.wrap(Foo, 'forward')
+  pyprof.wrap(Foo, 'backward')
+
+.. _section-custom-module:
+
+Custom Module
+---------------
+
+The following is example of how to enable Torch Autograd to profile a custom
+module. ::
+
+  # examples/custom_func_module/custom_module.py
+  
+  import torch
+  import pyprof
+  pyprof.init()
+  
+  class Foo(torch.nn.Module):
+    def __init__(self, size):
+      super(Foo, self).__init__()
+      self.n = torch.nn.Parameter(torch.ones(size))
+      self.m = torch.nn.Parameter(torch.ones(size))
+      
+    def forward(self, input):
+      return self.n*input + self.m         # This could be a custom C++ function.
+  
+  # Hook the forward function to pyprof
+  pyprof.wrap(Foo, 'forward')
+
+Extensibility
+-------------
+
+* For custom functions and modules, users can add flops and bytes calculation
+
+* Python code is easy to extend - no need to recompile, no need to change the 
+  PyTorch backend and resolve merge conflicts on every version upgrade
+
+Actionable Items
+----------------
+
+The following list provides some common actionable items to consider when 
+analyzing profile results and deciding on how best to improve the performance. 
+For more customized and directed actionable items, consider using the `NVIDIA 
+Deep Learning Profiler <https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide/index.html>`_ 
+that provide direct *Expert Systems* feedback based on the profile.
+
+* NvProf/ NsightSystems tell us what the hotspots are, but not if we can act on 
+  them.
+
+* If a kernel runs close to max perf based on FLOPs and bytes (and maximum FLOPs
+  and bandwidth of the GPU), then there’s no point in optimizing it even if it’s
+  a hotspot.
+
+* If the ideal timing based on FLOPs and bytes (max(compute_time, 
+  bandwidth_time)) is much shorter than the silicon time, there’s scope for 
+  improvement.
+
+* Tensor Core usage (conv): for Volta, convolutions should have the input 
+  channel count (C) and the output channel count (K) divisible by 8, in order to
+  use tensor cores. For Turing, it’s optimal for C and K to be divisible by 16. 
+
+* Tensor core usage (GEMM): M, N and K divisible by 8 (Volta) or 16 (Turing) (https://docs.nvidia.com/deeplearning/sdk/dl-performance-guide/index.html)  
diff --git a/docs/examples.rst b/docs/examples.rst
@@ -20,8 +20,8 @@ Examples
 
 This section provides several real examples on how to profile with PyPRrof.
 
-  *TODO:* Provide real examples. Everything here should also be added to
-  a QA L0_ test to lock in the code
+Profile Lenet
+-------------
 
 Navigate to the lenet example. ::
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -32,5 +32,6 @@ NVIDIA PyProf - Pytorch Profiler
    quickstart
    install
    profile
+   advanced
    examples
    faqs
diff --git a/docs/install.rst b/docs/install.rst
@@ -27,4 +27,27 @@ Installing from GitHub
 
 .. include:: ../README.rst
    :start-after: quick-install-start-marker-do-not-remove
-   :end-before: quick-install-end-marker-do-not-remove
+   :end-before: quick-install-end-marker-do-not-remove
+
+.. _section-installing-from-ngc:
+
+Install from NGC Container
+--------------------------
+
+PyProf is available in the PyTorch container on the `NVIDIA GPU Cloud (NGC) 
+<https://ngc.nvidia.com>`_.
+
+Before you can pull a container from the NGC container registry, you
+must have Docker and nvidia-docker installed. For DGX users, this is
+explained in `Preparing to use NVIDIA Containers Getting Started Guide
+<http://docs.nvidia.com/deeplearning/dgx/preparing-containers/index.html>`_.
+For users other than DGX, follow the `nvidia-docker installation
+documentation <https://github.com/NVIDIA/nvidia-docker>`_ to install
+the most recent version of CUDA, Docker, and nvidia-docker.
+
+After performing the above setup, you can pull the PyProf container
+using the following command::
+
+  docker pull nvcr.io/nvidia/pytorch:20.07-py3
+
+Replace *20.07* with the version of PyTorch container that you want to pull.