diff --git a/README.rst b/README.rst index affad00..b116537 100644 --- a/README.rst +++ b/README.rst @@ -118,12 +118,27 @@ Documentation ------------- The User Guide can be found in the -`PyProf docs folder `_, and +`documentation for current release +`_, and provides instructions on how to install and profile with PyProf. -An `FAQ `_ provides +A complete `Quick Start Guide `_ +provides step-by-step instructions to get you quickly started using PyProf. + +An `FAQ `_ provides answers for frequently asked questions. +The `Release Notes +`_ +indicate the required versions of the NVIDIA Driver and CUDA, and also describe +which GPUs are supported by PyProf + +Presentation and Papers +^^^^^^^^^^^^^^^^^^^^^^^ + +* `Automating End-toEnd PyTorch Profiling `_. + * `Presentation slides `_. + Contributing ------------ diff --git a/docs/advanced.rst b/docs/advanced.rst new file mode 100644 index 0000000..2edcad6 --- /dev/null +++ b/docs/advanced.rst @@ -0,0 +1,138 @@ +.. + # Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. + # + # Licensed under the Apache License, Version 2.0 (the "License"); + # you may not use this file except in compliance with the License. + # You may obtain a copy of the License at + # + # http://www.apache.org/licenses/LICENSE-2.0 + # + # Unless required by applicable law or agreed to in writing, software + # distributed under the License is distributed on an "AS IS" BASIS, + # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + # See the License for the specific language governing permissions and + # limitations under the License. + +Advanced PyProf Usage +===================== + +This section demonstrates some advanced techniques to get even more from your +PyProf profiles. + +.. _section-layer-annotation: + +Layer Annotation +---------------- + +Adding custom NVTX ranges to the model layers will allow PyProf to aggregate +profile results based on the ranges. :: + + # examples/user_annotation/resnet.py + # Use the “layer:” prefix + + class Bottleneck(nn.Module): + def forward(self, x): + nvtx.range_push("layer:Bottleneck_{}".format(self.id)) # NVTX push marker + + nvtx.range_push("layer:Conv1") # Nested NVTX push/pop markers + out = self.conv1(x) + nvtx.range_pop() + + nvtx.range_push("layer:BN1") # Use the “layer:” prefix + out = self.bn1(out) + nvtx.range_pop() + + nvtx.range_push("layer:ReLU") + out = self.relu(out) + nvtx.range_pop() + + ... + + nvtx.range_pop() # NVTX pop marker.return out + +.. _section-custom-function: + +Custom Function +--------------- + +The following is example of how to enable Torch Autograd to profile a custom +function. :: + + # examples/custom_func_module/custom_function.py + + import torch + import pyprof + pyprof.init() + + class Foo(torch.autograd.Function): + @staticmethoddef forward(ctx, in1, in2): + out = in1 + in2 # This could be a custom C++ function + return out + @staticmethod + def backward(ctx, grad): + in1_grad, in2_grad = grad, grad # This could be a custom C++ function + return in1_grad, in2_grad + + # Hook the forward and backward functions to pyprof + pyprof.wrap(Foo, 'forward') + pyprof.wrap(Foo, 'backward') + +.. _section-custom-module: + +Custom Module +--------------- + +The following is example of how to enable Torch Autograd to profile a custom +module. :: + + # examples/custom_func_module/custom_module.py + + import torch + import pyprof + pyprof.init() + + class Foo(torch.nn.Module): + def __init__(self, size): + super(Foo, self).__init__() + self.n = torch.nn.Parameter(torch.ones(size)) + self.m = torch.nn.Parameter(torch.ones(size)) + + def forward(self, input): + return self.n*input + self.m # This could be a custom C++ function. + + # Hook the forward function to pyprof + pyprof.wrap(Foo, 'forward') + +Extensibility +------------- + +* For custom functions and modules, users can add flops and bytes calculation + +* Python code is easy to extend - no need to recompile, no need to change the + PyTorch backend and resolve merge conflicts on every version upgrade + +Actionable Items +---------------- + +The following list provides some common actionable items to consider when +analyzing profile results and deciding on how best to improve the performance. +For more customized and directed actionable items, consider using the `NVIDIA +Deep Learning Profiler `_ +that provide direct *Expert Systems* feedback based on the profile. + +* NvProf/ NsightSystems tell us what the hotspots are, but not if we can act on + them. + +* If a kernel runs close to max perf based on FLOPs and bytes (and maximum FLOPs + and bandwidth of the GPU), then there’s no point in optimizing it even if it’s + a hotspot. + +* If the ideal timing based on FLOPs and bytes (max(compute_time, + bandwidth_time)) is much shorter than the silicon time, there’s scope for + improvement. + +* Tensor Core usage (conv): for Volta, convolutions should have the input + channel count (C) and the output channel count (K) divisible by 8, in order to + use tensor cores. For Turing, it’s optimal for C and K to be divisible by 16. + +* Tensor core usage (GEMM): M, N and K divisible by 8 (Volta) or 16 (Turing) (https://docs.nvidia.com/deeplearning/sdk/dl-performance-guide/index.html) diff --git a/docs/examples.rst b/docs/examples.rst index 0e841e0..0954ec8 100644 --- a/docs/examples.rst +++ b/docs/examples.rst @@ -20,8 +20,8 @@ Examples This section provides several real examples on how to profile with PyPRrof. - *TODO:* Provide real examples. Everything here should also be added to - a QA L0_ test to lock in the code +Profile Lenet +------------- Navigate to the lenet example. :: diff --git a/docs/index.rst b/docs/index.rst index a699d6f..11a72dd 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -32,5 +32,6 @@ NVIDIA PyProf - Pytorch Profiler quickstart install profile + advanced examples faqs \ No newline at end of file diff --git a/docs/install.rst b/docs/install.rst index c5e4a58..2790cc1 100644 --- a/docs/install.rst +++ b/docs/install.rst @@ -27,4 +27,27 @@ Installing from GitHub .. include:: ../README.rst :start-after: quick-install-start-marker-do-not-remove - :end-before: quick-install-end-marker-do-not-remove \ No newline at end of file + :end-before: quick-install-end-marker-do-not-remove + +.. _section-installing-from-ngc: + +Install from NGC Container +-------------------------- + +PyProf is available in the PyTorch container on the `NVIDIA GPU Cloud (NGC) +`_. + +Before you can pull a container from the NGC container registry, you +must have Docker and nvidia-docker installed. For DGX users, this is +explained in `Preparing to use NVIDIA Containers Getting Started Guide +`_. +For users other than DGX, follow the `nvidia-docker installation +documentation `_ to install +the most recent version of CUDA, Docker, and nvidia-docker. + +After performing the above setup, you can pull the PyProf container +using the following command:: + + docker pull nvcr.io/nvidia/pytorch:20.07-py3 + +Replace *20.07* with the version of PyTorch container that you want to pull. diff --git a/docs/profile.rst b/docs/profile.rst index 77deda2..1fb15d0 100644 --- a/docs/profile.rst +++ b/docs/profile.rst @@ -16,9 +16,6 @@ Profiling PyTorch with PyProf ============================= - TODO: this chapter should go into the details of profiling, - including any options. - Overview -------- For FLOP and bandwidth calculations, we use a relatively straightforward approach. @@ -39,17 +36,26 @@ determined by the grid/block dimensions. Since many PyTorch kernels are open-sou (or even custom written by the user, as in CUDA Extensions), this provides the user with information that helps root cause performance issues and prioritize optimization work. +.. _section-components-and-flow: + +Components and Flow +------------------- + +There are four steps to the PyProf profile flow: + +1. :ref:`Import PyProf `: PyProf module is required to intercept all PyTorch, custom functions and modules. + +2. :ref:`Profile PyTorch Model `: Profile the model with either NVProf or Nsight Systems to obtain an SQL database. + +3. :ref:`parse.py `: Extract information from the SQL database. + +4. :ref:`prof.py `: Use this information to calculate flops and bytes. .. _section-profile-enable-profiler: Enable Profiler in PyTorch Network ---------------------------------- - *TODO:* provide more detail about `torch.cuda.profiler`, why it is needed - and how to access it. The follow is cut and pasted from old README and needs - to be expanded. - - Pyprof makes use of the profiler functionality available in `Pytorch `_. The profiler allows you to inspect the cost of different operators @@ -90,22 +96,42 @@ Here's an example: :: if iter == iter_to_capture: profiler.stop() +.. _section-profile-with-nvidia-profilers: + +Profile with NVIDIA Profilers +----------------------------- + +After modifying the PyTorch script to improt pyprof, you will need to use either +NVProf or Nsight Systems to profile the performance. Both profilers will output +an SQLite database containing the results of the profile. + +Please note that NVProf is currently being phased out, and it is recommend to +use Nsight Systems to future proof the profile process. + +Additionally, only Nsight Systems is available in either the pre-built NGC +container or manually built Docker container. + .. _section-profile-with-nvprof: -Profile with NVprof -------------------- +Profile with NVProf +^^^^^^^^^^^^^^^^^^^ If you are not using Nvprof, skip ahead to :ref:`section-profile-with-nsys`. -Run NVprof to generate a SQL (NVVP) file. This file can be opened with NVVP. - -If using ``profiler.start()`` and ``profiler.stop()`` in ``net.py`` :: +Run NVprof to generate a SQL (NVVP) file. This file can be opened with NVVP. :: - $ nvprof -f -o net.sql --profile-from-start off -- python net.py + $ nvprof + -f # Overwrite existing file + -o net.sql # Create net.sql + python net.py -For all other profiling :: +If using ``profiler.start()`` and ``profiler.stop()`` in ``net.py`` :: - $ nvprof -f -o net.sql -- python net.py + $ nvprof + -f + -o net.sql + --profile-from-start off # Profiling start/stop insiode net.py + python net.py **Note:** if you're experiencing issues with hardware counters and you get a message such as :: @@ -118,30 +144,50 @@ Please follow the steps described in :ref:`section-profile-hardware-counters`. .. _section-profile-with-nsys: Profile with Nsight Systems ---------------------------- - -Run Nsight Systems to generate a SQLite file. +^^^^^^^^^^^^^^^^^^^^^^^^^^^ -If using ``profiler.start()`` and ``profiler.stop()`` in ``net.py`` :: +Run Nsight Systems to generate a SQLite file. :: - $ nsys profile -f true -o net -c cudaProfilerApi --stop-on-range-end true --export sqlite python net.py + $ nsys profile + -f true # Overwrite existing files + -o net # Create net.qdrep (used by Nsys viewer) + -c cudaProfileApi # Optional argument required for profiler start/stop + --stop-on-range-end true # Optional argument required for profiler start/stop + --export sqlite # Export net.sql (similar to NVProf) + python net.py -For all other profiling :: - - $ nsys profile -f true -o net --export sqlite python net.py +If using ``profiler.start()`` and ``profiler.stop()`` in ``net.py``, the options +``-c cudaProfileApi --stop-on-range-end true`` are required. .. _section-parse-sql-file: Parse the SQL file ------------------ + Run parser on the SQL file. The output is an ASCII file. Each line is a python dictionary which contains information about the kernel name, duration, parameters etc. This file can be used as input to other custom -scripts as well. **Note:** Nsys will create a file called net.sqlite. :: +scripts as well. Nsys will create a file called net.sqlite. :: + + $ python -m pyprof.parse net.sqlite > net.dict + +.. csv-table:: Extracted information for each GPU kernel + :header: "Tool", "Value", "Example" + :widths: 70, 100, 100 + + "NVProf/Nsys", "Kernel Name", "elementwise_kernel" + "", "Duration", "44736 ns" + "", "Grid and block dimensions", "(160,1,1)(128,1,1)" + "", "Thead ID, Device ID, Stream ID", "23, 0, 7" + "\+ PyProf", "Call stack", "resnet.py:210, resnet.py:168" + "", "Layer name", "Conv2_x:Bottleneck_1:ReLU" + "", "Operator", "ReLU" + "", "Tensor Shapes", "[32, 64, 56, 56]" + "", "Datatype", "fp16" + +.. _section-run-prof-script: - python -m pyprof.parse net.sqlite > net.dict - -Run the prof script +Run the Prof Script ------------------- Using the python dictionary created in step 3 as the input, Pyprof can produce a CSV output, a columnated output (similar to `column -t` for terminal @@ -150,31 +196,39 @@ for instance). It produces 20 columns of information for every GPU kernel but you can select a subset of columns using the `-c` flag. Note that a few columns might have the value "na" implying either its a work in progress or the tool was unable to extract that information. Assuming -the directory is `prof`, here are a few examples of how to use `prof.py`. :: +the directory is `prof`, here are a few examples of how to use `prof.py`. + +* Print usage and help. Lists all available output columns:: + + $ python -m pyprof.prof -h + +* Columnated output of width 150 with some default columns:: + + $ python -m pyprof.prof -w 150 net.dict + +* CSV output:: + + $ python -m pyprof.prof --csv net.dict + +* Space seperated output:: + + $ python -m pyprof.prof net.dict - # Print usage and help. Lists all available output columns. - python -m pyprof.prof -h +* Columnated output of width 130 with columns index,direction,kernel name,parameters,silicon time:: - # Columnated output of width 150 with some default columns. - python -m pyprof.prof -w 150 net.dict + $ python -m pyprof.prof -w 130 -c idx,dir,kernel,params,sil net.dict - # CSV output. - python -m pyprof.prof --csv net.dict +* CSV output with columns index,direction,kernel name,parameters,silicon time:: - # Space seperated output. - python -m pyprof.prof net.dict + $ python -m pyprof.prof --csv -c idx,dir,kernel,params,sil net.dict - # Columnated output of width 130 with columns index,direction,kernel name,parameters,silicon time. - python -m pyprof.prof -w 130 -c idx,dir,kernel,params,sil net.dict +* Space separated output with columns index,direction,kernel name,parameters,silicon time:: - # CSV output with columns index,direction,kernel name,parameters,silicon time. - python -m pyprof.prof --csv -c idx,dir,kernel,params,sil net.dict + $ python -m pyprof.prof -c idx,dir,kernel,params,sil net.dict - # Space separated output with columns index,direction,kernel name,parameters,silicon time. - python -m pyprof.prof -c idx,dir,kernel,params,sil net.dict +* Input redirection:: - # Input redirection. - python -m pyprof.prof < net.dict + $ python -m pyprof.prof < net.dict .. csv-table:: Options for prof.py :header: "Command", "Description" @@ -223,13 +277,14 @@ Profiling GPU workloads may require access to hardware performance counters. Due to a fix in recent NVIDIA drivers addressing CVE‑2018‑6260, the hardware counters are disabled by default, and require elevated privileges to be enabled again. If you're using a recent driver, -you may see the following message when trying to run nvprof: +you may see the following message when trying to run nvprof :: -**_ERR_NVGPUCTRPERM The user running does not have permission to access NVIDIA GPU Performance Counters on the target device._** + _ERR_NVGPUCTRPERM The user running does not have permission to access NVIDIA GPU Performance Counters on the target device._ For details, see `here `_. -*Permanent solution* +Permanent Solution +^^^^^^^^^^^^^^^^^^ Follow the steps here. The current steps for Linux are: :: @@ -240,7 +295,8 @@ Follow the steps here. The current steps for Linux are: :: The above steps should result in a permanent change. -*Temporary solution* +Temporary Solution +^^^^^^^^^^^^^^^^^^ When running on bare metal, you can run nvprof with sudo. diff --git a/docs/quickstart.rst b/docs/quickstart.rst index 1443a79..6567248 100644 --- a/docs/quickstart.rst +++ b/docs/quickstart.rst @@ -20,24 +20,72 @@ Quickstart PyProf is available in the following ways: -* As installable python code located in GitHub. +* As :ref:`installable python code located in GitHub `. + +* As a pre-built Docker container available from the `NVIDIA GPU Cloud (NGC) + `_. For more information, see :ref:`section-installing-from-ngc`. + +* As a buildable docker container. You can :ref:`build your + own container using Docker ` .. _section-quickstart-prerequisites: Prerequisites ------------- - TODO: List any prerequisites, including point to instructions on how to - install either +* If you are installing directly from GitHub or building your own docker + container, you will need to clone the PyProf GitHub repo. Go to + https://github.com/NVIDIA/PyProf and then select the *clone* or *download* + drop down button. After cloning the repo be sure to select the r + release branch that corresponds to the version of PyProf want to use:: + + $ git checkout r20.07 + +* If you are starting with a pre-built NGC container, you will need to install + Docker and nvidia-docker. For DGX users, see `Preparing to use NVIDIA Containers + `_. + For users other than DGX, see the `nvidia-docker installation documentation + `_. + +.. _section-quickstart-using-a-prebuilt-docker-container: + +Using a Prebuilt Docker Containers +---------------------------------- + +Use docker pull to get the PyTorch container from NGC:: + + $ docker pull nvcr.io/nvidia/pytorch:-py3 + +Where is the version of PyProf that you want to pull. Once you have the +container, you can run the container with the following command:: + + $ docker run --gpus=1 --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v/full/path/to/example/model/repository:/models + +Where is *nvcr.io/nvidia/pytorch:-py3*. + +.. _section-quickstart-building-with-docker: + +Building With Docker +-------------------- + +Make sure you complete the step in +:ref:`section-quickstart-prerequisites` before attempting to build the PyProf +container. To build PyProf from source, change to the root directory of +the GitHub repo and checkout the release version of the branch that +you want to build (or the master branch if you want to build the +under-development version):: + + $ git checkout r20.07 + +Then use docker to build:: + + $ docker build --pull -t pyprof . -.. _section-quickstart-installing-from-github: +After the build completes you can run the container with the following command:: -Installing from GitHub ----------------------- + $ docker run --gpus=1 --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v/full/path/to/example/model/repository:/models -Make sure you complete the steps in :ref:`section-quickstart-prerequisites` -before attempting to install PyProf. See :ref:`section-installing-from-github` -for details on how to install from GitHub +Where is *pyprof*. .. _section-quickstart-profile-with-pyprof: