-
Notifications
You must be signed in to change notification settings - Fork 46
TritonBench Metrics and Performance Measurement Options
TritonBench supports two types of metrics: built-in and user-defined.
All metrics are specified with --metrics <METRIC_NAMES>
option, where <METRIC_NAMES>
are built-in or user-defined metric names separated by comma.
TritonBench supports a rich set of built-in metrics.
Metric Name | Definition |
---|---|
latency |
The latency given by triton.testing.do_bench , in milliseconds. |
kineto_trace |
Chrome Trace generated by Kineto. More details in Kineto Trace Analysis |
walltime |
CPU-side wall latency, including CPU kernel launch time. |
ncu_rep |
(NVIDIA-only) Generate the NVIDIA NSight Compute Replay file. |
nsys_rep |
(NVIDIA-only) Generate the NVIDIA NSight Systems Replay file. |
speedup |
(Requires baseline backend) Latency speedup comparing to the baseline backend. |
accuracy |
(Requires baseline backend) Numeric accuracy comapring to the baseline backend. |
compile_time |
(Triton-only) Triton compile time. |
compile_trace |
(Triton-only) Kineto profiling of Triton compile. |
Latency is the foundation of all performance related metrics such as memory throughput and TFLOPS.
By default, latency is measured by triton.testing.do_bench
. This method is
fast, but it may not be accurate because it does not account for the time spent
in the CUDA kernel launch overhead, especially when the operator involves
multiple CUDA kernel launches. By using the --cudagraph
option, it will use
CUDA Graph to reduce the CPU-side launch overhead.
Another option is to use --latency-measure-mode profiler
, which is slower, but
more accurate because it will use the Kineto profiler to measure the latency,
which excludes the CUDA kernel launch overhead. Using --latency-measure-mode profiler --cudagraph
is by far the most accurate latency measurement approach.
The --cudagraph
option also works with --metrics kineto_trace
, which
collects the Kineto trace when launching the kernel with CUDA Graph. However,
note that not all kernels will work with CUDA Graph enabled.
There are three run time options that can also affect the latency measurement: --warmup
, --rep
and --sleep
.
-
--warmup <MS>
: The number of milliseconds to warm up the kernel before measuring the latency. -
--rep <MS>
: The number of milliseconds to repeat the kernel execution. For example,--rep 1000
will repeat the kernel execution for 1 second. -
--sleep <SEC>
: The number of seconds to sleep between each kernel execution. For example,--sleep 1
will sleep for 1 second between each backend execution. This is to restore the GPU power state to normal idle state before each backend execution.
Additionally, users can define customized metrics or override in operator.py
using the @register_metric
decorator.
These user-defined metrics can utilize the basic metrics provided by TritonBench, such as latency
, walltime
, kineto_trace
, etc.
Here are some examples: