Skip to content

TritonBench Metrics and Performance Measurement Options

Xu Zhao edited this page Oct 17, 2025 · 2 revisions

TritonBench supports two types of metrics: built-in and user-defined. All metrics are specified with --metrics <METRIC_NAMES> option, where <METRIC_NAMES> are built-in or user-defined metric names separated by comma.

Built-in metrics

TritonBench supports a rich set of built-in metrics.

Metric Name Definition
latency The latency given by triton.testing.do_bench, in milliseconds.
kineto_trace Chrome Trace generated by Kineto. More details in Kineto Trace Analysis
walltime CPU-side wall latency, including CPU kernel launch time.
ncu_rep (NVIDIA-only) Generate the NVIDIA NSight Compute Replay file.
nsys_rep (NVIDIA-only) Generate the NVIDIA NSight Systems Replay file.
speedup (Requires baseline backend) Latency speedup comparing to the baseline backend.
accuracy (Requires baseline backend) Numeric accuracy comapring to the baseline backend.
compile_time (Triton-only) Triton compile time.
compile_trace (Triton-only) Kineto profiling of Triton compile.

Additional Options for High-precision GPU Kernel Latency Measurement

Latency: --latency-measure-mode [triton_do_bench | profiler] and --cudagraph

Latency is the foundation of all performance related metrics such as memory throughput and TFLOPS.

By default, latency is measured by triton.testing.do_bench. This method is fast, but it may not be accurate because it does not account for the time spent in the CUDA kernel launch overhead, especially when the operator involves multiple CUDA kernel launches. By using the --cudagraph option, it will use CUDA Graph to reduce the CPU-side launch overhead.

Another option is to use --latency-measure-mode profiler, which is slower, but more accurate because it will use the Kineto profiler to measure the latency, which excludes the CUDA kernel launch overhead. Using --latency-measure-mode profiler --cudagraph is by far the most accurate latency measurement approach.

The --cudagraph option also works with --metrics kineto_trace, which collects the Kineto trace when launching the kernel with CUDA Graph. However, note that not all kernels will work with CUDA Graph enabled.

Latency: --warmup <MS>, --rep <MS> and --sleep <SEC>

There are three run time options that can also affect the latency measurement: --warmup, --rep and --sleep.

  • --warmup <MS>: The number of milliseconds to warm up the kernel before measuring the latency.
  • --rep <MS>: The number of milliseconds to repeat the kernel execution. For example, --rep 1000 will repeat the kernel execution for 1 second.
  • --sleep <SEC>: The number of seconds to sleep between each kernel execution. For example, --sleep 1 will sleep for 1 second between each backend execution. This is to restore the GPU power state to normal idle state before each backend execution.

User-defined metrics

Additionally, users can define customized metrics or override in operator.py using the @register_metric decorator. These user-defined metrics can utilize the basic metrics provided by TritonBench, such as latency, walltime, kineto_trace, etc.

Here are some examples: