Skip to content

Latest commit

 

History

History
237 lines (174 loc) · 12.2 KB

File metadata and controls

237 lines (174 loc) · 12.2 KB

Benchmark Methodology

Purpose

This repo already had benchmark harnesses, manifests, tier-1 suites, and cluster evaluation bundles. What it lacked was one explicit methodology that says how those pieces should be used when the output needs to stand up in engineering reviews, release gates, customer escalations, analyst briefings, or benchmark publications.

This document is that contract.

Use it with:

The Three-Layer Stack

Use all three layers together. Do not skip straight to end-to-end numbers and then guess at root cause.

Layer Goal Typical repo entrypoints Questions it answers
Microbenchmarks Isolate subsystem limits and regressions. `python -m cli.aisp benchmark memory cache
Component benchmarks Measure serving, dataloaders, job startup, and scheduler paths with pinned workload semantics. `python -m cli.aisp bench run --targets ... --profile minimal deep_dive, [core/scripts/profiling/perf_triage_bundle.sh](/home/cfregly/ai-performance-engineering/code/core/scripts/profiling/perf_triage_bundle.sh), cluster/scripts/run_vllm_bench.sh, cluster/scripts/run_torchrun_transformer_train_step.sh`
End-to-end benchmarks Validate realistic workflows after lower layers are understood. python -m cli.aisp bench run-tier1 --single-gpu --profile minimal, `python -m cli.aisp cluster common-eval --preset common-answer-fast modern-llm, end-to-end labs under labs/real_world_models`

The methodology program should model all three layers even if a single BenchmarkRun only enables a subset.

Workload Freeze Rule

Before comparing anything, freeze the workload definition:

  • same model
  • same sequence-length mix
  • same precision
  • same batching policy
  • same concurrency model
  • same dataset or prompt corpus
  • same correctness policy

The human-editable freeze contract lives in templates/benchmark_workload_spec.yaml. The declarative/operator-facing version lives in templates/benchmark_run.yaml and can be checked with core/scripts/validate_benchmark_run.py.

Comparison Rule

Compare one variable at a time. Valid candidates include:

  • hardware generation
  • runtime version
  • scheduler path
  • control-plane path
  • driver stack
  • network topology
  • storage stack

If more than one variable moves, the run is exploratory, not publication-grade.

Metrics By Workload Type

Training

Track these as primary metrics:

  • time to train
  • MFU
  • scaling efficiency
  • training reliability

Inference

Track these as primary metrics:

  • TTFT
  • tokens per second
  • p99 latency
  • jitter
  • cost per token or request

Secondary metrics such as request goodput, p50 latency, GPU utilization, NIC throughput, and power are diagnostic, not substitutes for the user-facing outcome.

Artifact Contract

Treat benchmark artifacts like release artifacts.

Every serious run must capture:

  • pinned workload spec
  • image digests, not mutable image tags
  • driver, CUDA, NCCL, framework, and runtime versions
  • exact hardware topology
  • immutable raw artifacts
  • audit trail linking the claim back to the run

Every serious benchmark program should also preserve:

  • raw logs, traces, and profiler outputs in immutable low-cost storage
  • curated fact and dimension tables in a columnar analytical store
  • retention tiers for hot, warm, and cold data
  • a lineage path from any published number back to the raw artifact manifest
  • cardinality budgets so operational metrics remain usable

For publication-grade work, also require signed provenance. The repo template models this through spec.provenance.signing, even if the attestation backend is provided by your deployment environment rather than by the harness itself.

The current harness already captures manifests, profiler artifacts, raw timings, and artifact hashes. Cryptographic provenance signing is therefore a policy requirement that still depends on deployment or CI infrastructure today; do not imply it exists if the attestation path was not actually used.

Existing repo primitives that already support this:

  • core/benchmark/run_manifest.py
  • cluster/runs/<run_id>/manifest.json
  • tier-1 history packages under artifacts/history/tier1/<run_id>/
  • canonical cluster packages under cluster/published/current/

Trial Policy

Do not report a single run.

  • run multiple trials
  • keep raw distributions
  • report confidence
  • apply explicit outlier policy
  • decompose performance shifts into compute, communication, storage, and orchestration overhead

The repo BenchmarkRun validator enforces:

  • minReplicates >= 3
  • confidence level must be declared
  • outlier policy must be declared
  • distributions and confidence intervals must be reported
  • rank-level outliers must be surfaced for distributed runs

Bottleneck Taxonomy

Classify the bottleneck before choosing instrumentation.

Bottleneck What to inspect first Repo-native signal
Compute-bound GPU counters, NCU, kernel timelines, model-server metrics profile_ncu, tier-1 deep dive, Nsight artifacts
Comm-bound NCCL traces, RDMA probes, NVLink/NIC health cluster all-reduce, all-gather, stability suites
Input-bound storage probes, network probes, dataloader metrics fio bundles, data pipeline sweeps, triage bundle
Control-plane-bound scheduler timing, queue depth, job startup timing readiness runs, startup benchmarks, suite steps

Do not over-index on GPU utilization alone. Low utilization can mean dataloader starvation, CPU throttling, queueing, network stalls, slow collectives, or bad topology placement.

Distributed Rule: The Slowest Rank Wins

Distributed training and serving are gated by the slowest rank or node.

Always collect:

  • rank-level collective latency and bandwidth
  • rank or node outliers
  • RDMA path validation
  • GPU-to-NIC affinity
  • PCIe width and speed
  • NVLink health
  • thermal throttling

If one GPU or one node is consistently slow, isolate it and check whether the cluster returns to baseline. In Kubernetes environments, make node cordon or removal an explicit remediation path, not an ad hoc decision.

Relevant repo surfaces:

  • cluster/scripts/run_allreduce_stability.sh
  • cluster/scripts/run_gemm_sanity_all_nodes.sh
  • cluster/analysis/plot_mamf.py
  • cluster/common-eval --preset modern-llm

Serving Stack Comparison Rule

When benchmarking llm-d, vLLM, Triton, KServe-style serving paths, or equivalent stacks, keep the serving contract fixed unless that contract is the variable under test.

Freeze:

  • protocol surface
  • deployment model
  • prompt and output token distributions
  • concurrency sweep points
  • batching policy
  • precision
  • GPU type

If protocol behavior or deployment shape changes between candidates, record that explicitly. Otherwise you are often measuring framework API behavior or control-plane differences more than serving efficiency.

Publication-Grade Versus Realism-Grade

These are different test modes and should stay different.

Mode Isolation policy Why it exists
Publication-grade Dedicated nodes, stable background load, fixed topology, topology-exclusive scheduling when appropriate Hard regression gates, public claims, RFPs, analyst or benchmark publications
Realism-grade Multi-tenant scenarios included, cluster context preserved, background load tolerated and recorded Customer-experience testing and operator reality

Both modes must record enough cluster context to explain outliers later.

Declarative BenchmarkRun

The repo now exposes a portable BenchmarkRun spec:

Validate a spec locally:

python -m core.scripts.validate_benchmark_run --file templates/benchmark_run.yaml

Use this spec as the stable interface between:

  • workload owners
  • performance engineers
  • cluster operators
  • CI/CD schedulers
  • analytical storage

The BenchmarkRun contract also models:

  • observability join keys that tie cluster, job, request, rank, and hardware telemetry together
  • raw artifact versus curated warehouse sinks
  • retention tiers and lineage requirements
  • scenario playbooks for tail latency, starvation, stragglers, and scheduler backpressure

Repo Mapping

Use the declarative spec to choose the right execution path instead of inventing a new one per workload.

Need Repo command
Fast benchmark health signal python -m cli.aisp bench run-tier1 --single-gpu --profile minimal
Performance intake + quick instrumentation bundle follow docs/perf_intake_and_triage.md
Common cluster evaluation python -m cli.aisp cluster common-eval --preset common-answer-fast
Full modern LLM cluster evaluation python -m cli.aisp cluster common-eval --preset modern-llm
Multi-node contract validation before workloads python -m cli.aisp cluster common-eval --preset multinode-readiness

Kubernetes Service Design

The declarative service design requested in the benchmark methodology lives in cluster/docs/kubernetes_benchmark_service.md.

The warehouse design that backs those runs lives in docs/performance_warehouse.md.

Thin interface surfaces for these contracts are exposed through:

  • python -m cli.aisp tools benchmark-contracts
  • dashboard API GET /api/benchmark/contracts
  • MCP tool benchmark_contracts

That design makes the service Kubernetes-native without pretending the current repo already ships a full operator:

  • BenchmarkRun is the declarative workload and measurement contract
  • an operator resolves it into SLINKY and Kueue execution primitives
  • hot metrics land in time-series storage
  • long-term regression data lands in an analytical warehouse
  • canary, nightly, and pre-release runs are scheduled automatically