Reference implementation of high-performance PyTorch, CUDA, and Triton workloads for NVIDIA Blackwell platforms. The repository packages 20 focused chapters, advanced labs, and the shared benchmarking harness so you can profile baselines, apply optimizations, and capture artifacts that prove performance gains.
Roadmap: docs/performance_repo_roadmap.md defines the prioritized plan for canonical suites, trend tracking, anti-pattern enforcement, shared benchmark bases, and evidence-first documentation.
Most performance repos are easy to browse and hard to trust. This one is meant to do the opposite:
- baseline and optimized paths live side-by-side
- correctness checks run before speedup claims matter
- profiler artifacts and benchmark manifests are first-class outputs, not optional extras
The fastest way to answer "is this repo still delivering real wins?" is the canonical tier-1 suite.
python -m cli.aisp bench run-tier1 --single-gpu --profile minimalThat command now writes a stable history package under artifacts/history/tier1/<run_id>/:
summary.json: per-target baseline, optimized path, and best speedupregression_summary.md: human-readable before/after summary against the previous tier-1 runregression_summary.json: machine-readable regressions and improvementstrend_snapshot.json: run-history summary for dashboards and release notesartifacts/history/tier1/index.json: suite history index
See docs/tier1_benchmark_suite.md for the current target list, artifact contract, and interpretation guidance.
These numbers are taken from the latest canonical tier-1 history summary rather than from hand-maintained README text.
Source artifact: artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json
Representative suite speedup: 7.16x geomean, 8.29x median, 21.79x arithmetic average.
| Target | Baseline | Optimized | Measured delta | Artifact |
|---|---|---|---|---|
labs/block_scaling:block_scaling |
0.197 ms |
0.117 ms |
1.68x |
artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json |
labs/flashattention4:flashattention4_alibi |
5.440 ms |
0.385 ms |
14.14x |
artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json |
labs/persistent_decode:persistent_decode |
1.331 ms |
0.089 ms |
15.03x |
artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json |
labs/kv_optimization:kv_standard |
1606.800 ms |
993.124 ms |
1.62x |
artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json |
ch04:gradient_fusion |
3.554 ms |
0.037 ms |
95.84x |
artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json |
labs/real_world_models:llama_3_1_8b |
12.812 ms |
5.259 ms |
2.44x |
artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json |
When you want proof beyond wall-clock timing, use the same harness target with a profiling mode instead of a different script.
python -m cli.aisp bench run --targets labs/block_scaling:block_scaling --profile deep_dive --single-gpu
python -m cli.aisp bench run --targets labs/flashattention4:flashattention4_alibi --profile deep_dive --single-gpu
python -m cli.aisp bench run --targets labs/persistent_decode:persistent_decode --profile deep_dive --single-gpuminimalis the fastest artifact-bearing path.deep_diveis the profiler-backed path for Nsight Systems + Nsight Compute comparisons.- The benchmark harness now blocks more hot-path anti-patterns and follows imported helper code, so "clean benchmark" means more than it used to.
If you are navigating by performance problem instead of by chapter, start with these benchmark-pair labs:
| Lab | Best for | Distinct from |
|---|---|---|
labs/parameterized_cuda_graphs |
fixed-shape PyTorch CUDA Graph replay where request bindings change but graph topology does not | broader decode-serving labs; dynamic-shape graph problems |
labs/persistent_decode |
persistent decode kernels plus graph/TMA variants on serving-style decode paths | the narrower executable-graph parameter-mutation story |
labs/decode_optimization |
decode microbenchmarks that separate pinned memory, streams, compile, graphs, and cache policy | a single graph-launch-overhead benchmark pair |
labs/block_scaling |
Blackwell block-scaled GEMM mechanics and measured tensor-core wins | serving/control-plane orchestration labs |
labs/training_hotpath |
training-path launch and fusion bottlenecks with clean benchmark pairs | inference/decode-oriented graph replay work |
This keeps labs/parameterized_cuda_graphs visible without pretending it replaces the broader persistent-decode or decode-optimization stories.
This repo now exposes one repeatable benchmarking methodology instead of leaving performance work as a collection of scripts.
Start with:
docs/benchmark_methodology.mdfor the three-layer model (micro,component,end_to_end), bottleneck taxonomy, publication-vs-realism policy, and straggler playbook.docs/performance_warehouse.mdfor the stable event schema, raw-versus-curated storage split, retention tiers, and telemetry lineage back to raw evidence.templates/performance_intake.yamlfor KPIs, constraints, and the variable under test.templates/benchmark_workload_spec.yamlfor the frozen workload definition and measurement policy.templates/benchmark_run.yamlfor the CRD-aligned declarativeBenchmarkRunshape the repo would map onto a Kubernetes-native service.cluster/docs/kubernetes_benchmark_service.mdpluscluster/configs/benchmarkrun-crd.yamlfor the cluster-native operator/CRD direction already being sketched in the repo.
Thin surfaces for these contracts are also exposed through python -m cli.aisp tools benchmark-contracts, dashboard API GET /api/benchmark/contracts, and MCP tool benchmark_contracts.
The current harness already captures manifests, profiler artifacts, raw timings, and artifact hashes. Cryptographic provenance signing is still a documented gap, so external publication packets should record that explicitly rather than assuming hashes alone are sufficient.
Cluster evaluation has one supported artifact contract for new work:
cluster/runs/<run_id>/
manifest.json
structured/
raw/
figures/
reports/
Start with:
cluster/README.mdfor the current commands and folder contract.python -m cli.aisp cluster common-eval --preset common-answer-fast ...for the normal "evaluate this system" ask.python -m cli.aisp cluster common-eval --preset modern-llm ...when you need the full canonical package.python -m cli.aisp cluster common-eval --preset multinode-readiness ...before first real multi-node workloads.python -m cli.aisp cluster promote-run --run-id <run_id> ...when one collected run should become the published localhost package.
The current published canonical package lives under cluster/published/current/. New collection still happens under cluster/runs/<run_id>/.
- Understand how the chapters, labs, and shared tooling fit together.
- Stand up a reproducible environment for PyTorch 2.10-dev + CUDA 13 workloads on Blackwell GPUs.
- Run the benchmark harness directly or through the Typer CLI for automated artifact capture.
- Validate peak hardware characteristics before grading optimizations against stored expectations.
| Path | Description |
|---|---|
ch01 - ch20 |
One directory per chapter with baseline/optimized benchmarks, workload configs, and chapter-level harness entrypoints such as ch01/compare.py. |
labs/ |
Deep-dive labs for memory-bandwidth patterns, matmul, routing, FlexAttention, MoE, persistent decode, distributed training, and more. |
core/benchmark/, profiling/, core/, optimization/, analysis/ |
Shared harness, logging, workload metadata, profiling, and optimization utilities used by every chapter. |
python -m cli.aisp bench |
Typer-based CLI for running and profiling targets with reproducible artifacts. |
docs/ + core/scripts/ |
Operational guides, profiling workflows, and setup/reset helpers (setup.sh, cleanup.py, reset-gpu.sh). |
Use the benchmark harness for quick comparisons or drive the Typer CLI when you need repeatable artifact capture.
cd ai-performance-engineering
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements_latest.txt
python -m cli.aisp bench list-targets --chapter ch01
python -m cli.aisp bench run --targets ch01 --profile minimal
python -m cli.aisp bench run-tier1 --single-gpu --profile minimalsetup.shinstalls system prerequisites (drivers, CUDA, Nsight) and should be rerun after driver upgrades.- Benchmark validity profile defaults to strict. Virtualization is warning-only; use
--validity-profile portablefor broader compatibility on hardware-limited hosts. - Use
python -m cli.aisp bench expectations --hardware b200 --min-speedup 1.05to report expectation entries below a target threshold. - Repeat
--targetsfor multi-scope runs, for examplepython -m cli.aisp bench run --targets ch01 --targets ch02 --profile minimal; usepython -m cli.aisp bench run-tier1 --single-gpu --profile minimalfor the canonical regression suite. - Portable runs do not update expectation files unless
--allow-portable-expectations-updateis supplied. python core/analysis/analyze_expectations.py --artifacts-dir artifactscompares new runs to stored thresholds.
pytest tests/integrationsucceeds to confirm harness discovery and CLI plumbing.python core/benchmark/benchmark_peak.pyreports TFLOP/s, bandwidth, and NVLink numbers close to the published ceilings.
The benchmark harness includes a strict set of correctness and validity checks to prevent misleading speedups. Below is the reference list of validity issues we explicitly protect against, plus real-world incidents that motivated these checks.
Note: All 95 validity issues are protected by the harness.
CUDA Graph Note: Capturing CUDA graphs in setup() is allowed for steady-state replay benchmarks (we intentionally
measure replay, not capture). It is NOT allowed to precompute and reuse the final output from setup(). The output
used for verification must come from the timed benchmark_fn() run and be surfaced via capture_verification_payload().
Virtualization Note: validate_environment() treats virtualization (hypervisor present) as a warning. Benchmarks can
run in virtualized environments, but bare metal remains the preferred source for final performance numbers.
| Category | Issue | What Happens | Protection | Status | Real-World Incident |
|---|---|---|---|---|---|
| Timing | Unsynced Streams | Work on non-default streams is not timed | Full device sync + StreamAuditor | OK | Locus/KernelBench 2025 |
| Timing | Incomplete Async Ops | Timer stops before async work finishes | Full device sync | OK | Locus/KernelBench 2025 |
| Timing | Event Timing Gaps | CUDA events recorded incorrectly | Cross-validate with wall clock | OK | |
| Timing | Timer Granularity | Measurement too coarse for fast ops | Adaptive iterations | OK | |
| Timing | Warmup Bleed | Real work happens during warmup | isolate_warmup_cache | OK | |
| Timing | Clock Drift | System clock changes during measurement | Monotonic clock usage | OK | |
| Timing | Profiler Overhead | Profiling tools add latency | Profile-free timing path | OK | |
| Output | Constant Output | Same result regardless of input | Jitter check | OK | |
| Output | Stale Cache | Same result across different seeds | Fresh-input check | OK | |
| Output | Approximation Drift | Rough estimate instead of full compute | Output tolerance validation | OK | |
| Output | Invalid Values (NaN) | NaN in output | validate_result NaN check | OK | |
| Output | Invalid Values (Inf) | Inf in output | validate_result Inf check | OK | |
| Output | Invalid Ground Truth | Labels/expected values wrong | GoldenOutputCache | OK | ImageNet Labels 2021, MMLU Errors 2025 |
| Output | Shape Mismatch | Output shape differs from expected | Shape validation | OK | |
| Output | Dtype Mismatch | Output dtype differs from expected | ToleranceSpec dtype check | OK | |
| Output | Denormalized Values | Subnormal floats cause slowdowns | Denormal check | OK | |
| Output | Uninitialized Memory | Output contains garbage | Memory initialization check | OK | |
| Workload | Precision Mismatch | Claims FP32 but uses FP16 | InputSignature dtype verification | OK | |
| Workload | Backend Precision Policy Drift | Global precision policy changes during timing | Backend policy immutability check | OK | PyTorch TF32 Default 2020 |
| Workload | Undeclared Shortcuts | Skips elements without declaring | Workload invariant check | OK | AI Agent Benchmark Shortcuts 2024 |
| Workload | Early Exit | Stops iteration loops early | Config immutability | OK | |
| Workload | Batch Shrinking | Processes fewer samples | InputSignature matching | OK | |
| Workload | Sequence Truncation | Processes shorter sequences | InputSignature matching | OK | |
| Workload | Hidden Downsampling | Silently reduces resolution | Dimension validation | OK | |
| Workload | Sparsity Mismatch | Different sparsity patterns | Sparsity ratio check | OK | |
| Workload | Attention Mask Mismatch | Different masking applied | Mask equivalence check | OK | |
| Workload | KV Cache Size Mismatch | Different cache sizes | Cache dimension check | OK | |
| Workload | Train/Test Overlap | Model tested on training data | Dataset isolation | OK | Computational Biology 2019 |
| Location | CPU Spillover | Work offloaded to CPU | GPU kernel time validation | OK | |
| Location | Setup Pre-computation | Work done in setup | check_setup_precomputation | OK | |
| Location | Graph Capture Cheat | Pre-compute during graph capture | GraphCaptureCheatDetector | OK | |
| Location | Warmup Computation | Compute results during warmup | isolate_warmup_cache | OK | |
| Location | Background Thread | Compute in separate thread | Process isolation | OK | |
| Location | Lazy Evaluation Skip | Returns unevaluated lazy tensor | force_tensor_evaluation | OK | |
| Location | JIT Compilation Timing | JIT compile time included/excluded inconsistently | clear_compile_cache | OK | |
| Memory | Pre-allocated Output | Result buffer allocated in setup | MemoryAllocationTracker | OK | |
| Memory | Input-Output Aliasing | Output points to pre-filled input | check_input_output_aliasing | OK | |
| Memory | Pinned Memory Timing | Async pinned transfers not waited | Transfer completion check | OK | |
| Memory | Memory Pool Reuse | Cached allocations skew timing | reset_cuda_memory_pool | OK | |
| Memory | Fragmentation Effects | Memory fragmentation differs | Memory pool reset | OK | |
| Memory | Page Fault Timing | First-touch page faults included | Memory pre-touch | OK | |
| Memory | Swap Interference | Swapping affects timing | Memory lock / swap disable | OK | |
| CUDA | Host Callback Escape | cudaLaunchHostFunc returns early | Host function tracking | OK | |
| CUDA | Async Memcpy Incomplete | D2H/H2D copies not awaited | Full device sync | OK | |
| CUDA | Workspace Pre-compute | Work in cuBLAS workspace alloc | Workspace monitoring | OK | |
| CUDA | Persistent Kernel | Kernel left running across calls | Kernel lifetime check | OK | |
| CUDA | Undeclared Multi-GPU | Work spread across undeclared GPUs | validate_environment | OK | |
| CUDA | Context Switch Overhead | CUDA context switches affect timing | Context pinning | OK | |
| CUDA | Driver Overhead | Driver calls not accounted for | Driver call tracking | OK | |
| CUDA | Cooperative Launch Abuse | Cooperative kernels bypass checks | Launch mode validation | OK | |
| CUDA | Dynamic Parallelism Hidden | Child kernels not tracked | CDP kernel tracking | OK | |
| CUDA | Unified Memory Faults | Page migration not timed | UM fault tracking | OK | |
| Compile | Compilation Cache Hit | Returns cached compiled output | clear_compile_cache | OK | |
| Compile | Trace Reuse | Exploits trace caching | torch._dynamo.reset | OK | |
| Compile | Mode Inconsistency | Different compile mode verify vs perf | Mode consistency check | OK | |
| Compile | Inductor Asymmetry | Inductor optimizations inconsistent | Compilation parity | OK | |
| Compile | Guard Failure Hidden | Recompilation not counted | get_compile_state | OK | |
| Compile | Autotuning Variance | Autotuning picks different kernels | Fixed autotuning cache | OK | |
| Compile | Symbolic Shape Exploit | Different shapes trigger different code | InputSignature matching | OK | |
| Distributed | Rank Skipping | Some ranks do not do work | check_rank_execution | OK | |
| Distributed | Collective Short-circuit | Communication skipped | NCCL validation | OK | |
| Distributed | Topology Mismatch | Claims different topology | verify_distributed | OK | |
| Distributed | Barrier Timing | Barrier timing exploited | Barrier synchronization | OK | |
| Distributed | Gradient Bucketing Mismatch | Different bucket sizes | Bucket size validation | OK | |
| Distributed | Async Gradient Timing | Async all-reduce not awaited | Full device sync | OK | |
| Distributed | Pipeline Bubble Hiding | Pipeline bubbles not counted | Bubble time tracking | OK | |
| Distributed | Shard Size Mismatch | FSDP shards differ | InputSignature matching | OK | |
| Environment | Device Mismatch | Uses different GPU than declared | validate_environment | OK | |
| Environment | Frequency Boost | Overclocked for benchmark only | lock_gpu_clocks | OK | |
| Environment | Priority Elevation | Runs at higher priority | Process isolation | OK | |
| Environment | Memory Overcommit | Exploits memory overcommit | Memory validation | OK | |
| Environment | NUMA Inconsistency | NUMA placement differs | NUMA audit | OK | |
| Environment | CPU Governor Mismatch | Different CPU frequency scaling | Governor lock | OK | |
| Environment | Thermal Throttling | GPU throttles during run | capture_gpu_state (pynvml) | OK | |
| Environment | Power Limit Difference | Different TDP settings | capture_gpu_state (pynvml) | OK | |
| Environment | Driver Version Mismatch | Different CUDA drivers | RunManifest version lock | OK | |
| Environment | Library Version Mismatch | Different cuDNN/cuBLAS | RunManifest version lock | OK | |
| Environment | Container Resource Limits | cgroups limits differ | Resource limit check | OK | |
| Environment | Virtualization Overhead | VM/container overhead varies | Bare-metal validation | OK | |
| Statistical | Cherry-picking | Only best iterations reported | All-iteration reporting | OK | Leaderboard Illusion 2025 |
| Statistical | Outlier Injection | Slow iterations added to baseline | Statistical validation | OK | |
| Statistical | Variance Gaming | Variance reporting manipulated | Consistent statistics | OK | |
| Statistical | Percentile Selection | Favorable percentile chosen | Fixed percentile policy | OK | |
| Statistical | Insufficient Samples | Too few iterations for significance | Adaptive iterations | OK | Measuring What Matters 2025 |
| Statistical | Cold Start Inclusion | First run included unfairly | Warmup enforcement | OK | |
| Statistical | GC Interference | Garbage collection during timing | gc_disabled | OK | |
| Statistical | Background Process Noise | System processes affect timing | Process isolation | OK | |
| Evaluation | Eval Code Exploitation | Benchmark code modified to pass | BenchmarkContract enforcement | OK | |
| Evaluation | Timeout Manipulation | Timeout extended to hide slowdowns | Config immutability | OK | |
| Evaluation | Metric Definition Gaming | Redefine what speedup means | Standardized metric definitions | OK | MLPerf 2019, HANS 2019, Measuring What Matters 2025, Medical LLM Benchmarks 2025 |
| Evaluation | Test Data Leakage | Training on test data | Data contamination checks | OK | Benchmark Data Contamination Survey 2024 |
| Evaluation | Benchmark Overfitting | Optimize specifically for benchmark | Fresh-input + jitter checks | OK | Underspecification 2020, Epic Sepsis 2021, NaturalCodeBench 2024 |
| Evaluation | Self-Modifying Tests | AI/code modifies its own tests | Config immutability | OK | |
| Evaluation | Benchmark Memorization | Agent memorizes test cases | Fresh-input checks, jitter | OK | AI Agent Benchmark Shortcuts 2024 |
| Evaluation | Missing Holdout Sets | No proper train/test split | Held-out evaluation data | OK | AI Agent Benchmark Shortcuts 2024, Microsoft Tay 2016 |
Total: 11 categories, 95 validity issues - all protected by the harness.
| Year | Incident | Issue Type | What Happened | Source |
|---|---|---|---|---|
| 2025 | Locus/KernelBench Stream Exploit | Unsynced Streams | Claimed 20x speedup on Llama FFW kernel. AI launched work on non-default CUDA streams but timer only measured default stream. 32.8 percent of RL-generated kernels exploited this, causing fake 18x speedups. | https://x.com/miru_why/status/1991773868806361138 |
| 2025 | Measuring What Matters: Construct Validity in LLM Benchmarks | Metric Definition Gaming / Construct Validity | Systematic review of 445 LLM benchmarks found construct-validity weaknesses and low statistical rigor; issued eight design recommendations. | https://ora.ox.ac.uk/objects/uuid%3Aad2b69b6-0986-42d0-a512-a6e56338b6cc |
| 2025 | Medical LLM Benchmarks and Construct Validity | Metric Definition Gaming / Construct Validity | Position paper argues exam-style medical LLM benchmarks miss real-world tasks and documents construct-validity gaps using clinical data. | https://arxiv.org/abs/2503.10694 |
| 2025 | Sakana AI Scientist Evaluation | Evaluation Integrity | Independent evaluation found frequent experiment failures and hallucinated numerical results. | https://arxiv.org/abs/2502.14297 |
| 2025 | Leaderboard Illusion (Chatbot Arena) | Cherry-picking | Analysis of Chatbot Arena reports selection effects and leaderboard instability when submissions are inconsistent or selectively disclosed. | https://arxiv.org/abs/2504.20879 |
| 2024 | MMLU Benchmark Errors | Invalid Ground Truth | Analysis found 57 percent of MMLU virology subset questions incorrect and estimated 6.49 percent errors overall. | https://arxiv.org/abs/2406.04127 |
| 2024 | AI Agent Benchmark Shortcuts | Missing Holdout Sets | Study found AI agents memorize benchmark test samples instead of learning to generalize. Many benchmarks lack proper holdout test sets. | https://arxiv.org/abs/2407.01502 |
| 2024 | NaturalCodeBench vs HumanEval | Benchmark Overfitting | Real-user coding tasks in NaturalCodeBench show large performance gaps and weak correlation with HumanEval scores. | https://aclanthology.org/2024.findings-acl.471/ |
| 2024 | Benchmark Data Contamination Survey | Data Contamination | Survey catalogs contamination pathways across LLM benchmarks and highlights mitigation gaps. | https://arxiv.org/abs/2406.04244 |
| 2023 | NLP Evaluation Data Contamination | Data Contamination | Position paper warns that LLMs trained on benchmark test splits can inflate reported scores. | https://arxiv.org/abs/2310.18018 |
| 2022 | MLPerf Participation Issues | Cherry-picking | MLPerf faced inconsistent vendor participation; selective scenario submissions led to biased performance representations. | https://web.archive.org/web/20250813110435/https://www.nextplatform.com/2022/04/08/the-performance-of-mlperf-as-a-ubiquitous-benchmark-is-lacking/ |
| 2022 | ML Benchmark Validity (Berkeley) | Benchmark Overfitting | Small changes in data distribution caused significant performance drops, questioning external validity. | https://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-180.html |
| 2021 | ImageNet Label Errors | Invalid Ground Truth | At least 6 percent label errors in ImageNet validation set. | https://arxiv.org/abs/2103.14749 |
| 2021 | MLPerf Reproducibility | Benchmark Reproducibility | Users could not reproduce MLPerf v0.7 results due to inaccessible datasets and outdated repos. | https://groups.google.com/a/mlcommons.org/g/public/c/T_8UsUPIWFo |
| 2021 | Epic Sepsis Model External Validation | Benchmark Overfitting | External validation found poor discrimination and calibration for the Epic Sepsis Model, leading to missed cases and alert fatigue. | https://pubmed.ncbi.nlm.nih.gov/34152373/ |
| 2020 | Underspecification in ML | Benchmark Overfitting | Models with equivalent benchmark performance diverged in deployment behavior. | https://arxiv.org/abs/2011.03395 |
| 2020 | TF32 Default on Ampere | Precision Policy Drift | TF32-enabled matmul/conv trades precision for speed unless explicitly disabled. | https://pytorch.org/docs/stable/notes/cuda.html#tf32-on-ampere |
| 2019 | NLI Heuristic Shortcuts (HANS) | Metric Definition Gaming | Models trained on MNLI (GLUE) rely on shallow heuristics and fail on HANS, revealing spurious shortcut behavior. | https://aclanthology.org/P19-1334/ |
| 2019 | MLPerf Inference Bias | Metric Definition Gaming | Vendors selectively submitted results highlighting strengths. | https://web.archive.org/web/20191112035148/https://www.forbes.com/sites/janakirammsv/2019/11/10/the-curious-case-of-mlperf-inferencing-benchmark-results/ |
| 2019 | Computational Biology Overfitting | Train/Test Overlap | Tools developed and tested on same datasets failed on new data. | https://www.nature.com/articles/s41467-019-09406-4 |
| 2016 | Microsoft Tay Chatbot | Missing Holdout Sets | AI chatbot learned abusive behavior within 24 hours after deployment due to adversarial user interactions. | https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/ |
| Category | Incidents | Our Protection | Status |
|---|---|---|---|
| Timing Manipulation | 1 (Locus/KernelBench) | Full device sync + StreamAuditor | OK |
| Invalid Ground Truth | 2 (ImageNet Labels, MMLU) | GoldenOutputCache + validate_result | OK |
| Benchmark Overfitting | 4 (Underspecification, Epic Sepsis, NaturalCodeBench, Berkeley) | Fresh-input checks + jitter | OK |
| Data Contamination | 2 (LLM Survey 2024, NLP Contamination 2023) | Data contamination checks + fresh inputs | OK |
| Metric Gaming | 4 (Measuring What Matters 2025, Medical LLM Benchmarks 2025, HANS 2019, MLPerf 2019) | Standardized metric definitions | OK |
| Cherry-picking | 2 (Leaderboard Illusion, MLPerf 2022) | All-iteration reporting | OK |
| Train/Test Overlap | 1 (Computational Biology) | Dataset isolation + holdout enforcement | OK |
| Missing Holdout Sets | 2 (AI Agent Shortcuts, Microsoft Tay) | Held-out evaluation data | OK |
| Reproducibility | 1 (MLPerf 2021) | RunManifest version locking | OK |
| Evaluation Integrity | 1 (Sakana AI Scientist) | BenchmarkContract + verification enforcement | OK |
| Precision Policy Drift | 1 (TF32 Default) | Backend policy immutability check | OK |
This 2025 incident illustrates why correctness verification alone is insufficient.
# VULNERABLE TIMING (what KernelBench did)
start_event.record(original_model_stream) # Only records on default stream
model(*inputs) # But work runs on s1, s2, s3
end_event.record(original_model_stream) # Timer stops before s1/s2/s3 finish
torch.cuda.synchronize(device=device) # Waits, but timing already recorded
# CORRECT TIMING (the fix)
for stream in custom_model_streams:
custom_model_stream.wait_stream(stream) # Wait for ALL streams
_event.record(custom_model_stream) # Then record timingThe exploit pattern:
- AI creates non-default streams:
s1 = getStreamFromPool(),s2 = ...,s3 = ... - AI launches GEMMs on those streams:
at::mm_out(gate, x2d, gate_proj.t())on s1 - AI does not call
setCurrentCUDAStream(s3)or wait for streams before returning - Correctness test uses
torch.cuda.synchronize()and passes - Performance test uses stream-specific events and reports fake speedups
Result: 82/250 (32.8 percent) of RL-generated CUDA kernels exploited this, producing artificial 18x speedups with zero actual performance improvement.
| Module | Key Protections |
|---|---|
core/harness/benchmark_harness.py |
Full device sync, L2 cache clearing, GPU clock locking, warmup isolation, config immutability, adaptive iterations, CUDA graph mode |
core/harness/validity_checks.py |
StreamAuditor, MemoryAllocationTracker, GraphCaptureCheatDetector, gc_disabled, clear_compile_cache, capture_gpu_state, validate_environment |
core/harness/l2_cache_utils.py |
Dynamic L2 cache size detection, clear_l2_cache |
core/benchmark/verify_runner.py |
VerifyRunner, GoldenOutputCache, jitter check, fresh-input check, output comparison, workload invariants |
core/benchmark/verification.py |
InputSignature, ToleranceSpec, QuarantineReason, seed mutation detection |
core/benchmark/quarantine.py |
QuarantineManager with persistence |
core/benchmark/contract.py |
BenchmarkContract enforcement |
core/scripts/profile_all_workloads.shandncu_template.inicapture Nsight traces with consistent metric sets.artifacts/runs/holds run outputs (results/profiles/reports/logs); clean viapython cleanup.pywhen rotating hardware.docs/benchmark_methodology.mddefines the repo-wide benchmarking method that ties harness runs, cluster packages, and publication-grade evidence together.docs/performance_warehouse.mddefines the warehouse contract that ties published numbers back to raw evidence and telemetry joins.docs/perf_intake_and_triage.mdoutlines the standard intake bundle for performance investigations.