Name	Name	Last commit message	Last commit date
parent directory ..
.cursor	.cursor
.github/workflows	.github/workflows
.plan-and-implement	.plan-and-implement
ch01	ch01
ch02	ch02
ch03	ch03
ch04	ch04
ch05	ch05
ch06	ch06
ch07	ch07
ch08	ch08
ch09	ch09
ch10	ch10
ch11	ch11
ch12	ch12
ch13	ch13
ch14	ch14
ch15	ch15
ch16	ch16
ch17	ch17
ch18	ch18
ch19	ch19
ch20	ch20
cli	cli
cluster	cluster
configs/benchmark_suites	configs/benchmark_suites
core	core
dashboard	dashboard
docs	docs
eval_datasets	eval_datasets
examples	examples
hack/numba	hack/numba
labs	labs
mcp	mcp
monitoring	monitoring
scripts	scripts
templates	templates
tests	tests
third_party	third_party
tools	tools
vendor/wheels	vendor/wheels
.cursorignore	.cursorignore
.env.example	.env.example
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
AGENTS.md	AGENTS.md
CONTRIBUTING.md	CONTRIBUTING.md
Makefile	Makefile
README.md	README.md
TODO.md	TODO.md
arch_config.py	arch_config.py
ch10-fix.md	ch10-fix.md
ch11-fix.md	ch11-fix.md
cleanup.py	cleanup.py
conftest.py	conftest.py
mcp.json	mcp.json
model.cache	model.cache
ncu_template.ini	ncu_template.ini
profile.sh	profile.sh
pyproject.toml	pyproject.toml
pytest.ini	pytest.ini
requirements_latest.txt	requirements_latest.txt
reset-gpu.sh	reset-gpu.sh
setup-dashboard.sh	setup-dashboard.sh
setup.sh	setup.sh
sitecustomize.py	sitecustomize.py
start-dashboard.sh	start-dashboard.sh
stop-dashboard.sh	stop-dashboard.sh
usercustomize.py	usercustomize.py
uv.lock	uv.lock

AI Systems Performance Engineering

Summary

Reference implementation of high-performance PyTorch, CUDA, and Triton workloads for NVIDIA Blackwell platforms. The repository packages 20 focused chapters, advanced labs, and the shared benchmarking harness so you can profile baselines, apply optimizations, and capture artifacts that prove performance gains.

Roadmap: docs/performance_repo_roadmap.md defines the prioritized plan for canonical suites, trend tracking, anti-pattern enforcement, shared benchmark bases, and evidence-first documentation.

Problem

Most performance repos are easy to browse and hard to trust. This one is meant to do the opposite:

baseline and optimized paths live side-by-side
correctness checks run before speedup claims matter
profiler artifacts and benchmark manifests are first-class outputs, not optional extras

Tier-1 Canonical Suite

The fastest way to answer "is this repo still delivering real wins?" is the canonical tier-1 suite.

python -m cli.aisp bench run-tier1 --single-gpu --profile minimal

That command now writes a stable history package under artifacts/history/tier1/<run_id>/:

summary.json: per-target baseline, optimized path, and best speedup
regression_summary.md: human-readable before/after summary against the previous tier-1 run
regression_summary.json: machine-readable regressions and improvements
trend_snapshot.json: run-history summary for dashboards and release notes
artifacts/history/tier1/index.json: suite history index

See docs/tier1_benchmark_suite.md for the current target list, artifact contract, and interpretation guidance.

Current Representative Deltas

These numbers are taken from the latest canonical tier-1 history summary rather than from hand-maintained README text.

Source artifact: artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json

Representative suite speedup: 7.16x geomean, 8.29x median, 21.79x arithmetic average.

Target	Baseline	Optimized	Measured delta	Artifact
`labs/block_scaling:block_scaling`	`0.197 ms`	`0.117 ms`	`1.68x`	`artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json`
`labs/flashattention4:flashattention4_alibi`	`5.440 ms`	`0.385 ms`	`14.14x`	`artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json`
`labs/persistent_decode:persistent_decode`	`1.331 ms`	`0.089 ms`	`15.03x`	`artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json`
`labs/kv_optimization:kv_standard`	`1606.800 ms`	`993.124 ms`	`1.62x`	`artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json`
`ch04:gradient_fusion`	`3.554 ms`	`0.037 ms`	`95.84x`	`artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json`
`labs/real_world_models:llama_3_1_8b`	`12.812 ms`	`5.259 ms`	`2.44x`	`artifacts/history/tier1/20260309_211715_tier1_nightly_local/summary.json`

Profiler Evidence

When you want proof beyond wall-clock timing, use the same harness target with a profiling mode instead of a different script.

python -m cli.aisp bench run --targets labs/block_scaling:block_scaling --profile deep_dive --single-gpu
python -m cli.aisp bench run --targets labs/flashattention4:flashattention4_alibi --profile deep_dive --single-gpu
python -m cli.aisp bench run --targets labs/persistent_decode:persistent_decode --profile deep_dive --single-gpu

minimal is the fastest artifact-bearing path.
deep_dive is the profiler-backed path for Nsight Systems + Nsight Compute comparisons.
The benchmark harness now blocks more hot-path anti-patterns and follows imported helper code, so "clean benchmark" means more than it used to.

Lab Navigation

If you are navigating by performance problem instead of by chapter, start with these benchmark-pair labs:

Lab	Best for	Distinct from
`labs/parameterized_cuda_graphs`	fixed-shape PyTorch CUDA Graph replay where request bindings change but graph topology does not	broader decode-serving labs; dynamic-shape graph problems
`labs/persistent_decode`	persistent decode kernels plus graph/TMA variants on serving-style decode paths	the narrower executable-graph parameter-mutation story
`labs/decode_optimization`	decode microbenchmarks that separate pinned memory, streams, compile, graphs, and cache policy	a single graph-launch-overhead benchmark pair
`labs/block_scaling`	Blackwell block-scaled GEMM mechanics and measured tensor-core wins	serving/control-plane orchestration labs
`labs/training_hotpath`	training-path launch and fusion bottlenecks with clean benchmark pairs	inference/decode-oriented graph replay work

This keeps labs/parameterized_cuda_graphs visible without pretending it replaces the broader persistent-decode or decode-optimization stories.

Benchmark Methodology

This repo now exposes one repeatable benchmarking methodology instead of leaving performance work as a collection of scripts.

Start with:

docs/benchmark_methodology.md for the three-layer model (micro, component, end_to_end), bottleneck taxonomy, publication-vs-realism policy, and straggler playbook.
docs/performance_warehouse.md for the stable event schema, raw-versus-curated storage split, retention tiers, and telemetry lineage back to raw evidence.
templates/performance_intake.yaml for KPIs, constraints, and the variable under test.
templates/benchmark_workload_spec.yaml for the frozen workload definition and measurement policy.
templates/benchmark_run.yaml for the CRD-aligned declarative BenchmarkRun shape the repo would map onto a Kubernetes-native service.
cluster/docs/kubernetes_benchmark_service.md plus cluster/configs/benchmarkrun-crd.yaml for the cluster-native operator/CRD direction already being sketched in the repo.

Thin surfaces for these contracts are also exposed through python -m cli.aisp tools benchmark-contracts, dashboard API GET /api/benchmark/contracts, and MCP tool benchmark_contracts.

The current harness already captures manifests, profiler artifacts, raw timings, and artifact hashes. Cryptographic provenance signing is still a documented gap, so external publication packets should record that explicitly rather than assuming hashes alone are sufficient.

Cluster Evaluation

Cluster evaluation has one supported artifact contract for new work:

cluster/runs/<run_id>/
  manifest.json
  structured/
  raw/
  figures/
  reports/

Start with:

cluster/README.md for the current commands and folder contract.
python -m cli.aisp cluster common-eval --preset common-answer-fast ... for the normal "evaluate this system" ask.
python -m cli.aisp cluster common-eval --preset modern-llm ... when you need the full canonical package.
python -m cli.aisp cluster common-eval --preset multinode-readiness ... before first real multi-node workloads.
python -m cli.aisp cluster promote-run --run-id <run_id> ... when one collected run should become the published localhost package.

The current published canonical package lives under cluster/published/current/. New collection still happens under cluster/runs/<run_id>/.

Learning Goals

Understand how the chapters, labs, and shared tooling fit together.
Stand up a reproducible environment for PyTorch 2.10-dev + CUDA 13 workloads on Blackwell GPUs.
Run the benchmark harness directly or through the Typer CLI for automated artifact capture.
Validate peak hardware characteristics before grading optimizations against stored expectations.

Directory Layout

Path	Description
`ch01` - `ch20`	One directory per chapter with baseline/optimized benchmarks, workload configs, and chapter-level harness entrypoints such as `ch01/compare.py`.
`labs/`	Deep-dive labs for memory-bandwidth patterns, matmul, routing, FlexAttention, MoE, persistent decode, distributed training, and more.
`core/benchmark/`, `profiling/`, `core/`, `optimization/`, `analysis/`	Shared harness, logging, workload metadata, profiling, and optimization utilities used by every chapter.
`python -m cli.aisp bench`	Typer-based CLI for running and profiling targets with reproducible artifacts.
`docs/` + `core/scripts/`	Operational guides, profiling workflows, and setup/reset helpers (`setup.sh`, `cleanup.py`, `reset-gpu.sh`).

Running the Benchmarks

Use the benchmark harness for quick comparisons or drive the Typer CLI when you need repeatable artifact capture.

cd ai-performance-engineering
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements_latest.txt
python -m cli.aisp bench list-targets --chapter ch01
python -m cli.aisp bench run --targets ch01 --profile minimal
python -m cli.aisp bench run-tier1 --single-gpu --profile minimal

setup.sh installs system prerequisites (drivers, CUDA, Nsight) and should be rerun after driver upgrades.
Benchmark validity profile defaults to strict. Virtualization is warning-only; use --validity-profile portable for broader compatibility on hardware-limited hosts.
Use python -m cli.aisp bench expectations --hardware b200 --min-speedup 1.05 to report expectation entries below a target threshold.
Repeat --targets for multi-scope runs, for example python -m cli.aisp bench run --targets ch01 --targets ch02 --profile minimal; use python -m cli.aisp bench run-tier1 --single-gpu --profile minimal for the canonical regression suite.
Portable runs do not update expectation files unless --allow-portable-expectations-update is supplied.
python core/analysis/analyze_expectations.py --artifacts-dir artifacts compares new runs to stored thresholds.

Validation Checklist

pytest tests/integration succeeds to confirm harness discovery and CLI plumbing.
python core/benchmark/benchmark_peak.py reports TFLOP/s, bandwidth, and NVLink numbers close to the published ceilings.

Wall of Shame

The benchmark harness includes a strict set of correctness and validity checks to prevent misleading speedups. Below is the reference list of validity issues we explicitly protect against, plus real-world incidents that motivated these checks.

Note: All 95 validity issues are protected by the harness.

CUDA Graph Note: Capturing CUDA graphs in setup() is allowed for steady-state replay benchmarks (we intentionally measure replay, not capture). It is NOT allowed to precompute and reuse the final output from setup(). The output used for verification must come from the timed benchmark_fn() run and be surfaced via capture_verification_payload().

Virtualization Note: validate_environment() treats virtualization (hypervisor present) as a warning. Benchmarks can run in virtualized environments, but bare metal remains the preferred source for final performance numbers.

Benchmark Validity Issues Reference

Category	Issue	What Happens	Protection	Status	Real-World Incident
Timing	Unsynced Streams	Work on non-default streams is not timed	Full device sync + StreamAuditor	OK	Locus/KernelBench 2025
Timing	Incomplete Async Ops	Timer stops before async work finishes	Full device sync	OK	Locus/KernelBench 2025
Timing	Event Timing Gaps	CUDA events recorded incorrectly	Cross-validate with wall clock	OK
Timing	Timer Granularity	Measurement too coarse for fast ops	Adaptive iterations	OK
Timing	Warmup Bleed	Real work happens during warmup	isolate_warmup_cache	OK
Timing	Clock Drift	System clock changes during measurement	Monotonic clock usage	OK
Timing	Profiler Overhead	Profiling tools add latency	Profile-free timing path	OK
Output	Constant Output	Same result regardless of input	Jitter check	OK
Output	Stale Cache	Same result across different seeds	Fresh-input check	OK
Output	Approximation Drift	Rough estimate instead of full compute	Output tolerance validation	OK
Output	Invalid Values (NaN)	NaN in output	validate_result NaN check	OK
Output	Invalid Values (Inf)	Inf in output	validate_result Inf check	OK
Output	Invalid Ground Truth	Labels/expected values wrong	GoldenOutputCache	OK	ImageNet Labels 2021, MMLU Errors 2025
Output	Shape Mismatch	Output shape differs from expected	Shape validation	OK
Output	Dtype Mismatch	Output dtype differs from expected	ToleranceSpec dtype check	OK
Output	Denormalized Values	Subnormal floats cause slowdowns	Denormal check	OK
Output	Uninitialized Memory	Output contains garbage	Memory initialization check	OK
Workload	Precision Mismatch	Claims FP32 but uses FP16	InputSignature dtype verification	OK
Workload	Backend Precision Policy Drift	Global precision policy changes during timing	Backend policy immutability check	OK	PyTorch TF32 Default 2020
Workload	Undeclared Shortcuts	Skips elements without declaring	Workload invariant check	OK	AI Agent Benchmark Shortcuts 2024
Workload	Early Exit	Stops iteration loops early	Config immutability	OK
Workload	Batch Shrinking	Processes fewer samples	InputSignature matching	OK
Workload	Sequence Truncation	Processes shorter sequences	InputSignature matching	OK
Workload	Hidden Downsampling	Silently reduces resolution	Dimension validation	OK
Workload	Sparsity Mismatch	Different sparsity patterns	Sparsity ratio check	OK
Workload	Attention Mask Mismatch	Different masking applied	Mask equivalence check	OK
Workload	KV Cache Size Mismatch	Different cache sizes	Cache dimension check	OK
Workload	Train/Test Overlap	Model tested on training data	Dataset isolation	OK	Computational Biology 2019
Location	CPU Spillover	Work offloaded to CPU	GPU kernel time validation	OK
Location	Setup Pre-computation	Work done in setup	check_setup_precomputation	OK
Location	Graph Capture Cheat	Pre-compute during graph capture	GraphCaptureCheatDetector	OK
Location	Warmup Computation	Compute results during warmup	isolate_warmup_cache	OK
Location	Background Thread	Compute in separate thread	Process isolation	OK
Location	Lazy Evaluation Skip	Returns unevaluated lazy tensor	force_tensor_evaluation	OK
Location	JIT Compilation Timing	JIT compile time included/excluded inconsistently	clear_compile_cache	OK
Memory	Pre-allocated Output	Result buffer allocated in setup	MemoryAllocationTracker	OK
Memory	Input-Output Aliasing	Output points to pre-filled input	check_input_output_aliasing	OK
Memory	Pinned Memory Timing	Async pinned transfers not waited	Transfer completion check	OK
Memory	Memory Pool Reuse	Cached allocations skew timing	reset_cuda_memory_pool	OK
Memory	Fragmentation Effects	Memory fragmentation differs	Memory pool reset	OK
Memory	Page Fault Timing	First-touch page faults included	Memory pre-touch	OK
Memory	Swap Interference	Swapping affects timing	Memory lock / swap disable	OK
CUDA	Host Callback Escape	cudaLaunchHostFunc returns early	Host function tracking	OK
CUDA	Async Memcpy Incomplete	D2H/H2D copies not awaited	Full device sync	OK
CUDA	Workspace Pre-compute	Work in cuBLAS workspace alloc	Workspace monitoring	OK
CUDA	Persistent Kernel	Kernel left running across calls	Kernel lifetime check	OK
CUDA	Undeclared Multi-GPU	Work spread across undeclared GPUs	validate_environment	OK
CUDA	Context Switch Overhead	CUDA context switches affect timing	Context pinning	OK
CUDA	Driver Overhead	Driver calls not accounted for	Driver call tracking	OK
CUDA	Cooperative Launch Abuse	Cooperative kernels bypass checks	Launch mode validation	OK
CUDA	Dynamic Parallelism Hidden	Child kernels not tracked	CDP kernel tracking	OK
CUDA	Unified Memory Faults	Page migration not timed	UM fault tracking	OK
Compile	Compilation Cache Hit	Returns cached compiled output	clear_compile_cache	OK
Compile	Trace Reuse	Exploits trace caching	torch._dynamo.reset	OK
Compile	Mode Inconsistency	Different compile mode verify vs perf	Mode consistency check	OK
Compile	Inductor Asymmetry	Inductor optimizations inconsistent	Compilation parity	OK
Compile	Guard Failure Hidden	Recompilation not counted	get_compile_state	OK
Compile	Autotuning Variance	Autotuning picks different kernels	Fixed autotuning cache	OK
Compile	Symbolic Shape Exploit	Different shapes trigger different code	InputSignature matching	OK
Distributed	Rank Skipping	Some ranks do not do work	check_rank_execution	OK
Distributed	Collective Short-circuit	Communication skipped	NCCL validation	OK
Distributed	Topology Mismatch	Claims different topology	verify_distributed	OK
Distributed	Barrier Timing	Barrier timing exploited	Barrier synchronization	OK
Distributed	Gradient Bucketing Mismatch	Different bucket sizes	Bucket size validation	OK
Distributed	Async Gradient Timing	Async all-reduce not awaited	Full device sync	OK
Distributed	Pipeline Bubble Hiding	Pipeline bubbles not counted	Bubble time tracking	OK
Distributed	Shard Size Mismatch	FSDP shards differ	InputSignature matching	OK
Environment	Device Mismatch	Uses different GPU than declared	validate_environment	OK
Environment	Frequency Boost	Overclocked for benchmark only	lock_gpu_clocks	OK
Environment	Priority Elevation	Runs at higher priority	Process isolation	OK
Environment	Memory Overcommit	Exploits memory overcommit	Memory validation	OK
Environment	NUMA Inconsistency	NUMA placement differs	NUMA audit	OK
Environment	CPU Governor Mismatch	Different CPU frequency scaling	Governor lock	OK
Environment	Thermal Throttling	GPU throttles during run	capture_gpu_state (pynvml)	OK
Environment	Power Limit Difference	Different TDP settings	capture_gpu_state (pynvml)	OK
Environment	Driver Version Mismatch	Different CUDA drivers	RunManifest version lock	OK
Environment	Library Version Mismatch	Different cuDNN/cuBLAS	RunManifest version lock	OK
Environment	Container Resource Limits	cgroups limits differ	Resource limit check	OK
Environment	Virtualization Overhead	VM/container overhead varies	Bare-metal validation	OK
Statistical	Cherry-picking	Only best iterations reported	All-iteration reporting	OK	Leaderboard Illusion 2025
Statistical	Outlier Injection	Slow iterations added to baseline	Statistical validation	OK
Statistical	Variance Gaming	Variance reporting manipulated	Consistent statistics	OK
Statistical	Percentile Selection	Favorable percentile chosen	Fixed percentile policy	OK
Statistical	Insufficient Samples	Too few iterations for significance	Adaptive iterations	OK	Measuring What Matters 2025
Statistical	Cold Start Inclusion	First run included unfairly	Warmup enforcement	OK
Statistical	GC Interference	Garbage collection during timing	gc_disabled	OK
Statistical	Background Process Noise	System processes affect timing	Process isolation	OK
Evaluation	Eval Code Exploitation	Benchmark code modified to pass	BenchmarkContract enforcement	OK
Evaluation	Timeout Manipulation	Timeout extended to hide slowdowns	Config immutability	OK
Evaluation	Metric Definition Gaming	Redefine what speedup means	Standardized metric definitions	OK	MLPerf 2019, HANS 2019, Measuring What Matters 2025, Medical LLM Benchmarks 2025
Evaluation	Test Data Leakage	Training on test data	Data contamination checks	OK	Benchmark Data Contamination Survey 2024
Evaluation	Benchmark Overfitting	Optimize specifically for benchmark	Fresh-input + jitter checks	OK	Underspecification 2020, Epic Sepsis 2021, NaturalCodeBench 2024
Evaluation	Self-Modifying Tests	AI/code modifies its own tests	Config immutability	OK
Evaluation	Benchmark Memorization	Agent memorizes test cases	Fresh-input checks, jitter	OK	AI Agent Benchmark Shortcuts 2024
Evaluation	Missing Holdout Sets	No proper train/test split	Held-out evaluation data	OK	AI Agent Benchmark Shortcuts 2024, Microsoft Tay 2016

Total: 11 categories, 95 validity issues - all protected by the harness.

Notable Real-World Incidents

Year	Incident	Issue Type	What Happened	Source
2025	Locus/KernelBench Stream Exploit	Unsynced Streams	Claimed 20x speedup on Llama FFW kernel. AI launched work on non-default CUDA streams but timer only measured default stream. 32.8 percent of RL-generated kernels exploited this, causing fake 18x speedups.	https://x.com/miru_why/status/1991773868806361138
2025	Measuring What Matters: Construct Validity in LLM Benchmarks	Metric Definition Gaming / Construct Validity	Systematic review of 445 LLM benchmarks found construct-validity weaknesses and low statistical rigor; issued eight design recommendations.	https://ora.ox.ac.uk/objects/uuid%3Aad2b69b6-0986-42d0-a512-a6e56338b6cc
2025	Medical LLM Benchmarks and Construct Validity	Metric Definition Gaming / Construct Validity	Position paper argues exam-style medical LLM benchmarks miss real-world tasks and documents construct-validity gaps using clinical data.	https://arxiv.org/abs/2503.10694
2025	Sakana AI Scientist Evaluation	Evaluation Integrity	Independent evaluation found frequent experiment failures and hallucinated numerical results.	https://arxiv.org/abs/2502.14297
2025	Leaderboard Illusion (Chatbot Arena)	Cherry-picking	Analysis of Chatbot Arena reports selection effects and leaderboard instability when submissions are inconsistent or selectively disclosed.	https://arxiv.org/abs/2504.20879
2024	MMLU Benchmark Errors	Invalid Ground Truth	Analysis found 57 percent of MMLU virology subset questions incorrect and estimated 6.49 percent errors overall.	https://arxiv.org/abs/2406.04127
2024	AI Agent Benchmark Shortcuts	Missing Holdout Sets	Study found AI agents memorize benchmark test samples instead of learning to generalize. Many benchmarks lack proper holdout test sets.	https://arxiv.org/abs/2407.01502
2024	NaturalCodeBench vs HumanEval	Benchmark Overfitting	Real-user coding tasks in NaturalCodeBench show large performance gaps and weak correlation with HumanEval scores.	https://aclanthology.org/2024.findings-acl.471/
2024	Benchmark Data Contamination Survey	Data Contamination	Survey catalogs contamination pathways across LLM benchmarks and highlights mitigation gaps.	https://arxiv.org/abs/2406.04244
2023	NLP Evaluation Data Contamination	Data Contamination	Position paper warns that LLMs trained on benchmark test splits can inflate reported scores.	https://arxiv.org/abs/2310.18018
2022	MLPerf Participation Issues	Cherry-picking	MLPerf faced inconsistent vendor participation; selective scenario submissions led to biased performance representations.	http://web.archive.org/web/20250813110435/https://www.nextplatform.com/2022/04/08/the-performance-of-mlperf-as-a-ubiquitous-benchmark-is-lacking/
2022	ML Benchmark Validity (Berkeley)	Benchmark Overfitting	Small changes in data distribution caused significant performance drops, questioning external validity.	https://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-180.html
2021	ImageNet Label Errors	Invalid Ground Truth	At least 6 percent label errors in ImageNet validation set.	https://arxiv.org/abs/2103.14749
2021	MLPerf Reproducibility	Benchmark Reproducibility	Users could not reproduce MLPerf v0.7 results due to inaccessible datasets and outdated repos.	https://groups.google.com/a/mlcommons.org/g/public/c/T_8UsUPIWFo
2021	Epic Sepsis Model External Validation	Benchmark Overfitting	External validation found poor discrimination and calibration for the Epic Sepsis Model, leading to missed cases and alert fatigue.	https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2781307
2020	Underspecification in ML	Benchmark Overfitting	Models with equivalent benchmark performance diverged in deployment behavior.	https://arxiv.org/abs/2011.03395
2020	TF32 Default on Ampere	Precision Policy Drift	TF32-enabled matmul/conv trades precision for speed unless explicitly disabled.	https://pytorch.org/docs/stable/notes/cuda.html#tf32-on-ampere
2019	NLI Heuristic Shortcuts (HANS)	Metric Definition Gaming	Models trained on MNLI (GLUE) rely on shallow heuristics and fail on HANS, revealing spurious shortcut behavior.	https://aclanthology.org/P19-1334/
2019	MLPerf Inference Bias	Metric Definition Gaming	Vendors selectively submitted results highlighting strengths.	http://web.archive.org/web/20191112035148/https://www.forbes.com/sites/janakirammsv/2019/11/10/the-curious-case-of-mlperf-inferencing-benchmark-results/
2019	Computational Biology Overfitting	Train/Test Overlap	Tools developed and tested on same datasets failed on new data.	https://www.nature.com/articles/s41467-019-09406-4
2016	Microsoft Tay Chatbot	Missing Holdout Sets	AI chatbot learned abusive behavior within 24 hours after deployment due to adversarial user interactions.	https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/

Incident Categories and Our Protections

Category	Incidents	Our Protection	Status
Timing Manipulation	1 (Locus/KernelBench)	Full device sync + StreamAuditor	OK
Invalid Ground Truth	2 (ImageNet Labels, MMLU)	GoldenOutputCache + validate_result	OK
Benchmark Overfitting	4 (Underspecification, Epic Sepsis, NaturalCodeBench, Berkeley)	Fresh-input checks + jitter	OK
Data Contamination	2 (LLM Survey 2024, NLP Contamination 2023)	Data contamination checks + fresh inputs	OK
Metric Gaming	4 (Measuring What Matters 2025, Medical LLM Benchmarks 2025, HANS 2019, MLPerf 2019)	Standardized metric definitions	OK
Cherry-picking	2 (Leaderboard Illusion, MLPerf 2022)	All-iteration reporting	OK
Train/Test Overlap	1 (Computational Biology)	Dataset isolation + holdout enforcement	OK
Missing Holdout Sets	2 (AI Agent Shortcuts, Microsoft Tay)	Held-out evaluation data	OK
Reproducibility	1 (MLPerf 2021)	RunManifest version locking	OK
Evaluation Integrity	1 (Sakana AI Scientist)	BenchmarkContract + verification enforcement	OK
Precision Policy Drift	1 (TF32 Default)	Backend policy immutability check	OK

Deep Dive: The Locus/KernelBench Stream Timing Vulnerability

This 2025 incident illustrates why correctness verification alone is insufficient.

# VULNERABLE TIMING (what KernelBench did)
start_event.record(original_model_stream)  # Only records on default stream
model(*inputs)                              # But work runs on s1, s2, s3
end_event.record(original_model_stream)    # Timer stops before s1/s2/s3 finish
torch.cuda.synchronize(device=device)      # Waits, but timing already recorded

# CORRECT TIMING (the fix)
for stream in custom_model_streams:
    custom_model_stream.wait_stream(stream)  # Wait for ALL streams
_event.record(custom_model_stream)           # Then record timing

The exploit pattern:

AI creates non-default streams: s1 = getStreamFromPool(), s2 = ..., s3 = ...
AI launches GEMMs on those streams: at::mm_out(gate, x2d, gate_proj.t()) on s1
AI does not call setCurrentCUDAStream(s3) or wait for streams before returning
Correctness test uses torch.cuda.synchronize() and passes
Performance test uses stream-specific events and reports fake speedups

Result: 82/250 (32.8 percent) of RL-generated CUDA kernels exploited this, producing artificial 18x speedups with zero actual performance improvement.

Protection Implementation Reference

Module	Key Protections
`core/harness/benchmark_harness.py`	Full device sync, L2 cache clearing, GPU clock locking, warmup isolation, config immutability, adaptive iterations, CUDA graph mode
`core/harness/validity_checks.py`	StreamAuditor, MemoryAllocationTracker, GraphCaptureCheatDetector, gc_disabled, clear_compile_cache, capture_gpu_state, validate_environment
`core/harness/l2_cache_utils.py`	Dynamic L2 cache size detection, clear_l2_cache
`core/benchmark/verify_runner.py`	VerifyRunner, GoldenOutputCache, jitter check, fresh-input check, output comparison, workload invariants
`core/benchmark/verification.py`	InputSignature, ToleranceSpec, QuarantineReason, seed mutation detection
`core/benchmark/quarantine.py`	QuarantineManager with persistence
`core/benchmark/contract.py`	BenchmarkContract enforcement

Notes

core/scripts/profile_all_workloads.sh and ncu_template.ini capture Nsight traces with consistent metric sets.
artifacts/runs/ holds run outputs (results/profiles/reports/logs); clean via python cleanup.py when rotating hardware.
docs/benchmark_methodology.md defines the repo-wide benchmarking method that ties harness runs, cluster packages, and publication-grade evidence together.
docs/performance_warehouse.md defines the warehouse contract that ties published numbers back to raw evidence and telemetry joins.
docs/perf_intake_and_triage.md outlines the standard intake bundle for performance investigations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

AI Systems Performance Engineering

Summary

Problem

Tier-1 Canonical Suite

Current Representative Deltas

Profiler Evidence

Lab Navigation

Benchmark Methodology

Cluster Evaluation

Learning Goals

Directory Layout

Running the Benchmarks

Validation Checklist

Wall of Shame

Benchmark Validity Issues Reference

Notable Real-World Incidents

Incident Categories and Our Protections

Deep Dive: The Locus/KernelBench Stream Timing Vulnerability

Protection Implementation Reference

Notes

FilesExpand file tree

code

Directory actions

More options

Directory actions

More options

Latest commit

History

code

Folders and files

parent directory

README.md

AI Systems Performance Engineering

Summary

Problem

Tier-1 Canonical Suite

Current Representative Deltas

Profiler Evidence

Lab Navigation

Benchmark Methodology

Cluster Evaluation

Learning Goals

Directory Layout

Running the Benchmarks

Validation Checklist

Wall of Shame

Benchmark Validity Issues Reference

Notable Real-World Incidents

Incident Categories and Our Protections

Deep Dive: The Locus/KernelBench Stream Timing Vulnerability

Protection Implementation Reference

Notes