Welcome! This guide covers coding standards and best practices for the AI Performance Engineering codebase.
Example chapter layout (see ch07/):
ch07/__init__.py # Module exports
ch07/baseline_memory_access.py # Unoptimized reference
ch07/optimized_memory_access.py # Optimized version
ch07/baseline_memory_access.cu # CUDA kernel (if applicable)
ch07/optimized_memory_access.cu # Optimized CUDA kernel
ch07/Makefile # CUDA compilation rules
| Pattern | Purpose | Example |
|---|---|---|
baseline_*.py |
Unoptimized reference | ch07/baseline_memory_access.py |
optimized_*.py |
Optimized implementation | ch07/optimized_memory_access.py |
*_bench.py |
Standalone benchmark | ch13/fp8_perchannel_bench.py |
*_demo.py |
Demonstration/example | ch15/pipeline_parallel_demo.py |
Every benchmark inheriting from BaseBenchmark must implement:
from core.harness.benchmark_harness import BaseBenchmark, BenchmarkConfig
class MyBenchmark(BaseBenchmark):
"""One-line description of what this benchmark measures."""
def setup(self) -> None:
"""Initialize tensors, models, etc. Called once before benchmarking."""
self.N = 1024
self.tensor = torch.randn(self.N, device='cuda')
def benchmark_fn(self) -> None:
"""The operation to benchmark. Called repeatedly for timing."""
result = self.tensor.sum()
torch.cuda.synchronize()
def teardown(self) -> None:
"""Cleanup. Called once after benchmarking."""
del self.tensor
def get_custom_metrics(self) -> Optional[dict]:
"""Return domain-specific metrics. Use helper functions!"""
return compute_memory_transfer_metrics(
bytes_transferred=self.N * 4,
elapsed_ms=getattr(self, '_last_elapsed_ms', 1.0),
)
def get_benchmark():
"""Factory function for benchmark discovery."""
return MyBenchmark()def validate_result(self) -> bool:
"""Verify correctness. Return True if results are valid."""
return self._result is not None
def _nvtx_range(self, name: str):
"""Context manager for NVTX profiling ranges."""
return torch.cuda.nvtx.range(name)
def get_config(self) -> BenchmarkConfig:
"""Return custom benchmark configuration."""
return BenchmarkConfig(
warmup=10, # REQUIRED: See Warmup Requirements below
iterations=100,
)Warmup iterations are MANDATORY to ensure accurate benchmark measurements. Low warmup causes JIT/compile overhead to be INCLUDED in timing results, leading to incorrect speedup calculations.
| Feature Used | Minimum Warmup | Recommended Warmup |
|---|---|---|
| Basic CUDA | 5 | 5-10 |
| torch.compile | 10 | 10-15 |
| Triton kernels | 10 | 10-15 |
| CUDA Graphs | 10 | 10-15 |
Why this matters:
torch.compiletriggers JIT compilation on the first 1-3 calls- Triton kernels compile on first invocation
- CUDA driver initialization and cuDNN autotuning happen early
- Memory allocator needs warmup to reach steady state
The benchmark harness will AUTOMATICALLY raise warmup to minimum (5) if you set it lower, with a warning.
DO NOT:
# ❌ BAD - JIT overhead will pollute measurements
def get_config(self):
return BenchmarkConfig(warmup=0, iterations=5)
# ❌ BAD - torch.compile needs more warmup
def get_config(self):
return BenchmarkConfig(warmup=2, iterations=10)DO:
# ✓ GOOD - Sufficient warmup for accurate measurements
def get_config(self):
return BenchmarkConfig(warmup=10, iterations=20)Validation:
- Run
make audit-warmupto check all benchmarks - Pre-commit hook automatically validates warmup settings
make checkincludes warmup audit
# Standard library
from __future__ import annotations
from typing import Optional, Dict, Any
# Third-party
import torch
import torch.nn as nn
# Local
from core.harness.benchmark_harness import BaseBenchmark
from core.benchmark.metrics import compute_memory_transfer_metricsAlways use type hints for function signatures:
def compute_something(
tensor: torch.Tensor,
scale: float = 1.0,
device: Optional[str] = None,
) -> Dict[str, float]:
...Use Google-style docstrings:
def compute_bandwidth(bytes_transferred: int, elapsed_ms: float) -> float:
"""Calculate achieved bandwidth in GB/s.
Args:
bytes_transferred: Total bytes moved
elapsed_ms: Time elapsed in milliseconds
Returns:
Bandwidth in GB/s
Raises:
ValueError: If elapsed_ms is zero or negative
"""
...Always synchronize before timing:
def benchmark_fn(self) -> None:
# Do work
result = self.model(self.input)
# ALWAYS synchronize before timing ends
torch.cuda.synchronize()Always use the standardized metric helpers from core/benchmark/metrics.py:
from core.benchmark.metrics import (
compute_memory_transfer_metrics, # Ch2: bandwidth
compute_kernel_fundamentals_metrics, # Ch6: bank conflicts
compute_memory_access_metrics, # Ch7: coalescing
compute_optimization_metrics, # Ch8: speedup
compute_roofline_metrics, # Ch9: arithmetic intensity
compute_pipeline_metrics, # Ch10: pipeline efficiency
compute_stream_metrics, # Ch11: overlap
compute_graph_metrics, # Ch12: launch overhead
compute_precision_metrics, # Ch13: FP8/FP16
compute_triton_metrics, # Ch14: Triton kernels
compute_inference_metrics, # Ch15-17: TTFT/TPOT
compute_speculative_decoding_metrics, # Ch18: acceptance rate
compute_distributed_metrics, # Ch4: collective bandwidth
compute_moe_metrics, # MoE: expert utilization
)Use category.metric_name format:
# Good
{
"transfer.achieved_gbps": 100.5,
"transfer.efficiency_pct": 85.2,
"memory.coalescing_pct": 95.0,
}
# Bad
{
"bandwidth": 100.5, # Missing category
"efficiency": 85.2, # Missing category
"CoalescingPercent": 95.0, # Wrong format
}Use conditional returns for metrics that depend on runtime data:
def get_custom_metrics(self) -> Optional[dict]:
# Guard against missing data
if not hasattr(self, '_last_elapsed_ms'):
return None
return compute_memory_transfer_metrics(
bytes_transferred=self.N * 4,
elapsed_ms=self._last_elapsed_ms,
)# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_benchmark_metrics.py -v
# Run with coverage
pytest tests/ --cov=benchmark --cov=core --cov=profiling --cov-report=htmlimport pytest
from core.benchmark.metrics import compute_memory_transfer_metrics
class TestMemoryTransferMetrics:
def test_basic_transfer(self):
"""Test basic bandwidth calculation."""
metrics = compute_memory_transfer_metrics(
bytes_transferred=1e9, # 1 GB
elapsed_ms=100.0, # 100 ms
)
# 1 GB in 100 ms = 10 GB/s
assert abs(metrics["transfer.achieved_gbps"] - 10.0) < 0.01
def test_zero_time_handling(self):
"""Should handle edge cases gracefully."""
metrics = compute_memory_transfer_metrics(
bytes_transferred=1e9,
elapsed_ms=0.0, # Edge case
)
assert metrics["transfer.achieved_gbps"] > 0 # Should not crashEvery Python file should have a docstring:
#!/usr/bin/env python3
"""Optimized: Memory coalescing with vectorized loads.
Demonstrates 128-bit vectorized memory access patterns for
optimal memory bandwidth utilization on Blackwell GPUs.
Key optimizations:
- float4 vectorized loads (128-bit)
- Aligned memory access
- Coalesced access patterns
Expected speedup: 2-4× over baseline
"""Labs should have README.md with:
- Goal/purpose
- Key techniques demonstrated
- File descriptions
- Running instructions
- Configuration options
- Related chapters
# Check get_custom_metrics() status
python core/scripts/update_custom_metrics.py --analyze
# Apply suggested improvements
python core/scripts/update_custom_metrics.py --apply
# Validate metric quality
python core/scripts/update_custom_metrics.py --validate# Compare two benchmark result JSON files
python -m cli.aisp bench compare-runs \
--baseline artifacts/runs/baseline/benchmark_test_results.json \
--candidate artifacts/runs/candidate/benchmark_test_results.json# nsys profile
nsys profile -o output python ch07/optimized_memory_access.py
# ncu profile with chapter-specific metrics
ncu --set full --metrics $(python -c "from core.profiling.profiler_config import get_chapter_metrics; print(','.join(get_chapter_metrics(7)))") python ch07/optimized_memory_access.py# See all available targets
make help
# Run all checks before committing
make check
# Run tests with coverage
make test-cov
# Generate coverage report
make coverageThe following checks run automatically on push/PR:
- Unit tests for benchmark_metrics, profiler_config, update_custom_metrics
- Benchmark import validation
- Metrics coverage analysis
- Linting (flake8, mypy)
See ./.github/workflows/benchmark-validation.yml for details.
Before submitting changes:
- Benchmark inherits from
BaseBenchmark - Has
get_benchmark()factory function - Has
get_custom_metrics()using helper functions - Includes
torch.cuda.synchronize()inbenchmark_fn() - Has docstring with description
- Uses type hints
- Follows
baseline_/optimized_naming - Tests pass (
make test) - Validation passes (
make check)