Contributing Guide

Welcome! This guide covers coding standards and best practices for the AI Performance Engineering codebase.

File Organization
Benchmark Structure
Coding Standards
Metrics Guidelines
Testing
Documentation

File Organization

Chapter Structure

Example chapter layout (see ch07/):

ch07/__init__.py                    # Module exports
ch07/baseline_memory_access.py      # Unoptimized reference
ch07/optimized_memory_access.py     # Optimized version
ch07/baseline_memory_access.cu      # CUDA kernel (if applicable)
ch07/optimized_memory_access.cu     # Optimized CUDA kernel
ch07/Makefile                       # CUDA compilation rules

Naming Conventions

Pattern	Purpose	Example
`baseline_*.py`	Unoptimized reference	`ch07/baseline_memory_access.py`
`optimized_*.py`	Optimized implementation	`ch07/optimized_memory_access.py`
`*_bench.py`	Standalone benchmark	`ch13/fp8_perchannel_bench.py`
`*_demo.py`	Demonstration/example	`ch15/pipeline_parallel_demo.py`

Benchmark Structure

Required Methods

Every benchmark inheriting from BaseBenchmark must implement:

from core.harness.benchmark_harness import BaseBenchmark, BenchmarkConfig

class MyBenchmark(BaseBenchmark):
    """One-line description of what this benchmark measures."""
    
    def setup(self) -> None:
        """Initialize tensors, models, etc. Called once before benchmarking."""
        self.N = 1024
        self.tensor = torch.randn(self.N, device='cuda')
    
    def benchmark_fn(self) -> None:
        """The operation to benchmark. Called repeatedly for timing."""
        result = self.tensor.sum()
        torch.cuda.synchronize()
    
    def teardown(self) -> None:
        """Cleanup. Called once after benchmarking."""
        del self.tensor
    
    def get_custom_metrics(self) -> Optional[dict]:
        """Return domain-specific metrics. Use helper functions!"""
        return compute_memory_transfer_metrics(
            bytes_transferred=self.N * 4,
            elapsed_ms=getattr(self, '_last_elapsed_ms', 1.0),
        )

def get_benchmark():
    """Factory function for benchmark discovery."""
    return MyBenchmark()

Optional Methods

def validate_result(self) -> bool:
    """Verify correctness. Return True if results are valid."""
    return self._result is not None

def _nvtx_range(self, name: str):
    """Context manager for NVTX profiling ranges."""
    return torch.cuda.nvtx.range(name)

def get_config(self) -> BenchmarkConfig:
    """Return custom benchmark configuration."""
    return BenchmarkConfig(
        warmup=10,  # REQUIRED: See Warmup Requirements below
        iterations=100,
    )

⚠️ CRITICAL: Warmup Requirements

Warmup iterations are MANDATORY to ensure accurate benchmark measurements. Low warmup causes JIT/compile overhead to be INCLUDED in timing results, leading to incorrect speedup calculations.

Feature Used	Minimum Warmup	Recommended Warmup
Basic CUDA	5	5-10
torch.compile	10	10-15
Triton kernels	10	10-15
CUDA Graphs	10	10-15

Why this matters:

torch.compile triggers JIT compilation on the first 1-3 calls
Triton kernels compile on first invocation
CUDA driver initialization and cuDNN autotuning happen early
Memory allocator needs warmup to reach steady state

The benchmark harness will AUTOMATICALLY raise warmup to minimum (5) if you set it lower, with a warning.

DO NOT:

# ❌ BAD - JIT overhead will pollute measurements
def get_config(self):
    return BenchmarkConfig(warmup=0, iterations=5)

# ❌ BAD - torch.compile needs more warmup
def get_config(self):
    return BenchmarkConfig(warmup=2, iterations=10)

DO:

# ✓ GOOD - Sufficient warmup for accurate measurements
def get_config(self):
    return BenchmarkConfig(warmup=10, iterations=20)

Validation:

Run make audit-warmup to check all benchmarks
Pre-commit hook automatically validates warmup settings
make check includes warmup audit

Coding Standards

Imports

# Standard library
from __future__ import annotations
from typing import Optional, Dict, Any

# Third-party
import torch
import torch.nn as nn

# Local
from core.harness.benchmark_harness import BaseBenchmark
from core.benchmark.metrics import compute_memory_transfer_metrics

Type Hints

Always use type hints for function signatures:

def compute_something(
    tensor: torch.Tensor,
    scale: float = 1.0,
    device: Optional[str] = None,
) -> Dict[str, float]:
    ...

Docstrings

Use Google-style docstrings:

def compute_bandwidth(bytes_transferred: int, elapsed_ms: float) -> float:
    """Calculate achieved bandwidth in GB/s.
    
    Args:
        bytes_transferred: Total bytes moved
        elapsed_ms: Time elapsed in milliseconds
    
    Returns:
        Bandwidth in GB/s
    
    Raises:
        ValueError: If elapsed_ms is zero or negative
    """
    ...

CUDA Synchronization

Always synchronize before timing:

def benchmark_fn(self) -> None:
    # Do work
    result = self.model(self.input)
    
    # ALWAYS synchronize before timing ends
    torch.cuda.synchronize()

Metrics Guidelines

Use Helper Functions

Always use the standardized metric helpers from core/benchmark/metrics.py:

from core.benchmark.metrics import (
    compute_memory_transfer_metrics,    # Ch2: bandwidth
    compute_kernel_fundamentals_metrics, # Ch6: bank conflicts
    compute_memory_access_metrics,       # Ch7: coalescing
    compute_optimization_metrics,        # Ch8: speedup
    compute_roofline_metrics,            # Ch9: arithmetic intensity
    compute_pipeline_metrics,            # Ch10: pipeline efficiency
    compute_stream_metrics,              # Ch11: overlap
    compute_graph_metrics,               # Ch12: launch overhead
    compute_precision_metrics,           # Ch13: FP8/FP16
    compute_triton_metrics,              # Ch14: Triton kernels
    compute_inference_metrics,           # Ch15-17: TTFT/TPOT
    compute_speculative_decoding_metrics, # Ch18: acceptance rate
    compute_distributed_metrics,         # Ch4: collective bandwidth
    compute_moe_metrics,                 # MoE: expert utilization
)

Metric Naming Convention

Use category.metric_name format:

# Good
{
    "transfer.achieved_gbps": 100.5,
    "transfer.efficiency_pct": 85.2,
    "memory.coalescing_pct": 95.0,
}

# Bad
{
    "bandwidth": 100.5,      # Missing category
    "efficiency": 85.2,      # Missing category
    "CoalescingPercent": 95.0,  # Wrong format
}

Defensive Guards

Use conditional returns for metrics that depend on runtime data:

def get_custom_metrics(self) -> Optional[dict]:
    # Guard against missing data
    if not hasattr(self, '_last_elapsed_ms'):
        return None
    
    return compute_memory_transfer_metrics(
        bytes_transferred=self.N * 4,
        elapsed_ms=self._last_elapsed_ms,
    )

Testing

Running Tests

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_benchmark_metrics.py -v

# Run with coverage
pytest tests/ --cov=benchmark --cov=core --cov=profiling --cov-report=html

Writing Tests

import pytest
from core.benchmark.metrics import compute_memory_transfer_metrics

class TestMemoryTransferMetrics:
    def test_basic_transfer(self):
        """Test basic bandwidth calculation."""
        metrics = compute_memory_transfer_metrics(
            bytes_transferred=1e9,  # 1 GB
            elapsed_ms=100.0,       # 100 ms
        )
        
        # 1 GB in 100 ms = 10 GB/s
        assert abs(metrics["transfer.achieved_gbps"] - 10.0) < 0.01
    
    def test_zero_time_handling(self):
        """Should handle edge cases gracefully."""
        metrics = compute_memory_transfer_metrics(
            bytes_transferred=1e9,
            elapsed_ms=0.0,  # Edge case
        )
        assert metrics["transfer.achieved_gbps"] > 0  # Should not crash

Documentation

File Headers

Every Python file should have a docstring:

#!/usr/bin/env python3
"""Optimized: Memory coalescing with vectorized loads.

Demonstrates 128-bit vectorized memory access patterns for
optimal memory bandwidth utilization on Blackwell GPUs.

Key optimizations:
- float4 vectorized loads (128-bit)
- Aligned memory access
- Coalesced access patterns

Expected speedup: 2-4× over baseline
"""

README Files

Labs should have README.md with:

Goal/purpose
Key techniques demonstrated
File descriptions
Running instructions
Configuration options
Related chapters

Tools

Analyze Metrics Coverage

# Check get_custom_metrics() status
python core/scripts/update_custom_metrics.py --analyze

# Apply suggested improvements
python core/scripts/update_custom_metrics.py --apply

# Validate metric quality
python core/scripts/update_custom_metrics.py --validate

Benchmark Comparison

# Compare two benchmark result JSON files
python -m cli.aisp bench compare-runs \
    --baseline artifacts/runs/baseline/benchmark_test_results.json \
    --candidate artifacts/runs/candidate/benchmark_test_results.json

Profiling

# nsys profile
nsys profile -o output python ch07/optimized_memory_access.py

# ncu profile with chapter-specific metrics
ncu --set full --metrics $(python -c "from core.profiling.profiler_config import get_chapter_metrics; print(','.join(get_chapter_metrics(7)))") python ch07/optimized_memory_access.py

Development Workflow

Using the Makefile

# See all available targets
make help

# Run all checks before committing
make check

# Run tests with coverage
make test-cov

# Generate coverage report
make coverage

CI/CD Integration

The following checks run automatically on push/PR:

Unit tests for benchmark_metrics, profiler_config, update_custom_metrics
Benchmark import validation
Metrics coverage analysis
Linting (flake8, mypy)

See ./.github/workflows/benchmark-validation.yml for details.

Checklist

Before submitting changes:

Benchmark inherits from BaseBenchmark
Has get_benchmark() factory function
Has get_custom_metrics() using helper functions
Includes torch.cuda.synchronize() in benchmark_fn()
Has docstring with description
Uses type hints
Follows baseline_/optimized_ naming
Tests pass (make test)
Validation passes (make check)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing Guide

Table of Contents

File Organization

Chapter Structure

Naming Conventions

Benchmark Structure

Required Methods

Optional Methods

⚠️ CRITICAL: Warmup Requirements

Coding Standards

Imports

Type Hints

Docstrings

CUDA Synchronization

Metrics Guidelines

Use Helper Functions

Metric Naming Convention

Defensive Guards

Testing

Running Tests

Writing Tests

Documentation

File Headers

README Files

Tools

Analyze Metrics Coverage

Benchmark Comparison

Profiling

Development Workflow

Using the Makefile

CI/CD Integration

Checklist

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing Guide

Table of Contents

File Organization

Chapter Structure

Naming Conventions

Benchmark Structure

Required Methods

Optional Methods

⚠️ CRITICAL: Warmup Requirements

Coding Standards

Imports

Type Hints

Docstrings

CUDA Synchronization

Metrics Guidelines

Use Helper Functions

Metric Naming Convention

Defensive Guards

Testing

Running Tests

Writing Tests

Documentation

File Headers

README Files

Tools

Analyze Metrics Coverage

Benchmark Comparison

Profiling

Development Workflow

Using the Makefile

CI/CD Integration

Checklist