andrewyng · qhy991 · Mar 20, 2026 · Mar 20, 2026 · Mar 21, 2026
diff --git a/content/cuda/docs/async-copy/DOC.md b/content/cuda/docs/async-copy/DOC.md
@@ -0,0 +1,117 @@
+---
+name: async-copy
+description: "CUDA async copy essentials: cooperative_groups::memcpy_async, cuda::pipeline, wait rules, and the bridge to cp.async/TMA."
+metadata:
+  languages: "cpp"
+  versions: "13.1"
+  revision: 1
+  updated-on: "2026-03-20"
+  source: official
+  tags: "cuda,gpu,kernel,async-copy,memcpy_async,cuda::pipeline,cuda::barrier,cp.async,tma,shared-memory"
+---
+
+# CUDA Async Copy (C++)
+
+Use this page for the CUDA C++ view of asynchronous copies from global memory to shared memory and the synchronization rules around them.
+
+## What Problem It Solves
+
+A conventional copy into shared memory:
+
+```cpp
+shared[idx] = global[idx];
+```
+
+typically expands into:
+
+1. load from global memory into a register
+2. store from register into shared memory
+
+Async copy can avoid that register staging path on supported hardware and can overlap data movement with computation.
+
+## Main CUDA C++ Entry Points
+
+Two common interfaces appear in NVIDIA documentation:
+
+- `cooperative_groups::memcpy_async(...)`
+- `cuda::memcpy_async(...)` together with `cuda::pipeline` or `cuda::barrier`
+
+At a high level, both start an async transfer and require an explicit wait before the data in shared memory is consumed.
+
+## Fundamental Safety Rule
+
+After initiating the async copy:
+
+- do not read the destination shared memory until the corresponding wait completes
+- do not modify the source or destination participating region while the transfer is in flight
+
+Until the wait completes, reading or writing the participating data can create a data race.
+
+## Cooperative Groups Pattern
+
+```cpp
+namespace cg = cooperative_groups;
+
+auto block = cg::this_thread_block();
+extern __shared__ float smem[];
+
+cg::memcpy_async(block, smem, gmem_ptr, bytes);
+cg::wait(block);
+block.sync();
+```
+
+Use `cg::wait(group)` before consuming the copied shared-memory data.
+
+## Pipeline Pattern
+
+For newer CUDA C++ paths, `cuda::pipeline` can express staged copy/compute overlap.
+
+The common structure is:
+
+1. acquire / start pipeline stage
+2. issue `cuda::memcpy_async`
+3. commit or advance the stage
+4. wait for the prior stage
+5. compute on the completed shared-memory tile
+
+This is the higher-level CUDA C++ bridge to lower-level async copy hardware behavior.
+
+## When Hardware Acceleration Matters
+
+NVIDIA documents that on compute capability 8.0 and higher, async copies from global to shared memory can benefit from hardware acceleration that avoids an intermediate register path.
+
+That does not remove the need for:
+
+- alignment discipline
+- correct wait behavior
+- sensible shared-memory layout
+
+## When To Escalate To PTX / TMA
+
+Stay in CUDA C++ docs when:
+
+- you are using `memcpy_async`
+- you need pipeline-level copy/compute overlap
+- you want a supported C++ interface
+
+Drop to PTX / TMA docs when:
+
+- you need precise `cp.async` group semantics
+- you need bulk async copies or TMA
+- you need `mbarrier` or cluster-scope completion behavior
+
+## Related Topics
+
+- Shared memory usage: `../shared-memory/DOC.md`
+- Synchronization rules: `../synchronization/DOC.md`
+- Cooperative Groups: `../cooperative-groups/DOC.md`
+- PTX `cp.async`: `../ptx/instructions/data-movement/references/cp-async.md`
+- PTX TMA: `../ptx/instructions/tma/DOC.md`
+
+## Official Source Links (Fact Check)
+
+- CUDA Programming Guide, Asynchronous Data Copies: https://docs.nvidia.com/cuda/archive/13.1.1/cuda-programming-guide/04-special-topics/async-copies.html
+- CUDA Programming Guide, Cooperative Groups async copy examples: https://docs.nvidia.com/cuda/archive/11.8.0/cuda-c-programming-guide/index.html
+- CUDA Programming Guide, `memcpy_async` and `cuda::pipeline`: https://docs.nvidia.com/cuda/archive/11.6.2/cuda-c-programming-guide/index.html
+
+Last cross-check date: 2026-03-20
diff --git a/content/cuda/docs/atomics-and-reductions/DOC.md b/content/cuda/docs/atomics-and-reductions/DOC.md
@@ -0,0 +1,94 @@
+---
+name: atomics-and-reductions
+description: "CUDA atomics and reduction essentials: atomicAdd, shared/global scope, warp-first reduction, and common tradeoffs."
+metadata:
+  languages: "cpp"
+  versions: "12.9"
+  revision: 1
+  updated-on: "2026-03-20"
+  source: official
+  tags: "cuda,gpu,kernel,atomics,reduction,atomicAdd,atomicCAS,shared-memory,warp-reduction"
+---
+
+# CUDA Atomics And Reductions (C++)
+
+Use this page when deciding between direct atomics, shared-memory reductions, and warp-first reduction patterns.
+
+## Atomic Basics
+
+An atomic operation performs a read-modify-write sequence as one atomic transaction on a word in global or shared memory.
+
+Common examples:
+
+- `atomicAdd`
+- `atomicCAS`
+- `atomicMax`
+- `atomicMin`
+
+Atomics are correct tools for contention-sensitive updates, but they can serialize hot spots.
+
+## Scope Choice
+
+- shared-memory atomics are useful for contention within one block
+- global-memory atomics are visible across blocks but usually cost more under heavy contention
+
+A common pattern is:
+
+1. reduce within a warp
+2. reduce within a block using shared memory
+3. emit one global atomic per block
+
+## Preferred Reduction Structure
+
+For many reductions, do not start with one atomic per thread.
+
+Better default:
+
+- first use warp shuffle reduction
+- then combine warp results in shared memory
+- then write one value per block or one atomic per block
+
+This reduces contention and memory traffic.
+
+## When Direct Atomics Are Fine
+
+Direct global atomics are often acceptable when:
+
+- the output has low contention
+- the kernel is not dominated by the atomic path
+- simplicity matters more than peak throughput
+
+Examples:
+
+- histogram with many bins and good distribution
+- sparse accumulation with low collision probability
+
+## When Atomics Become A Problem
+
+Expect trouble when:
+
+- many threads update the same location
+- the output space is very small
+- the kernel becomes serialization-bound
+
+In those cases, switch to hierarchical reduction or privatization.
+
+## Minimal Strategy Guide
+
+- one scalar result per block: block reduction in shared memory
+- one scalar result for the whole grid: block reduction plus final stage
+- many bins with moderate collisions: shared-memory privatization, then flush
+- warp-local aggregation: use shuffle before touching shared or global memory
+
+## Related Topics
+
+- Shared memory staging: `../shared-memory/DOC.md`
+- Warp-level collectives: `../warp-primitives/DOC.md`
+- Synchronization rules: `../synchronization/DOC.md`
+
+## Official Source Links (Fact Check)
+
+- CUDA C++ Programming Guide, atomic functions: https://docs.nvidia.com/cuda/archive/12.9.1/cuda-c-programming-guide/index.html
+- CUDA C++ Best Practices Guide, reduction and shared-memory patterns: https://docs.nvidia.com/cuda/archive/13.0.0/cuda-c-best-practices-guide/index.html
+
+Last cross-check date: 2026-03-20
diff --git a/content/cuda/docs/benchmarking-methodology/DOC.md b/content/cuda/docs/benchmarking-methodology/DOC.md
@@ -0,0 +1,74 @@
+---
+name: benchmarking-methodology
+description: "CUDA benchmarking methodology essentials: warmup, synchronization discipline, stable inputs, percentile reporting, and fair comparisons."
+metadata:
+  languages: "cpp"
+  versions: "12.9"
+  revision: 1
+  updated-on: "2026-03-20"
+  source: official
+  tags: "cuda,gpu,kernel,benchmark,methodology,warmup,timing,percentile,variance,fair-comparison"
+---
+
+# CUDA Benchmarking Methodology (C++)
+
+Use this page when you need benchmark numbers that are comparable and reproducible.
+
+## Core Rules
+
+1. measure steady state, not cold start.
+2. use correct synchronization for the scope being measured.
+3. keep input shapes and distributions fixed across variants.
+4. report variability, not just one best run.
+
+## Warmup
+
+Always include warmup iterations before measurement to absorb:
+
+- JIT or first-use overheads
+- cache/allocator/transient startup effects
+
+## Timing Discipline
+
+For kernel timing:
+
+- use event-based timing around the measured stream segment
+- avoid mixing host wall-clock timing with unsynchronized device work
+
+For end-to-end latency:
+
+- include all relevant host/device stages intentionally
+- document what is excluded
+
+## Comparison Hygiene
+
+- same hardware and driver/toolkit
+- same input set and batch strategy
+- same precision and algorithm settings
+- same determinism flags where relevant
+
+Any mismatch here can invalidate claimed speedups.
+
+## Reporting
+
+Report at least:
+
+- median
+- p90/p95 (or similar tail percentile)
+- run-to-run variance
+
+Single minimum time is not sufficient for production-facing claims.
+
+## Related Topics
+
+- Streams and events: `../streams-and-events/DOC.md`
+- Performance debugging: `../performance-debugging/DOC.md`
+- NVTX profiling workflow: `../nvtx-and-profiling-workflow/DOC.md`
+- Regression testing and CI: `../regression-testing-and-ci/DOC.md`
+
+## Official Source Links (Fact Check)
+
+- CUDA C++ Best Practices Guide, measurement and optimization workflow context: https://docs.nvidia.com/cuda/archive/13.0.0/cuda-c-best-practices-guide/index.html
+- CUDA Runtime API, event timing APIs: https://docs.nvidia.com/cuda/cuda-runtime-api/index.html
+
+Last cross-check date: 2026-03-20
diff --git a/content/cuda/docs/build-and-abi-compatibility/DOC.md b/content/cuda/docs/build-and-abi-compatibility/DOC.md
@@ -0,0 +1,72 @@
+---
+name: build-and-abi-compatibility
+description: "CUDA build and ABI compatibility essentials: arch targets, PTX/SASS forward-compat strategy, runtime/driver constraints, and packaging hygiene."
+metadata:
+  languages: "cpp"
+  versions: "12.9"
+  revision: 1
+  updated-on: "2026-03-20"
+  source: official
+  tags: "cuda,gpu,kernel,build,abi,compatibility,sm-arch,ptx,sass,nvcc,driver-runtime"
+---
+
+# CUDA Build And ABI Compatibility (C++)
+
+Use this page when shipping CUDA binaries across different GPU architectures and deployment environments.
+
+## Targeting Strategy
+
+Build artifacts can include:
+
+- SASS for specific SM architectures
+- PTX for forward compatibility via JIT on newer compatible drivers
+
+A common practical strategy is to include both:
+
+- native SASS for known deployment GPUs
+- PTX fallback for future-compatible targets
+
+## Why Compatibility Breaks
+
+Typical mismatch classes:
+
+- runtime-toolkit vs driver capability mismatch
+- missing arch target in build flags
+- ABI or dependency mismatch in host integration
+
+Treat compatibility as part of release engineering, not a last-minute fix.
+
+## NVCC Arch Hygiene
+
+Use explicit arch targets and document them in build config.
+
+- keep `-gencode` matrix aligned with actual fleet GPUs
+- avoid shipping only one narrow arch unless environment is fixed
+
+## Runtime/Driver Considerations
+
+- new toolkits can require minimum driver versions
+- deployment systems may lag driver updates
+
+Validate on representative driver/toolkit combinations before release.
+
+## Package-Level Practices
+
+- pin toolkit version in CI
+- record compile flags in build metadata
+- verify cold-start JIT overhead if PTX fallback is expected
+- add smoke tests per target GPU class
+
+## Related Topics
+
+- Error handling and debug build: `../error-handling-and-debug-build/DOC.md`
+- Runtime API overview: `../runtime/DOC.md`
+- PTX ISA overview: `../ptx/DOC.md`
+
+## Official Source Links (Fact Check)
+
+- NVCC Compiler Driver documentation: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
+- CUDA Compatibility documentation: https://docs.nvidia.com/deploy/cuda-compatibility/index.html
+- CUDA Installation Guide (version/driver context): https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
+
+Last cross-check date: 2026-03-20