Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions content/cuda/docs/async-copy/DOC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
name: async-copy
description: "CUDA async copy essentials: cooperative_groups::memcpy_async, cuda::pipeline, wait rules, and the bridge to cp.async/TMA."
metadata:
languages: "cpp"
versions: "13.1"
revision: 1
updated-on: "2026-03-20"
source: official
tags: "cuda,gpu,kernel,async-copy,memcpy_async,cuda::pipeline,cuda::barrier,cp.async,tma,shared-memory"
---

# CUDA Async Copy (C++)

Use this page for the CUDA C++ view of asynchronous copies from global memory to shared memory and the synchronization rules around them.

## What Problem It Solves

A conventional copy into shared memory:

```cpp
shared[idx] = global[idx];
```

typically expands into:

1. load from global memory into a register
2. store from register into shared memory

Async copy can avoid that register staging path on supported hardware and can overlap data movement with computation.

## Main CUDA C++ Entry Points

Two common interfaces appear in NVIDIA documentation:

- `cooperative_groups::memcpy_async(...)`
- `cuda::memcpy_async(...)` together with `cuda::pipeline` or `cuda::barrier`

At a high level, both start an async transfer and require an explicit wait before the data in shared memory is consumed.

## Fundamental Safety Rule

After initiating the async copy:

- do not read the destination shared memory until the corresponding wait completes
- do not modify the source or destination participating region while the transfer is in flight

Until the wait completes, reading or writing the participating data can create a data race.

## Cooperative Groups Pattern

```cpp
namespace cg = cooperative_groups;

auto block = cg::this_thread_block();
extern __shared__ float smem[];

cg::memcpy_async(block, smem, gmem_ptr, bytes);
cg::wait(block);
block.sync();
```

Use `cg::wait(group)` before consuming the copied shared-memory data.

## Pipeline Pattern

For newer CUDA C++ paths, `cuda::pipeline` can express staged copy/compute overlap.

The common structure is:

1. acquire / start pipeline stage
2. issue `cuda::memcpy_async`
3. commit or advance the stage
4. wait for the prior stage
5. compute on the completed shared-memory tile

This is the higher-level CUDA C++ bridge to lower-level async copy hardware behavior.

## When Hardware Acceleration Matters

NVIDIA documents that on compute capability 8.0 and higher, async copies from global to shared memory can benefit from hardware acceleration that avoids an intermediate register path.

That does not remove the need for:

- alignment discipline
- correct wait behavior
- sensible shared-memory layout

## When To Escalate To PTX / TMA

Stay in CUDA C++ docs when:

- you are using `memcpy_async`
- you need pipeline-level copy/compute overlap
- you want a supported C++ interface

Drop to PTX / TMA docs when:

- you need precise `cp.async` group semantics
- you need bulk async copies or TMA
- you need `mbarrier` or cluster-scope completion behavior

## Related Topics

- Shared memory usage: `../shared-memory/DOC.md`
- Synchronization rules: `../synchronization/DOC.md`
- Cooperative Groups: `../cooperative-groups/DOC.md`
- PTX `cp.async`: `../ptx/instructions/data-movement/references/cp-async.md`
- PTX TMA: `../ptx/instructions/tma/DOC.md`

## Official Source Links (Fact Check)

- CUDA Programming Guide, Asynchronous Data Copies: https://docs.nvidia.com/cuda/archive/13.1.1/cuda-programming-guide/04-special-topics/async-copies.html
- CUDA Programming Guide, Cooperative Groups async copy examples: https://docs.nvidia.com/cuda/archive/11.8.0/cuda-c-programming-guide/index.html
- CUDA Programming Guide, `memcpy_async` and `cuda::pipeline`: https://docs.nvidia.com/cuda/archive/11.6.2/cuda-c-programming-guide/index.html

Last cross-check date: 2026-03-20
94 changes: 94 additions & 0 deletions content/cuda/docs/atomics-and-reductions/DOC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
name: atomics-and-reductions
description: "CUDA atomics and reduction essentials: atomicAdd, shared/global scope, warp-first reduction, and common tradeoffs."
metadata:
languages: "cpp"
versions: "12.9"
revision: 1
updated-on: "2026-03-20"
source: official
tags: "cuda,gpu,kernel,atomics,reduction,atomicAdd,atomicCAS,shared-memory,warp-reduction"
---

# CUDA Atomics And Reductions (C++)

Use this page when deciding between direct atomics, shared-memory reductions, and warp-first reduction patterns.

## Atomic Basics

An atomic operation performs a read-modify-write sequence as one atomic transaction on a word in global or shared memory.

Common examples:

- `atomicAdd`
- `atomicCAS`
- `atomicMax`
- `atomicMin`

Atomics are correct tools for contention-sensitive updates, but they can serialize hot spots.

## Scope Choice

- shared-memory atomics are useful for contention within one block
- global-memory atomics are visible across blocks but usually cost more under heavy contention

A common pattern is:

1. reduce within a warp
2. reduce within a block using shared memory
3. emit one global atomic per block

## Preferred Reduction Structure

For many reductions, do not start with one atomic per thread.

Better default:

- first use warp shuffle reduction
- then combine warp results in shared memory
- then write one value per block or one atomic per block

This reduces contention and memory traffic.

## When Direct Atomics Are Fine

Direct global atomics are often acceptable when:

- the output has low contention
- the kernel is not dominated by the atomic path
- simplicity matters more than peak throughput

Examples:

- histogram with many bins and good distribution
- sparse accumulation with low collision probability

## When Atomics Become A Problem

Expect trouble when:

- many threads update the same location
- the output space is very small
- the kernel becomes serialization-bound

In those cases, switch to hierarchical reduction or privatization.

## Minimal Strategy Guide

- one scalar result per block: block reduction in shared memory
- one scalar result for the whole grid: block reduction plus final stage
- many bins with moderate collisions: shared-memory privatization, then flush
- warp-local aggregation: use shuffle before touching shared or global memory

## Related Topics

- Shared memory staging: `../shared-memory/DOC.md`
- Warp-level collectives: `../warp-primitives/DOC.md`
- Synchronization rules: `../synchronization/DOC.md`

## Official Source Links (Fact Check)

- CUDA C++ Programming Guide, atomic functions: https://docs.nvidia.com/cuda/archive/12.9.1/cuda-c-programming-guide/index.html
- CUDA C++ Best Practices Guide, reduction and shared-memory patterns: https://docs.nvidia.com/cuda/archive/13.0.0/cuda-c-best-practices-guide/index.html

Last cross-check date: 2026-03-20
74 changes: 74 additions & 0 deletions content/cuda/docs/benchmarking-methodology/DOC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
name: benchmarking-methodology
description: "CUDA benchmarking methodology essentials: warmup, synchronization discipline, stable inputs, percentile reporting, and fair comparisons."
metadata:
languages: "cpp"
versions: "12.9"
revision: 1
updated-on: "2026-03-20"
source: official
tags: "cuda,gpu,kernel,benchmark,methodology,warmup,timing,percentile,variance,fair-comparison"
---

# CUDA Benchmarking Methodology (C++)

Use this page when you need benchmark numbers that are comparable and reproducible.

## Core Rules

1. measure steady state, not cold start.
2. use correct synchronization for the scope being measured.
3. keep input shapes and distributions fixed across variants.
4. report variability, not just one best run.

## Warmup

Always include warmup iterations before measurement to absorb:

- JIT or first-use overheads
- cache/allocator/transient startup effects

## Timing Discipline

For kernel timing:

- use event-based timing around the measured stream segment
- avoid mixing host wall-clock timing with unsynchronized device work

For end-to-end latency:

- include all relevant host/device stages intentionally
- document what is excluded

## Comparison Hygiene

- same hardware and driver/toolkit
- same input set and batch strategy
- same precision and algorithm settings
- same determinism flags where relevant

Any mismatch here can invalidate claimed speedups.

## Reporting

Report at least:

- median
- p90/p95 (or similar tail percentile)
- run-to-run variance

Single minimum time is not sufficient for production-facing claims.

## Related Topics

- Streams and events: `../streams-and-events/DOC.md`
- Performance debugging: `../performance-debugging/DOC.md`
- NVTX profiling workflow: `../nvtx-and-profiling-workflow/DOC.md`
- Regression testing and CI: `../regression-testing-and-ci/DOC.md`

## Official Source Links (Fact Check)

- CUDA C++ Best Practices Guide, measurement and optimization workflow context: https://docs.nvidia.com/cuda/archive/13.0.0/cuda-c-best-practices-guide/index.html
- CUDA Runtime API, event timing APIs: https://docs.nvidia.com/cuda/cuda-runtime-api/index.html

Last cross-check date: 2026-03-20
72 changes: 72 additions & 0 deletions content/cuda/docs/build-and-abi-compatibility/DOC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
---
name: build-and-abi-compatibility
description: "CUDA build and ABI compatibility essentials: arch targets, PTX/SASS forward-compat strategy, runtime/driver constraints, and packaging hygiene."
metadata:
languages: "cpp"
versions: "12.9"
revision: 1
updated-on: "2026-03-20"
source: official
tags: "cuda,gpu,kernel,build,abi,compatibility,sm-arch,ptx,sass,nvcc,driver-runtime"
---

# CUDA Build And ABI Compatibility (C++)

Use this page when shipping CUDA binaries across different GPU architectures and deployment environments.

## Targeting Strategy

Build artifacts can include:

- SASS for specific SM architectures
- PTX for forward compatibility via JIT on newer compatible drivers

A common practical strategy is to include both:

- native SASS for known deployment GPUs
- PTX fallback for future-compatible targets

## Why Compatibility Breaks

Typical mismatch classes:

- runtime-toolkit vs driver capability mismatch
- missing arch target in build flags
- ABI or dependency mismatch in host integration

Treat compatibility as part of release engineering, not a last-minute fix.

## NVCC Arch Hygiene

Use explicit arch targets and document them in build config.

- keep `-gencode` matrix aligned with actual fleet GPUs
- avoid shipping only one narrow arch unless environment is fixed

## Runtime/Driver Considerations

- new toolkits can require minimum driver versions
- deployment systems may lag driver updates

Validate on representative driver/toolkit combinations before release.

## Package-Level Practices

- pin toolkit version in CI
- record compile flags in build metadata
- verify cold-start JIT overhead if PTX fallback is expected
- add smoke tests per target GPU class

## Related Topics

- Error handling and debug build: `../error-handling-and-debug-build/DOC.md`
- Runtime API overview: `../runtime/DOC.md`
- PTX ISA overview: `../ptx/DOC.md`

## Official Source Links (Fact Check)

- NVCC Compiler Driver documentation: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
- CUDA Compatibility documentation: https://docs.nvidia.com/deploy/cuda-compatibility/index.html
- CUDA Installation Guide (version/driver context): https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

Last cross-check date: 2026-03-20
Loading