Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions content/apple/docs/metal-argument-buffers-and-residency/DOC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
name: metal-argument-buffers-and-residency
description: "Apple Metal argument buffer patterns: encoding resource tables, residency requirements, and useResource or useHeap rules for compute workloads."
metadata:
languages: "cpp"
versions: "4.0"
revision: 1
updated-on: "2026-03-21"
source: official
tags: "apple,metal,argument-buffers,residency,useresource,useheap,resource-binding,indirect-access,compute,mtlargumentencoder"
---

# Metal Argument Buffers And Residency

Use this page when a Metal compute kernel needs many resources, indirect resource access, or lower CPU-side binding overhead.

## What Argument Buffers Solve

Argument buffers let you encode resource references into a buffer-backed table instead of rebinding many individual buffers or textures for every dispatch.

This is useful when:

- one kernel reads many buffers or textures
- the set of resources changes per dispatch
- the resource table is reused across many dispatches
- GPU-driven or indirection-heavy workflows need resource tables

## Core Host-Side Pattern

The usual structure is:

1. create an `MTLArgumentEncoder`
2. allocate a backing `MTLBuffer`
3. encode resource references into that buffer
4. bind the argument buffer to the kernel
5. make indirect resources resident before dispatch

The last step is the part people miss most often.

## Residency Rules Matter

If a kernel reaches resources through an argument buffer, those resources must be resident for the duration of the compute pass.

In practice this means:

- call `useResource(_:usage:)` for resources reached indirectly through an argument buffer
- call `useHeap(_:)` when residency is managed through a heap
- do this before the encoded dispatch that consumes those resources

If you bind a resource directly to a kernel argument, you do not need the extra residency call for that direct binding path.

## Good Usage Pattern

- keep the argument buffer layout stable across many dispatches
- separate "table rebuild" work from "per-dispatch scalar parameter" work
- prefer argument buffers when the binding count is the CPU bottleneck
- keep residency calls explicit and near the dispatch site

## Common Failure Modes

- a resource is encoded into the argument buffer but never made resident
- the argument buffer is updated but a stale resource table is still reused
- CPU-side code mutates resident resources during the compute pass
- argument buffers are introduced for tiny fixed-bind workloads that did not need them

## Review Checklist

- Does the kernel really access resources indirectly?
- Are all indirectly referenced buffers or textures marked resident?
- Is the argument buffer rebuilt only when the table actually changes?
- Is the performance goal CPU submission overhead rather than kernel ALU time?

## Official Source Links (Fact Check)

- Compute passes: https://developer.apple.com/documentation/metal/compute-passes
- Improving CPU performance by using argument buffers: https://developer.apple.com/documentation/metal/improving-cpu-performance-by-using-argument-buffers
- `useResource(_:usage:)`: https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/useresource%28_%3Ausage%3A%29
- Metal shader converter binding model and argument buffer notes: https://developer.apple.com/metal/shader-converter/

Last cross-check date: 2026-03-21
56 changes: 56 additions & 0 deletions content/apple/docs/metal-broadcast-kernel-patterns/DOC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
name: metal-broadcast-kernel-patterns
description: "Apple Metal broadcast kernel patterns: scalar or vector expansion, shape alignment, masked edges, and correctness checks for broadcast-heavy compute code."
metadata:
languages: "cpp"
versions: "4.0"
revision: 1
updated-on: "2026-03-21"
source: official
tags: "apple,metal,broadcast,shape-alignment,elementwise,tensor-shapes,masked-edges,compute"
---

# Metal Broadcast Kernel Patterns

Use this page when a Metal kernel combines tensors of different logical shapes using broadcast semantics.

## Why Broadcast Bugs Are Common

Broadcast kernels look simple because the math is usually elementwise.

The real complexity is shape alignment:

- which dimensions are expanded
- which dimensions are equal
- whether one operand is scalar, row-wise, column-wise, or channel-wise

If the shape contract is vague, the kernel may appear correct on square or fully dense cases while failing on realistic shapes.

## Safe Baseline Pattern

- align shapes explicitly in host code before launch
- pass logical sizes needed for each broadcasted dimension
- write one straightforward reference kernel
- test cases where only one dimension is broadcast, then several

## What To Verify

- broadcasted dimensions reuse the intended source index
- non-broadcasted dimensions advance normally
- output shape is derived from the broadcast rule, not copied from one operand blindly
- masked tails or rounded dispatches cannot write past the real output shape

## Common Failure Modes

- one dimension is broadcast on the host side but advanced in the kernel
- scalar and vector broadcast paths behave differently
- the kernel passes tests where all dimensions match, hiding broken broadcast logic
- the output allocation follows one input shape instead of the broadcasted result shape

## Official Source Links (Fact Check)

- Compute passes: https://developer.apple.com/documentation/metal/compute-passes
- Calculating threadgroup and grid sizes: https://developer.apple.com/documentation/metal/calculating-threadgroup-and-grid-sizes
- Metal Shading Language Specification: https://developer.apple.com/metal/resources/

Last cross-check date: 2026-03-21
76 changes: 76 additions & 0 deletions content/apple/docs/metal-buffer-layout-and-alignment/DOC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
name: metal-buffer-layout-and-alignment
description: "Apple Metal buffer layout and alignment: resource sizing, texture-from-buffer alignment, and host/kernel layout discipline."
metadata:
languages: "cpp"
versions: "4.0"
revision: 1
updated-on: "2026-03-21"
source: official
tags: "apple,metal,buffer,alignment,layout,mtlbuffer,minimumtexturebufferalignment,minimumlineartexturealignment,bytesperrow,heap-size-align"
---

# Metal Buffer Layout And Alignment

Use this page when host-side data layout, buffer sizing, or buffer-backed texture creation is part of the kernel path.

## Why This Matters

Many Metal failures that look like "bad math" are actually layout problems:

- host structs and shader expectations do not match
- offsets are aligned incorrectly
- buffer-backed textures use invalid row pitch or offset values
- heap sizing and alignment are estimated incorrectly

## Layout Discipline

- define host and kernel-visible struct layouts explicitly
- keep element size, stride, and offset calculations centralized
- treat texture-from-buffer paths as alignment-sensitive, not as generic byte blobs

## Alignment APIs Apple Exposes

Apple documents alignment helpers on `MTLDevice`, including:

- `minimumTextureBufferAlignment(for:)`
- `minimumLinearTextureAlignment(for:)`
- `heapBufferSizeAndAlign(length:options:)`
- `heapTextureSizeAndAlign(descriptor:)`

Use these APIs instead of guessing alignment from prior hardware experience.

## Buffer-Backed Texture Rules

When creating textures from buffers, values such as:

- buffer offset
- bytes per row
- pixel format alignment

must satisfy the device's documented alignment constraints.

Apple's documentation explicitly ties alignment values to texture creation parameters.

## Common Failure Modes

- Buffer length is correct in bytes, but element stride is wrong.
- Struct fields are logically correct but host/kernel padding expectations differ.
- Texture buffer offset is not aligned to the device minimum.
- `bytesPerRow` is computed from logical width only and ignores required alignment.
- Heap size estimates ignore size-and-align APIs and under-allocate.

## Safe Practice

1. Compute all offsets from element-size and alignment helpers.
2. Reuse one layout definition across host and shader-facing code.
3. Validate resource creation parameters before debugging kernel math.

## Official Source Links (Fact Check)

- `minimumTextureBufferAlignment(for:)`: https://developer.apple.com/documentation/metal/mtldevice/minimumtexturebufferalignment%28for%3A%29
- `minimumLinearTextureAlignment(for:)`: https://developer.apple.com/documentation/metal/mtldevice/minimumlineartexturealignment%28for%3A%29
- `heapBufferSizeAndAlign(length:options:)`: https://developer.apple.com/documentation/metal/mtldevice/heapbuffersizeandalign%28length%3Aoptions%3A%29
- `bufferBytesPerRow`: https://developer.apple.com/documentation/metal/mtltexture/bufferbytesperrow

Last cross-check date: 2026-03-21
73 changes: 73 additions & 0 deletions content/apple/docs/metal-command-buffer-reuse-and-batching/DOC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
name: metal-command-buffer-reuse-and-batching
description: "Apple Metal command buffer reuse and batching guidance: transient versus persistent objects, submission frequency, and indirect command buffer tradeoffs."
metadata:
languages: "cpp"
versions: "4.0"
revision: 1
updated-on: "2026-03-21"
source: official
tags: "apple,metal,command-buffer,batching,submission,indirect-command-buffer,persistent-objects,command-queue,reuse,icb"
---

# Metal Command Buffer Reuse And Batching

Use this page when CPU submission overhead starts to matter as much as the Metal kernel itself.

## First Principle

Apple distinguishes between transient and persistent objects.

Persistent objects should be created early and reused:

- `MTLDevice`
- `MTLCommandQueue`
- buffers
- textures
- pipeline states

Command buffers themselves are transient single-use objects.

That means:

- reuse pipeline and resource objects
- do not try to reuse committed command buffers
- reduce submission overhead by batching work intelligently

## Batching Guidance

Apple's best-practices material emphasizes submitting as few command buffers as practical without starving the GPU.

This usually means:

- group related work into fewer submissions
- avoid over-fragmenting compute work into many tiny command buffers
- profile CPU/GPU overlap before changing submission policy

## When Indirect Command Buffers Matter

Apple documents indirect command buffers (ICBs) as a way to reduce CPU overhead for repeated command patterns.

Use them when:

- command structure is repeated
- CPU encoding cost is significant
- the workload benefits from reusing encoded command structure

Do not reach for ICBs before validating that ordinary command submission is actually the bottleneck.

## Common Failure Modes

- wrapper code "optimizes" by caching the wrong objects and leaves command-buffer churn untouched
- work is split into too many tiny submissions
- submission count is reduced blindly and introduces dependency or latency issues
- ICB complexity is introduced before measuring CPU encoding cost

## Official Source Links (Fact Check)

- Command Organization and Execution Model: https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Cmd-Submiss/Cmd-Submiss.html
- Metal Best Practices Guide: Persistent Objects: https://developer.apple.com/library/archive/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/PersistentObjects.html
- Metal Best Practices Guide: Command Buffers: https://developer.apple.com/library/archive/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/CommandBuffers.html
- Encoding indirect command buffers on the CPU: https://developer.apple.com/documentation/metal/encoding-indirect-command-buffers-on-the-cpu

Last cross-check date: 2026-03-21
93 changes: 93 additions & 0 deletions content/apple/docs/metal-compute-launch-patterns/DOC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
---
name: metal-compute-launch-patterns
description: "Apple Metal compute launch patterns: MTLDevice, pipeline creation, buffers, encoders, and dispatch sizing."
metadata:
languages: "cpp"
versions: "4.0"
revision: 1
updated-on: "2026-03-21"
source: official
tags: "apple,metal,compute,mtldevice,mtlcommandqueue,mtlcommandbuffer,mtlcomputecommandencoder,dispatchthreads,threadsperthreadgroup,metallib"
---

# Metal Compute Launch Patterns

Use this page for the host-side structure of launching Metal compute work: device discovery, pipeline creation, resource binding, and dispatch.

## Core Host Objects

The standard compute path is built around:

- `MTLDevice`: the GPU device handle
- `MTLCommandQueue`: source of command buffers
- `MTLCommandBuffer`: unit of submitted GPU work
- `MTLComputePipelineState`: compiled compute kernel state
- `MTLComputeCommandEncoder`: binds resources and dispatches a compute kernel

## Minimal Host Flow

1. Get a `MTLDevice`.
2. Create or load a library containing the kernel function.
3. Build a `MTLComputePipelineState` from the kernel.
4. Allocate buffers/textures.
5. Create a command buffer and compute encoder.
6. Bind resources with `setBuffer`, `setTexture`, and related APIs.
7. Dispatch threads or threadgroups.
8. End encoding, commit the command buffer, and wait only when the CPU truly needs completion.

## Dispatch Sizing Rule

There are two separate choices:

- total work size: how many threads should run overall
- threadgroup size: how many threads cooperate locally

The host must choose both consistently with the kernel's indexing logic and any threadgroup-memory usage.

## Practical Example Shape

```cpp
id<MTLCommandBuffer> cb = [queue commandBuffer];
id<MTLComputeCommandEncoder> enc = [cb computeCommandEncoder];

[enc setComputePipelineState:pso];
[enc setBuffer:inBuffer offset:0 atIndex:0];
[enc setBuffer:outBuffer offset:0 atIndex:1];

MTLSize grid = MTLSizeMake(n, 1, 1);
MTLSize tpg = MTLSizeMake(256, 1, 1);
[enc dispatchThreads:grid threadsPerThreadgroup:tpg];

[enc endEncoding];
[cb commit];
```

## Buffer And Binding Discipline

- buffer index values must match kernel `[[buffer(i)]]` attributes
- host-side buffer sizes must cover the kernel's full access range
- threadgroup memory declarations require matching dispatch assumptions
- command-buffer completion only guarantees GPU completion for that buffer, not correctness of your indexing logic

## Common Failure Modes

- Binding order does not match kernel buffer indices.
- Dispatch shape changes but kernel index math is left unchanged.
- Threadgroup size exceeds hardware or pipeline limits.
- CPU waits on every command buffer and destroys overlap unnecessarily.
- Library and pipeline creation are done inside a hot loop instead of being cached

## Profiling And Debugging Guidance

- Use Xcode Metal debugging tools to inspect resource bindings and dispatch layout.
- Use runtime validation to catch invalid API usage early.
- Treat incorrect output and poor throughput separately: one is often indexing or binding, the other is often sizing or memory behavior.

## Official Source Links (Fact Check)

- Performing calculations on a GPU: https://developer.apple.com/documentation/metal/performing-calculations-on-a-gpu
- Compute passes: https://developer.apple.com/documentation/metal/compute-passes
- Metal Programming Guide (archive): https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Compute-Ctx/Compute-Ctx.html
- Metal developer tools: https://developer.apple.com/metal/tools/

Last cross-check date: 2026-03-21
Loading