andrewyng · qhy991 · Mar 20, 2026 · Mar 20, 2026 · Mar 21, 2026 · Mar 21, 2026
diff --git a/content/apple/docs/metal-argument-buffers-and-residency/DOC.md b/content/apple/docs/metal-argument-buffers-and-residency/DOC.md
@@ -0,0 +1,80 @@
+---
+name: metal-argument-buffers-and-residency
+description: "Apple Metal argument buffer patterns: encoding resource tables, residency requirements, and useResource or useHeap rules for compute workloads."
+metadata:
+  languages: "cpp"
+  versions: "4.0"
+  revision: 1
+  updated-on: "2026-03-21"
+  source: official
+  tags: "apple,metal,argument-buffers,residency,useresource,useheap,resource-binding,indirect-access,compute,mtlargumentencoder"
+---
+
+# Metal Argument Buffers And Residency
+
+Use this page when a Metal compute kernel needs many resources, indirect resource access, or lower CPU-side binding overhead.
+
+## What Argument Buffers Solve
+
+Argument buffers let you encode resource references into a buffer-backed table instead of rebinding many individual buffers or textures for every dispatch.
+
+This is useful when:
+
+- one kernel reads many buffers or textures
+- the set of resources changes per dispatch
+- the resource table is reused across many dispatches
+- GPU-driven or indirection-heavy workflows need resource tables
+
+## Core Host-Side Pattern
+
+The usual structure is:
+
+1. create an `MTLArgumentEncoder`
+2. allocate a backing `MTLBuffer`
+3. encode resource references into that buffer
+4. bind the argument buffer to the kernel
+5. make indirect resources resident before dispatch
+
+The last step is the part people miss most often.
+
+## Residency Rules Matter
+
+If a kernel reaches resources through an argument buffer, those resources must be resident for the duration of the compute pass.
+
+In practice this means:
+
+- call `useResource(_:usage:)` for resources reached indirectly through an argument buffer
+- call `useHeap(_:)` when residency is managed through a heap
+- do this before the encoded dispatch that consumes those resources
+
+If you bind a resource directly to a kernel argument, you do not need the extra residency call for that direct binding path.
+
+## Good Usage Pattern
+
+- keep the argument buffer layout stable across many dispatches
+- separate "table rebuild" work from "per-dispatch scalar parameter" work
+- prefer argument buffers when the binding count is the CPU bottleneck
+- keep residency calls explicit and near the dispatch site
+
+## Common Failure Modes
+
+- a resource is encoded into the argument buffer but never made resident
+- the argument buffer is updated but a stale resource table is still reused
+- CPU-side code mutates resident resources during the compute pass
+- argument buffers are introduced for tiny fixed-bind workloads that did not need them
+
+## Review Checklist
+
+- Does the kernel really access resources indirectly?
+- Are all indirectly referenced buffers or textures marked resident?
+- Is the argument buffer rebuilt only when the table actually changes?
+- Is the performance goal CPU submission overhead rather than kernel ALU time?
+
+## Official Source Links (Fact Check)
+
+- Compute passes: https://developer.apple.com/documentation/metal/compute-passes
+- Improving CPU performance by using argument buffers: https://developer.apple.com/documentation/metal/improving-cpu-performance-by-using-argument-buffers
+- `useResource(_:usage:)`: https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/useresource%28_%3Ausage%3A%29
+- Metal shader converter binding model and argument buffer notes: https://developer.apple.com/metal/shader-converter/
+
+Last cross-check date: 2026-03-21
diff --git a/content/apple/docs/metal-broadcast-kernel-patterns/DOC.md b/content/apple/docs/metal-broadcast-kernel-patterns/DOC.md
@@ -0,0 +1,56 @@
+---
+name: metal-broadcast-kernel-patterns
+description: "Apple Metal broadcast kernel patterns: scalar or vector expansion, shape alignment, masked edges, and correctness checks for broadcast-heavy compute code."
+metadata:
+  languages: "cpp"
+  versions: "4.0"
+  revision: 1
+  updated-on: "2026-03-21"
+  source: official
+  tags: "apple,metal,broadcast,shape-alignment,elementwise,tensor-shapes,masked-edges,compute"
+---
+
+# Metal Broadcast Kernel Patterns
+
+Use this page when a Metal kernel combines tensors of different logical shapes using broadcast semantics.
+
+## Why Broadcast Bugs Are Common
+
+Broadcast kernels look simple because the math is usually elementwise.
+
+The real complexity is shape alignment:
+
+- which dimensions are expanded
+- which dimensions are equal
+- whether one operand is scalar, row-wise, column-wise, or channel-wise
+
+If the shape contract is vague, the kernel may appear correct on square or fully dense cases while failing on realistic shapes.
+
+## Safe Baseline Pattern
+
+- align shapes explicitly in host code before launch
+- pass logical sizes needed for each broadcasted dimension
+- write one straightforward reference kernel
+- test cases where only one dimension is broadcast, then several
+
+## What To Verify
+
+- broadcasted dimensions reuse the intended source index
+- non-broadcasted dimensions advance normally
+- output shape is derived from the broadcast rule, not copied from one operand blindly
+- masked tails or rounded dispatches cannot write past the real output shape
+
+## Common Failure Modes
+
+- one dimension is broadcast on the host side but advanced in the kernel
+- scalar and vector broadcast paths behave differently
+- the kernel passes tests where all dimensions match, hiding broken broadcast logic
+- the output allocation follows one input shape instead of the broadcasted result shape
+
+## Official Source Links (Fact Check)
+
+- Compute passes: https://developer.apple.com/documentation/metal/compute-passes
+- Calculating threadgroup and grid sizes: https://developer.apple.com/documentation/metal/calculating-threadgroup-and-grid-sizes
+- Metal Shading Language Specification: https://developer.apple.com/metal/resources/
+
+Last cross-check date: 2026-03-21
diff --git a/content/apple/docs/metal-buffer-layout-and-alignment/DOC.md b/content/apple/docs/metal-buffer-layout-and-alignment/DOC.md
@@ -0,0 +1,76 @@
+---
+name: metal-buffer-layout-and-alignment
+description: "Apple Metal buffer layout and alignment: resource sizing, texture-from-buffer alignment, and host/kernel layout discipline."
+metadata:
+  languages: "cpp"
+  versions: "4.0"
+  revision: 1
+  updated-on: "2026-03-21"
+  source: official
+  tags: "apple,metal,buffer,alignment,layout,mtlbuffer,minimumtexturebufferalignment,minimumlineartexturealignment,bytesperrow,heap-size-align"
+---
+
+# Metal Buffer Layout And Alignment
+
+Use this page when host-side data layout, buffer sizing, or buffer-backed texture creation is part of the kernel path.
+
+## Why This Matters
+
+Many Metal failures that look like "bad math" are actually layout problems:
+
+- host structs and shader expectations do not match
+- offsets are aligned incorrectly
+- buffer-backed textures use invalid row pitch or offset values
+- heap sizing and alignment are estimated incorrectly
+
+## Layout Discipline
+
+- define host and kernel-visible struct layouts explicitly
+- keep element size, stride, and offset calculations centralized
+- treat texture-from-buffer paths as alignment-sensitive, not as generic byte blobs
+
+## Alignment APIs Apple Exposes
+
+Apple documents alignment helpers on `MTLDevice`, including:
+
+- `minimumTextureBufferAlignment(for:)`
+- `minimumLinearTextureAlignment(for:)`
+- `heapBufferSizeAndAlign(length:options:)`
+- `heapTextureSizeAndAlign(descriptor:)`
+
+Use these APIs instead of guessing alignment from prior hardware experience.
+
+## Buffer-Backed Texture Rules
+
+When creating textures from buffers, values such as:
+
+- buffer offset
+- bytes per row
+- pixel format alignment
+
+must satisfy the device's documented alignment constraints.
+
+Apple's documentation explicitly ties alignment values to texture creation parameters.
+
+## Common Failure Modes
+
+- Buffer length is correct in bytes, but element stride is wrong.
+- Struct fields are logically correct but host/kernel padding expectations differ.
+- Texture buffer offset is not aligned to the device minimum.
+- `bytesPerRow` is computed from logical width only and ignores required alignment.
+- Heap size estimates ignore size-and-align APIs and under-allocate.
+
+## Safe Practice
+
+1. Compute all offsets from element-size and alignment helpers.
+2. Reuse one layout definition across host and shader-facing code.
+3. Validate resource creation parameters before debugging kernel math.
+
+## Official Source Links (Fact Check)
+
+- `minimumTextureBufferAlignment(for:)`: https://developer.apple.com/documentation/metal/mtldevice/minimumtexturebufferalignment%28for%3A%29
+- `minimumLinearTextureAlignment(for:)`: https://developer.apple.com/documentation/metal/mtldevice/minimumlineartexturealignment%28for%3A%29
+- `heapBufferSizeAndAlign(length:options:)`: https://developer.apple.com/documentation/metal/mtldevice/heapbuffersizeandalign%28length%3Aoptions%3A%29
+- `bufferBytesPerRow`: https://developer.apple.com/documentation/metal/mtltexture/bufferbytesperrow
+
+Last cross-check date: 2026-03-21
diff --git a/content/apple/docs/metal-command-buffer-reuse-and-batching/DOC.md b/content/apple/docs/metal-command-buffer-reuse-and-batching/DOC.md
@@ -0,0 +1,73 @@
+---
+name: metal-command-buffer-reuse-and-batching
+description: "Apple Metal command buffer reuse and batching guidance: transient versus persistent objects, submission frequency, and indirect command buffer tradeoffs."
+metadata:
+  languages: "cpp"
+  versions: "4.0"
+  revision: 1
+  updated-on: "2026-03-21"
+  source: official
+  tags: "apple,metal,command-buffer,batching,submission,indirect-command-buffer,persistent-objects,command-queue,reuse,icb"
+---
+
+# Metal Command Buffer Reuse And Batching
+
+Use this page when CPU submission overhead starts to matter as much as the Metal kernel itself.
+
+## First Principle
+
+Apple distinguishes between transient and persistent objects.
+
+Persistent objects should be created early and reused:
+
+- `MTLDevice`
+- `MTLCommandQueue`
+- buffers
+- textures
+- pipeline states
+
+Command buffers themselves are transient single-use objects.
+
+That means:
+
+- reuse pipeline and resource objects
+- do not try to reuse committed command buffers
+- reduce submission overhead by batching work intelligently
+
+## Batching Guidance
+
+Apple's best-practices material emphasizes submitting as few command buffers as practical without starving the GPU.
+
+This usually means:
+
+- group related work into fewer submissions
+- avoid over-fragmenting compute work into many tiny command buffers
+- profile CPU/GPU overlap before changing submission policy
+
+## When Indirect Command Buffers Matter
+
+Apple documents indirect command buffers (ICBs) as a way to reduce CPU overhead for repeated command patterns.
+
+Use them when:
+
+- command structure is repeated
+- CPU encoding cost is significant
+- the workload benefits from reusing encoded command structure
+
+Do not reach for ICBs before validating that ordinary command submission is actually the bottleneck.
+
+## Common Failure Modes
+
+- wrapper code "optimizes" by caching the wrong objects and leaves command-buffer churn untouched
+- work is split into too many tiny submissions
+- submission count is reduced blindly and introduces dependency or latency issues
+- ICB complexity is introduced before measuring CPU encoding cost
+
+## Official Source Links (Fact Check)
+
+- Command Organization and Execution Model: https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Cmd-Submiss/Cmd-Submiss.html
+- Metal Best Practices Guide: Persistent Objects: https://developer.apple.com/library/archive/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/PersistentObjects.html
+- Metal Best Practices Guide: Command Buffers: https://developer.apple.com/library/archive/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/CommandBuffers.html
+- Encoding indirect command buffers on the CPU: https://developer.apple.com/documentation/metal/encoding-indirect-command-buffers-on-the-cpu
+
+Last cross-check date: 2026-03-21
diff --git a/content/apple/docs/metal-compute-launch-patterns/DOC.md b/content/apple/docs/metal-compute-launch-patterns/DOC.md
@@ -0,0 +1,93 @@
+---
+name: metal-compute-launch-patterns
+description: "Apple Metal compute launch patterns: MTLDevice, pipeline creation, buffers, encoders, and dispatch sizing."
+metadata:
+  languages: "cpp"
+  versions: "4.0"
+  revision: 1
+  updated-on: "2026-03-21"
+  source: official
+  tags: "apple,metal,compute,mtldevice,mtlcommandqueue,mtlcommandbuffer,mtlcomputecommandencoder,dispatchthreads,threadsperthreadgroup,metallib"
+---
+
+# Metal Compute Launch Patterns
+
+Use this page for the host-side structure of launching Metal compute work: device discovery, pipeline creation, resource binding, and dispatch.
+
+## Core Host Objects
+
+The standard compute path is built around:
+
+- `MTLDevice`: the GPU device handle
+- `MTLCommandQueue`: source of command buffers
+- `MTLCommandBuffer`: unit of submitted GPU work
+- `MTLComputePipelineState`: compiled compute kernel state
+- `MTLComputeCommandEncoder`: binds resources and dispatches a compute kernel
+
+## Minimal Host Flow
+
+1. Get a `MTLDevice`.
+2. Create or load a library containing the kernel function.
+3. Build a `MTLComputePipelineState` from the kernel.
+4. Allocate buffers/textures.
+5. Create a command buffer and compute encoder.
+6. Bind resources with `setBuffer`, `setTexture`, and related APIs.
+7. Dispatch threads or threadgroups.
+8. End encoding, commit the command buffer, and wait only when the CPU truly needs completion.
+
+## Dispatch Sizing Rule
+
+There are two separate choices:
+
+- total work size: how many threads should run overall
+- threadgroup size: how many threads cooperate locally
+
+The host must choose both consistently with the kernel's indexing logic and any threadgroup-memory usage.
+
+## Practical Example Shape
+
+```cpp
+id<MTLCommandBuffer> cb = [queue commandBuffer];
+id<MTLComputeCommandEncoder> enc = [cb computeCommandEncoder];
+
+[enc setComputePipelineState:pso];
+[enc setBuffer:inBuffer offset:0 atIndex:0];
+[enc setBuffer:outBuffer offset:0 atIndex:1];
+
+MTLSize grid = MTLSizeMake(n, 1, 1);
+MTLSize tpg  = MTLSizeMake(256, 1, 1);
+[enc dispatchThreads:grid threadsPerThreadgroup:tpg];
+
+[enc endEncoding];
+[cb commit];
+```
+
+## Buffer And Binding Discipline
+
+- buffer index values must match kernel `[[buffer(i)]]` attributes
+- host-side buffer sizes must cover the kernel's full access range
+- threadgroup memory declarations require matching dispatch assumptions
+- command-buffer completion only guarantees GPU completion for that buffer, not correctness of your indexing logic
+
+## Common Failure Modes
+
+- Binding order does not match kernel buffer indices.
+- Dispatch shape changes but kernel index math is left unchanged.
+- Threadgroup size exceeds hardware or pipeline limits.
+- CPU waits on every command buffer and destroys overlap unnecessarily.
+- Library and pipeline creation are done inside a hot loop instead of being cached
+
+## Profiling And Debugging Guidance
+
+- Use Xcode Metal debugging tools to inspect resource bindings and dispatch layout.
+- Use runtime validation to catch invalid API usage early.
+- Treat incorrect output and poor throughput separately: one is often indexing or binding, the other is often sizing or memory behavior.
+
+## Official Source Links (Fact Check)
+
+- Performing calculations on a GPU: https://developer.apple.com/documentation/metal/performing-calculations-on-a-gpu
+- Compute passes: https://developer.apple.com/documentation/metal/compute-passes
+- Metal Programming Guide (archive): https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Compute-Ctx/Compute-Ctx.html
+- Metal developer tools: https://developer.apple.com/metal/tools/
+
+Last cross-check date: 2026-03-21