Summary
Reduce global atomic contention in the branched flow trace pass by using workgroup-local shared memory for deposits, then flushing to global memory cooperatively.
Current Behavior
- Each ray deposits directly to global
caustic_texture via atomicAdd
- At high ray counts, many rays hit the same pixel, causing contention
- Workgroup size is 64 threads
Proposed Change
- Allocate a workgroup-local tile in shared memory
- Each thread deposits to the local tile first
- After the ray loop, cooperatively flush the tile to global memory via
atomicAdd
- Deposits outside the tile footprint fall back to direct global atomic
Complexity Note
Medium-high complexity. Rays spread during propagation, so many deposits will fall outside the workgroup's tile region. The fallback path (direct global atomic) may dominate for divergent ray patterns, limiting the benefit. Benchmark with GPU timestamp profiler (already implemented) before and after.
Acceptance Criteria
Context
From deep review plan item 3.4. Potential 64x reduction in global atomic operations for rays that stay within the tile. Actual benefit depends on ray divergence patterns.
Files: src/render/shaders/branched_flow_compute.wgsl
Summary
Reduce global atomic contention in the branched flow trace pass by using workgroup-local shared memory for deposits, then flushing to global memory cooperatively.
Current Behavior
caustic_textureviaatomicAddProposed Change
atomicAddComplexity Note
Medium-high complexity. Rays spread during propagation, so many deposits will fall outside the workgroup's tile region. The fallback path (direct global atomic) may dominate for divergent ray patterns, limiting the benefit. Benchmark with GPU timestamp profiler (already implemented) before and after.
Acceptance Criteria
Context
From deep review plan item 3.4. Potential 64x reduction in global atomic operations for rays that stay within the tile. Actual benefit depends on ray divergence patterns.
Files:
src/render/shaders/branched_flow_compute.wgsl