Skip to content

[LOW] Performance: Per-workgroup shared memory deposits #36

@pjt222

Description

@pjt222

Summary

Reduce global atomic contention in the branched flow trace pass by using workgroup-local shared memory for deposits, then flushing to global memory cooperatively.

Current Behavior

  • Each ray deposits directly to global caustic_texture via atomicAdd
  • At high ray counts, many rays hit the same pixel, causing contention
  • Workgroup size is 64 threads

Proposed Change

  1. Allocate a workgroup-local tile in shared memory
  2. Each thread deposits to the local tile first
  3. After the ray loop, cooperatively flush the tile to global memory via atomicAdd
  4. Deposits outside the tile footprint fall back to direct global atomic

Complexity Note

Medium-high complexity. Rays spread during propagation, so many deposits will fall outside the workgroup's tile region. The fallback path (direct global atomic) may dominate for divergent ray patterns, limiting the benefit. Benchmark with GPU timestamp profiler (already implemented) before and after.

Acceptance Criteria

  • Shared memory tile allocation in WGSL compute shader
  • Cooperative flush to global memory after ray loop
  • Fallback to global atomic for out-of-tile deposits
  • GPU timing comparison showing measurable improvement
  • All existing tests pass

Context

From deep review plan item 3.4. Potential 64x reduction in global atomic operations for rays that stay within the tile. Actual benefit depends on ray divergence patterns.

Files: src/render/shaders/branched_flow_compute.wgsl

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePerformance optimizations

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions