[LOW] Performance: Workgroup size tuning for branched flow

## Summary
Benchmark different workgroup sizes (64, 128, 256) for the branched flow trace compute shader using the GPU timestamp profiler.

## Current Behavior
- Trace pass uses `@workgroup_size(64)` (hardcoded in WGSL)
- Clear pass uses `@workgroup_size(16, 16)` (256 total)
- No benchmarking data on optimal workgroup size for this workload

## Proposed Change
1. Test workgroup sizes 64, 128, and 256 for the trace pass
2. Measure per-pass GPU time using the `GpuProfiler` (already implemented)
3. Set the default to whichever performs best on representative hardware
4. Consider making workgroup size configurable via a compile-time constant

## Acceptance Criteria
- [ ] Benchmark data for 3 workgroup sizes on at least one GPU
- [ ] Default updated to empirically best size
- [ ] All existing tests pass (including WGSL validation)

## Context
From deep review plan item 4.2. Requires interactive benchmarking with GPU profiler. Optimal size depends on GPU architecture (occupancy, register pressure). Note: WSLg currently uses OpenGL ES software rendering (no Vulkan), so benchmark on native hardware or with GPU passthrough for meaningful results.

**Files**: `src/render/shaders/branched_flow_compute.wgsl`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LOW] Performance: Workgroup size tuning for branched flow #37

Summary

Current Behavior

Proposed Change

Acceptance Criteria

Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[LOW] Performance: Workgroup size tuning for branched flow #37

Description

Summary

Current Behavior

Proposed Change

Acceptance Criteria

Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions