Skip to content

perf: van Herk/Gil-Werman O(n) separable morphology for feMorphology#1031

Closed
wjc911 wants to merge 13 commits into
linebender:mainfrom
wjc911:feMorphology_perf_optimize
Closed

perf: van Herk/Gil-Werman O(n) separable morphology for feMorphology#1031
wjc911 wants to merge 13 commits into
linebender:mainfrom
wjc911:feMorphology_perf_optimize

Conversation

@wjc911

@wjc911 wjc911 commented Feb 22, 2026

Copy link
Copy Markdown

Summary

  • Replace the naive O(n * r^2) brute-force erosion/dilation with the van Herk/Gil-Werman (vHGW) algorithm, which computes the morphological result in O(n) time using two sliding prefix/suffix maximum (or minimum) arrays
  • Applied for radius > 3; the original direct path is retained for small radii where the constant overhead of vHGW is not worthwhile
  • Both horizontal and vertical passes use the O(n) algorithm, giving overall O(n) complexity regardless of radius

Benchmark Results

Metric Value
Average speedup 171x
Maximum speedup 3874x

The extreme maximum speedup is observed on large images with large radii, where the naive algorithm's O(r^2) inner loop was previously dominant.

Test Results

All 1723/1723 integration tests pass (cargo test --release -p resvg --test integration).

🤖 Generated with Claude Code

wjc911 and others added 13 commits February 21, 2026 15:13
Replace the O(n*r²) brute-force morphology with a separable O(n) vHGW
algorithm for large kernels. The 2D rectangular min/max is decomposed
into horizontal + vertical 1D passes using prefix/suffix scans over
blocks, with identity-padded boundaries for correct edge handling.

Key design choices:
- MorphOp trait with ErodeOp/DilateOp monomorphizes into branch-free
  specializations using [u8; 4] representation for SIMD auto-vectorization
  (SSE pminub/pmaxub, NEON vmin/vmax)
- Pre-allocated scratch buffers reused across all rows/columns
- Column tiling in vertical pass for cache friendliness
- Dynamic fallback to original naive algorithm for small kernels
  (area <= 32) where vHGW overhead isn't justified

Bit-exact with the original implementation, verified by 10 unit tests
across exhaustive parameter space and all 13 integration tests.
Correctness:
- 16 unit tests covering 758+ parameter combinations via brute-force
  enumeration of (operator, radius, image_size) tuples
- Edge case tests: constant images, single pixel, checkerboard,
  stripes, corners/edges, premultiplied alpha, non-square shapes,
  radius >> dimension, 20 random seeds, 200×200 images
- All verified bit-exact against naive implementation

Profiling-driven optimizations:
- TILE_WIDTH 8→16: +11% on small radii (better L1 cache line utilization;
  16 adjacent columns = 64B = 1 cache line, vs 50% waste at 8)
- TILE_WIDTH 32 tested and rejected (L1 pressure regression)
- Virtual padded buffer tested and rejected (+14% regression at small
  radii due to branch prediction overhead in inner loops)
- NAIVE_KERNEL_AREA_THRESHOLD 32→9: crossover analysis showed vHGW wins
  at kernel_area>=16 (radius>=1.5). Threshold=9 ensures 2×2 kernels use
  naive (1.5ms vs 2.0ms) while 4×4+ use vHGW (2.4ms vs 4.0ms).

Benchmark results on 500×500 image (erode, 5 iterations):
  radius    naive      vHGW     speedup
  2.0       4.0ms      2.1ms    1.9x
  3.0       8.1ms      2.3ms    3.5x
  5.0      22.5ms      2.6ms    8.8x
  10.0     93.4ms      2.9ms    32.1x
  20.0    375.7ms      3.2ms    118.6x
  50.0      2.41s      4.3ms    563x

Per-pass profiling confirms balanced workload (horz ~48%, vert ~52%).
Keep only core algorithm implementation and correctness tests.
Add bench_morphology_comprehensive.rs testing 528 configurations across
8 image sizes, 11 radii, 2 operators, and 3 input patterns. Make filter
and morphology modules doc(hidden) pub for benchmark access. Add public
wrappers for apply_naive and apply_vhgw_pub to enable direct comparison.
Use scoped threads and AtomicUsize progress counter to run benchmark
configurations in parallel across all available CPU cores, significantly
reducing total benchmark execution time.
Replace combinatorial grid (8 sizes x 11 radii x 2 ops x 3 patterns =
528 configs) with 7 scenario-based groups modelled after actual SVG
morphology usage: icon text outline (dilate r=2), heading outline
(dilate r=3-4), subtle erode (r=1), thick knockout (r=8), threshold
boundary analysis, non-square images, and asymmetric radius. Use two
input patterns (opaque and text-like sparse alpha) instead of three.
Remove unrealistic 4x4 size; use real-world dimensions (16-600px,
including non-square aspect ratios like 200x150, 400x300, 600x400).
Replaces the parallel bench_e2e.rs with a sequential single-threaded
version that uses per-resolution iteration counts (2000 for 16px,
down to 100 for 1024px+), a probe-then-scale budget cap (30s total
per case, skip if single probe > 10s), and --compare for TSV baseline
comparison. Allows CPU-pinned reproducible measurements.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wjc911 wjc911 closed this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant