perf: van Herk/Gil-Werman O(n) separable morphology for feMorphology by wjc911 · Pull Request #1031 · linebender/resvg

wjc911 · 2026-02-22T18:31:52Z

Summary

Replace the naive O(n * r^2) brute-force erosion/dilation with the van Herk/Gil-Werman (vHGW) algorithm, which computes the morphological result in O(n) time using two sliding prefix/suffix maximum (or minimum) arrays
Applied for radius > 3; the original direct path is retained for small radii where the constant overhead of vHGW is not worthwhile
Both horizontal and vertical passes use the O(n) algorithm, giving overall O(n) complexity regardless of radius

Benchmark Results

Metric	Value
Average speedup	171x
Maximum speedup	3874x

The extreme maximum speedup is observed on large images with large radii, where the naive algorithm's O(r^2) inner loop was previously dominant.

Test Results

All 1723/1723 integration tests pass (cargo test --release -p resvg --test integration).

🤖 Generated with Claude Code

Replace the O(n*r²) brute-force morphology with a separable O(n) vHGW algorithm for large kernels. The 2D rectangular min/max is decomposed into horizontal + vertical 1D passes using prefix/suffix scans over blocks, with identity-padded boundaries for correct edge handling. Key design choices: - MorphOp trait with ErodeOp/DilateOp monomorphizes into branch-free specializations using [u8; 4] representation for SIMD auto-vectorization (SSE pminub/pmaxub, NEON vmin/vmax) - Pre-allocated scratch buffers reused across all rows/columns - Column tiling in vertical pass for cache friendliness - Dynamic fallback to original naive algorithm for small kernels (area <= 32) where vHGW overhead isn't justified Bit-exact with the original implementation, verified by 10 unit tests across exhaustive parameter space and all 13 integration tests.

Correctness: - 16 unit tests covering 758+ parameter combinations via brute-force enumeration of (operator, radius, image_size) tuples - Edge case tests: constant images, single pixel, checkerboard, stripes, corners/edges, premultiplied alpha, non-square shapes, radius >> dimension, 20 random seeds, 200×200 images - All verified bit-exact against naive implementation Profiling-driven optimizations: - TILE_WIDTH 8→16: +11% on small radii (better L1 cache line utilization; 16 adjacent columns = 64B = 1 cache line, vs 50% waste at 8) - TILE_WIDTH 32 tested and rejected (L1 pressure regression) - Virtual padded buffer tested and rejected (+14% regression at small radii due to branch prediction overhead in inner loops) - NAIVE_KERNEL_AREA_THRESHOLD 32→9: crossover analysis showed vHGW wins at kernel_area>=16 (radius>=1.5). Threshold=9 ensures 2×2 kernels use naive (1.5ms vs 2.0ms) while 4×4+ use vHGW (2.4ms vs 4.0ms). Benchmark results on 500×500 image (erode, 5 iterations): radius naive vHGW speedup 2.0 4.0ms 2.1ms 1.9x 3.0 8.1ms 2.3ms 3.5x 5.0 22.5ms 2.6ms 8.8x 10.0 93.4ms 2.9ms 32.1x 20.0 375.7ms 3.2ms 118.6x 50.0 2.41s 4.3ms 563x Per-pass profiling confirms balanced workload (horz ~48%, vert ~52%).

Keep only core algorithm implementation and correctness tests.

Add bench_morphology_comprehensive.rs testing 528 configurations across 8 image sizes, 11 radii, 2 operators, and 3 input patterns. Make filter and morphology modules doc(hidden) pub for benchmark access. Add public wrappers for apply_naive and apply_vhgw_pub to enable direct comparison.

Use scoped threads and AtomicUsize progress counter to run benchmark configurations in parallel across all available CPU cores, significantly reducing total benchmark execution time.

Replace combinatorial grid (8 sizes x 11 radii x 2 ops x 3 patterns = 528 configs) with 7 scenario-based groups modelled after actual SVG morphology usage: icon text outline (dilate r=2), heading outline (dilate r=3-4), subtle erode (r=1), thick knockout (r=8), threshold boundary analysis, non-square images, and asymmetric radius. Use two input patterns (opaque and text-like sparse alpha) instead of three. Remove unrealistic 4x4 size; use real-world dimensions (16-600px, including non-square aspect ratios like 200x150, 400x300, 600x400).

Replaces the parallel bench_e2e.rs with a sequential single-threaded version that uses per-resolution iteration counts (2000 for 16px, down to 100 for 1024px+), a probe-then-scale budget cap (30s total per case, skip if single probe > 10s), and --compare for TSV baseline comparison. Allows CPU-pinned reproducible measurements. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wjc911 and others added 13 commits February 21, 2026 15:13

Run cargo fmt

536fa89

Remove benchmark and profiling code from morphology.rs

72e5d83

Keep only core algorithm implementation and correctness tests.

bench: parallelize feMorphology benchmark with std::thread::scope

7388c3e

Use scoped threads and AtomicUsize progress counter to run benchmark configurations in parallel across all available CPU cores, significantly reducing total benchmark execution time.

Fix morphology regression: inline naive path, VHGW as cold early-return

d8357b0

Clean up feMorphology optimization code

f226079

Apply cargo fmt to bench_e2e.rs

f22a1e1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Apply cargo fmt --all to fix CI formatting check

404dd32

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add missing copyright headers to benchmark example files

5e6e1ed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wjc911 closed this Feb 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: van Herk/Gil-Werman O(n) separable morphology for feMorphology#1031

perf: van Herk/Gil-Werman O(n) separable morphology for feMorphology#1031
wjc911 wants to merge 13 commits into
linebender:mainfrom
wjc911:feMorphology_perf_optimize

wjc911 commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

wjc911 commented Feb 22, 2026

Summary

Benchmark Results

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant