perf: van Herk/Gil-Werman O(n) separable morphology for feMorphology#1031
Closed
wjc911 wants to merge 13 commits into
Closed
perf: van Herk/Gil-Werman O(n) separable morphology for feMorphology#1031wjc911 wants to merge 13 commits into
wjc911 wants to merge 13 commits into
Conversation
Replace the O(n*r²) brute-force morphology with a separable O(n) vHGW algorithm for large kernels. The 2D rectangular min/max is decomposed into horizontal + vertical 1D passes using prefix/suffix scans over blocks, with identity-padded boundaries for correct edge handling. Key design choices: - MorphOp trait with ErodeOp/DilateOp monomorphizes into branch-free specializations using [u8; 4] representation for SIMD auto-vectorization (SSE pminub/pmaxub, NEON vmin/vmax) - Pre-allocated scratch buffers reused across all rows/columns - Column tiling in vertical pass for cache friendliness - Dynamic fallback to original naive algorithm for small kernels (area <= 32) where vHGW overhead isn't justified Bit-exact with the original implementation, verified by 10 unit tests across exhaustive parameter space and all 13 integration tests.
Correctness: - 16 unit tests covering 758+ parameter combinations via brute-force enumeration of (operator, radius, image_size) tuples - Edge case tests: constant images, single pixel, checkerboard, stripes, corners/edges, premultiplied alpha, non-square shapes, radius >> dimension, 20 random seeds, 200×200 images - All verified bit-exact against naive implementation Profiling-driven optimizations: - TILE_WIDTH 8→16: +11% on small radii (better L1 cache line utilization; 16 adjacent columns = 64B = 1 cache line, vs 50% waste at 8) - TILE_WIDTH 32 tested and rejected (L1 pressure regression) - Virtual padded buffer tested and rejected (+14% regression at small radii due to branch prediction overhead in inner loops) - NAIVE_KERNEL_AREA_THRESHOLD 32→9: crossover analysis showed vHGW wins at kernel_area>=16 (radius>=1.5). Threshold=9 ensures 2×2 kernels use naive (1.5ms vs 2.0ms) while 4×4+ use vHGW (2.4ms vs 4.0ms). Benchmark results on 500×500 image (erode, 5 iterations): radius naive vHGW speedup 2.0 4.0ms 2.1ms 1.9x 3.0 8.1ms 2.3ms 3.5x 5.0 22.5ms 2.6ms 8.8x 10.0 93.4ms 2.9ms 32.1x 20.0 375.7ms 3.2ms 118.6x 50.0 2.41s 4.3ms 563x Per-pass profiling confirms balanced workload (horz ~48%, vert ~52%).
Keep only core algorithm implementation and correctness tests.
Add bench_morphology_comprehensive.rs testing 528 configurations across 8 image sizes, 11 radii, 2 operators, and 3 input patterns. Make filter and morphology modules doc(hidden) pub for benchmark access. Add public wrappers for apply_naive and apply_vhgw_pub to enable direct comparison.
Use scoped threads and AtomicUsize progress counter to run benchmark configurations in parallel across all available CPU cores, significantly reducing total benchmark execution time.
Replace combinatorial grid (8 sizes x 11 radii x 2 ops x 3 patterns = 528 configs) with 7 scenario-based groups modelled after actual SVG morphology usage: icon text outline (dilate r=2), heading outline (dilate r=3-4), subtle erode (r=1), thick knockout (r=8), threshold boundary analysis, non-square images, and asymmetric radius. Use two input patterns (opaque and text-like sparse alpha) instead of three. Remove unrealistic 4x4 size; use real-world dimensions (16-600px, including non-square aspect ratios like 200x150, 400x300, 600x400).
Replaces the parallel bench_e2e.rs with a sequential single-threaded version that uses per-resolution iteration counts (2000 for 16px, down to 100 for 1024px+), a probe-then-scale budget cap (30s total per case, skip if single probe > 10s), and --compare for TSV baseline comparison. Allows CPU-pinned reproducible measurements. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Benchmark Results
The extreme maximum speedup is observed on large images with large radii, where the naive algorithm's O(r^2) inner loop was previously dominant.
Test Results
All 1723/1723 integration tests pass (
cargo test --release -p resvg --test integration).🤖 Generated with Claude Code