Daily Perf Improver - Optimize dot product with Vector.Sum horizontal reduction #33

github-actions · 2025-10-12T15:24:26Z

Summary

This PR optimizes dot product SIMD horizontal reduction achieving 8.3-35.8% speedup for small to medium-sized vectors by replacing manual accumulation with Vector.Sum(), which uses hardware-specific horizontal add instructions.

Performance Goal

Goal Selected: Optimize dot product accumulation (Phase 2, Priority: MEDIUM)

Rationale: The research plan from Discussion #11 identified "dot product accumulation" as a potential optimization target, suggesting tree-reduction strategies for better performance and numerical stability. Upon investigation, I found that the dot product was using manual element-by-element accumulation instead of the optimized Vector.Sum() method that's already used elsewhere in the codebase (Matrix.fs).

Changes Made

Core Optimization

File Modified: src/FsMath/SpanMath.fs - dotUnchecked function (lines 39-41)

Original Implementation:

let mutable acc = LanguagePrimitives.GenericZero<'T>
for i = 0 to simdWidth - 1 do
    acc <- acc + accVec.[i]

Optimized Implementation:

// Use Vector.Sum for optimized horizontal reduction (uses hardware-specific instructions)
let mutable acc = Numerics.Vector.Sum(accVec)

Approach

✅ Analyzed current dot product implementation
✅ Identified manual SIMD horizontal reduction as optimization opportunity
✅ Noticed Vector.Sum() is already used in Matrix.fs for similar purposes
✅ Ran baseline benchmarks to establish current performance
✅ Implemented optimization using Vector.Sum()
✅ Built project and verified all 488 tests pass
✅ Ran optimized benchmarks and measured improvements

Performance Measurements

Test Environment

Platform: Linux Ubuntu 24.04.3 LTS (virtualized)
CPU: AMD EPYC 7763, 2 physical cores (4 logical) with AVX2
Runtime: .NET 8.0.20 with hardware SIMD acceleration
Job: ShortRun (3 warmup, 3 iterations, 1 launch)

Results Summary

Vector Size	Before (Baseline)	After (Optimized)	Improvement	Speedup
10	6.894 ns	4.426 ns	35.8% faster	1.56×
100	27.745 ns	25.434 ns	8.3% faster	1.09×
1000	238.856 ns	241.945 ns	~equivalent	1.00×
10000	2,359.324 ns	2,354.626 ns	~equivalent	1.00×

Detailed Benchmark Results

Before (Baseline):

| Method     | Size  | Mean         | Error       | StdDev    | Allocated |
|----------- |------ |-------------:|------------:|----------:|----------:|
| DotProduct | 10    |     6.894 ns |   0.2225 ns | 0.0122 ns |         - |
| DotProduct | 100   |    27.745 ns |   0.3199 ns | 0.0175 ns |         - |
| DotProduct | 1000  |   238.856 ns |   4.4528 ns | 0.2441 ns |         - |
| DotProduct | 10000 | 2,359.324 ns | 129.2301 ns | 7.0835 ns |         - |

After (Optimized):

| Method     | Size  | Mean         | Error     | StdDev    | Allocated |
|----------- |------ |-------------:|----------:|----------:|----------:|
| DotProduct | 10    |     4.426 ns | 0.1592 ns | 0.0087 ns |         - |
| DotProduct | 100   |    25.434 ns | 0.1821 ns | 0.0100 ns |         - |
| DotProduct | 1000  |   241.945 ns | 8.6443 ns | 0.4738 ns |         - |
| DotProduct | 10000 | 2,354.626 ns | 9.2639 ns | 0.5078 ns |         - |

Key Observations

Significant Speedup for Small Vectors: 35.8% improvement for size 10 (most dramatic)
Modest Speedup for Medium Vectors: 8.3% improvement for size 100
Equivalent Performance for Large Vectors: Within measurement noise for sizes 1000 and 10000
No Memory Overhead: Allocations remain zero across all sizes
Better Instruction Utilization: Vector.Sum() uses horizontal add instructions (e.g., VPHADDPS on AVX)

Why This Works

The optimization leverages hardware-specific instructions:

Hardware-Optimized Instructions:
- Before: Sequential scalar addition of vector elements
- After: Hardware horizontal add instructions (VPHADDPS/VHADD on AVX)
- Result: Reduced instruction count and better CPU pipeline utilization
Instruction-Level Parallelism:
- Vector.Sum() can use tree-reduction internally
- Modern CPUs can execute multiple adds in parallel
- Reduces dependency chains
Consistency with Codebase:
- Matrix.fs already uses Vector.Sum() for similar operations
- This change brings consistency across the codebase
- Proven approach in production code

Replicating the Performance Measurements

To replicate these benchmarks:

# 1. Check out this branch
git checkout perf/optimize-dot-product-pairwise-reduction-20251012-151836-f4cfea086616c539

# 2. Build the project
./build.sh

# 3. Run baseline (main branch) benchmarks first
git stash
git checkout main
./build.sh
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*DotProduct*" --job short > baseline.txt

# 4. Run optimized benchmarks
git checkout perf/optimize-dot-product-pairwise-reduction-20251012-151836-f4cfea086616c539
./build.sh
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*DotProduct*" --job short > optimized.txt

# 5. Compare results
diff baseline.txt optimized.txt

Results are saved to BenchmarkDotNet.Artifacts/results/ in multiple formats.

Testing

✅ All 488 tests pass
✅ DotProduct benchmarks execute successfully
✅ No memory allocations changed
✅ Performance improves 8.3-35.8% for small-medium vectors
✅ Correctness verified across all test cases

Implementation Details

Optimization Techniques Applied

Hardware Horizontal Add: Replace manual loop with Vector.Sum() intrinsic
Reduced Instruction Count: From simdWidth additions to log₂(simdWidth) in hardware
Better ILP: Hardware can parallelize the reduction tree
Code Simplification: Clearer intent with single function call

Code Quality

Simpler, more readable code (3 lines reduced to 1)
Consistent with existing Matrix.fs implementation
Clear documentation comment explaining the optimization
Preserves all existing error handling and validation
Maintains backward compatibility
No breaking changes to API

Limitations and Future Work

While this optimization provides solid improvements, there are additional opportunities:

Numerical Stability: For very large vectors with floating-point, consider Kahan summation
Cache Optimization: Could explore blocking strategies for extremely large vectors
Parallel Dot Product: Very large vectors (>100K elements) could benefit from parallelization
Alternative Algorithms: Compensated summation algorithms for improved accuracy

Next Steps

Based on the performance plan from Discussion #11, remaining Phase 2 work includes:

✅ Dot product optimization (this PR)
⚠️ Matrix multiplication optimization (blocked GEMM - partially addressed in other PRs)
⚠️ In-place operations - Reduce allocations in hot paths
⚠️ Column operations (already addressed in PR [REJECT?] Daily Perf Improver - Optimize column extraction with loop unrolling #29)
⚠️ Transpose optimization (already addressed in PR [REJECT?] Daily Perf Improver - Optimize matrix transpose with loop unrolling and adaptive block sizing #32)

Related Issues/Discussions

Performance Research: https://github.com/fslaborg/FsMath/discussions/11
Open PR Daily Perf Improver - Fix and optimize outer product #18: Fix and optimize outer product
Open PR Daily Perf Improver - Optimize vector × matrix multiplication with SIMD #26: Optimize vector × matrix multiplication with SIMD
Open PR [REJECT?] Daily Perf Improver - Optimize column extraction with loop unrolling #29: Optimize column extraction with loop unrolling
Open PR [REJECT?] Daily Perf Improver - Optimize matrix transpose with loop unrolling and adaptive block sizing #32: Optimize matrix transpose with loop unrolling and adaptive block sizing

Bash Commands Used

# Research and analysis
cd /home/runner/work/FsMath/FsMath
git status
git checkout -b perf/optimize-dot-product-pairwise-reduction-20251012-151836-f4cfea086616c539

# Baseline benchmarking
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*DotProduct*" --job short

# Development
# (edited SpanMath.fs - dotUnchecked function)

# Build and test
cd /home/runner/work/FsMath/FsMath
./build.sh
dotnet test tests/FsMath.Tests/FsMath.Tests.fsproj -c Release --no-build

# Performance verification
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*DotProduct*" --job short

# Commit and create PR
cd /home/runner/work/FsMath/FsMath
git add src/FsMath/SpanMath.fs
git commit -m "Optimize dot product horizontal reduction with Vector.Sum..."

Web Searches Performed

None - this optimization was based on:

Existing codebase patterns (Vector.Sum usage in Matrix.fs)
Standard SIMD optimization techniques
The performance research plan from Discussion #11

🤖 Generated with Claude Code

AI generated by Daily Perf Improver

Replace manual accumulation loop with Vector.Sum() for SIMD horizontal reduction. Vector.Sum() uses hardware-specific horizontal add instructions (e.g., VPHADDPS on AVX) which are more efficient than manual element-by-element accumulation. Performance improvements (ShortRun benchmarks): - Size 10: 35.8% faster (6.894 → 4.426 ns, 1.56× speedup) - Size 100: 8.3% faster (27.745 → 25.434 ns, 1.09× speedup) - Size 1000: ~equivalent (238.856 → 241.945 ns) - Size 10000: ~equivalent (2,359 → 2,355 ns) All 488 tests pass. No allocations changed.

…correctness - Replace manual loop accumulation with Vector.Sum() in fold2Unchecked - Aligns with dot product optimization from PR #33 - Removes hardcoded addition operator, improving both correctness and performance - All 488 tests pass This change: 1. Uses hardware-optimized horizontal add instructions (VPHADDPS/VHADD on AVX) 2. Removes unnecessary re-initialization with 'init' during horizontal reduction 3. Provides consistent pattern with other SIMD reductions in the codebase

dsyme · 2025-10-12T15:40:15Z

The perf results reported seem a little suspicious as you'd expect improvements for the bigger sizes.

However the code is simpler so the PR is probably acceptable

…-20251012-151836-f4cfea086616c539-f0d309f462bb0265

github-actions · 2025-10-14T15:37:06Z

📊 Code Coverage Report

Summary

Package	Line Rate	Branch Rate	Complexity	Health
FsMath	77%	50%	4325	➖
FsMath	77%	50%	4325	➖
Summary	77% (3084 / 3984)	50% (4300 / 8518)	8650	➖

📈 Coverage Analysis

🟡 Good Coverage Your code coverage is above 60%. Consider adding more tests to reach 80%.

🎯 Coverage Goals

Target: 80% line coverage
Minimum: 60% line coverage
Current: 77% line coverage

📋 What These Numbers Mean

Line Rate: Percentage of code lines that were executed during tests
Branch Rate: Percentage of code branches (if/else, switch cases) that were tested
Health: Overall assessment combining line and branch coverage

🔗 Detailed Reports

📋 Download Full Coverage Report - Check the 'coverage-report' artifact for detailed HTML coverage report

Coverage report generated on 2025-10-14 at 15:37:05 UTC

github-actions bot mentioned this pull request Oct 12, 2025

Daily Perf Improver - Fix fold2 horizontal SIMD reduction bug #35

Merged

dsyme closed this Oct 12, 2025

dsyme reopened this Oct 12, 2025

dsyme marked this pull request as ready for review October 12, 2025 15:43

dsyme added 2 commits October 12, 2025 16:43

Merge branch 'main' into perf/optimize-dot-product-pairwise-reduction…

588e729

…-20251012-151836-f4cfea086616c539-f0d309f462bb0265

Merge branch 'main' into perf/optimize-dot-product-pairwise-reduction…

5089504

…-20251012-151836-f4cfea086616c539-f0d309f462bb0265

github-actions bot mentioned this pull request Oct 15, 2025

Daily Perf Improver - Optimize QR decomposition with SIMD Householder transformations #71

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Daily Perf Improver - Optimize dot product with Vector.Sum horizontal reduction #33

Daily Perf Improver - Optimize dot product with Vector.Sum horizontal reduction #33

Uh oh!

github-actions bot commented Oct 12, 2025

Uh oh!

dsyme commented Oct 12, 2025

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Daily Perf Improver - Optimize dot product with Vector.Sum horizontal reduction #33

Are you sure you want to change the base?

Daily Perf Improver - Optimize dot product with Vector.Sum horizontal reduction #33

Uh oh!

Conversation

github-actions bot commented Oct 12, 2025

Summary

Performance Goal

Changes Made

Core Optimization

Approach

Performance Measurements

Test Environment

Results Summary

Detailed Benchmark Results

Key Observations

Why This Works

Replicating the Performance Measurements

Testing

Implementation Details

Optimization Techniques Applied

Code Quality

Limitations and Future Work

Next Steps

Related Issues/Discussions

Bash Commands Used

Web Searches Performed

Uh oh!

dsyme commented Oct 12, 2025

Uh oh!

github-actions bot commented Oct 14, 2025

📊 Code Coverage Report

Summary

📈 Coverage Analysis

🎯 Coverage Goals

📋 What These Numbers Mean

🔗 Detailed Reports

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant