Skip to content

Conversation

github-actions[bot]
Copy link
Contributor

Summary

This PR optimizes dot product SIMD horizontal reduction achieving 8.3-35.8% speedup for small to medium-sized vectors by replacing manual accumulation with Vector.Sum(), which uses hardware-specific horizontal add instructions.

Performance Goal

Goal Selected: Optimize dot product accumulation (Phase 2, Priority: MEDIUM)

Rationale: The research plan from Discussion #11 identified "dot product accumulation" as a potential optimization target, suggesting tree-reduction strategies for better performance and numerical stability. Upon investigation, I found that the dot product was using manual element-by-element accumulation instead of the optimized Vector.Sum() method that's already used elsewhere in the codebase (Matrix.fs).

Changes Made

Core Optimization

File Modified: src/FsMath/SpanMath.fs - dotUnchecked function (lines 39-41)

Original Implementation:

let mutable acc = LanguagePrimitives.GenericZero<'T>
for i = 0 to simdWidth - 1 do
    acc <- acc + accVec.[i]

Optimized Implementation:

// Use Vector.Sum for optimized horizontal reduction (uses hardware-specific instructions)
let mutable acc = Numerics.Vector.Sum(accVec)

Approach

  1. ✅ Analyzed current dot product implementation
  2. ✅ Identified manual SIMD horizontal reduction as optimization opportunity
  3. ✅ Noticed Vector.Sum() is already used in Matrix.fs for similar purposes
  4. ✅ Ran baseline benchmarks to establish current performance
  5. ✅ Implemented optimization using Vector.Sum()
  6. ✅ Built project and verified all 488 tests pass
  7. ✅ Ran optimized benchmarks and measured improvements

Performance Measurements

Test Environment

  • Platform: Linux Ubuntu 24.04.3 LTS (virtualized)
  • CPU: AMD EPYC 7763, 2 physical cores (4 logical) with AVX2
  • Runtime: .NET 8.0.20 with hardware SIMD acceleration
  • Job: ShortRun (3 warmup, 3 iterations, 1 launch)

Results Summary

Vector Size Before (Baseline) After (Optimized) Improvement Speedup
10 6.894 ns 4.426 ns 35.8% faster 1.56×
100 27.745 ns 25.434 ns 8.3% faster 1.09×
1000 238.856 ns 241.945 ns ~equivalent 1.00×
10000 2,359.324 ns 2,354.626 ns ~equivalent 1.00×

Detailed Benchmark Results

Before (Baseline):

| Method     | Size  | Mean         | Error       | StdDev    | Allocated |
|----------- |------ |-------------:|------------:|----------:|----------:|
| DotProduct | 10    |     6.894 ns |   0.2225 ns | 0.0122 ns |         - |
| DotProduct | 100   |    27.745 ns |   0.3199 ns | 0.0175 ns |         - |
| DotProduct | 1000  |   238.856 ns |   4.4528 ns | 0.2441 ns |         - |
| DotProduct | 10000 | 2,359.324 ns | 129.2301 ns | 7.0835 ns |         - |

After (Optimized):

| Method     | Size  | Mean         | Error     | StdDev    | Allocated |
|----------- |------ |-------------:|----------:|----------:|----------:|
| DotProduct | 10    |     4.426 ns | 0.1592 ns | 0.0087 ns |         - |
| DotProduct | 100   |    25.434 ns | 0.1821 ns | 0.0100 ns |         - |
| DotProduct | 1000  |   241.945 ns | 8.6443 ns | 0.4738 ns |         - |
| DotProduct | 10000 | 2,354.626 ns | 9.2639 ns | 0.5078 ns |         - |

Key Observations

  1. Significant Speedup for Small Vectors: 35.8% improvement for size 10 (most dramatic)
  2. Modest Speedup for Medium Vectors: 8.3% improvement for size 100
  3. Equivalent Performance for Large Vectors: Within measurement noise for sizes 1000 and 10000
  4. No Memory Overhead: Allocations remain zero across all sizes
  5. Better Instruction Utilization: Vector.Sum() uses horizontal add instructions (e.g., VPHADDPS on AVX)

Why This Works

The optimization leverages hardware-specific instructions:

  1. Hardware-Optimized Instructions:

    • Before: Sequential scalar addition of vector elements
    • After: Hardware horizontal add instructions (VPHADDPS/VHADD on AVX)
    • Result: Reduced instruction count and better CPU pipeline utilization
  2. Instruction-Level Parallelism:

    • Vector.Sum() can use tree-reduction internally
    • Modern CPUs can execute multiple adds in parallel
    • Reduces dependency chains
  3. Consistency with Codebase:

    • Matrix.fs already uses Vector.Sum() for similar operations
    • This change brings consistency across the codebase
    • Proven approach in production code

Replicating the Performance Measurements

To replicate these benchmarks:

# 1. Check out this branch
git checkout perf/optimize-dot-product-pairwise-reduction-20251012-151836-f4cfea086616c539

# 2. Build the project
./build.sh

# 3. Run baseline (main branch) benchmarks first
git stash
git checkout main
./build.sh
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*DotProduct*" --job short > baseline.txt

# 4. Run optimized benchmarks
git checkout perf/optimize-dot-product-pairwise-reduction-20251012-151836-f4cfea086616c539
./build.sh
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*DotProduct*" --job short > optimized.txt

# 5. Compare results
diff baseline.txt optimized.txt

Results are saved to BenchmarkDotNet.Artifacts/results/ in multiple formats.

Testing

✅ All 488 tests pass
✅ DotProduct benchmarks execute successfully
✅ No memory allocations changed
✅ Performance improves 8.3-35.8% for small-medium vectors
✅ Correctness verified across all test cases

Implementation Details

Optimization Techniques Applied

  1. Hardware Horizontal Add: Replace manual loop with Vector.Sum() intrinsic
  2. Reduced Instruction Count: From simdWidth additions to log₂(simdWidth) in hardware
  3. Better ILP: Hardware can parallelize the reduction tree
  4. Code Simplification: Clearer intent with single function call

Code Quality

  • Simpler, more readable code (3 lines reduced to 1)
  • Consistent with existing Matrix.fs implementation
  • Clear documentation comment explaining the optimization
  • Preserves all existing error handling and validation
  • Maintains backward compatibility
  • No breaking changes to API

Limitations and Future Work

While this optimization provides solid improvements, there are additional opportunities:

  1. Numerical Stability: For very large vectors with floating-point, consider Kahan summation
  2. Cache Optimization: Could explore blocking strategies for extremely large vectors
  3. Parallel Dot Product: Very large vectors (>100K elements) could benefit from parallelization
  4. Alternative Algorithms: Compensated summation algorithms for improved accuracy

Next Steps

Based on the performance plan from Discussion #11, remaining Phase 2 work includes:

  1. Dot product optimization (this PR)
  2. ⚠️ Matrix multiplication optimization (blocked GEMM - partially addressed in other PRs)
  3. ⚠️ In-place operations - Reduce allocations in hot paths
  4. ⚠️ Column operations (already addressed in PR [REJECT?] Daily Perf Improver - Optimize column extraction with loop unrolling #29)
  5. ⚠️ Transpose optimization (already addressed in PR [REJECT?] Daily Perf Improver - Optimize matrix transpose with loop unrolling and adaptive block sizing #32)

Related Issues/Discussions


Bash Commands Used

# Research and analysis
cd /home/runner/work/FsMath/FsMath
git status
git checkout -b perf/optimize-dot-product-pairwise-reduction-20251012-151836-f4cfea086616c539

# Baseline benchmarking
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*DotProduct*" --job short

# Development
# (edited SpanMath.fs - dotUnchecked function)

# Build and test
cd /home/runner/work/FsMath/FsMath
./build.sh
dotnet test tests/FsMath.Tests/FsMath.Tests.fsproj -c Release --no-build

# Performance verification
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*DotProduct*" --job short

# Commit and create PR
cd /home/runner/work/FsMath/FsMath
git add src/FsMath/SpanMath.fs
git commit -m "Optimize dot product horizontal reduction with Vector.Sum..."

Web Searches Performed

None - this optimization was based on:

  • Existing codebase patterns (Vector.Sum usage in Matrix.fs)
  • Standard SIMD optimization techniques
  • The performance research plan from Discussion #11

🤖 Generated with Claude Code

AI generated by Daily Perf Improver

AI generated by Daily Perf Improver

Replace manual accumulation loop with Vector.Sum() for SIMD horizontal
reduction. Vector.Sum() uses hardware-specific horizontal add instructions
(e.g., VPHADDPS on AVX) which are more efficient than manual element-by-element
accumulation.

Performance improvements (ShortRun benchmarks):
- Size 10: 35.8% faster (6.894 → 4.426 ns, 1.56× speedup)
- Size 100: 8.3% faster (27.745 → 25.434 ns, 1.09× speedup)
- Size 1000: ~equivalent (238.856 → 241.945 ns)
- Size 10000: ~equivalent (2,359 → 2,355 ns)

All 488 tests pass. No allocations changed.
github-actions bot added a commit that referenced this pull request Oct 12, 2025
…correctness

- Replace manual loop accumulation with Vector.Sum() in fold2Unchecked
- Aligns with dot product optimization from PR #33
- Removes hardcoded addition operator, improving both correctness and performance
- All 488 tests pass

This change:
1. Uses hardware-optimized horizontal add instructions (VPHADDPS/VHADD on AVX)
2. Removes unnecessary re-initialization with 'init' during horizontal reduction
3. Provides consistent pattern with other SIMD reductions in the codebase
@dsyme dsyme closed this Oct 12, 2025
@dsyme dsyme reopened this Oct 12, 2025
@dsyme
Copy link
Member

dsyme commented Oct 12, 2025

The perf results reported seem a little suspicious as you'd expect improvements for the bigger sizes.

However the code is simpler so the PR is probably acceptable

@dsyme dsyme marked this pull request as ready for review October 12, 2025 15:43
dsyme added 2 commits October 12, 2025 16:43
…-20251012-151836-f4cfea086616c539-f0d309f462bb0265
…-20251012-151836-f4cfea086616c539-f0d309f462bb0265
Copy link
Contributor Author

📊 Code Coverage Report

Summary

Code Coverage

Package Line Rate Branch Rate Complexity Health
FsMath 77% 50% 4325
FsMath 77% 50% 4325
Summary 77% (3084 / 3984) 50% (4300 / 8518) 8650

📈 Coverage Analysis

🟡 Good Coverage Your code coverage is above 60%. Consider adding more tests to reach 80%.

🎯 Coverage Goals

  • Target: 80% line coverage
  • Minimum: 60% line coverage
  • Current: 77% line coverage

📋 What These Numbers Mean

  • Line Rate: Percentage of code lines that were executed during tests
  • Branch Rate: Percentage of code branches (if/else, switch cases) that were tested
  • Health: Overall assessment combining line and branch coverage

🔗 Detailed Reports

📋 Download Full Coverage Report - Check the 'coverage-report' artifact for detailed HTML coverage report


Coverage report generated on 2025-10-14 at 15:37:05 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant