Daily Perf Improver - Optimize vector × matrix multiplication with SIMD #26
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR optimizes row-vector × matrix multiplication (
v × M
) achieving 3.5-4.8× speedup for typical matrix sizes by reorganizing the computation to exploit row-major storage and SIMD acceleration.Performance Goal
Goal Selected: Optimize vector × matrix multiplication (Phase 2, Priority: MEDIUM)
Rationale: The research plan from Discussion #11 and benchmarks from PR #20 identified that
VectorMatrixMultiply
(vector × matrix) was 4-5× slower thanMatrixVectorMultiply
(matrix × vector). This asymmetry was caused by column-wise memory access patterns that don't align with row-major storage.Changes Made
Core Optimization
File Modified:
src/FsMath/Matrix.fs
-multiplyRowVector
function (lines 581-645)Original Implementation:
Optimized Implementation:
Benchmark Infrastructure
Added comprehensive matrix operation benchmarks from PR #20:
benchmarks/FsMath.Benchmarks/Matrix.fs
(108 lines, 14 benchmarks)FsMath.Benchmarks.fsproj
- Added Matrix.fs to compilationProgram.fs
- Registered MatrixBenchmarks classApproach
Performance Measurements
Test Environment
Results Summary
Detailed Benchmark Results
Key Observations
MatrixVectorMultiply
performanceWhy This Works
The optimization addresses three key bottlenecks:
Memory Access Pattern:
data[i*m + j]
) - cache-unfriendlydata[i*m..(i+1)*m]
) - cache-friendlySIMD Utilization:
Computational Efficiency:
Replicating the Performance Measurements
To replicate these benchmarks:
Results are saved to
BenchmarkDotNet.Artifacts/results/
in GitHub MD, HTML, and CSV formats.Testing
✅ All 132 tests pass
✅ VectorMatrixMultiply benchmarks execute successfully
✅ Memory allocations unchanged
✅ Performance improves 3.5-4.8× for target sizes
✅ Correctness verified across all test cases
Implementation Details
Optimization Techniques Applied
v × M
as linear combination of matrix rowsNumerics.Vector<'T>(weight)
to broadcast scalar across SIMD lanesv[i] == 0
Code Quality
Next Steps
This PR establishes parity between vector × matrix and matrix × vector operations. Based on the performance plan, remaining Phase 2 work includes:
getCol
still has strided accessFuture Optimization Opportunities
From this work, I identified additional optimization targets:
getCol
): Could use SIMD gather instructionsRelated Issues/Discussions
🤖 Generated with Claude Code