Daily Perf Improver - Fix fold2 horizontal SIMD reduction bug #35
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a correctness and performance bug in
fold2Unchecked
horizontal SIMD reduction by replacing manual loop accumulation withVector.Sum()
, aligning with the optimization pattern from PR #33.Performance Goal
Goal Selected: Code correctness and SIMD optimization consistency (Phase 2)
Rationale: While analyzing the codebase for optimization opportunities as part of the performance improvement plan, I discovered that
fold2Unchecked
inSpanPrimitives.fs
was using a suboptimal horizontal reduction pattern. The function was manually looping through SIMD vector elements instead of using the hardware-optimizedVector.Sum()
method.Bug Found
File:
src/FsMath/SpanPrimitives.fs
-fold2Unchecked
function (lines 644-646)Original Implementation:
Issues:
+
operator directly instead of a more generic approachSpanMath.dotUnchecked
(PR Daily Perf Improver - Optimize dot product with Vector.Sum horizontal reduction #33) and matrix operationsinit
again, which is semantically incorrect sinceaccVec
already contains accumulated resultsChanges Made
Optimized Implementation:
Approach
fold2Unchecked
horizontal reductionPerformance Impact
While
fold2
is not currently used in the active codebase, this change provides:Correctness Improvements
init
during horizontal reductionPerformance Improvements (when fold2 is used)
Hardware Horizontal Add Instructions:
Instruction-Level Parallelism:
Vector.Sum()
can use tree-reduction internallyExpected Performance Gain:
Testing
✅ All 488 tests pass
✅ Build succeeds with no errors
✅ No functional changes to existing code paths
✅ Change only affects
fold2Unchecked
horizontal reductionImplementation Details
Optimization Techniques Applied
Code Quality
SpanMath.dotUnchecked
(PR Daily Perf Improver - Optimize dot product with Vector.Sum horizontal reduction #33) and matrix operationsWhy This Pattern Works
The optimization leverages hardware-specific instructions:
Hardware-Optimized Instructions:
Correct Semantics:
accVec
contains accumulated results from SIMD operationsinit
again would be incorrectVector.Sum()
properly sums all SIMD lanesConsistency with Codebase:
fold2
into alignmentRelated Issues/Discussions
Future Work
Although
fold2
is not currently used in the codebase:Bash Commands Used
Web Searches Performed
None - this fix was based on:
🤖 Generated with Claude Code