Daily Perf Improver - Optimize LU decomposition with SIMD row operations #75

github-actions · 2025-10-16T03:16:20Z

Summary

This PR optimizes LU decomposition achieving 43-60% speedup for typical matrix sizes by replacing scalar row operations with SIMD-accelerated operations using the existing subScaledRowInPlace helper function.

Performance Goal

Goal Selected: Optimize LU decomposition (Phase 3, Linear Algebra Optimizations)

Rationale: The research plan from Discussion #4 identified Phase 3 linear algebra optimizations as high-priority work after Phase 1 & 2. LU decomposition is fundamental to solving linear systems, matrix inversion, and determinant computation. The implementation had clear opportunities for SIMD optimization in the elimination step's inner loop.

Changes Made

Core Optimization

File Modified: src/FsMath/Algebra/LinearAlgebra.fs - luDecompose function (lines 586-630)

Original Implementation:

// Eliminate below pivot
for j = i + 1 to n - 1 do
    L.[j,i] <- U.[j,i] / U.[i,i]
    // Scalar loop updating each element individually
    for k = i + 1 to n - 1 do
        U.[j,k] <- U.[j,k] - L.[j,i] * U.[i,k]
    U.[j,i] <- 'T.Zero

Optimized Implementation:

// Get raw data arrays for SIMD-optimized row operations
let Udata = U.Data
let Ldata = L.Data

// Eliminate below pivot using SIMD-optimized row operations
let diagVal = U.[i,i]
for j = i + 1 to n - 1 do
    let multiplier = U.[j,i] / diagVal
    L.[j,i] <- multiplier
    
    // SIMD-optimized row update using existing helper
    if i + 1 < n then
        let count = n - (i + 1)
        let rowJOffset = j * n + (i + 1)
        let rowIOffset = i * n + (i + 1)
        LinearAlgebra.subScaledRowInPlace
            multiplier
            rowJOffset
            rowIOffset
            count
            Udata
            Udata
    
    U.[j,i] <- 'T.Zero

Approach

✅ Reviewed Phase 3 opportunities from the performance plan
✅ Selected LU decomposition as high-impact Phase 3 target
✅ Ran baseline benchmarks (1.160 μs, 16.070 μs, 69.460 μs for 10×10, 30×30, 50×50)
✅ Identified hot spot: Inner loop O(n³) scalar operations in elimination step
✅ Implemented SIMD optimization using existing subScaledRowInPlace helper
✅ Built project and verified all 1396 tests pass
✅ Ran optimized benchmarks and measured 43-60% improvements

Performance Measurements

Test Environment

Platform: Linux Ubuntu 24.04.3 LTS (virtualized)
CPU: AMD EPYC 7763, 2 physical cores (4 logical) with AVX2
Runtime: .NET 8.0.20 with hardware SIMD acceleration
Job: ShortRun (3 warmup, 3 iterations, 1 launch)

Results Summary

Matrix Size	Before (Baseline)	After (Optimized)	Improvement	Speedup
10×10	1.160 μs	1.239 μs	6.8% slower	0.94×
30×30	16.070 μs	9.104 μs	43.3% faster	1.77×
50×50	69.460 μs	27.844 μs	59.9% faster	2.49×

Detailed Benchmark Results

Before (Baseline):

| Method   | Mean      | Error      | StdDev    | Gen0   | Allocated |
|--------- |----------:|-----------:|----------:|-------:|----------:|
| LU_10x10 |  1.160 us |  0.1024 us | 0.0056 us | 0.1316 |   2.16 KB |
| LU_30x30 | 16.070 us |  1.8560 us | 0.1017 us | 0.9155 |  14.98 KB |
| LU_50x50 | 69.460 us | 13.1057 us | 0.7184 us | 2.4414 |  40.44 KB |

After (Optimized):

| Method   | Mean      | Error     | StdDev    | Gen0   | Allocated |
|--------- |----------:|----------:|----------:|-------:|----------:|
| LU_10x10 |  1.239 us | 0.0400 us | 0.0022 us | 0.1316 |   2.16 KB |
| LU_30x30 |  9.104 us | 0.2094 us | 0.0115 us | 0.9155 |  14.98 KB |
| LU_50x50 | 27.844 us | 2.2157 us | 0.1215 us | 2.4719 |  40.44 KB |

Key Observations

Dramatic Speedup for Medium/Large Matrices: 43-60% improvement for 30×30 and 50×50 matrices
Small Matrix Overhead: 10×10 shows slight overhead (7%), likely due to SIMD setup costs being comparable to computation time
Memory Efficiency: Allocations unchanged - same memory footprint
Low Variance: Standard deviations are small, indicating stable, reliable performance
Scaling Benefits: Larger matrices see proportionally greater improvements

Why This Works

The optimization addresses the key bottleneck in LU decomposition:

SIMD Row Operations:
- Before: Sequential scalar operations U[j,k] -= L[j,i] * U[i,k] for each element
- After: Vectorized operations processing multiple elements per clock cycle via subScaledRowInPlace
- Result: Parallel processing exploits AVX2 hardware instructions
Contiguous Memory Access:
- Row-major storage enables efficient SIMD operations on contiguous memory
- Better cache line utilization for sequential access patterns
- Reduced memory bandwidth pressure
Reusing Existing Infrastructure:
- subScaledRowInPlace already implements optimal SIMD patterns
- Proven code path with proper tail handling for non-SIMD-aligned data
- Consistent with other SIMD optimizations in the codebase
Optimal for Medium/Large Matrices:
- SIMD overhead amortized over larger row segments
- O(n³) algorithm benefits significantly from per-element speedup

Replicating the Performance Measurements

To replicate these benchmarks:

# 1. Check out this branch
git checkout perf/optimize-lu-decomposition-simd-20251016-030930-8f45d341

# 2. Build the project
./build.sh

# 3. Run LU benchmarks with short job (~30 seconds)
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*LU*" --job short

# 4. For production-quality measurements (~2-3 minutes)
dotnet run -c Release -- --filter "*LU*"

# 5. Compare with baseline by checking out main first
git checkout main
./build.sh
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*LU*" --job short

Results are saved to BenchmarkDotNet.Artifacts/results/ in multiple formats.

Testing

✅ All 1396 tests pass (8 skipped)
✅ LU benchmarks execute successfully
✅ Memory allocations unchanged
✅ Performance improves 43-60% for typical matrix sizes
✅ Correctness verified across all test cases
✅ Build completes with only pre-existing warnings

Implementation Details

Optimization Techniques Applied

SIMD-Accelerated Row Updates: Use existing subScaledRowInPlace for vectorized row operations
Direct Memory Access: Work with raw U.Data array for efficient offset calculations
Minimal Code Changes: Leverage existing infrastructure rather than rewriting
Proper Boundary Handling: Conditional check if i + 1 < n avoids edge cases

Code Quality

Leverages existing, proven subScaledRowInPlace helper function
Clear comments explaining the SIMD optimization
Preserves all existing error handling and validation
Follows existing code style and patterns
Maintains backward compatibility
No breaking changes to API

Limitations and Future Work

While this optimization provides significant improvements, there are additional opportunities:

Small Matrix Overhead: 10×10 matrices show slight overhead - could add size-based dispatch
Pivoting Optimization: Row swaps could potentially use SIMD memcpy
Blocked LU: For very large matrices (≥100×100), cache-blocking could provide further gains
Parallel LU: Very large matrices (≥200×200) could benefit from parallelization
Specialized Kernels: Hand-optimized kernels for specific sizes (2×2, 3×3, 4×4) could be even faster

Next Steps

Based on the performance plan from Discussion #4, remaining Phase 3 work includes:

✅ QR decomposition optimization (PR Daily Perf Improver - Optimize QR decomposition with SIMD Householder transformations #71)
✅ LU decomposition optimization (this PR)
⚠️ Other linear algebra optimizations - Cholesky, EVD/SVD
⚠️ Parallel implementations - For large matrices
⚠️ Specialized fast paths - Small matrix (2×2, 3×3, 4×4) optimizations

Related Issues/Discussions

Performance Research: Daily Perf Improver - Research and Plan #4
Open PR Daily Perf Improver - Fix and optimize outer product #18: Fix and optimize outer product
Open PR Daily Perf Improver - Optimize vector × matrix multiplication with SIMD #26: Optimize vector × matrix multiplication with SIMD
Open PR [REJECT?] Daily Perf Improver - Optimize column extraction with loop unrolling #29: Optimize column extraction with loop unrolling [REJECT?]
Open PR [REJECT?] Daily Perf Improver - Optimize matrix transpose with loop unrolling and adaptive block sizing #32: Optimize matrix transpose with loop unrolling [REJECT?]
Open PR Daily Perf Improver - Optimize dot product with Vector.Sum horizontal reduction #33: Optimize dot product with Vector.Sum horizontal reduction
Open PR Daily Perf Improver - Optimize QR decomposition with SIMD Householder transformations #71: Optimize QR decomposition with SIMD Householder transformations

Bash Commands Used

# Research and setup
cd /home/runner/work/FsMath/FsMath
git status
git checkout -b perf/optimize-lu-decomposition-simd-20251016-030930-8f45d341

# Baseline benchmarking
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*LU*" --job short > /tmp/gh-aw/agent/lu_baseline.txt

# Development
# (edited LinearAlgebra.fs - luDecompose function)

# Build and test
cd /home/runner/work/FsMath/FsMath
./build.sh
dotnet test tests/FsMath.Tests/FsMath.Tests.fsproj -c Release --no-build

# Performance verification
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*LU*" --job short > /tmp/gh-aw/agent/lu_optimized.txt

# Commit and create PR
cd /home/runner/work/FsMath/FsMath
git add src/FsMath/Algebra/LinearAlgebra.fs
git commit -m "Optimize LU decomposition with SIMD row operations..."

Web Searches Performed

None - this optimization was based on:

Existing SIMD patterns in the codebase (subScaledRowInPlace)
Standard numerical linear algebra optimization techniques
The performance research plan from Discussion Daily Perf Improver - Research and Plan #4

🤖 Generated with Claude Code

AI generated by Daily Perf Improver

Replace scalar loop in elimination step with SIMD-accelerated subScaledRowInPlace function, achieving 43-60% speedup for typical matrix sizes by utilizing hardware vector instructions for row updates.

Optimize LU decomposition with SIMD row operations

0021d51

Replace scalar loop in elimination step with SIMD-accelerated subScaledRowInPlace function, achieving 43-60% speedup for typical matrix sizes by utilizing hardware vector instructions for row updates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Daily Perf Improver - Optimize LU decomposition with SIMD row operations #75

Daily Perf Improver - Optimize LU decomposition with SIMD row operations #75

Uh oh!

github-actions bot commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Daily Perf Improver - Optimize LU decomposition with SIMD row operations #75

Are you sure you want to change the base?

Daily Perf Improver - Optimize LU decomposition with SIMD row operations #75

Uh oh!

Conversation

github-actions bot commented Oct 16, 2025

Summary

Performance Goal

Changes Made

Core Optimization

Approach

Performance Measurements

Test Environment

Results Summary

Detailed Benchmark Results

Key Observations

Why This Works

Replicating the Performance Measurements

Testing

Implementation Details

Optimization Techniques Applied

Code Quality

Limitations and Future Work

Next Steps

Related Issues/Discussions

Bash Commands Used

Web Searches Performed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants