Skip to content

Conversation

github-actions[bot]
Copy link
Contributor

Summary

This PR optimizes LU decomposition achieving 43-60% speedup for typical matrix sizes by replacing scalar row operations with SIMD-accelerated operations using the existing subScaledRowInPlace helper function.

Performance Goal

Goal Selected: Optimize LU decomposition (Phase 3, Linear Algebra Optimizations)

Rationale: The research plan from Discussion #4 identified Phase 3 linear algebra optimizations as high-priority work after Phase 1 & 2. LU decomposition is fundamental to solving linear systems, matrix inversion, and determinant computation. The implementation had clear opportunities for SIMD optimization in the elimination step's inner loop.

Changes Made

Core Optimization

File Modified: src/FsMath/Algebra/LinearAlgebra.fs - luDecompose function (lines 586-630)

Original Implementation:

// Eliminate below pivot
for j = i + 1 to n - 1 do
    L.[j,i] <- U.[j,i] / U.[i,i]
    // Scalar loop updating each element individually
    for k = i + 1 to n - 1 do
        U.[j,k] <- U.[j,k] - L.[j,i] * U.[i,k]
    U.[j,i] <- 'T.Zero

Optimized Implementation:

// Get raw data arrays for SIMD-optimized row operations
let Udata = U.Data
let Ldata = L.Data

// Eliminate below pivot using SIMD-optimized row operations
let diagVal = U.[i,i]
for j = i + 1 to n - 1 do
    let multiplier = U.[j,i] / diagVal
    L.[j,i] <- multiplier
    
    // SIMD-optimized row update using existing helper
    if i + 1 < n then
        let count = n - (i + 1)
        let rowJOffset = j * n + (i + 1)
        let rowIOffset = i * n + (i + 1)
        LinearAlgebra.subScaledRowInPlace
            multiplier
            rowJOffset
            rowIOffset
            count
            Udata
            Udata
    
    U.[j,i] <- 'T.Zero

Approach

  1. ✅ Reviewed Phase 3 opportunities from the performance plan
  2. ✅ Selected LU decomposition as high-impact Phase 3 target
  3. ✅ Ran baseline benchmarks (1.160 μs, 16.070 μs, 69.460 μs for 10×10, 30×30, 50×50)
  4. ✅ Identified hot spot: Inner loop O(n³) scalar operations in elimination step
  5. ✅ Implemented SIMD optimization using existing subScaledRowInPlace helper
  6. ✅ Built project and verified all 1396 tests pass
  7. ✅ Ran optimized benchmarks and measured 43-60% improvements

Performance Measurements

Test Environment

  • Platform: Linux Ubuntu 24.04.3 LTS (virtualized)
  • CPU: AMD EPYC 7763, 2 physical cores (4 logical) with AVX2
  • Runtime: .NET 8.0.20 with hardware SIMD acceleration
  • Job: ShortRun (3 warmup, 3 iterations, 1 launch)

Results Summary

Matrix Size Before (Baseline) After (Optimized) Improvement Speedup
10×10 1.160 μs 1.239 μs 6.8% slower 0.94×
30×30 16.070 μs 9.104 μs 43.3% faster 1.77×
50×50 69.460 μs 27.844 μs 59.9% faster 2.49×

Detailed Benchmark Results

Before (Baseline):

| Method   | Mean      | Error      | StdDev    | Gen0   | Allocated |
|--------- |----------:|-----------:|----------:|-------:|----------:|
| LU_10x10 |  1.160 us |  0.1024 us | 0.0056 us | 0.1316 |   2.16 KB |
| LU_30x30 | 16.070 us |  1.8560 us | 0.1017 us | 0.9155 |  14.98 KB |
| LU_50x50 | 69.460 us | 13.1057 us | 0.7184 us | 2.4414 |  40.44 KB |

After (Optimized):

| Method   | Mean      | Error     | StdDev    | Gen0   | Allocated |
|--------- |----------:|----------:|----------:|-------:|----------:|
| LU_10x10 |  1.239 us | 0.0400 us | 0.0022 us | 0.1316 |   2.16 KB |
| LU_30x30 |  9.104 us | 0.2094 us | 0.0115 us | 0.9155 |  14.98 KB |
| LU_50x50 | 27.844 us | 2.2157 us | 0.1215 us | 2.4719 |  40.44 KB |

Key Observations

  1. Dramatic Speedup for Medium/Large Matrices: 43-60% improvement for 30×30 and 50×50 matrices
  2. Small Matrix Overhead: 10×10 shows slight overhead (7%), likely due to SIMD setup costs being comparable to computation time
  3. Memory Efficiency: Allocations unchanged - same memory footprint
  4. Low Variance: Standard deviations are small, indicating stable, reliable performance
  5. Scaling Benefits: Larger matrices see proportionally greater improvements

Why This Works

The optimization addresses the key bottleneck in LU decomposition:

  1. SIMD Row Operations:

    • Before: Sequential scalar operations U[j,k] -= L[j,i] * U[i,k] for each element
    • After: Vectorized operations processing multiple elements per clock cycle via subScaledRowInPlace
    • Result: Parallel processing exploits AVX2 hardware instructions
  2. Contiguous Memory Access:

    • Row-major storage enables efficient SIMD operations on contiguous memory
    • Better cache line utilization for sequential access patterns
    • Reduced memory bandwidth pressure
  3. Reusing Existing Infrastructure:

    • subScaledRowInPlace already implements optimal SIMD patterns
    • Proven code path with proper tail handling for non-SIMD-aligned data
    • Consistent with other SIMD optimizations in the codebase
  4. Optimal for Medium/Large Matrices:

    • SIMD overhead amortized over larger row segments
    • O(n³) algorithm benefits significantly from per-element speedup

Replicating the Performance Measurements

To replicate these benchmarks:

# 1. Check out this branch
git checkout perf/optimize-lu-decomposition-simd-20251016-030930-8f45d341

# 2. Build the project
./build.sh

# 3. Run LU benchmarks with short job (~30 seconds)
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*LU*" --job short

# 4. For production-quality measurements (~2-3 minutes)
dotnet run -c Release -- --filter "*LU*"

# 5. Compare with baseline by checking out main first
git checkout main
./build.sh
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*LU*" --job short

Results are saved to BenchmarkDotNet.Artifacts/results/ in multiple formats.

Testing

✅ All 1396 tests pass (8 skipped)
✅ LU benchmarks execute successfully
✅ Memory allocations unchanged
✅ Performance improves 43-60% for typical matrix sizes
✅ Correctness verified across all test cases
✅ Build completes with only pre-existing warnings

Implementation Details

Optimization Techniques Applied

  1. SIMD-Accelerated Row Updates: Use existing subScaledRowInPlace for vectorized row operations
  2. Direct Memory Access: Work with raw U.Data array for efficient offset calculations
  3. Minimal Code Changes: Leverage existing infrastructure rather than rewriting
  4. Proper Boundary Handling: Conditional check if i + 1 < n avoids edge cases

Code Quality

  • Leverages existing, proven subScaledRowInPlace helper function
  • Clear comments explaining the SIMD optimization
  • Preserves all existing error handling and validation
  • Follows existing code style and patterns
  • Maintains backward compatibility
  • No breaking changes to API

Limitations and Future Work

While this optimization provides significant improvements, there are additional opportunities:

  1. Small Matrix Overhead: 10×10 matrices show slight overhead - could add size-based dispatch
  2. Pivoting Optimization: Row swaps could potentially use SIMD memcpy
  3. Blocked LU: For very large matrices (≥100×100), cache-blocking could provide further gains
  4. Parallel LU: Very large matrices (≥200×200) could benefit from parallelization
  5. Specialized Kernels: Hand-optimized kernels for specific sizes (2×2, 3×3, 4×4) could be even faster

Next Steps

Based on the performance plan from Discussion #4, remaining Phase 3 work includes:

  1. QR decomposition optimization (PR Daily Perf Improver - Optimize QR decomposition with SIMD Householder transformations #71)
  2. LU decomposition optimization (this PR)
  3. ⚠️ Other linear algebra optimizations - Cholesky, EVD/SVD
  4. ⚠️ Parallel implementations - For large matrices
  5. ⚠️ Specialized fast paths - Small matrix (2×2, 3×3, 4×4) optimizations

Related Issues/Discussions


Bash Commands Used

# Research and setup
cd /home/runner/work/FsMath/FsMath
git status
git checkout -b perf/optimize-lu-decomposition-simd-20251016-030930-8f45d341

# Baseline benchmarking
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*LU*" --job short > /tmp/gh-aw/agent/lu_baseline.txt

# Development
# (edited LinearAlgebra.fs - luDecompose function)

# Build and test
cd /home/runner/work/FsMath/FsMath
./build.sh
dotnet test tests/FsMath.Tests/FsMath.Tests.fsproj -c Release --no-build

# Performance verification
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*LU*" --job short > /tmp/gh-aw/agent/lu_optimized.txt

# Commit and create PR
cd /home/runner/work/FsMath/FsMath
git add src/FsMath/Algebra/LinearAlgebra.fs
git commit -m "Optimize LU decomposition with SIMD row operations..."

Web Searches Performed

None - this optimization was based on:


🤖 Generated with Claude Code

AI generated by Daily Perf Improver

Replace scalar loop in elimination step with SIMD-accelerated subScaledRowInPlace function, achieving 43-60% speedup for typical matrix sizes by utilizing hardware vector instructions for row updates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants