~5x speedup than SparseArrays #28

Roger-luo · 2023-01-14T03:32:43Z

conclusion: this package seems slightly slow than existing ones, but faster for some of the sizes, bmul is the fastest one, I'm looking forward to see how much @batch can improve on this

~~not sure why, but this is currently 8x slower than SparseArrays, the following is the comparison at different size~~

I just have a naive benchmark, currently have 5x speedup at 2^20 x 2^20

julia> speedup = map(zip(reports, base_reports)) do (r, br)
           minimum(br).time / minimum(r).time
       end
9-element Vector{Float64}:
 0.00560784995998469
 0.013175431564728122
 0.03735109945863908
 0.20631260631260628
 0.4875896304467733
 2.7184508079539054
 4.600574856107668
 5.6593177540493
 5.013580931023638

version info (8 threads on M1 pro)

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 10 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores

the script

to reproduce

using BenchmarkTools
using ParallelMergeCSR
using SparseArrays

function benchmark(n::Int)
    A = sprand(2^n, 2^n, 1e-4)
    x = rand(2^n)
    y = similar(x)
    return @benchmark ParallelMergeCSR.mul!($y, transpose($A), $x, 1.0, 0.0)
end

function benchmark_base(n::Int)
    A = sprand(2^n, 2^n, 1e-4)
    x = rand(2^n)
    y = similar(x)
    return @benchmark SparseArrays.mul!($y, transpose($A), $x, 1.0, 0.0)
end

reports = []
for n in 4:2:20
    @info "benchmarking" n
    push!(reports, benchmark(n))
end

base_reports = []
for n in 4:2:20
    @info "benchmarking" n
    push!(base_reports, benchmark_base(n))
end

speedup = map(zip(reports, base_reports)) do (r, br)
    minimum(br).time / minimum(r).time
end

The text was updated successfully, but these errors were encountered:

Roger-luo · 2023-01-14T03:56:22Z

more detailed benchmark comparing with ThreadedSparseCSR

julia> speedup = map(reports["base"], reports["this"]) do br, r
           minimum(br).time / minimum(r).time
       end
9-element Vector{Float64}:
 0.0055334472724891184
 0.010322679308762808
 0.04274428084218395
 0.14642857142857144
 0.47042220219007225
 2.1794871794871793
 4.394308194754103
 5.724467113647089
 4.775544347485181

julia> speedup = map(reports["base"], reports["tmul"]) do br, r
           minimum(br).time / minimum(r).time
       end
9-element Vector{Float64}:
 0.006753095374756376
 0.03454683929931455
 0.09163389771566433
 0.5019588638589618
 0.9172189246816113
 3.053892215568862
 6.099349844322258
 5.354817480719794
 5.343557759590914

julia> speedup = map(reports["base"], reports["bmul"]) do br, r
           minimum(br).time / minimum(r).time
       end
9-element Vector{Float64}:
 0.014777543308632282
 0.03007957559681698
 0.11305128801015392
 0.4823529411764706
 1.7014805708482548
 7.927461139896373
 7.44011544011544
 5.3025629910750975
 5.277065594192646

with the following script

using BenchmarkTools
using ParallelMergeCSR
using SparseArrays
using SparseMatricesCSR
using ThreadedSparseCSR
ThreadedSparseCSR.get_num_threads()

function benchmark(n::Int)
    A = transpose(sprand(2^n, 2^n, 1e-4))
    x = rand(2^n)
    y = similar(x)
    return @benchmark ParallelMergeCSR.mul!($y, $A, $x, 1.0, 0.0)
end

function benchmark_base(n::Int)
    A = transpose(sprand(2^n, 2^n, 1e-4))
    x = rand(2^n)
    y = similar(x)
    return @benchmark SparseArrays.mul!($y, $A, $x, 1.0, 0.0)
end

function benchmark_tmul(n::Int)
    x = rand(2^n)
    y = similar(x)
    csrA = SparseMatrixCSR(transpose(sprand(2^n, 2^n, 1e-4)));
    return @benchmark tmul!($y, $csrA, $x, 1.0, 0.0)
end

function benchmark_bmul(n::Int)
    x = rand(2^n)
    y = similar(x)
    csrA = SparseMatrixCSR(transpose(sprand(2^n, 2^n, 1e-4)));
    return @benchmark bmul!($y, $csrA, $x, 1.0, 0.0)
end

reports = Dict(
    "this" => [],
    "base" => [],
    "tmul" => [],
    "bmul" => [],
)

for n in 4:2:20
    @info "benchmarking" n
    push!(reports["this"], benchmark(n))
    push!(reports["base"], benchmark_base(n))
    push!(reports["tmul"], benchmark_tmul(n))
    push!(reports["bmul"], benchmark_bmul(n))
end

speedup = map(reports["base"], reports["this"]) do br, r
    minimum(br).time / minimum(r).time
end

speedup = map(reports["base"], reports["tmul"]) do br, r
    minimum(br).time / minimum(r).time
end

speedup = map(reports["base"], reports["bmul"]) do br, r
    minimum(br).time / minimum(r).time
end

weinbe58 · 2023-01-20T14:59:09Z

These results really don't make sense to me, why does the speedup seem to decrease with larger matrices?

weinbe58 · 2023-01-20T15:38:51Z

Also, One issue I see is that the matrices are not the same between the benchmarks so it's not exactly a fair comparison for smaller matrices since the fluctuations with the number of non-zero elements from matrix-to-matrix are larger.

Roger-luo added the bug Something isn't working label Jan 14, 2023

Roger-luo changed the title ~~~8x slower than SparseArrays~~ ~5x speedup than SparseArrays Jan 14, 2023

Roger-luo removed the bug Something isn't working label Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

~5x speedup than SparseArrays #28

~5x speedup than SparseArrays #28

Roger-luo commented Jan 14, 2023 •

edited

Loading

Roger-luo commented Jan 14, 2023

weinbe58 commented Jan 20, 2023

weinbe58 commented Jan 20, 2023

~5x speedup than SparseArrays #28

~5x speedup than SparseArrays #28

Comments

Roger-luo commented Jan 14, 2023 • edited Loading

Roger-luo commented Jan 14, 2023

weinbe58 commented Jan 20, 2023

weinbe58 commented Jan 20, 2023

Roger-luo commented Jan 14, 2023 •

edited

Loading