-
-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace matrixmultiply with gemm #1176
base: main
Are you sure you want to change the base?
Conversation
matmul benchmark results on my machine after enabling
|
the CI seems to be failing on cuda but i'm not sure what's causing it. it says it can't find |
Thanks @sarah-ek, this is impressive work! I'm personally cautiously positive towards working towards integrating
I'm sure @sebcrozet might have additional thoughts on this.
I believe very strongly that parallelism should be opt-in, on a case-by-case basis. In particular, given Automatically parallelizing This begs the question how the API for parallel kernel calls - like GEMM - should look like in |
I forgot to mention, but having more contributors run the benchmarks on their computers would also be helpful, since these low-level kernels might be highly sensitive to the hardware, and I'm sure you've tested it most extensively on your own hardware @sarah-ek. I'm out of time right now but I'll try to do it in due time. |
so, gemm is based on the BLIS papers, which can be found here https://github.com/flame/blis#citations i'm aware the documentation is sparse at the moment, and i'm willing to put in the work to improve it. i have a more extensive benchmark suite on the gemm repository, and i can post the results on my machine. i can also make it so that it's easier for other people to benchmark the code. i'm also curious how it performs on machines other than my desktop and laptop (both x86_64), and i'm open to suggestions regarding that as for the unsafety of the code. i run a subset of my tests in miri to make sure there's no UB. i'm also willing to write more tests and make sure all the code paths are covered. |
You have done an incredible job @sarah-ek, thank you for this PR! I agree with @Andlon’s remarks. Looking at its code base, we can’t use I think the transition should be slower than just replacing
I think this happens because the compiler version |
these are all great ideas. im not familiar with how CI works but i can use this as a learning opportunity. having it as an experimental feature also sounds good to me. as for the |
this PR replaces the
matrixmultiply
dependency withgemm
, which is a crate i developed for high performance matrix multiplication. its single threaded performance is about 10-20% better thanmatrixmultiply
for medium/large matrices. and its multithreaded performance is competitive with blis and intel mkl (and about 10% faster than eigen on my machine)the multithread parallelism isn't implemented in this PR. since i'm not sure how to expose it in
nalgebra
(global variable, thread_local, per call basis?)gemm
also has anightly
feature that enables avx512 for extra performance, which is also not implemented in this PR since i don't know how we want to expose it either.