-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very poor gemmt
performance compared to gemm
and syrk
#4921
Comments
The current GEMMT implementation is just a loop around GEMV, so its performance largely depends on that of the individual optimized kernels for the latter. It is provided for compatibility, but not yet optimized for speed. As the Reference BLAS looks to be adding its own Interpretation of what used to be an inofficial extension, a total rework may be necessary at some point in any case. |
On ARM, the performance of this somewhat naive gemmt implementation is about on par with gemm, clearly better than syrk - provided the number of threads is capped at about 30 (on 64 cores, gemmt comes out horrendously bad again, taking about ten times as long as gemm). Interestingly the obvious optimization of allocating the memory buffer only once instead of allocating and freeing it for every individual gemv step in interface/gemmt.c does not result in significant improvement. |
If someone wants to add a generic implementation with a reasonable performance, the LAPACK implementation prior to the introduction of GEMMT (aka GEMMTR in LAPACK) as part of the BLAS may serve as inspiration. It reduces the problem essentially to GEMM by blocking into panels; only the small triangular part is computed with GEMV. |
I'm running some timings on operation
t(X)*X
on row-major matrices having many more rows than columns.I'm finding that for these types of inputs, function
gemmt
is much slower than the equivalent fromsyrk
orgemm
, with a very wide margin.Timings in milliseconds for input size 1,000,000 x 32, intel i12700H, average of 3 runs:
gemmt
: 216.178syrk
: 41.0468gemm
: 39.55553Version: OpenBLAS 0.3.28, built with OpenMP, compiled from source (gcc with cmake system). Same issue happen with pthreads, and same timing difference is observed when running single-threaded.
For reference, timings for other libraries:
gemmt
: 25.66533syrk
: 12.57197gemm
: 15.69447tabmat
's "sandwich" op: 29.3Code to reproduce:
The text was updated successfully, but these errors were encountered: