-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: DGEMM workunits #146
base: main
Are you sure you want to change the base?
ENH: DGEMM workunits #146
Conversation
Maybe a reviewer should run the benchmark code locally a few times (once the deps are installed), to confirm if they see the large variations I see sometimes. I don't think that's necessarily surprising with multi-threaded benchmarks, but.. annoying |
I tried to make the benchmarks a bit more robust/clearer in terms of my concerns about outliers, and switch to using On this branch with 1 and 10 threads, respectively: On I think these results are a bit clearer, though I'd probably agree we still need to iterate more on the benchmark reporting to make it clearer. The boxplots should make outliers more obvious, but a short summary of the fold slowdown/speedup +- standard deviation or something like that should probably be placed on the plots as well at some point. |
Of course, we could also try to use a standard benchmarking library like |
I updated to include some text that shows a simple avg. +- std. dev. of relative speed on the plots, which is colored "red" if slower, "green" if faster. On this branch with 1 and 10 threads: Same thing on |
66b69e9
to
91c17a7
Compare
tests/test_linalg.py
Outdated
|
||
@pytest.mark.parametrize("input_width, tile_width", [ | ||
(4, 2), | ||
#(8, 2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2 x 2 tiled matrix multiplication tests are now passing for me locally at least, with SciPy (OpenBLAS) dgemm
as a reference, when 4x4 matrices are used for a
, and b
.
Other matrix and/or tile sizes can cause substantial issues, but perhaps I've carried this far enough forward for an initial review of the tiling algorithm and suggestions for generalization to varied matrix and tile sizes would be helpful (within the confines of i.e., powers of 2 at least).
Also, I believe we may need to add checks to prevent certain types of segfaults/errors, for example requesting team (thread) sizes/league (block) sizes or shared memory sizes that are not hardware-compatible...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JBludau The segfault in CI vs. passing locally is possibly fairly low-level hierarchical parallelism sutff, but having this lower-level stuff carefully caught with a thoughtful error message would be most helpful for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yeah, that message is not helpful at all ... let me think about what we could do here
Segfault is not reproducible locally, even with |
Oh, I can reproduce the segfault locally now by setting |
274b15a
to
241e5f4
Compare
Setting |
Tests should now be passing for 2x2 tiling with any square input power of 2 matrices. If CI agrees with me, the next step is to try benchmarks again I think. |
I'd say benchmarking is still blocked by portability issues--if I try running the linear algebra tests here in the diff --git a/tests/test_linalg.py b/tests/test_linalg.py
index 4367d0c..c24ec41 100644
--- a/tests/test_linalg.py
+++ b/tests/test_linalg.py
@@ -1,4 +1,7 @@
import pykokkos as pk
+space = pk.ExecutionSpace.Cuda
+pk.set_default_space(space)
+
from pykokkos.linalg.l3_blas import dgemm
import numpy as np
To be honest, even if everything I give to the workunit is in Cuda space (by manually hacking around gh-186) I still see that same error. |
gh-187 may be related, not sure, but what is clear is that error messages/tracebacks do not do a great job of telling me exactly what is wrong--I can see that some view casting/assignment is going wrong, but the precise control flow from Python through compilation machinery is somewhat obscured |
The issue above related to row-major and column-major discrepancies on CPU vs. GPU casting still remains after merging gh-188, not sure what the route forward here is, but I think Jakob would like me to try using i.e., Fortran ordering with NumPy before we try to change anything under the hood. |
For reference, I did try using
So maybe the casting issue isn't exactly to do with the layout |
I tried using the approach from https://github.com/kokkos/pykokkos/blob/develop/examples/pykokkos/from_array.py given discussion today that our workunits are not actually portable between host and CUDA space at the moment by doing stuff like this and sending in CUDA arrays/views via -@pk.workunit
+@pk.workunit(view_a = pk.ViewTypeInfo(space=pk.CudaSpace, layout=pk.LayoutRight),
+ view_b = pk.ViewTypeInfo(space=pk.CudaSpace, layout=pk.LayoutRight),
+ out = pk.ViewTypeInfo(space=pk.CudaSpace, layout=pk.LayoutRight))
def dgemm_impl_tiled_no_view_c(team_member: pk.TeamMember,
k_a: int,
alpha: float,
and tons more changes in I think for now the right call is for me to wait on a few things:
There are issues open related to most of those, and the core algorithm here is pretty solid/passing tests on the host, but we're still pretty far from a smooth experience switching between host and device, etc. |
@NaderAlAwar @JBludau I made some hacks here and now the benchmarks are working for OpenMP tiled, CUDA non-tiled, but not CUDA tiled yet. Perhaps you can help me diagnose the CUDA + tiled error I pasted below the fold. To reproduce:
Of course a tiling algorithm that works with OpenMP should be safe with CUDA, but we probably don't have static analysis to catch some stuff yet. I also note that the machine I was running did occasionally suffer from gh-195, so I'm somewhat curious if you'll simply be able to run the benchmarks just fine once you have all the deps installed... |
* `dgemm` now uses `pykokkos` workunits/kernels to achieve much faster performance than before * I had to correct a mistake in the benchmark code--we now use larger tiling dimensions to expand the data to avoid having empty arrays there--the net effect is bigger benchmark sizes, which seems desirable anyway * the benchmark code was also adjusted to modulate/directly control the number of OpenMP threads used by PyKokkos using the `threadpoolctl` library--this seems to stabilize the timing from trial to trial a bit better but there is still quite a bit more variation than I'd like between trials (benchmarking concurrent code is hard...) for PyKokkos (warmup issues?) * the small, medium, large slowdowns vs. SciPy are more reasonable now (with kernels pre-compiled/cached) - from kokkosgh-134: 310X, 4014X, and 4985X slower, respectively - here with 1 OpenMP thread: 75X, 19X, 14X - here with 4 OpenMP threads: 62X, 66X, 10X - here with 10 OpenMP threads: 38X, 18X, 13X * it may also be interesting to check these on the GPU, although OpenBLAS is just using the host as well
* remove `threadpoolctl` stuff and switch to using `OMP_NUM_THREADS` manually + do way more trials and use boxplots to better visualize outliers I might be concerned about
* add fold ratios directly to plots to facilitate performance comparisons
* early draft of scratch memory setup for the tiled DGEMM workunit * at the moment this doesn't work because of kokkosgh-180, so will need to deal with that first
* created two scratch mem locations per team, and add draft code to fill them up (probably wrong) * draft code to fill the result view with the tiling operations (probably wrong) * add some tests for the tiled kernel vs. SciPy `dgemm` (new cases are failing, which makes sense for now)
* all tiled matmul tests passing; simplified algorithm
* more tiled DGEMM testing/bug fixing
* allow varied league_size, but currently segfaults when greater than `4` it seems...
* `dgemm()` now accepts a `league_size` argument, in case that might be useful for GPU where more blocks of threads may be allowed? We no longer calculate `league_size` automatically because this can cause segfaults/issues... (wrt actually available resources I think...) * the tiled DGEMM kernel now passes tests with several input widths that are different powers of 2
* add limited league size variation support--size of 1 and some convenient multiples of 4 may work; tests for 1 and 4 are passing locally
408e5e6
to
876cc99
Compare
I rebased this branch and confirmed that gh-195 is not related. I can reproduce the hard crash above on this branch now in two different scenarios, so it looks real:
Apart from that, feels like we're pretty close to getting the benchmarks now! |
so, with a debug cuda version I get:
|
looks like the problem is that we can not set the desired amount of scratch memory for the pykokkos/pykokkos/interface/execution_policy.py Lines 91 to 93 in 4432889
looks like we need to add this to the python side and then use it in the cpp code generation in order to request the desired amount of bytes in scratch memory. |
dgemm
now usespykokkos
workunits/kernels to achieve much faster performance than beforeI had to correct a mistake in the benchmark code--we now use larger tiling dimensions to expand the data to avoid having empty arrays there--the net effect is bigger benchmark sizes, which seems desirable anyway
the benchmark code was also adjusted to modulate/directly control the number of OpenMP threads used by PyKokkos using the
threadpoolctl
library--this seems to stabilize the timing from trial to trial a bit better but there is still quite a bit more variation than I'd like between trials (benchmarking concurrent code is hard...) for PyKokkos (warmup issues?) -- it may be good to confirm that we actually believepykokkos
is correctly controlled by this threading libthe small, medium, large slowdowns vs. SciPy are more reasonable now (with kernels pre-compiled/cached)
it may also be interesting to check these on the GPU, although OpenBLAS is just using the host as well
Sample plot for 10 threads: