ENH: DGEMM workunits #146

tylerjereddy · 2023-01-03T21:52:27Z

dgemm now uses pykokkos workunits/kernels to achieve much faster performance than before
I had to correct a mistake in the benchmark code--we now use larger tiling dimensions to expand the data to avoid having empty arrays there--the net effect is bigger benchmark sizes, which seems desirable anyway
the benchmark code was also adjusted to modulate/directly control the number of OpenMP threads used by PyKokkos using the threadpoolctl library--this seems to stabilize the timing from trial to trial a bit better but there is still quite a bit more variation than I'd like between trials (benchmarking concurrent code is hard...) for PyKokkos (warmup issues?) -- it may be good to confirm that we actually believe pykokkos is correctly controlled by this threading lib
the small, medium, large slowdowns vs. SciPy are more reasonable now (with kernels pre-compiled/cached)
- from ENH: DGEMM non-optimized #134: 310X, 4014X, and 4985X slower, respectively
- here with 1 OpenMP thread: 75X, 19X, 14X
- here with 4 OpenMP threads: 62X, 66X, 10X
- here with 10 OpenMP threads: 38X, 18X, 13X
it may also be interesting to check these on the GPU, although OpenBLAS is just using the host as well

Sample plot for 10 threads:

tylerjereddy · 2023-01-03T22:18:26Z

Maybe a reviewer should run the benchmark code locally a few times (once the deps are installed), to confirm if they see the large variations I see sometimes. I don't think that's necessarily surprising with multi-threaded benchmarks, but.. annoying

tylerjereddy · 2023-01-06T00:31:19Z

I tried to make the benchmarks a bit more robust/clearer in terms of my concerns about outliers, and switch to using OMP_NUM_THREADS env var directly rather than the Python lib I mentioned.

On this branch with 1 and 10 threads, respectively:

On develop (pure Python "kernel") I have to reduce the number of trials, and the number of repeats time per trial drastically because PyKokkos is just so much slower (so the y-axis times are for 100-fold less work here):

I think these results are a bit clearer, though I'd probably agree we still need to iterate more on the benchmark reporting to make it clearer. The boxplots should make outliers more obvious, but a short summary of the fold slowdown/speedup +- standard deviation or something like that should probably be placed on the plots as well at some point.

tylerjereddy · 2023-01-06T00:35:48Z

Of course, we could also try to use a standard benchmarking library like asv, but I'm not so sure our build system/toolchain is anywhere near ready for straigthforward installation by the automated machinery there.

tylerjereddy · 2023-01-09T21:09:04Z

I updated to include some text that shows a simple avg. +- std. dev. of relative speed on the plots, which is colored "red" if slower, "green" if faster.

On this branch with 1 and 10 threads:

Same thing on develop, but with far less repeats b/c so much slower to run (reviewer should feel free to run this for same number of trials as above to be thorough though):

tylerjereddy · 2023-03-12T22:19:20Z

tests/test_linalg.py

+
+@pytest.mark.parametrize("input_width, tile_width", [
+    (4, 2),
+    #(8, 2),


The 2 x 2 tiled matrix multiplication tests are now passing for me locally at least, with SciPy (OpenBLAS) dgemm as a reference, when 4x4 matrices are used for a, and b.

Other matrix and/or tile sizes can cause substantial issues, but perhaps I've carried this far enough forward for an initial review of the tiling algorithm and suggestions for generalization to varied matrix and tile sizes would be helpful (within the confines of i.e., powers of 2 at least).

Also, I believe we may need to add checks to prevent certain types of segfaults/errors, for example requesting team (thread) sizes/league (block) sizes or shared memory sizes that are not hardware-compatible...

@JBludau The segfault in CI vs. passing locally is possibly fairly low-level hierarchical parallelism sutff, but having this lower-level stuff carefully caught with a thoughtful error message would be most helpful for debugging.

hmm yeah, that message is not helpful at all ... let me think about what we could do here

tylerjereddy · 2023-03-14T21:40:53Z

We're currently limited to 4x4 matrices and 2x2 tiling, so I'm a bit confused about how we can be ready to benchmark anything useful yet, but here are some sample results from an i9-7900X CPU with or without the reference implementation (SciPy/OpenBLAS).

tylerjereddy · 2023-03-18T23:17:43Z

Segfault is not reproducible locally, even with act -j test_pykokkos...

tylerjereddy · 2023-03-19T01:40:27Z

Oh, I can reproduce the segfault locally now by setting export OMP_NUM_THREADS=1. It isn't actually clear to me that PyKokkos should allow this situation to even reach the compilation stage--a kernel with a barrier sync is compiled at runtime when it already knows that the number of threads is going to be one.

tylerjereddy · 2023-03-19T04:19:57Z

Setting OMP_NUM_THREADS to be >= to the team size fixes the segfault, albeit by oversubscribing the available CI hardware. I'll leave further discussion related to that UX/DX experience to gh-185.

tylerjereddy · 2023-03-19T05:17:47Z

Tests should now be passing for 2x2 tiling with any square input power of 2 matrices. If CI agrees with me, the next step is to try benchmarks again I think.

tylerjereddy · 2023-03-19T22:47:55Z

I'd say benchmarking is still blocked by portability issues--if I try running the linear algebra tests here in the Cuda execution space they fail, so I think I'll need someone to help patch that cc @NaderAlAwar @JBludau. Ideally, static analysis of some sort will catch this while testing on the host (i.e., the lack of portability). Note that gh-186 seems somewhat related, but isn't the whole story I don't think...

diff --git a/tests/test_linalg.py b/tests/test_linalg.py
index 4367d0c..c24ec41 100644
--- a/tests/test_linalg.py
+++ b/tests/test_linalg.py
@@ -1,4 +1,7 @@
 import pykokkos as pk
+space = pk.ExecutionSpace.Cuda
+pk.set_default_space(space)
+
 from pykokkos.linalg.l3_blas import dgemm
 
 import numpy as np

        workunit_cache[cache_key] = (func, args)
>       func(**args)
E       RuntimeError: Unable to cast Python instance of type <class 'kokkos.libpykokkos.KokkosView_float64_CudaSpace_LayoutRight_2'> to C++ type 'Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::HostSpace, Kokkos::Experimental::EmptyViewHooks>'

pykokkos/interface/parallel_dispatch.py:179: RuntimeError

To be honest, even if everything I give to the workunit is in Cuda space (by manually hacking around gh-186) I still see that same error.

tylerjereddy · 2023-03-20T21:51:11Z

gh-187 may be related, not sure, but what is clear is that error messages/tracebacks do not do a great job of telling me exactly what is wrong--I can see that some view casting/assignment is going wrong, but the precise control flow from Python through compilation machinery is somewhat obscured

tylerjereddy · 2023-03-21T18:29:27Z

The issue above related to row-major and column-major discrepancies on CPU vs. GPU casting still remains after merging gh-188, not sure what the route forward here is, but I think Jakob would like me to try using i.e., Fortran ordering with NumPy before we try to change anything under the hood.

tylerjereddy · 2023-03-21T18:37:48Z

For reference, I did try using np.asfortranarray() in tests/test_linalg.py when the default execution space is set to Cuda, and there was only a small change in the error message--swapping the layouts but still unable to cast:

RuntimeError: Unable to cast Python instance of type <class 'kokkos.libpykokkos.KokkosView_float64_CudaSpace_LayoutLeft_2'> to C++ type 'Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::HostSpace, Kokkos::Experimental::EmptyViewHooks>'

So maybe the casting issue isn't exactly to do with the layout

tylerjereddy · 2023-03-23T21:45:20Z

I tried using the approach from https://github.com/kokkos/pykokkos/blob/develop/examples/pykokkos/from_array.py given discussion today that our workunits are not actually portable between host and CUDA space at the moment by doing stuff like this and sending in CUDA arrays/views via pk.from_cupy():

-@pk.workunit
+@pk.workunit(view_a = pk.ViewTypeInfo(space=pk.CudaSpace, layout=pk.LayoutRight),
+             view_b = pk.ViewTypeInfo(space=pk.CudaSpace, layout=pk.LayoutRight),
+             out = pk.ViewTypeInfo(space=pk.CudaSpace, layout=pk.LayoutRight))
 def dgemm_impl_tiled_no_view_c(team_member: pk.TeamMember,
                                k_a: int,
                                alpha: float,

and tons more changes in tests/test_linalg.py to selectively feed in CuPy or NumPy arrays/views depending on the default execution space. But the development experience is just too painful--I keep running up against kernel segfaults.

I think for now the right call is for me to wait on a few things:

Kokkos/C++ runtime errors always get properly propagated up to the Python layer rather than segfaulting
Reference counting is robust
workunits are writeable agnostic to execution space

There are issues open related to most of those, and the core algorithm here is pretty solid/passing tests on the host, but we're still pretty far from a smooth experience switching between host and device, etc.

tylerjereddy · 2023-03-26T22:30:42Z

@NaderAlAwar @JBludau I made some hacks here and now the benchmarks are working for OpenMP tiled, CUDA non-tiled, but not CUDA tiled yet. Perhaps you can help me diagnose the CUDA + tiled error I pasted below the fold. To reproduce:

SciPy benchmark works of course: python dgemm_compare.py -n 5 -m scipy -p 9 -l 4 -s OpenMP
OpenMP space with tiling works (but is slow): python dgemm_compare.py -n 5 -m pykokkos_with_tiling -p 9 -l 4 -s OpenMP
CUDA space with no tiling works: python dgemm_compare.py -n 5 -m pykokkos_no_tiling -p 9 -l 4 -s Cuda
tiled CUDA gets a pretty low-level looking crash in my hands: python dgemm_compare.py -n 5 -m pykokkos_with_tiling -p 9 -l 4 -s Cuda

cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /tmp/tyler/spack-stage/spack-stage-kokkos-3.7.01-dtngsfqs7inn3lymcgvstolhwo4nmcx3/spack-src/core/src/Cuda/Kokkos_Cuda_Instance.cpp:161
Backtrace:
                                                                                                                                                                                                                                                                                                                                                                                                                                                 Kokkos::Impl::save_stacktrace() [0x7f5978a91fa9]
                                                                                                                                                                                                                                                                                                                                                                                                                                Kokkos::Impl::traceback_callstack(std::ostream&) [0x7f5978a85f6e]
                                                                                                                                                                                                                                                                                                                                                                                                                                           Kokkos::Impl::host_abort(char const*) [0x7f5978a85f9f]
                                                                                                                                                                                                                                                                                                                                                                                               Kokkos::Impl::cuda_internal_error_abort(cudaError, char const*, char const*, int) [0x7f5978a99db4]
                                                                                                                                                                                                                                                                                                                                                   Kokkos::Impl::cuda_device_synchronize(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [0x7f5978a99eb2]
                                                                                                                                                                                                                                                                                                                                            Kokkos::Impl::ExecSpaceManager::static_fence(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [0x7f5978a7d8bd]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f58cc0f4207]
run_dgemm_impl_tiled_no_view_c(int, double, int, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::HostSpace, Kokkos::Experimental::EmptyViewHooks>, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::HostSpace, Kokkos::Experimental::EmptyViewHooks>, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::HostSpace, Kokkos::Experimental::EmptyViewHooks>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int, int, int, int) [0x7f58cc0f6ec4]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f58cc0f79af]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f58cc0f9e7f]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f58cc10ac17]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f5983e189b3]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  _PyObject_Call [0x7f5983dc6b3c]
                                                                                                                                                                                                                                                                                                                                                                                                                                                        _PyEval_EvalFrameDefault [0x7f5983d654f6]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                 PyEval_EvalCode [0x7f5983ec562d]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f5983f107fd]
                                                                                                                                                                                                                                                                                                                                                                                                                                                         _PyRun_SimpleFileObject [0x7f5983f1205a]
                                                                                                                                                                                                                                                                                                                                                                                                                                                            _PyRun_AnyFileObject [0x7f5983f125df]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Py_RunMain [0x7f5983f326f0]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Py_BytesMain [0x7f5983f32c7e]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f5983a42d90]
                                                                                                                                                                                                                                                                                                                                                                                                                                                               __libc_start_main [0x7f5983a42e40]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          _start [0x563fa7e0a095]
Aborted (core dumped)

Of course a tiling algorithm that works with OpenMP should be safe with CUDA, but we probably don't have static analysis to catch some stuff yet. I also note that the machine I was running did occasionally suffer from gh-195, so I'm somewhat curious if you'll simply be able to run the benchmarks just fine once you have all the deps installed...

* `dgemm` now uses `pykokkos` workunits/kernels to achieve much faster performance than before * I had to correct a mistake in the benchmark code--we now use larger tiling dimensions to expand the data to avoid having empty arrays there--the net effect is bigger benchmark sizes, which seems desirable anyway * the benchmark code was also adjusted to modulate/directly control the number of OpenMP threads used by PyKokkos using the `threadpoolctl` library--this seems to stabilize the timing from trial to trial a bit better but there is still quite a bit more variation than I'd like between trials (benchmarking concurrent code is hard...) for PyKokkos (warmup issues?) * the small, medium, large slowdowns vs. SciPy are more reasonable now (with kernels pre-compiled/cached) - from kokkosgh-134: 310X, 4014X, and 4985X slower, respectively - here with 1 OpenMP thread: 75X, 19X, 14X - here with 4 OpenMP threads: 62X, 66X, 10X - here with 10 OpenMP threads: 38X, 18X, 13X * it may also be interesting to check these on the GPU, although OpenBLAS is just using the host as well

* remove `threadpoolctl` stuff and switch to using `OMP_NUM_THREADS` manually + do way more trials and use boxplots to better visualize outliers I might be concerned about

* add fold ratios directly to plots to facilitate performance comparisons

* early draft of scratch memory setup for the tiled DGEMM workunit * at the moment this doesn't work because of kokkosgh-180, so will need to deal with that first

* created two scratch mem locations per team, and add draft code to fill them up (probably wrong) * draft code to fill the result view with the tiling operations (probably wrong) * add some tests for the tiled kernel vs. SciPy `dgemm` (new cases are failing, which makes sense for now)

* all tiled matmul tests passing; simplified algorithm

* more tiled DGEMM testing/bug fixing

* allow varied league_size, but currently segfaults when greater than `4` it seems...

* `dgemm()` now accepts a `league_size` argument, in case that might be useful for GPU where more blocks of threads may be allowed? We no longer calculate `league_size` automatically because this can cause segfaults/issues... (wrt actually available resources I think...) * the tiled DGEMM kernel now passes tests with several input widths that are different powers of 2

* add limited league size variation support--size of 1 and some convenient multiples of 4 may work; tests for 1 and 4 are passing locally

tylerjereddy · 2023-03-27T16:24:33Z

I rebased this branch and confirmed that gh-195 is not related. I can reproduce the hard crash above on this branch now in two different scenarios, so it looks real:

CUDA 11.7 on NVIDIA GeForce GTX 1080 Ti
CUDA 12.0 on Tesla V100-PCIE-32GB

Apart from that, feels like we're pretty close to getting the benchmarks now!

JBludau · 2023-03-27T19:10:07Z

so, with a debug cuda version I get:

ScratchMemorySpace<...>::get_shmem: Failed to allocate 32 byte(s); remaining capacity is 0 byte(s)

JBludau · 2023-03-28T11:31:12Z

looks like the problem is that we can not set the desired amount of scratch memory for the TeamPolicy as the function is not implemented:

pykokkos/pykokkos/interface/execution_policy.py

Lines 91 to 93 in 4432889

    
           def set_scratch_size(self, level: int, per_team_or_thread): # -> TeamPolicy: 
        
               pass

looks like we need to add this to the python side and then use it in the cpp code generation in order to request the desired amount of bytes in scratch memory.

tylerjereddy added the enhancement New feature or request label Jan 3, 2023

tylerjereddy mentioned this pull request Jan 16, 2023

A few notes on DGEMM vs. OpenBLAS #159

Open

tylerjereddy force-pushed the treddy_dgemm_with_workunits branch from 66b69e9 to 91c17a7 Compare March 12, 2023 22:08

tylerjereddy commented Mar 12, 2023

View reviewed changes

tylerjereddy mentioned this pull request Mar 14, 2023

BUG: no guardrails on shared/scratch alloc requests #183

Open

tylerjereddy mentioned this pull request Mar 19, 2023

BUG: handle segfault with hierarchical kernel when team size is larger than available threads #185

Open

tylerjereddy force-pushed the treddy_dgemm_with_workunits branch from 274b15a to 241e5f4 Compare March 19, 2023 03:39

tylerjereddy mentioned this pull request Mar 19, 2023

BUG: default execution space doesn't affect empty view creation #186

Open

tylerjereddy added 7 commits March 27, 2023 10:06

MAINT: unpin mypy

e49cbf5

BENCH: fixup benchmarks

1cdbb9c

* remove `threadpoolctl` stuff and switch to using `OMP_NUM_THREADS` manually + do way more trials and use boxplots to better visualize outliers I might be concerned about

ENH: PR 146 revisions

0de8ced

* add fold ratios directly to plots to facilitate performance comparisons

WIP: draft tiled matmul--very early stages with team policy

6323724

ENH: use scratch for tiled DGEMM

2031748

* early draft of scratch memory setup for the tiled DGEMM workunit * at the moment this doesn't work because of kokkosgh-180, so will need to deal with that first

tylerjereddy added 14 commits March 27, 2023 10:06

MAINT: vastly simplify indexing logic.

2605bbc

WIP: only 1 tiled test case failing.

73a18bb

ENH: add tiled matmul tests passing

521c849

* all tiled matmul tests passing; simplified algorithm

BUG, TST: more tests/fixes

e40f5c4

* more tiled DGEMM testing/bug fixing

ENH: allow varied league_size

64b8d0d

* allow varied league_size, but currently segfaults when greater than `4` it seems...

Add SciPy

8087560

Debug prints

92a4a25

Try more threads

e3166eb

BENCH: simplify/improve DGEMM benchmarking code.

87b8f00

ENH: support different league sizes

6c71f6d

* add limited league size variation support--size of 1 and some convenient multiples of 4 may work; tests for 1 and 4 are passing locally

BENCH: auto-rm pk_cpp folder..

40a654d

DGEMM benchmark improvments

30fca1f

matplotlib wasn't being used in DGEMM bench.

876cc99

tylerjereddy force-pushed the treddy_dgemm_with_workunits branch from 408e5e6 to 876cc99 Compare March 27, 2023 16:20

NaderAlAwar changed the base branch from develop to main May 24, 2023 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: DGEMM workunits #146

ENH: DGEMM workunits #146

tylerjereddy commented Jan 3, 2023

tylerjereddy commented Jan 3, 2023

tylerjereddy commented Jan 6, 2023

tylerjereddy commented Jan 6, 2023

tylerjereddy commented Jan 9, 2023

tylerjereddy Mar 12, 2023

tylerjereddy Mar 12, 2023

JBludau Mar 13, 2023

tylerjereddy commented Mar 14, 2023 •

edited

Loading

tylerjereddy commented Mar 18, 2023

tylerjereddy commented Mar 19, 2023

tylerjereddy commented Mar 19, 2023

tylerjereddy commented Mar 19, 2023

tylerjereddy commented Mar 19, 2023

tylerjereddy commented Mar 20, 2023

tylerjereddy commented Mar 21, 2023

tylerjereddy commented Mar 21, 2023

tylerjereddy commented Mar 23, 2023

tylerjereddy commented Mar 26, 2023

tylerjereddy commented Mar 27, 2023

JBludau commented Mar 27, 2023

JBludau commented Mar 28, 2023

ENH: DGEMM workunits #146

Are you sure you want to change the base?

ENH: DGEMM workunits #146

Conversation

tylerjereddy commented Jan 3, 2023

tylerjereddy commented Jan 3, 2023

tylerjereddy commented Jan 6, 2023

tylerjereddy commented Jan 6, 2023

tylerjereddy commented Jan 9, 2023

tylerjereddy Mar 12, 2023

Choose a reason for hiding this comment

tylerjereddy Mar 12, 2023

Choose a reason for hiding this comment

JBludau Mar 13, 2023

Choose a reason for hiding this comment

tylerjereddy commented Mar 14, 2023 • edited Loading

tylerjereddy commented Mar 18, 2023

tylerjereddy commented Mar 19, 2023

tylerjereddy commented Mar 19, 2023

tylerjereddy commented Mar 19, 2023

tylerjereddy commented Mar 19, 2023

tylerjereddy commented Mar 20, 2023

tylerjereddy commented Mar 21, 2023

tylerjereddy commented Mar 21, 2023

tylerjereddy commented Mar 23, 2023

tylerjereddy commented Mar 26, 2023

tylerjereddy commented Mar 27, 2023

JBludau commented Mar 27, 2023

JBludau commented Mar 28, 2023

tylerjereddy commented Mar 14, 2023 •

edited

Loading