Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: DGEMM workunits #146

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

tylerjereddy
Copy link
Contributor

  • dgemm now uses pykokkos workunits/kernels to achieve much faster performance than before

  • I had to correct a mistake in the benchmark code--we now use larger tiling dimensions to expand the data to avoid having empty arrays there--the net effect is bigger benchmark sizes, which seems desirable anyway

  • the benchmark code was also adjusted to modulate/directly control the number of OpenMP threads used by PyKokkos using the threadpoolctl library--this seems to stabilize the timing from trial to trial a bit better but there is still quite a bit more variation than I'd like between trials (benchmarking concurrent code is hard...) for PyKokkos (warmup issues?) -- it may be good to confirm that we actually believe pykokkos is correctly controlled by this threading lib

  • the small, medium, large slowdowns vs. SciPy are more reasonable now (with kernels pre-compiled/cached)

    • from ENH: DGEMM non-optimized #134: 310X, 4014X, and 4985X slower, respectively
    • here with 1 OpenMP thread: 75X, 19X, 14X
    • here with 4 OpenMP threads: 62X, 66X, 10X
    • here with 10 OpenMP threads: 38X, 18X, 13X
  • it may also be interesting to check these on the GPU, although OpenBLAS is just using the host as well

Sample plot for 10 threads:

DGEMM_perf_compare_10_threads

@tylerjereddy tylerjereddy added the enhancement New feature or request label Jan 3, 2023
@tylerjereddy
Copy link
Contributor Author

Maybe a reviewer should run the benchmark code locally a few times (once the deps are installed), to confirm if they see the large variations I see sometimes. I don't think that's necessarily surprising with multi-threaded benchmarks, but.. annoying

@tylerjereddy
Copy link
Contributor Author

I tried to make the benchmarks a bit more robust/clearer in terms of my concerns about outliers, and switch to using OMP_NUM_THREADS env var directly rather than the Python lib I mentioned.

On this branch with 1 and 10 threads, respectively:

DGEMM_perf_compare_1_threads
DGEMM_perf_compare_10_threads

On develop (pure Python "kernel") I have to reduce the number of trials, and the number of repeats time per trial drastically because PyKokkos is just so much slower (so the y-axis times are for 100-fold less work here):

DGEMM_perf_compare_1_threads
DGEMM_perf_compare_10_threads

I think these results are a bit clearer, though I'd probably agree we still need to iterate more on the benchmark reporting to make it clearer. The boxplots should make outliers more obvious, but a short summary of the fold slowdown/speedup +- standard deviation or something like that should probably be placed on the plots as well at some point.

@tylerjereddy
Copy link
Contributor Author

Of course, we could also try to use a standard benchmarking library like asv, but I'm not so sure our build system/toolchain is anywhere near ready for straigthforward installation by the automated machinery there.

@tylerjereddy
Copy link
Contributor Author

I updated to include some text that shows a simple avg. +- std. dev. of relative speed on the plots, which is colored "red" if slower, "green" if faster.

On this branch with 1 and 10 threads:

DGEMM_perf_compare_1_threads
DGEMM_perf_compare_10_threads

Same thing on develop, but with far less repeats b/c so much slower to run (reviewer should feel free to run this for same number of trials as above to be thorough though):

DGEMM_perf_compare_1_threads
DGEMM_perf_compare_10_threads


@pytest.mark.parametrize("input_width, tile_width", [
(4, 2),
#(8, 2),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2 x 2 tiled matrix multiplication tests are now passing for me locally at least, with SciPy (OpenBLAS) dgemm as a reference, when 4x4 matrices are used for a, and b.

Other matrix and/or tile sizes can cause substantial issues, but perhaps I've carried this far enough forward for an initial review of the tiling algorithm and suggestions for generalization to varied matrix and tile sizes would be helpful (within the confines of i.e., powers of 2 at least).

Also, I believe we may need to add checks to prevent certain types of segfaults/errors, for example requesting team (thread) sizes/league (block) sizes or shared memory sizes that are not hardware-compatible...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JBludau The segfault in CI vs. passing locally is possibly fairly low-level hierarchical parallelism sutff, but having this lower-level stuff carefully caught with a thoughtful error message would be most helpful for debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah, that message is not helpful at all ... let me think about what we could do here

@tylerjereddy
Copy link
Contributor Author

tylerjereddy commented Mar 14, 2023

We're currently limited to 4x4 matrices and 2x2 tiling, so I'm a bit confused about how we can be ready to benchmark anything useful yet, but here are some sample results from an i9-7900X CPU with or without the reference implementation (SciPy/OpenBLAS).

DGEMM_perf_compare_10_threads_no_scipy

DGEMM_perf_compare_10_threads

@tylerjereddy
Copy link
Contributor Author

Segfault is not reproducible locally, even with act -j test_pykokkos...

@tylerjereddy
Copy link
Contributor Author

Oh, I can reproduce the segfault locally now by setting export OMP_NUM_THREADS=1. It isn't actually clear to me that PyKokkos should allow this situation to even reach the compilation stage--a kernel with a barrier sync is compiled at runtime when it already knows that the number of threads is going to be one.

@tylerjereddy
Copy link
Contributor Author

Setting OMP_NUM_THREADS to be >= to the team size fixes the segfault, albeit by oversubscribing the available CI hardware. I'll leave further discussion related to that UX/DX experience to gh-185.

@tylerjereddy
Copy link
Contributor Author

Tests should now be passing for 2x2 tiling with any square input power of 2 matrices. If CI agrees with me, the next step is to try benchmarks again I think.

@tylerjereddy
Copy link
Contributor Author

I'd say benchmarking is still blocked by portability issues--if I try running the linear algebra tests here in the Cuda execution space they fail, so I think I'll need someone to help patch that cc @NaderAlAwar @JBludau. Ideally, static analysis of some sort will catch this while testing on the host (i.e., the lack of portability). Note that gh-186 seems somewhat related, but isn't the whole story I don't think...

diff --git a/tests/test_linalg.py b/tests/test_linalg.py
index 4367d0c..c24ec41 100644
--- a/tests/test_linalg.py
+++ b/tests/test_linalg.py
@@ -1,4 +1,7 @@
 import pykokkos as pk
+space = pk.ExecutionSpace.Cuda
+pk.set_default_space(space)
+
 from pykokkos.linalg.l3_blas import dgemm
 
 import numpy as np
        workunit_cache[cache_key] = (func, args)
>       func(**args)
E       RuntimeError: Unable to cast Python instance of type <class 'kokkos.libpykokkos.KokkosView_float64_CudaSpace_LayoutRight_2'> to C++ type 'Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::HostSpace, Kokkos::Experimental::EmptyViewHooks>'

pykokkos/interface/parallel_dispatch.py:179: RuntimeError

To be honest, even if everything I give to the workunit is in Cuda space (by manually hacking around gh-186) I still see that same error.

@tylerjereddy
Copy link
Contributor Author

gh-187 may be related, not sure, but what is clear is that error messages/tracebacks do not do a great job of telling me exactly what is wrong--I can see that some view casting/assignment is going wrong, but the precise control flow from Python through compilation machinery is somewhat obscured

@tylerjereddy
Copy link
Contributor Author

The issue above related to row-major and column-major discrepancies on CPU vs. GPU casting still remains after merging gh-188, not sure what the route forward here is, but I think Jakob would like me to try using i.e., Fortran ordering with NumPy before we try to change anything under the hood.

@tylerjereddy
Copy link
Contributor Author

For reference, I did try using np.asfortranarray() in tests/test_linalg.py when the default execution space is set to Cuda, and there was only a small change in the error message--swapping the layouts but still unable to cast:

RuntimeError: Unable to cast Python instance of type <class 'kokkos.libpykokkos.KokkosView_float64_CudaSpace_LayoutLeft_2'> to C++ type 'Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::HostSpace, Kokkos::Experimental::EmptyViewHooks>'

So maybe the casting issue isn't exactly to do with the layout

@tylerjereddy
Copy link
Contributor Author

I tried using the approach from https://github.com/kokkos/pykokkos/blob/develop/examples/pykokkos/from_array.py given discussion today that our workunits are not actually portable between host and CUDA space at the moment by doing stuff like this and sending in CUDA arrays/views via pk.from_cupy():

-@pk.workunit
+@pk.workunit(view_a = pk.ViewTypeInfo(space=pk.CudaSpace, layout=pk.LayoutRight),
+             view_b = pk.ViewTypeInfo(space=pk.CudaSpace, layout=pk.LayoutRight),
+             out = pk.ViewTypeInfo(space=pk.CudaSpace, layout=pk.LayoutRight))
 def dgemm_impl_tiled_no_view_c(team_member: pk.TeamMember,
                                k_a: int,
                                alpha: float,

and tons more changes in tests/test_linalg.py to selectively feed in CuPy or NumPy arrays/views depending on the default execution space. But the development experience is just too painful--I keep running up against kernel segfaults.

I think for now the right call is for me to wait on a few things:

  • Kokkos/C++ runtime errors always get properly propagated up to the Python layer rather than segfaulting
  • Reference counting is robust
  • workunits are writeable agnostic to execution space

There are issues open related to most of those, and the core algorithm here is pretty solid/passing tests on the host, but we're still pretty far from a smooth experience switching between host and device, etc.

@tylerjereddy
Copy link
Contributor Author

@NaderAlAwar @JBludau I made some hacks here and now the benchmarks are working for OpenMP tiled, CUDA non-tiled, but not CUDA tiled yet. Perhaps you can help me diagnose the CUDA + tiled error I pasted below the fold. To reproduce:

  • SciPy benchmark works of course: python dgemm_compare.py -n 5 -m scipy -p 9 -l 4 -s OpenMP
  • OpenMP space with tiling works (but is slow): python dgemm_compare.py -n 5 -m pykokkos_with_tiling -p 9 -l 4 -s OpenMP
  • CUDA space with no tiling works: python dgemm_compare.py -n 5 -m pykokkos_no_tiling -p 9 -l 4 -s Cuda
  • tiled CUDA gets a pretty low-level looking crash in my hands: python dgemm_compare.py -n 5 -m pykokkos_with_tiling -p 9 -l 4 -s Cuda
cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /tmp/tyler/spack-stage/spack-stage-kokkos-3.7.01-dtngsfqs7inn3lymcgvstolhwo4nmcx3/spack-src/core/src/Cuda/Kokkos_Cuda_Instance.cpp:161
Backtrace:
                                                                                                                                                                                                                                                                                                                                                                                                                                                 Kokkos::Impl::save_stacktrace() [0x7f5978a91fa9]
                                                                                                                                                                                                                                                                                                                                                                                                                                Kokkos::Impl::traceback_callstack(std::ostream&) [0x7f5978a85f6e]
                                                                                                                                                                                                                                                                                                                                                                                                                                           Kokkos::Impl::host_abort(char const*) [0x7f5978a85f9f]
                                                                                                                                                                                                                                                                                                                                                                                               Kokkos::Impl::cuda_internal_error_abort(cudaError, char const*, char const*, int) [0x7f5978a99db4]
                                                                                                                                                                                                                                                                                                                                                   Kokkos::Impl::cuda_device_synchronize(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [0x7f5978a99eb2]
                                                                                                                                                                                                                                                                                                                                            Kokkos::Impl::ExecSpaceManager::static_fence(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [0x7f5978a7d8bd]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f58cc0f4207]
run_dgemm_impl_tiled_no_view_c(int, double, int, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::HostSpace, Kokkos::Experimental::EmptyViewHooks>, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::HostSpace, Kokkos::Experimental::EmptyViewHooks>, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::HostSpace, Kokkos::Experimental::EmptyViewHooks>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int, int, int, int) [0x7f58cc0f6ec4]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f58cc0f79af]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f58cc0f9e7f]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f58cc10ac17]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f5983e189b3]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  _PyObject_Call [0x7f5983dc6b3c]
                                                                                                                                                                                                                                                                                                                                                                                                                                                        _PyEval_EvalFrameDefault [0x7f5983d654f6]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                 PyEval_EvalCode [0x7f5983ec562d]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f5983f107fd]
                                                                                                                                                                                                                                                                                                                                                                                                                                                         _PyRun_SimpleFileObject [0x7f5983f1205a]
                                                                                                                                                                                                                                                                                                                                                                                                                                                            _PyRun_AnyFileObject [0x7f5983f125df]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Py_RunMain [0x7f5983f326f0]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Py_BytesMain [0x7f5983f32c7e]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 [0x7f5983a42d90]
                                                                                                                                                                                                                                                                                                                                                                                                                                                               __libc_start_main [0x7f5983a42e40]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          _start [0x563fa7e0a095]
Aborted (core dumped)

Of course a tiling algorithm that works with OpenMP should be safe with CUDA, but we probably don't have static analysis to catch some stuff yet. I also note that the machine I was running did occasionally suffer from gh-195, so I'm somewhat curious if you'll simply be able to run the benchmarks just fine once you have all the deps installed...

* `dgemm` now uses `pykokkos` workunits/kernels to achieve
much faster performance than before

* I had to correct a mistake in the benchmark code--we now use
larger tiling dimensions to expand the data to avoid having
empty arrays there--the net effect is bigger benchmark sizes,
which seems desirable anyway

* the benchmark code was also adjusted to modulate/directly
control the number of OpenMP threads used by PyKokkos
using the `threadpoolctl` library--this seems to stabilize
the timing from trial to trial a bit better but there is still
quite a bit more variation than I'd like between trials (benchmarking
concurrent code is hard...) for PyKokkos (warmup issues?)

* the small, medium, large slowdowns vs. SciPy are more
reasonable now (with kernels pre-compiled/cached)
  - from kokkosgh-134: 310X, 4014X, and 4985X slower, respectively
  - here with 1 OpenMP thread: 75X, 19X, 14X
  - here with 4 OpenMP threads: 62X, 66X, 10X
  - here with 10 OpenMP threads: 38X, 18X, 13X

* it may also be interesting to check these on the GPU,
although OpenBLAS is just using the host as well
* remove `threadpoolctl` stuff and switch to using
`OMP_NUM_THREADS` manually + do way more trials
and use boxplots to better visualize outliers I might
be concerned about
* add fold ratios directly to plots to facilitate
performance comparisons
* early draft of scratch memory setup for the tiled
DGEMM workunit

* at the moment this doesn't work because of kokkosgh-180,
so will need to deal with that first
* created two scratch mem locations per team,
and add draft code to fill them up (probably wrong)

* draft code to fill the result view with the tiling
operations (probably wrong)

* add some tests for the tiled kernel vs. SciPy
`dgemm` (new cases are failing, which makes sense
for now)
* all tiled matmul tests passing; simplified algorithm
* more tiled DGEMM testing/bug fixing
* allow varied league_size, but currently segfaults
when greater than `4` it seems...
* `dgemm()` now accepts a `league_size` argument, in case
that might be useful for GPU where more blocks of threads may
be allowed? We no longer calculate `league_size` automatically
because this can cause segfaults/issues... (wrt actually available
resources I think...)

* the tiled DGEMM kernel now passes tests with several input
widths that are different powers of 2
* add limited league size variation support--size of 1
and some convenient multiples of 4 may work; tests for 1
and 4 are passing locally
@tylerjereddy
Copy link
Contributor Author

I rebased this branch and confirmed that gh-195 is not related. I can reproduce the hard crash above on this branch now in two different scenarios, so it looks real:

  • CUDA 11.7 on NVIDIA GeForce GTX 1080 Ti
  • CUDA 12.0 on Tesla V100-PCIE-32GB

Apart from that, feels like we're pretty close to getting the benchmarks now!

@JBludau
Copy link
Contributor

JBludau commented Mar 27, 2023

so, with a debug cuda version I get:

ScratchMemorySpace<...>::get_shmem: Failed to allocate 32 byte(s); remaining capacity is 0 byte(s)

@JBludau
Copy link
Contributor

JBludau commented Mar 28, 2023

looks like the problem is that we can not set the desired amount of scratch memory for the TeamPolicy as the function is not implemented:

def set_scratch_size(self, level: int, per_team_or_thread): # -> TeamPolicy:
pass

looks like we need to add this to the python side and then use it in the cpp code generation in order to request the desired amount of bytes in scratch memory.

@NaderAlAwar NaderAlAwar changed the base branch from develop to main May 24, 2023 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants