ENH: DGEMM workunits #146

* `dgemm` now uses `pykokkos` workunits/kernels to achieve much faster performance than before * I had to correct a mistake in the benchmark code--we now use larger tiling dimensions to expand the data to avoid having empty arrays there--the net effect is bigger benchmark sizes, which seems desirable anyway * the benchmark code was also adjusted to modulate/directly control the number of OpenMP threads used by PyKokkos using the `threadpoolctl` library--this seems to stabilize the timing from trial to trial a bit better but there is still quite a bit more variation than I'd like between trials (benchmarking concurrent code is hard...) for PyKokkos (warmup issues?) * the small, medium, large slowdowns vs. SciPy are more reasonable now (with kernels pre-compiled/cached) - from kokkosgh-134: 310X, 4014X, and 4985X slower, respectively - here with 1 OpenMP thread: 75X, 19X, 14X - here with 4 OpenMP threads: 62X, 66X, 10X - here with 10 OpenMP threads: 38X, 18X, 13X * it may also be interesting to check these on the GPU, although OpenBLAS is just using the host as well

* remove `threadpoolctl` stuff and switch to using `OMP_NUM_THREADS` manually + do way more trials and use boxplots to better visualize outliers I might be concerned about

* add fold ratios directly to plots to facilitate performance comparisons

* early draft of scratch memory setup for the tiled DGEMM workunit * at the moment this doesn't work because of kokkosgh-180, so will need to deal with that first

* created two scratch mem locations per team, and add draft code to fill them up (probably wrong) * draft code to fill the result view with the tiling operations (probably wrong) * add some tests for the tiled kernel vs. SciPy `dgemm` (new cases are failing, which makes sense for now)

* all tiled matmul tests passing; simplified algorithm

* more tiled DGEMM testing/bug fixing

* allow varied league_size, but currently segfaults when greater than `4` it seems...

* `dgemm()` now accepts a `league_size` argument, in case that might be useful for GPU where more blocks of threads may be allowed? We no longer calculate `league_size` automatically because this can cause segfaults/issues... (wrt actually available resources I think...) * the tiled DGEMM kernel now passes tests with several input widths that are different powers of 2

* add limited league size variation support--size of 1 and some convenient multiples of 4 may work; tests for 1 and 4 are passing locally

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: DGEMM workunits #146

ENH: DGEMM workunits #146

Commits on Mar 27, 2023

ENH: DGEMM workunits #146

Are you sure you want to change the base?

ENH: DGEMM workunits #146

Commits on Mar 27, 2023