-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: DGEMM workunits #146
Open
tylerjereddy
wants to merge
21
commits into
kokkos:main
Choose a base branch
from
tylerjereddy:treddy_dgemm_with_workunits
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
ENH: DGEMM workunits #146
tylerjereddy
wants to merge
21
commits into
kokkos:main
from
tylerjereddy:treddy_dgemm_with_workunits
Commits on Mar 27, 2023
-
* `dgemm` now uses `pykokkos` workunits/kernels to achieve much faster performance than before * I had to correct a mistake in the benchmark code--we now use larger tiling dimensions to expand the data to avoid having empty arrays there--the net effect is bigger benchmark sizes, which seems desirable anyway * the benchmark code was also adjusted to modulate/directly control the number of OpenMP threads used by PyKokkos using the `threadpoolctl` library--this seems to stabilize the timing from trial to trial a bit better but there is still quite a bit more variation than I'd like between trials (benchmarking concurrent code is hard...) for PyKokkos (warmup issues?) * the small, medium, large slowdowns vs. SciPy are more reasonable now (with kernels pre-compiled/cached) - from kokkosgh-134: 310X, 4014X, and 4985X slower, respectively - here with 1 OpenMP thread: 75X, 19X, 14X - here with 4 OpenMP threads: 62X, 66X, 10X - here with 10 OpenMP threads: 38X, 18X, 13X * it may also be interesting to check these on the GPU, although OpenBLAS is just using the host as well
Configuration menu - View commit details
-
Copy full SHA for c88c279 - Browse repository at this point
Copy the full SHA c88c279View commit details -
Configuration menu - View commit details
-
Copy full SHA for e49cbf5 - Browse repository at this point
Copy the full SHA e49cbf5View commit details -
* remove `threadpoolctl` stuff and switch to using `OMP_NUM_THREADS` manually + do way more trials and use boxplots to better visualize outliers I might be concerned about
Configuration menu - View commit details
-
Copy full SHA for 1cdbb9c - Browse repository at this point
Copy the full SHA 1cdbb9cView commit details -
* add fold ratios directly to plots to facilitate performance comparisons
Configuration menu - View commit details
-
Copy full SHA for 0de8ced - Browse repository at this point
Copy the full SHA 0de8cedView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6323724 - Browse repository at this point
Copy the full SHA 6323724View commit details -
ENH: use scratch for tiled DGEMM
* early draft of scratch memory setup for the tiled DGEMM workunit * at the moment this doesn't work because of kokkosgh-180, so will need to deal with that first
Configuration menu - View commit details
-
Copy full SHA for 2031748 - Browse repository at this point
Copy the full SHA 2031748View commit details -
* created two scratch mem locations per team, and add draft code to fill them up (probably wrong) * draft code to fill the result view with the tiling operations (probably wrong) * add some tests for the tiled kernel vs. SciPy `dgemm` (new cases are failing, which makes sense for now)
Configuration menu - View commit details
-
Copy full SHA for ca7bf74 - Browse repository at this point
Copy the full SHA ca7bf74View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2605bbc - Browse repository at this point
Copy the full SHA 2605bbcView commit details -
Configuration menu - View commit details
-
Copy full SHA for 73a18bb - Browse repository at this point
Copy the full SHA 73a18bbView commit details -
ENH: add tiled matmul tests passing
* all tiled matmul tests passing; simplified algorithm
Configuration menu - View commit details
-
Copy full SHA for 521c849 - Browse repository at this point
Copy the full SHA 521c849View commit details -
* more tiled DGEMM testing/bug fixing
Configuration menu - View commit details
-
Copy full SHA for e40f5c4 - Browse repository at this point
Copy the full SHA e40f5c4View commit details -
* allow varied league_size, but currently segfaults when greater than `4` it seems...
Configuration menu - View commit details
-
Copy full SHA for 64b8d0d - Browse repository at this point
Copy the full SHA 64b8d0dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 8087560 - Browse repository at this point
Copy the full SHA 8087560View commit details -
Configuration menu - View commit details
-
Copy full SHA for 92a4a25 - Browse repository at this point
Copy the full SHA 92a4a25View commit details -
Configuration menu - View commit details
-
Copy full SHA for e3166eb - Browse repository at this point
Copy the full SHA e3166ebView commit details -
* `dgemm()` now accepts a `league_size` argument, in case that might be useful for GPU where more blocks of threads may be allowed? We no longer calculate `league_size` automatically because this can cause segfaults/issues... (wrt actually available resources I think...) * the tiled DGEMM kernel now passes tests with several input widths that are different powers of 2
Configuration menu - View commit details
-
Copy full SHA for cc7d976 - Browse repository at this point
Copy the full SHA cc7d976View commit details -
Configuration menu - View commit details
-
Copy full SHA for 87b8f00 - Browse repository at this point
Copy the full SHA 87b8f00View commit details -
ENH: support different league sizes
* add limited league size variation support--size of 1 and some convenient multiples of 4 may work; tests for 1 and 4 are passing locally
Configuration menu - View commit details
-
Copy full SHA for 6c71f6d - Browse repository at this point
Copy the full SHA 6c71f6dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 40a654d - Browse repository at this point
Copy the full SHA 40a654dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 30fca1f - Browse repository at this point
Copy the full SHA 30fca1fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 876cc99 - Browse repository at this point
Copy the full SHA 876cc99View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.